# Implementing Linked Lists with Contiguous Arrays

## 1 Introduction

The linked list is a well-known data structure, with a straightforward implementation that is a common exercise for students and interviewees alike. The advantage of a linked list is the $\mathcal O(1)$ cost of insertion and erasure of elements at arbitrary positions, which does not require the existing data elements to be moved around. It is also useful for swapping elements, or splicing two lists together, which can be done purely by modifying the links.

The problem is that the traversal of a linked list is deeply offensive to modern hardware, and it’s difficult to avoid traversal when implementing a higher-level algorithm. Even if the use case primarily concerns insertion/erasure, there is often some traversal involved to get to the required position of insertion/erasure.

Pointer chasing is one of the best ways to destroy performance on modern hardware, and the linked list is the king of pointer chasing. This is something that has been highlighted in recent years as the importance of cache-friendliness has become more publicized. Here are a few recent talks on the topic:

The traversal of a linked list is so disastrous that its cost can overwhelm the linked list’s benefits, to such an extent that a vector ends up being the better choice, despite its expensive insert/erase. This is particularly the case with conventional implementations, such as std::list, that do separate allocations for each node. This has the cost of an allocation/deallocation for every insertion/erasure, and potentially scatters the nodes wildly over memory.

An alternate implementation is the intrusive linked list, where the allocation is outsourced to the user, who can preallocate the nodes in a contiguous array and then link/unlink them at their leisure. Allocating the nodes in a contiguous array improves the situation somewhat, but the link pointers (prev, next) can still be spread out over a large area as they are separated by the data elements:

prev next [---data---] prev next [---data---] prev next [---data---] ...

When traversing the list, we don’t necessarily need to access the whole of each data element; we primarily wish to follow the trail of pointers. We will refer to traversal that doesn’t touch the values as pure traversal. This memory layout is inefficient for pure traversal as a significant amount of each cache line contains irrelevant data, wasting both memory bandwidth and increasing the chance of cache misses. This gets worse with larger data types.

## 2 Reinventing the Linked List

My objective was to create a non-standard linked list that attempts to mitigate the cost of pure traversal while still retaining the $\mathcal O(1)$ insert/erase benefit, and then to see how that affects the performance of various algorithms. To do this, I began by separating the storage of the values from the nodes that describe the linked structure.

struct list {
struct node {
index_type prev, next;
};
std::vector<value_type> values;
std::vector<node> nodes;
};


(templates omitted)

Both the nodes and the values are stored in their own contiguous arrays. This allows for indexes to be used instead of pointers, and this in turn allows for small unsigned integers (such as uint16_t) to be used instead of the full 64-bits that you get with a pointer. The idea here is to make the array of nodes as compact as possible.

[---data---] [---data---] [---data---] [---data---] [---data---] ... prev next prev next prev next prev next prev next ...

The association between a node and its value is implicit — the value associated with nodes[i] is values[i]. There is no explicit pointer-to-value. Each index identifies both a node and a value, and this design allows us to pack as much of these indexes into cache as possible.

The contiguous property of the two vectors is an invariant of the list. The insertion and erasure operations are designed to achieve this:

• Insertion pushes a node and value to the back of their respective vectors, which is amortized $\mathcal O(1)$. The links are updated as usual.
• Erasure moves the back node/value to the index of the erased node/value and then pops both vectors. The links need to be updated to reflect this move, and it costs a move operation of one node and one value, but there is no deallocation and it is $\mathcal O(1)$.

With the basic design established I then fleshed out the functionality, following the interface of std::list so that it can be used as a drop-in replacement. The implementation is on GitHub: https://github.com/cwzx/cwds.

## 2.1 Benefits

There are several benefits to this approach that can be noted:

• As the implementation is backed by vectors, additional functionality not found in std::list can be provided, such as .reserve(). Being able to preallocate the memory is a big win for performance.
• No allocation on insert, apart from when the vectors need to grow to accommodate more elements. This can be reserved upfront.
• No deallocation on erase.
• As the values are separate to the linking machinery, we can implement some algorithms directly on the vector of values, bypassing the linked structure entirely. This is possible when the algorithm accesses all of the elements and doesn’t depend on the order of the elements. Examples:
• .remove()
• .remove_if()
• .unique()

And when applied to the whole list, these STL algorithms can also benefit in this way:

• std::accumulate
• std::any_of
• std::all_of
• std::none_of
• std::count
• std::count_if
• std::fill
• std::replace
• std::replace_if

These can be given the full cache/prefetcher benefits of linear traversal over contiguous arrays.

• Some operations do not touch the values at all, they just rewire the links. For example, .reverse(). These operations can be implemented directly on the vector of nodes, bypassing the values entirely.
• The list can be constructed/assigned to using an existing vector of values by both move and copy.
• The list can be copied/moved without needing to update any links.
• Without any explicit pointers to worry about, the list can be easily serialized.

• Due to the separation of nodes from values, swapping two lists invalidates all iterators to both lists.
• Both .merge() and .splice() require allocation and moving the values across, as we require contiguous data.
• Erasing an element invalidates iterators, as the (vector) back is moved to the erased index, and the moved element could have any position in the list.

## 3 Benchmarks

Benchmarks were performed to compare cw::list with std::list. Tests were performed for different numbers of elements, and for different sizes of the value type. For cw::list, the smallest possible unsigned integer type was used for the index type depending on the number of elements tested. Each test was repeated multiple times.

The following compiler and settings were used:

• Visual Studio 2015 Preview
• x64 Release
• AVX2 instructions enabled

A custom implementation of std::chrono::high_resolution_timer was used that uses Window’s actual high resolution timer (QueryPerformanceCounter).

The following hardware was used:

• Intel i5-4670K @ 4.4GHz (Haswell)
• 16GB DDR3 1600MHz CAS-9

Three list creation strategies were tested:

• Back — insert each value at the back.
• Mid — insert each value at the midpoint.
• Random — insert each value at the front or back, chosen at random with equal probability.

The first strategy produces a linear ordering (i.e. iterating over the list corresponds to linearly traversing the underlying arrays), the other two produce nonlinear orderings.

As we’re interested in comparing performance of the two list implementations, we will plot the ratio of time taken for std::list over the time taken for cw::list. Results greater than 1 indicate a performance improvement.

## 3.1 Pure Traversal

Pure traversal means iterating over all the elements of the list without touching the values.

For value types between 1 byte and 8 bytes we see the same pattern:

Going beyond 8 bytes, the advantage cw::list has over std::list at large list sizes grows:

## 3.1.1 Large list improvements

Value Type Size
(bytes)
Improvement Factor
(linear)
Improvement Factor
(nonlinear)
List Sizes
1–8 1.6 3.2 >300 000
16 2.3 4.5 >250 000
32 3.2 7.5 >200 000
64 4.75 14 >100 000
128 15 20 >120 000
1024 25 40 >120 000

## 3.2 Accumulate

Now we test traversal where every value is accessed. std::accumulate sums the values of the list. As the objective here is to test traversal of the linked structure, we use the standard implementation of accumulate, rather than cw::list‘s optimized version that bypasses the linked structure.

As with the previous benchmark, we see improvements for large lists, with benefits increasing with value size, although to a lesser degree this time. The nonlinear traversals see the greatest improvement, with the random pattern tending to see the most benefit.

This is good news, as we would expect a list in the wild to be in a nonlinear, pseudo-random configuration, particularly after prolonged use under many insertions/erasures/swaps/etc.

## 3.2.1 Large list improvements

Value Type Size
(bytes)
Improvement Factor
(linear)
Improvement Factor
(nonlinear)
List Sizes
1–8 1.2 1.8 >300 000
16 1.8 2.25 >200 000
32 1.7 2.15 >750
64 1.45 2.8 >2500
128 1.9 3.9 >2500
1024 3.6 6.2 >70 000

## 3.3 Insert Random Values Sorted

This benchmark creates a list by inserting random values such that the list remains sorted. This tests both traversal (to find the insertion point) and insertion. The traversal is performed by std::lower_bound, which is a binary search. Note that a binary search uses pure traversal to navigate the lists (via std::advance). We can also compare the lists to std::vector, for which the binary search uses direct indexing.

For this benchmark, the maximum list sizes tested are significantly smaller than the previous two benchmarks. This is due to the complexity being $\mathcal O(n\log n)$ for the lists and $\mathcal O(n^2)$ for the vector.

First we can compare the two list implementations:

Again, cw::list performs better for large lists sizes, for all value sizes.

We can also compare both std::list and cw::list with std::vector. The vector wins for small value types, but the lists win for large value types:

For cw::list, the performance advantage over std::vector begins at 256-byte values.

For std::list, the performance advantage over std::vector also begins at 256-byte values, but it appears to decay somewhat with increasing list sizes.

The performance of std::vector on this task declines significantly as the value size increases due to the cost of insertion, which must shift large amounts of data proportional to the value size.

For std::list, performance declines slightly with value size, either due to the allocation, or due to the cost of traversal, as for large value sizes it must use an entire cache line for each node traversal even though it doesn’t touch the value.

However, the performance of cw::list is almost completely independent of value size, as the dominant factor is the (pure) traversal required by the binary search, which is independent of value size.

## 4 Conclusion

Contiguous arrays are great for performance, but the requirements of a linked list mean we can’t traverse these linearly. However, we can still gain some benefit by mitigating the effect of pointer-chasing by separating the nodes from the values and making them as compact as possible.

The benchmarks show that this approach is a win over std::list for large list sizes and the benefit increases with value size.

However, for applications involving a combination of traversal and insertion, the vector still wins for small value types (128 bytes or less in this example), which appears to be true for all list sizes.

So unless you have large value types, the standing advice remains: just use a vector.

# Iterating Over Distinct Pairs in C++

It is common in programming to iterate over a data set, processing each element in turn. However there are cases where one wants to process pairs of elements from the data set. This situation readily arises when evaluating binary relations between elements, such as collisions (intersections) between geometric objects, or the relation “has the same colour as” between coloured objects. Many binary relations (such as the aforementioned examples) feature two properties — symmetry and reflexivity — that mean we don’t have to process all pairs, but only a subset of the pairs. Symmetry means the order doesn’t matter — e.g. if x has the same colour as y then y has the same colour as x. Reflexivity means that every element is related to itself — e.g. x has the same colour as x for all x. These two properties mean we only need to iterate over the distinct pairs.

Let $X$ be a set of $N$ elements. The set of pairs of elements from $X$ is just the Cartesian product $X^2 = X\times X$. In this set the pair $(x,y)$ is a different element to $(y,x)$, and this set also includes elements paired with themselves, $(x,x)$. There are $N^2$ pairs in total.

On the other hand, we can consider distinct pairs, where the two elements $x$ and $y$ must be different, and the order of the pairing does not matter, i.e. $(x,y)$ and $(y,x)$ represent the same distinct pair. The set of distinct pairs contains $N(N-1)/2$ elements.

In order to iterate over the pairs of $X$, one typically writes nested for loops, like so:

for(int i=0;i<N;++i)
for(int j=0;j<N;++j)
process_pair( X[i], X[j] );


One can use the range-based for loops of C++11 to extend this to more general containers:

for( auto& x : X )
for( auto& y : X )
process_pair( x, y );


For distinct pairs, the indexed approach looks like this:

for(int i=0;i<N;++i)
for(int j=i+1;j<N;++j)
process_pair( X[i], X[j] );


i.e. for each i, the index j runs over values strictly greater than i.

What would be nice is to leverage the iterators of C++ in order to make use of both range-based for loops and the host of STL algorithms for processing pairs and distinct pairs of elements from a container, something like this:

for( auto& p : pairs(X) )
process_pair( p );

for( auto& p : distinct_pairs(X) )
process_pair( p );


The STL already has std::pair for representing a pair of values. The naive solution to this would be to have pairs() construct a new container of std::pairs, copying the elements over from the original container, and then iterating over the new container as normal. This is bad for the following reasons:

• The cost associated with allocation and copying.
• It multiplies the memory requirement by $2N+1$ for pairs, or by $N$ for distinct pairs. For large containers this can be a problem. In fact, it doesn’t take a very large value of $N$ for the memory requirements to exceed the limits of a desktop machine: if you have more than 131072 1-byte elements, you’ll need more than 32 GB to store all the pairs, which is the limit of most motherboards at the time of writing (185364 for distinct pairs).
• We may wish to mutate the original elements in place — mutating an element on one iteration may affect future iterations.

Due to the memory requirements, constructing a container of pairs of references is out of the question as well. A much more elegant solution is to use the existing iterators of the container to create new iterator types: pairs_iterator and distinct_pairs_iterator.

My implementation of this approach can be found on GitHub.

The two new iterator types describe each pair using a std::pair of iterators to the original container. The dereferencing operation returns a std::pair of references to the elements, which can then be used to read/write the elements that make up the pair. All of the usual iterator operations, such as incrementation, can be implemented in terms of those from the container iterator.

The only problematic operation is operator->(). Since the pairs of elements don’t actually exist anywhere in memory, we can’t return a pointer to them. Without this operator, the iterators technically don’t satisfy the InputIterator concept, but do satisfy the more basic Iterator concept. It may be possible to resolve this by returning a custom wrapper that mimics a pointer to a pair of elements, but I haven’t tried it. It turns out that this shortcoming is not really a problem, since generic algorithms tend not use this operator (it’s used to access members of the element type, and an arbitrary type need not have any members).

The iterators do satisfy the OutputIterator concept, as you can assign a std::pair of values to a std::pair of references, which will perform a member-wise assignment that updates the elements of the original container. This approach is therefore compatible with algorithms that mutate, although with great caution as the result may well be sensitive to the order of iteration over the pairs (which is the same order as in the indexed for loops above).

The utility functions pairs() and distinct_pairs() construct a wrapper object that represents the set of pairs without actually storing them. The wrapper only contains the range (pair of iterators) that defines the extent of original data set, and provides the key begin() and end() functionality that returns the new iterators. The const versions are called cpairs() and cdistinct_pairs().

The really cool thing about this approach is that pairs() and distinct_paris() can be applied to any data set that is bookended by a pair of iterators. This means we can apply it to itself to iterate over the pairs-of-pairs:

for( auto pp : pairs(pairs(X)) )
process_pair_of_pairs( pp );


This works because the reference type of the iterator isn’t actually a reference — it’s a std::pair of references from the underlying iterator. This means the type of pp is “a pair of a pair of references”, i.e. std::pair<std::pair<T&,T&>,std::pair<T&,T&>>, so there are no references to temporaries to worry about.

## Examples

### A big data set

// a big vector
vector<int> v( 1 << 17 );

// fill the vector with ascending numbers starting with 1
iota( begin(v), end(v), 1 );

// count the number of distinct pairs whose sum is even
auto ps = distinct_pairs(v);
cout << count_if( begin(ps), end(ps),
[]( auto p ) {
return ( p.first + p.second ) % 2 == 0;
}
);

/* Output:
4294901760
*/


Note that it would take 64 GB of memory to store the distinct pairs for this example, yet this implementation had a running time of only 9 seconds and memory usage of around 1 MB.

### Print distinct pairs of distinct pairs

vector<int> v { 1, 2, 3, 4 };

for( auto p : distinct_pairs(distinct_pairs(v)) ) {
cout << "( ( " << p.first.first << ", " << p.first.second << " ), ( " << p.second.first << ", " << p.second.second << " ) )" << endl;
}

/* Output:
( ( 1, 2 ), ( 1, 3 ) )
( ( 1, 2 ), ( 1, 4 ) )
( ( 1, 2 ), ( 2, 3 ) )
( ( 1, 2 ), ( 2, 4 ) )
( ( 1, 2 ), ( 3, 4 ) )
( ( 1, 3 ), ( 1, 4 ) )
( ( 1, 3 ), ( 2, 3 ) )
( ( 1, 3 ), ( 2, 4 ) )
( ( 1, 3 ), ( 3, 4 ) )
( ( 1, 4 ), ( 2, 3 ) )
( ( 1, 4 ), ( 2, 4 ) )
( ( 1, 4 ), ( 3, 4 ) )
( ( 2, 3 ), ( 2, 4 ) )
( ( 2, 3 ), ( 3, 4 ) )
( ( 2, 4 ), ( 3, 4 ) )
*/


## Limitations

If your data set contains two different elements that are equal, the distinct_pairs iterator will still consider them to be distinct. In other words, ‘distinct’ refers to instances of elements, not to semantic distinctness with respect to the equality comparison.