## Contents

## 1 Introduction

The linked list is a well-known data structure, with a straightforward implementation that is a common exercise for students and interviewees alike. The advantage of a linked list is the cost of insertion and erasure of elements at arbitrary positions, which does not require the existing data elements to be moved around. It is also useful for swapping elements, or splicing two lists together, which can be done purely by modifying the links.

The problem is that the traversal of a linked list is deeply offensive to modern hardware, and it’s difficult to avoid traversal when implementing a higher-level algorithm. Even if the use case primarily concerns insertion/erasure, there is often some traversal involved to get to the required position of insertion/erasure.

Pointer chasing is one of the best ways to destroy performance on modern hardware, and the linked list is the king of pointer chasing. This is something that has been highlighted in recent years as the importance of cache-friendliness has become more publicized. Here are a few recent talks on the topic:

- Build 2014: Herb Sutter — Modern C++: What You Need to Know
- CppCon 2014: Mike Acton — Data-Oriented Design and C++
- CppCon 2014: Chandler Carruth — Efficiency with Algorithms, Performance with Data Structures
- Code::Dive 2014: Scott Meyers — CPU Caches and Why You Care

The traversal of a linked list is so disastrous that its cost can overwhelm the linked list’s benefits, to such an extent that a vector ends up being the better choice, despite its expensive insert/erase. This is particularly the case with conventional implementations, such as `std::list`

, that do separate allocations for each node. This has the cost of an allocation/deallocation for every insertion/erasure, and potentially scatters the nodes wildly over memory.

An alternate implementation is the intrusive linked list, where the allocation is outsourced to the user, who can preallocate the nodes in a contiguous array and then link/unlink them at their leisure. Allocating the nodes in a contiguous array improves the situation somewhat, but the link pointers (`prev`

, `next`

) can still be spread out over a large area as they are separated by the data elements:

`prev next [---data---] prev next [---data---] prev next [---data---] ...`

When traversing the list, we don’t necessarily need to access the whole of each data element; we primarily wish to follow the trail of pointers. We will refer to traversal that doesn’t touch the values as *pure* traversal. This memory layout is inefficient for pure traversal as a significant amount of each cache line contains irrelevant data, wasting both memory bandwidth and increasing the chance of cache misses. This gets worse with larger data types.

## 2 Reinventing the Linked List

My objective was to create a non-standard linked list that attempts to mitigate the cost of pure traversal while still retaining the insert/erase benefit, and then to see how that affects the performance of various algorithms. To do this, I began by separating the storage of the values from the nodes that describe the linked structure.

struct list { struct node { index_type prev, next; }; std::vector<value_type> values; std::vector<node> nodes; index_type head, tail; };

(templates omitted)

Both the nodes and the values are stored in their own contiguous arrays. This allows for indexes to be used instead of pointers, and this in turn allows for small unsigned integers (such as `uint16_t`

) to be used instead of the full 64-bits that you get with a pointer. The idea here is to make the array of nodes as compact as possible.

`[---data---] [---data---] [---data---] [---data---] [---data---] ...`

prev next prev next prev next prev next prev next ...

The association between a node and its value is implicit — the value associated with `nodes[i]`

is `values[i]`

. There is no explicit pointer-to-value. Each index identifies both a node and a value, and this design allows us to pack as much of these indexes into cache as possible.

The contiguous property of the two vectors is an invariant of the list. The insertion and erasure operations are designed to achieve this:

- Insertion pushes a node and value to the back of their respective vectors, which is amortized . The links are updated as usual.
- Erasure moves the back node/value to the index of the erased node/value and then pops both vectors. The links need to be updated to reflect this move, and it costs a move operation of one node and one value, but there is no deallocation and it is .

With the basic design established I then fleshed out the functionality, following the interface of `std::list`

so that it can be used as a drop-in replacement. The implementation is on GitHub: https://github.com/cwzx/cwds.

## 2.1 Benefits

There are several benefits to this approach that can be noted:

- As the implementation is backed by vectors, additional functionality not found in
`std::list`

can be provided, such as`.reserve()`

. Being able to preallocate the memory is a big win for performance. - No allocation on insert, apart from when the vectors need to grow to accommodate more elements. This can be reserved upfront.
- No deallocation on erase.
- As the values are separate to the linking machinery, we can implement some algorithms directly on the vector of values, bypassing the linked structure entirely. This is possible when the algorithm accesses all of the elements and doesn’t depend on the order of the elements. Examples:
`.remove()`

`.remove_if()`

`.unique()`

`std::accumulate`

`std::any_of`

`std::all_of`

`std::none_of`

`std::count`

`std::count_if`

`std::fill`

`std::replace`

`std::replace_if`

- Some operations do not touch the values at all, they just rewire the links. For example,
`.reverse()`

. These operations can be implemented directly on the vector of nodes, bypassing the values entirely. - The list can be constructed/assigned to using an existing vector of values by both move and copy.
- The list can be copied/moved without needing to update any links.
- Without any explicit pointers to worry about, the list can be easily serialized.

And when applied to the whole list, these STL algorithms can also benefit in this way:

These can be given the full cache/prefetcher benefits of linear traversal over contiguous arrays.

## 2.2 Disadvantages

- Due to the separation of nodes from values, swapping two lists invalidates all iterators to both lists.
- Both
`.merge()`

and`.splice()`

require allocation and moving the values across, as we require contiguous data. - Erasing an element invalidates iterators, as the (vector) back is moved to the erased index, and the moved element could have any position in the list.

## 3 Benchmarks

Benchmarks were performed to compare `cw::list`

with `std::list`

. Tests were performed for different numbers of elements, and for different sizes of the value type. For `cw::list`

, the smallest possible unsigned integer type was used for the index type depending on the number of elements tested. Each test was repeated multiple times.

The following compiler and settings were used:

- Visual Studio 2015 Preview
- x64 Release
- AVX2 instructions enabled

A custom implementation of `std::chrono::high_resolution_timer`

was used that uses Window’s actual high resolution timer (`QueryPerformanceCounter`

).

The following hardware was used:

- Intel i5-4670K @ 4.4GHz (Haswell)
- 16GB DDR3 1600MHz CAS-9

Three list creation strategies were tested:

- Back — insert each value at the back.
- Mid — insert each value at the midpoint.
- Random — insert each value at the front or back, chosen at random with equal probability.

The first strategy produces a linear ordering (i.e. iterating over the list corresponds to linearly traversing the underlying arrays), the other two produce nonlinear orderings.

As we’re interested in comparing performance of the two list implementations, we will plot the ratio of time taken for `std::list`

over the time taken for `cw::list`

. Results greater than 1 indicate a performance improvement.

## 3.1 Pure Traversal

Pure traversal means iterating over all the elements of the list without touching the values.

For value types between 1 byte and 8 bytes we see the same pattern:

Going beyond 8 bytes, the advantage `cw::list`

has over `std::list`

at large list sizes grows:

## 3.1.1 Large list improvements

Value Type Size (bytes) |
Improvement Factor (linear) |
Improvement Factor (nonlinear) |
List Sizes |
---|---|---|---|

1–8 | 1.6 | 3.2 | >300 000 |

16 | 2.3 | 4.5 | >250 000 |

32 | 3.2 | 7.5 | >200 000 |

64 | 4.75 | 14 | >100 000 |

128 | 15 | 20 | >120 000 |

1024 | 25 | 40 | >120 000 |

## 3.2 Accumulate

Now we test traversal where every value is accessed. `std::accumulate`

sums the values of the list. As the objective here is to test traversal of the linked structure, we use the standard implementation of `accumulate`

, rather than `cw::list`

‘s optimized version that bypasses the linked structure.

As with the previous benchmark, we see improvements for large lists, with benefits increasing with value size, although to a lesser degree this time. The nonlinear traversals see the greatest improvement, with the random pattern tending to see the most benefit.

This is good news, as we would expect a list in the wild to be in a nonlinear, pseudo-random configuration, particularly after prolonged use under many insertions/erasures/swaps/etc.

## 3.2.1 Large list improvements

Value Type Size (bytes) |
Improvement Factor (linear) |
Improvement Factor (nonlinear) |
List Sizes |
---|---|---|---|

1–8 | 1.2 | 1.8 | >300 000 |

16 | 1.8 | 2.25 | >200 000 |

32 | 1.7 | 2.15 | >750 |

64 | 1.45 | 2.8 | >2500 |

128 | 1.9 | 3.9 | >2500 |

1024 | 3.6 | 6.2 | >70 000 |

## 3.3 Insert Random Values Sorted

This benchmark creates a list by inserting random values such that the list remains sorted. This tests both traversal (to find the insertion point) and insertion. The traversal is performed by `std::lower_bound`

, which is a binary search. Note that a binary search uses pure traversal to navigate the lists (via `std::advance`

). We can also compare the lists to `std::vector`

, for which the binary search uses direct indexing.

For this benchmark, the maximum list sizes tested are significantly smaller than the previous two benchmarks. This is due to the complexity being for the lists and for the vector.

First we can compare the two list implementations:

Again, `cw::list`

performs better for large lists sizes, for all value sizes.

We can also compare both `std::list`

and `cw::list`

with `std::vector`

. The vector wins for small value types, but the lists win for large value types:

For `cw::list`

, the performance advantage over `std::vector`

begins at 256-byte values.

For `std::list`

, the performance advantage over `std::vector`

also begins at 256-byte values, but it appears to decay somewhat with increasing list sizes.

The performance of `std::vector`

on this task declines significantly as the value size increases due to the cost of insertion, which must shift large amounts of data proportional to the value size.

For `std::list`

, performance declines slightly with value size, either due to the allocation, or due to the cost of traversal, as for large value sizes it must use an entire cache line for each node traversal even though it doesn’t touch the value.

However, the performance of `cw::list`

is almost completely independent of value size, as the dominant factor is the (pure) traversal required by the binary search, which is independent of value size.

## 4 Conclusion

Contiguous arrays are great for performance, but the requirements of a linked list mean we can’t traverse these linearly. However, we can still gain some benefit by mitigating the effect of pointer-chasing by separating the nodes from the values and making them as compact as possible.

The benchmarks show that this approach is a win over `std::list`

for large list sizes and the benefit increases with value size.

However, for applications involving a combination of traversal and insertion, the vector still wins for small value types (128 bytes or less in this example), which appears to be true for all list sizes.

So unless you have large value types, the standing advice remains: just use a vector.