Iterating Over Distinct Pairs in C++

It is common in programming to iterate over a data set, processing each element in turn. However there are cases where one wants to process pairs of elements from the data set. This situation readily arises when evaluating binary relations between elements, such as collisions (intersections) between geometric objects, or the relation “has the same colour as” between coloured objects. Many binary relations (such as the aforementioned examples) feature two properties — symmetry and reflexivity — that mean we don’t have to process all pairs, but only a subset of the pairs. Symmetry means the order doesn’t matter — e.g. if x has the same colour as y then y has the same colour as x. Reflexivity means that every element is related to itself — e.g. x has the same colour as x for all x. These two properties mean we only need to iterate over the distinct pairs.

Let $X$ be a set of $N$ elements. The set of pairs of elements from $X$ is just the Cartesian product $X^2 = X\times X$. In this set the pair $(x,y)$ is a different element to $(y,x)$, and this set also includes elements paired with themselves, $(x,x)$. There are $N^2$ pairs in total.

On the other hand, we can consider distinct pairs, where the two elements $x$ and $y$ must be different, and the order of the pairing does not matter, i.e. $(x,y)$ and $(y,x)$ represent the same distinct pair. The set of distinct pairs contains $N(N-1)/2$ elements.

In order to iterate over the pairs of $X$, one typically writes nested for loops, like so:

for(int i=0;i<N;++i)
for(int j=0;j<N;++j)
process_pair( X[i], X[j] );


One can use the range-based for loops of C++11 to extend this to more general containers:

for( auto& x : X )
for( auto& y : X )
process_pair( x, y );


For distinct pairs, the indexed approach looks like this:

for(int i=0;i<N;++i)
for(int j=i+1;j<N;++j)
process_pair( X[i], X[j] );


i.e. for each i, the index j runs over values strictly greater than i.

What would be nice is to leverage the iterators of C++ in order to make use of both range-based for loops and the host of STL algorithms for processing pairs and distinct pairs of elements from a container, something like this:

for( auto& p : pairs(X) )
process_pair( p );

for( auto& p : distinct_pairs(X) )
process_pair( p );


The STL already has std::pair for representing a pair of values. The naive solution to this would be to have pairs() construct a new container of std::pairs, copying the elements over from the original container, and then iterating over the new container as normal. This is bad for the following reasons:

• The cost associated with allocation and copying.
• It multiplies the memory requirement by $2N+1$ for pairs, or by $N$ for distinct pairs. For large containers this can be a problem. In fact, it doesn’t take a very large value of $N$ for the memory requirements to exceed the limits of a desktop machine: if you have more than 131072 1-byte elements, you’ll need more than 32 GB to store all the pairs, which is the limit of most motherboards at the time of writing (185364 for distinct pairs).
• We may wish to mutate the original elements in place — mutating an element on one iteration may affect future iterations.

Due to the memory requirements, constructing a container of pairs of references is out of the question as well. A much more elegant solution is to use the existing iterators of the container to create new iterator types: pairs_iterator and distinct_pairs_iterator.

My implementation of this approach can be found on GitHub.

The two new iterator types describe each pair using a std::pair of iterators to the original container. The dereferencing operation returns a std::pair of references to the elements, which can then be used to read/write the elements that make up the pair. All of the usual iterator operations, such as incrementation, can be implemented in terms of those from the container iterator.

The only problematic operation is operator->(). Since the pairs of elements don’t actually exist anywhere in memory, we can’t return a pointer to them. Without this operator, the iterators technically don’t satisfy the InputIterator concept, but do satisfy the more basic Iterator concept. It may be possible to resolve this by returning a custom wrapper that mimics a pointer to a pair of elements, but I haven’t tried it. It turns out that this shortcoming is not really a problem, since generic algorithms tend not use this operator (it’s used to access members of the element type, and an arbitrary type need not have any members).

The iterators do satisfy the OutputIterator concept, as you can assign a std::pair of values to a std::pair of references, which will perform a member-wise assignment that updates the elements of the original container. This approach is therefore compatible with algorithms that mutate, although with great caution as the result may well be sensitive to the order of iteration over the pairs (which is the same order as in the indexed for loops above).

The utility functions pairs() and distinct_pairs() construct a wrapper object that represents the set of pairs without actually storing them. The wrapper only contains the range (pair of iterators) that defines the extent of original data set, and provides the key begin() and end() functionality that returns the new iterators. The const versions are called cpairs() and cdistinct_pairs().

The really cool thing about this approach is that pairs() and distinct_paris() can be applied to any data set that is bookended by a pair of iterators. This means we can apply it to itself to iterate over the pairs-of-pairs:

for( auto pp : pairs(pairs(X)) )
process_pair_of_pairs( pp );


This works because the reference type of the iterator isn’t actually a reference — it’s a std::pair of references from the underlying iterator. This means the type of pp is “a pair of a pair of references”, i.e. std::pair<std::pair<T&,T&>,std::pair<T&,T&>>, so there are no references to temporaries to worry about.

Examples

A big data set

// a big vector
vector<int> v( 1 << 17 );

// fill the vector with ascending numbers starting with 1
iota( begin(v), end(v), 1 );

// count the number of distinct pairs whose sum is even
auto ps = distinct_pairs(v);
cout << count_if( begin(ps), end(ps),
[]( auto p ) {
return ( p.first + p.second ) % 2 == 0;
}
);

/* Output:
4294901760
*/


Note that it would take 64 GB of memory to store the distinct pairs for this example, yet this implementation had a running time of only 9 seconds and memory usage of around 1 MB.

Print distinct pairs of distinct pairs

vector<int> v { 1, 2, 3, 4 };

for( auto p : distinct_pairs(distinct_pairs(v)) ) {
cout << "( ( " << p.first.first << ", " << p.first.second << " ), ( " << p.second.first << ", " << p.second.second << " ) )" << endl;
}

/* Output:
( ( 1, 2 ), ( 1, 3 ) )
( ( 1, 2 ), ( 1, 4 ) )
( ( 1, 2 ), ( 2, 3 ) )
( ( 1, 2 ), ( 2, 4 ) )
( ( 1, 2 ), ( 3, 4 ) )
( ( 1, 3 ), ( 1, 4 ) )
( ( 1, 3 ), ( 2, 3 ) )
( ( 1, 3 ), ( 2, 4 ) )
( ( 1, 3 ), ( 3, 4 ) )
( ( 1, 4 ), ( 2, 3 ) )
( ( 1, 4 ), ( 2, 4 ) )
( ( 1, 4 ), ( 3, 4 ) )
( ( 2, 3 ), ( 2, 4 ) )
( ( 2, 3 ), ( 3, 4 ) )
( ( 2, 4 ), ( 3, 4 ) )
*/


Limitations

If your data set contains two different elements that are equal, the distinct_pairs iterator will still consider them to be distinct. In other words, ‘distinct’ refers to instances of elements, not to semantic distinctness with respect to the equality comparison.