Rust hash iteration+reinsertion

It was recently discovered that some surprising operations on Rust’s standard hash table types could go quadratic.

Perhaps the simplest illustration is this snippet from a comment, here simplified even further:

use std::collections::hash_set::HashSet;

fn main() {
    println!("populating...");

    let mut one = HashSet::new();
    for i in 1..5000000 {
        one.insert(i);
    }
    println!("cloning...");

    let mut two = HashSet::new();
    for v in one {
        two.insert(v);
    }
}

In this example, the first loop (populating one) finishes fairly quickly, as you would expect, but the second loop (copying into two) takes much, much longer.

I enjoy this bug for at least two reasons: One, it’s fun technically, allowing us to take a brief deep dive into hash-table implementation – something most of us are normally able to treat as a solved problem – and two, it’s a great example of subtle quadratic behavior in a system specifically designed by smart and well-informed developers to avoid the possibility of accidental quadratic behavior!

Robin-Hood hashing

To understand the bug, we’re going to need to understand the hash table strategy used by Rust’s standard hash tables. A major challenge in any general-purpose hash table implementation is how to handle hash collisions; If you have a table with N buckets, eventually you’ll get two elements with the same value of hashcode%N, and what do you do?

Rust uses Robin Hood hashing, a relatively old technique that’s recently received new attention. Robin Hood hashing is a variation on good old open addressing with linear probing, with a small twist.

First, let’s refresh our memory:

In hash tables, “open addressing” refers to the technique of, upon encountering a collision, somehow selecting an alternate location in the hash table. It’s most commonly contrasted with “chaining”, in which you store multiple entries in a single hash bucket, typically using a linked list.

Linear probing, in particular, means that select alternate buckets by scanning forward from the natural bucket. If you try to insert an element with hash code H into a table with N buckets, and bucket H%N is full, you try (H+1)%N, (H+2)%N and so on, until you find an empty bucket. On lookup, you start at H%N and scan forward until you find your element or an empty bucket.

Open addressing with linear probing is arguably the simplest possible general-purpose hash table implementation, and it can work fine (assuming a good hash function) as long as you don’t let your load factor get too high (reminder: the load factor is the number of elements stored in the table, divided by the number of buckets in the hash table. A load factor of 0.5 means that half the buckets are in use). If your load factor gets too high, then elements can end up extremely far from their natural hash bucket, necessitating a lot of scanning on inserts and lookups.

Robin Hood hashing improves on linear probing with a simple trick: When you’re scanning forward from H%N, look at each element you encounter. If the inserting element is further from its “natural” bucket than the element in the bucket you’re considering, then swap the new element and the element in that bucket, and continue scanning with the other element.

Let’s consider an example. Imagine this is a subset of a hash table; I’ve labeled each bucket below it with its index, and inside the bucket I’ve placed hashcode%N, what I’ve been calling the “natural” index of an element:

+---+---+---+---+---+
| 0 | 0 | 2 | 1 |   | ...
+---+---+---+---+---+
  0   1   2   3   4

Now suppose we want to insert an element with hash%N == 1. We start at bucket 1, and find it full. The element in that bucket is at index 1 and wants to be at index 0, so it’s further from home than our new element is. We keep looking. At index 2 we find an element with natural index of 2. It is a distance 0 from home, and we’re a distance 1, which is further. So we place our new element at index 2, and pick up whatever element used to be there, and continue. Eventually, we end up with:

+---+---+---+---+---+
| 0 | 0 | 1 | 1 | 2 | ...
+---+---+---+---+---+
  0   1   2   3   4

The net result of this process is to reduce the variance of how far out-of-position elements are. Reducing variance means we have much more predictable insert and lookup times, which is strongly desirable. When we employ robin hood hashing, we can safely load a table up to a load factor of 90% or even 95%. We can even prove (as the author of the original paper did) fairly tight bounds on the expected maximum probe distance.

The Problem

Rust’s problem arises when you iterate over one table and insert the resulting entries into another table. The problem occurs because iteration over a hash table procedes by walking the backing array of the hash table directly. This results in yielding entries approximately in hash order, which turns out to be fatal when combined with the other elements of Rust’s implemention.

The final detail we need to know in order to understand the bug is Rust’s table-sizing algorithm. Rust uses power-of-two sized tables (a common technique, since it means you can use bitmasks instead of more-expensive modulus), and grows tables when they reach a 90% load factor.

Now, let’s consider what happens when we execute the above example. For simplicity, we’ll pretend that two starts out with a table size of half of one’s. In reality it will start out extremely small and grow through powers of two, but the half-size case exhibits the necessary behavior and is simpler for illustration. We can modify the above example to initialize two as let two = HashSet::with_capacity(one.capacity()/2); to verify that this still exhibits quadratic behavior.

So, let’s consider what happens as we start copying one into a two half its size. Things procede uneventually for the first half of the table; Because two has half as many buckets as one, hash%two.raw_capacity() == hash%one.raw_capacity() for the first half of the table, and elements are inserted into two in approximately the same indexes as they had in one. Because elements have been reordered slightly in one by the insertion process we occasionally have to search slightly, but never far, because most of the space to the right is still free. So, as we go, the picture looks something like (where Xs are filled buckets, and dots are empty):

one:    |XX.XX.XXXX.XXXXXX.XXX..XXX|
                 ^-- cursor
two:    |X.XXXX.XX....|

However, eventually we reach the midpoint of one’s table. At this point, indexes in two’s half-sized table wrap around, and we start searching for openings beginning near the start of two once again.

At this point, two’s load factor is approximately equal to that of one; it has half as many buckets, and approximately half as many elements. Because Rust resizes tables at a 90% load factor, we know one’s load factor to be between 45% and 90%. If we suppose by way of example that one is at a 70% factor, two can absorb 20% of its capacity past one’s half-way point until it will resize.

During the insertion of that additional 20% of two’s capacity (constituting something like 70% * 10% = 7% of one’s elements), we run into trouble.

We begin inserting at the beginning of two. two is 70% full, so as we insert we search to the right until we find an empty bucket.

However, “immediately to the right” is also where our future inserts will land, since we’re walking two in bucket order. And so, as we go, the position at which we attempt an insertion marches forward at a rate of one bucket per bucket in one; But the insertions all “stack up”, pushing the distance-until-empty forward at greater than one bucket per bucket, since we must also contend with the 70% of buckets already full. The situation rapidly looks something more like:

one:    |XX.XX.XXXX.XXXXXX.XXX..XXX|
                           ^-- cursor
two:    |XXXXXXX.X.....|
            ^-- cursor/2

With each element we insert, we must search further and further to the right to find the empty spaces in two. This is, essentially, a classic triangular pattern, and definitely quadratic.

This pattern continues until we fill up two to 90% capacity; At that point, two will be resized, and performance will go back to something reasonable. However, even though the quadratic regime covers only a small fraction of the middle of the loop, it’s an approximately-constant fraction, and so the quadratic behavior is still quadratic.

We can demonstrate this, visually. If we augment the above program to measure the cumulative time to insert N elements into two, we observe the above behavior:

We can see the exact behavior I described – efficient performance for the first half, a quadratic explosion, and then a return to efficiency.

If we don’t preallocate capacity/2 elements, we can see the full picture:

It is an essentially similar picture, except that the quadratic breakdown happens prior to each resize of the underlying table.

The Fix

Rust is working around the issue by reseeding the underlying SipHash hashing algorithm on a per-table basis. This solves the problem very thoroughly, by ensuring that the order of elements in two different tables has no correlation to each other, restoring all the nice assumptions about independence that make Robin Hood hashing work. They’re also speculating about a more fundamental fix to the hash table implementation, so that users who drop in a different hash algorithm are not again vulnerable. One proposal is that hash tables ought to retain their insertion order, since this bug is much less likely to happen if tables do not leak their hash order.

Concluding Thoughts

This was a subtle bug. Rust’s hash tables were very carefully thought by experienced developers well-versed in the literature and practice of deliberate quadratic attacks on hash tables. They selected an algorithm, a hashing algorithm (SipHash) and developed an implementation specifically to avoid such anomalies. But quadratics are sneaky creatures, and unexpected correlations can bite any algorithm based on statistical independence.

For my part, this was one of the harder bugs I’ve written up to understand. When the ticket was initially linked to me, I had to read it several times and read up on robin hood hashing and try several approaches to developing intuitions about the system before it began to make sense to me. But even then, when I sat down to write this post, I realized I had oversimplified it in my head and my explanation was wrong! I had to resort to a lot of experimentation with the above test case before I was confident I understood the dynamics.

Surprisingly to me, the specific dynamics of Robin Hood hashing end up being relatively unimportant here; I believe that vanilla linear probing would exhibit similar behaviors. The key effect of Robin Hood hashing is just that it gives you confidence and/or hubris to push a table to 90% capacity, which greatly exacerbates the problem.

Definitely a keeper of a bug.