.@ Tony Finch – blog


The usual way to find an element in a sorted array is using a binary search, which takes log(n) time, where logs are understood to be in base 2 and N is the size of the array.

Linus Torvalds made the clever observation that you can do better than log(N) if you know that the contents of the array are uniformly distributed. For instance, git's pack files store multiple objects identified by the SHA-1 hashes of their contents. Each pack file has an index containing a list of the pack's SHA-1 IDs and their objects' locations in the pack. The index is sorted by SHA-1 ID. Since SHA-1 is a cryptographic hash, we can assume its output is uniformly distributed.

Linus described his technique as "Newton-Raphson" which is a bit of a misnomer, since N-R works on smooth differentiable curves whereas what we have is a straight line with some stochastic variations. What we're actually doing is an iterated linear interpolation. If the SHA-1 IDs were perfectly evenly distributed then a single linear interpolation would land us right on the target item, but the random variation means we will be off by some amount, so we need to continue searching.

How far off will we be? It turns out (based on Monte Carlo simulation) that the expected error is about 0.31 * sqrt(N) with a standard deviation of about 0.26 * sqrt(N). This is a really promising result since it implies that each iteration reduces the search space to N1/2 whereas an iteration of binary search reduces it to N/2. So we should expect a complete search to take O(log(log(N))) iterations.

I wrote a simulation to try this out, and it matches this prediction: in fact the number of iterations was about 1 + log(log(N)). However what is the variation around this expected result? In my tests it turned out that the maximum number of probes was log(N) though for small N it bottomed out at about 16. When testing lots of different randomly filled arrays, the standard deviation was about 1.2 for all values of N, but when I tested fewer arrays this number ramped up.

Junio Hamano's implementation of Linus's idea is included in git but disabled by default. He added a tweak that biases the linear interpolation towards the centre of the search range, so it's kind of a balance between binary search and linear interpolation search. In my simulator this tweaked version required (log(N)+3)/2 iterations on average with a standard deviation of 0.8. The maximum number of iterations was again log(N) but it bottomed out at about 12. Overall it's a bit slower but better behaved.

In git, where a large repository might contain two million objects, and where pack index lookups are not particularly performance-critical, this improved lookup code doesn't provide a noticeable advantage. Still, I think it's interesting and the idea might be useful in other situations. Note that unlike a binary search, which can just use comparisons returning greater / equal / less, the linear interpolation search needs to know the absolute values of the elements. Git's code actually uses a lexicographic variant that ignores any common prefix shared by the elements in the search range, and uses only the next two bytes for the interpolation.

To finish, here's a bit of code. In this example, 0.0 <= array[k] < 1.0, and I use k for keys and v for values of array elements. We are searching for vtarg.

	/* all bounds are exclusive */
	double vlo = -DBL_MIN, vhi = +1.0;
	int klo = -1, khi = N;
	while(klo - khi > 1) {
		int kmid = klo + (khi-klo) * (vtarg-vlo) / (vhi-vlo);
		/* ensure rounding does not put us out of bounds */
		if(guess_k <= min_k) guess_k = min_k + 1;
		if(guess_k >= max_k) guess_k = max_k - 1;
		double vmid = array[kmid];
		if(vmid == vtarg) return(kmid);
		if(vmid < vtarg) klo = kmid, vlo = vmid;
		if(vmid > vtarg) khi = kmid, vhi = vmid;
	}
	return(-1);

Addendum: there are a few corrections in a follow-up post.