Bloom filters are a simple and clever data structure that every programmer should know about. A Bloom filter represents a set where the universe of possible members is very large and where it doesn't matter if the filter occasionally falsely claims that a non-member is in the set. It is useful for filtering lookups into more expensive (larger) data structures, so that most of the time you can avoid performing lookups that will fail. For example, a hyphenation program on a small computer might have some rules that cover most words, and a dictionary of exceptions; it can represent the entries in the exceptions dictionary as a Bloom filter that fits in RAM, so that it doesn't have to go to disk to find out that its standard rules apply.
The way it works is to represent the set as an array of size bits, which is typically a small multiple of the expected population of the set. To add an item to the set, you hash it with nh independent hash functions, and set the nh bits in the array indexed by the hashes. To check if an item is in the set, you hash it in the same way, and check that all nh bits are set. You get a false positive when hash collisions cause all the bits to be set by accident.
There are a few neat tricks you can perform with Bloom filters. You can take the union of two sets by ORing the two Bloom filters together. You can halve the size of a bloom filter by ORing the top half with the bottom half. You can't delete elements from a Bloom filter, though if false negatives are OK as well as false positives, you can represent the set of deletions as a second Bloom filter. Alternatively, you can use a counting Bloom filter, in which each bit is replaced with a small counter that is incremented for insertions and decremented for deletions. Four bit counters overflow with a probability of 1.37 ∗ 10-15.
The following analysis is mostly based on Andrei Broder and Michael Mitzenmacher's survey of network applications of Bloom filters.
After pop elements have been added to the set, the probability that a particular bit is zero is
pz′ = (1 − 1/size) nh ∗ popwhich is approximately
pz = exp(−nh ∗ pop / size)The probability of a false positive is
Given a fixed array size and population, we can work out the number of hash functions that minimizes the false positive rate. Increasing nh makes the exponent larger which reduces fpr, but beyond a certain point the smaller pz starts making fpr larger again. First note that
fpr = (1 − pz) nh = exp(nh ∗ ln( 1 − pz))
which gives us
ln(pz) = −nh ∗ pop / size nh = −ln(pz) ∗ size / pop
Symmetry reveals that the minimum occurs when pz = 1/2, i.e. half the bits are set, so when the array is full (i.e. at maximum information density),
ln(fpr) = nh ∗ ln(1 − pz) = −(size / pop) ∗ ln(pz) ∗ ln(1 − pz)
The size is a factor of only log2(e) ≈ 1.44 larger than the information-theoretic lower bound (which I'm not going to explain here).
nh = ln(2) ∗ size / pop size = pop ∗ nh / ln(2) = pop ∗ nh ∗ log2(e) fpr = exp(−nh ∗ ln(2)) = 2−nh
Let's compare a Bloom filter with a simple one-bit-per-element hashed set, with a target false positive rate of 2-k. If you set only one bit per element, then (assuming no hash collisions) fpr = pop / size, so size = pop ∗ 2k, which is exponentially worse than k-hash Bloom filter.
It seems reasonable (for the application I have in mind) to set size = 16 ∗ limit and nh = 8, which gives us headroom for the population to overflow limit by 2 ∗ ln 2 ≈ 1.39 before fpr rises to 1/256.
Adam Kirsch and Michael Mitzenmacher show that, rather than requiring nh independent hash functions, we can use a set of functions of the form h1(x) + i ∗ h2(x) where 0 ≤ i < nh, without increasing the false positive rate. A hash function might be implemented by taking log2(size) bits from the result of MD5(x), so without this trick, nh = 8 limits size to 216, whereas with the trick we only need two functions so the limit is large enough that we don't need to worry.