What use are Bloom filters, anyway?

All that stuff about Bloom filters was because I wanted to make sure I understood them before using them to improve Exim's ratelimit feature.

One of the things that our users often do which triggers the rate limiter is forward all their email offsite. This is a huge rate of messages, but they are all from and to the same person - not many different addresses. Secondly, the vast majority of the computers on our network are used by one person, so should only originate email from a limited number of email addresses. Unexpectedly using lots of addresses may be a sign of a compromised machine.

What I wanted was a way of identifying how often different events occurred, with reasonable efficiency. The existing ratelimit code stores (for each sender) a 64 bit timestamp and a 64 bit floating point rate measurement, plus a key to identify the sender. This is small and easy to update. I wanted something that didn't bloat it too much and which is similarly easy to update.

The essential idea is that the new per_addr option also stores a Bloom filter per sender. Each recipient specified by the sender is checked for membership of its Bloom filter; if it is present then no update happens; if it is absent, the sender's rate is incremented and the recipient is added to the filter.

The rate limit configuration includes a time period which is currently used to determine the exponential smoothing constant, i.e. how bursty senders can be. With per_addr it is also used to determine the Bloom filter's lifetime: when it gets too old, the filter is thrown away. This solves the problem of the filter getting over-full and producing loads of false positives; this means that Exim is measuring the number of different recipient addresses used by the sender in each period. It's very pleasing that this is so simple for both the code and the user :-)

The remaining question is how big the filter should be, and how many hash functions to use. For most configurations, the maximum number of entries in the filter will be the same as the maximum rate, so I decided to simply allocate 2 bytes i.e. 16 bits times the limit, since the false positive rate was too high with just one byte per element. I chose to use 8 hash functions, since that provides some headroom above the maximum rate before the false positive rate gets too high. (You can tell Exim to measure the client's success rate, which is always less than the limit, or the client's attempt rate, which can be very much higher.)

With this choice of parameters, the false positive rates for clients sending to different addresses at rates relative to the limit are:

sending rate/limit	false positive rate
0.5	0.0006% or 1 in 170000
1	0.06% or 1 in 1700
2	2.5% or 1 in 40
2.7 = e	9.3% or 1 in 10.7
4	31% or 1 in 3.2

It looks like the false positive rate for fast senders might easily become embarrassing, but again the periodic filter clearance comes to the rescue. When a fast sender stops, its smoothed rate decays exponentially by 1/e each period. Therefore if its rate was more than e times the limit, then the over-full Bloom filter will be cleared before its rate falls below the limit. Thus the smoothing period is the maximum time for which Exim can be fooled into under-estimating the sender's rate, but when this is a serious problem (false positive rate over 9%) the client will be subject to countermeasures continuously. The only bad effect should be incorrect rate measurements in the logs.

The code is actually more general than I have described: you aren't limited to measuring unique recipient addresses. Under the covers the per_addr option is equivalent to per_rcpt/unique=$local_part@$domain so by changing the unique value you can, for example, count unique sender addresses: per_mail/unique=$sender_address.

I have posted a patch to the exim-users list so that people can try out the new code before it gets baked into a release.