Luck or judgment? – Tony Finch

In 2005 I developed a mathematical model for measuring average event rates which became the core of a new rate limiting feature for Exim. The model has a particularly useful property which I did not expect it to have and which I did not (until recently) fully understand.

The model

The central formula is the standard exponentially weighted moving average:

r_new = (1 − a) ∗ r_inst + a ∗ r_old

We use this to calculate the new average rate from the instantaneous rate and the result of the previous average calculation. These rates are all measured in events per some configurable time period. The smoothing factor a is a value between 0 and 1 which determines how slowly the model forgets about past behaviour.

We calculate r_inst and a from the raw inputs to the formula, which are p, a time period configured by the postmaster, and i, the time interval between the previous event and this event. By dividing them we get

r_inst = p / i

In this formula, p determines the per time unit used for measuring rates, e.g. events per hour or events per day.

The exponentially weighted moving average is usually used to smooth samples that are measured at fixed intervals, in which case the smoothing factor, a, is also fixed. In our situation events occur at varying intervals, so the smoothing factor needs to be varied accordingly.

a = e^{−i / p}

In this formula, p determines the smoothing period, i.e. the length of time it takes to forget 63% of past behaviour.

A useful property

When developing the model, I needed to understand how it reacts to changes in the rate of events. It's fairly simple to see that if the rate drops, the average decays exponentially towards the new rate. It's less clear what happens when the rate increases. A particular practical question is what happens if there's a sudden burst of messages? How much of the burst gets through before the average rate passes some configured maximum?

I did some trial computations and I found that when the interval is very small (i.e. the rate is high) the average rate increases by nearly one for each event (the smaller the interval, the closer to one). This means that the maximum rate is also the maximum burst size. How wonderfully simple! The postmaster can configure the model with two numbers, a maximum number of events per a time period, which also directly specifies the units of measurement, the smoothing period, and the maximum burst size.

(The maximum burst size is larger for slower senders, increasing to infinity for those sending below the maximum rate. However you don't have to be much above the maximum for your average to hit the limit within one period.)

This property produces a simple user interface, but I did not understand how it works. It's obvious that when i is small, a ≈ 1, but it is not so clear why (1 − a) ∗ r_inst ≈ 1. It seems I landed directly on a mathematical sweet spot without the fumbling around that might have led me to understand the situation better.

Arbitrary choices

I made two choices when developing the model which at the time seemed arbitrary, but which both must be right to get the max rate = burst size property.

Firstly, I decided to re-use the configured period for two purposes: as the smoothing period and as the per time unit of rates. I could instead have made them independently configurable, but this seemed to give no benefits that compensated for the extra complexity. Alternatively, I could have used a fixed value instead of p in the formula for r_inst, but that seemed liable to be confusing or awkward, and to require more mental arithmetic to configure.

Secondly, I decided to use e as the base of the exponent. I could have used 2, in which case the smoothing period would have been a half life, or 10, so that 90% of past behaviour would be forgotten after one period. There seemed to be no clear way to choose, so I split the difference and went with e on the basis of mathematical superstition and because exp() has fewer arguments than pow().

How it works

Looking back over my old notes this week, I had a revelation that the max rate = burst size property comes from the fact that the gradient of e^x is 1 when x is 0. To show why this is so, I first need to define a bit of notation:

y ≡ f(x) ≡ e^x
δy ≡ f(x + δx) − f(x)
δx ≡ −i / p

This allows us to say, when x is zero and δx is small,

(1 − a) ∗ r_inst
= (1 − e^−i/p) ∗ p/i
= (1 − e^δx) / −δx
= (e^δx − 1) / δx
= (e^{0 + δx} − e⁰) / δx
= ( f(0 + δx) − f(0) ) / δx
= δy / δx
≈ dy / dx
= e^x
= 1

Sweet.

	(1 − a) ∗ r_inst
=	(1 − e^−i/p) ∗ p/i
=	(1 − e^δx) / −δx
=	(e^δx − 1) / δx
=	(e^{0 + δx} − e⁰) / δx
=	( f(0 + δx) − f(0) ) / δx
=	δy / δx
≈	dy / dx
=	e^x
=	1