.@ Tony Finch – blog


A while back I wrote about the linear rate limit algorithms leaky bucket and GCRA. Since then I have been vexed by how common it is to implement rate limiting using complicated and wasteful algorithms (for example).

But linear (and exponential) rate limiters have a disadvantage: they can be slow to throttle clients whose request rate is above the limit but not super fast. And I just realised that this disadvantage can be unacceptable in some situations, when it’s imperative that no more than some quota of requests is accepted within a window of time.

In this article I’ll explore a way to enforce rate limit quotas more precisely, without undue storage costs, and without encouraging clients to oscillate between bursts and pauses. However I’m not sure it’s a good idea.

linear reaction time

How many requests does a linear rate limiter allow before throttling?

The parameters for a rate limiter are:

So the maximum permitted rate is q/w.

Let’s consider a client whose rate is some multiple a > 1 of the permitted rate (a for abuse factor)

    c = a * q/w

I’ll model the rate limiter as a token bucket which starts off with q tokens at time 0. The bucket accumulates tokens at the permitted rate and the client consumes them at its request rate. (It is capped at q tokens but we can ignore that detail when a > 1.)

    b(t) = q + t*q/w - t*a*q/w

The time taken for n requests is

    t(n) = n/c = (n*w) / (a*q)

After n requests the bucket contains

    b(n) = q + n/a - n

The rate limter throttles the client when the bucket is empty.

    b(t) = 0 = q + t * (1 - a) * q/w

    0 = 1 - t * (a - 1) / w

    t = w / (a - 1)

    b(n) = 0 = q + n * (1/a - 1)

    0 = q - n * (a - 1) / a

    n = q * a / (a - 1)

For example, if the client is running at twice the permitted rate, a=2, they will be allowed q*2 requests within w seconds before they are throttled.

That’s a bit slow.

The problem is that the bucket starts with a full quota of tokens, and during the first window of time another quota-full is added. So a linear rate limiter seems to be more generous in its startup phase than its parameters suggest.

fixed window quota resets

This is a fairly common rate limit algorithm that enforces quotas more precisely than linear rate limiters. It is similar to a token bucket, but it resets the bucket to the full quota q after each window w.

The disadvantage of periodically resetting the bucket is that careless clients are likely to send a fast burst of requests every window. (A linear rate limiter will tend to smooth out requests from careless clients, though it can’t prevent deliberately abusive burstiness.) And a quota-reset rate limiter uses more space than a linear rate limiter: it needs a separate bucket counter as well as a timestamp.

hybrid quota-linear algorithm

The idea is to have two operating modes:

The algorithm switches to smooth mode when a client consumes its entire quota and the bucket reaches zero, and switches back to bursty mode when the bucket recovers to a full quota.

Bursty mode avoids being too generous by adding at most one quota of tokens to the bucket per window. Smooth mode also avoids being too generous by starting with an empty bucket.

I’ve written this pseudocode in a repetitive style because that makes each paragraph more independent of its context, though it is still somewhat stateful and the order of the clauses matters.

discussion

This hybrid algorithm has a similar effect to running both a quota-reset and a linear rate limiter in parallel, but it uses less space.

The precision of quota enforcement in smooth mode is maybe arguable: It guarantees that the client remains below the limit on average over the whole time it is in smooth mode. But if it slows down for a while (but not long enough to return to bursty mode) it can speed up again and make more requests than its quota within a window.

opinion

I think it’s a mistake to try to treat each time window in strict isolation when assessing a client’s request quota. A linear rate limiter only seems to be unduly generous on startup if you ignore the fact that the client was quiet in the previous window.

As well as being more expensive than necessary (in some cases disgracefully wasteful), algorithms like sliding-window and quota-reset encourage clients into cyclic burst-pause behaviour which is unhealthy for servers. And I’ve seen developers complaining about how annoying it is to use glut/famine rate limiters.

By contrast, rate limiters that measure longer-term average behaviour by keeping state across multiple windows can naturally encourage clients to smooth out their requests.

So on balance I think that instead of using this hybrid quota-linear rate limiter, you should reframe your problem so that you can use a simple linear rate limiter like GCRA.