hybrid quota-linear rate limiter

A while back I wrote about the linear rate limit algorithms leaky bucket and GCRA. Since then I have been vexed by how common it is to implement rate limiting using complicated and wasteful algorithms (for example).

But linear (and exponential) rate limiters have a disadvantage: they can be slow to throttle clients whose request rate is above the limit but not super fast. And I just realised that this disadvantage can be unacceptable in some situations, when it’s imperative that no more than some quota of requests is accepted within a window of time.

In this article I’ll explore a way to enforce rate limit quotas more precisely, without undue storage costs, and without encouraging clients to oscillate between bursts and pauses. However I’m not sure it’s a good idea.

linear reaction time
fixed window quota resets
hybrid quota-linear algorithm
discussion
opinion

linear reaction time

How many requests does a linear rate limiter allow before throttling?

The parameters for a rate limiter are:

q, the permitted quota of requests
w, the accounting time window

So the maximum permitted rate is q/w.

Let’s consider a client whose rate is some multiple a > 1 of the permitted rate (a for abuse factor)

    c = a * q/w

I’ll model the rate limiter as a token bucket which starts off with q tokens at time 0. The bucket accumulates tokens at the permitted rate and the client consumes them at its request rate. (It is capped at q tokens but we can ignore that detail when a > 1.)

    b(t) = q + t*q/w - t*a*q/w

The time taken for n requests is

    t(n) = n/c = (n*w) / (a*q)

After n requests the bucket contains

    b(n) = q + n/a - n

The rate limter throttles the client when the bucket is empty.

    b(t) = 0 = q + t * (1 - a) * q/w

    0 = 1 - t * (a - 1) / w

    t = w / (a - 1)

    b(n) = 0 = q + n * (1/a - 1)

    0 = q - n * (a - 1) / a

    n = q * a / (a - 1)

For example, if the client is running at twice the permitted rate, a=2, they will be allowed q*2 requests within w seconds before they are throttled.

That’s a bit slow.

The problem is that the bucket starts with a full quota of tokens, and during the first window of time another quota-full is added. So a linear rate limiter seems to be more generous in its startup phase than its parameters suggest.

fixed window quota resets

This is a fairly common rate limit algorithm that enforces quotas more precisely than linear rate limiters. It is similar to a token bucket, but it resets the bucket to the full quota q after each window w.

The disadvantage of periodically resetting the bucket is that careless clients are likely to send a fast burst of requests every window. (A linear rate limiter will tend to smooth out requests from careless clients, though it can’t prevent deliberately abusive burstiness.) And a quota-reset rate limiter uses more space than a linear rate limiter: it needs a separate bucket counter as well as a timestamp.

each client has the time when its window started and a bucket of tokens

if c == NULL
    c = new Client
    c.bucket = quota
    c.time = now

reset the bucket when the window has expired

if c.time + window <= now
    c.bucket = quota
    c.time = now

the request is allowed if the client has a token to spend

if c.bucket >= 1
    c.bucket -= 1
    return ALLOW
else
    return DENY

hybrid quota-linear algorithm

The idea is to have two operating modes:

Low-traffic clients are allowed to make bursts of requests, because that’s good for interactive latency.
High-traffic clients are required to smooth out their requests to an even rate.

The algorithm switches to smooth mode when a client consumes its entire quota and the bucket reaches zero, and switches back to bursty mode when the bucket recovers to a full quota.

Bursty mode avoids being too generous by adding at most one quota of tokens to the bucket per window. Smooth mode also avoids being too generous by starting with an empty bucket.

I’ve written this pseudocode in a repetitive style because that makes each paragraph more independent of its context, though it is still somewhat stateful and the order of the clauses matters.

the rate limit is derived from the quota and window
```
rate = quota / window
```
a client that returns after a long absence is reinitialized like a new client; in bursty mode the time is the start of a fixed window and the bucket contains a whole number of tokens; we subtract one from the quota to account for the current request
```
fn reset()
    c.bucket = quota - 1
    c.time = now
    c.mode = BURSTY
    return ALLOW
```

initialize the state of a new client

if c == NULL
    c = new Client
    return reset()

reset the state when a bursty client’s window has expired

if c.mode == BURSTY
and c.time + window <= now
    return reset()

when a client consumes its last token, switch modes; in smooth mode the time is updated for every request and the bucket can hold fractional tokens; apply a negative penalty so we don’t allow any over-quota requests before the end of the fixed window; add one to allow the next request at the start of the next window
```
if c.mode == BURSTY
and c.bucket == 1
    remaining = c.time + window - now
    c.bucket = -remaining * rate + 1
    c.time = now
    c.mode = SMOOTH
    return ALLOW
```
smooth mode accumulates tokens proportional to the time since the client’s previous request; update the request time so that tokens accumulate at the same rate whether the request is allowed or denied
```
if c.mode == SMOOTH
    c.bucket += (now - c.time) * rate
    c.time = now
```

when the bucket has refilled, switch back to bursty mode

if c.mode == SMOOTH
and c.bucket >= quota
    return reset()

we have dealt with all the special cases; in either mode the request is allowed if the client has a token to spend
```
if c.bucket >= 1
    c.bucket -= 1
    return ALLOW
else
    return DENY
```

discussion

This hybrid algorithm has a similar effect to running both a quota-reset and a linear rate limiter in parallel, but it uses less space.

The precision of quota enforcement in smooth mode is maybe arguable: It guarantees that the client remains below the limit on average over the whole time it is in smooth mode. But if it slows down for a while (but not long enough to return to bursty mode) it can speed up again and make more requests than its quota within a window.

opinion

I think it’s a mistake to try to treat each time window in strict isolation when assessing a client’s request quota. A linear rate limiter only seems to be unduly generous on startup if you ignore the fact that the client was quiet in the previous window.

As well as being more expensive than necessary (in some cases disgracefully wasteful), algorithms like sliding-window and quota-reset encourage clients into cyclic burst-pause behaviour which is unhealthy for servers. And I’ve seen developers complaining about how annoying it is to use glut / famine rate limiters.

By contrast, rate limiters that measure longer-term average behaviour by keeping state across multiple windows can naturally encourage clients to smooth out their requests.

So on balance I think that instead of using this hybrid quota-linear rate limiter, you should reframe your problem so that you can use a simple linear rate limiter like GCRA.