Scalable log processing

Here's an idea I came up with in 1999, when I was working at Demon. It's a design for a scalable log processing infrastructure, which allows you to plug in simple processing scripts which have very easy requirements, but still attain good scalability and reliability.

It's all based on dividing logs into relatively small batches: too large and the log processing latency gets too big; too small and the batch handling overhead gets too big. I decided that one-minute batches were a happy medium. Batches are always aligned to minute boundaries NN:NN:00, so that (assuming your computers' clocks are synchronized) batches generated on different computers start and end at the same time.

Log processing scripts can be simple sequential code. The requirements are that they are idempotent, so that they can be re-run if anything goes wrong; they must depend only on the time stamps in the logs, not the wall clock time; and they must be able to process batches independently.

The log transport protocol must also satisfy these requirements. I decided to use HTTP, because it's already ubiquitous. The PUT method is the one we want, because it's guaranteed to be idempotent (unlike POST), and it pushes data which is more efficient for the log server than pulling (which usually implies polling). HTTP already has lots of useful features (such as security) that we don't have to re-invent. If you use the "chunked" Transfer-Encoding, then you can start sending a batch as it starts which minimizes latency.

This scheme gets its scalability by exploiting parallelism.

A batch of logs may take longer to process than it does to generate if you have lots of log generating machines, or if a naive log processing script is slow - the real-world example from Demon was doing reverse DNS lookups on web logs. In this situation you can start processing the next batch as soon as it becomes available, even if the previous batch is still being processed. This means you can make use of idle CPU (waiting for DNS lookups) or multiple CPUs.

If one machine isn't enough to handle your log load, you can send alternate batches to different machines. For example one can process all the even minutes and one all the odd minutes. You can also have a hierarchy of log processors: perhaps a log collector for each cluster which in turn passes its collected logs to the central processors.

The reliability comes from a couple of simple things.

Logs are transmitted using TCP, which isn't lossy like traditional UDP-based syslog. Logs are written to disk before being sent to the server, so that if there is a network problem they can be transmitted later. Similarly, the log server can delay processing a batch until all the clients have sent their chunks. The log processor can catch up after an outage by processing the backlog of batches in parallel.

This implies that the log server has to have a list of clients so that it knows how many batches to expect each minute. Alternatively, if a machine comes back after an outage and sends a load of old batches to the log processor, idempotence means that it can re-process those old batches safely. I'm not really sure what is the best approach here - it requires some operational experience.