How not to design an MTA - part 5 - spool file logistics

Yesterday I wrote about the format of message data in spool files, but that's only part of what is stored in the spool - and from the performance point of view it isn't even a very important part, even though that is I was writing about. Far more important for performance is the high-level logistics of spool file management: how they are created, written, and deleted.

As well as the message data, the spool will contain for each each message all the information needed to route it to its destination(s). At a minimum this is the SMTP envelope, i.e. the recipient addresses, the return path (in case the message bounces) and the DSN options. However sometimes the postmaster wants to make routeing decisions using more details of the message's origin, such as the client and server IP addresses, information from the DNS, security details from TLS and SASL, the time it arrived, and results from anti-spam and anti-virus scanners. Also, for my cunning wire format scheme you need to preserve the MIME parse tree and dot-stuffing points.

Almost all of this information never changes after the message has been received. What does change is the list of recipients that have successfully received the message, or failed, or which need to be retried.

Sendmail and Exim have quite similar schemes for their spool files. See the Sendmail operations guide page 100 and the Exim specification page 410. Both store a message in two principal files, one containing the message body (i.e. the message data excluding the header) and one containing the envelope and message header. (Both can make routeing decisions based on arbitrary header fields.) There are also some short-lived auxiliary files: a transaction journal which records changes to the recipient list during a delivery, and a temporary file which is used when writing a new envelope file which holds the merged data from the old envelope file and the transaction journal. (Exim also has per-message log files, which IME are not much use, and can be turned off.)

The split is very slightly different in qmail, though it still has essentially two files per message: its data file stores the whole message data, and its envelope file stores a very bare envelope. It doesn't have a complicated envelope file because qmail is limited to routeing messages based on just their email addresses.

All of these MTAs are quite profligate in their use of files, and therefore the amount of filesystem metadata manipulation they cause. File creation requires modification of the parent directory's data and inode, the allocation of a new inode and at least one block for the file's data. This implies that the disks are going to have to do quite a lot of seeking to write the message data to stable (crash-resistant) storage. As a result these MTAs don't get the best possible performance out of their disk systems.

Postfix is better because it only uses one file per message. However it has multiple queue directories, and moves messages between them at different stages in their life cycle. This is wasteful: for example the Postfix queue documentation says that "whenever the queue manager is restarted [...] the queue manager moves all the active queue messages back into the incoming queue, and then uses its normal incoming queue scan to refill the active queue. The process of moving all the messages back and forth [...] is expensive. At all costs, avoid frequent restarts of the queue manager."

My simple spool scheme is as follows:

You use one file per message, which is divided into three sections: first the message data (in the same format as it came in on the wire), then the static envelope (information from the client connection, security negotiation, parser, scanner, etc), and finally the recipient addresses and transaction journal.
You create the spool file when you start receiving the message data, because that is the first time that it becomes absolutely necessary. It is never moved after it is created.
Some of the envelope information depends on the message data, so it is simpler to write all of the envelope after the data. The size of the envelope has a much lower bound than the data so it's OK to keep it in memory.
The recipient list comes last, because that is the part of the envelope that changes as deliveries are performed. The transaction journal is not necessarily just a list of addresses that we no longer need to deal with: in some situations you want to add recipients to the message after receiving it.
The file is almost entirely written only once in append-only order: the data and static envelope information never changes, and the transaction journal is append-only. However you need to be able to find the start of the envelope quickly when re-opening an old message, so its offset in the file is stored before the message data, and this offset information must be written after the data has been received.

In fact the presence of the envelope offset can serve as a way to distinguish between messages that have been correctly received and those that were only partially written (e.g. because of a crash). When the client finishes sending you the message data, you append the envelope, then update the envelope offset to mark the message as being OK immediately before sending the OK response to the client.

Let's see if we can reduce the load on the disks even more.

One obvious cost is the creation and deletion of spool files. You could keep a pool of unused files, and re-open one of them instead of creating a new file. When you have finished with a file, you can truncate it and return it to the unused pool. A small disadvantage of this is it removes the connection between a message's spool ID and its file on disk.

You are still truncating and extending the files a lot, which is a lot of work for the filesystem's block allocator. You could skip the truncation step before returning the file to the unused pool. This means you have to change your spool file format slightly to add a marker so that you can tell where the file's current contents end and the junk from previous messages starts. You have to overwrite the marker and add a new one whenever writing to the file, and you can no longer use O_APPEND mode. You also have to mark the file as unused, by zapping the envelope offset, before returning it to the pool.

The remaining work is fsync()ing changes to the files to ensure they are safely written to disk. A message needs to be synced when it is received, when you take responsibility for it. You then need to sync each entry in the transaction journal, which is at least once per message and at most once per recipient. Because there are lots of messages in their individual files, these syncs are all over the disk, which implies lots of costly seeking.

High performance NNTP servers minimize seeks by having a few large spool files to which they write messages sequentially. There are a couple of reasons that this doesn't work for us. Firstly, NNTP servers generally receive messages streamed from relatively few high-bandwidth clients, whereas SMTP servers often have many low-bandwidth clients which send few messages per connection. You want at least one spool file per client so that one client can write to a file slowly without holding up another. Secondly, you can't write the messages contiguously if you are going to keep a message's transaction journal next to it.

So why not move all the transaction journals into a global transaction log?

When I first thought of this, I doubted that the benefits would outweigh the complexity, but it turns out to have wonderful consequences.

The nice thing about this kind of log-structured storage is that it's very friendly to disks: the write position moves slowly so you can get much higher performance than you do with random seeks. The great difficulty is that you need to clean garbage out of the log so that it doesn't grow without bound, and this process can easily cannibalize all the performance you gained from using a log structure in the first place. Hence my initial skepticism.

Another problem is that the information about a message can end up scattered throughout the journal as records are appended when its delivery is tried and retried. You can avoid this by re-writing its remaining recipient list at the head of the journal whenever it is retried. (This is a bit like Sendmail and Exim's envelope re-writing, but lest costly because of the log-structured write activity.) The next time it is tried you only have to look at its most recent entry in the journal.

One significant difference between an MTA transaction journal and a database or filesystem is that the MTA's data has a natural maximum lifetime, and this can actually be quite short. A nice consequence of the re-writing I just described is that the maximum age of data in the journal is the same as your maximum retry interval, which is probably only a few hours - compared to days for the lifetime of the message data, or years for blocks in a filesystem or database.

So you are effectively driving your journal garbage collector from your MTA's queue runners. But wait! Turn it around: you can drive the queue runners from the journal. It has exactly the data your queue runners need, already sorted into in a fair retry order. What's more, your queue runners no longer need to scan the directories in the spool and open each file in turn to check whether it needs to be retried. They also never have to go searching in the global journal for a message's recipient list. In fact, if you keep enough envelope information in the journal to do all the routing for each message, you only need to re-open a spool file just before it is transmitted - and not at all if the destination host is still unavailable. Wonderful!

This division of the spool into a collection of message data files and a global log-structured queue is also very amenable for tuned storage setups. You could reserve a mirrored pair of disks for the queue, so that these disks almost never have to seek. The message data files can live on separate disks that have more random access patterns, but which have a lower load because most work goes on in the queue. The principal workload is only one fsync() per message file in the lifetime of a message. The queue gets all the fsync()s to record deliveries, so a load of less critical writes (such as re-writing envelopes) get fsync()ed for free.

In this setup the queue is somewhat like an NNTP server's overview database, and it has the same problems. All your eggs are in one basket, so it had better be very reliable. When INN introduced advanced storage layouts that depended on the correctness of the overview database, any problems with the database were catastrophic - and they were sadly common. That's not to say it can't be done: Demon's NewsBorg managed it. And in fact, an MTA's log-structured queue database which only needs to be accessed in log-structured order is much simpler than a random-access overview database, so it should also be more reliable.

I must say, I'm pretty pleased with this :-) Although I came at it from the direction of performance hackery, it turns out to be a unifying and simplifying idea of great elegance. And there are still at least a couple more ways of making it faster...