Previously: part 1 part 2 part 3 part 4 part 5a part 5b part 5c part 6
Message content scanning is vital for blocking viruses and spam that aren't blocked by DNSBLs. Unfortunately the interface between MTAs and scanners varies wildly depending on the MTA and the scanner - there are no good standards in this area. However I'm going to ignore that cesspit for now, and instead concentrate on when the scanning is performed relative to other message processing. As you might expect, most of the time it is done wrong. Fortunately for me the Postfix documentation has an excellent catalogue of wrong ways that I can refer to in this article.
An old approach is for the MTA to deliver the message to the scanner which then re-injects it into the MTA. The MTA needs to distinguish messages from outside and messages from the scanner so that the former are scanned and the latter are delivered normally. The Postfix documentation describes doing the delivery and re-injection in the "simple" way via a pipe and the sendmail command, or in the "advanced" way via SMTP. The usual way to do this with Exim is to tell Exim to deliver to itself using BSMTP over a pipe, using the transport filter feature to invoke the scanner. This setup has a couple of disadvantages that are worth noting. It (at a minimum) doubles your load because each message is received and delivered twice. It also makes the logs confusing to read, since the message has a different queue ID before and after scanning and therefore different IDs when it is originally received and finally delivered.
Another arrangement is MailScanner's bump-in-the-queue setup. The MTA is configured to leave messages in the queue after receiving them, instead of delivering them immediately as it usually would. MailScanner picks them up fairly promptly - it scans the queue every second or two - and after scanning them, drops them in a second queue then tells the MTA to deliver the messages from this second queue. MailScanner has the advantage that it can work in batch mode, so when load is high (several messages arrive between incoming queue scans) the scanner startup cost is spread more thinly. This is useful for old-fashioned scanners that can't be daemonized. Apart from the scanning itself, its only overhead is moving the messages between queues. MailScanner also preserves queue IDs, keeping logs simple. A key disadvantage is that MailScanner needs intimate knowledge of the MTA's queue format, which is usually considered private to the MTA. Sendmail and Exim do at least document their queue formats, though MailScanner is still vulnerable to format changes (e.g. Exim's recent extension to ACL variable syntax). Postfix is much more shy of its private parts, so there's a long-standing argument between people who want to use MailScanner and Wietse Venema who insists that it is completely wrong to fiddle with the queue in this way.
So far I have completely ignored the most important problem that both these designs have. It is too late to identify junk email after you have accepted responsibility for delivering it. You can't bounce the junk, because it will have a bogus return path so the bounce will go to the wrong place. You can't discard it because of the risk of throwing away legitimate email. You can't quarantine it or file it in a junk mailbox, because people will not check the quarantine and the ultimate effect will be the same as discarding. (Perhaps I exaggerate a bit: If the recipient doesn't get an expected message promptly, or if the sender contacts them out of band because they didn't get a reply, the recipient can at least look in the quarantine for it. However you can only expect people to check their quarantines for unexpected misclassified email if the volume of junk in the quarantine is relatively small. Which means the quarantine should be reserved for the most difficult-to-classify messages.)
You must design the MTA to scan email during the SMTP conversation, before it accepts responsibility for the message. It can then reject messages that smell bad. Software that sends junk will just drop a rejected message, whereas legitimate software will generate a bounce to inform the sender of the problem. You minimise the problem of spam backscatter and legitimate senders still get prompt notification of false positives. However you become much more vulnerable to overload: If you scan messages after accepting them, you can deal with an overload situation by letting a backlog build up, to be dealt with when the load goes down again. You do not have this latitude with SMTP-time scanning.
The Postfix before-queue content filter setup uses the Postfix smtpd on the front end to do non-content anti-spam checks (e.g. DNS blacklists and address verification), and then passes the messages through the scanner using SMTP (in a similar manner to Postfix's "advanced" after-queue filters) then in turn to another instance of the smtpd which inserts the message into the queue. There is minimal buffering before the scanner, so the whole message must be scanned in memory as it comes in, which means the scanner's concurrency is the same as the number of incoming connections. This is a waste: messages come in over the network slowly; if you buffer them so that you can pass them to the scanner at full speed, you can handle the same volume of email with lower scanner concurrency, saving memory resources or increasing the number of connections you can handle at once. However you don't want to buffer large messages in memory because that brings back the problem in another form. You also don't want to buffer them on disk, since that would add overhead to the slowest part of the system - unless you use the queue file as the buffer. This implies that Posfix's before-queue filtering is too early since the writing to disk happens after the message has gone through the scanner.
Sendmail's milter API couples scanners to the MTA in about the same place as Postfix's before-queue content filter, so it has the same performance problems. (Actually, in some cases it is worse: If you have a filter that wants to modify the message body, then with Postfix it can in principle do so in streaming mode with minimal in-memory buffering, whereas with Sendmail the milter API forces it to buffer the entire message before it can start emitting the modified version.) What's more interesting is their contrasting approach to protocol design. Postfix goes for a simple open standard on-the-wire protocol as the interface to its scanners. However it misses its target: It speaks a simplified version of SMTP to the scanner, with a non-standard protocol extension to pass information about the client through to Postfix's back end. The simplification means that Postfix cannot offer SMTP extensions such as BINARYMIME if the scanner does not do so too, which is a bit crippling. Sendmail goes for an open API, and expects scanners to link to a library that provides this API. The connection to the MTA is a private undocumented protocol internal to Sendmail, and subject to change between versions. This decouples scanners from the details of SMTP, but instead couples them to Sendmail. This is terrible for interoperability - and in practice it's futile to fight against interoperability by making the protocol private, because people will create independent implementations of it anyway: 1 2 3. So I don't like the Postfix or the Sendmail approaches, both because of their performance characteristics and because of their bad interfaces.
Exim is agnostic about its interface to scanners: it has sections of code that talk directly to each of the popular scanners, e.g. SpamAssassin, ClamAV, etc. This is rather inefficient in terms of development resources (though the protocols tend to be simple), and is succumbing to exactly the Babel that Postfix and Sendmail were trying to avoid. Exim's approach has the potential to be better from the performance point of view: It writes the message to disk before passing it to the scanner at full speed, so in principle the same file could act as the buffer for the scanner and the queue file for later delivery. This would mean there are no overheads for buffering messages that are accepted; if the message is rejected then it will only hit the disk if the machine is under memory pressure. Sadly the current implementation formats the message to a second file on disk before passing it to the scanner(s), instead of formatting it in the process of piping it to the scanner. The other weakness is that although there is a limit on the number of concurrent SMTP connections, you can't set a lower limit on the number of messages being scanned at once. You must instead rely on the scanners themselves to implement concurrency limits, and avoid undaemonized scanners that don't have such limits. This is probably adequate for many setups, but it means the MTA can't make use of its greater knowledge to do things like prioritize internal traffic over external traffic in the event of overload.
So, having criticised everything in sight, what properties do we want from the MTA's interface to scanners? In general, we would like the logistics of passing the message to the scanner to add no significant overhead - i.e. the cost should be the same as receiving the message and scanning the message considered separately, with nothing added to plug these processes together. Furthermore we'd like to save scanners from having to duplicate functionality that already exists in the MTA. Specifically:
- Buffer the message in its queue file before scanning, so that the scanner does not take longer than necessary because it is limited by the client's sending speed.
- Insulate the scanner from the details of SMTP extensions and wire formats, without compromising the MTA's support for same. This implies that any reformatting (e.g. downgrade binary attachments to base64) needed by the scanner should not pessimize onward delivery.
- Put sensible limits on the concurrency demanded of the scanner to maximise its throughput. Use short-term queueing and scheduling (a few seconds) to handle spikes in load.
- Cache scanner results.
- Put a security boundary between the MTA and the scanner.
Notice that these have a certain commonality with callout address verification, which also needs a results cache, concurrency limits, and a queue / scheduler. This gives me the idea for what I call "data callouts" for content scanning, based on a loose analogy between verifying that the message's addresses are OK and verifying that the message's contents are OK. Also notice that message reformatting and security boundaries are requirements for local delivery. So a "data callout" is essentially a special kind of local delivery that the MTA performs before sending its reply to the last of the message data; it's a special kind of delivery because it is only done to check for success or failure - unlike normal deliveries the message isn't stored in a mailbox. This design makes good use of existing infrastructure: The MTA can use its global scheduler to manage the load on the scanner. There is already lots of variability in local delivery, so the variability in content scanner protocols fits in nicely.
The data callout is actually a special case of "early delivery", i.e. delivering a message before telling the client that it has been accepted. This feature gives you a massive performance boost, since you can relay a message without touching disk at all (except to log!). If you are going to attempt this stunt then you need a coherent way to deal with problems caused by the early delivery taking too long. Probably the best plan is to ensure that a very slow diskless early delivery can be converted to a normal on-disk delivery, so that a response can be given to the client before it times out, and so that the effort spent on delivery so far is not wasted. This is similar to allowing lengthy callouts address verifications to continue even after the client that triggered them has gone, so that the callout cache will be populated with a result that can be returned quickly when the client retries. (I'm not sure if it's worth doing the same thing with data callouts, or if a slow scanner more likely indicates some nasty problem that the MTA should back away from.)
The Postfix and Sendmail filter interfaces have a feature that is missing from Exim's scanner interface and my data callout idea. The filters can modify the message, whereas the scanners can only return a short result (such as a score). Message mangling is not something I particularly approve of, but it is a popular requirement. Fortunately my idea can support it, by going back to the old approach of delivering the message to the scanner which then re-injects it. Early delivery removes most of the disadvantages from this technique: it happens before we accept the message, and it doesn't add to disk load. It adds a new advantage of being able to fall back gracefully from scan-then-accept to accept-then-scan in the event of overload, if that's what you want. It still has the disadvantages of log obfuscation and depending on the scanner to support advanced SMTP features (though perhaps these can be avoided with a better filter protocol).
I hope that this convinces you that - as I said in my last essay - lots of cool things become possible if you get callouts right. This essay also serves as a response to iwj10, who complained that my log-structured queue idea was a pointless optimisation because early delivery is much more effective. He wasn't convinced when I said that early delivery was a separate problem. Even when you have early delivery - so that the queue only contains very slow or undeliverable messages - the log-structured queue reduces the effort required to work out which messages to retry next because the necessary information is stored in the right order.