How not to design an MTA - part 6 - address verification

Back in the autumn I wrote three posts (1, 2, 3) about MTA queue logistics, which together comprised part 5 of my MTA design series. Part 4 was about message file format, part 3 was about local delivery, and part 2 was about security partitions. Since it's nearly a year since I wrote part 1 (on local message submission) I should probably write some more...

One of the things I most like about Exim's architecture is the split between the two major sections of its configuration file. The "acls" section contains the "access control lists" which control which messages Exim will accept. ("Access control logic" would be a more accurate name since they are not simple lists.) The "routers" section controls how Exim decides where to deliver messages. This is a fairly clean front-end / back-end split, though it is somewhat obscured by historical baggage, such as the separation between routers and transports, the client/server confusion of authenticators, and the remaining ACL-related stuff in the global configuration section. What is neat, though, is the loose coupling between the front and back ends.

One of the key requirements of an SMTP server (especially an MX) is to verify addresses before it accepts messages, so that it does not take responsibility for undeliverable messages. Postfix normally verifies addresses using its local_recipient_maps setting, which duplicates the routing logic implemented by other parts of the MTA. This implies that when you change the configuration of Postfix's back-end you must make a corresponding change to the front-end. This sucks.

Exim avoids this duplication by using the routers directly to do verification. The ACL just says verify = recipient or verify = sender, and Exim attempts to route the relevant address. If routing succeeds the address is valid, and if it fails the ACL returns the error from the routers directly to the client.

Under the covers, Sendmail works in a similar way to Exim. The rulesets 0-5 correspond to Exim's routers, and the check_* rulesets correspond to Exim's ACLs. Sendmail's rulesets can invoke each other, and the check_* rulesets invoke the numbered rulesets to do address verification.

One of the less nice things about Exim is the ad-hoc configuration language - in fact, Exim has about 7 little languages squeezed inside it, by my count. Sendmail has a much more unified configuration syntax, but it is much, much more obscure. So in fact most people configure it using m4 macros and don't get to appreciate its lurking crufty elegance.

There is an architectural bug in Postfix that prevents it from using this loose coupling design effectively. All email traffic between its front end and back end is via files on disk. This is catastrophic for performance, since using this mechanism for verification would at least triple the disk load required to receive a message: one disk op to receive the message, one for sender verification, and one for recipient verification. That is ignoring the 10%+ of messages that have multiple recipients, and the 25%-33% of messages that have invalid recipients (even after blacklist checks). Even so, you can configure Postfix to work in this way in order to do callout address verification.

At this point I need to take a brief diversion to explore the varied depths of email address verification. In particular, how much effort can or should you put in, which is to say, how close do you get to delivering a message before stopping? (You can't go all the way because you don't have a message to deliver when verifying!) If the address is an alias, do you go on to verify the address it redirects to? What if there is more than one address? If it is a local user, do you check the quota? The answers can depend on both the implementation of your MTA and on local policy decisions.

Callout verification is specific to addresses that the MTA will deliver to over SMTP or LMTP. Traditionally MTAs verify remote addresses by just checking that the domain's DNS is sane, but you can go quite a lot further before you are actually delivering a message. You can ensure that you can connect to the remote SMTP server, and you can start an SMTP transaction by sending the MAIL and RCPT commands of a message envelope. You can abort by resetting the transaction before sending any message data. The result of the verification is the destination's response to the RCPT command, if everything worked.

In practice, callout verification is not very useful for addresses at domains not under your control, which is usually the case for sender addresses. This is partly because a lot of email is sent from broken domains that are not running an MTA - most commonly, email from web servers. A lot of it is junk from compromised servers, but a lot of it is also very desirable email related to some transaction performed by the user on the web site. A more worrying reason is that if sender callout verification is widely deployed, then a joe-job turns into an anonymized distributed denial of service attack: criminals can attempt to send lots of email "from" their victim and thereby cause lots of otherwise legitimate MTAs to bombard the victim with verification requests.

However, recipient callout verification can be very useful. If you are running an MX for domains that are under separate management it can be difficult to get their lists of valid recipients. They will have various different user administration systems, some more ad-hoc than others, and even if they are accessible in a reasonably standard way (e.g. LDAP queries to a Microsoft Active Directory) you still have to establish a second out-of-band trust relationship between your MX and the destination system. In comparison, callout verification works in-band, using normal SMTP behaviour with no special work required at either end. Much simpler!

Unfortunately, callout implementations are usually sub-standard. I've already described Postfix's performance problems. Exim's is crippled because it is too closely coupled to the ACLs: it's implemented as a subroutine call that has at most one SMTP time-out period to perform an operation that can take multiple time-out periods. Postfix doesn't make this mistake because its implementation follows the loosely-coupled design that Exim only hints at. This also means that Postfix's callouts can make use of its global scheduling, concurrency control, and connection cacheing. Exim, being too decentralized, doesn't have these features at all.

Clearly there is room for improvement. In my next article I am going to argue that if you can get callout verification right then it makes a lot of other really cool stunts easy.