Mail switch naming and addressing at Cambridge

A postmaster at another university asked me why Cambridge has just one MX record pointing to a host name with multiple IP addresses, and what our experiences are with this setup. I thought I would post my answer in public since it might be of general interest.

Our current setup dates from 2004, though we reshuffled it a bit in 2010. It still has some historical artifacts which it would be nice to fix, but which aren't all that important.

Until 2004 our mail hub ppsw.cam.ac.uk (named after the infamous JANET email relay software) handled both incoming and outgoing email. Since approximately the dawn of time ppswitch has been scaled to multiple servers by giving the name multiple IP addresses. (PPswitch dates from 1991; I don't know when it was first scaled to multiple hosts - mid 1990s?.) We've generally depended on hardware and software reliability rather than fancy load-balancing fail-over appliances; this has been a very cheap and effective strategy for the last 10 years, though it didn't work so well when we were running PP :-)

By 2004 ppswitch was also providing a message submission service on smtp.hermes.cam.ac.uk, which ran on a different set of IP addresses on the same machines. (Plus POP+IMAP proxies which aren't really relevant to this post.) At that time the Exim configuration was a bit unsatisfactory because it did not clearly distinguish between the different classes of traffic - incoming, outgoing, submission - which meant it was not possible to take aggressive SMTP-time anti-spam measures without affecting internal email service.

So we created mx.cam.ac.uk to replace the use of ppsw.cam.ac.uk in MX records, keeping the traditional name ppsw.cam.ac.uk for outgoing relay service. Since then each ppswitch machine has had three public IP addresses, one for each type of service. Exim is configured to behave differently depending on the IP address that the sender connected to. The delivery logic is the same regardless of how messages arrive.

The setup of mx.cam.ac.uk was basically a copy of ppsw.cam.ac.uk and smtp.hermes.cam.ac.uk, which is why it is configured like a scaled service host name rather than making use of the extra indirection that MX records allow. This simple arrangement has never really been a problem for us. The load is not perfectly balanced - we tend to get more on the lowest IP address - but it has never been impossibly out of whack. The extra traffic tends to be easily-rejected spam and we have enough headroom that it isn't a problem.

Last year we made a change that improves ppswitch's managability and robustness - more the first than the second in practice, but auditors like to hear about disaster recovery plans. Now, each ppswitch machine by default only has a management IP address (and since this is the system's default IP address it is also used for outgoing connections). Machines in service or testing also have three service IP addresses for incoming connections.

The service addresses can be brought up on any of the physical servers, so if one of them dies we can bring up its addresses on a spare server. We can also use this for potentially disruptive configuration changes: put the new configuration on a spare server, flip the IP addresses over, and in case of cockup back out with a reverse flip. This is considerably better than relying on DNS changes to move service between machines, as we used to do!

This year we did IPv6 day, and we're in the process of putting IPv6 into full service on ppswitch. The IPv6 setup is basically the same as the v4 one, except that we have allocated separate addresses for the IMAP and POP proxies in v6 whereas they share the message submission address on v4. So a dual stack machine has 5 v6 and 3 v4 service addresses plus a v4 and v6 management address.

You can see how all this appears in the DNS if you run

dig axfr cam.ac.uk @authdns0.csx.cam.ac.uk | grep ppsw | grep -v RRSIG

That should give you some idea of how we have laid out ppswitch's names and IP addresses. The public service host names are: ppsw.cam.ac.uk (outgoing relay), mx.cam.ac.uk (incoming anti-spam gateway), smtp.hermes.cam.ac.uk (secure message submission), and {pop,imap}.hermes.cam.ac.uk (message store access).

We have well-defined IP address ranges to accommodate parts of the University with strict packet filters: 131.111.8.128/27 and 2001:630:212:8::e:0/112.

The way the (numbered) physical hosts and the (lettered) virtual service addresses fit into the v4 range is complicated. The final decimal digit tells you whether it's a physical host (0,1 = on site, 2,3 = off site) or virtual service address (4,5 = testing, 6-9 = live), and the penultimate digit defines which kind of service (3 = ppsw, 4 = mx, 5 = hermes).

What could be improved?

I would quite like to rename all the hosts into a mail.cam.ac.uk subdomain, instead of using the generic Computing Service Internal domain.

I have occasionally wished for an MX host name like mx0.mail.cam.ac.uk, so we have the option of more flexibility without polluting our top level namespace. But the only thing that might have benefited from the ability to add MX records was the possibility of fake low-priority anti-spam MXs.

The current naming scheme for the physical and service addresses is confusing and not as helpful in practice as I thought it might be. But I haven't come up with a scheme that is better enough to be worth the effort of renaming.