Spam bot signatures – Tony Finch

Recently I have been investigating spam bot signatures, specifically the characteristic domain names they choose to put in their SMTP HELO commands. A lot of spam bots use the same HELO domains from lots of different compromised PCs, which makes them quite easy to spot and block without any risk of blocking legitimate email. This kind of block can take care of about 15%-20% of spam without relying on 3rd party services like DNSBLs. Of course, this is one of the techniques used to populate the Spamhaus XBL so the only advantage of doing it yourself is if you want to spot spam bots that have not yet been spotted by the Spamhaus guys.

Steve Champeon's Enemieslist service is the highly developed commercial implementation of this idea. His patterns are much more comprehensive, covering spam bot signatures, domestic IP connectivity (like the Spamhaus PBL), and spam-infested netblocks.

Yesterday evening I was thinking about how to automatically identify spam bot signatures, when I realised that I had already written the code to do the job! I wanted to count how many different IP addresses were using the same HELO domain, and block connections that used excessively popular domains. All I needed was a few lines of Exim configuration:

  deny
    message   = Probable spam bot HELO seen from $sender_rate networks
    condition = ${if !eqi{localhost.localdomain}{$sender_helo_name} }
  ! verify    = helo
    ratelimit = 4 / 1w / per_conn / strict \
      / unique=${mask:$sender_host_address/24} / ${lc:$sender_helo_name}

Let's unpack this in reverse order.

We're measuring the rate of use of HELO domains, so the ratelimit key is ${lc:$sender_helo_name}. It's forced to lower case so that SERVER and server are treated as the same thing.

But we don't care about the total usage rate, only the rate of uses from different unique IP addresses. The unique= option invokes the Bloom filter code to avoid counting each spam bot more than once. In fact we only count different unique /24 network blocks, in order to avoid false positives from mail clusters in which all servers use the same name. For example, Facebook's MTAs all say HELO mx-out.facebook.com though they are spread across about 100 IP addresses on a couple of /24 networks.

The strict option means keep counting even when the measured rate has passed the limit. The per_conn option means only count once for each connection (which mainly helps with efficiency).

The smoothing period is set to one week, which should mean that Exim doesn't easily forget which HELO domains have been abused.

I've currently got the limit set to 4 blocks of /24. It might even be reasonable to reduce this to 3. I'll need to run it a bit longer to see if any more odd false positives sneak out of the woodwork.

We do not apply this check if the DNS agrees with the HELO domain. There are some legitimate host names which are being heavily abused by spam bots, such as mail.aol.com and mx54.mail.com, so we want to block them if the connection comes from anywhere other than the host itself. (Sadly Facebook's MTAs are misconfigured so they don't pass this check.)

The only exception to this (so far) is localhost.localdomain which is the result of a popular misconfiguration (or lack of configuration) on legitimate Unix MTAs. If I find any other false positives they'll get checked in a similar way.

ETA rediffmail.com also needs whitelisting - it's the mail service of rediff.com which is a portal for Indian expats. Also easyjet.com.

This heuristic seems to catch about three different kinds of spam bot behaviour.

The HELO domain is the same as the domain in the MAIL FROM address. Spammers like to forge email "from" surprisingly few popular sites.
The HELO domain is one of a few bare hostnames, such as pc or computer - I guess they use the name of the compromised host.
The HELO domain is a parent domain of an ISP's edge network, e.g. telesp.net.br.

I'm really pleased by how easy and effective this has turned out to be. The only annoyance is that it took me 20 months to realise that my Bloom filter ratelimit code could do this! Also I hope there aren't too many lurking gotchas that I haven't spotted yet.

This check really shows up a long-standing weakness in Exim's hints database implementation. It just uses a local DBM file to store ratelimit data, so each individual server in my SMTP cluster has to accumulate data on spam bot HELO domains without being able to benefit from the experience of the rest of the cluster. I suppose I should spend some quality time with Tokyo Tyrant...