A rant about <tt>whois</tt>

I have been fiddling around with FreeBSD's whois client. Since I have become responsible for Cambridge's Internet registrations, it's helpful to have a whois client which isn't annoying.

Sadly, whois is an unspeakably crappy protocol. In fact it's barely even a protocol, more like a set of vague suggestions. Ugh.

The first problem...

... is to work out which server to send your whois query to. There are a number of techniques, most of which are necessary and none of which are sufficient.

Rely on a knowledgable user to specify the server.

Happily we can do better than just this, but the feature has to be available for special queries.
Have a built-in curated mapping from query patterns to servers.

This is the approach used by Debian's whois client. Sadly in the era of vast numbers of new gTLDs, this requires software updates a couple of times a week.
Send the query to TLD.whois-servers.net which maps TLDs to whois servers using CNAMEs in the DNS.

This is a brilliant service, particularly good for the wild and wacky two-letter country-class TLDs. Unfortunately it has also failed to keep up with the new gTLDs, even though it only needs a small amount of extra automation to do so.
Try whois.nic.TLD which is the standard required for new gTLDs.

In practice a combination of (2) and (3) is extremely effective for domain name whois lookups.
Follow referrals from a server with broad but shallow data to one with narrower and deeper data.

Referrals are necessary for domain queries in "thin" registries, in which the TLD's registry does not contain all the details about registrants (domain owners), but instead refers queries to the registrar (i.e. reseller).

They are also necessary for IP address lookups, for which ARIN's database contains registrations in North America, plus referrals to the other regional Internet registries for IP address registrations in other parts of the world.

Back in May I added (3) to FreeBSD's whois to fix its support for new gTLDs, and I added a bit more (1).

One motivation for the latter was for looking up ac.uk domains: (4) doesn't work because Nominet's .uk whois server doesn't provide referrals to JANET's whois server; and (2) is a bit awkward, because although there is an entry for ac.uk.whois-servers.net you have to have some idea of when it makes sense to try DNS queries for 2LDs. (whois-servers.net would be easier to use if it had a wildcard for each entry.)

The other motivation for extending the curated server list was to teach it about more NIC handle formats, such as -RIPE and -NICAT handles; and the same mechanism is useful for special-case domains.

Last week I added support for AS numbers, moving them from (0) to (1). After doing that I continued to fiddle around, and soon realised that it is possible to dispense with (3) and (2) and a large chunk of (1), by relying more on (4). The IANA whois server knows about most things you might look up with whois - domain names, IP addresses, AS numbers - and can refer you to the right server.

This allowed me to throw away a lot of query syntax analysis and trial-and-error DNS lookups. Very satisfying.

(I'm not sure if this excellently comprehensive data is a new feature of IANA's whois server, or if I just failed to notice it before...)

The second problem...

... is that the output from whois servers is only vaguely machine-readable.

For example, FreeBSD's whois now knows about 4 different referral formats, two of which occur with varying spacing and casing from different servers. (I've removed support for one amazingly ugly and happily obsolete referral format.)

My code just looks for a match for any referral format without trying to be knowledgable about which servers use which syntax.

The output from whois is basically a set of key: value pairs, but often these will belong to multiple separate objects (such as a domain name or a person or a net block); servers differ about whether blank lines separate objects or are just for pretty-printing a single object. I'm not sure if there's anything that can be done about this without huge amounts of tedious work.

And servers often emit a lot of rubric such as terms and conditions or hints and tips, which might or might not have comment markers. FreeBSD's whois has a small amount of rudimentary rubric-trimming code which works in a lot of the most annoying cases.

The third problem...

... is that the syntax of whois queries is enormously variable. What is worse, some servers require some non-standard complication to get useful output.

If you query Verisign for microsoft.com the server does fuzzy matching and returns a list of dozens of spammy name server names. To get a useful answer you need to ask for domain microsoft.com.

ARIN also returns an unhelpfully terse list if a query matches multiple objects, e.g. a net block and its first subnet. To make it return full details for all matches (like RIPE's whois server) you need to prefix the query with a +.

And for .dk the verbosity option is --show-handles.

The best one is DENIC, which requires a different query syntax depending on whether the domain name is a non-ASCII internationalized domain name, or a plain ASCII domain (which might be a punycode-encoded internationalized domain name). Good grief, can't it just give a useful answer without hand-holding?

Conclusion

That's quite a lot of bullshit for a small program to cope with, and it's really only scratching the surface. Debian's whois implementation has attacked this mess with a lot more sustained diligence, but I still prefer FreeBSD's because of its better support for new gTLDs.