2022-06-27 – tolower() in bulk at speed

Here’s a fun bit of Monday optimization.

DNS servers often have to convert DNS names to their canonical lower case form. BIND has to do so a bit more than most because it tries to preserve the case of DNS names rather than canonicalizing them.

I was recently distracted by a cleanup oportunity: there were multiple maptolower lookup tables in BIND, so I decided to make a home for them so that we would only need one copy. Then I thought, surely there are faster ways to implement tolower() than a lookup table.

SIMD within a register

There’s an amazing amount of space in a 64-bit register for crunching data in parallel. Many years ago I wrote about a bitsliced Life In A Register, my link log has a collection of web pages about bit twiddling hacks, and my shelves are graced by a copy of Hacker’s Delight.

Code often needs to work with DNS names one label at a time (the labels are the dot-separated parts), and labels are frequently less than 8 bytes long, and therefore fit inside a 64-bit register. So it probably isn’t worth dealing with the portability issues of working with wide vector registers. (Especially since I could not find a quick way to load an arbitrary number of bytes into a vector register with AVX2 nor with NEON.)

bytewise parallel tolower()

Hacker’s Delight has a lot of discussion of this kind of SWAR string processing code, but it does not directly tackle tolower(). So, here’s how I adapted Hank Warren’s examples for my purposes.

Our function is going to work on 8 ASCII bytes packed into a uint64_t.

uint64_t tolower8(uint64_t octets) {

For readability, it’s helpful to be able to replicate a single-byte value to all 8 bytes in a word. To do so we can multiply by a constant that has a 1 in each byte. The compiler’s constant propagation and instruction selection will make this efficient for us.

    uint64_t all_bytes = 0x0101010101010101;

We are going to steal the 0x80 bit in each byte to use as a flag for various purposes.

    uint64_t heptets = octets & (0x7F * all_bytes);

The main trick we’ll use from Hank Warren is how to find out which bytes are greater than a particular constant, such as 'Z'. If we add 0x7F to any of our bytes, carries will propagate and the result will overflow into our 0x80 flag bit only if the byte is greater than zero. And because we cleared the flag bits, they act as fire breaks stopping the carries from lower bytes messing up the calculations for higher bytes.

To compare with a value other than zero, we can subtract it from 0x7F:

    uint64_t is_gt_Z = heptets + (0x7F - 'Z') * all_bytes;

To do a greater-than-or-equal comparison, we just need to add 1:

    uint64_t is_ge_A = heptets + (0x80 - 'A') * all_bytes;

The flags we just calculated are only valid if the corresponding octet is in the ASCII range of 0 to 127. So we need a clear flag for each byte that is 128 or greater.

    uint64_t is_ascii = ~octets;

We are now ready to identify which bytes contain upper case ASCII letters:

    uint64_t is_upper = is_ascii & (is_ge_A ^ is_gt_Z);

The difference between corresponding upper case and lower case letters in ASCII is 32, aka 0x20. So to get our conversion value, we need to move our 0x80 flag bits to 0x20, and clear out the junk that our calculations left behind in the other bits:

    uint64_t to_lower = (is_upper >> 2) & (0x20 * all_bytes);

All that’s left is to add our flag bits to the original word, so that each byte that was an ASCII upper case letter has 32 added to change it to lower case.

    return (octets | to_lower);

That should compile down to about 9 instructions, which is not much more than one instruction per byte. And these are the fastest kinds of instructions :-)

edited to add

We can reduce the critical path from 7 to 6 instructions with a slight adjustment: change is_ascii and the return value as follows, and delete to_lower.

    uint64_t is_ascii = ~octets & (0x80 * all_bytes);
    return (octets | is_upper >> 2);

performance numbers

I ran some basic benchmarks on a million random bytes to see how this code compares to one-byte-at-a-time tolower():

    0.098 ms memmove() copy
    0.399 ms tolower8() copy
    1.817 ms tolower() copy
    0.280 ms tolower8() compare
    2.090 ms tolower() compare

The copy lines are for making a new lower-case version of the string. The compare lines are for testing two strings for case-insensitive equality.

My tolower8() function is at its best when it has a lot to chew on; it is not so good for really short strings. The crossover point, when comparing two strings, is about 4 characters - conveniently, the same length as .com. (There is a trick for 4-character strings: load them both into the same register so they can both be converted to lower case at the same time.)

So now you know how to convert strings to lower case at gigabytes per second in a single thread.

⇐ 2022-06-22 ⇐ Compacting a qp-trie ⇐ ⇒ A DNS name compression algorithm ⇒ 2022-07-01 ⇒