.@ Tony Finch – blog


A couple of years ago I wrote about tolower() in bulk at speed using SWAR tricks. A couple of days ago I was interested by Olivier Giniaux’s article about unsafe read beyond of death, an optimization for handling small strings with SIMD instructions, for a fast hash function written in Rust.

I’ve long been annoyed that SIMD instructions can easily eat short strings whole, but it’s irritatingly difficult to transfer short strings between memory and vector registers. Oliver’s post caught my eye because it seemed like a fun way to avoid the problem, at least for loads. (Stores remain awkward!)

Actually, to be frank, Olivier nerdsniped me.

This article is part of a series on optimizing DNS names:

signs of hope

Reading more around the topic, I learned that some SIMD instruction sets do, in fact, have useful masked loads and stores that are suitable for string processing, that is, they have byte granularity. They are:

I have an AMD Zen 4 box, so I thought I would try a little AVX-512-BW.

tolower64()

Using the Intel intrinsics guide I wrote a basic tolower() function that can munch 64 bytes at once.

Top tip: You can use * as a wildcard in the search box, so I made heavy use of mm512*epi8 to find byte-wise AVX-512 functions (epi8 is an obscure alias for byte).

First, we fill a few registers with 64 copies of some handy bytes.

We need the letters A and Z:

    __m512i A = _mm512_set1_epi8('A');
    __m512i Z = _mm512_set1_epi8('Z');

We need a number to add to uppercase letters to make them lowercase:

    __m512i to_lower = _mm512_set1_epi8('a' - 'A');

We compare our input characters c with A and Z. The result of each comparison is a 64 bit mask which has bits set for the bytes where the comparison is true:

    __mmask64 ge_A = _mm512_cmpge_epi8_mask(c, A);
    __mmask64 le_Z = _mm512_cmple_epi8_mask(c, Z);

If it’s greater than or equal to A, and less than or equal to Z, then it is upper case. (AVX mask registers have names beginning with k.)

    __mmask64 is_upper = _kand_mask64(ge_A, le_Z);

Finally, we do a masked add. We pass c twice: bytes from the first c are copied to the result when is_upper is false, and when is_upper is true the result is c + to_lower.

    return _mm512_mask_add_epi8(c, is_upper, c, to_lower);

bulk load and store

The tolower64() kernel in the previous section needs to be wrapped up in more convenient functions such as copying a string while converting it to lower case.

For long strings, the bulk of the work uses unaligned vector load and store instructions:

	__m512i src_vec = _mm512_loadu_epi8(src_ptr);
	__m512i dst_vec = tolower64(src_vec);
	_mm512_storeu_epi8(dst_ptr, dst_vec);

masked load and store

Small strings and the stub end of long strings use masked unaligned loads and stores.

This is the magic! Here is the reason I wrote this blog post!

The mask has its lowest len bits set (its first len bits in little-endian order). I wrote these two lines with perhaps more ceremony than required, but I thought it was helpful to indicate that the mask is not any old 64 bit integer: it has to be loaded into one of the SIMD unit’s mask registers.

	uint64_t len_bits = (~0ULL) >> (64 - len);
	__mmask64 len_mask =  _cvtu64_mask64(len_bits);

The load and store look fairly similar to the full-width versions, but with the mask stuff added. The z in maskz means zero the destination register when the mask is clear, as opposed to copying from another register (like in mask_add above).

	__m512i src_vec = _mm512_maskz_loadu_epi8(len_mask, src_ptr);
	__m512i dst_vec = tolower64(src_vec);
	_mm512_mask_storeu_epi8(dst_ptr, len_mask, dst_vec);

That’s the essence of it: you can see the complete version of copytolower64() in my git repository.

benchmarking

To see how well it works, I benchmarked several similar functions. Here’s a chart of the results, compiled with Clang 16 on Debian 11, and run on an AMD Ryzen 9 7950X.

The benchmark measures the time to copy about 1 MiByte, in chunks of various lengths from 1 byte to 1 kilobyte. I wanted to take into account differences in alignment in the source and destination strings, so there are a few bytes between each source and destination string, which are not counted as part of the megabyte.

On this CPU the L2 cache is 1 MiB per core, so I expect each run of the test spills into the L3 cache.

To be sure I was measuring what I thought I was, I compiled each function separately to avoid interference from inlining and code motion. In real code it’s more likely that you would want to encourage inlining, not prevent it!

benchmark results

conclusion

So, AVX-512-BW is very nice indeed for working with strings, especially short strings. On Zen 4 it’s very fast, and the intrinsic functions are reasonably easy to use.

The most notable thing is AVX-512-BW’s smooth performance: there’s very little sign of the performance troughs that the autovectorizer suffers from as it shifts to scalar code for small string fragments.

I don’t have convenient access to an ARM box with SVE support or RISC-V with the vector extension, so I have not investigated them in detail. It’ll be interesting to see how well they work for short strings.

I would like both of these instruction set extensions to be much more widely available. They should improve the performance of string handling tremendously.

The code for this blog post is available from my web site.


Thanks to Bruce Hoult on Lobsters for providing a version of tolower() using RISC-V vector instructions, and measuring it on multiple single-board computers.


Thanks to LelouBil on Hacker News for pointing out a variable was named backwards. Ooops!