tolower() small string performance

I’m pleased that so many people enjoyed my previous blog post on tolower() with AVX-512. Thanks for all the great comments and discussion!

One aspect that needed more work was examining the performance for small strings. The previous blog post had a graph for strings up to about 1000 bytes long, mainly because it illustrated some curious behaviour – but that distracted me from what I really care about, which is strings up to about 100 bytes long, 10x smaller.

Eventually I managed to get some numbers that I think are plausible.

This article is part of a series on optimizing DNS names:

question

When processing very small string fragments, what is the cross-over point between scalar code and AVX-512 with masked loads and stores?

method

I am testing the same functions as described in my previous blog post. The copybytes functions are different implementations of memcpy; the tolower functions convert a string to lower-case while copying in a similar manner to memcpy. Most of the functions are autovectorized by Clang, apart from tolower64 and copybytes64 which use AVX-512 instrinsic functions with masked loads and stores. (The copybytes1 function never gets out of its scalar regime on the graph below.)

Each function is called 1000 times with varying alignments of the source and destination strings. The graph shows the total time divided by 1000, measured using Linux perf_event_open(2). There are two aeparate 256 byte buffers, one for the source strings and one for the destination strings, so the code is working entirely from L1 cache.

The functions are compiled separately, so that the compiler can’t over-optimize the benchmark loop. There is an MFENCE instruction before each call to the function under test. The baseline of 65 cycles includes the MFENCE, the benchmark loop, and the function call overhead.

You can find the source code in the bench-short.c and bench.c files.

results

The lines of interest are:

The purple tolower1 line, which is very simple C code autovectorized by Clang.
The pink tolower64 line, which is written using AVX-512 instrinsic functions, and uses masked loads and stores.

benchmark results

The crossover point for very small strings is 5 bytes: scalar code is faster for strings less than 5 bytes, and AVX-512 is faster for strings longer than 5 bytes.
Clang used an AVX-256 section in tolower1 for strings between 32 and 256 bytes long. The tolower1 line drops below the tolower64 line for lengths around a multiple of 32, which suggests AVX-256 is faster than AVX-512 for small numbers of chunks.

So it looks like masked loads and stores do pretty well for short strings, but they aren’t completely dominant. There’s still some performance to be gained from code tuned to a few more size classes.

perf_event_open

The graph in my previous blog post was made using timing measurements from clock_gettime(). I found that I was getting implausibly tiny numbers for short strings – not impossibly tiny, but there was clearly something hinky going on. So I thought I should try reading the CPU performance counters, which on Linux means perf_event_open(2)

There is a helpful example at the end of the man page which I used as a starting point. I also used Daniel Lemire’s C++ benchmarking harness and the linux/perf_event.h header file for reference.

The perf_event_open(2) API is fairly painful. You have to write your own syscall wrapper because it’s missing from glibc. It requires you to open a file descriptor for each event you are measuring, such as instructions or clock cycles.

You can create a group of events that are measured together, in which case you are mostly working with the group leader file descriptors, and the others are dead weight. (It would make more sense to me to add events to a group using ioctl()s on the group fd, instead of several open()s, but, shrug.)

Depending on how you configure your group, you get different data layouts when read()ing from the event fd.

You can reset a group’s measurement counters, but you can’t reset the total running time counter. I gave up and implemented my own reset logic.

Anyway, after too much faff I had some basic Linux perf_event_open() benchmarking functions

MFENCE

I was still getting implausibly tiny numbers.

For small strings, copies were taking about 2 nanoseconds, or about 10 clock cycles, or about 25 instructions. The instruction count was plausible, but it seemed to be running without touching memory at all, not even L1 cache.

I guess the inner loop was so tiny that the CPU could run multiple iterations concurrently and completely hide any latency? If so, it’s impressive, but it made a nonsense of my measurements, and I wasn’t sure how to solve the problem.

After sleeping on it, I cribbed another idea from Daniel Lemire’s benchmarking harness which includes memory fences around the function under test.

I added an _mm_mfence() call to my benchmarking loop and the results made a lot more sense! Most obviously, the tolower64 line was clearly slower than copybytes64.

conclusion

Benchmarking is hard.

I would love to hear more suggestions on how to get better measurements of these functions in isolation.

The graph above is definitely not a good indicator of the performance of these functions in real code, where they are likely to be inlined and optimized together with the surrounding code, and where the CPU has other things to work on.

Thanks to Daniel Lemire for putting lots of great vectorized code and benchmarking examples on his blog.