I’m pleased that so many people enjoyed my previous blog post on tolower() with AVX-512. Thanks for all the great comments and discussion!
One aspect that needed more work was examining the performance for small strings. The previous blog post had a graph for strings up to about 1000 bytes long, mainly because it illustrated some curious behaviour – but that distracted me from what I really care about, which is strings up to about 100 bytes long, 10x smaller.
Eventually I managed to get some numbers that I think are plausible.
This article is part of a series on optimizing DNS names:
- tolower() in bulk at speed
- a DNS name compression algorithm
- faster DNS name decompression
- BIND zone transfer performance
- slower DNS name decompression
- tolower() with AVX-512
- tolower() small string performance
question
When processing very small string fragments, what is the cross-over point between scalar code and AVX-512 with masked loads and stores?
method
I am testing the same functions as described in my previous blog
post. The copybytes
functions are different implementations
of memcpy
; the tolower
functions convert a string to lower-case
while copying in a similar manner to memcpy
. Most of the functions
are autovectorized by Clang, apart from tolower64
and copybytes64
which use AVX-512 instrinsic functions with masked loads and stores.
(The copybytes1
function never gets out of its scalar regime on the
graph below.)
Each function is called 1000 times with varying alignments of the
source and destination strings. The graph shows the total time divided
by 1000, measured using Linux perf_event_open(2)
. There are two
aeparate 256 byte buffers, one for the source strings and one for the
destination strings, so the code is working entirely from L1 cache.
The functions are compiled separately, so that the compiler can’t
over-optimize the benchmark loop. There is an MFENCE
instruction
before each call to the function under test. The baseline of 65 cycles
includes the MFENCE
, the benchmark loop, and the function call
overhead.
You can find the source code in the bench-short.c
and bench.c
files.
results
The lines of interest are:
-
The purple
tolower1
line, which is very simple C code autovectorized by Clang. -
The pink
tolower64
line, which is written using AVX-512 instrinsic functions, and uses masked loads and stores.
-
The crossover point for very small strings is 5 bytes: scalar code is faster for strings less than 5 bytes, and AVX-512 is faster for strings longer than 5 bytes.
-
Clang used an AVX-256 section in
tolower1
for strings between 32 and 256 bytes long. Thetolower1
line drops below thetolower64
line for lengths around a multiple of 32, which suggests AVX-256 is faster than AVX-512 for small numbers of chunks.
So it looks like masked loads and stores do pretty well for short strings, but they aren’t completely dominant. There’s still some performance to be gained from code tuned to a few more size classes.
perf_event_open
The graph in my previous blog post was made using timing
measurements from clock_gettime()
. I found that I was getting
implausibly tiny numbers for short strings – not impossibly tiny,
but there was clearly something hinky going on. So I thought I should
try reading the CPU performance counters, which on Linux means
perf_event_open(2)
There is a helpful example at the end of the man page
which I used as a starting point. I also used Daniel Lemire’s C++
benchmarking harness and the linux/perf_event.h
header file for reference.
The perf_event_open(2)
API is fairly painful. You have to write your
own syscall wrapper because it’s missing from glibc. It requires you
to open a file descriptor for each event you are measuring, such as
instructions or clock cycles.
You can create a group of events that are measured together, in which
case you are mostly working with the group leader file descriptors,
and the others are dead weight. (It would make more sense to me to add
events to a group using ioctl()
s on the group fd, instead of several
open()
s, but, shrug.)
Depending on how you configure your group, you get different data
layouts when read()
ing from the event fd.
You can reset a group’s measurement counters, but you can’t reset the total running time counter. I gave up and implemented my own reset logic.
Anyway, after too much faff I had some basic Linux
perf_event_open()
benchmarking functions
MFENCE
I was still getting implausibly tiny numbers.
For small strings, copies were taking about 2 nanoseconds, or about 10 clock cycles, or about 25 instructions. The instruction count was plausible, but it seemed to be running without touching memory at all, not even L1 cache.
I guess the inner loop was so tiny that the CPU could run multiple iterations concurrently and completely hide any latency? If so, it’s impressive, but it made a nonsense of my measurements, and I wasn’t sure how to solve the problem.
After sleeping on it, I cribbed another idea from Daniel Lemire’s benchmarking harness which includes memory fences around the function under test.
I added an _mm_mfence()
call to my benchmarking loop and the results
made a lot more sense! Most obviously, the tolower64
line was
clearly slower than copybytes64
.
conclusion
Benchmarking is hard.
I would love to hear more suggestions on how to get better measurements of these functions in isolation.
The graph above is definitely not a good indicator of the performance of these functions in real code, where they are likely to be inlined and optimized together with the surrounding code, and where the CPU has other things to work on.