In the previous article on auto-vectorization we treated instructions as either SIMD (Single Instruction Multiple Data) or non-SIMD. We also assumed that SIMD meant four values at a time.

That was true for the way we wrote and compiled our code in that article, but we’re going to expand beyond that. There is a progression of SIMD instruction families, with new releases of CPU’s from Intel and AMD supporting new instructions we can use to increase the performance of our code.

If our goal is to get the best performance we need to take advantage of all the possible SIMD instructions on our hardware.

In this article we’re going to:

  • Look at the compiler output when targeting the different SIMD instruction set families.
  • Benchmark the different instruction sets.
  • Look at how we can structure our Rust code to support compiling for multiple instruction sets and then selecting at runtime the one to use.

The Different SIMD Instruction Sets on x86 CPUs

The history of SIMD on x86 CPU’s starts with the MMX family of instructions on the Pentium in 1997. But we can skip that early stage and go straight to the SSE2 family. The reason this family is so important is it’s the most recent one guaranteed to be supported by all 64-bit X86 CPU.

SSE2 supports a wide variety of mathematical and logical operations on 4 floating point values and as well as packed 8, 16, 32, and 64bit integer types. All the SIMD from the previous article was limited to SSE2 or earlier instructions as the default settings for the compiler is to produce code that is compatible with all 64-bit X86 CPU’s.

There have been many instruction families released after SSE2. The ones that followed directly are SSE3, SSSE3, SSE4.1, and SSE4.2. As the naming suggests these are all incremental upgrades. They added new instructions that make it easier to optimise a broader variety of applications using SIMD. But the number of floating point values processed by each instruction remained at four.

When Intel introduced AVX in 2011, it was a major change as it introduced registers that are 256 bits wide, meaning instructions could support up to eight floats. AVX2 was introduced in 2016, and while it didn’t increase the size of the registers, it did add new instructions that made using these wider registers easier.

The newest instructions added are in the AVX512 family. As the name suggests it increases the register size to 512 bits, doubling the number of floats that can be processed in a single instruction to sixteen.

Hardware surveys

In evaluating what instructions families we want to support, it can be useful to look at data sets for a large number of users.

The Steam Hardware Survey is the best such available data.

Instruction Family Percentage of Steam Users with Computers that Support It (April 2020)
SSE2 100%
SSE3 100.00%
SSSE3 98.52%
SSE4.1 97.80%
SSE4.2 97.05%
AVX 92.25%
AVX2 73.88%
AVX512F 0.30%

We can see that because SSE2 first appeared on CPUs in 2000 and is mandatory on 64 bit, its support level in 2020 is at 100%. But the trend is the newer the instruction family the lower its support in CPUs in the wild.

Be aware that this survey largely reflects western gamers and might not capture other types of users well.

The Benchmark Function

To test the compiler we’re going to stick with the theme of audio processing but use a task a little more complicated than the one used in the previous article.

When we store digital audio data with multiple channels, such as stereo with two channels or full surround with eight channels, it can be done two ways. Either interleaved or de-interleaved. De-interleaved means that each channel has its own slice of memory. Interleaved means the samples are adjacent in memory and there is only a single slice of memory.

Interleaved audio is used in audio file formats such as WAV, or when sending audio to output devices. De-interleaved audio is often used when processing audio in memory.

We can get a better understanding if we look at how we would code 7.1 surround sound in the two representations with Rust types:

// In all the following code we use these acronyms for our channel names for conciseness
// FL - front left
// FR - front right
// FC - front center, also just called center
// LF - low frequency, also called low frequency effects, or sub-woofer
// SL - surround left
// SR - surround right
// RL - rear left
// RR - rear right

// A single sample
pub struct InterleavedSample71 {
    fl: i16,
    fr: i16,
    fc: i16,
    lf: i16,
    sl: i16,
    sr: i16,
    rl: i16,
    rr: i16,
}

pub struct InterleavedBuffer71 {
    data: Vec<InterleavedSample71>,
}

pub struct DeinterleavedBuffer71 {
    num_samples: usize,
    data_fl: Vec<f32>,
    data_fr: Vec<f32>,
    data_fc: Vec<f32>,
    data_lf: Vec<f32>,
    data_sl: Vec<f32>,
    data_sr: Vec<f32>,
    data_rl: Vec<f32>,
    data_rr: Vec<f32>,
}

You might notice that the interleaved audio represents each channel as a signed 16bit number, while the de-interleaved uses floating point. This is because 16bit is mostly used in storing audio file formats such as WAV, and floating point is used in memory when applying effects or mixing.

The function we’re going to explore takes a de-interleaved buffer and transforms it to an interleaved buffer. The basic structure is to loop over the samples and for each one: load, re-arrange, convert from float to 16bit, and then store.

static POS_FLOAT_TO_16_SCALE: f32 = 0x7fff as f32;

#[inline(always)]
fn pcm_float_to_16(x: f32) -> i16 {
    (x * POS_FLOAT_TO_16_SCALE) as i16
}

pub fn interleave_71(
    deinterleaved: &DeinterleavedBuffer71,
    interleaved: &mut InterleavedBuffer71,
) {
    let num_samples = deinterleaved.num_samples;
    let dst = &mut interleaved.data[0..num_samples];
    let src_fl = &deinterleaved.data_fl[0..num_samples];
    let src_fr = &deinterleaved.data_fr[0..num_samples];
    let src_fc = &deinterleaved.data_fc[0..num_samples];
    let src_lf = &deinterleaved.data_lf[0..num_samples];
    let src_sl = &deinterleaved.data_sl[0..num_samples];
    let src_sr = &deinterleaved.data_sr[0..num_samples];
    let src_rl = &deinterleaved.data_rl[0..num_samples];
    let src_rr = &deinterleaved.data_rr[0..num_samples];
    for i in 0..num_samples {
        dst[i].fl = pcm_float_to_16(src_fl[i]);
        dst[i].fr = pcm_float_to_16(src_fr[i]);
        dst[i].fc = pcm_float_to_16(src_fc[i]);
        dst[i].lf = pcm_float_to_16(src_lf[i]);
        dst[i].sl = pcm_float_to_16(src_sl[i]);
        dst[i].sr = pcm_float_to_16(src_sr[i]);
        dst[i].rl = pcm_float_to_16(src_rl[i]);
        dst[i].rr = pcm_float_to_16(src_rr[i]);
    }
}

N.B I decided for readability to use the index version rather than iterators and multiple nested zip functions. izip! from the itertools crate is a more readable versions but is not available in the compiler explorer. All the options produce equivalent compiler output.

Controlling the Compiler

All the assembler output from the compiler presented in the previous article was produced by the Rust compiler with the default settings for a release build on x86_64. As we said earlier x86 64 bit has guaranteed support for SSE2, so this is the highest SIMD instruction family the compiler used when compiling with these settings.

If we use those same default compiler settings and compile this new function, we can see in the assembly output that our lessons from the previous article have carried over and that the compiler has produced SIMD code for the loop.

.LBB0_40:
        movups          xmm0, xmmword ptr [rdx + 4*rax]
        mulps           xmm0, xmm8
        cvttps2dq       xmm7, xmm0
        movups          xmm0, xmmword ptr [r9 + 4*rax]
        mulps           xmm0, xmm8
        cvttps2dq       xmm9, xmm0
        movups          xmm0, xmmword ptr [r10 + 4*rax]
        mulps           xmm0, xmm8
        cvttps2dq       xmm13, xmm0
        movups          xmm0, xmmword ptr [r11 + 4*rax]
        mulps           xmm0, xmm8
        cvttps2dq       xmm11, xmm0
        movups          xmm0, xmmword ptr [r14 + 4*rax]
        mulps           xmm0, xmm8
        cvttps2dq       xmm15, xmm0
        movups          xmm0, xmmword ptr [r15 + 4*rax]
        mulps           xmm0, xmm8
        cvttps2dq       xmm10, xmm0
        movups          xmm0, xmmword ptr [r12 + 4*rax]
        mulps           xmm0, xmm8
        cvttps2dq       xmm2, xmm0
        movups          xmm0, xmmword ptr [rbx + 4*rax]
        mulps           xmm0, xmm8
        cvttps2dq       xmm12, xmm0

There are the signs of SIMD we discussed before. Instructions appear such as mulps which is short for Multiply Packed Single-precision-float, packed being the term for operating on four values at once.

There are other features we now need to notice in this disassembly. If we look at the registers passed as arguments to each instruction we see they all follow the naming conventions xmm<NUMBER>. Registers following this naming convention are called the SSE registers and can hold up to four floats.

Telling the Compiler to use Newer Instruction Families

Using the godbolt compiler explorer we can pass additional compiler arguments. In addition to turning on optimization (the -O argument) we can allow the compiler to use additional instruction families.

The following assembly is take from the start of the loop whe using the arguments -O -C target-feature=+avx

.LBB0_40:
        vmovaps         ymm6, ymmword ptr [rip + .LCPI0_0]
        vmulps          ymm0, ymm6, ymmword ptr [rdx + 4*rax]
        vcvttps2dq      ymm0, ymm0
        vextractf128    xmm1, ymm0, 1
        vpackssdw       xmm11, xmm0, xmm1
        vmulps          ymm0, ymm6, ymmword ptr [r9 + 4*rax]
        vcvttps2dq      ymm0, ymm0
        vextractf128    xmm2, ymm0, 1
        vmulps          ymm3, ymm6, ymmword ptr [r10 + 4*rax]
        vpackssdw       xmm12, xmm0, xmm2
        vcvttps2dq      ymm0, ymm3
        vextractf128    xmm3, ymm0, 1
        vpackssdw       xmm13, xmm0, xmm3
        vmulps          ymm0, ymm6, ymmword ptr [r11 + 4*rax]
        vcvttps2dq      ymm0, ymm0
        vextractf128    xmm4, ymm0, 1
        vpackssdw       xmm14, xmm0, xmm4

Full Output

The first thing to notice is that in addition to the SSE registers we see registers with the naming convention ymm<NUMBER>. As we discussed earlier in the post AVX added new 256 bit wide registers capable of storing up to eight floats. Now we see how they appear in our assembly. Unsurprisingly, registers starting with ymm are refered to as AVX registers.

When we see a SIMD instruction such as vmulps, we now need to look at the registers that are being passed to it. If we see xmm registers, we know it’s performing four multiplies. If the register arguments are ymm then it’s performing eight multiplies.

I’m going to skip over the fact that the compiler used vmulps instead of mulps. This is a result of VEX encoding and not relevant to these discussions.

If we change our compiler arguments to -O -C target-feature=+avx2 we can view the resulting output. It might be hard to spot the difference at first. Not a lot has changed. AVX2 was an incremental improvement of AVX. It added no new register types, only new instructions. If you scan the instructions you can see that the compiler has used ones such as vpblendvb and vinserti128. Consulting the Intel Intrinsics Guide shows there are new instructions for AVX2.

Runtime Selection of SIMD Instruction Set

Compiling our entire program so it only runs on a subset of available CPU’s may not be feasible. But there are alternatives to compiler flags we can use.

Instead we can use the built-in target_feature attribute and is_x86_feature_detected macro.

#[inline(always)]
fn interleave_71_inner(
    deinterleaved: &DeinterleavedBuffer71,
    interleaved: &mut InterleavedBuffer71,
) {
    // the body of this function is the same as the earlier interleave_71
}

#[target_feature(enable = "avx")]
pub unsafe fn interleave_71_avx(
    deinterleaved: &DeinterleavedBuffer71,
    interleaved: &mut InterleavedBuffer71,
) {
    interleave_71_inner(deinterleaved, interleaved)
}

#[target_feature(enable = "avx2")]
pub unsafe fn interleave_71_avx2(
    deinterleaved: &DeinterleavedBuffer71,
    interleaved: &mut InterleavedBuffer71,
) {
    interleave_71_inner(deinterleaved, interleaved)
}

pub fn interleave_71(deinterleaved: &DeinterleavedBuffer71, interleaved: &mut InterleavedBuffer71) {
    if is_x86_feature_detected!("avx2") {
        unsafe { interleave_71_avx2(deinterleaved, interleaved) }
    } else if is_x86_feature_detected!("avx") {
        unsafe { interleave_71_avx(deinterleaved, interleaved) }
    } else {
        interleave_71_inner(deinterleaved, interleaved)
    }
}

To break down what we’ve done to our code:

Firstly we’ve renamed our function interleave_71 to interleave_71_inner, made it private, and forced it to be inlined everywhere it’s used.

Secondly we added two new functions interleave_71_avx and interleave_71_avx2. They use the target_feature attribute to tell the compiler that for this function it’s allowed to assume the CPU is capable of the given instruction family. Because these function inline the contents of interleave_71_inner, this results in us getting a version of our code with the allowed instructions. Because calling this code on a CPU that doesn’t support the instructions in undefined behavior we’ve had to mark our functions as unsafe.

Finally we have a new public function interleave_71. It uses the is_x86_feature_detected macro to check the CPU, which lets it know if it’s safe to call our higher instruction family compiled functions. If it doesn’t detect the instructions are available it will fall back to the default compiled version. This allows us to present a safe function that doesn’t allow undefined behavior. We’ve arranged the function in descending order of date of introduction of the instruction family.

Be aware that this pattern should only be applied to functions that perform enough work to justify the runtime check. Applying it on the level of only a few short instructions may be a net loss in performance.

I’ve never used it but there is a proc macro crates to automate this pattern available: multiversion.

Does it Matter (is it Faster?)

We’ve seen that we can either compile our whole application to only target a subset of available CPU’s or add extra code to detect support at runtime.

Do the performance gains justify the extra effort? We can set up a benchmark of all the variations on the same CPU for comparison. I’ve also included a benchmark using -O -C no-vectorize-loops to get an idea of the overall benefits of SIMD.

  Time for 100,000 Samples
vectorize-loops disabled 321 μs
SSE2 215 μs
AVX 245 μs
AVX2 207 μs

Unfortunately, the answer seems to be no. With the current version of Rust (1.43.0) it seems that AVX performance is worse than SSE2. AVX2 is a slight win, but in testing with machines other than my main laptop it was also slower than the SSE laptop.

Comparing the Compiler’s AVX2 Code to Hand-written Intrinsics

To judge the assembly output by the compiler we’re going to again write a version of the function by hand using the intrinsics in Rust’s core::arch::x86_64 module.

use std::arch::x86_64::*;

#[inline(always)]
pub unsafe fn transpose_and_convert(a: __m256, b: __m256, c: __m256, d: __m256, dst: *mut __m256i) {
    // keep shuffling individual elements inside the vector around to finish interleaving
    let unpack_lo_pd_0 =
        _mm256_castpd_ps(_mm256_unpacklo_pd(_mm256_castps_pd(a), _mm256_castps_pd(b)));
    let unpack_lo_pd_1 =
        _mm256_castpd_ps(_mm256_unpacklo_pd(_mm256_castps_pd(c), _mm256_castps_pd(d)));

    let unpack_hi_pd_0 =
        _mm256_castpd_ps(_mm256_unpackhi_pd(_mm256_castps_pd(a), _mm256_castps_pd(b)));
    let unpack_hi_pd_1 =
        _mm256_castpd_ps(_mm256_unpackhi_pd(_mm256_castps_pd(c), _mm256_castps_pd(d)));

    let permute_0 = _mm256_permute2f128_ps(unpack_lo_pd_0, unpack_hi_pd_0, 0b_0010_0000);
    let permute_1 = _mm256_permute2f128_ps(unpack_lo_pd_0, unpack_hi_pd_0, 0b_0011_0001);
    let permute_2 = _mm256_permute2f128_ps(unpack_lo_pd_1, unpack_hi_pd_1, 0b_0010_0000);
    let permute_3 = _mm256_permute2f128_ps(unpack_lo_pd_1, unpack_hi_pd_1, 0b_0011_0001);

    // convert all our values from f32 [-1.0, 1.0] to i32 [-32767, 32767]
    let scale = _mm256_set1_ps(POS_FLOAT_TO_16_SCALE);
    let i32_0 = _mm256_cvtps_epi32(_mm256_mul_ps(permute_0, scale));
    let i32_1 = _mm256_cvtps_epi32(_mm256_mul_ps(permute_1, scale));
    let i32_2 = _mm256_cvtps_epi32(_mm256_mul_ps(permute_2, scale));
    let i32_3 = _mm256_cvtps_epi32(_mm256_mul_ps(permute_3, scale));

    // convert from i32 to i16
    let i16_0 = _mm256_packs_epi32(i32_0, i32_2);
    let i16_1 = _mm256_packs_epi32(i32_1, i32_3);

    // store to destination memory
    _mm256_storeu_si256(dst.offset(0), i16_0);
    _mm256_storeu_si256(dst.offset(2), i16_1);
}

#[target_feature(enable = "avx2")]
pub unsafe fn interleave_71_manual_avx2(
    deinterleaved: &DeinterleavedBuffer71,
    interleaved: &mut InterleavedBuffer71,
) {
    let num_samples = deinterleaved.num_samples;
    assert_eq!(interleaved.data.len(), num_samples);
    assert!(num_samples % 8 == 0);
    let mut dst_base = interleaved.data.as_mut_ptr() as *mut __m256i;
    let mut src_fl_base = deinterleaved.data_fl.as_ptr();
    let mut src_fr_base = deinterleaved.data_fr.as_ptr();
    let mut src_fc_base = deinterleaved.data_fc.as_ptr();
    let mut src_lf_base = deinterleaved.data_lf.as_ptr();
    let mut src_sl_base = deinterleaved.data_sl.as_ptr();
    let mut src_sr_base = deinterleaved.data_sr.as_ptr();
    let mut src_bl_base = deinterleaved.data_bl.as_ptr();
    let mut src_br_base = deinterleaved.data_br.as_ptr();
    for _ in 0..num_samples / 8 {
        let src_fl = _mm256_loadu_ps(src_fl_base);
        let src_fr = _mm256_loadu_ps(src_fr_base);
        let src_fc = _mm256_loadu_ps(src_fc_base);
        let src_lf = _mm256_loadu_ps(src_lf_base);
        let src_sl = _mm256_loadu_ps(src_sl_base);
        let src_sr = _mm256_loadu_ps(src_sr_base);
        let src_bl = _mm256_loadu_ps(src_bl_base);
        let src_br = _mm256_loadu_ps(src_br_base);

        transpose_and_convert(
            _mm256_unpacklo_ps(src_fl, src_fr),
            _mm256_unpacklo_ps(src_fc, src_lf),
            _mm256_unpacklo_ps(src_sl, src_sr),
            _mm256_unpacklo_ps(src_bl, src_br),
            dst_base,
        );
        transpose_and_convert(
            _mm256_unpackhi_ps(src_fl, src_fr),
            _mm256_unpackhi_ps(src_fc, src_lf),
            _mm256_unpackhi_ps(src_sl, src_sr),
            _mm256_unpackhi_ps(src_bl, src_br),
            dst_base.offset(1),
        );

        src_fl_base = src_fl_base.offset(8);
        src_fr_base = src_fr_base.offset(8);
        src_fc_base = src_fc_base.offset(8);
        src_lf_base = src_lf_base.offset(8);
        src_sl_base = src_sl_base.offset(8);
        src_sr_base = src_sr_base.offset(8);
        src_bl_base = src_bl_base.offset(8);
        src_br_base = src_br_base.offset(8);
        dst_base = dst_base.offset(4);
    }
}

The thing to note about this code is that all the intrinsics we use start with _mm256 to indicate they are from the AVX or AVX2 instruction family. We’ve also had to keep the #[target_feature(enable = "avx2")] attribute on our function to ensure the compiler generates the correct code for our intrinsics.

  Time for 100,000 Samples
vectorize-loops disabled 321 μs
SSE2 215 μs
AVX 245 μs
AVX2 207 μs
Hand-written AVX2 104 μs

As we might expect from an instruction set able to process eight values at a time instead of four, the hand-written AVX2 code is twice as fast as the SSE2 code.

We’re not going to fully break down what the compiler has produced when auto-vectorizing for AVX2. But we can get do a brief comparison against the hand-written version.

If we look at the hand-written version of AVX2 it’s verbose but straightforward in places.

Remember the basic structure is to loop over the samples and for each one: load, re-arrange, convert from float to 16bit, and then store. The intrinsics version does eight samples at a time.

_mm256_loadu_ps and _mm256_storeu_si256 are load and store. _mm256_mul_ps, _mm256_cvtps_epi32, and _mm256_packs_epi32 are used to convert from floating point values in the range (-1.0, 1.0) to signed 16 bit integers.

The hard part of the code to follow is the use of _mm256_permute2f128_ps and the various unpack instructions such as _mm256_unpacklo_ps to interleave the values. If we compare this part of the hand-written to the compiler generated assembly we can see where the compiler has produced sub-optimal code.

Conclusion

The AVX and AVX2 instruction set families are supported by large percentage of CPU’s right now. We can target them at compile time or runtime and get a boost in the performance of our code. However, compilers may be lacking in their ability to auto-vectorize code for these instruction families. Benchmarking and examination of the compiler output are still required in performance critical situations.

Sources

All the source code for this article can be found on github