20th May 2020 - 02:09:50 - 3256 characters

Rust, SIMD, and ISPC

We don't use our hardware effectively

Examples of fast compression, parsing JSON, vectorized emulation of 16 32-bit VMs per AVX-512 thread, ...

SIMD isn't easy but it is endlessly fascinating. There's an odd sense of joy gained from peering high level to low through the lens of the compiler and raw assembly. It also happens to be the perfect knife edge between awe and expletives.

Compilers have been improved, and for obvious cases can give you high performance thanks to autovectorization, but that's not guaranteed. As noted by Lucas Meijer on the dangers of relying on autovectorization by the compiler:

I should be able to say “if this loop for some reason doesn’t vectorize, that should be a compiler error, not a ‘oh code is now just 8x slower but it still produces correct values, no biggy!’”

Task: pseudo dot product

The dot product between two vectors is z[i] = x[i] * y[i], or \(z_i = \sum_i x_i y_i \) if you prefer the mathematical notation.

We'll be performing a pseudo dot product, specifically z[i] += x[i] * y[i], for two reasons: first, the compiler is too smart and will remove all our redundant loops as they're not used, and second, we want to demonstrate the fused multiply add (FMA) SIMD intrinsic.

Benchmark

For the benchmark we'll be performing this tight loop 10 million times.

For the CPU I'm using the Intel Core i7-7700k in my desktop machine. This supports various SIMD instruction sets up to and including AVX-2 instructions.

Pre-requisite compiler magic: RUSTFLAGS="-C target-cpu=native" to say "optimize for the CPU on this machine" and --release to perform speed runs on a build with full optimization.

For analyzing the resulting assembly we'll be using cargo asm and perf

Version 1: Naive Rust and naive C

Naive Rust: 4 loads, 4 muls (implicit load), 4 adds (implicit load), 4 saves

Naive C: 4 loads, 4 loads, 4 fmas (implicit load), 4 saves

Version 2: Rust SIMD intrinsics with and without inline

Inlining allows for re-ordering of instructions as long as those instructions don't interfere with each other. In our case this appears to cause issues, likely stalls, as blocks of up to eight moves (vmovups) are scheduled in one go. By turning off inlining we're able to prevent that and the given instructions (move, move, mul + move, save) are unrolled four times.

An interesting aside: Rust may prevent certain code from being inlined if it contains SIMD intrinsics as those SIMD intrinsics might not be supported for all the hardware the binary is intended for.

Version 3: Rust-ISPC and Rust without bounds checking

ISPC's ASM: 4 x (load, load, fma, save) for 992 / 32 loops

Rust: Almost exactly the same generated assembly - yet it's slower ...