The Rust compiler validates the extern ABI while parsing the "extern"
keyword, so normal conditional compilation (`#[cfg(...)]`) isn't enough
to hide the ABI from Rust versions which don't know it.
I tried hiding the extern ABI using a macro, but the contents of an
"extern" block aren't a valid `item`, and I couldn't find any other
working way to pass the function declarations to the macro.
The solution which worked in the end was to use `include!`. This
prevents the compiler from even trying to parse the "extern" block
unless the nightly-only cargo feature "simd" is enabled.
Before the conversion to SIMD-like vectors, this was not possible
because the array had more than 32 elements, and these traits are only
implemented for arrays of up to 32 elements.
After the conversion, the array has only 2 elements, so deriving these
traits is possible and simplifies the code.
For each round, BLAKE2 loads a different set of words from the message,
controlled by the SIGMA array. This seems an obvious place to use a SIMD
gather instruction. To allow for further experimentation, move the
gather of the message words to the SIMD code.
The compiler doesn't seem to be able to convert 8-bit rotating shuffles
of 64-bit elements into the VEXT instruction.
Unfortunately, this code crashes the LLVM compiler used by rustc.
Also simplifies the SIMD code, and uses builtin operations for shifts
and shuffles. For x86 and x86_64, and VREV shuffles on arm, the compiler
does a good job.
Unfortunately, the compiler fails to use VEXT for shuffles on arm, and
the inline assembly for it crashes the LLVM compiler, so I removed it in
this commit.
It seems that the compiler is not smart enough to avoid the abstraction
penalty in the 32-bit case. Passing the vector directly instead of
encapsulating it inside a struct made the SIMD Blake2s code now very
slightly *faster* than the non-SIMD Blake2s code.
Go all the way and manually vectorize the code, using a generic
4-element vector type, and add both a fallback implementation and an
attempt at a SSE2 implementation using LLVM intrinsics.
The experiment was not successful; the SIMD code for some reason is much
slower. Interestingly, however, the fallback code was much faster for
Blake2s, and only slightly slower for Blake2b. Now both implementations
are around 80% of the speed of the SIMD-optimized reference libb2 code.
Change the order of the round operations to make more parallelism
visible to the compiler's autovectorizer.
The results were mixed; on my 64-bit machine, it made Blake2b 14%
faster, but Blake2s became 3% slower.