The Rust compiler validates the extern ABI while parsing the "extern"
keyword, so normal conditional compilation (`#[cfg(...)]`) isn't enough
to hide the ABI from Rust versions which don't know it.
I tried hiding the extern ABI using a macro, but the contents of an
"extern" block aren't a valid `item`, and I couldn't find any other
working way to pass the function declarations to the macro.
The solution which worked in the end was to use `include!`. This
prevents the compiler from even trying to parse the "extern" block
unless the nightly-only cargo feature "simd" is enabled.
Also simplifies the SIMD code, and uses builtin operations for shifts
and shuffles. For x86 and x86_64, and VREV shuffles on arm, the compiler
does a good job.
Unfortunately, the compiler fails to use VEXT for shuffles on arm, and
the inline assembly for it crashes the LLVM compiler, so I removed it in
this commit.
It seems that the compiler is not smart enough to avoid the abstraction
penalty in the 32-bit case. Passing the vector directly instead of
encapsulating it inside a struct made the SIMD Blake2s code now very
slightly *faster* than the non-SIMD Blake2s code.
Go all the way and manually vectorize the code, using a generic
4-element vector type, and add both a fallback implementation and an
attempt at a SSE2 implementation using LLVM intrinsics.
The experiment was not successful; the SIMD code for some reason is much
slower. Interestingly, however, the fallback code was much faster for
Blake2s, and only slightly slower for Blake2b. Now both implementations
are around 80% of the speed of the SIMD-optimized reference libb2 code.