It seems that the compiler is not smart enough to avoid the abstraction
penalty in the 32-bit case. Passing the vector directly instead of
encapsulating it inside a struct made the SIMD Blake2s code now very
slightly *faster* than the non-SIMD Blake2s code.
Go all the way and manually vectorize the code, using a generic
4-element vector type, and add both a fallback implementation and an
attempt at a SSE2 implementation using LLVM intrinsics.
The experiment was not successful; the SIMD code for some reason is much
slower. Interestingly, however, the fallback code was much faster for
Blake2s, and only slightly slower for Blake2b. Now both implementations
are around 80% of the speed of the SIMD-optimized reference libb2 code.
Change the order of the round operations to make more parallelism
visible to the compiler's autovectorizer.
The results were mixed; on my 64-bit machine, it made Blake2b 14%
faster, but Blake2s became 3% slower.