Also simplifies the SIMD code, and uses builtin operations for shifts
and shuffles. For x86 and x86_64, and VREV shuffles on arm, the compiler
does a good job.
Unfortunately, the compiler fails to use VEXT for shuffles on arm, and
the inline assembly for it crashes the LLVM compiler, so I removed it in
this commit.
Go all the way and manually vectorize the code, using a generic
4-element vector type, and add both a fallback implementation and an
attempt at a SSE2 implementation using LLVM intrinsics.
The experiment was not successful; the SIMD code for some reason is much
slower. Interestingly, however, the fallback code was much faster for
Blake2s, and only slightly slower for Blake2b. Now both implementations
are around 80% of the speed of the SIMD-optimized reference libb2 code.