riscv64: Implement optimised crc using zbc and zbb extensions #299
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The RISC-V carryless-multiplication extension, Zbc, provides instructions that can be used to optimise the calculation of Cyclic Redundancy Checks (CRCs). This pull request creates a new RISC-V target for isa-l and provides optimised implementations of all the CRC16, CRC32 and CRC64 algorithms using these instructions, based on the approach described in Intel's whitepaper on the topic, "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction". The core loop, which folds four 128-bit chunks in parallel, is shared between all the algorithms.
This patch also requires the target have the Zbb bit-manipulation extension. This provides an endianness swap hardware instruction, which makes up a fair part of the core folding loop for non-reflected CRCs.
On a MuseBook (1.6 GHz Spacemit X60), I gathered the following performance numbers, observing around a 20x increase in throughput for reflected algorithms and 17x for normal algorithms, likely due to the extra endianness swap instructions needed.
This patch doesn't currently have functionality for picking which version to use at runtime like the CRC implementations for aarch64 and x86_64 do. The approach used by them (reading either cpuid or hwcap) doesn't immediately translate to RISCV; I have some ideas for alternate routes, either using the linux riscv hwprobe interface which would require an up-to-date version of the kernel (v6.4+), or by detecting at buildtime with compiler flags (gcc/clang only and doesn't help detect at runtime). It would be great to get your opinion on which approach would be preferred.