riscv64: Implement optimised crc using zbc and zbb extensions #299

daniel-gregory · 2024-08-27T16:10:49Z

The RISC-V carryless-multiplication extension, Zbc, provides instructions that can be used to optimise the calculation of Cyclic Redundancy Checks (CRCs). This pull request creates a new RISC-V target for isa-l and provides optimised implementations of all the CRC16, CRC32 and CRC64 algorithms using these instructions, based on the approach described in Intel's whitepaper on the topic, "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction". The core loop, which folds four 128-bit chunks in parallel, is shared between all the algorithms.

This patch also requires the target have the Zbb bit-manipulation extension. This provides an endianness swap hardware instruction, which makes up a fair part of the core folding loop for non-reflected CRCs.

On a MuseBook (1.6 GHz Spacemit X60), I gathered the following performance numbers, observing around a 20x increase in throughput for reflected algorithms and 17x for normal algorithms, likely due to the extra endianness swap instructions needed.

Algorithm	Throughput (MB/s)
Table (Base)	206
CRC16_t10dif_copy	463
CRC16_t10dif	3855
CRC32_gzip_refl	4530
CRC32_IEEE	3855
CRC32_iscsi	4530
CRC64_norm	3856
CRC64_refl	4538

This patch doesn't currently have functionality for picking which version to use at runtime like the CRC implementations for aarch64 and x86_64 do. The approach used by them (reading either cpuid or hwcap) doesn't immediately translate to RISCV; I have some ideas for alternate routes, either using the linux riscv hwprobe interface which would require an up-to-date version of the kernel (v6.4+), or by detecting at buildtime with compiler flags (gcc/clang only and doesn't help detect at runtime). It would be great to get your opinion on which approach would be preferred.

Use the base implementations for every function. Signed-off-by: Daniel Gregory <[email protected]>

The Zbc extension defines instructions for carryless multiplication that can be used to accelerate the calculation of CRC checksums. This technique is described in Intel's whitepaper, "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction". The Zbb extension defines, among other bit manipulation operations, an instruction for byte-reversing a register (rev8). This is used when doing endianness swaps. crc_fold_common_clmul.h defines a macro that reduces a double-word aligned buffer to 128 bits by folding four 128-bit chunks in parallel then folding a single 128-bit chunk until less than two remain. This macro can be reused for all the CRC algorithms with some parametrisation controlling: - where the seed is xor-ed into the first fold - whether an endianness swap is needed on double-words read in - whether the algorithm is reflected, which affects whether clmulh gives back the high double word of a result or the low double word Where the algorithms differ more is in how the final 128-bits is reduced to a 32/64 bit result (which also changes if the algorithm is reflected) and how the buffer is made to be double-word aligned. 32-bit CRCs use a Barrett's reduction to reduce the buffer enough to be double-word aligned and to reduce any excess leftover after folding. As the different CRC32 algorithms isa-l supports differ in whether the seed is inverted and function signature, the alignment, excess and 128-bit reduction are defined as macros in crc32_*_common_clmul.h that the implementations (crc32_*.S) include and surround with algorithm-specific assembly and precomputed constants. This also makes it straightforward to reuse the macros to calculate crc16_t10dif. 64-bit CRCs use a table-based reduction to align the buffer and handle excess. All isa-l's CRC64 algorithms pass arguments in the same order and invert the seed before & after folding, so crc64_*_common_clmul.h both contain a macro for defining a CRC64 function with a particular name. Then each of the crc64_*.S contain a call to that macro along with the precomputed constants and lookup table. The .h header files added don't contain C code and so are excluded from Clang formatting, similarly to the header files defined for aarch64. Signed-off-by: Daniel Gregory <[email protected]>

Rather than duplicating all the crc32 4-folding and modifying it to write back to the destination the read-in bytes, write a very simple memcpy that then tail calls crc16_t10dif. This makes the performance of crc16_t10dif_copy much worse than crc16_t10dif, but still about twice as fast as crc16_t10dif_copy_base. Signed-off-by: Daniel Gregory <[email protected]>

pablodelara · 2024-10-03T08:28:33Z

Thanks @daniel-gregory! We decide the implementation to use at runtime, so it would be great to do the same here too, thanks!

pablodelara · 2024-11-04T10:17:56Z

Thanks @daniel-gregory! We decide the implementation to use at runtime, so it would be great to do the same here too, thanks!

Any update here, @daniel-gregory?

daniel-gregory added 3 commits August 8, 2024 14:35

build: Add riscv64 support

8a4c891

Use the base implementations for every function. Signed-off-by: Daniel Gregory <[email protected]>

daniel-gregory force-pushed the riscv-crc branch from aab4a5b to a62dd04 Compare August 30, 2024 09:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

riscv64: Implement optimised crc using zbc and zbb extensions #299

riscv64: Implement optimised crc using zbc and zbb extensions #299

daniel-gregory commented Aug 27, 2024

pablodelara commented Oct 3, 2024

pablodelara commented Nov 4, 2024

riscv64: Implement optimised crc using zbc and zbb extensions #299

Are you sure you want to change the base?

riscv64: Implement optimised crc using zbc and zbb extensions #299

Conversation

daniel-gregory commented Aug 27, 2024

pablodelara commented Oct 3, 2024

pablodelara commented Nov 4, 2024