Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

riscv64: Implement optimised crc using zbc and zbb extensions #299

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

daniel-gregory
Copy link

The RISC-V carryless-multiplication extension, Zbc, provides instructions that can be used to optimise the calculation of Cyclic Redundancy Checks (CRCs). This pull request creates a new RISC-V target for isa-l and provides optimised implementations of all the CRC16, CRC32 and CRC64 algorithms using these instructions, based on the approach described in Intel's whitepaper on the topic, "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction". The core loop, which folds four 128-bit chunks in parallel, is shared between all the algorithms.

This patch also requires the target have the Zbb bit-manipulation extension. This provides an endianness swap hardware instruction, which makes up a fair part of the core folding loop for non-reflected CRCs.

On a MuseBook (1.6 GHz Spacemit X60), I gathered the following performance numbers, observing around a 20x increase in throughput for reflected algorithms and 17x for normal algorithms, likely due to the extra endianness swap instructions needed.

Algorithm Throughput (MB/s)
Table (Base) 206
CRC16_t10dif_copy 463
CRC16_t10dif 3855
CRC32_gzip_refl 4530
CRC32_IEEE 3855
CRC32_iscsi 4530
CRC64_norm 3856
CRC64_refl 4538

This patch doesn't currently have functionality for picking which version to use at runtime like the CRC implementations for aarch64 and x86_64 do. The approach used by them (reading either cpuid or hwcap) doesn't immediately translate to RISCV; I have some ideas for alternate routes, either using the linux riscv hwprobe interface which would require an up-to-date version of the kernel (v6.4+), or by detecting at buildtime with compiler flags (gcc/clang only and doesn't help detect at runtime). It would be great to get your opinion on which approach would be preferred.

Use the base implementations for every function.

Signed-off-by: Daniel Gregory <[email protected]>
The Zbc extension defines instructions for carryless multiplication that
can be used to accelerate the calculation of CRC checksums. This
technique is described in Intel's whitepaper, "Fast CRC Computation for
Generic Polynomials Using PCLMULQDQ Instruction".

The Zbb extension defines, among other bit manipulation operations, an
instruction for byte-reversing a register (rev8). This is used when
doing endianness swaps.

crc_fold_common_clmul.h defines a macro that reduces a double-word
aligned buffer to 128 bits by folding four 128-bit chunks in parallel
then folding a single 128-bit chunk until less than two remain. This
macro can be reused for all the CRC algorithms with some parametrisation
controlling:

- where the seed is xor-ed into the first fold
- whether an endianness swap is needed on double-words read in
- whether the algorithm is reflected, which affects whether clmulh gives
  back the high double word of a result or the low double word

Where the algorithms differ more is in how the final 128-bits is reduced
to a 32/64 bit result (which also changes if the algorithm is reflected)
and how the buffer is made to be double-word aligned.

32-bit CRCs use a Barrett's reduction to reduce the buffer enough to be
double-word aligned and to reduce any excess leftover after folding. As
the different CRC32 algorithms isa-l supports differ in whether the seed
is inverted and function signature, the alignment, excess and
128-bit reduction are defined as macros in crc32_*_common_clmul.h that
the implementations (crc32_*.S) include and surround with
algorithm-specific assembly and precomputed constants. This also makes
it straightforward to reuse the macros to calculate crc16_t10dif.

64-bit CRCs use a table-based reduction to align the buffer and handle
excess. All isa-l's CRC64 algorithms pass arguments in the same order
and invert the seed before & after folding, so crc64_*_common_clmul.h
both contain a macro for defining a CRC64 function with a particular
name. Then each of the crc64_*.S contain a call to that macro along
with the precomputed constants and lookup table.

The .h header files added don't contain C code and so are excluded from
Clang formatting, similarly to the header files defined for aarch64.

Signed-off-by: Daniel Gregory <[email protected]>
Rather than duplicating all the crc32 4-folding and modifying it to
write back to the destination the read-in bytes, write a very simple
memcpy that then tail calls crc16_t10dif. This makes the performance of
crc16_t10dif_copy much worse than crc16_t10dif, but still about twice as
fast as crc16_t10dif_copy_base.

Signed-off-by: Daniel Gregory <[email protected]>
@pablodelara
Copy link
Contributor

Thanks @daniel-gregory! We decide the implementation to use at runtime, so it would be great to do the same here too, thanks!

@pablodelara
Copy link
Contributor

Thanks @daniel-gregory! We decide the implementation to use at runtime, so it would be great to do the same here too, thanks!

Any update here, @daniel-gregory?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants