ARMv7 optimization #213

timethernom · 2024-04-28T02:11:34Z

I would like to speed up signature verification using ARMv7 instruction set (ARM4 or ARM33 device). Verification takes about 3 seconds with BLST_PORTABLE build. Could you hazard a guess on how much speed up I could get with an ARMv7 asm implementation?

By compare, ECDSA signature verification take about 75 msec with asm from https://github.com/kmackay/micro-ecc, using about 40K of static lookup tables. Am I correct that there is less role for lookup tables with BLS?

mratsim · 2024-04-29T10:21:35Z

Pairings (BLS signatures) are significantly slower than scalar multiplications (ECDSA).
In my old BLST benches: status-im/nim-blst#1
there was a ratio of 6x and this is without the need of double pairings and hashing-to-curve for BLS verification.

If you look at micro-ecc, they use the ARM crypto DSP with UMAAL instruction: https://github.com/kmackay/micro-ecc/blob/master/asm_arm_mult_square_umaal.inc

This paper https://eprint.iacr.org/2018/700.pdf has some timings using generic Montgomery-based field multiplication (like BLST) with Neon (it also goes over UMAAL) and the ratio is 6.5x:

If you want to speedup signature verifications, maybe explore batching the verification, if you can verify 20+ at once, the total verification time should be lower than 20+ ECDSA signatures

dot-asm · 2024-04-29T18:42:24Z

NEON

Since there is no such thing as "ARM4 or ARM33 device," it's more likely that the question is rather about Cortex-M4/M33, micro-controllers, which don't have the NEON unit. And with UMAAL being optional on M33, so it's safer to assume that it's not an option either. This is not to say that assembly won't give meaningful improvement, but it won't be as impressive as suggested.

verify

One can wonder why would one want to task a micro-controller with such a heavy operation as BLS signature verification. But either way, as pointed out by @mratsim, there is no way to get BLS anywhere near ECDSA in terms of wall-clock performance. Even signature generation is computationally heavier, let alone verification.

timethernom · 2024-04-29T18:59:19Z

Right, Cortex-M4/M33 is what I meant, and can require UMAAL if it delivers significant gain. Application is to deliver consensus proof to a hardware wallet. Probably will use schnorr threshold signature scheme like FROST, but would like to avoid extra steps that may fail. 3 seconds is too slow. .5 seconds might be viable. Signing and aggregation speed is not a concern.

dot-asm · 2024-04-29T19:17:46Z

can require UMAAL if it delivers significant gain.

Depends on definition of "significant", but if taking it from 3s to 0.5s is the goal, UMAAL won't get you there, no.

dot-asm · 2024-04-29T19:21:38Z

BTW, how did you compile it exactly? Do you use say Rust bindings or make raw C calls?

timethernom · 2024-04-29T19:37:37Z

I just added server.c to a Keil project. My (out of date) Keil compiler threw some warnings on the inline functions calling static functions in no_asm.h, and then crashed. I got it to build by removing the static function declarations in no_asm.h.

This was just to see if I could get in the ballpark. I expected not, given RFC showing 30x slower than ECDSA. But, the blst lib was such a smooth build and test on MacOS, I decided worth checking out on one of my devices.

dot-asm · 2024-04-29T19:57:28Z

I just added server.c to a Keil project.

Question is what was the optimization level. A possible pitfall is that assembler-assisted ECDSA won't be that dependent on optimization level, while pure C would be very dependent. Secondly, if your target executable format is ELF, you should be able to try newer gcc to compile libblst.a and link your project with it. The thing is that blst doesn't have explicit dependencies on language run-time, which should give you freedom to mix the compilers. It would be good idea to add -freestanding flag, e.g. as ./build.sh CROSS_COMPILE=arm-none-eabi- -ffreestanding. What I'm trying to ensure is that 3s baseline is actually representative.

dot-asm · 2024-04-29T20:20:05Z

./build.sh CROSS_COMPILE=arm-none-eabi- -ffreestanding

Hmm, gcc leaves reference to memset (without us calling it) and __aeabi_uidivmod. ./build.sh CC=clang --target=arm-none-eabi -ffreestanding is another option. This still leaves reference to __aeabi_uidivmod, which originates in multi_scalar.c. If it's a problem, -ffunction-sections is likely to remedy it. Try even -O3...

dot-asm · 2024-04-29T20:23:45Z

Oh! And of course you'd need to throw in -mcpu=cortex-m4 option.

dot-asm · 2024-04-29T20:29:50Z

Oh! And of course you'd need to throw in -mcpu=cortex-m4 option.

BTW, clang even issues umaal instructions, so that assembly won't give you an edge is respect to hardware capabilities utilization. This is still not to say that it won't give meaningful improvement, but it won't be direct result of utilizing "magic" instructions.

timethernom · 2024-04-29T22:02:12Z

Thanks! I will try this and see if any performance difference. This was with -O3 optimization.

timethernom · 2024-04-30T00:16:04Z

Easier said than done. Keil does not like the library format, so will have to dig around on what is required there.

I guess the main acceleration in the kmackay ECDSA comes from the 20K look up table. In my intended bls application, the same public key will be used repeatedly. Is there any possibility to take advantage of that? It's not the batch scenario, because the signatures arrive over time. But, there would be room to cache some pre-computes.

dot-asm · 2024-04-30T07:27:37Z

Easier said than done. Keil does not like the library format, so will have to dig around on what is required there.

Giving it server.o might work better...

I guess the main acceleration in the kmackay ECDSA comes from the 20K look up table. In my intended bls application, the same public key will be used repeatedly. Is there any possibility to take advantage of that? It's not the batch scenario, because the signatures arrive over time. But, there would be room to cache some pre-computes.

You make it sound as if the lookup table makes ECDSA 6 times faster. It's not. It does yield meaningful improvement, but it's not "magical." BLS is just significantly computationally heavier than ECDSA. But to the point, while there is a thing that can be pre-computed in BLS, the gain is marginal, it ~~won't get you anywhere close to where you want to be~~ would hardly tip the scales.

dot-asm · 2024-05-01T10:36:31Z

BLS is just significantly computationally heavier than ECDSA.

For an embedded system the correct answer is likely to be an integrated co-processor. One wishes OpenTitan was an accessible option, but it doesn't seem to be the case. Otherwise there is for example Kinetis K8x with usable capabilities in the Cortex-M4 space...

timethernom · 2024-05-01T15:36:41Z

The Kinetis K8x looks almost identical to what we are currently running on, an Infineon PSoC62. We are currently running the 150Mhz M4 at 48Mhz to save power, so there's a 3x gain available there. It also has "12 programmable logic blocks, each with 8 Macrocells and an 8-bit data path (called universal digital blocks or UDBs)". Not sure if that can be of any use.

The kmackay ECDSA with uECC_OPTIMIZATION_LEVEL 0 takes 10 sec to verify 1 signature. The big leap comes at level 1, which drops to 170 msec, with no change in image size. This drops to 75 msec at level 3 with 40K jump in image size. So, I guess the precomputes deliver about 2x. The big win was something else. Probably level 0 is just a very simplistic reference implementation.

Using server.o revealed the need to add -fshort-enum to the build, for compatibility with our existing compile options. But, then I got a bunch of link errors about relocating objects, mostly related to stdio. This is mysterious. Will investigate further some time.

dot-asm · 2024-05-02T09:49:50Z

But, then I got a bunch of link errors about relocating objects, mostly related to stdio.

As already mentioned, server.c has to no dependencies on C run-time, i.e. no references to stdio or whatever std. But ./build.sh does compile it with -fPIC by default, and it's not impossible to imagine that it might disorient your linker and confuse it into blaming everybody for its failure. Add -fno-pic to override (or don't pass -fPIC if you compile server.c yourself).

timethernom · 2024-05-03T04:26:05Z

No doubt the linker was confused. Removing -fPIC avoided that. But, no significant performance difference. clang -O2 was about 50 msec slower than the keil -03 build. clang -03 was about 50 msec faster. Based on a single sample. Thanks for your help.

dot-asm · 2024-05-03T10:14:00Z

To summarize. It's the ability to keep whole vectors in the register bank that translates to performance "leaps". Most notably one can barely accommodate a 256-bit vector in ARM32 registers(*), but not a 384-bit one. BLS operates on 384-bit inputs, which is basically why one can't expect times-impressive-N improvements no matter what you do here. This is still not to say that assembly won't provide meaningful improvement, but it won't take you anywhere near the stated objective.

As for Kinetis K8x. K82 is the only one with relevant capabilities, more specifically a modular arithmetic unit. However, I don't actually have hands-on experience with it, so don't consider the mention as "this one will actually solve the problem."

(*) Which is what the referred micro-ecc does with high enough uECC_OPTIMIZATION_LEVEL.

dot-asm · 2024-05-05T07:52:07Z

don't consider the [Kinetis K82] mention as "this one will actually solve the problem."

Just in case. One has to recognize that the co-processor is still power-constrained, which in practice means that it won't run as fast as one wishes. It surely can deliver significant improvements on "large" operations such as modular exponentiation and ECC arithmetic, but the most computationally intensive part of BLS signature verification would have to rely on individual field operations, multiplications, additions, etc. And it's not impossible to imagine that the overhead of frequent data transfers to/from co-processor could sum up and diminish the wow factor.

michaelsproul mentioned this issue Jul 4, 2024

Show blst hardware support in lighthouse --version sigp/lighthouse#6039

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARMv7 optimization #213

ARMv7 optimization #213

timethernom commented Apr 28, 2024

mratsim commented Apr 29, 2024

dot-asm commented Apr 29, 2024

timethernom commented Apr 29, 2024

dot-asm commented Apr 29, 2024

dot-asm commented Apr 29, 2024

timethernom commented Apr 29, 2024

dot-asm commented Apr 29, 2024

dot-asm commented Apr 29, 2024

dot-asm commented Apr 29, 2024

dot-asm commented Apr 29, 2024

timethernom commented Apr 29, 2024

timethernom commented Apr 30, 2024

dot-asm commented Apr 30, 2024 •

edited

Loading

dot-asm commented May 1, 2024

timethernom commented May 1, 2024

dot-asm commented May 2, 2024 •

edited

Loading

timethernom commented May 3, 2024

dot-asm commented May 3, 2024

dot-asm commented May 5, 2024

ARMv7 optimization #213

ARMv7 optimization #213

Comments

timethernom commented Apr 28, 2024

mratsim commented Apr 29, 2024

dot-asm commented Apr 29, 2024

timethernom commented Apr 29, 2024

dot-asm commented Apr 29, 2024

dot-asm commented Apr 29, 2024

timethernom commented Apr 29, 2024

dot-asm commented Apr 29, 2024

dot-asm commented Apr 29, 2024

dot-asm commented Apr 29, 2024

dot-asm commented Apr 29, 2024

timethernom commented Apr 29, 2024

timethernom commented Apr 30, 2024

dot-asm commented Apr 30, 2024 • edited Loading

dot-asm commented May 1, 2024

timethernom commented May 1, 2024

dot-asm commented May 2, 2024 • edited Loading

timethernom commented May 3, 2024

dot-asm commented May 3, 2024

dot-asm commented May 5, 2024

dot-asm commented Apr 30, 2024 •

edited

Loading

dot-asm commented May 2, 2024 •

edited

Loading