-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARMv7 optimization #213
Comments
Pairings (BLS signatures) are significantly slower than scalar multiplications (ECDSA). If you look at micro-ecc, they use the ARM crypto DSP with UMAAL instruction: https://github.com/kmackay/micro-ecc/blob/master/asm_arm_mult_square_umaal.inc This paper https://eprint.iacr.org/2018/700.pdf has some timings using generic Montgomery-based field multiplication (like BLST) with Neon (it also goes over UMAAL) and the ratio is 6.5x: If you want to speedup signature verifications, maybe explore batching the verification, if you can verify 20+ at once, the total verification time should be lower than 20+ ECDSA signatures |
Since there is no such thing as "ARM4 or ARM33 device," it's more likely that the question is rather about Cortex-M4/M33, micro-controllers, which don't have the NEON unit. And with UMAAL being optional on M33, so it's safer to assume that it's not an option either. This is not to say that assembly won't give meaningful improvement, but it won't be as impressive as suggested.
One can wonder why would one want to task a micro-controller with such a heavy operation as BLS signature verification. But either way, as pointed out by @mratsim, there is no way to get BLS anywhere near ECDSA in terms of wall-clock performance. Even signature generation is computationally heavier, let alone verification. |
Right, Cortex-M4/M33 is what I meant, and can require UMAAL if it delivers significant gain. Application is to deliver consensus proof to a hardware wallet. Probably will use schnorr threshold signature scheme like FROST, but would like to avoid extra steps that may fail. 3 seconds is too slow. .5 seconds might be viable. Signing and aggregation speed is not a concern. |
Depends on definition of "significant", but if taking it from 3s to 0.5s is the goal, UMAAL won't get you there, no. |
BTW, how did you compile it exactly? Do you use say Rust bindings or make raw C calls? |
I just added server.c to a Keil project. My (out of date) Keil compiler threw some warnings on the inline functions calling static functions in no_asm.h, and then crashed. I got it to build by removing the static function declarations in no_asm.h. This was just to see if I could get in the ballpark. I expected not, given RFC showing 30x slower than ECDSA. But, the blst lib was such a smooth build and test on MacOS, I decided worth checking out on one of my devices. |
Question is what was the optimization level. A possible pitfall is that assembler-assisted ECDSA won't be that dependent on optimization level, while pure C would be very dependent. Secondly, if your target executable format is ELF, you should be able to try newer gcc to compile libblst.a and link your project with it. The thing is that blst doesn't have explicit dependencies on language run-time, which should give you freedom to mix the compilers. It would be good idea to add |
Hmm, gcc leaves reference to memset (without us calling it) and __aeabi_uidivmod. |
Oh! And of course you'd need to throw in |
BTW, clang even issues umaal instructions, so that assembly won't give you an edge is respect to hardware capabilities utilization. This is still not to say that it won't give meaningful improvement, but it won't be direct result of utilizing "magic" instructions. |
Thanks! I will try this and see if any performance difference. This was with -O3 optimization. |
Easier said than done. Keil does not like the library format, so will have to dig around on what is required there. I guess the main acceleration in the kmackay ECDSA comes from the 20K look up table. In my intended bls application, the same public key will be used repeatedly. Is there any possibility to take advantage of that? It's not the batch scenario, because the signatures arrive over time. But, there would be room to cache some pre-computes. |
Giving it server.o might work better...
You make it sound as if the lookup table makes ECDSA 6 times faster. It's not. It does yield meaningful improvement, but it's not "magical." BLS is just significantly computationally heavier than ECDSA. But to the point, while there is a thing that can be pre-computed in BLS, the gain is marginal, it |
For an embedded system the correct answer is likely to be an integrated co-processor. One wishes OpenTitan was an accessible option, but it doesn't seem to be the case. Otherwise there is for example Kinetis K8x with usable capabilities in the Cortex-M4 space... |
The Kinetis K8x looks almost identical to what we are currently running on, an Infineon PSoC62. We are currently running the 150Mhz M4 at 48Mhz to save power, so there's a 3x gain available there. It also has "12 programmable logic blocks, each with 8 Macrocells and an 8-bit data path (called universal digital blocks or UDBs)". Not sure if that can be of any use. The kmackay ECDSA with uECC_OPTIMIZATION_LEVEL 0 takes 10 sec to verify 1 signature. The big leap comes at level 1, which drops to 170 msec, with no change in image size. This drops to 75 msec at level 3 with 40K jump in image size. So, I guess the precomputes deliver about 2x. The big win was something else. Probably level 0 is just a very simplistic reference implementation. Using server.o revealed the need to add -fshort-enum to the build, for compatibility with our existing compile options. But, then I got a bunch of link errors about relocating objects, mostly related to stdio. This is mysterious. Will investigate further some time. |
As already mentioned, server.c has |
No doubt the linker was confused. Removing -fPIC avoided that. But, no significant performance difference. clang -O2 was about 50 msec slower than the keil -03 build. clang -03 was about 50 msec faster. Based on a single sample. Thanks for your help. |
To summarize. It's the ability to keep whole vectors in the register bank that translates to performance "leaps". Most notably one can barely accommodate a 256-bit vector in ARM32 registers(*), but not a 384-bit one. BLS operates on 384-bit inputs, which is basically why one can't expect times-impressive-N improvements no matter what you do here. This is still not to say that assembly won't provide meaningful improvement, but it won't take you anywhere near the stated objective. As for Kinetis K8x. K82 is the only one with relevant capabilities, more specifically a modular arithmetic unit. However, I don't actually have hands-on experience with it, so don't consider the mention as "this one will actually solve the problem." (*) Which is what the referred micro-ecc does with high enough uECC_OPTIMIZATION_LEVEL. |
Just in case. One has to recognize that the co-processor is still power-constrained, which in practice means that it won't run as fast as one wishes. It surely can deliver significant improvements on "large" operations such as modular exponentiation and ECC arithmetic, but the most computationally intensive part of BLS signature verification would have to rely on individual field operations, multiplications, additions, etc. And it's not impossible to imagine that the overhead of frequent data transfers to/from co-processor could sum up and diminish the wow factor. |
I would like to speed up signature verification using ARMv7 instruction set (ARM4 or ARM33 device). Verification takes about 3 seconds with BLST_PORTABLE build. Could you hazard a guess on how much speed up I could get with an ARMv7 asm implementation?
By compare, ECDSA signature verification take about 75 msec with asm from https://github.com/kmackay/micro-ecc, using about 40K of static lookup tables. Am I correct that there is less role for lookup tables with BLS?
The text was updated successfully, but these errors were encountered: