Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Fast CIOS for Montgomery modular multiplication #1009

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

clementjuventin
Copy link

@clementjuventin clementjuventin commented Sep 17, 2024

This PR implements the improvement of the CIOS variant of the Montgomery algorithm showcased in the following document: EdMSM: Multi-Scalar-Multiplication for SNARKs and Faster Montgomery multiplication and proposed in #869.
The optimization occurs when most_significant_p_word < (word_size / 2), p is represented by the variable mod in the current implementation.

Here are the benchmarks performed after applying these changes starting from commit 01eca77.

Build: cmake --build build --parallel

Benchmarks: taskset -c 0 ./build/bin/evmone-bench-internal --benchmark_filter=evmmax* --benchmark_repetitions=10 --benchmark_format=json --benchmark_out=cios_classic.json

Comparison:

user@user:~/sandbox/evmone$ python3 ../benchmark/tools/compare.py benchmarks cios_classic.json cios_improved.json 
Comparing cios_classic.json to cios_improved.json
Benchmark                                               Time             CPU      Time Old      Time New       CPU Old       CPU New
------------------------------------------------------------------------------------------------------------------------------------
evmmax_mul<uint256, bn254>_pvalue                     0.0002          0.0002      U Test, Repetitions: 10 vs 10
evmmax_mul<uint256, bn254>_mean                      +7.8884         +7.8873            27           238            27           238
evmmax_mul<uint256, bn254>_median                    +7.8595         +7.8584            27           237            27           237
evmmax_mul<uint256, bn254>_stddev                   +17.5405        +17.5057             0             3             0             3
evmmax_mul<uint256, bn254>_cv                        +1.0859         +1.0823             0             0             0             0
evmmax_mul<uint256, secp256k1>_pvalue                 0.0002          0.0002      U Test, Repetitions: 10 vs 10
evmmax_mul<uint256, secp256k1>_mean                  +7.6644         +7.6636            29           249            29           249
evmmax_mul<uint256, secp256k1>_median                +7.7730         +7.7721            28           249            28           249
evmmax_mul<uint256, secp256k1>_stddev               +10.6559        +10.6522             1             7             1             7
evmmax_mul<uint256, secp256k1>_cv                    +0.3453         +0.3450             0             0             0             0
OVERALL_GEOMEAN                                      +1.0661         +1.0658             0             0             0             0

As you can see, the results do not indicate a performance improvement.

@clementjuventin
Copy link
Author

clementjuventin commented Sep 19, 2024

After further investigation, I found another way of ordering branches (second commit) and obtained what we were looking for (~15% gain on evmmax_mul<uint256, bn254>)!

evmmax_mul<uint256, bn254>_median                    -0.1529         -0.1529            27            23            27            23
evmmax_mul<uint256, secp256k1>_median                +0.0059         +0.0058            28            28            28            28

I still don't get why this implementation is better, even assuming branch prediction makes the checks insignificant.
I would also like to compare the assembly code in case there is important optimization under the hood but I never did so let's see if I manage to do it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant