Optimize comparison operators #269

greg7mdp · 2022-04-12T11:58:59Z

No description provided.

greg7mdp · 2022-04-13T15:12:31Z

@chfast, let me know what you find out please :-)

chfast · 2022-05-05T16:13:26Z

Preliminary benchmarks:

ecmod<addmod_public>_mean                                            +0.0584         +0.0584             7             7             7             7                                                                              
ecmod<addmod_simple>_mean                                            -0.0017         -0.0017            22            22            22            22                                                                              
ecmod<addmod_prenormalize>_mean                                      -0.0999         -0.0999             5             5             5             5                                                                              
ecmod<addmod_daosvik_v1>_mean                                        +0.0210         +0.0210            12            12            12            12                                                                              
ecmod<addmod_daosvik_v2>_mean                                        +0.0103         +0.0103             6             6             6             6                                                                              
ecmod<mulmod>_mean                                                   +0.0051         +0.0051            68            68            68            68                                                                              
shift<uint256, uint256, shl_public>/-1_mean                          -0.0789         -0.0789             3             2             3             2
shift<uint256, uint256, shl_public>/0_mean                           -0.0716         -0.0716             3             3             3             3
shift<uint256, uint256, shl_public>/1_mean                           -0.0020         -0.0020             3             3             3             3
shift<uint256, uint256, shl_public>/2_mean                           -0.0885         -0.0885             3             2             3             2
shift<uint256, uint256, shl_public>/3_mean                           -0.1080         -0.1080             2             2             2             2
shift<uint256, uint256, shl_halves>/-1_mean                          +0.0045         +0.0045             2             2             2             2
shift<uint256, uint256, shl_halves>/0_mean                           +0.0012         +0.0012             3             3             3             3
shift<uint256, uint256, shl_halves>/1_mean                           +0.0016         +0.0016             3             3             3             3
shift<uint256, uint256, shl_halves>/2_mean                           +0.0004         +0.0004             2             2             2             2
shift<uint256, uint256, shl_halves>/3_mean                           -0.0006         -0.0006             2             2             2             2
shift<uint256, uint64_t, shl_public>/-1_mean                         -0.0052         -0.0052             2             2             2             2
shift<uint256, uint64_t, shl_public>/0_mean                          +0.0021         +0.0021             3             3             3             3
shift<uint256, uint64_t, shl_public>/1_mean                          +0.0002         +0.0002             2             2             2             2
shift<uint256, uint64_t, shl_public>/2_mean                          -0.0959         -0.0959             2             2             2             2
shift<uint256, uint64_t, shl_public>/3_mean                          +0.0070         +0.0070             2             2             2             2
shift<uint256, uint64_t, shl_halves>/-1_mean                         -0.0030         -0.0030             2             2             2             2
shift<uint256, uint64_t, shl_halves>/0_mean                          -0.0022         -0.0022             3             3             3             3
shift<uint256, uint64_t, shl_halves>/1_mean                          -0.0017         -0.0017             2             2             2             2
shift<uint256, uint64_t, shl_halves>/2_mean                          -0.0785         -0.0785             2             2             2             2
shift<uint256, uint64_t, shl_halves>/3_mean                          +0.0006         +0.0006             2             2             2             2
shift<uint256, uint64_t, shl_llvm>/-1_mean                           -0.0057         -0.0057             6             6             6             6
shift<uint256, uint64_t, shl_llvm>/0_mean                            -0.0049         -0.0049             6             6             6             6
shift<uint256, uint64_t, shl_llvm>/1_mean                            -0.0056         -0.0056             6             6             6             6
shift<uint256, uint64_t, shl_llvm>/2_mean                            -0.0073         -0.0073             6             6             6             6
shift<uint256, uint64_t, shl_llvm>/3_mean                            -0.0053         -0.0053             6             6             6             6
shift<uint512, uint512, shl_public>/-1_mean                          +0.0005         +0.0005             9             9             9             9
shift<uint512, uint512, shl_public>/0_mean                           +0.0008         +0.0008            10            10            10            10
shift<uint512, uint512, shl_public>/1_mean                           -0.0046         -0.0046             9             9             9             9
shift<uint512, uint512, shl_public>/2_mean                           -0.0028         -0.0028             9             9             9             9
shift<uint512, uint512, shl_public>/3_mean                           +0.0006         +0.0006             8             8             8             8
shift<uint512, uint64_t, shl_public>/-1_mean                         -0.0081         -0.0081             8             8             8             8
shift<uint512, uint64_t, shl_public>/0_mean                          +0.0538         +0.0538             8             9             8             9
shift<uint512, uint64_t, shl_public>/1_mean                          -0.0263         -0.0263             8             8             8             8
shift<uint512, uint64_t, shl_public>/2_mean                          -0.0286         -0.0286             8             8             8             8
shift<uint512, uint64_t, shl_public>/3_mean                          -0.0297         -0.0297             8             8             8             8
compare<lt_public>/64_mean                                           -0.0021         -0.0021             2             2             2             2
compare<lt_public>/128_mean                                          -0.1423         -0.1423             2             2             2             2
compare<lt_public>/192_mean                                          +0.0196         +0.0196             2             2             2             2
compare<lt_public>/256_mean                                          -0.1432         -0.1432             2             1             2             1
compare<lt_sub>/64_mean                                              +0.0092         +0.0092             2             2             2             2
compare<lt_sub>/128_mean                                             +0.0044         +0.0044             2             2             2             2
compare<lt_sub>/192_mean                                             +0.0062         +0.0062             2             2             2             2
compare<lt_sub>/256_mean                                             +0.0078         +0.0078             2             2             2             2
compare<lt_wordcmp>/64_mean                                          +0.0009         +0.0009             2             2             2             2
compare<lt_wordcmp>/128_mean                                         +0.0024         +0.0024             2             2             2             2
compare<lt_wordcmp>/192_mean                                         -0.0349         -0.0349             4             4             4             4
compare<lt_wordcmp>/256_mean                                         +0.0308         +0.0308             3             4             3             4
compare<lt_halves>/64_mean                                           -0.0025         -0.0025             3             3             3             3
compare<lt_halves>/128_mean                                          -0.0021         -0.0021             3             3             3             3
compare<lt_halves>/192_mean                                          -0.0018         -0.0018             3             3             3             3
compare<lt_halves>/256_mean                                          -0.0031         -0.0031             3             3             3             3
compare<lt_llvm>/64_mean                                             -0.0127         -0.0127             2             2             2             2
compare<lt_llvm>/128_mean                                            -0.0117         -0.0117             2             2             2             2
compare<lt_llvm>/192_mean                                            -0.0109         -0.0109             2             2             2             2
compare<lt_llvm>/256_mean                                            -0.0115         -0.0115             2             2             2             2

So this looks solid for compare operators directly and seems to help the shifts as well, although this may be code layout effect.

greg7mdp · 2022-05-05T16:26:14Z

Thanks, glad to hear :-)

chfast · 2022-05-05T16:27:16Z

Confirmed, this does not affect assembly of shift operators. Proof: https://godbolt.org/z/4b6cT4rje.

greg7mdp · 2022-05-05T16:31:44Z

Yes, makes sense. When I played with this repo a while back, I noticed that the benchmarks had variability from one run to the next, so I'm not sure how reliable the comparison truly is.
I do believe that my other changes in add, sub were also beneficial.

codecov-commenter · 2022-05-05T16:40:38Z

Codecov Report

Merging #269 (f977136) into master (e243bfb) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##            master      #269   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           10        10           
  Lines         1941      1937    -4     
=========================================
- Hits          1941      1937    -4

Flag	Coverage Δ
32bit	`100.00% <100.00%> (ø)`
gcc	`99.47% <100.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
include/intx/intx.hpp	`100.00% <100.00%> (ø)`

axic · 2022-06-01T16:38:21Z

@chfast should this be merged?

chfast · 2022-06-01T16:45:21Z

On the latest compilers I don't see much difference so I'm not very much in favor because this implementation has 4 branches while the original one is branchless. But people are welcome to drop their own benchmarking results (use "compare" filter).

greg7mdp · 2022-06-01T18:20:04Z

Why does it matter that it is branchless if it is slower?

chfast · 2022-08-31T12:10:37Z

I will try to clean up the situation.
We are discussing 3 "less than" implementations:

"sub" - subtract two values and return the borrow flag,
"ne" - find the top pair of not-equal words, return comparison of these; proposed here by @greg7mdp,
"split" - the current one which is the mixture of the two above: split into two 128-bit values and subtract them.

Here you can see the assembly output for each: https://godbolt.org/z/WY3qqzrET
Compilers are doing good job.

Analysis

In the worst case every variant need to load all 8 words. Additionally, "sub" executes 4 subtract instructions (independent of input). The "ne" executes 4 cmp+jmp instead. These 2 instructions are fused to single micro-instruction. So it should perform the same as "sub" in worst case but may perform better if it can "exit" earlier.

Benchmarks

I added all variants of implementations in #277 and benchmarked them on two CPUs. There are 5 benchmark cases:

64, 128, 192, 256 mean the value size in bits. E.g. 64 means we are comparing two 64 bit numbers (i.e. 3 top words are zeros).
0 equal combination of above cases, shuffled.
There are also some internal changes to benchmarks to improve stability of results.

Haswell

compare<[lt_split vs. lt_ne]>/0_mean                    +0.1471         +0.1471             1             2             1             2
compare<[lt_split vs. lt_ne]>/64_mean                   -0.0009         -0.0009             2             2             2             2
compare<[lt_split vs. lt_ne]>/128_mean                  -0.0009         -0.0009             2             2             2             2
compare<[lt_split vs. lt_ne]>/192_mean                  +0.0013         +0.0013             1             1             1             1
compare<[lt_split vs. lt_ne]>/256_mean                  +0.2908         +0.2908             1             2             1             2

compare<[lt_split vs. lt_sub]>/0_mean                    -0.0345         -0.0345             1             1             1             1
compare<[lt_split vs. lt_sub]>/64_mean                   -0.0827         -0.0827             2             1             2             1
compare<[lt_split vs. lt_sub]>/128_mean                  -0.0818         -0.0818             2             1             2             1
compare<[lt_split vs. lt_sub]>/192_mean                  +0.0992         +0.0992             1             1             1             1
compare<[lt_split vs. lt_sub]>/256_mean                  +0.0978         +0.0978             1             1             1             1

compare<[lt_sub vs. lt_ne]>/0_mean                    +0.1882         +0.1882             1             2             1             2
compare<[lt_sub vs. lt_ne]>/64_mean                   +0.0892         +0.0892             1             2             1             2
compare<[lt_sub vs. lt_ne]>/128_mean                  +0.0881         +0.0881             1             2             1             2
compare<[lt_sub vs. lt_ne]>/192_mean                  -0.0890         -0.0890             1             1             1             1
compare<[lt_sub vs. lt_ne]>/256_mean                  +0.1758         +0.1758             1             2             1             2

Skylake

compare<[lt_split vs. lt_ne]>/0_mean                    +0.0854         +0.0854             2             2             2             2
compare<[lt_split vs. lt_ne]>/64_mean                   +0.0036         +0.0036             2             2             2             2
compare<[lt_split vs. lt_ne]>/128_mean                  +0.1674         +0.1674             2             2             2             2
compare<[lt_split vs. lt_ne]>/192_mean                  -0.0026         -0.0027             2             2             2             2
compare<[lt_split vs. lt_ne]>/256_mean                  +0.0057         +0.0057             2             2             2             2

compare<[lt_split vs. lt_sub]>/0_mean                    -0.1562         -0.1562             2             2             2             2
compare<[lt_split vs. lt_sub]>/64_mean                   -0.0004         -0.0004             2             2             2             2
compare<[lt_split vs. lt_sub]>/128_mean                  +0.0004         +0.0004             2             2             2             2
compare<[lt_split vs. lt_sub]>/192_mean                  -0.2086         -0.2086             2             2             2             2
compare<[lt_split vs. lt_sub]>/256_mean                  -0.2122         -0.2122             2             2             2             2

compare<[lt_sub vs. lt_ne]>/0_mean                    +0.2863         +0.2863             2             2             2             2
compare<[lt_sub vs. lt_ne]>/64_mean                   +0.0039         +0.0039             2             2             2             2
compare<[lt_sub vs. lt_ne]>/128_mean                  +0.1670         +0.1670             2             2             2             2
compare<[lt_sub vs. lt_ne]>/192_mean                  +0.2602         +0.2602             2             2             2             2
compare<[lt_sub vs. lt_ne]>/256_mean                  +0.2766         +0.2766             2             2             2             2

Zen3

This is noisy machine with AMD CPU. For this case I used lt_ne2 for comparison because it was 15% better than the lt_ne.

compare<[lt_split vs. lt_ne2]>/0_mean                    -0.1398         -0.1397             4             3             4             3
compare<[lt_split vs. lt_ne2]>/64_mean                   -0.2668         -0.2668             3             2             3             2
compare<[lt_split vs. lt_ne2]>/128_mean                  -0.1391         -0.1392             3             3             3             3
compare<[lt_split vs. lt_ne2]>/192_mean                  -0.1564         -0.1564             3             3             3             3
compare<[lt_split vs. lt_ne2]>/256_mean                  -0.1487         -0.1487             3             3             3             3

compare<[lt_split vs. lt_sub]>/0_mean                    -0.2012         -0.2010             4             3             4             3
compare<[lt_split vs. lt_sub]>/64_mean                   -0.1363         -0.1366             3             3             3             3
compare<[lt_split vs. lt_sub]>/128_mean                  -0.1508         -0.1507             3             3             3             3
compare<[lt_split vs. lt_sub]>/192_mean                  -0.1555         -0.1555             3             3             3             3
compare<[lt_split vs. lt_sub]>/256_mean                  -0.1456         -0.1456             3             3             3             3

compare<[lt_sub vs. lt_ne2]>/0_mean                    +0.0769         +0.0768             3             3             3             3
compare<[lt_sub vs. lt_ne2]>/64_mean                   -0.1511         -0.1508             3             2             3             2
compare<[lt_sub vs. lt_ne2]>/128_mean                  +0.0137         +0.0136             3             3             3             3
compare<[lt_sub vs. lt_ne2]>/192_mean                  -0.0010         -0.0010             3             3             3             3
compare<[lt_sub vs. lt_ne2]>/256_mean                  -0.0036         -0.0036             3             3             3             3

Xeon Platinum 8272CL

compare<[lt_split vs. lt_ne]>/0_mean                    +0.2017         +0.2016             2             2             2             2
compare<[lt_split vs. lt_ne]>/64_mean                   -0.0078         -0.0073             2             2             2             2
compare<[lt_split vs. lt_ne]>/128_mean                  +0.3924         +0.3924             2             2             2             2
compare<[lt_split vs. lt_ne]>/192_mean                  -0.0017         -0.0016             2             2             2             2
compare<[lt_split vs. lt_ne]>/256_mean                  -0.0009         -0.0009             2             2             2             2

compare<[lt_split vs. lt_sub]>/0_mean                    -0.0803         -0.0803             2             2             2             2
compare<[lt_split vs. lt_sub]>/64_mean                   -0.0051         -0.0046             2             2             2             2
compare<[lt_split vs. lt_sub]>/128_mean                  -0.0021         -0.0020             2             2             2             2
compare<[lt_split vs. lt_sub]>/192_mean                  -0.0025         -0.0025             2             2             2             2
compare<[lt_split vs. lt_sub]>/256_mean                  -0.0023         -0.0023             2             2             2             2

compare<[lt_sub vs. lt_ne]>/0_mean                    +0.3065         +0.3065             2             2             2             2
compare<[lt_sub vs. lt_ne]>/64_mean                   -0.0027         -0.0027             2             2             2             2
compare<[lt_sub vs. lt_ne]>/128_mean                  +0.3953         +0.3953             2             2             2             2
compare<[lt_sub vs. lt_ne]>/192_mean                  +0.0009         +0.0009             2             2             2             2
compare<[lt_sub vs. lt_ne]>/256_mean                  +0.0015         +0.0015             2             2             2             2

From the results we should rather consider switching back to "sub" implementation. I may do some other benchmarks if I have time.

sonarcloud · 2023-09-20T12:11:08Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
11 Code Smells

No Coverage information
0.0% Duplication

greg7mdp mentioned this pull request Apr 12, 2022

streamline/speed up addc/subc by using in/out carry parameter #267

Open

chfast self-assigned this May 5, 2022

chfast force-pushed the compare branch from 4004a04 to 38b4b74 Compare May 5, 2022 16:23

chfast changed the title ~~separate change to operator<() in a separate PR for easier testing~~ Optimize comparison operators May 5, 2022

Optimize comparison operators

f977136

chfast force-pushed the compare branch from 38b4b74 to f977136 Compare August 26, 2022 12:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize comparison operators #269

Optimize comparison operators #269

greg7mdp commented Apr 12, 2022

greg7mdp commented Apr 13, 2022

chfast commented May 5, 2022

greg7mdp commented May 5, 2022

chfast commented May 5, 2022

greg7mdp commented May 5, 2022

codecov-commenter commented May 5, 2022 •

edited

Loading

axic commented Jun 1, 2022

chfast commented Jun 1, 2022

greg7mdp commented Jun 1, 2022

chfast commented Aug 31, 2022 •

edited

Loading

sonarcloud bot commented Sep 20, 2023

Optimize comparison operators #269

Are you sure you want to change the base?

Optimize comparison operators #269

Conversation

greg7mdp commented Apr 12, 2022

greg7mdp commented Apr 13, 2022

chfast commented May 5, 2022

greg7mdp commented May 5, 2022

chfast commented May 5, 2022

greg7mdp commented May 5, 2022

codecov-commenter commented May 5, 2022 • edited Loading

Codecov Report

axic commented Jun 1, 2022

chfast commented Jun 1, 2022

greg7mdp commented Jun 1, 2022

chfast commented Aug 31, 2022 • edited Loading

Analysis

Benchmarks

Haswell

Skylake

Zen3

Xeon Platinum 8272CL

sonarcloud bot commented Sep 20, 2023

codecov-commenter commented May 5, 2022 •

edited

Loading

chfast commented Aug 31, 2022 •

edited

Loading