Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make Eth2Digest isZero 8x faster #6682

Merged
merged 3 commits into from
Oct 28, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion beacon_chain/spec/digest.nim
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ import
json_serialization

from nimcrypto/utils import burnMem
from stew/staticfor import staticFor

export
# Exports from sha2 / hash are explicit to avoid exporting upper-case `$` and
Expand Down Expand Up @@ -143,7 +144,13 @@ func `==`*(a, b: Eth2Digest): bool =
equalMem(unsafeAddr a.data[0], unsafeAddr b.data[0], sizeof(a.data))

func isZero*(x: Eth2Digest): bool =
x.isZeroMemory
var tmp {.noinit.}: uint64
var tmp2 = 0'u64
static: doAssert sizeof(x.data) mod sizeof(tmp) == 0
staticFor i, 0 ..< sizeof(x.data) div sizeof(tmp):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since it's a weekend, there's one thing that one might try, which is to get rid of the branch entirely:

var tmp2 = 0'u64
var tmp {.noinit}
staticFor:
  copyMEm(tmp, ..)
  tmp2 = tmp2 or tmp
tmp2 > 0

it could pay off because mdigest is 32 bytes which fits in a cache line so it won't cost anything more to read the rest of it and the resulting code should be much smaller but who knows these days..

Copy link
Member

@arnetheduck arnetheduck Oct 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on the other hand, if it's nonzero, it's likely to be so in the first word too - choices, choices... it's certainly faster when it's zero, so it might be that it depends on the frequency of zero-mem digests, which one ends up a "total winner" :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, it's comparable, maybe slightly faster. It's easy enough to try both, but the advantage of the one with branches is that it can early-exit, so these benchmarks are best case:

  PR: 0.872 microseconds/10k zero Eth2Digest comparisons
  PR: 1.71 microseconds/10k zero Eth2Digest comparisons
  PR: 0.631 microseconds/10k zero Eth2Digest comparisons
  PR: 0.621 microseconds/10k zero Eth2Digest comparisons
  PR: 0.621 microseconds/10k zero Eth2Digest comparisons
  PR: 1.24 microseconds/10k zero Eth2Digest comparisons
  PR: 1.34 microseconds/10k zero Eth2Digest comparisons
  PR: 1.35 microseconds/10k zero Eth2Digest comparisons
  PR: 0.612 microseconds/10k zero Eth2Digest comparisons
  PR: 1.35 microseconds/10k zero Eth2Digest comparisons
  PR: 0.621 microseconds/10k zero Eth2Digest comparisons
  PR: 0.621 microseconds/10k zero Eth2Digest comparisons
  PR: 0.621 microseconds/10k zero Eth2Digest comparisons
  PR: 1.34 microseconds/10k zero Eth2Digest comparisons
  PR: 1.34 microseconds/10k zero Eth2Digest comparisons
  PR: 1.32 microseconds/10k zero Eth2Digest comparisons
  PR: 0.611 microseconds/10k zero Eth2Digest comparisons
  PR: 0.601 microseconds/10k zero Eth2Digest comparisons
  PR: 1.27 microseconds/10k zero Eth2Digest comparisons
  PR: 0.621 microseconds/10k zero Eth2Digest comparisons
  PR: 0.642 microseconds/10k zero Eth2Digest comparisons
  PR: 0.612 microseconds/10k zero Eth2Digest comparisons
  PR: 0.651 microseconds/10k zero Eth2Digest comparisons
  PR: 0.611 microseconds/10k zero Eth2Digest comparisons
  PR: 0.621 microseconds/10k zero Eth2Digest comparisons

Would have to add non-zero benchmarks to compare properly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried all 4 combinations,

  • 4a8289f (staticFor + copyMem); and
  • staticFor/copyMem/or loop with no early exit

and zero/non-zero hashes (set them to all 255)

And:

$ datamash -R3 mean 1 perc:10 1 median 1 perc:90 1 count 1 < ~/benchmarks_zerohash_copymem_or.txt 
0.929	0.601	0.641	1.342	100.000
$ datamash -R3 mean 1 perc:10 1 median 1 perc:90 1 count 1 < ~/benchmarks_zerohash_copymem.txt 
0.966	0.611	0.706	1.352	100.000
$ datamash -R3 mean 1 perc:10 1 median 1 perc:90 1 count 1 < ~/benchmarks_nonzerohash_copymem_or.txt 
0.945	0.600	0.632	1.363	100.000
$ datamash -R3 mean 1 perc:10 1 median 1 perc:90 1 count 1 < ~/benchmarks_nonzerohash_copymem.txt 
1.072	0.612	1.297	1.362	100.000

benchmarks_nonzerohash_copymem.txt
benchmarks_nonzerohash_copymem_or.txt
benchmarks_zerohash_copymem.txt
benchmarks_zerohash_copymem_or.txt

Remarkably similar to each other. The main advantage seems to be the efficient static iteration over fewer, large elements, not the exact operations or even the early exit. It matters, but it's fairly marginal performance-wise.

Copy link
Contributor Author

@tersec tersec Oct 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The or approach isn't much better, but sure, slightly better, so going with it.

diff --git a/beacon_chain/spec/digest.nim b/beacon_chain/spec/digest.nim
index 87c447b84..06ea59be9 100644
--- a/beacon_chain/spec/digest.nim
+++ b/beacon_chain/spec/digest.nim
@@ -145,12 +145,12 @@ func `==`*(a, b: Eth2Digest): bool =
 
 func isZero*(x: Eth2Digest): bool =
   var tmp {.noinit.}: uint64
+  var tmp2 = 0'u64
   static: doAssert sizeof(x.data) mod sizeof(tmp) == 0
   staticFor i, 0 ..< sizeof(x.data) div sizeof(tmp):
     copyMem(addr tmp, addr x.data[i*sizeof(tmp)], sizeof(tmp))
-    if tmp != 0:
-      return false
-  true
+    tmp2 = tmp2 or tmp
+  tmp2 == 0
 
 proc writeValue*(w: var JsonWriter, a: Eth2Digest) {.raises: [IOError].} =
   w.writeValue $a

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typedef struct tyObject_MDigest__sonz1Q4qEjQxue9bhMfa9bfg
    tyObject_MDigest__sonz1Q4qEjQxue9bhMfa9bfg;
typedef NU8 tyArray__vEOa9c5qaE9ajWxR5R4zwfQg[32];
struct tyObject_MDigest__sonz1Q4qEjQxue9bhMfa9bfg {
  NIM_ALIGN(16) tyArray__vEOa9c5qaE9ajWxR5R4zwfQg data;
};
static N_INLINE(void,
                _ZN6system7copyMemE7pointer7pointer25range09223372036854775807)(
    void *dest_p0, void *source_p1, NI size_p2);
static N_INLINE(void, nimCopyMem)(void *dest_p0, void *source_p1, NI size_p2);

static N_INLINE(void, nimCopyMem)(void *dest_p0, void *source_p1, NI size_p2) {
  void *T1_;
  T1_ = (void *)0;
  T1_ = memcpy(dest_p0, source_p1, ((size_t)(size_p2)));
}

static N_INLINE(void,
                _ZN6system7copyMemE7pointer7pointer25range09223372036854775807)(
    void *dest_p0, void *source_p1, NI size_p2) {
  nimCopyMem(dest_p0, source_p1, size_p2);
}

N_LIB_PRIVATE N_NIMCALL(NIM_BOOL, _ZN6digest6isZeroE7MDigestI6staticI3intEE)(
    tyObject_MDigest__sonz1Q4qEjQxue9bhMfa9bfg *x_p0) {
  NIM_BOOL result;
  NU64 tmp;
  NU64 tmp2;
  result = (NIM_BOOL)0;
  tmp2 = 0ULL;
  {
    _ZN6system7copyMemE7pointer7pointer25range09223372036854775807(
        ((void *)((&tmp))), ((void *)((&(*x_p0).data[(((NI)0))-0]))), ((NI)8));

    tmp2 = (NU64)(tmp2 | tmp);
  }
  {
    _ZN6system7copyMemE7pointer7pointer25range09223372036854775807(
        ((void *)((&tmp))), ((void *)((&(*x_p0).data[(((NI)8))-0]))), ((NI)8));

    tmp2 = (NU64)(tmp2 | tmp);
  }
  {
    _ZN6system7copyMemE7pointer7pointer25range09223372036854775807(
        ((void *)((&tmp))), ((void *)((&(*x_p0).data[(((NI)16))-0]))), ((NI)8));

    tmp2 = (NU64)(tmp2 | tmp);
  }
  {
    _ZN6system7copyMemE7pointer7pointer25range09223372036854775807(
        ((void *)((&tmp))), ((void *)((&(*x_p0).data[(((NI)24))-0]))), ((NI)8));

    tmp2 = (NU64)(tmp2 | tmp);
  }

  result = (tmp2 == 0ULL);
  return result;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One quirk of this version is that it generalizes less outside the Eth2Digest and/or hash function case, where there are only 4 chunks to process, so the early exit doesn't buy much.

For more general isZeroMemory, early exit, unless it had to be constant-time, might become more generally useful.

copyMem(addr tmp, addr x.data[i*sizeof(tmp)], sizeof(tmp))
tmp2 = tmp2 or tmp
tmp2 == 0

proc writeValue*(w: var JsonWriter, a: Eth2Digest) {.raises: [IOError].} =
w.writeValue $a
Expand Down
Loading