Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make Eth2Digest isZero 8x faster #6682

Merged
merged 3 commits into from
Oct 28, 2024
Merged

make Eth2Digest isZero 8x faster #6682

merged 3 commits into from
Oct 28, 2024

Conversation

tersec
Copy link
Contributor

@tersec tersec commented Oct 26, 2024

Via profiling by @etan-status
iszeromemory_holesky_syncing

During Holesky syncing, Nimbus is spending 25% of its time in isZeroMemory. This PR improves this.

test 1:
  PR: 2.94 microseconds/10k zero Eth2Digest comparisons
prev: 18.6 microseconds/10k zero Eth2Digest comparisons

test 2:
  PR: 2.60 microseconds/10k zero Eth2Digest comparisons
prev: 18.6 microseconds/10k zero Eth2Digest comparisons

test 3:
  PR: 1.29 microseconds/10k zero Eth2Digest comparisons
prev: 8.33 microseconds/10k zero Eth2Digest comparisons

test 4:
  PR: 1.27 microseconds/10k zero Eth2Digest comparisons
prev: 8.29 microseconds/10k zero Eth2Digest comparisons

test 5:
  PR: 1.21 microseconds/10k zero Eth2Digest comparisons
prev: 8.29 microseconds/10k zero Eth2Digest comparisons

test 6:
  PR: 2.79 microseconds/10k zero Eth2Digest comparisons
prev: 18.7 microseconds/10k zero Eth2Digest comparisons

test 7:
  PR: 2.82 microseconds/10k zero Eth2Digest comparisons
prev: 18.7 microseconds/10k zero Eth2Digest comparisons

test 8:
  PR: 2.85 microseconds/10k zero Eth2Digest comparisons
prev: 18.7 microseconds/10k zero Eth2Digest comparisons

test 9:
  PR: 1.28 microseconds/10k zero Eth2Digest comparisons
prev: 8.28 microseconds/10k zero Eth2Digest comparisons

test 10:
  PR: 1.27 microseconds/10k zero Eth2Digest comparisons
prev: 8.32 microseconds/10k zero Eth2Digest comparisons

using

import "."/beacon_chain/spec/digest
import std/[strformat, times]
var aZeroHash = default(Eth2Digest)

func f(d: float): string =
  fmt"{d:.3}"

template measure(s: static string, code: untyped): bool =
  let t1 = cpuTime()
  var res: bool
  for _ in 0.uint ..< 1000.uint:
    res = code
  block:
    let duration = (cpuTime() - t1)*1_000_000
    echo s, ": ", f(duration), " microseconds/10k zero Eth2Digest comparisons"
  res

let _ = measure("  PR") do:
  isZero(aZeroHash)

let _ = measure("prev") do:
  isZeroSlow(aZeroHash)

It would arguably be better to use cmpMem (i.e. memcmp() in thin disguise) to

  • avoid assuming 4-byte alignment, not readily statically checkable;
  • allow exploiting the more usual case of its being 16-byte aligned; and
  • reduce the pointer arithmetic.

However, attempting to do this in a safe and efficient way triggers various combinations of:

So this can function as a stopgap approach.

@tersec tersec enabled auto-merge (squash) October 26, 2024 04:43
@tersec tersec disabled auto-merge October 26, 2024 04:48
Copy link

github-actions bot commented Oct 26, 2024

Unit Test Results

       12 files  ±0    1 810 suites  ±0   56m 4s ⏱️ +9s
  5 227 tests ±0    4 879 ✔️ ±0  348 💤 ±0  0 ±0 
29 057 runs  ±0  28 605 ✔️ ±0  452 💤 ±0  0 ±0 

Results for commit 5789e3c. ± Comparison against base commit 58a34e0.

♻️ This comment has been updated with latest results.

@arnetheduck
Copy link
Member

arnetheduck commented Oct 26, 2024

It would arguably be better

what you can do instead is:

var tmp: uint64 {.noinit.}
staticFor i, 0..<sizeof(mdigest)/sizeof(tmp):
  copyMem(addr tmp, addr mdigest(i) ...)
  if tmp == 0: ...

the compiler will then understand what's going on and use at least uint64 alignment, and if it detects something better, use that (this is based on gut feeling, needs measuring of course)

@tersec
Copy link
Contributor Author

tersec commented Oct 26, 2024

It would arguably be better

what you can do instead is:

var tmp: uint64 {.noinit.}
staticFor i, 0..<sizeof(mdigest)/sizeof(tmp):
  copyMem(addr tmp, addr mdigest(i) ...)
  if tmp == 0: ...

the compiler will then understand what's going on and use at least uint64 alignment, and if it detects something better, use that (this is based on gut feeling, needs measuring of course)

4a8289f

It's faster:

  PR: 1.48 microseconds/10k zero Eth2Digest comparisons
  PR: 1.35 microseconds/10k zero Eth2Digest comparisons
  PR: 0.631 microseconds/10k zero Eth2Digest comparisons
  PR: 1.31 microseconds/10k zero Eth2Digest comparisons
  PR: 1.33 microseconds/10k zero Eth2Digest comparisons
  PR: 1.32 microseconds/10k zero Eth2Digest comparisons
  PR: 0.621 microseconds/10k zero Eth2Digest comparisons
  PR: 1.35 microseconds/10k zero Eth2Digest comparisons
  PR: 1.31 microseconds/10k zero Eth2Digest comparisons
  PR: 1.32 microseconds/10k zero Eth2Digest comparisons
  PR: 1.36 microseconds/10k zero Eth2Digest comparisons
  PR: 0.621 microseconds/10k zero Eth2Digest comparisons
  PR: 1.35 microseconds/10k zero Eth2Digest comparisons
  PR: 1.35 microseconds/10k zero Eth2Digest comparisons
  PR: 0.621 microseconds/10k zero Eth2Digest comparisons
  PR: 0.611 microseconds/10k zero Eth2Digest comparisons
  PR: 1.29 microseconds/10k zero Eth2Digest comparisons
  PR: 1.34 microseconds/10k zero Eth2Digest comparisons
  PR: 1.29 microseconds/10k zero Eth2Digest comparisons
  PR: 0.631 microseconds/10k zero Eth2Digest comparisons
  PR: 0.611 microseconds/10k zero Eth2Digest comparisons
  PR: 1.28 microseconds/10k zero Eth2Digest comparisons
  PR: 1.33 microseconds/10k zero Eth2Digest comparisons
  PR: 1.37 microseconds/10k zero Eth2Digest comparisons
  PR: 0.631 microseconds/10k zero Eth2Digest comparisons

And the codegen looks reasonable:

static N_INLINE(void,
                _ZN6system7copyMemE7pointer7pointer25range09223372036854775807)(
    void *dest_p0, void *source_p1, NI size_p2);
static N_INLINE(void, nimCopyMem)(void *dest_p0, void *source_p1, NI size_p2);

static N_INLINE(void, nimCopyMem)(void *dest_p0, void *source_p1, NI size_p2) {
  void *T1_;
  T1_ = (void *)0;
  T1_ = memcpy(dest_p0, source_p1, ((size_t)(size_p2)));
}

static N_INLINE(void,
                _ZN6system7copyMemE7pointer7pointer25range09223372036854775807)(
    void *dest_p0, void *source_p1, NI size_p2) {
  nimCopyMem(dest_p0, source_p1, size_p2);
}

N_LIB_PRIVATE N_NIMCALL(NIM_BOOL, _ZN6digest6isZeroE7MDigestI6staticI3intEE)(
    tyObject_MDigest__sonz1Q4qEjQxue9bhMfa9bfg *x_p0) {
  NIM_BOOL result;
  NU64 tmp;
  {
    result = (NIM_BOOL)0;
    {
      _ZN6system7copyMemE7pointer7pointer25range09223372036854775807(
          ((void *)((&tmp))), ((void *)((&(*x_p0).data[(((NI)0))-0]))),
          ((NI)8));

      {
        if (!!((tmp == 0ULL)))
          goto LA4_;

        result = NIM_FALSE;
        goto BeforeRet_;
      }
    LA4_:;
    }
    {
      _ZN6system7copyMemE7pointer7pointer25range09223372036854775807(
          ((void *)((&tmp))), ((void *)((&(*x_p0).data[(((NI)8))-0]))),
          ((NI)8));

      {
        if (!!((tmp == 0ULL)))
          goto LA9_;

        result = NIM_FALSE;
        goto BeforeRet_;
      }
    LA9_:;
    }
    {
      _ZN6system7copyMemE7pointer7pointer25range09223372036854775807(
          ((void *)((&tmp))), ((void *)((&(*x_p0).data[(((NI)16))-0]))),
          ((NI)8));

      {
        if (!!((tmp == 0ULL)))
          goto LA14_;

        result = NIM_FALSE;
        goto BeforeRet_;
      }
    LA14_:;
    }
    {
      _ZN6system7copyMemE7pointer7pointer25range09223372036854775807(
          ((void *)((&tmp))), ((void *)((&(*x_p0).data[(((NI)24))-0]))),
          ((NI)8));

      {
        if (!!((tmp == 0ULL)))
          goto LA19_;

        result = NIM_FALSE;
        goto BeforeRet_;
      }
    LA19_:;
    }
    result = NIM_TRUE;
  }
BeforeRet_:;
  return result;
}

@tersec
Copy link
Contributor Author

tersec commented Oct 26, 2024

I also tried a staticFor/equalMem-based approach and it was not as fast; presumably the comparison to 0 specifically is something compilers and CPUs handle well:

  PR: 2.75 microseconds/10k zero Eth2Digest comparisons
  PR: 1.16 microseconds/10k zero Eth2Digest comparisons
  PR: 1.18 microseconds/10k zero Eth2Digest comparisons
  PR: 1.19 microseconds/10k zero Eth2Digest comparisons
  PR: 2.37 microseconds/10k zero Eth2Digest comparisons
  PR: 2.60 microseconds/10k zero Eth2Digest comparisons
  PR: 2.66 microseconds/10k zero Eth2Digest comparisons
  PR: 1.20 microseconds/10k zero Eth2Digest comparisons
  PR: 1.19 microseconds/10k zero Eth2Digest comparisons
  PR: 1.16 microseconds/10k zero Eth2Digest comparisons
  PR: 1.12 microseconds/10k zero Eth2Digest comparisons
  PR: 1.18 microseconds/10k zero Eth2Digest comparisons
  PR: 1.19 microseconds/10k zero Eth2Digest comparisons
  PR: 1.20 microseconds/10k zero Eth2Digest comparisons
  PR: 1.21 microseconds/10k zero Eth2Digest comparisons
  PR: 1.18 microseconds/10k zero Eth2Digest comparisons
  PR: 1.18 microseconds/10k zero Eth2Digest comparisons
  PR: 1.20 microseconds/10k zero Eth2Digest comparisons
  PR: 2.67 microseconds/10k zero Eth2Digest comparisons
  PR: 2.55 microseconds/10k zero Eth2Digest comparisons
  PR: 2.56 microseconds/10k zero Eth2Digest comparisons
  PR: 2.63 microseconds/10k zero Eth2Digest comparisons
  PR: 2.61 microseconds/10k zero Eth2Digest comparisons
  PR: 1.19 microseconds/10k zero Eth2Digest comparisons
  PR: 1.18 microseconds/10k zero Eth2Digest comparisons

@arnetheduck
Copy link
Member

During Holesky syncing, Nimbus is spending 25% of its time in isZeroMem

uh, maybe a good question to ask is what is calling isZeroMem? the hashlist cache should be checking the first 8 bytes only

@tersec
Copy link
Contributor Author

tersec commented Oct 26, 2024

During Holesky syncing, Nimbus is spending 25% of its time in isZeroMem

uh, maybe a good question to ask is what is calling isZeroMem? the hashlist cache should be checking the first 8 bytes only

func compute_deltas(
deltas: var openArray[Delta],
indices: Table[Eth2Digest, Index],
indices_offset: Index,
votes: var openArray[VoteTracker],
old_balances: openArray[Gwei],
new_balances: openArray[Gwei]
): FcResult[void] =
## Update `deltas`
## between old and new balances
## between votes
##
## `deltas.len` must match `indices.len` (length match)
##
## Error:
## - If a value in indices is greater than `indices.len`
## - If a `Eth2Digest` in `votes` does not exist in `indices`
## except for the `default(Eth2Digest)` (i.e. zero hash)
for val_index, vote in votes.mpairs():
# No need to create a score change if the validator has never voted
# or if votes are for the zero hash (alias to the genesis block)
if vote.current_root.isZero and vote.next_root.isZero:
continue
# If the validator was not included in `old_balances` (i.e. did not exist)
# its balance is zero
let old_balance = if val_index < old_balances.len: old_balances[val_index]
else: 0.Gwei
# If the validator is not known in the `new_balances` then use balance of zero
#
# It is possible that there is a vote for an unknown validator if we change our
# justified state to a new state with a higher epoch on a different fork
# because that fork may have on-boarded less validators than the previous fork.
#
# Note that attesters are not different as they are activated only under finality
let new_balance = if val_index < new_balances.len: new_balances[val_index]
else: 0.Gwei
if vote.current_root != vote.next_root or old_balance != new_balance:
# Ignore the current or next vote if it is not known in `indices`.
# We assume that it is outside of our tree (i.e., pre-finalization) and therefore not interesting.
if vote.current_root in indices:
let index = indices.unsafeGet(vote.current_root) - indices_offset
if index >= deltas.len:
return err ForkChoiceError(
kind: fcInvalidNodeDelta,
index: index)
deltas[index] -= Delta old_balance
# Note that delta can be negative
# TODO: is int64 big enough?
if vote.next_epoch != FAR_FUTURE_EPOCH or not vote.next_root.isZero:
if vote.next_root in indices:
let index = indices.unsafeGet(vote.next_root) - indices_offset
if index >= deltas.len:
return err ForkChoiceError(
kind: fcInvalidNodeDelta,
index: index)
deltas[index] += Delta new_balance
# Note that delta can be negative
# TODO: is int64 big enough?
vote.current_root = vote.next_root
return ok()

@arnetheduck
Copy link
Member

oh. oof, that old code is starting to show its age .. it was written for 20k validators, not almost 2M - not bad that it has held on for so long

@arnetheduck
Copy link
Member

oh. oof, that old code is starting to show its age .. it was written for 20k validators, not almost 2M - not bad that it has held on for so long

ie instead of comparing hashes, it could keep a bit list or any number of other solutions - anyway, that's unrelated to this pr, which certainly is a nice little improvement

x.isZeroMemory
var tmp {.noinit.}: uint64
static: doAssert sizeof(x.data) mod sizeof(tmp) == 0
staticFor i, 0 ..< sizeof(x.data) div sizeof(tmp):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since it's a weekend, there's one thing that one might try, which is to get rid of the branch entirely:

var tmp2 = 0'u64
var tmp {.noinit}
staticFor:
  copyMEm(tmp, ..)
  tmp2 = tmp2 or tmp
tmp2 > 0

it could pay off because mdigest is 32 bytes which fits in a cache line so it won't cost anything more to read the rest of it and the resulting code should be much smaller but who knows these days..

Copy link
Member

@arnetheduck arnetheduck Oct 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on the other hand, if it's nonzero, it's likely to be so in the first word too - choices, choices... it's certainly faster when it's zero, so it might be that it depends on the frequency of zero-mem digests, which one ends up a "total winner" :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, it's comparable, maybe slightly faster. It's easy enough to try both, but the advantage of the one with branches is that it can early-exit, so these benchmarks are best case:

  PR: 0.872 microseconds/10k zero Eth2Digest comparisons
  PR: 1.71 microseconds/10k zero Eth2Digest comparisons
  PR: 0.631 microseconds/10k zero Eth2Digest comparisons
  PR: 0.621 microseconds/10k zero Eth2Digest comparisons
  PR: 0.621 microseconds/10k zero Eth2Digest comparisons
  PR: 1.24 microseconds/10k zero Eth2Digest comparisons
  PR: 1.34 microseconds/10k zero Eth2Digest comparisons
  PR: 1.35 microseconds/10k zero Eth2Digest comparisons
  PR: 0.612 microseconds/10k zero Eth2Digest comparisons
  PR: 1.35 microseconds/10k zero Eth2Digest comparisons
  PR: 0.621 microseconds/10k zero Eth2Digest comparisons
  PR: 0.621 microseconds/10k zero Eth2Digest comparisons
  PR: 0.621 microseconds/10k zero Eth2Digest comparisons
  PR: 1.34 microseconds/10k zero Eth2Digest comparisons
  PR: 1.34 microseconds/10k zero Eth2Digest comparisons
  PR: 1.32 microseconds/10k zero Eth2Digest comparisons
  PR: 0.611 microseconds/10k zero Eth2Digest comparisons
  PR: 0.601 microseconds/10k zero Eth2Digest comparisons
  PR: 1.27 microseconds/10k zero Eth2Digest comparisons
  PR: 0.621 microseconds/10k zero Eth2Digest comparisons
  PR: 0.642 microseconds/10k zero Eth2Digest comparisons
  PR: 0.612 microseconds/10k zero Eth2Digest comparisons
  PR: 0.651 microseconds/10k zero Eth2Digest comparisons
  PR: 0.611 microseconds/10k zero Eth2Digest comparisons
  PR: 0.621 microseconds/10k zero Eth2Digest comparisons

Would have to add non-zero benchmarks to compare properly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried all 4 combinations,

  • 4a8289f (staticFor + copyMem); and
  • staticFor/copyMem/or loop with no early exit

and zero/non-zero hashes (set them to all 255)

And:

$ datamash -R3 mean 1 perc:10 1 median 1 perc:90 1 count 1 < ~/benchmarks_zerohash_copymem_or.txt 
0.929	0.601	0.641	1.342	100.000
$ datamash -R3 mean 1 perc:10 1 median 1 perc:90 1 count 1 < ~/benchmarks_zerohash_copymem.txt 
0.966	0.611	0.706	1.352	100.000
$ datamash -R3 mean 1 perc:10 1 median 1 perc:90 1 count 1 < ~/benchmarks_nonzerohash_copymem_or.txt 
0.945	0.600	0.632	1.363	100.000
$ datamash -R3 mean 1 perc:10 1 median 1 perc:90 1 count 1 < ~/benchmarks_nonzerohash_copymem.txt 
1.072	0.612	1.297	1.362	100.000

benchmarks_nonzerohash_copymem.txt
benchmarks_nonzerohash_copymem_or.txt
benchmarks_zerohash_copymem.txt
benchmarks_zerohash_copymem_or.txt

Remarkably similar to each other. The main advantage seems to be the efficient static iteration over fewer, large elements, not the exact operations or even the early exit. It matters, but it's fairly marginal performance-wise.

Copy link
Contributor Author

@tersec tersec Oct 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The or approach isn't much better, but sure, slightly better, so going with it.

diff --git a/beacon_chain/spec/digest.nim b/beacon_chain/spec/digest.nim
index 87c447b84..06ea59be9 100644
--- a/beacon_chain/spec/digest.nim
+++ b/beacon_chain/spec/digest.nim
@@ -145,12 +145,12 @@ func `==`*(a, b: Eth2Digest): bool =
 
 func isZero*(x: Eth2Digest): bool =
   var tmp {.noinit.}: uint64
+  var tmp2 = 0'u64
   static: doAssert sizeof(x.data) mod sizeof(tmp) == 0
   staticFor i, 0 ..< sizeof(x.data) div sizeof(tmp):
     copyMem(addr tmp, addr x.data[i*sizeof(tmp)], sizeof(tmp))
-    if tmp != 0:
-      return false
-  true
+    tmp2 = tmp2 or tmp
+  tmp2 == 0
 
 proc writeValue*(w: var JsonWriter, a: Eth2Digest) {.raises: [IOError].} =
   w.writeValue $a

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typedef struct tyObject_MDigest__sonz1Q4qEjQxue9bhMfa9bfg
    tyObject_MDigest__sonz1Q4qEjQxue9bhMfa9bfg;
typedef NU8 tyArray__vEOa9c5qaE9ajWxR5R4zwfQg[32];
struct tyObject_MDigest__sonz1Q4qEjQxue9bhMfa9bfg {
  NIM_ALIGN(16) tyArray__vEOa9c5qaE9ajWxR5R4zwfQg data;
};
static N_INLINE(void,
                _ZN6system7copyMemE7pointer7pointer25range09223372036854775807)(
    void *dest_p0, void *source_p1, NI size_p2);
static N_INLINE(void, nimCopyMem)(void *dest_p0, void *source_p1, NI size_p2);

static N_INLINE(void, nimCopyMem)(void *dest_p0, void *source_p1, NI size_p2) {
  void *T1_;
  T1_ = (void *)0;
  T1_ = memcpy(dest_p0, source_p1, ((size_t)(size_p2)));
}

static N_INLINE(void,
                _ZN6system7copyMemE7pointer7pointer25range09223372036854775807)(
    void *dest_p0, void *source_p1, NI size_p2) {
  nimCopyMem(dest_p0, source_p1, size_p2);
}

N_LIB_PRIVATE N_NIMCALL(NIM_BOOL, _ZN6digest6isZeroE7MDigestI6staticI3intEE)(
    tyObject_MDigest__sonz1Q4qEjQxue9bhMfa9bfg *x_p0) {
  NIM_BOOL result;
  NU64 tmp;
  NU64 tmp2;
  result = (NIM_BOOL)0;
  tmp2 = 0ULL;
  {
    _ZN6system7copyMemE7pointer7pointer25range09223372036854775807(
        ((void *)((&tmp))), ((void *)((&(*x_p0).data[(((NI)0))-0]))), ((NI)8));

    tmp2 = (NU64)(tmp2 | tmp);
  }
  {
    _ZN6system7copyMemE7pointer7pointer25range09223372036854775807(
        ((void *)((&tmp))), ((void *)((&(*x_p0).data[(((NI)8))-0]))), ((NI)8));

    tmp2 = (NU64)(tmp2 | tmp);
  }
  {
    _ZN6system7copyMemE7pointer7pointer25range09223372036854775807(
        ((void *)((&tmp))), ((void *)((&(*x_p0).data[(((NI)16))-0]))), ((NI)8));

    tmp2 = (NU64)(tmp2 | tmp);
  }
  {
    _ZN6system7copyMemE7pointer7pointer25range09223372036854775807(
        ((void *)((&tmp))), ((void *)((&(*x_p0).data[(((NI)24))-0]))), ((NI)8));

    tmp2 = (NU64)(tmp2 | tmp);
  }

  result = (tmp2 == 0ULL);
  return result;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One quirk of this version is that it generalizes less outside the Eth2Digest and/or hash function case, where there are only 4 chunks to process, so the early exit doesn't buy much.

For more general isZeroMemory, early exit, unless it had to be constant-time, might become more generally useful.

@tersec tersec merged commit 4565c02 into unstable Oct 28, 2024
13 checks passed
@tersec tersec deleted the twW branch October 28, 2024 05:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants