-
Notifications
You must be signed in to change notification settings - Fork 231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make Eth2Digest isZero 8x faster #6682
Conversation
what you can do instead is: var tmp: uint64 {.noinit.}
staticFor i, 0..<sizeof(mdigest)/sizeof(tmp):
copyMem(addr tmp, addr mdigest(i) ...)
if tmp == 0: ... the compiler will then understand what's going on and use at least uint64 alignment, and if it detects something better, use that (this is based on gut feeling, needs measuring of course) |
It's faster:
And the codegen looks reasonable: static N_INLINE(void,
_ZN6system7copyMemE7pointer7pointer25range09223372036854775807)(
void *dest_p0, void *source_p1, NI size_p2);
static N_INLINE(void, nimCopyMem)(void *dest_p0, void *source_p1, NI size_p2);
static N_INLINE(void, nimCopyMem)(void *dest_p0, void *source_p1, NI size_p2) {
void *T1_;
T1_ = (void *)0;
T1_ = memcpy(dest_p0, source_p1, ((size_t)(size_p2)));
}
static N_INLINE(void,
_ZN6system7copyMemE7pointer7pointer25range09223372036854775807)(
void *dest_p0, void *source_p1, NI size_p2) {
nimCopyMem(dest_p0, source_p1, size_p2);
}
N_LIB_PRIVATE N_NIMCALL(NIM_BOOL, _ZN6digest6isZeroE7MDigestI6staticI3intEE)(
tyObject_MDigest__sonz1Q4qEjQxue9bhMfa9bfg *x_p0) {
NIM_BOOL result;
NU64 tmp;
{
result = (NIM_BOOL)0;
{
_ZN6system7copyMemE7pointer7pointer25range09223372036854775807(
((void *)((&tmp))), ((void *)((&(*x_p0).data[(((NI)0))-0]))),
((NI)8));
{
if (!!((tmp == 0ULL)))
goto LA4_;
result = NIM_FALSE;
goto BeforeRet_;
}
LA4_:;
}
{
_ZN6system7copyMemE7pointer7pointer25range09223372036854775807(
((void *)((&tmp))), ((void *)((&(*x_p0).data[(((NI)8))-0]))),
((NI)8));
{
if (!!((tmp == 0ULL)))
goto LA9_;
result = NIM_FALSE;
goto BeforeRet_;
}
LA9_:;
}
{
_ZN6system7copyMemE7pointer7pointer25range09223372036854775807(
((void *)((&tmp))), ((void *)((&(*x_p0).data[(((NI)16))-0]))),
((NI)8));
{
if (!!((tmp == 0ULL)))
goto LA14_;
result = NIM_FALSE;
goto BeforeRet_;
}
LA14_:;
}
{
_ZN6system7copyMemE7pointer7pointer25range09223372036854775807(
((void *)((&tmp))), ((void *)((&(*x_p0).data[(((NI)24))-0]))),
((NI)8));
{
if (!!((tmp == 0ULL)))
goto LA19_;
result = NIM_FALSE;
goto BeforeRet_;
}
LA19_:;
}
result = NIM_TRUE;
}
BeforeRet_:;
return result;
} |
I also tried a
|
uh, maybe a good question to ask is what is calling isZeroMem? the hashlist cache should be checking the first 8 bytes only |
nimbus-eth2/beacon_chain/fork_choice/fork_choice.nim Lines 405 to 470 in 58a34e0
|
oh. oof, that old code is starting to show its age .. it was written for 20k validators, not almost 2M - not bad that it has held on for so long |
ie instead of comparing hashes, it could keep a bit list or any number of other solutions - anyway, that's unrelated to this pr, which certainly is a nice little improvement |
x.isZeroMemory | ||
var tmp {.noinit.}: uint64 | ||
static: doAssert sizeof(x.data) mod sizeof(tmp) == 0 | ||
staticFor i, 0 ..< sizeof(x.data) div sizeof(tmp): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since it's a weekend, there's one thing that one might try, which is to get rid of the branch entirely:
var tmp2 = 0'u64
var tmp {.noinit}
staticFor:
copyMEm(tmp, ..)
tmp2 = tmp2 or tmp
tmp2 > 0
it could pay off because mdigest is 32 bytes which fits in a cache line so it won't cost anything more to read the rest of it and the resulting code should be much smaller but who knows these days..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on the other hand, if it's nonzero, it's likely to be so in the first word too - choices, choices... it's certainly faster when it's zero, so it might be that it depends on the frequency of zero-mem digests, which one ends up a "total winner" :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, it's comparable, maybe slightly faster. It's easy enough to try both, but the advantage of the one with branches is that it can early-exit, so these benchmarks are best case:
PR: 0.872 microseconds/10k zero Eth2Digest comparisons
PR: 1.71 microseconds/10k zero Eth2Digest comparisons
PR: 0.631 microseconds/10k zero Eth2Digest comparisons
PR: 0.621 microseconds/10k zero Eth2Digest comparisons
PR: 0.621 microseconds/10k zero Eth2Digest comparisons
PR: 1.24 microseconds/10k zero Eth2Digest comparisons
PR: 1.34 microseconds/10k zero Eth2Digest comparisons
PR: 1.35 microseconds/10k zero Eth2Digest comparisons
PR: 0.612 microseconds/10k zero Eth2Digest comparisons
PR: 1.35 microseconds/10k zero Eth2Digest comparisons
PR: 0.621 microseconds/10k zero Eth2Digest comparisons
PR: 0.621 microseconds/10k zero Eth2Digest comparisons
PR: 0.621 microseconds/10k zero Eth2Digest comparisons
PR: 1.34 microseconds/10k zero Eth2Digest comparisons
PR: 1.34 microseconds/10k zero Eth2Digest comparisons
PR: 1.32 microseconds/10k zero Eth2Digest comparisons
PR: 0.611 microseconds/10k zero Eth2Digest comparisons
PR: 0.601 microseconds/10k zero Eth2Digest comparisons
PR: 1.27 microseconds/10k zero Eth2Digest comparisons
PR: 0.621 microseconds/10k zero Eth2Digest comparisons
PR: 0.642 microseconds/10k zero Eth2Digest comparisons
PR: 0.612 microseconds/10k zero Eth2Digest comparisons
PR: 0.651 microseconds/10k zero Eth2Digest comparisons
PR: 0.611 microseconds/10k zero Eth2Digest comparisons
PR: 0.621 microseconds/10k zero Eth2Digest comparisons
Would have to add non-zero benchmarks to compare properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried all 4 combinations,
- 4a8289f (
staticFor
+copyMem
); and staticFor
/copyMem
/or
loop with no early exit
and zero/non-zero hashes (set them to all 255)
And:
$ datamash -R3 mean 1 perc:10 1 median 1 perc:90 1 count 1 < ~/benchmarks_zerohash_copymem_or.txt
0.929 0.601 0.641 1.342 100.000
$ datamash -R3 mean 1 perc:10 1 median 1 perc:90 1 count 1 < ~/benchmarks_zerohash_copymem.txt
0.966 0.611 0.706 1.352 100.000
$ datamash -R3 mean 1 perc:10 1 median 1 perc:90 1 count 1 < ~/benchmarks_nonzerohash_copymem_or.txt
0.945 0.600 0.632 1.363 100.000
$ datamash -R3 mean 1 perc:10 1 median 1 perc:90 1 count 1 < ~/benchmarks_nonzerohash_copymem.txt
1.072 0.612 1.297 1.362 100.000
benchmarks_nonzerohash_copymem.txt
benchmarks_nonzerohash_copymem_or.txt
benchmarks_zerohash_copymem.txt
benchmarks_zerohash_copymem_or.txt
Remarkably similar to each other. The main advantage seems to be the efficient static iteration over fewer, large elements, not the exact operations or even the early exit. It matters, but it's fairly marginal performance-wise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The or
approach isn't much better, but sure, slightly better, so going with it.
diff --git a/beacon_chain/spec/digest.nim b/beacon_chain/spec/digest.nim
index 87c447b84..06ea59be9 100644
--- a/beacon_chain/spec/digest.nim
+++ b/beacon_chain/spec/digest.nim
@@ -145,12 +145,12 @@ func `==`*(a, b: Eth2Digest): bool =
func isZero*(x: Eth2Digest): bool =
var tmp {.noinit.}: uint64
+ var tmp2 = 0'u64
static: doAssert sizeof(x.data) mod sizeof(tmp) == 0
staticFor i, 0 ..< sizeof(x.data) div sizeof(tmp):
copyMem(addr tmp, addr x.data[i*sizeof(tmp)], sizeof(tmp))
- if tmp != 0:
- return false
- true
+ tmp2 = tmp2 or tmp
+ tmp2 == 0
proc writeValue*(w: var JsonWriter, a: Eth2Digest) {.raises: [IOError].} =
w.writeValue $a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typedef struct tyObject_MDigest__sonz1Q4qEjQxue9bhMfa9bfg
tyObject_MDigest__sonz1Q4qEjQxue9bhMfa9bfg;
typedef NU8 tyArray__vEOa9c5qaE9ajWxR5R4zwfQg[32];
struct tyObject_MDigest__sonz1Q4qEjQxue9bhMfa9bfg {
NIM_ALIGN(16) tyArray__vEOa9c5qaE9ajWxR5R4zwfQg data;
};
static N_INLINE(void,
_ZN6system7copyMemE7pointer7pointer25range09223372036854775807)(
void *dest_p0, void *source_p1, NI size_p2);
static N_INLINE(void, nimCopyMem)(void *dest_p0, void *source_p1, NI size_p2);
static N_INLINE(void, nimCopyMem)(void *dest_p0, void *source_p1, NI size_p2) {
void *T1_;
T1_ = (void *)0;
T1_ = memcpy(dest_p0, source_p1, ((size_t)(size_p2)));
}
static N_INLINE(void,
_ZN6system7copyMemE7pointer7pointer25range09223372036854775807)(
void *dest_p0, void *source_p1, NI size_p2) {
nimCopyMem(dest_p0, source_p1, size_p2);
}
N_LIB_PRIVATE N_NIMCALL(NIM_BOOL, _ZN6digest6isZeroE7MDigestI6staticI3intEE)(
tyObject_MDigest__sonz1Q4qEjQxue9bhMfa9bfg *x_p0) {
NIM_BOOL result;
NU64 tmp;
NU64 tmp2;
result = (NIM_BOOL)0;
tmp2 = 0ULL;
{
_ZN6system7copyMemE7pointer7pointer25range09223372036854775807(
((void *)((&tmp))), ((void *)((&(*x_p0).data[(((NI)0))-0]))), ((NI)8));
tmp2 = (NU64)(tmp2 | tmp);
}
{
_ZN6system7copyMemE7pointer7pointer25range09223372036854775807(
((void *)((&tmp))), ((void *)((&(*x_p0).data[(((NI)8))-0]))), ((NI)8));
tmp2 = (NU64)(tmp2 | tmp);
}
{
_ZN6system7copyMemE7pointer7pointer25range09223372036854775807(
((void *)((&tmp))), ((void *)((&(*x_p0).data[(((NI)16))-0]))), ((NI)8));
tmp2 = (NU64)(tmp2 | tmp);
}
{
_ZN6system7copyMemE7pointer7pointer25range09223372036854775807(
((void *)((&tmp))), ((void *)((&(*x_p0).data[(((NI)24))-0]))), ((NI)8));
tmp2 = (NU64)(tmp2 | tmp);
}
result = (tmp2 == 0ULL);
return result;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One quirk of this version is that it generalizes less outside the Eth2Digest
and/or hash function case, where there are only 4 chunks to process, so the early exit doesn't buy much.
For more general isZeroMemory
, early exit, unless it had to be constant-time, might become more generally useful.
Via profiling by @etan-status
During Holesky syncing, Nimbus is spending 25% of its time in
isZeroMemory
. This PR improves this.using
It would arguably be better to use cmpMem (i.e.
memcmp()
in thin disguise) toHowever, attempting to do this in a safe and efficient way triggers various combinations of:
let
/const
-bound anonymousproc
doesn't participate in overload resolution nim-lang/Nim#24357Error: VM problem: dest register is not set
withconst
-bound proc nim-lang/Nim#24359error: ‘T2_’ undeclared
anderror: incompatible types when assigning to type ‘void *’
nim-lang/Nim#24361So this can function as a stopgap approach.