gc: findHead scan optimization #3899

HattoriHanzo031 · 2023-09-08T17:26:22Z

Draft PR to continue discussion with @dgryski and @aykevl about optimizing findHead algorithm in case of large object allocation.
The idea is to scan the block state byte at the time. State byte is first XOR-ed with all tails byte which turns all tails in state byte to 0 and then bits.LeadingZeros8 is used to get how many tails are in the byte. Also, if XOR-ed byte is 0 that means it is all tails so we can skip to the next. For the first byte we first shift out the bits that are not part of current object.

dgryski · 2023-09-08T20:52:22Z

I wonder if we want to leave an old copy of the code disabled unless the runtime is built with runtime_asserts or something, just until we're confident this is always returning the same result. Also, I know you wrote also a 32-bit version -- Is there a way we could have both of these available and decided at compile time?

Also, there is a similar forward scanning loop in findNext(). I wonder if that should also be tweaked.

aykevl · 2023-09-09T15:01:42Z

I need to take a closer look at the code still, but I like how it's kept small.
It's unfortunate that there's a ~100 byte increase in binary size for baremetal chips. Normally that would be a blocker for me but considering the performance benefits I'm not so sure.

Looking at the code, there appears to be a bit of duplication though: for example there are two comparisons against zero. I wonder if the code can be shortened (and thus also reduced in binary size) if the outer if is removed and it's all a single loop? It might also make the code a bit more readable.

I wonder if we want to leave an old copy of the code disabled unless the runtime is built with runtime_asserts or something, just until we're confident this is always returning the same result.

I'd rather not have an old version lying around that isn't getting much testing. Such an old version is likely to bitrot. Instead, I'd prefer we do a ton of testing (corpus, etc), merge it, and if there is a bug we'll know it soon enough (it's easy to verify a potential bug by simply reverting the commit and seeing whether the bug goes away).

Also, I know you wrote also a 32-bit version -- Is there a way we could have both of these available and decided at compile time?

Perhaps this is something for a future PR? Just to keep this PR focused: it's already providing a really nice performance improvement.
(Deciding at compile time is relatively easy: for example if you want to only do this on wasm then you can simply use if GOARCH == "wasm" - this check will be optimized away entirely).

HattoriHanzo031 · 2023-09-09T16:00:59Z

Also, I know you wrote also a 32-bit version -- Is there a way we could have both of these available and decided at compile time?

32bit version would require much more considerations regarding edge cases and platform differences so I suspect it would grow much more complex than 8bit version, but I agree we can try to implement it later in another PR because of significant performance gain it could provide.

Looking at the code, there appears to be a bit of duplication though: for example there are two comparisons against zero. I wonder if the code can be shortened (and thus also reduced in binary size) if the outer if is removed and it's all a single loop? It might also make the code a bit more readable.

It also looked to me that first if could be avoided and I tried to eliminate it, but the problem is that the first byte we get is a special case because it can also contain bits from different object so it must be handled differently. There still might be a way to do it that I didn't find.

Regarding the code size, I suspect it is due to use of LeadingZeros8 which is implemented as a lookup table. I just built the compiler using self-built LLVM so I can tun the tests and see what can I do to reduce the size. I first have to confirm if that is the reason, because if we are already using this lookup table somewhere it is probably reused (since it is defined in the library) and not add to binary size (and also since lookup table is 256 bytes and increase is around 100). One idea I have is to checking just 6 bits instead of 8 because if we found the byte that is not all tails it must have at least one block that will not be 00 after XOR. This would reduce lookup table to 64bytes.

HattoriHanzo031 · 2023-09-09T20:51:50Z

@aykevl you were right, it can be simplified, I just needed to subtract number of blocks from other objects at the end :)
Unfortunately, it only saves 4 bytes, but better than nothing, and the code is much more clear.

aykevl · 2023-09-10T17:19:00Z

Regarding the code size, I suspect it is due to use of LeadingZeros8 which is implemented as a lookup table.

It is not. That is the software fallback, but in fact most of the math/bits functions are implemented as special compiler intrinsics. See: #3620

On armv7m LLVM emits a single instruction: https://godbolt.org/z/5aeG8bPfc
On armv6m (microbit for example) LLVM emits a library call, which is probably the main reason for the big increase in binary size: https://godbolt.org/z/YjrqE56Yz. Perhaps it would be a good idea to fall back to the old code on armv6m, AVR, and possibly other architectures that don't have clz?

HattoriHanzo031 · 2023-09-11T12:33:36Z

It is not. That is the software fallback, but in fact most of the math/bits functions are implemented as special compiler intrinsics. See: #3620

Thanks for the info, it is good to know.

Perhaps it would be a good idea to fall back to the old code on armv6m, AVR, and possibly other architectures that don't have clz?

It probably makes sense, on those targets there will probably not be big amount of heap allocations anyway.

dgryski · 2024-09-19T19:55:01Z

Any outstanding comments here? Is this ready to review/merge? Does it need another pass or implementation attempt? @HattoriHanzo031 If you don't have time, I can probably pick this up and get it across the finish line.

HattoriHanzo031 · 2024-09-20T17:12:41Z

@dgryski sorry for the delay. I rebased the branch and tried to run some tests and benchmarks on embedded target. I built test program that creates some heap variables in the loop and ran it on rp2040 board. I also added toggling of one pin at the start and the end of the findHead call and capture them on logic analyser so I have exact duration of the call. I will use this over weekend to see the results and maybe tweak the algorithm.

dgryski · 2024-09-20T20:40:12Z

Here are some numbers from https://github.com/dgryski/trifles/tree/master/tinyjsonbench

   4.0  +
        |
        |
        |                   *
        |
   3.8  +                                        *
        |
        |
        |
   3.6  +                   |                    |
        |         +-------------------+          |
        |         |         *         |          |
        |         +-------------------++-------------------+
   3.4  +                   |          |         *         |
        |                              +-------------------+
        |                              +-------------------+
        |                                        |
   3.2  +--------------------------------------------------------------
                          dev.s                opt.s

File   N   Mean  Stddev
dev.s  20  3.51  0.11    (control)
opt.s  20  3.39  0.12    (3.39 < 3.51 ± 0.07, p = .003)

So we can see an improvement of about 10ms or about 3%. ( Data analysis via https://github.com/codahale/tinystat )

HattoriHanzo031 · 2024-09-23T20:51:09Z

@dgryski Here are some results from benchmarking. findHead is on average around 2 times faster. It is only slower for very short searches, and the longer the search is, the better the improvement. Here is a comparisons, one channel is the durations of findHead and second channel is duration of runGC
seeed XIAO (SAMD21) board:
dev branch:

PR branch:

nice!nano board:
dev branch:

PR branch:

I also ran both algorithms side by side (also comparing the output):

One strange thing I noticed is that on RP-2040 board runGC is slower with new algorithm although findHead invocations took significantly les time:
dev branch:

PR branch:

Still didn't figure out what is the cause for this since only change is in the findHead function and it runs (on average) much faster and returns the same results. Maybe @aykevl has some explanation.

dgryski · 2024-10-23T20:07:27Z

Given this is slower for very short scans, and that most allocations are only for a single block, I wonder if we want to have an additional test to see if it's a one-block allocation before jumping into the larger logic? Or maybe the extra test outweighs the difference.

dgryski · 2024-10-23T21:36:20Z

Also, ran this code against the corpus with the following patch to check for any mismatches:

diff --git a/src/runtime/gc_blocks.go b/src/runtime/gc_blocks.go
index 43c73143..02a5a5b4 100644
--- a/src/runtime/gc_blocks.go
+++ b/src/runtime/gc_blocks.go
@@ -129,6 +129,12 @@ func (b gcBlock) address() uintptr {
 // points to an allocated object. It returns the same block if this block
 // already points to the head.
 func (b gcBlock) findHead() gcBlock {
+
+       var btmp gcBlock
+       if gcAsserts {
+           btmp = b
+       }
+
        stateBytePtr := (*uint8)(unsafe.Add(metadataStart, b/blocksPerStateByte))

        // XOR the stateByte with byte containing all tails to turn tail bits to 0
@@ -156,6 +162,14 @@ func (b gcBlock) findHead() gcBlock {
        b -= gcBlock(bits.LeadingZeros8(stateByte) / stateBits)

        if gcAsserts {
+               for btmp.state() == blockStateTail {
+                       btmp--
+               }
+
+               if btmp != b {
+                   runtimePanic("gc: findHead optimization mismatch")
+               }
+
                if b.state() != blockStateHead && b.state() != blockStateMark {
                        runtimePanic("gc: found tail without head")
                }

dgryski · 2024-10-24T23:49:35Z

This is a short-circuit patch to return early if we're already on a head block.

diff --git a/src/runtime/gc_blocks.go b/src/runtime/gc_blocks.go
index 43c73143..2bc02a02 100644
--- a/src/runtime/gc_blocks.go
+++ b/src/runtime/gc_blocks.go
@@ -129,6 +129,11 @@ func (b gcBlock) address() uintptr {
 // points to an allocated object. It returns the same block if this block
 // already points to the head.
 func (b gcBlock) findHead() gcBlock {
+       // Nothing to do.
+       if b.state() == blockStateHead {
+               return b
+       }
+
        stateBytePtr := (*uint8)(unsafe.Add(metadataStart, b/blocksPerStateByte))

        // XOR the stateByte with byte containing all tails to turn tail bits to 0

HattoriHanzo031 · 2024-10-27T08:02:53Z

Given this is slower for very short scans, and that most allocations are only for a single block, I wonder if we want to have an additional test to see if it's a one-block allocation before jumping into the larger logic? Or maybe the extra test outweighs the difference.

@dgryski sorry for the late response, I think I tried adding check for single block allocations, but it got complicated because of assert logic. I like how your suggestion avoids dealing with assert since short circuit check does not decrements b so the rest of the code goes trough blocks from the beginning again. I will try and see if this is speeding things up and let you know the results.

Also, ran this code against the corpus with the following patch to check for any mismatches:

Did you found any mismatch when running these tests? I did similar for my test programs and never got different value between old and new algorithm.

dgryski · 2024-10-27T18:16:48Z

I never got any differences between your code and the old version running tinygo through the test corpus on either Linux or WASI. (Pretty sure, although I might double-check this claim...)

dgryski · 2024-10-29T18:26:01Z

builder/sizes_test.go

-		{"hifive1b", "examples/echo", 4560, 280, 0, 2268},
-		{"microbit", "examples/serial", 2868, 388, 8, 2272},
-		{"wioterminal", "examples/pininterrupt", 6104, 1484, 116, 6832},
+		{"hifive1b", "examples/echo", 4688, 280, 0, 2268},


Current echo size should be 4660. Then this should pass CI and I think we'll be good to merge.

dgryski · 2024-10-29T18:26:30Z

@aykevl Aside from the above CI issue with binary sizes, do you have any comments before this is merged?

aykevl · 2024-10-31T10:46:40Z

@dgryski is there a way I can run tinyjsonbench myself? I'd like to play a bit with findHead optimizations.

This is similar to #3899, but smaller and hopefully just as efficient.

aykevl · 2024-10-31T11:52:09Z

Did some experimenting, and wrote some code that's a bit smaller (typically 20 byte increase in binary size, as opposed to 44-84 bytes in this PR) and I think just as fast or maybe slightly faster:

        |                                                                 
        |                                                                 
        |                                                                 
        |                                                                 
        |                                                                 
   4.0  +                   |                                             
        |                   |                                             
        |         +-------------------+                                   
        |         +---------*---------+                                   
   3.9  +         |                   |                                   
        |         +-------------------+                                   
        |                   |                                             
        |                                        |                        
   3.8  +                                        |                        
        |                              +-------------------+              
        |                              |         *         |              
        |                              +-------------------+              
   3.7  +----------------------------------------|---------------------   
                        old.times            new.times                    

File       N   Mean  Stddev  
old.times  20  3.92  0.04    (control)
new.times  20  3.77  0.03    (3.77 < 3.92 ± 0.03, p = .000)

PR is at #4574.

The main difference is that I don't try complicated XOR and bits.LeadingZeros8, but instead care about just two cases:

If the block state is stored in the last byte, check the whole byte. This is efficient for large allocations where a pointer could point to the middle of it.
Else, just use the old code of checking the block state. This is efficient for pointers that point to the start of an allocation (a very common case). It is also reasonably efficient for the first few and last few blocks of an allocation that aren't aligned (b % 4 == 3) because it will loop at most 3 times.

@dgryski as you can see, I did get tinyjsonbench to work. However I did find that run.sh is locale dependent: I had to replace a bunch of commas with dots in old.times and new.times for it to show something sensible (I've set my system to a weird mixture of English language and Dutch formatting of dates and numbers and stuff).

HattoriHanzo031 · 2024-11-01T02:35:52Z

PR is at #4574.

The main difference is that I don't try complicated XOR and bits.LeadingZeros8, but instead care about just two cases:

If the block state is stored in the last byte, check the whole byte. This is efficient for large allocations where a pointer could point to the middle of it.

Else, just use the old code of checking the block state. This is efficient for pointers that point to the start of an allocation (a very common case). It is also reasonably efficient for the first few and last few blocks of an allocation that aren't aligned (b % 4 == 3) because it will loop at most 3 times.

@aykevl I like the simplicity of your suggestion. I did some benchmarking on RP2040 and the code from this PR is slightly (but consistently) faster than your suggestion. I'm not sure if this small speedup justifies so much size increase, especially if benchmarks on linux/wasm show similar/better performance.

aykevl · 2024-11-01T07:47:49Z

@HattoriHanzo031 can you share the benchmark code?

HattoriHanzo031 force-pushed the gc_find_head_optimization branch 2 times, most recently from e282884 to 565a24d Compare September 8, 2023 18:56

HattoriHanzo031 force-pushed the gc_find_head_optimization branch from 9a315e8 to 565a24d Compare September 9, 2023 10:05

HattoriHanzo031 force-pushed the gc_find_head_optimization branch from 16f759a to 47d0cb1 Compare September 20, 2024 11:43

HattoriHanzo031 force-pushed the gc_find_head_optimization branch from 47d0cb1 to 9a1ca3c Compare October 18, 2024 07:52

HattoriHanzo031 marked this pull request as ready for review October 24, 2024 03:27

dgryski reviewed Oct 29, 2024

View reviewed changes

HattoriHanzo031 and others added 6 commits October 31, 2024 00:59

Scan optimization

44d8b04

changed expected size so all tests can run

a82b085

Simplify algorithm

bbbf759

restoring original comment

bcfa5e8

updated new binary sizes for TestBinarySize

5859de8

test binary size update

a48056d

HattoriHanzo031 force-pushed the gc_find_head_optimization branch from 9a1ca3c to a48056d Compare October 31, 2024 00:12

aykevl added a commit that referenced this pull request Oct 31, 2024

runtime: optimize findHead

48dc0fc

This is similar to #3899, but smaller and hopefully just as efficient.

aykevl mentioned this pull request Oct 31, 2024

runtime: optimize findHead #4574

Draft

aykevl added a commit that referenced this pull request Oct 31, 2024

runtime: optimize findHead

afd2900

This is similar to #3899, but smaller and hopefully just as efficient.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gc: findHead scan optimization #3899

gc: findHead scan optimization #3899

HattoriHanzo031 commented Sep 8, 2023 •

edited

Loading

dgryski commented Sep 8, 2023

aykevl commented Sep 9, 2023

HattoriHanzo031 commented Sep 9, 2023

HattoriHanzo031 commented Sep 9, 2023

aykevl commented Sep 10, 2023 •

edited

Loading

HattoriHanzo031 commented Sep 11, 2023

dgryski commented Sep 19, 2024

HattoriHanzo031 commented Sep 20, 2024

dgryski commented Sep 20, 2024

HattoriHanzo031 commented Sep 23, 2024 •

edited

Loading

dgryski commented Oct 23, 2024

dgryski commented Oct 23, 2024

dgryski commented Oct 24, 2024

HattoriHanzo031 commented Oct 27, 2024

dgryski commented Oct 27, 2024

dgryski Oct 29, 2024

dgryski commented Oct 29, 2024

aykevl commented Oct 31, 2024

aykevl commented Oct 31, 2024 •

edited

Loading

HattoriHanzo031 commented Nov 1, 2024 •

edited

Loading

aykevl commented Nov 1, 2024

gc: findHead scan optimization #3899

Are you sure you want to change the base?

gc: findHead scan optimization #3899

Conversation

HattoriHanzo031 commented Sep 8, 2023 • edited Loading

dgryski commented Sep 8, 2023

aykevl commented Sep 9, 2023

HattoriHanzo031 commented Sep 9, 2023

HattoriHanzo031 commented Sep 9, 2023

aykevl commented Sep 10, 2023 • edited Loading

HattoriHanzo031 commented Sep 11, 2023

dgryski commented Sep 19, 2024

HattoriHanzo031 commented Sep 20, 2024

dgryski commented Sep 20, 2024

HattoriHanzo031 commented Sep 23, 2024 • edited Loading

dgryski commented Oct 23, 2024

dgryski commented Oct 23, 2024

dgryski commented Oct 24, 2024

HattoriHanzo031 commented Oct 27, 2024

dgryski commented Oct 27, 2024

dgryski Oct 29, 2024

Choose a reason for hiding this comment

dgryski commented Oct 29, 2024

aykevl commented Oct 31, 2024

aykevl commented Oct 31, 2024 • edited Loading

HattoriHanzo031 commented Nov 1, 2024 • edited Loading

aykevl commented Nov 1, 2024

HattoriHanzo031 commented Sep 8, 2023 •

edited

Loading

aykevl commented Sep 10, 2023 •

edited

Loading

HattoriHanzo031 commented Sep 23, 2024 •

edited

Loading

aykevl commented Oct 31, 2024 •

edited

Loading

HattoriHanzo031 commented Nov 1, 2024 •

edited

Loading