Enable per thread register state cache on libunwind #55049

andrebsguedes · 2024-07-06T00:57:36Z

Looking into a profile recently I realized that when recording backtraces the CPU utilization is mostly dominated by lookups/updates to libunwind's register state cache (get_rs_cache, put_rs_cache):

It is also worth noting that those functions are taking a lock and using sigprocmask which does not scale, so by recording backtraces in parallel we get:

And this translates to these times on a recent laptop (Linux X86_64):

julia> @time for i in 1:1000000 Base.backtrace() end
  8.286924 seconds (32.00 M allocations: 8.389 GiB, 1.46% gc time)

julia> @time Threads.@sync for i in 1:16
           Threads.@spawn for j in 1:1000000
               Base.backtrace()
           end
       end
 20.448630 seconds (160.01 M allocations: 123.740 GiB, 8.05% gc time, 0.43% compilation time: 18% of which was recompilation)

Good news is that libunwind already has the solution for this in the form of the --enable-per-thread-cache build option which uses a thread local cache for register state instead of the default global one (1). But this is not without some hiccups due to how we dlopen libunwind so we need a small patch (2).

By applying those changes we get:

julia> @time for i in 1:1000000 Base.backtrace() end
  2.378070 seconds (32.00 M allocations: 8.389 GiB, 4.72% gc time)

julia> @time Threads.@sync for i in 1:16
           Threads.@spawn for j in 1:1000000
               Base.backtrace()
           end
       end
  3.657772 seconds (160.01 M allocations: 123.740 GiB, 52.05% gc time, 2.33% compilation time: 19% of which was recompilation)

Single-Threaded:

Multi-Threaded:

As a companion to this PR I have created another one for applying the same change to LibUnwind_jll on Yggdrasil. After that lands we can bump the version here.

giordano · 2024-07-06T08:41:51Z

deps/patches/libunwind-disable-initial-exec-tls.patch

Does it make sense to upstream this patch?

vtjnash · 2024-07-06T20:34:18Z

I thought the problem with this option was it causes libunwind to leak memory. Why does it matter if this is scalable?

andrebsguedes · 2024-07-22T15:27:54Z

I thought the problem with this option was it causes libunwind to leak memory. Why does it matter if this is scalable?

@vtjnash What kind of memory leak are we talking about? My understanding is that it only leaks memory if the cache is resized and a thread exits, but with this feature enabled the cache cannot be resized.

vtjnash · 2024-07-22T17:35:14Z

Okay, yes, looks like this option had been fixed https://github.com/libunwind/libunwind/pull/8/files#r111250766

NHDaly · 2024-07-22T18:53:20Z

Why does it matter if this is scalable?

If you have a multithreaded program, and two different tasks are logging an exception or a warning with backtraces, they will serialize unnecessarily. We encounter this case in production, somewhat regularly, especially due to failures that cascade across threads, meaning the system grinds to a halt when attempting to report an error.

andrebsguedes · 2024-07-23T01:22:44Z

It turns out that there is a side effect of loading libunwind through dlopen when using glibc and per thread caches: it makes any function that uses the cache not async-signal-safe.

This happens because in order to dlopen libunwind we must use the global-dynamic TLS model which in glibc means that any space for TLS will be allocated lazily in the first access from a new thread. This lazy allocation is performed with malloc which is not async-signal-safe thus rendering the first access to the cache from a new thread not async-signal-safe (e.g. If this first access comes from a signal handler that interrupts a thread within the critical section of malloc it will deadlock).

I didn't take the time to check but I am assuming we use libunwind from signal handlers, which makes me think we have the following options:

always set the caching policy to UNW_CACHE_GLOBAL when unwinding from signal handlers
have the default caching policy be UNW_CACHE_GLOBAL and create backtrace alternatives that don't need to be async-signal-safe (which use UNW_CACHE_PER_THREAD)

Do we have any preference, or maybe other options? @vtjnash @giordano

P.S.: We cannot use the current initial-exec TLS model because we get cannot allocate memory in static TLS block as glibc only reserves so much static TLS space (however the amount can be changed on startup by setting the glibc.rtld.optional_static_tls glibc tunable).

This jemalloc issue comment has a brief explanation of the dlopen issue

vtjnash · 2024-07-23T02:31:09Z

I thought I saw libunwind now used mmap to avoid that problem, as long as the pthreads implementation itself is safe

vtjnash · 2024-07-23T02:46:14Z

Apparently Google proposed fixing glibc a decade ago, because it is required by MSAN to be reliable, but I don't know if this merged: https://sourceware.org/legacy-ml/libc-alpha/2014-12/msg00583.html

andrebsguedes · 2024-07-23T15:20:59Z

The proposed fix from Google is for pthread_getspecific which is used with pthread_key but the caches we are talking about are using plain C11 _Thread_Local structs that get lazy initialized within __tls_get_addr in glibc. Google also discussed making __tls_get_addr async-signal-safe for dlopened libraries here but I am not aware of any real fix for it. After a quick look at glibc source code it appears that malloc is still there.

vtjnash · 2024-07-23T15:24:41Z

Could you add a function to libunwind that simply accesses that variable to force it to get initialized? We can call that from jl_adopt_thread

…om jl_init_threadtls

andrebsguedes · 2024-07-23T22:09:05Z

@vtjnash I added a function to the patch that simply accesses a dummy _Thread_Local as this is enough to allocate the dtv entry that is also used by the other _Thread_Locals (I verified this). I had to call it from jl_init_threadtls instead of jl_adopt_thread because the later is only called by foreign threads and we want all threads to run this. Also managed to confirm by stepping with GDB that the first call to the unw_ensure_tls function always allocates and subsequent jl_unw_step calls do not, where previously jl_unw_step would indeed allocate within __tls_get_addr on the first call.

giordano · 2024-07-24T17:57:09Z

Windows fails to build due to linker errors related to libunwind.

andrebsguedes · 2024-07-24T19:48:24Z

@giordano Is there a way to restart only the build powerpc64le-linux-gnuassert job that timed out? Looks unrelated.

giordano · 2024-07-24T19:51:39Z

I restarted it.

andrebsguedes · 2024-07-25T14:09:43Z

@giordano Can you restart test x86_64-linux-gnuassertrr-net and test x86_64-linux-gnuassertrr? The failures look flaky to me

giordano · 2024-07-25T14:14:10Z

Yeah, restarted. That's #55235, it's hitting very frequently lately.

vtjnash · 2024-07-25T14:28:48Z

FWIW, we are running into that frequently now due to Pkg bugs. The error message printing there is just our attempted error recovery not being successful

andrebsguedes · 2024-07-25T14:46:08Z

The test x86_64-linux-gnuassertrr-net job seems to be timing out with: Process failed to exit within 7200s, requesting termination (SIGTERM) of PID before running into that assertion on shutdown. I see the same pattern on other builds

andrebsguedes · 2024-07-25T14:54:45Z

I see, it hangs there because Pkg tests never finish.

giordano · 2024-07-25T23:17:07Z

This is finally all green. Are we good to go now? There haven't been substantial changes since last approval

andrebsguedes · 2024-07-26T14:05:55Z

Yeah, the only change is the version bump of the package and some minor build fixes, I think we are good

andrebsguedes · 2024-07-31T16:38:53Z

@vtjnash gentle nudge here : )

@time

Looking into a profile recently I realized that when recording backtraces the CPU utilization is mostly dominated by lookups/updates to libunwind's register state cache (`get_rs_cache`, `put_rs_cache`): ![Screenshot from 2024-07-05 19-29-45](https://github.com/JuliaLang/julia/assets/5301739/5e65f867-6dc8-4d55-8669-aaf1f756a2ac) It is also worth noting that those functions are taking a lock and using `sigprocmask` which does not scale, so by recording backtraces in parallel we get: ![Screenshot from 2024-07-05 19-30-21](https://github.com/JuliaLang/julia/assets/5301739/ed3124dd-f340-4b52-a7f9-c0a203f935b6) And this translates to these times on a recent laptop (Linux X86_64): ``` julia> @time for i in 1:1000000 Base.backtrace() end 8.286924 seconds (32.00 M allocations: 8.389 GiB, 1.46% gc time) julia> @time Threads.@sync for i in 1:16 Threads.@Spawn for j in 1:1000000 Base.backtrace() end end 20.448630 seconds (160.01 M allocations: 123.740 GiB, 8.05% gc time, 0.43% compilation time: 18% of which was recompilation) ``` Good news is that libunwind already has the solution for this in the form of the `--enable-per-thread-cache` build option which uses a thread local cache for register state instead of the default global one ([1](https://libunwind-devel.nongnu.narkive.com/V3gtFUL9/question-about-performance-of-threaded-access-in-libunwind)). But this is not without some hiccups due to how we `dlopen` libunwind so we need a small patch ([2](https://libunwind-devel.nongnu.narkive.com/QG1K3Uke/tls-model-initial-exec-attribute-prevents-dynamic-loading-of-libunwind-via-dlopen)). By applying those changes we get: ``` julia> @time for i in 1:1000000 Base.backtrace() end 2.378070 seconds (32.00 M allocations: 8.389 GiB, 4.72% gc time) julia> @time Threads.@sync for i in 1:16 Threads.@Spawn for j in 1:1000000 Base.backtrace() end end 3.657772 seconds (160.01 M allocations: 123.740 GiB, 52.05% gc time, 2.33% compilation time: 19% of which was recompilation) ``` Single-Threaded: ![Screenshot from 2024-07-05 20-25-49](https://github.com/JuliaLang/julia/assets/5301739/ebc87952-e51f-488c-92f4-72aed5abb93a) Multi-Threaded: ![Screenshot from 2024-07-05 20-26-32](https://github.com/JuliaLang/julia/assets/5301739/0ea2160a-60e8-49ea-af62-7d8ffc35c963) As a companion to this PR I have created another one for applying the same change to LibUnwind_jll [on Yggdrasil](JuliaPackaging/Yggdrasil#9030). After that lands we can bump the version here.

Assuming non-windows and libunwind not disabled: The flag `-DLLVMLIBUNWIND` is currently set on macos only for `USE_SYSTEM_UNWIND=0` which seems wrong to me and causes build issues for macos on Yggdrasil in combination with the recent #55049 which should only affect gnu libunwind (`error: call to undeclared function 'unw_ensure_tls'`). This flag is now set independently of the system-libunwind flag (on Darwin and OpenBSD as before). `LIBUNWIND=-lunwind` is set for `USE_SYSTEM_UNWIND=0` || `USE_SYSTEM_UNWIND=1` && `OS != Darwin`. I don't think the check for Darwin make sense and might be a leftover from using osxunwind a (long) while ago. Changed that to always set `-lunwind` if enabled. x-ref: JuliaPackaging/Yggdrasil#9331

andrebsguedes mentioned this pull request Jul 6, 2024

libunwind: Enable per thread register state cache JuliaPackaging/Yggdrasil#9030

Merged

vchuravy approved these changes Jul 6, 2024

View reviewed changes

giordano reviewed Jul 6, 2024

View reviewed changes

deps/patches/libunwind-disable-initial-exec-tls.patch

Copy link

Contributor

giordano Jul 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to upstream this patch?

ararslan mentioned this pull request Jul 9, 2024

Initial support for FreeBSD AArch64 #55089

Merged

andrebsguedes added 2 commits July 23, 2024 18:49

Enable per thread register state cache on libunwind

6dc43b6

Include function to ensure tls allocation in the patch and call it fr…

267d0dc

…om jl_init_threadtls

andrebsguedes force-pushed the libunwind-per-thread-cache branch from 43a6a78 to 267d0dc Compare July 23, 2024 21:56

vtjnash approved these changes Jul 24, 2024

View reviewed changes

Bump LibUnwind_jll to v1.8.1+1 to enable per thread caches

9ce5f76

Only ensure tls when libunwind is present

c4d71d9

andrebsguedes force-pushed the libunwind-per-thread-cache branch from 8c8ca84 to c4d71d9 Compare July 24, 2024 18:10

IanButterworth merged commit 5a904ac into JuliaLang:master Jul 31, 2024
7 checks passed

This was referenced Aug 30, 2024

Update libjulia 1.12.0-DEV to latest master JuliaPackaging/Yggdrasil#9331

Merged

fixes/cleanup for use-system-unwind make flags #55639

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable per thread register state cache on libunwind #55049

Enable per thread register state cache on libunwind #55049

andrebsguedes commented Jul 6, 2024 •

edited

Loading

giordano Jul 6, 2024

vtjnash commented Jul 6, 2024

andrebsguedes commented Jul 22, 2024

vtjnash commented Jul 22, 2024

NHDaly commented Jul 22, 2024

andrebsguedes commented Jul 23, 2024

vtjnash commented Jul 23, 2024

vtjnash commented Jul 23, 2024

andrebsguedes commented Jul 23, 2024

vtjnash commented Jul 23, 2024

andrebsguedes commented Jul 23, 2024 •

edited

Loading

giordano commented Jul 24, 2024

andrebsguedes commented Jul 24, 2024

giordano commented Jul 24, 2024

andrebsguedes commented Jul 25, 2024

giordano commented Jul 25, 2024

vtjnash commented Jul 25, 2024

andrebsguedes commented Jul 25, 2024

andrebsguedes commented Jul 25, 2024

giordano commented Jul 25, 2024

andrebsguedes commented Jul 26, 2024

andrebsguedes commented Jul 31, 2024

Enable per thread register state cache on libunwind #55049

Enable per thread register state cache on libunwind #55049

Conversation

andrebsguedes commented Jul 6, 2024 • edited Loading

giordano Jul 6, 2024

Choose a reason for hiding this comment

vtjnash commented Jul 6, 2024

andrebsguedes commented Jul 22, 2024

vtjnash commented Jul 22, 2024

NHDaly commented Jul 22, 2024

andrebsguedes commented Jul 23, 2024

vtjnash commented Jul 23, 2024

vtjnash commented Jul 23, 2024

andrebsguedes commented Jul 23, 2024

vtjnash commented Jul 23, 2024

andrebsguedes commented Jul 23, 2024 • edited Loading

giordano commented Jul 24, 2024

andrebsguedes commented Jul 24, 2024

giordano commented Jul 24, 2024

andrebsguedes commented Jul 25, 2024

giordano commented Jul 25, 2024

vtjnash commented Jul 25, 2024

andrebsguedes commented Jul 25, 2024

andrebsguedes commented Jul 25, 2024

giordano commented Jul 25, 2024

andrebsguedes commented Jul 26, 2024

andrebsguedes commented Jul 31, 2024

andrebsguedes commented Jul 6, 2024 •

edited

Loading

andrebsguedes commented Jul 23, 2024 •

edited

Loading