-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable per thread register state cache on libunwind #55049
Enable per thread register state cache on libunwind #55049
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to upstream this patch?
I thought the problem with this option was it causes libunwind to leak memory. Why does it matter if this is scalable? |
@vtjnash What kind of memory leak are we talking about? My understanding is that it only leaks memory if the cache is resized and a thread exits, but with this feature enabled the cache cannot be resized. |
Okay, yes, looks like this option had been fixed https://github.com/libunwind/libunwind/pull/8/files#r111250766 |
If you have a multithreaded program, and two different tasks are logging an exception or a warning with backtraces, they will serialize unnecessarily. We encounter this case in production, somewhat regularly, especially due to failures that cascade across threads, meaning the system grinds to a halt when attempting to report an error. |
It turns out that there is a side effect of loading libunwind through This happens because in order to I didn't take the time to check but I am assuming we use libunwind from signal handlers, which makes me think we have the following options:
Do we have any preference, or maybe other options? @vtjnash @giordano P.S.: We cannot use the current This jemalloc issue comment has a brief explanation of the dlopen issue |
I thought I saw libunwind now used mmap to avoid that problem, as long as the pthreads implementation itself is safe |
Apparently Google proposed fixing glibc a decade ago, because it is required by MSAN to be reliable, but I don't know if this merged: https://sourceware.org/legacy-ml/libc-alpha/2014-12/msg00583.html |
The proposed fix from Google is for |
Could you add a function to libunwind that simply accesses that variable to force it to get initialized? We can call that from jl_adopt_thread |
43a6a78
to
267d0dc
Compare
@vtjnash I added a function to the patch that simply accesses a dummy |
Windows fails to build due to linker errors related to libunwind. |
8c8ca84
to
c4d71d9
Compare
@giordano Is there a way to restart only the |
I restarted it. |
@giordano Can you restart |
Yeah, restarted. That's #55235, it's hitting very frequently lately. |
FWIW, we are running into that frequently now due to Pkg bugs. The error message printing there is just our attempted error recovery not being successful |
The |
I see, it hangs there because |
This is finally all green. Are we good to go now? There haven't been substantial changes since last approval |
Yeah, the only change is the version bump of the package and some minor build fixes, I think we are good |
@vtjnash gentle nudge here : ) |
Looking into a profile recently I realized that when recording backtraces the CPU utilization is mostly dominated by lookups/updates to libunwind's register state cache (`get_rs_cache`, `put_rs_cache`): ![Screenshot from 2024-07-05 19-29-45](https://github.com/JuliaLang/julia/assets/5301739/5e65f867-6dc8-4d55-8669-aaf1f756a2ac) It is also worth noting that those functions are taking a lock and using `sigprocmask` which does not scale, so by recording backtraces in parallel we get: ![Screenshot from 2024-07-05 19-30-21](https://github.com/JuliaLang/julia/assets/5301739/ed3124dd-f340-4b52-a7f9-c0a203f935b6) And this translates to these times on a recent laptop (Linux X86_64): ``` julia> @time for i in 1:1000000 Base.backtrace() end 8.286924 seconds (32.00 M allocations: 8.389 GiB, 1.46% gc time) julia> @time Threads.@sync for i in 1:16 Threads.@Spawn for j in 1:1000000 Base.backtrace() end end 20.448630 seconds (160.01 M allocations: 123.740 GiB, 8.05% gc time, 0.43% compilation time: 18% of which was recompilation) ``` Good news is that libunwind already has the solution for this in the form of the `--enable-per-thread-cache` build option which uses a thread local cache for register state instead of the default global one ([1](https://libunwind-devel.nongnu.narkive.com/V3gtFUL9/question-about-performance-of-threaded-access-in-libunwind)). But this is not without some hiccups due to how we `dlopen` libunwind so we need a small patch ([2](https://libunwind-devel.nongnu.narkive.com/QG1K3Uke/tls-model-initial-exec-attribute-prevents-dynamic-loading-of-libunwind-via-dlopen)). By applying those changes we get: ``` julia> @time for i in 1:1000000 Base.backtrace() end 2.378070 seconds (32.00 M allocations: 8.389 GiB, 4.72% gc time) julia> @time Threads.@sync for i in 1:16 Threads.@Spawn for j in 1:1000000 Base.backtrace() end end 3.657772 seconds (160.01 M allocations: 123.740 GiB, 52.05% gc time, 2.33% compilation time: 19% of which was recompilation) ``` Single-Threaded: ![Screenshot from 2024-07-05 20-25-49](https://github.com/JuliaLang/julia/assets/5301739/ebc87952-e51f-488c-92f4-72aed5abb93a) Multi-Threaded: ![Screenshot from 2024-07-05 20-26-32](https://github.com/JuliaLang/julia/assets/5301739/0ea2160a-60e8-49ea-af62-7d8ffc35c963) As a companion to this PR I have created another one for applying the same change to LibUnwind_jll [on Yggdrasil](JuliaPackaging/Yggdrasil#9030). After that lands we can bump the version here.
Assuming non-windows and libunwind not disabled: The flag `-DLLVMLIBUNWIND` is currently set on macos only for `USE_SYSTEM_UNWIND=0` which seems wrong to me and causes build issues for macos on Yggdrasil in combination with the recent #55049 which should only affect gnu libunwind (`error: call to undeclared function 'unw_ensure_tls'`). This flag is now set independently of the system-libunwind flag (on Darwin and OpenBSD as before). `LIBUNWIND=-lunwind` is set for `USE_SYSTEM_UNWIND=0` || `USE_SYSTEM_UNWIND=1` && `OS != Darwin`. I don't think the check for Darwin make sense and might be a leftover from using osxunwind a (long) while ago. Changed that to always set `-lunwind` if enabled. x-ref: JuliaPackaging/Yggdrasil#9331
Assuming non-windows and libunwind not disabled: The flag `-DLLVMLIBUNWIND` is currently set on macos only for `USE_SYSTEM_UNWIND=0` which seems wrong to me and causes build issues for macos on Yggdrasil in combination with the recent #55049 which should only affect gnu libunwind (`error: call to undeclared function 'unw_ensure_tls'`). This flag is now set independently of the system-libunwind flag (on Darwin and OpenBSD as before). `LIBUNWIND=-lunwind` is set for `USE_SYSTEM_UNWIND=0` || `USE_SYSTEM_UNWIND=1` && `OS != Darwin`. I don't think the check for Darwin make sense and might be a leftover from using osxunwind a (long) while ago. Changed that to always set `-lunwind` if enabled. x-ref: JuliaPackaging/Yggdrasil#9331
Looking into a profile recently I realized that when recording backtraces the CPU utilization is mostly dominated by lookups/updates to libunwind's register state cache (
get_rs_cache
,put_rs_cache
):It is also worth noting that those functions are taking a lock and using
sigprocmask
which does not scale, so by recording backtraces in parallel we get:And this translates to these times on a recent laptop (Linux X86_64):
Good news is that libunwind already has the solution for this in the form of the
--enable-per-thread-cache
build option which uses a thread local cache for register state instead of the default global one (1). But this is not without some hiccups due to how wedlopen
libunwind so we need a small patch (2).By applying those changes we get:
Single-Threaded:
Multi-Threaded:
As a companion to this PR I have created another one for applying the same change to LibUnwind_jll on Yggdrasil. After that lands we can bump the version here.