Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable per thread register state cache on libunwind #55049

Merged

Conversation

andrebsguedes
Copy link
Contributor

@andrebsguedes andrebsguedes commented Jul 6, 2024

Looking into a profile recently I realized that when recording backtraces the CPU utilization is mostly dominated by lookups/updates to libunwind's register state cache (get_rs_cache, put_rs_cache):

Screenshot from 2024-07-05 19-29-45

It is also worth noting that those functions are taking a lock and using sigprocmask which does not scale, so by recording backtraces in parallel we get:

Screenshot from 2024-07-05 19-30-21

And this translates to these times on a recent laptop (Linux X86_64):

julia> @time for i in 1:1000000 Base.backtrace() end
  8.286924 seconds (32.00 M allocations: 8.389 GiB, 1.46% gc time)

julia> @time Threads.@sync for i in 1:16
           Threads.@spawn for j in 1:1000000
               Base.backtrace()
           end
       end
 20.448630 seconds (160.01 M allocations: 123.740 GiB, 8.05% gc time, 0.43% compilation time: 18% of which was recompilation)

Good news is that libunwind already has the solution for this in the form of the --enable-per-thread-cache build option which uses a thread local cache for register state instead of the default global one (1). But this is not without some hiccups due to how we dlopen libunwind so we need a small patch (2).

By applying those changes we get:

julia> @time for i in 1:1000000 Base.backtrace() end
  2.378070 seconds (32.00 M allocations: 8.389 GiB, 4.72% gc time)

julia> @time Threads.@sync for i in 1:16
           Threads.@spawn for j in 1:1000000
               Base.backtrace()
           end
       end
  3.657772 seconds (160.01 M allocations: 123.740 GiB, 52.05% gc time, 2.33% compilation time: 19% of which was recompilation)

Single-Threaded:
Screenshot from 2024-07-05 20-25-49

Multi-Threaded:
Screenshot from 2024-07-05 20-26-32

As a companion to this PR I have created another one for applying the same change to LibUnwind_jll on Yggdrasil. After that lands we can bump the version here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to upstream this patch?

@vtjnash
Copy link
Member

vtjnash commented Jul 6, 2024

I thought the problem with this option was it causes libunwind to leak memory. Why does it matter if this is scalable?

@andrebsguedes
Copy link
Contributor Author

I thought the problem with this option was it causes libunwind to leak memory. Why does it matter if this is scalable?

@vtjnash What kind of memory leak are we talking about? My understanding is that it only leaks memory if the cache is resized and a thread exits, but with this feature enabled the cache cannot be resized.

@vtjnash
Copy link
Member

vtjnash commented Jul 22, 2024

Okay, yes, looks like this option had been fixed https://github.com/libunwind/libunwind/pull/8/files#r111250766

@NHDaly
Copy link
Member

NHDaly commented Jul 22, 2024

Why does it matter if this is scalable?

If you have a multithreaded program, and two different tasks are logging an exception or a warning with backtraces, they will serialize unnecessarily. We encounter this case in production, somewhat regularly, especially due to failures that cascade across threads, meaning the system grinds to a halt when attempting to report an error.

@andrebsguedes
Copy link
Contributor Author

It turns out that there is a side effect of loading libunwind through dlopen when using glibc and per thread caches: it makes any function that uses the cache not async-signal-safe.

This happens because in order to dlopen libunwind we must use the global-dynamic TLS model which in glibc means that any space for TLS will be allocated lazily in the first access from a new thread. This lazy allocation is performed with malloc which is not async-signal-safe thus rendering the first access to the cache from a new thread not async-signal-safe (e.g. If this first access comes from a signal handler that interrupts a thread within the critical section of malloc it will deadlock).

I didn't take the time to check but I am assuming we use libunwind from signal handlers, which makes me think we have the following options:

  • always set the caching policy to UNW_CACHE_GLOBAL when unwinding from signal handlers
  • have the default caching policy be UNW_CACHE_GLOBAL and create backtrace alternatives that don't need to be async-signal-safe (which use UNW_CACHE_PER_THREAD)

Do we have any preference, or maybe other options? @vtjnash @giordano

P.S.: We cannot use the current initial-exec TLS model because we get cannot allocate memory in static TLS block as glibc only reserves so much static TLS space (however the amount can be changed on startup by setting the glibc.rtld.optional_static_tls glibc tunable).

This jemalloc issue comment has a brief explanation of the dlopen issue

@vtjnash
Copy link
Member

vtjnash commented Jul 23, 2024

I thought I saw libunwind now used mmap to avoid that problem, as long as the pthreads implementation itself is safe

@vtjnash
Copy link
Member

vtjnash commented Jul 23, 2024

Apparently Google proposed fixing glibc a decade ago, because it is required by MSAN to be reliable, but I don't know if this merged: https://sourceware.org/legacy-ml/libc-alpha/2014-12/msg00583.html

@andrebsguedes
Copy link
Contributor Author

The proposed fix from Google is for pthread_getspecific which is used with pthread_key but the caches we are talking about are using plain C11 _Thread_Local structs that get lazy initialized within __tls_get_addr in glibc. Google also discussed making __tls_get_addr async-signal-safe for dlopened libraries here but I am not aware of any real fix for it. After a quick look at glibc source code it appears that malloc is still there.

@vtjnash
Copy link
Member

vtjnash commented Jul 23, 2024

Could you add a function to libunwind that simply accesses that variable to force it to get initialized? We can call that from jl_adopt_thread

@andrebsguedes
Copy link
Contributor Author

andrebsguedes commented Jul 23, 2024

@vtjnash I added a function to the patch that simply accesses a dummy _Thread_Local as this is enough to allocate the dtv entry that is also used by the other _Thread_Locals (I verified this). I had to call it from jl_init_threadtls instead of jl_adopt_thread because the later is only called by foreign threads and we want all threads to run this. Also managed to confirm by stepping with GDB that the first call to the unw_ensure_tls function always allocates and subsequent jl_unw_step calls do not, where previously jl_unw_step would indeed allocate within __tls_get_addr on the first call.

@giordano
Copy link
Contributor

Windows fails to build due to linker errors related to libunwind.

@andrebsguedes
Copy link
Contributor Author

@giordano Is there a way to restart only the build powerpc64le-linux-gnuassert job that timed out? Looks unrelated.

@giordano
Copy link
Contributor

I restarted it.

@andrebsguedes
Copy link
Contributor Author

@giordano Can you restart test x86_64-linux-gnuassertrr-net and test x86_64-linux-gnuassertrr? The failures look flaky to me

@giordano
Copy link
Contributor

Yeah, restarted. That's #55235, it's hitting very frequently lately.

@vtjnash
Copy link
Member

vtjnash commented Jul 25, 2024

FWIW, we are running into that frequently now due to Pkg bugs. The error message printing there is just our attempted error recovery not being successful

@andrebsguedes
Copy link
Contributor Author

The test x86_64-linux-gnuassertrr-net job seems to be timing out with: Process failed to exit within 7200s, requesting termination (SIGTERM) of PID before running into that assertion on shutdown. I see the same pattern on other builds

@andrebsguedes
Copy link
Contributor Author

I see, it hangs there because Pkg tests never finish.

@giordano
Copy link
Contributor

This is finally all green. Are we good to go now? There haven't been substantial changes since last approval

@andrebsguedes
Copy link
Contributor Author

Yeah, the only change is the version bump of the package and some minor build fixes, I think we are good

@andrebsguedes
Copy link
Contributor Author

@vtjnash gentle nudge here : )

@IanButterworth IanButterworth merged commit 5a904ac into JuliaLang:master Jul 31, 2024
7 checks passed
lazarusA pushed a commit to lazarusA/julia that referenced this pull request Aug 17, 2024
Looking into a profile recently I realized that when recording
backtraces the CPU utilization is mostly dominated by lookups/updates to
libunwind's register state cache (`get_rs_cache`, `put_rs_cache`):

![Screenshot from 2024-07-05
19-29-45](https://github.com/JuliaLang/julia/assets/5301739/5e65f867-6dc8-4d55-8669-aaf1f756a2ac)

It is also worth noting that those functions are taking a lock and using
`sigprocmask` which does not scale, so by recording backtraces in
parallel we get:

![Screenshot from 2024-07-05
19-30-21](https://github.com/JuliaLang/julia/assets/5301739/ed3124dd-f340-4b52-a7f9-c0a203f935b6)

And this translates to these times on a recent laptop (Linux X86_64):
```
julia> @time for i in 1:1000000 Base.backtrace() end
  8.286924 seconds (32.00 M allocations: 8.389 GiB, 1.46% gc time)

julia> @time Threads.@sync for i in 1:16
           Threads.@Spawn for j in 1:1000000
               Base.backtrace()
           end
       end
 20.448630 seconds (160.01 M allocations: 123.740 GiB, 8.05% gc time, 0.43% compilation time: 18% of which was recompilation)
```

Good news is that libunwind already has the solution for this in the
form of the `--enable-per-thread-cache` build option which uses a thread
local cache for register state instead of the default global one
([1](https://libunwind-devel.nongnu.narkive.com/V3gtFUL9/question-about-performance-of-threaded-access-in-libunwind)).
But this is not without some hiccups due to how we `dlopen` libunwind so
we need a small patch
([2](https://libunwind-devel.nongnu.narkive.com/QG1K3Uke/tls-model-initial-exec-attribute-prevents-dynamic-loading-of-libunwind-via-dlopen)).

By applying those changes we get:
```
julia> @time for i in 1:1000000 Base.backtrace() end
  2.378070 seconds (32.00 M allocations: 8.389 GiB, 4.72% gc time)

julia> @time Threads.@sync for i in 1:16
           Threads.@Spawn for j in 1:1000000
               Base.backtrace()
           end
       end
  3.657772 seconds (160.01 M allocations: 123.740 GiB, 52.05% gc time, 2.33% compilation time: 19% of which was recompilation)
```

Single-Threaded:
![Screenshot from 2024-07-05
20-25-49](https://github.com/JuliaLang/julia/assets/5301739/ebc87952-e51f-488c-92f4-72aed5abb93a)

Multi-Threaded:
![Screenshot from 2024-07-05
20-26-32](https://github.com/JuliaLang/julia/assets/5301739/0ea2160a-60e8-49ea-af62-7d8ffc35c963)

As a companion to this PR I have created another one for applying the
same change to LibUnwind_jll [on
Yggdrasil](JuliaPackaging/Yggdrasil#9030). After
that lands we can bump the version here.
fingolfin pushed a commit that referenced this pull request Aug 31, 2024
Assuming non-windows and libunwind not disabled:

The flag `-DLLVMLIBUNWIND` is currently set on macos only for
`USE_SYSTEM_UNWIND=0` which seems wrong to me and causes build issues
for macos on Yggdrasil in combination with the recent
#55049 which should only affect
gnu libunwind (`error: call to undeclared function 'unw_ensure_tls'`).
This flag is now set independently of the system-libunwind flag (on
Darwin and OpenBSD as before).

`LIBUNWIND=-lunwind` is set for `USE_SYSTEM_UNWIND=0` ||
`USE_SYSTEM_UNWIND=1` && `OS != Darwin`.
I don't think the check for Darwin make sense and might be a leftover
from using osxunwind a (long) while ago.
Changed that to always set `-lunwind` if enabled.

x-ref: JuliaPackaging/Yggdrasil#9331
KristofferC pushed a commit that referenced this pull request Sep 12, 2024
Assuming non-windows and libunwind not disabled:

The flag `-DLLVMLIBUNWIND` is currently set on macos only for
`USE_SYSTEM_UNWIND=0` which seems wrong to me and causes build issues
for macos on Yggdrasil in combination with the recent
#55049 which should only affect
gnu libunwind (`error: call to undeclared function 'unw_ensure_tls'`).
This flag is now set independently of the system-libunwind flag (on
Darwin and OpenBSD as before).

`LIBUNWIND=-lunwind` is set for `USE_SYSTEM_UNWIND=0` ||
`USE_SYSTEM_UNWIND=1` && `OS != Darwin`.
I don't think the check for Darwin make sense and might be a leftover
from using osxunwind a (long) while ago.
Changed that to always set `-lunwind` if enabled.

x-ref: JuliaPackaging/Yggdrasil#9331
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants