Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TlsRandomNumberGenerator causes memory corruption when the process is forked. #2408

Open
ghalliday opened this issue Nov 16, 2023 · 3 comments
Labels
bug Something isn't working do-not-stale triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@ghalliday
Copy link

I manage an open source project https://github.com/hpcc-systems/HPCC-Platform which we adding support to open telemetry using this library.

One of the components launches large numbers (100+) of child processes at the same time. Since adding the open telemetry library this is now the system to core when those child processes are started. Here is an example of a stack trace from gdb:

#0  0x00007fc40090eb5d in read () from /usr/lib64/libc.so.6
#1  0x00007fc40089ad54 in __GI__IO_file_underflow () from /usr/lib64/libc.so.6
#2  0x00007fc400899808 in __GI__IO_file_xsgetn () from /usr/lib64/libc.so.6
#3  0x00007fc40088e1ef in fread () from /usr/lib64/libc.so.6
#4  0x00007fc4011baa50 in std::random_device::_M_getval() () from /usr/lib64/libstdc++.so.6
#5  0x00007fc40183995b in opentelemetry::v1::sdk::common::(anonymous namespace)::TlsRandomNumberGenerator::Seed() ()
   from /opt/HPCCSystems/lib/libopentelemetry_common.so
#6  0x00007fc4008e4c4e in fork () from /usr/lib64/libc.so.6
#7  0x00007fc4036aeb69 in CLinuxPipeProcess::run (this=0x7fc352cb63a0)
    at /hpcc-dev/HPCC-Platform/system/jlib/jthread.cpp:2072

I believe the problem is caused by the following code:

  TlsRandomNumberGenerator() noexcept
  {
    Seed();
    platform::AtFork(nullptr, nullptr, OnFork);
  }

where an onFork handler was added inside the random number generator made in 2018.

There are only a very restricted set of functions that are valid to be called at that point within the child process (https://man7.org/linux/man-pages/man3/pthread_atfork.3.html): they must be async-signal-safe, and not use any heap functions because that can cause memory corruption. This onFork call does not obey those restrictions. It also performs unnecessary work whenever you create a child process (I suspect it now takes a long time to start the child processes because of the need to wait for sufficient entropy from the random number generator).

I believe the fix is to delete that call to onFork, and re-examine why that change was made. In particular, why would a process ever call the open telemetry functions inside the child of a fork()?

@ghalliday ghalliday added the bug Something isn't working label Nov 16, 2023
@github-actions github-actions bot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Nov 16, 2023
@ghalliday
Copy link
Author

Here is my suggested replacement code (untested):

namespace
{
class TlsRandomNumberGenerator
{
public:
  TlsRandomNumberGenerator() noexcept
  {
    Seed();
  }

  FastRandomNumberGenerator & engine() noexcept { return engine_; }

private:
  FastRandomNumberGenerator engine_;

  void Seed() noexcept
  {
    std::random_device random_device;
    std::seed_seq seed_seq{random_device(), random_device(), random_device(), random_device()};
    engine_.seed(seed_seq);
  }
};

}  // namespace

FastRandomNumberGenerator &Random::GetRandomNumberGenerator() noexcept
{
  static thread_local TlsRandomNumberGenerator random_number_generator{};
  return random_number_generator.engine();
}


@marcalff marcalff added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 29, 2023
Copy link

This issue was marked as stale due to lack of activity.

@github-actions github-actions bot added the Stale label Jan 30, 2024
@rpastrana
Copy link

How do we remove the stale label off this jira?

@lalitb lalitb added do-not-stale and removed Stale labels Jan 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working do-not-stale triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants