Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(profiling): bound memory growth for internal string table #11310

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

nsrip-dd
Copy link
Contributor

@nsrip-dd nsrip-dd commented Nov 6, 2024

The profiler interns strings in an internal buffer. This buffer is currently not cleared.
But it can contain arbitrary user-generated strings (e.g. asyncio Task names),
and thus can grow without bound over the lifetime of the program.
Clear the string buffer after every profile upload to mitigate the problem.

This PR doesn't currently have a regression test. I took a stab at writing one
in de29bfb but it's a little tricky to test reliably. For now, we can rely on the relenv
to catch this since we originally saw the problem there.

Checklist

  • PR author has checked that all the criteria below are met
  • The PR description includes an overview of the change
  • The PR description articulates the motivation for the change
  • The change includes tests OR the PR description describes a testing strategy
  • The PR description notes risks associated with the change, if any
  • Newly-added code is easy to change
  • The change follows the library release note guidelines
  • The change includes or references documentation updates if necessary
  • Backport labels are set (if applicable)

Reviewer Checklist

  • Reviewer has checked that all the criteria below are met
  • Title is accurate
  • All changes are related to the pull request's stated goal
  • Avoids breaking API changes
  • Testing strategy adequately addresses listed risks
  • Newly-added code is easy to change
  • Release note makes sense to a user of the library
  • If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment
  • Backport labels are set in a manner that is consistent with the release branch maintenance policy

PROF-10824

Copy link
Contributor

github-actions bot commented Nov 6, 2024

CODEOWNERS have been resolved as:

releasenotes/notes/profiling-fix-unbounded-string-buffer-14108832cfd1d1a3.yaml  @DataDog/apm-python
ddtrace/internal/datadog/profiling/dd_wrapper/src/profile.cpp           @DataDog/profiling-python
ddtrace/internal/datadog/profiling/dd_wrapper/src/sample.cpp            @DataDog/profiling-python

@pr-commenter
Copy link

pr-commenter bot commented Nov 6, 2024

Benchmarks

Benchmark execution time: 2024-11-08 17:24:43

Comparing candidate commit 2c023d9 in PR branch nick.ripley/fix-profiler-string-table-leak with baseline commit 6272798 in branch main.

Found 2 performance improvements and 0 performance regressions! Performance is the same for 386 metrics, 2 unstable metrics.

scenario:iast_aspects-ospathsplitext_aspect

  • 🟩 execution_time [-413.078ns; -358.742ns] or [-10.369%; -9.005%]

scenario:iast_aspects-rsplit_aspect

  • 🟩 execution_time [-160.143ns; -138.134ns] or [-8.928%; -7.701%]

It's not currently working as intended and not clear that there's a path
forward for that test that won't be either too sensitve (flaky or tied
to implementation details) or not sensitive enough (don't actually catch
the problem). Remove it for now and rely on the relenv to evaluate the
fix.
@nsrip-dd nsrip-dd changed the title WIP: fix(profiling): bound memory growth for internal string table fix(profiling): bound memory growth for internal string table Nov 8, 2024
@nsrip-dd nsrip-dd marked this pull request as ready for review November 8, 2024 16:44
@nsrip-dd nsrip-dd requested review from a team as code owners November 8, 2024 16:44
{
const std::lock_guard<std::mutex> lock(profile_mtx);
strings.clear();
string_storage.clear();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this safe to do? What I'm worried about is if something uses insert_or_get, gets a string view pointing to the strings we're storing here, and then we call reset while something still holds on to the string view. cc @sanchda @taegyunkim

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically I had wondered about here, where we release the profile lock and then clear it out. Is there a window where other parts of the program can add stuff to the string table (and get references into it) before we clear it?

{
uploader.upload(Datadog::Sample::profile_borrow());
Datadog::Sample::profile_release();
Datadog::Sample::profile_clear_state();
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, as far as I know state is cleared when uploading has been done.

uploader.upload(Datadog::Sample::profile_borrow());
Datadog::Sample::profile_release();
Datadog::Sample::profile_clear_state();

And when clearing state if we clear everything properly in the profile state, it should be ok?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there's the profile_release call there that releases the profile lock. Then profile_clear_state clears out some data structures, under the lock for each thing it does. Another way to phrase my question is, can samples be added between the times the lock is released here? Like, between lines 310 and 311?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that can happen, and those samples would be lost, since they would be cleared.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... the more I think about this the less I'm sure there's a safe way to do this without bigger changes. The problem is that when we build samples, we hold references into the string table across multiple critical sections. So for something like this:

auto s = dup_start_sample();
ddup_push_task_name(s, "abc");
ddup_flush_sample(s);

The ddup_push_task_name call adds a copy of "abc" to the string table and holds that reference via s. Then, ddup_flush_sample(s) reads that string table reference when passing the sample into libdatadog. Between those two calls, we can't drop the string from the table or we'll deallocate it and ddup_flush_sample will read a dangling pointer. But the profiling lock isn't held between those two calls so the profiler might concurrently drop the string table (regardless of where we put the call to drop the string table). It seems like we need a different approach...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the following let us avoid this issue?
hold the profile lock

  • upload
  • clear the state
    and release the lock

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately no. The problem is that we build a sample across multiple API calls, and the profile (specifically the string table) is not locked between those calls. So even if we clear the state under the lock, we can get dangling references if we happen to be concurrently building a profile sample. I'm working on a few ideas for how to change this.

@nsrip-dd nsrip-dd marked this pull request as draft November 11, 2024 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants