-
Notifications
You must be signed in to change notification settings - Fork 412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(profiling): bound memory growth for internal string table #11310
base: main
Are you sure you want to change the base?
Conversation
|
BenchmarksBenchmark execution time: 2024-11-08 17:24:43 Comparing candidate commit 2c023d9 in PR branch Found 2 performance improvements and 0 performance regressions! Performance is the same for 386 metrics, 2 unstable metrics. scenario:iast_aspects-ospathsplitext_aspect
scenario:iast_aspects-rsplit_aspect
|
It's not currently working as intended and not clear that there's a path forward for that test that won't be either too sensitve (flaky or tied to implementation details) or not sensitive enough (don't actually catch the problem). Remove it for now and rely on the relenv to evaluate the fix.
{ | ||
const std::lock_guard<std::mutex> lock(profile_mtx); | ||
strings.clear(); | ||
string_storage.clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this safe to do? What I'm worried about is if something uses insert_or_get
, gets a string view pointing to the strings we're storing here, and then we call reset
while something still holds on to the string view. cc @sanchda @taegyunkim
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifically I had wondered about here, where we release the profile lock and then clear it out. Is there a window where other parts of the program can add stuff to the string table (and get references into it) before we clear it?
dd-trace-py/ddtrace/internal/datadog/profiling/dd_wrapper/src/ddup_interface.cpp
Lines 308 to 312 in 2c023d9
{ | |
uploader.upload(Datadog::Sample::profile_borrow()); | |
Datadog::Sample::profile_release(); | |
Datadog::Sample::profile_clear_state(); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, as far as I know state is cleared when uploading has been done.
dd-trace-py/ddtrace/internal/datadog/profiling/dd_wrapper/src/ddup_interface.cpp
Lines 309 to 311 in 6272798
uploader.upload(Datadog::Sample::profile_borrow()); | |
Datadog::Sample::profile_release(); | |
Datadog::Sample::profile_clear_state(); |
And when clearing state if we clear everything properly in the profile state, it should be ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there's the profile_release
call there that releases the profile lock. Then profile_clear_state
clears out some data structures, under the lock for each thing it does. Another way to phrase my question is, can samples be added between the times the lock is released here? Like, between lines 310 and 311?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that can happen, and those samples would be lost, since they would be cleared.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... the more I think about this the less I'm sure there's a safe way to do this without bigger changes. The problem is that when we build samples, we hold references into the string table across multiple critical sections. So for something like this:
auto s = dup_start_sample();
ddup_push_task_name(s, "abc");
ddup_flush_sample(s);
The ddup_push_task_name
call adds a copy of "abc"
to the string table and holds that reference via s
. Then, ddup_flush_sample(s)
reads that string table reference when passing the sample into libdatadog. Between those two calls, we can't drop the string from the table or we'll deallocate it and ddup_flush_sample
will read a dangling pointer. But the profiling lock isn't held between those two calls so the profiler might concurrently drop the string table (regardless of where we put the call to drop the string table). It seems like we need a different approach...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the following let us avoid this issue?
hold the profile lock
- upload
- clear the state
and release the lock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately no. The problem is that we build a sample across multiple API calls, and the profile (specifically the string table) is not locked between those calls. So even if we clear the state under the lock, we can get dangling references if we happen to be concurrently building a profile sample. I'm working on a few ideas for how to change this.
The profiler interns strings in an internal buffer. This buffer is currently not cleared.
But it can contain arbitrary user-generated strings (e.g. asyncio
Task
names),and thus can grow without bound over the lifetime of the program.
Clear the string buffer after every profile upload to mitigate the problem.
This PR doesn't currently have a regression test. I took a stab at writing one
in de29bfb but it's a little tricky to test reliably. For now, we can rely on the relenv
to catch this since we originally saw the problem there.
Checklist
Reviewer Checklist
PROF-10824