-
Notifications
You must be signed in to change notification settings - Fork 412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
StackCollector is causing SIGSEGV #11158
Comments
segfaults are bad and it does look like That said, it looks like the top frame is the cpython signal handler. As far as I know, I see that this is on cpthon 3.12. What's the patch version? Also, what's the ddtrace version? Thanks! |
This happened on 3.11.x (can't remember which) and 3.12.7. We're on 2.4.2 of the agent currently, due to #11141. It is the cpython signal handler, and once it's run, it goes back to the PyDict_SetItem call and immediately segfaults again. It doesn't leave the signal handler either, due to how Python works- https://docs.python.org/3/library/signal.html#execution-of-python-signal-handlers This tracks the behaviour we're seeing.
Because of the Cython/pyx usage in the StackCollector stuff, it means the Python signal handlers never actually get run. edit: :eye-twitch: we do have signal handlers. removing them |
I'm glad to know why it's hanging instead of crashing, and the fact it's been hanging instead of crashing is beneficial to catch the collected stack is stale? |
Yeah, I'm not 100% sure about this one. The behavior is almost, but not quite, representative of a class of defects we started observing in Python 3.11. Assuming that we're looking at a similar defect, here are some suggestions for avoiding the segmentation faults:
|
We should be able to give a newer version a shot, but uhh- this is what our docker container's entrypoint.sh looks like.
Is there any documentation outlining these suggestions? Python 3.11 was released October 24, 2022, so it seems kinda wild that these issues are still on-going two years later. |
Ah--yeah; 2.8.0 is the first
|
For #4, isn't it still going to have to go through For this issue, is it a case of Python pausing the StackCollector's thread, releasing the GIL, and then when the StackCollector's thread gets the GIL again, the frame it's iterating through is full of invalid references? |
We are experiencing this same issue after upgrading to Python 3.11. Not only is it causing crashes, but combined with gunicorn and ddtrace-run, the memory from the crashed worker is never recovered, eventually causing OOM issues. |
@lightswitch05 the OOM issue is unsettling--I'd guess it's similar to the behavior reported in the OP. Which version of ddtrace are you running? |
2.15.1 - we upgraded to the latest version (at the time) as an attempted fix before just disabling it. Prior to that, we also experienced it with 2.14.4 and 2.13.0. We did not have the issue on 2.12.2, but that was also prior to upgrading to python 3.11. Unfortunately, we've been unable to reliably reproduce the issue, and it has only happened in production, so I'm not able to provide any more debugging details. Also - and Its not really surprising - but the memory leak was not apparent in the APM profiler. The leak was clearly visible when reviewing Kubernetes pod memory usage. |
One thing to note is that Kubernetes RSS is the sum of the container RSS plus Kubernetes ephemeral storage (and some other Kubernetes things, I don't have a comprehensive list). What I can imagine happening is if your system is configured to emit corefiles to ephemeral storage, then those corefiles will continue to count against the pod RSS until you hit an inevitable OOM. In Datadog, you can split out the Kubenetes ephemeral storage metric (sorry, I forget exactly what it's called, but it should come up right away in the search). If what i'm saying is accurate, you'd see the ephemeral store look like a step function over time. On the other hand, you'd see container (not Kubernetes) RSS look pretty normal. It's also possible that workers are getting their corefiles written into a tmpfs (probably Naturally, I don't know whether this is truly indicative of your system. I just wanted to mention this dynamic because if it is accurate, it should be noted that many users prefer to disable corefile generation in their prod environments exactly to prevent crashes from turning into future OOMs. Anyway, since you're on a relatively recent version of |
This is great actionable information! |
When this happens, it goes into the below loop. This thread never releases the GIL. The debug logs stop being written.I suspect there's an issue in our code,
but ddtrace-py should never keep a process alive that's receiving SIGSEGV. I mean,Python is trying to access memory at 0x2- something has gone horribly wrong.I suspect that is happening because it's running Cython, and it's continually trying to re-run the same Python op-code, getting the signal, rerunning, forever, never yielding the GIL.I should not have had to use GDB to suss out the SIGSEGVIt then goes into the above loop again if you keep issuing
next
commands.edit: If I go and look at the thread it's trying to capture this for and look at the backtrace, it is NOT at line 148 in REDACTED.py. It has gone further along.
Could the Profiler be trying to access a stale frame, causing the SIGSEGV?
The text was updated successfully, but these errors were encountered: