Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StackCollector is causing SIGSEGV #11158

Open
lattwood opened this issue Oct 24, 2024 · 12 comments
Open

StackCollector is causing SIGSEGV #11158

lattwood opened this issue Oct 24, 2024 · 12 comments
Assignees
Labels
Profiling Continous Profling

Comments

@lattwood
Copy link

lattwood commented Oct 24, 2024

When this happens, it goes into the below loop. This thread never releases the GIL. The debug logs stop being written.

I suspect there's an issue in our code, but ddtrace-py should never keep a process alive that's receiving SIGSEGV. I mean, Python is trying to access memory at 0x2- something has gone horribly wrong.

I suspect that is happening because it's running Cython, and it's continually trying to re-run the same Python op-code, getting the signal, rerunning, forever, never yielding the GIL.

I should not have had to use GDB to suss out the SIGSEGV

// GDB OUTPUT
(gdb) next
signal_handler (sig_num=11) at ./Modules/signalmodule.c:348
348     ./Modules/signalmodule.c: No such file or directory.
(gdb) next
350     in ./Modules/signalmodule.c
(gdb) next
368     in ./Modules/signalmodule.c
(gdb) next

Thread 3 "python3" received signal SIGSEGV, Segmentation fault.
PyDict_SetItem (value=<unknown at remote 0x2>, key='scan',
    op={'ts': <datetime.datetime at remote 0xffff49506880>, 'entry': {'queue_name': 'BLAH', 'tasks': {'task__0': {'response': True}, 'task__1': {'response': True}}}, 'events': [], 'region_config': <LambdaRegion(region='us-west-2', ) at remote 0xffff7d934a50>,'logger': <TaskLogger(logger=<ExtraLogger(filters=[], name='cs.redacted-worker', level=20, parent=<RootLogger(filters=[], name='root', level=20, parent=None, propagate=False, handlers=[<StreamHandler(filters=[], _name=None, level=0, formatter=<FunctionFormatter(_format_func=<function at remote 0xffff4a26bba0>, datefmt='%Y-%m-%dT%H:%M:%S.%f+') at remote 0xffff495c4860>, _closed=False, lock=<_thread.RLock at remote 0xffff7e283ac0>, stream=<_io.TextIOWrapper(mode='w') at remote 0xffffa4e81b60>) at remote 0xffff7e1d05f0>], disabled=False, _cache={}) at remot...(truncated)) at Objects/dictobject.c:1884
1884    Objects/dictobject.c: No such file or directory.
(gdb) next
signal_handler (sig_num=11) at ./Modules/signalmodule.c:348
348     ./Modules/signalmodule.c: No such file or directory.
(gdb) next
350     in ./Modules/signalmodule.c
(gdb) next
368     in ./Modules/signalmodule.c
(gdb) next

Thread 3 "python3" received signal SIGSEGV, Segmentation fault.
PyDict_SetItem (value=<unknown at remote 0x2>, key='scan',
    op={'ts': <datetime.datetime at remote 0xffff49506880>, 'entry': {'queue_name': 'BLAH, 'tasks': {'task__0': {'response': True}, 'task__1': {'response': True}}}, 'events': [], 'region_config': <LambdaRegion(region='us-west-2', ) at remote 0xffff7d934a50>,'logger': <TaskLogger(logger=<ExtraLogger(filters=[], name='cs.redacted-worker', level=20, parent=<RootLogger(filters=[], name='root', level=20, parent=None, propagate=False, handlers=[<StreamHandler(filters=[], _name=None, level=0, formatter=<FunctionFormatter(_format_func=<function at remote 0xffff4a26bba0>, datefmt='%Y-%m-%dT%H:%M:%S.%f+') at remote 0xffff495c4860>, _closed=False, lock=<_thread.RLock at remote 0xffff7e283ac0>, stream=<_io.TextIOWrapper(mode='w') at remote 0xffffa4e81b60>) at remote 0xffff7e1d05f0>], disabled=False, _cache={}) at remot...(truncated)) at Objects/dictobject.c:1884
1884    Objects/dictobject.c: No such file or directory.
(gdb) next
signal_handler (sig_num=11) at ./Modules/signalmodule.c:348
348     ./Modules/signalmodule.c: No such file or directory.

It then goes into the above loop again if you keep issuing next commands.

(gdb) bt
#0  signal_handler (sig_num=11) at ./Modules/signalmodule.c:348
#1  <signal handler called>
#2  0x0000ffffa5381b68 in Py_INCREF (op=<unknown at remote 0x2>) at ./Include/object.h:642
#3  _Py_NewRef (obj=<unknown at remote 0x2>) at ./Include/object.h:814
#4  PyDict_SetItem (value=<unknown at remote 0x2>, key='scan',
    op={'ts': <datetime.datetime at remote 0xffff49506880>, 'entry': {'queue_name': 'BLAH', 'tasks': {'task__0': {'response': True}, 'task__1': {'response': True}}}, 'events': [], 'region_config': <LambdaRegion(region='us-west-2') at remote 0xffff7d934a50>, 'scan_region': 'us', 'logger': <TaskLogger(logger=<ExtraLogger(filters=[], name='cs.redacted-worker', level=20, parent=<RootLogger(filters=[], name='root', level=20, parent=None, propagate=False, handlers=[<StreamHandler(filters=[], _name=None, level=0, formatter=<FunctionFormatter(_format_func=<function at remote 0xffff4a26bba0>, datefmt='%Y-%m-%dT%H:%M:%S.%f+') at remote 0xffff495c4860>, _closed=False, lock=<_thread.RLock at remote 0xffff7e283ac0>, stream=<_io.TextIOWrapper(mode='w') at remote 0xffffa4e81b60>) at remote 0xffff7e1d05f0>], disabled=False, _cache={}) at remot...(truncated)) at Objects/dictobject.c:1884
#5  dict_ass_sub (mp=0xffff494ec200, v='scan', w=<unknown at remote 0x2>) at Objects/dictobject.c:2526
#6  0x0000ffffa5466114 in _PyFrame_GetLocals (frame=0xffffa5882f78, include_hidden=include_hidden@entry=1) at Objects/frameobject.c:1266
#7  0x0000ffffa535c38c in frame_getlocals (
    f=Frame 0xffff4bcf09d0, for file REDACTED.py, line 148, in resolve_scan_check (ts=<datetime.datetime at remote 0xffff49506880>, entry={'queue_name': 'BLAH', 'tasks': {'task__0': {'response': True}, 'task__1': {'response': True}}}, events=[], region_config=<LambdaRegion(region='us-west-2') at remote 0xffff7d934a50>, scan_region='us', logger=<TaskLogger(logger=<ExtraLogger(filters=[], name='cs.redacted-worker', level=20, parent=<RootLogger(filters=[], name='root', level=20, parent=None, propagate=False, handlers=[<StreamHandler(filters=[], _name=None, level=0, formatter=<FunctionFormatter(_format_func=<function at remote 0xffff4a26bba0>, datefmt='%Y-%m-%dT%H:%M:%S.%f+') at remote 0xffff495c4860>, _closed=False, lock=<_thread.RLock at remote 0xffff7e283ac0>, stream=<_io.Tex...(truncated), closure=<optimized out>) at Objects/frameobject.c:32
#8  0x0000ffffa53af29c in _PyObject_GenericGetAttrWithDict (
    obj=Frame 0xffff4bcf09d0, for file REDACTED.py, line 148, in resolve_scan_check (ts=<datetime.datetime at remote 0xffff49506880>, entry={'queue_name': 'BLAH', 'tasks': {'task__0': {'response': True}, 'task__1': {'response': True}}}, events=[], region_config=<LambdaRegion(region='us-west-2') at remote 0xffff7d934a50>, scan_region='us', logger=<TaskLogger(logger=<ExtraLogger(filters=[], name='cs.redacted-worker', level=20, parent=<RootLogger(filters=[], name='root', level=20, parent=None, propagate=False, handlers=[<StreamHandler(filters=[], _name=None, level=0, formatter=<FunctionFormatter(_format_func=<function at remote 0xffff4a26bba0>, datefmt='%Y-%m-%dT%H:%M:%S.%f+') at remote 0xffff495c4860>, _closed=False, lock=<_thread.RLock at remote 0xffff7e283ac0>, stream=<_io.Tex...(truncated), name='f_locals', dict=0x0, suppress=0) at ./Include/object.h:220
#9  0x0000ffffa1ac7b1c in __pyx_f_7ddtrace_9profiling_9collector_10_traceback__extract_class_name.constprop.0 () from /usr/src/.venv/lib/python3.12/site-packages/ddtrace/profiling/collector/_traceback.cpython-312-aarch64-linux-gnu.so
#10 0x0000ffffa1acb418 in __pyx_pw_7ddtrace_9profiling_9collector_10_traceback_5pyframe_to_frames () from /usr/src/.venv/lib/python3.12/site-packages/ddtrace/profiling/collector/_traceback.cpython-312-aarch64-linux-gnu.so
#11 0x0000ffffa1a4e6f8 in __pyx_f_7ddtrace_9profiling_9collector_5stack_stack_collect.constprop.0 () from /usr/src/.venv/lib/python3.12/site-packages/ddtrace/profiling/collector/stack.cpython-312-aarch64-linux-gnu.so
#12 0x0000ffffa1a5f8ec in __pyx_pw_7ddtrace_9profiling_9collector_5stack_14StackCollector_13collect () from /usr/src/.venv/lib/python3.12/site-packages/ddtrace/profiling/collector/stack.cpython-312-aarch64-linux-gnu.so
#13 0x0000ffffa53a7874 in _PyObject_VectorcallTstate (kwnames=<optimized out>, nargsf=<optimized out>, args=<optimized out>, callable=<_cython_3_0_11.cython_function_or_method at remote 0xffffa1d3f370>, tstate=0xffff94000b70)
    at ./Include/internal/pycore_call.h:92
#14 PyObject_Vectorcall (callable=<_cython_3_0_11.cython_function_or_method at remote 0xffffa1d3f370>, args=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:325
#15 0x0000ffffa538dc04 in _PyEval_EvalFrameDefault (tstate=0xffff94000b70, frame=0xffffa5849020, throwflag=<optimized out>) at Python/bytecodes.c:2715
#16 0x0000ffffa53d5e68 in _PyFunction_Vectorcall (kwnames=0x0, nargsf=1, stack=0xffffa102c7f8, func=<function at remote 0xffffa1d16ca0>) at Objects/call.c:419
#17 _PyObject_VectorcallTstate (kwnames=0x0, nargsf=1, args=0xffffa102c7f8, callable=<function at remote 0xffffa1d16ca0>, tstate=0xffff94000b70) at ./Include/internal/pycore_call.h:92
#18 method_vectorcall (method=<optimized out>, args=0x0, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/classobject.c:69
#19 0x0000ffffa537f080 in _PyObject_VectorcallTstate (tstate=0xffff94000b70, callable=<method at remote 0xffffa1cd6400>, args=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at ./Include/internal/pycore_call.h:92
#20 0x0000ffffa2fb327c in std::thread::_State_impl<std::thread::_Invoker<std::tuple<PeriodicThread_start(periodic_thread*, _object*)::{lambda()#1}> > >::_M_run() ()
   from /usr/src/.venv/lib/python3.12/site-packages/ddtrace/internal/_threads.cpython-312-aarch64-linux-gnu.so
#21 0x0000ffffa2fb363c in execute_native_thread_routine () from /usr/src/.venv/lib/python3.12/site-packages/ddtrace/internal/_threads.cpython-312-aarch64-linux-gnu.so
#22 0x0000ffffa50dee30 in ?? () from /lib/aarch64-linux-gnu/libc.so.6
#23 0x0000ffffa5147adc in ?? () from /lib/aarch64-linux-gnu/libc.so.6

edit: If I go and look at the thread it's trying to capture this for and look at the backtrace, it is NOT at line 148 in REDACTED.py. It has gone further along.

Could the Profiler be trying to access a stale frame, causing the SIGSEGV?

@sanchda
Copy link
Contributor

sanchda commented Oct 24, 2024

segfaults are bad and it does look like ddtrace is segfaulting. Taking a look at that. I think your hypothesis is valid.

That said, it looks like the top frame is the cpython signal handler. As far as I know, ddtrace does not set this handler. Do you have the pyfaulthandler enabled? The reason I ask is because crashing a process is bad enough, but crashing a process in a way that forces it to gobble an entire core while also being unresponsive is worse, and it'd be valuable to understand the circumstances in which that occurs.

I see that this is on cpthon 3.12. What's the patch version? Also, what's the ddtrace version? Thanks!

@sanchda sanchda self-assigned this Oct 24, 2024
@sanchda sanchda added the Profiling Continous Profling label Oct 24, 2024
@lattwood
Copy link
Author

lattwood commented Oct 24, 2024

This happened on 3.11.x (can't remember which) and 3.12.7.

We're on 2.4.2 of the agent currently, due to #11141.

It is the cpython signal handler, and once it's run, it goes back to the PyDict_SetItem call and immediately segfaults again.

It doesn't leave the signal handler either, due to how Python works- https://docs.python.org/3/library/signal.html#execution-of-python-signal-handlers
It makes little sense to catch synchronous errors like [SIGFPE](https://docs.python.org/3/library/signal.html#signal.SIGFPE) or [SIGSEGV](https://docs.python.org/3/library/signal.html#signal.SIGSEGV) that are caused by an invalid operation in C code. Python will return from the signal handler to the C code, which is likely to raise the same signal again, causing Python to apparently hang. From Python 3.3 onwards, you can use the [faulthandler](https://docs.python.org/3/library/faulthandler.html#module-faulthandler) module to report on synchronous errors.

This tracks the behaviour we're seeing.

A long-running calculation implemented purely in C (such as regular expression matching on a large body of text) may run uninterrupted for an arbitrary amount of time, regardless of any signals received. The Python signal handlers will be called when the calculation finishes.

Because of the Cython/pyx usage in the StackCollector stuff, it means the Python signal handlers never actually get run.

edit: :eye-twitch: we do have signal handlers. removing them

@lattwood
Copy link
Author

I'm glad to know why it's hanging instead of crashing, and the fact it's been hanging instead of crashing is beneficial to catch the collected stack is stale?

@lattwood lattwood changed the title StackCollector is constantly causing SIGSEGV which never crash the app StackCollector is causing SIGSEGV Oct 24, 2024
@sanchda
Copy link
Contributor

sanchda commented Nov 1, 2024

Yeah, I'm not 100% sure about this one. The behavior is almost, but not quite, representative of a class of defects we started observing in Python 3.11. Assuming that we're looking at a similar defect, here are some suggestions for avoiding the segmentation faults:

  1. Since TypeError endpoint profiling for stack v2 #11141 has been resolved, you should be able to run a contemporary version ofddtrace. The latest should be the greatest. I advise setting DD_PROFILING_STACK_V2_ENABLED=true environment variable. This will switch to our new stack collector implementation. I have a sales pitch about how this feature has less sampling bias and overhead, but what matters is that it should eliminate these segfaults and it's practically at feature parity with the "old" sampler (we're going to try to make this the default sometime this year).
  2. You can disable just stack sampling (DD_PROFILING_STACK_ENABLED=false).

@lattwood
Copy link
Author

lattwood commented Nov 1, 2024

We should be able to give a newer version a shot, but uhh- this is what our docker container's entrypoint.sh looks like.

Image

    export PYTHONUNBUFFERED=1 
    # Temporarily enable debug logging for troubleshooting.
    export DD_TRACE_DEBUG=true
    export DD_TRACE_LOG_FILE=ddtrace_logs.log
    export DD_TRACE_LOG_FILE_LEVEL=DEBUG
    export DD_TRACE_LOG_STREAM_HANDLER=false
    export DD_PROFILING_STACK_V2_ENABLED=true
    export DD_PROFILING_EXPORT_LIBDD_ENABLED=true
    export DD_PROFILING_TIMELINE_ENABLED=true
    exec ddtrace-run python3 worker.py

Is there any documentation outlining these suggestions? Python 3.11 was released October 24, 2022, so it seems kinda wild that these issues are still on-going two years later.

@sanchda
Copy link
Contributor

sanchda commented Nov 1, 2024

Ah--yeah; 2.8.0 is the first ddtrace version where that v2 flag was used for anything, so it makes sense that you wouldn't see any advantage to enabling it in the current version.

Is there any documentation outlining these suggestions?
Not really. Right now it's an opt-in, pre-GA component. I think later this month we'll do a doc drop with some more information on what it is and does, but briefly

  1. Profiling consists of a few different sampling components (stack--cpu/walltime/exceptions; memory--allocations/live-heap; locks)
  2. The "old" stack profiler is suboptimal for a few reasons. It takes the GIL to perform a sample (expensive), and in so doing it can only observe threads in their state when they are taken off-GIL--most threads don't transition because they've hit their bytecode budget in the scheduler, but rather because they hit a convenient point, so this creates bias.
  3. The 3.11 stability issue I mentioned before and will say a few more words about
  4. Stack V2 is intended to fix these problems. It takes sampling off-GIL.

so it seems kinda wild that these issues are still on-going two years later.
Yeah, it's wild. I can't say too much about it, but basically it took over half a year for customers to start reporting issues (it's not super common), when they did emerge we spent a huge amount of time trying to fix what we had, and only then did we realize we probably had to rebuild everything from scratch (and then polish it to a comparable level of integration and maturity as what we had).

@lattwood
Copy link
Author

lattwood commented Nov 1, 2024

For #4, isn't it still going to have to go through frame.f_locals to make the copy of the local variables to resolve the name of the class in the sample, which is where the SIGSEGV was happening? Since it's not holding the gil, I would expect that it would be even racier.

For this issue, is it a case of Python pausing the StackCollector's thread, releasing the GIL, and then when the StackCollector's thread gets the GIL again, the frame it's iterating through is full of invalid references?

@lightswitch05
Copy link

We are experiencing this same issue after upgrading to Python 3.11. Not only is it causing crashes, but combined with gunicorn and ddtrace-run, the memory from the crashed worker is never recovered, eventually causing OOM issues.

@sanchda
Copy link
Contributor

sanchda commented Nov 6, 2024

@lightswitch05 the OOM issue is unsettling--I'd guess it's similar to the behavior reported in the OP. Which version of ddtrace are you running?

@lightswitch05
Copy link

lightswitch05 commented Nov 6, 2024

2.15.1 - we upgraded to the latest version (at the time) as an attempted fix before just disabling it. Prior to that, we also experienced it with 2.14.4 and 2.13.0. We did not have the issue on 2.12.2, but that was also prior to upgrading to python 3.11. Unfortunately, we've been unable to reliably reproduce the issue, and it has only happened in production, so I'm not able to provide any more debugging details.

Also - and Its not really surprising - but the memory leak was not apparent in the APM profiler. The leak was clearly visible when reviewing Kubernetes pod memory usage.

@sanchda
Copy link
Contributor

sanchda commented Nov 6, 2024

@lightswitch05

but the memory leak was not apparent in the APM profiler.

One thing to note is that Kubernetes RSS is the sum of the container RSS plus Kubernetes ephemeral storage (and some other Kubernetes things, I don't have a comprehensive list). What I can imagine happening is if your system is configured to emit corefiles to ephemeral storage, then those corefiles will continue to count against the pod RSS until you hit an inevitable OOM.

In Datadog, you can split out the Kubenetes ephemeral storage metric (sorry, I forget exactly what it's called, but it should come up right away in the search). If what i'm saying is accurate, you'd see the ephemeral store look like a step function over time. On the other hand, you'd see container (not Kubernetes) RSS look pretty normal.

It's also possible that workers are getting their corefiles written into a tmpfs (probably /tmp). In this case container RSS and kubernetes RSS would agree and the ephemeral storage thing is wrong.

Naturally, I don't know whether this is truly indicative of your system. I just wanted to mention this dynamic because if it is accurate, it should be noted that many users prefer to disable corefile generation in their prod environments exactly to prevent crashes from turning into future OOMs.

Anyway, since you're on a relatively recent version of ddtrace, I'd also give export DD_PROFILING_STACK_V2_ENABLED=true a try. It's a feature we're working hard on and it has fixed these segfaults for several customers.

@lightswitch05
Copy link

This is great actionable information! kubernetes.ephemeral_storage.usage grows on each worker crash - but container.memory.rss is pretty flat. I'm not sure if we can try out DD_PROFILING_STACK_V2_ENABLED or not given our inability to reproduce in a lower environment, but your recommendation for corefiles should solve the OOM issue at least. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Profiling Continous Profling
Projects
None yet
Development

No branches or pull requests

3 participants