Worker autorestarts cause worker-dependent duplicate logs #3218

JKneedler · 2024-05-23T14:31:08Z

I'm currently having an issue where when a worker hits its max requests and auto-restarts as usual, there is a traffic-dependent chance that the new worker will be initialized to a state where all logs to a WatchedFileHandler are being duplicated up to the current max workers number of times.

The way our logs are currently configured have them logging to different files per worker so it's clear to see that it's an issue contained within a single worker post auto-restart. This implementation was recently added as all workers used to be logging to a single file. The hope was changing it to logging to different files per worker would resolve the issue (as this had been going on long before), but it made no change to the problem.

This is a decently high traffic application (10k per minute at peak traffic) so we are currently configured at 3 workers and 50 threads per worker. Ideally, we would like to bump the number of workers and lower the number of threads in order to closer match gunicorn doc recommendations but after testing a worker increase, the max number of logs that can appear increased to the new value.

We have some code to set numbered worker id's within gunicorn server hooks so those are reflected in the pre_fork and post_fork logging here, but this is what the docker logs show during an auto-restart:

I've searched around a while and have had a hard time finding information on how to resolve the RuntimeWarning shown so it's entirely possible that that's the issue, in which case this issue becomes a question of how to resolve that.

This is our gunicorn configuration at the moment :

For example, if the application is deployed in the morning (low traffic) none of the workers will be showing any number of duplicate logs. The max request value might be hit around mid-afternoon and an auto-restart initialized. After initialization, ALL logs from that worker could begin logging to it's appropriate log file with a number of duplicates ranging from 0-3. When a worker begins this, every single log originating from that worker will be duplicated the same number of times. This possibly points to the WatchedFileHandler being added a number of times, but no indication as to why a gunicorn worker restart would trigger this, or why it's not consistent and traffic-dependent. The worker would continue like this until its next auto-restart and once again could be initialized with any number of duplicate logs ranging from 0-3. All other workers act in this same way but independent of each other.

The reason I'm posting this here is because this only ever happens when a worker is initialized and seems to occur more frequently with high traffic so possibly some process/thread unsafety to blame. We recently implemented similar logging in another much lower traffic application and haven't seen this issue yet.

If any more information is needed then I can definitely provide and any help would be greatly appreciated, even if the help is '{other location} might be a better place to ask this question'.

pajod · 2024-05-24T05:57:50Z

You are preloading your application.. how does the code ensuring per-worker files look like? Called from gunicorn server hooks?
Is the number assignment that lead you to the intermediary conclusion about where the bug happens itself thread-safe?
Were you able to reproduce the effect without the ddtrace integration? (after injection, it will leave a note you can check with getattr(logging, "_datadog_patch", False))
Do you have a smaller cherrypy application (maybe some part of your tests) that works with your logging config and hooks that you are able to share for reproduction attempts?
Does inspecting your effective logging configuration produce anything unexpected?

About asking cherrypy to disable its own autoreloading and cleanup its "bus" on shutdown.. we should probably improve documentation, but I don't expect that to be related.

Linking related issue for easier discovery: Find out if worker is the "first" worker or get a lock object #2895 (also concerns having the arbiter let each worker know some index)

JKneedler · 2024-05-29T19:04:10Z

Apologies for the delayed response, I realize I posted the question at a bad time with a long weekend.

1 - This is the code from gunicorn.conf.py, setting env variables for each worker.

And here is the code within the structured logging file that handles consumption of the env variable, and creation of logger.

2 - I believe it is. The code above is being called once within the application initialization, so should be only called once per worker. Then any subsequent calls are logging to only a single file logger per worker so is thread safe within worker.

3 - Unfortunately, much of the assumption that lead us to believe that it is traffic-dependent is because no matter what sort of configuration we try we cannot replicate this issue in any other environment.

4 - I don't believe we do. I can try to get something together but at the moment we don't have much automated testing around this functionality, as for our dev and test environments we manually set the max threads and workers to 1.

5 - With logging tree, among a lot of other stuff that comes up, this is what comes up for this logger in particular.

We have a similar application that isn't experiencing this issue at all (experiences a tenth of the traffic though) and this is what the logging-tree looks like.

Obviously, the second application has propagate set to false, but this was because when implementing this in the second application every single log would be showing twice, so an entirely different issue to what we're experiencing with the first. I'm not exactly sure why we needed to set propagate to false in the second but not the first. There's clearly a chance that that has something to do with this, but given we can get more than 1 duplicate, and that it doesn't follow a similar 'every time' cadence this leads me to the feeling that it's not as related as I'd hope.

pajod · 2024-05-29T21:55:23Z

Thanks. I still believe the most obviously smoking gun is your mention of "up to the current max workers number of times". I do not understand those smoke signs; so for now, I keep asking:

Still exactly one relevant file handle (lsof does list dupes) in the offending worker processes?
Does the situation resolve to normal once the target file is manually renamed (consequently: reopened)?
Can you reproduce it without the Watched property of your FileHandlers?
Any observations timing-wise? e.g. error only triggered after overlapping reload of multiple workers. Feel free to share some larger log excerpt, maybe someone can spot some correlation between PID, logging ID and sequence of events.
What is the reasoning behind that w.alive check, any scenario where you need to recycle the logging ID before the worker is gone? (primarily asking because it might indicate you have spent time working around other problems, possibly overlapping open issues not yet properly diagnosed)

For reliable reproduction, you may need not just significant request counts (in relation to) configuration to trigger regular reloads, but also hanging requests, either by client decision or by processing/db bottleneck. Try firing a benchmarking tool and a manually issued "slow" request at the same (multi-processed) test instance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Worker autorestarts cause worker-dependent duplicate logs #3218

Worker autorestarts cause worker-dependent duplicate logs #3218

JKneedler commented May 23, 2024

pajod commented May 24, 2024 •

edited

Loading

JKneedler commented May 29, 2024

pajod commented May 29, 2024

Worker autorestarts cause worker-dependent duplicate logs #3218

Worker autorestarts cause worker-dependent duplicate logs #3218

Comments

JKneedler commented May 23, 2024

pajod commented May 24, 2024 • edited Loading

JKneedler commented May 29, 2024

pajod commented May 29, 2024

pajod commented May 24, 2024 •

edited

Loading