-
-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
celery workers randomly hang on IPC (causing all jobs to queue, but no longer run) #4185
Comments
I'm running on Python 3.6 and using Celery 4.1.0 |
@codeadict are you able to strace the two processes (worker and parent) to see if they are both stuck on the same file descriptor? You can verify it's the same fd with |
@codeadict as @chrismeyersfsu mentioned, you can
|
If you have
https://github.com/celery/celery/blob/v3.1.17/celery/concurrency/asynpool.py#L219 |
Same problem. celery==3.1.23, billiard==3.3.0.23 After few hours celery 'freeze' So, this is master process:
And one of childs:
And this current 'freezed' frame in master process:
|
@harlov @codeadict @chrismeyersfsu Did you find any solution or workaround? |
No. We are upgrading to celery 4. Will run out same set of tests that invokes this error and see if we can recreate on celery 4. |
@korycins we remove --autoscale and --maxtasksperchild - all okay yet. I gues ongoing respawning workers with autoscale о maxtasksperchild may cause this situation. |
@chrismeyersfsu I have the same issue with celery 4.1 so I guess switching to v4.1 will not help you. |
can anyone verify the issues on top of latest master? not sure if it's already fixed with latest commit. |
@auvipy which commit(s) do you suspect might resolve this? I'm currently able to reproduce this issue on celery 4.1.0, billiard 3.5.0.3 from PyPI. |
I am not sure so could you plz try the master branch? If after releasing 4.2 it still exists work would be carried out to fix |
@auvipy we could give it a try, but it may be hard to say for sure - personally, we only seem to encounter this hang once every few weeks, so it's pretty hard to narrow down. |
@ryanpetrello Try to add max_task_per_child and set it to 1. http://docs.celeryproject.org/en/latest/userguide/workers.html#max-tasks-per-child-setting |
Same issue, cannot reproduce right now, I thought It was configure issue, when restart just un-hang the worker, but looks like it happening from time to time. |
Just chiming in to say we also run into this issue. celery 4.1.0, kombu 4.1.0, billiard 3.5.0.3. With respect to blocking on file descriptors, we see the same as @harlov above: the master is in a deadlock with one of its workers. Sending SIGUSR1 to the celery master "unsticks" the process and dumps the stacktrace below. What is suspicious is that in a function "on_result_readable" it gets stuck on a read. This feels like a classic race condition: poll() returns readable but the read() blocks. This can happen, this is why it is always recommended to set the FDs in non-block mode to make sure nothing goes wrong. Relevant message on LKML: https://lkml.org/lkml/2011/6/18/105 Not sure it is easy to set the descriptors non-blocking, because we have an environment with many celery masters and there's always a few stuck, if it's a simple change we may be able to test it quickly.
|
Just an update. We spent some time testing things and trying to reproduce it.
I verified that the file descriptors normally used by celery are all correctly marked non-blocking and they are. So the question is, how does the code block if it's every descriptor is non-blocking. The track I'm following now is that the epoll() object is getting confused about the descriptors it's supposed to be watching. If you strace celery you get lots of output like:
It's entirely plausible that at some point a file descriptor that once was used for one thing gets used for something else, the hub gets confused and calls the wrong method and boom! |
Ok, so I've found the problem, and I have strace output capturing it failing. And it's very subtle and to do with the following completely non-obvious 'feature' of the epoll() interface. From the manpage:
What is happening is that a file descriptor is being used by celery for communication, the file descriptor leaks to a worker, the master closes the file descriptor and then opens another for the sentinel which is not marked non-blocking. The child process dies causing the file descriptor to get marked as readable, the master tries to read it an blocks. Yes, epoll() returns readability for a file descriptor that is no longer open in this process. Here I'm pasting captured strace output for just fd 49 as evidence (apologies for the multi-line output, strace was monitoring multiple processes):
Note how epoll() admits that fd 49 is no longer in its fdset, but it returned the number nonetheless. The way to fix this is to ensure that in the worker every file descriptor is closed that is being monitored by epoll(). Currently workers close the file descriptors they don't want that correspond to themselves, but they don't touch the file descriptors they inherit that relate to other workers. Workarounds are to mark the sentinal fd also as non-blocking, or simply using poll() instead of epoll(). I'm testing the latter locally. |
Ok, as it turns out, using poll() doesn't solve the problem. As the actual issue is that a file descriptor is being closed while still registered with the hub. Once you realise that it is easy to figure out exactly where it goes wrong and the following hack fixes this issue for us. Without this we could reproduce the issue within an hour or two. With it we haven't been able to trigger it yet. The hack is not the correct fix, but it really requires someone who knows the code better to write a good fix.
It looks like the write queue fd is properly removed from the hub (a few lines up), but the read queue fd is not. |
No updates here?:( |
@auvipy any thoughts on @kleptog's analysis above at #4185 (comment)? |
sorry dont have enough time right now to investigate the issue |
@kleptog thanks for so much legwork on this. It seems like celery/celery/concurrency/asynpool.py Lines 735 to 745 in b46bea2
...which is called in celery/celery/concurrency/asynpool.py Lines 1245 to 1249 in b46bea2
So is this statement actually accurate?
I'm not knowledgeable enough about celery internals to know why the code is currently only doing this for |
IIRC it was indeed the other queue that was being left out, so maybe it's a matter of doing an inqueue_close on the other queue? I won't have the opportunity to test it for a while though. |
@auvipy I'm sorry to be a nag, and I can sympathize with the often thankless task of doing free work on an open source project 😞 - I work on several in my spare time, and another one for my day job. Do you know if you or any other celery maintainer has had a chance to take a peek at this one? We've got a notable number of celery users in this thread that have been encountering this debilitating issue since is was reported last summer, and even a few who have pitched in and possibly identified the area of code that's the culprit. Is there any sort of ETA on when this bug might be looked at? |
Hi everyone, While there is a workaround which fixes the problem, @kleptog suggests it is a hack. @vlademy Did you encounter any problems while applying this patch? |
This was a separate issue with a library that was using a separate polling mechanism. |
Hi there, got same issue. Worker was running fine for several days, but then suddenly stopped picking up and executing tasks, which led to queue growth. Restart fixed the problem. Here is what I found
env:
Celery starts like this:
Few hours after it stopped executing tasks, I found records in the logs like this: I didn't see such a records before incident. Interesting, that workers were syncing with different workers, not sure what is the correct behaviour (should it sync with celery beat ?) |
I encountered similar scenario in production with celery 5.2.3. I logged issue celery/billiard#389 against billiard. Not sure if this is a problem with celery or billiard. If I kill the child, the task gets executed elsewhere. Strangely master seems to be fine processing other requests even after this stuck task on child. Soft timeout/hard timeout dont trigger - probably the master did not see an ack from child as child is still trying to read from the socket. |
The same issue on 5.2.7, I have config the MAX_TASK_PER_CHILD as 100, but we met verify_process_alive exception.
|
Noticing the same issue in 5.2.3. Child stack trace shows its waiting on recv, debugging the code further, I observed the code snippet in Celery's _write_job function, where an exception raised to the main worker process with errno values of EINTR or EAGAIN results in suspension using yield. Could this behavior be a potential cause for random failures in sending jobs from the supervisor to child workers? |
And again.
python gdb: (gdb) py-bt
Interestingly the issue occurs right during thread exec:
Threads overall:
Or to be honest, in the next thread (which is the 2nd) there is a place, where my code is waiting for the paramiko ssh command to give me some output:
I'm still not sure about the root cause and just randomly changing the code and removing try's to let the thread fail with anything more verbose. I cannot get any deeper with gdb since I can't find actual traceback or visible issues. |
thanks for you detailed investigation. |
On celery 5.2.7 I can see this error again
this happened when a worker got a timeout to start in example:
ForkPoolWorker-7 started before ForkPoolWorker-27 and every hub.remove in the on_pool_start happened all the time because of the hub tick but when changing the is this assert recent reproduction:
by changing the use of |
- fix main process Unrecoverable error: AssertionError() when read fd is deleted - see celery#4185 (comment)
- fix main process Unrecoverable error: AssertionError() when read fd is deleted - see celery#4185 (comment)
- fix main process Unrecoverable error: AssertionError() when read fd is deleted - see celery#4185 (comment)
- fix main process Unrecoverable error: AssertionError() when read fd is deleted - see celery#4185 (comment)
- fix main process Unrecoverable error: AssertionError() when read fd is deleted - see celery#4185 (comment) tests: - change hub.remove to hub.remove_writer in test_poll_write_generator and test_poll_write_generator_stopped - add 2 more tests for schedule_writes to assert only hub.writers is removed when hub.readers have the same fd id
- fix main process Unrecoverable error: AssertionError() when read fd is deleted - see celery#4185 (comment) tests: - change hub.remove to hub.remove_writer in test_poll_write_generator and test_poll_write_generator_stopped - add 2 more tests for schedule_writes to assert only hub.writers is removed when hub.readers have the same fd id
- fix main process Unrecoverable error: AssertionError() when read fd is deleted - see celery#4185 (comment) - tests: - change hub.remove to hub.remove_writer in test_poll_write_generator and test_poll_write_generator_stopped - add 2 more tests for schedule_writes to assert only hub.writers is removed when hub.readers have the same fd id
- fix main process Unrecoverable error: AssertionError() when read fd is deleted - see celery#4185 (comment) - tests: - change hub.remove to hub.remove_writer in test_poll_write_generator and test_poll_write_generator_stopped - add 2 more tests for schedule_writes to assert only hub.writers is removed when hub.readers have the same fd id
- fix main process Unrecoverable error: AssertionError() when read fd is deleted - see celery#4185 (comment) - tests: - change hub.remove to hub.remove_writer in test_poll_write_generator and test_poll_write_generator_stopped - add 3 more tests for schedule_writes to assert only hub.writers is removed when hub.readers have the same fd id
- fix main process Unrecoverable error: AssertionError() when read fd is deleted - see celery#4185 (comment) - tests: - change hub.remove to hub.remove_writer in test_poll_write_generator and test_poll_write_generator_stopped - add 3 more tests for schedule_writes to assert only hub.writers is removed when hub.readers have the same fd id
) - fix main process Unrecoverable error: AssertionError() when read fd is deleted - see #4185 (comment) - tests: - change hub.remove to hub.remove_writer in test_poll_write_generator and test_poll_write_generator_stopped - add 3 more tests for schedule_writes to assert only hub.writers is removed when hub.readers have the same fd id Co-authored-by: Idan Haim Shalom <[email protected]> Co-authored-by: Tomer Nosrati <[email protected]>
This is now fixed. I'm closing the issue. |
Fails in the same way on multiple celery versions:
celery version 3.1.7 billiard version 3.3.0.23
celery version 3.1.25 billiard version 3.3.0.23
celery worker -l debug --autoscale=50,4 -Ofair -Q tower_scheduler,tower_broadcast_all,tower,localhost -n celery@localhost
Steps to reproduce
Happens sporadically when running our nightly 8 hour integration tests on Ubuntu 14.04 and 16.04. Does not happen on RHEL 7 nor Centos 7. We are working on a set of smaller recreation steps.
Expected behavior
Actual behavior
The celery master process and a particular worker both block on the same file descriptor, performing a
read()
. The worker has finished a job and is ready for more work. The parent is "locked up", blocking on theread()
system call shared by the worker.Restarting celery "fixes" the issue. More surgically, sending a SIGUSR1 to the master process "fixes" the issue by breaking it out of the
read()
system call. The child then returns from the read and seems to process a/the pending message. The master process does NOT try toread()
from the PIPE again.https://github.com/celery/celery/blob/3.1/celery/concurrency/asynpool.py#L220 is where we are hanging. This is a call to
__read__
gotten from https://github.com/celery/billiard/blob/3.3/Modules/_billiard/multiprocessing.c#L202This is non-blocking and non-asynchronous. I don't really understand how this code isn't susceptible to a deadlock/infinite blocking scenario. The
read
is non-blocking and is called in such a way that it can block forever if the child dies.We are now trying to recreate the problem with
CELERY_RDBSIG="1"
set so that we can jump into a remote debug session when the deadlock occurs.Any advise would be helpful.
From an OS perspective, I can't reason how this could occur.
write()
being observed after the SIGUSR1.The text was updated successfully, but these errors were encountered: