-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hard time limit ... exceeded (followup) #2603
Comments
Just a note that I was already looking into that some time ago: packit/deployment#548 (comment) |
Now that I think about it, we could probably enable the concurrency on stage too (that's the only difference between prod and stage in short-running workers)… If these issues are really caused by Celery × gevent as I suspect in the #2522, we have a clear cause. As I discussed with Maja in the DMs, it doesn't make much sense to bump the resources as they are just used all the way anyways (I've already bumped them once with no result, it just prolongs the periods between OOM kills). |
Probably |
oh, we have concurrency on the stage too 👀 I see similar patterns in the metrics as in the production deployment, but… it doesn't run out of resources (probably the higher load causes more hanging?) |
This is a follow up card for #2512.
I went again through logs and metrics and now I believe that the tasks' hangs are somehow related with the health of the
short-running
pods.The hangs happen only in our
short-running
instances and most often in theshort-running-0
instance (which seems to be the less healthy)Recently, many hangs happened the 23th of October. Metrics show that the 23th of October the average memory usage was higher than other days all long the day.
I would try to solve our memory issues (
short-running
pods memory leaks -> #2522) before further investigate this.After solving the memory issues, if this still happens, I would investigate our
Liveness Probe check
, because the use ofcelery status
seems to be discouraged in favour ofcelery inspect ping
-> celery/celery#4079I can see in openshift that our
short-running
pods are sometimes killed by the "OOMKiller" and other times by a failed liveness probe.The text was updated successfully, but these errors were encountered: