Hard time limit ... exceeded (followup) #2603

majamassarini · 2024-10-25T11:23:44Z

This is a follow up card for #2512.

I went again through logs and metrics and now I believe that the tasks' hangs are somehow related with the health of the short-running pods.

The hangs happen only in our short-running instances and most often in the short-running-0 instance (which seems to be the less healthy)

 $ oc describe packit-worker-short-running-0    
[...]
Events:
  Type     Reason     Age                   From     Message
  ----     ------     ----                  ----     -------
  Warning  Unhealthy  63m (x38 over 3d12h)  kubelet  Liveness probe failed: Ignored keyword arguments: {'type': 'pagure'}

 $ oc describe packit-worker-short-running-1
  
Events:
  Type     Reason     Age                   From     Message
  ----     ------     ----                  ----     -------  
  Warning  Unhealthy  61m (x23 over 3d17h)  kubelet  Liveness probe failed: Ignored keyword arguments: {'type': 'pagure'}

Recently, many hangs happened the 23th of October. Metrics show that the 23th of October the average memory usage was higher than other days all long the day.

I would try to solve our memory issues (short-running pods memory leaks -> #2522) before further investigate this.

After solving the memory issues, if this still happens, I would investigate our Liveness Probe check, because the use of celery status seems to be discouraged in favour of celery inspect ping -> celery/celery#4079

I can see in openshift that our short-running pods are sometimes killed by the "OOMKiller" and other times by a failed liveness probe.

$ oc describe packit-worker-short-running-0    

      State:          Running
      Started:      Thu, 24 Oct 2024 14:22:48 +0200
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Wed, 23 Oct 2024 16:12:35 +0200
      Finished:     Thu, 24 Oct 2024 14:22:47 +0200
    Ready:          True
    Restart Count:  4

$  oc describe packit-worker-short-running-1

    State:          Running
      Started:      Thu, 24 Oct 2024 19:51:59 +0200
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Wed, 23 Oct 2024 16:31:17 +0200
      Finished:     Thu, 24 Oct 2024 19:51:58 +0200
    Ready:          True
    Restart Count:  6

The text was updated successfully, but these errors were encountered:

nforro · 2024-10-25T12:24:36Z

After solving the memory issues, if this still happens, I would investigate our Liveness Probe check, because the use of celery status seems to be discouraged in favour of celery inspect ping -> celery/celery#4079

Just a note that I was already looking into that some time ago: packit/deployment#548 (comment)
Perhaps switching to celery inspect ping would help.

mfocko · 2024-10-25T12:28:44Z

Now that I think about it, we could probably enable the concurrency on stage too (that's the only difference between prod and stage in short-running workers)… If these issues are really caused by Celery × gevent as I suspect in the #2522, we have a clear cause.

As I discussed with Maja in the DMs, it doesn't make much sense to bump the resources as they are just used all the way anyways (I've already bumped them once with no result, it just prolongs the periods between OOM kills).

majamassarini · 2024-10-25T12:42:04Z

After solving the memory issues, if this still happens, I would investigate our Liveness Probe check, because the use of celery status seems to be discouraged in favour of celery inspect ping -> celery/celery#4079

Just a note that I was already looking into that some time ago: packit/deployment#548 (comment) Perhaps switching celery inspect ping would help.

Probably celery status is not the best check and I would fix it. But I think that now it is somehow useful, because it is tracing the unhealthiness of our pods. The check seems to fail only in the short-running pods and not in the long-running... I would keep it, just as a feedback, until we fix the memory problems.

mfocko · 2024-10-25T12:51:40Z

oh, we have concurrency on the stage too 👀 I see similar patterns in the metrics as in the production deployment, but… it doesn't run out of resources (probably the higher load causes more hanging?)

usercont-release-bot added this to Packit Kanban Board Oct 25, 2024

github-project-automation bot moved this to new in Packit Kanban Board Oct 25, 2024

majamassarini added the kind/internal Doesn't affect users directly, may be e.g. infrastructure, DB related. label Oct 25, 2024

majamassarini mentioned this issue Oct 25, 2024

Hard time limit (300s) exceeded for task.steve_jobs.process_message #2512

Closed

mfocko moved this from new to backlog in Packit Kanban Board Oct 29, 2024

mfocko added the area/general Related to whole service, not a specific part/integration. label Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hard time limit ... exceeded (followup) #2603

Hard time limit ... exceeded (followup) #2603

majamassarini commented Oct 25, 2024 •

edited

Loading

nforro commented Oct 25, 2024 •

edited

Loading

mfocko commented Oct 25, 2024

majamassarini commented Oct 25, 2024

mfocko commented Oct 25, 2024

Hard time limit ... exceeded (followup) #2603

Hard time limit ... exceeded (followup) #2603

Comments

majamassarini commented Oct 25, 2024 • edited Loading

nforro commented Oct 25, 2024 • edited Loading

mfocko commented Oct 25, 2024

majamassarini commented Oct 25, 2024

mfocko commented Oct 25, 2024

majamassarini commented Oct 25, 2024 •

edited

Loading

nforro commented Oct 25, 2024 •

edited

Loading