Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hard time limit ... exceeded (followup) #2603

Open
majamassarini opened this issue Oct 25, 2024 · 4 comments
Open

Hard time limit ... exceeded (followup) #2603

majamassarini opened this issue Oct 25, 2024 · 4 comments
Labels
area/general Related to whole service, not a specific part/integration. kind/internal Doesn't affect users directly, may be e.g. infrastructure, DB related.

Comments

@majamassarini
Copy link
Member

majamassarini commented Oct 25, 2024

This is a follow up card for #2512.

I went again through logs and metrics and now I believe that the tasks' hangs are somehow related with the health of the short-running pods.

The hangs happen only in our short-running instances and most often in the short-running-0 instance (which seems to be the less healthy)

 $ oc describe packit-worker-short-running-0    
[...]
Events:
  Type     Reason     Age                   From     Message
  ----     ------     ----                  ----     -------
  Warning  Unhealthy  63m (x38 over 3d12h)  kubelet  Liveness probe failed: Ignored keyword arguments: {'type': 'pagure'} 
 $ oc describe packit-worker-short-running-1
  
Events:
  Type     Reason     Age                   From     Message
  ----     ------     ----                  ----     -------  
  Warning  Unhealthy  61m (x23 over 3d17h)  kubelet  Liveness probe failed: Ignored keyword arguments: {'type': 'pagure'}

Recently, many hangs happened the 23th of October. Metrics show that the 23th of October the average memory usage was higher than other days all long the day.

I would try to solve our memory issues (short-running pods memory leaks -> #2522) before further investigate this.

After solving the memory issues, if this still happens, I would investigate our Liveness Probe check, because the use of celery status seems to be discouraged in favour of celery inspect ping -> celery/celery#4079

I can see in openshift that our short-running pods are sometimes killed by the "OOMKiller" and other times by a failed liveness probe.

$ oc describe packit-worker-short-running-0    

      State:          Running
      Started:      Thu, 24 Oct 2024 14:22:48 +0200
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Wed, 23 Oct 2024 16:12:35 +0200
      Finished:     Thu, 24 Oct 2024 14:22:47 +0200
    Ready:          True
    Restart Count:  4
$  oc describe packit-worker-short-running-1

    State:          Running
      Started:      Thu, 24 Oct 2024 19:51:59 +0200
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Wed, 23 Oct 2024 16:31:17 +0200
      Finished:     Thu, 24 Oct 2024 19:51:58 +0200
    Ready:          True
    Restart Count:  6
@nforro
Copy link
Member

nforro commented Oct 25, 2024

After solving the memory issues, if this still happens, I would investigate our Liveness Probe check, because the use of celery status seems to be discouraged in favour of celery inspect ping -> celery/celery#4079

Just a note that I was already looking into that some time ago: packit/deployment#548 (comment)
Perhaps switching to celery inspect ping would help.

@mfocko
Copy link
Member

mfocko commented Oct 25, 2024

Now that I think about it, we could probably enable the concurrency on stage too (that's the only difference between prod and stage in short-running workers)… If these issues are really caused by Celery × gevent as I suspect in the #2522, we have a clear cause.

As I discussed with Maja in the DMs, it doesn't make much sense to bump the resources as they are just used all the way anyways (I've already bumped them once with no result, it just prolongs the periods between OOM kills).

@majamassarini
Copy link
Member Author

After solving the memory issues, if this still happens, I would investigate our Liveness Probe check, because the use of celery status seems to be discouraged in favour of celery inspect ping -> celery/celery#4079

Just a note that I was already looking into that some time ago: packit/deployment#548 (comment) Perhaps switching celery inspect ping would help.

Probably celery status is not the best check and I would fix it. But I think that now it is somehow useful, because it is tracing the unhealthiness of our pods. The check seems to fail only in the short-running pods and not in the long-running... I would keep it, just as a feedback, until we fix the memory problems.

@mfocko
Copy link
Member

mfocko commented Oct 25, 2024

oh, we have concurrency on the stage too 👀 I see similar patterns in the metrics as in the production deployment, but… it doesn't run out of resources (probably the higher load causes more hanging?)

@mfocko mfocko moved this from new to backlog in Packit Kanban Board Oct 29, 2024
@mfocko mfocko added the area/general Related to whole service, not a specific part/integration. label Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/general Related to whole service, not a specific part/integration. kind/internal Doesn't affect users directly, may be e.g. infrastructure, DB related.
Projects
Status: backlog
Development

No branches or pull requests

3 participants