You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We've noticed some GitLab CI worker pods failing to get scheduled. The typical output you see after the job times out is:
ERROR: Job failed (system failure): prepare environment: waiting for pod running:
timed out waiting for pod to start.
I caught one such pod before it timed out and ran kubectl describe on it. I saw:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Nominated 55s karpenter Pod should schedule on ip-10-0-168-134.ec2.internal
Hmm, that's suspicious. Karpenter sees a node where it can schedule this pod, but it never actually happens. And sure enough, when I ran kubectl describe on that node, I see some suspicious output:
Lease: Failed to get lease: leases.coordination.k8s.io "ip-10-0-168-134.ec2.internal" not found
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
Ready Unknown Fri, 17 Mar 2023 05:26:56 -0400 Fri, 17 Mar 2023 05:28:00 -0400 NodeStatusNeverUpdated Kubelet never posted node status.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedInflightCheck 4m3s (x737 over 5d2h) karpenter Instance Type "" not found
For now, I manually found the affected node in the AWS web console and terminated it. If we can't properly resolve this issue then we should strive to automatically detect and terminate such nodes.
The text was updated successfully, but these errors were encountered:
Description
We've noticed some GitLab CI worker pods failing to get scheduled. The typical output you see after the job times out is:
I caught one such pod before it timed out and ran
kubectl describe
on it. I saw:Hmm, that's suspicious. Karpenter sees a node where it can schedule this pod, but it never actually happens. And sure enough, when I ran
kubectl describe
on that node, I see some suspicious output:Relevant upstream issues
kubernetes-sigs/karpenter#750
aws/karpenter-provider-aws#3156
aws/karpenter-provider-aws#3311
Mitigation
For now, I manually found the affected node in the AWS web console and terminated it. If we can't properly resolve this issue then we should strive to automatically detect and terminate such nodes.
The text was updated successfully, but these errors were encountered: