Instance Type "" not found #416

zackgalbreath · 2023-03-22T13:31:06Z

Description

We've noticed some GitLab CI worker pods failing to get scheduled. The typical output you see after the job times out is:

ERROR: Job failed (system failure): prepare environment: waiting for pod running:
timed out waiting for pod to start.

I caught one such pod before it timed out and ran kubectl describe on it. I saw:

Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Normal   Nominated         55s   karpenter          Pod should schedule on ip-10-0-168-134.ec2.internal

Hmm, that's suspicious. Karpenter sees a node where it can schedule this pod, but it never actually happens. And sure enough, when I ran kubectl describe on that node, I see some suspicious output:

Lease:              Failed to get lease: leases.coordination.k8s.io "ip-10-0-168-134.ec2.internal" not found

Conditions:
  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason                   Message
  ----             ------    -----------------                 ------------------                ------                   -------
  Ready            Unknown   Fri, 17 Mar 2023 05:26:56 -0400   Fri, 17 Mar 2023 05:28:00 -0400   NodeStatusNeverUpdated   Kubelet never posted node status.

Events:
  Type     Reason               Age                    From       Message
  ----     ------               ----                   ----       -------
  Warning  FailedInflightCheck  4m3s (x737 over 5d2h)  karpenter  Instance Type "" not found

Relevant upstream issues

kubernetes-sigs/karpenter#750

aws/karpenter-provider-aws#3156

aws/karpenter-provider-aws#3311

Mitigation

For now, I manually found the affected node in the AWS web console and terminated it. If we can't properly resolve this issue then we should strive to automatically detect and terminate such nodes.

The text was updated successfully, but these errors were encountered:

jjnesbitt · 2023-03-22T15:31:31Z

I suppose it wouldn't be that hard to setup a cronjob/deployment that checked for stale NotReady nodes and spun them down, if it came to that.

bollig · 2023-03-24T15:55:33Z

Review https://karpenter.sh/v0.27.0/troubleshooting/#node-notready .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instance Type "" not found #416

Instance Type "" not found #416

zackgalbreath commented Mar 22, 2023

jjnesbitt commented Mar 22, 2023

bollig commented Mar 24, 2023

Instance Type "" not found #416

Instance Type "" not found #416

Comments

zackgalbreath commented Mar 22, 2023

Description

Relevant upstream issues

Mitigation

jjnesbitt commented Mar 22, 2023

bollig commented Mar 24, 2023