Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drift replacement stuck due to "Cannot disrupt NodeClaim" #1684

Closed
willthames opened this issue Sep 18, 2024 · 12 comments
Closed

Drift replacement stuck due to "Cannot disrupt NodeClaim" #1684

willthames opened this issue Sep 18, 2024 · 12 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@willthames
Copy link
Contributor

Description

Observed Behavior:

Two nodes were replaced during drift replacement, next one seems stuck with

  Normal   DisruptionBlocked      19s (x29 over 57m)  karpenter  Cannot disrupt Node: state node is marked for deletion

There is nothing in karpenter's logs to explain this. We did see similar behaviour during the karpenter 1.0.1 upgrade but put that down to API version mismatches but we don't seem to have any such mismatches this time.

k get nodeclaims -o custom-columns='APIVER:.apiVersion,NAME:.metadata.name,OWNER_API_VER:.metadata.ownerReferences[0].apiVersion,OWNERKIND:.metadata.ownerReferences[0].kind'
APIVER            NAME                               OWNER_API_VER     OWNERKIND
karpenter.sh/v1   bottlerocket-general-amd64-6k89s   karpenter.sh/v1   NodePool
karpenter.sh/v1   bottlerocket-general-amd64-cmdfv   karpenter.sh/v1   NodePool
karpenter.sh/v1   bottlerocket-general-amd64-hw9p6   karpenter.sh/v1   NodePool
karpenter.sh/v1   bottlerocket-general-amd64-jjmnz   karpenter.sh/v1   NodePool
karpenter.sh/v1   bottlerocket-general-amd64-ml9q4   karpenter.sh/v1   NodePool
karpenter.sh/v1   bottlerocket-general-amd64-z7r8c   karpenter.sh/v1   NodePool
karpenter.sh/v1   bottlerocket-smaller-amd64-gc2nt   karpenter.sh/v1   NodePool
karpenter.sh/v1   bottlerocket-smaller-amd64-sr2lr   karpenter.sh/v1   NodePool

Expected Behavior:

All nodes get replaced during drift replacement

Reproduction Steps (Please include YAML):

Versions:

  • Chart Version: 1.0.2
  • Kubernetes Version (kubectl version):
Client Version: v1.31.1
Kustomize Version: v5.4.2
Server Version: v1.30.3-eks-2f46c53
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@willthames willthames added the kind/bug Categorizes issue or PR as related to a bug. label Sep 18, 2024
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Sep 18, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If Karpenter contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@willthames
Copy link
Contributor Author

I've done some more digging on this and can add some additional information.

The node and nodeclaim both have deletionTimestamp set, which means they're just waiting for their finalizers to be removed before entering Terminating.

Looking at the code it seems that the node finalizer ensures that all the nodeclaims related to the node are deleted before it finishes by calling cloudProvider.Delete for the nodeclaim, then the node will finalize and terminate and then the nodeclaims will finalize and terminate.

I can't see any evidence that the deletion request has made it to the cloud (I did a athena search over cloudtrail for all non read-only requests in ec2 in the region, no calls to TerminateInstances (or similar) were made) so it seems to be something in between.

The only thing I can see that would cause the cloud termination never to be reached, without any errors being logged, is if one of the two actions that cause a reconcile requeue never finish - namely the drain, or the volume attachment tidy up.

Actually it can't be the drain, because a failed drain causes a node event to be published.

@leoryu
Copy link

leoryu commented Sep 19, 2024

Are you using aws provider? Karpenter will create an new nodeclaim firstly to ensure the pods can be scheduled. Cloud you see your controller logs to check whether the provider is trying to create an nodeclaim or not?

@willthames
Copy link
Contributor Author

It is the volume attachments.

If I do kubectl get volumeattachments there is still a volume attached to the node.

However, the persistent volume associated with the attachment is in Released status (the PVC, and the pod that was associated with the claim, no longer exist, presumably having been drained).

I know that the fix is to ignore released attachments in filterVolumeAttachments, I'm just struggling to create a test case that fails without the fix and passes with the fix!

@willthames
Copy link
Contributor Author

An easy way to validate that this was the problem (in hindsight, obviously!) is that
kubectl delete volumeattachment on the volume attachment associated with the stuck node causes the node to finally terminate.

@wmgroot
Copy link
Contributor

wmgroot commented Sep 20, 2024

We are seeing similar behavior with karpenter v1, nodeclaims stuck in a drifted state without ever being disrupted.

In my case, I have a node with no volumeattachments

$ kubectl get volumeattachments | grep ip-10-115-210-142.us-east-2.compute.internal
<nothing>

My node says that disruption is blocked due to a pending pod, but I have no pending pods in my cluster, and the node in question has a taint to allow only a single do-not-disrupt pod to schedule there as a test case.

$ kubectl describe node ip-10-115-210-142.us-east-2.compute.internal

Taints:             test-disruption=true:NoSchedule
Unschedulable:      false

Events:
  Type     Reason                   Age                   From                   Message
  ----     ------                   ----                  ----                   -------
  Normal   NodeHasSufficientMemory  47m (x2 over 47m)     kubelet                Node ip-10-115-210-142.us-east-2.compute.internal status is now: NodeHasSufficientMemory
  Normal   NodeAllocatableEnforced  47m                   kubelet                Updated Node Allocatable limit across pods
  Normal   NodeHasSufficientPID     47m (x2 over 47m)     kubelet                Node ip-10-115-210-142.us-east-2.compute.internal status is now: NodeHasSufficientPID
  Normal   NodeHasNoDiskPressure    47m (x2 over 47m)     kubelet                Node ip-10-115-210-142.us-east-2.compute.internal status is now: NodeHasNoDiskPressure
  Normal   Starting                 47m                   kubelet                Starting kubelet.
  Warning  InvalidDiskCapacity      47m                   kubelet                invalid capacity 0 on image filesystem
  Normal   RegisteredNode           47m                   node-controller        Node ip-10-115-210-142.us-east-2.compute.internal event: Registered Node ip-10-115-210-142.us-east-2.compute.internal in Controller
  Normal   Synced                   47m                   cloud-node-controller  Node synced successfully
  Normal   DisruptionBlocked        47m                   karpenter              Cannot disrupt Node: state node isn't initialized
  Normal   NodeReady                47m                   kubelet                Node ip-10-115-210-142.us-east-2.compute.internal status is now: NodeReady
  Normal   DisruptionBlocked        39m (x4 over 45m)     karpenter              Cannot disrupt Node: state node is nominated for a pending pod
  Normal   DisruptionBlocked        37m                   karpenter              Cannot disrupt Node: state node is nominated for a pending pod
  Normal   DisruptionBlocked        30m (x4 over 36m)     karpenter              Cannot disrupt Node: state node is nominated for a pending pod
  Normal   DisruptionBlocked        29m                   karpenter              Cannot disrupt Node: state node is nominated for a pending pod
  Normal   DisruptionBlocked        4m58s (x13 over 29m)  karpenter              Cannot disrupt Node: state node is nominated for a pending pod
  Normal   DisruptionBlocked        45s (x2 over 2m46s)   karpenter              Cannot disrupt Node: state node is nominated for a pending pod
$ kubectl get pod -n default -o wide
NAME                                 READY   STATUS    RESTARTS   AGE   IP              NODE                                           NOMINATED NODE   READINESS GATES
hello-world-nginx-5964768b4c-fnrxp   1/1     Running   0          77m   10.75.173.246   ip-10-115-217-230.us-east-2.compute.internal   <none>           <none>

I am using terminationGracePeriod on this nodeclaim, and I expect that disruption via drift should make the node unschedulable and create a new nodeclaim for pods to reschedule on.

@wmgroot
Copy link
Contributor

wmgroot commented Sep 20, 2024

Opened up a new issue here since my problem looks to be unrelated to the issue with volumeattachments, even though it results in similar behavior.
#1702

@willthames
Copy link
Contributor Author

Thanks to @AndrewSirenko for providing some valuable insight in #1700 by suggesting that maybe the CSI drivers weren't handling volume detachment correctly. I had also missed that only the EBS CSI volumes were affected, and the EFS volumes were being handled fine.

I decided to check the EBS controller logs during node termination, only to discover I no longer had an EBS controller on the node, because the EBS node daemonset didn't tolerate the termination taints.

Once I changed the tolerations so that the EBS node controller remained alive during termination that meant that the volumes could be cleaned up appropriately and the drift replacement now works perfectly again.

It looks like #1294 was released with 1.0.1 and was a breaking change for us due to our incorrect EBS CSI configuration which we'd previously got away with!

@willthames
Copy link
Contributor Author

Closing this now

@AndrewSirenko
Copy link
Contributor

I decided to check the EBS controller logs during node termination, only to discover I no longer had an EBS controller on the node, because the EBS node daemonset didn't tolerate the termination taints.

Glad you root caused this @willthames, and thanks for sharing this tricky failure mode. I'll make sure we over at EBS CSI Driver add this to some kind of Karpenter + EBS CSI FAQ/Troubleshooting guide.

Just curious, but what version of the EBS CSI Driver were you running? v1.29.0 added a check for the the karpenter.sh/disrupting taint in PR#1969, when karpenter changed to a custom taint last year. If your installation ≥ v1.29.0 then this is something for the EBS team to investigate... Cheers!

@willthames
Copy link
Contributor Author

I've just checked the running version in as yet unfixed cluster, it's v1.34.0 - so the version shouldn't be a problem (we have a github action that regularly checks our helm charts and bumps them so we're rarely too far off the leading edge)

I'll validate that the correct taints are being applied and watched for when I apply the AMI bump to our remaining cluster

@willthames
Copy link
Contributor Author

@AndrewSirenko I've raised kubernetes-sigs/aws-ebs-csi-driver#2158 now - it seems that the taint has changed with v1 to karpenter.sh/disrupted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
None yet
6 participants