Only wait for volume attachments for drainable nodes #1700

willthames · 2024-09-20T09:58:01Z

Description

This solves two problems identified so far:

waiting for a volume attachment associated with a persistent volume claim that has been released
waiting for a volume attachment associated with a persistent volume claim that has assigned to a new pod that can't start until the association is removed

Effectively this changes the logic from finding
all the PVCs associated with non-drainable pods
and ignoring those ones, to finding all the PVCs
associated with drainable pods and only blocking
on those ones.

As such some of the previous tests are now redundant because they check we wait for volume associations that aren't tied to a PVC at all.

How was this change tested?

make test

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

This solves two problems identified so far: * waiting for a volume attachment associated with a persistent volume claim that has been released * waiting for a volume attachment associated with a persistent volume claim that has assigned to a new pod that can't start until the association is removed Effectively this changes the logic from finding all the PVCs associated with non-drainable pods and ignoring those ones, to finding all the PVCs associated with drainable pods and only blocking on those ones. As such some of the previous tests are now redundant because they check we wait for volume associations that aren't tied to a PVC at all.

k8s-ci-robot · 2024-09-20T09:58:06Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: willthames
Once this PR has been reviewed and has the lgtm label, please assign ellistarn for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2024-09-20T09:58:10Z

Hi @willthames. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

willthames · 2024-09-20T10:02:53Z

Note that this is an alternative, much more impactful alternative to #1699.

There may be cases that I have not considered.

#1699 is a safer PR but doesn't protect against a second problem I found where a pod with a PVC associated with a volume attachment got moved to a new node (statefulset-like behaviour) and the new pod couldn't start because its persistent volume was still attached to the old node.

AndrewSirenko · 2024-09-23T01:02:29Z

Hey @willthames thanks for looking into this, I'm sorry that you're running into this drift replacement issue. I'll take a deeper look on the PR Tuesday, but I see one main issue with the following statement:

Effectively this changes the logic from finding
all the PVCs associated with non-drainable pods
and ignoring those ones, to finding all the PVCs
associated with drainable pods and only blocking
on those ones.

The issue is that Pod resources can get cleaned up in Kubernetes BEFORE the associated volumes are unpublished from the node (unmounted and/or detached). This is because the pod lifecycle is not affected by the lifecycle of the PVCs it is associated with.

Therefore if you block node termination by filtering in attachments of drainable pods instead of filtering out those of non-drainable pod, you are not guaranteed to block node termination until all volumes are detached (because the pod resources might be deleted by K8s before associated VolumeAttachment resources are dealt with). You can run into the following race condition:

Node starts being drained by Karpenter
Pod enters terminating state
Pod is drained and pod resource is deleted
K8s + CSI Driver sees that volume must be unmounted, begins unmount (but not detach)
Karpenter does not see this yet-to-be-detached volume when looping through node's pods and therefore proceeds with node termination
CSI Driver begins volume detach, but node is already terminating (holding volume hostage)
When stateful workload is rescheduled on another node, it now cannot attach volume and start until previous node fully terminated (Which this blocking of node termination is trying to prevent)

Or even worse, the Node resource could be deleted before kubelet confirms volume is unmounted. (Which leads to a 6+ minute delay on stateful workload migration)

Nonetheless, VolumeAttachment resources should not linger once a PV is released, so I'm curious about that case. Is there anything special about these volumes? This could be a bug in Attach/Detach controller instead of Kubernetes?

You can see this Preventing 6+ minute delays from StatefulSet Disruption Request for Comment for more information on why this blocking until volumes detached happens. I'm working on a up-to-date/cleaned-up version of this document which should be up at the end of the month.

willthames · 2024-09-23T04:05:25Z

Hi @AndrewSirenko, thanks for your detailed response, definitely appreciate the extra context!

The scenario you describe is the opposite problem to the one I have (nodes disappearing before the volume attachment, causing problems, rather than nodes not being terminated because of the volume attachment), so there's definitely a conflict somewhere.

I'm only having this problem with one file system type as far as I can tell (EBS) but also with one pod type (couchbase, which is managed as pods by the couchbase controller (rather than a higher level abstraction such as statefulsets)). But the couchbase pods are long gone by the time the clean up needs to happen. (Initially I thought this was a problem for EFS where we have a lot more applications using it, but it's EBS and this is the only application using it. EBS is also a lot less tolerant to being mounted on multiple pods!)

This problem persists for days, the only way I can get karpenter to complete drift termination is through deleting affected volume attachments (which is not an approach that is at all satisfying).

I'll definitely check for any outstanding efs controller bugs.

The alternative PR to this one doesn't block on unattached PVCs, but that one does not clean up if the volume is then mounted onto another pod

willthames · 2024-09-23T07:13:16Z

This PR is no longer required (I'll add an explanation in #1684)

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 20, 2024

k8s-ci-robot requested review from jackfrancis and jmdeal September 20, 2024 09:58

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 20, 2024

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Sep 20, 2024

willthames mentioned this pull request Sep 20, 2024

fix: Ignore released volumes when deleting a node #1699

Closed

willthames closed this Sep 23, 2024

willthames mentioned this pull request Sep 23, 2024

Drift replacement stuck due to "Cannot disrupt NodeClaim" #1684

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only wait for volume attachments for drainable nodes #1700

Only wait for volume attachments for drainable nodes #1700

willthames commented Sep 20, 2024 •

edited

Loading

k8s-ci-robot commented Sep 20, 2024

k8s-ci-robot commented Sep 20, 2024

willthames commented Sep 20, 2024

AndrewSirenko commented Sep 23, 2024 •

edited

Loading

willthames commented Sep 23, 2024 •

edited

Loading

willthames commented Sep 23, 2024

Only wait for volume attachments for drainable nodes #1700

Only wait for volume attachments for drainable nodes #1700

Conversation

willthames commented Sep 20, 2024 • edited Loading

k8s-ci-robot commented Sep 20, 2024

k8s-ci-robot commented Sep 20, 2024

willthames commented Sep 20, 2024

AndrewSirenko commented Sep 23, 2024 • edited Loading

willthames commented Sep 23, 2024 • edited Loading

willthames commented Sep 23, 2024

willthames commented Sep 20, 2024 •

edited

Loading

AndrewSirenko commented Sep 23, 2024 •

edited

Loading

willthames commented Sep 23, 2024 •

edited

Loading