Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only wait for volume attachments for drainable nodes #1700

Conversation

willthames
Copy link
Contributor

@willthames willthames commented Sep 20, 2024

Fixes #1684

Description

This solves two problems identified so far:

  • waiting for a volume attachment associated with a persistent volume claim that has been released
  • waiting for a volume attachment associated with a persistent volume claim that has assigned to a new pod that can't start until the association is removed

Effectively this changes the logic from finding
all the PVCs associated with non-drainable pods
and ignoring those ones, to finding all the PVCs
associated with drainable pods and only blocking
on those ones.

As such some of the previous tests are now redundant because they check we wait for volume associations that aren't tied to a PVC at all.

How was this change tested?

make test

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

This solves two problems identified so far:
* waiting for a volume attachment associated with a
  persistent volume claim that has been released
* waiting for a volume attachment associated with a
  persistent volume claim that has assigned to a new
  pod that can't start until the association is
  removed

Effectively this changes the logic from finding
all the PVCs associated with non-drainable pods
and ignoring those ones, to finding all the PVCs
associated with drainable pods and only blocking
on those ones.

As such some of the previous tests are now redundant
because they check we wait for volume associations
that aren't tied to a PVC at all.
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 20, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: willthames
Once this PR has been reviewed and has the lgtm label, please assign ellistarn for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 20, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @willthames. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Sep 20, 2024
@willthames
Copy link
Contributor Author

Note that this is an alternative, much more impactful alternative to #1699.

There may be cases that I have not considered.

#1699 is a safer PR but doesn't protect against a second problem I found where a pod with a PVC associated with a volume attachment got moved to a new node (statefulset-like behaviour) and the new pod couldn't start because its persistent volume was still attached to the old node.

@AndrewSirenko
Copy link
Contributor

AndrewSirenko commented Sep 23, 2024

Hey @willthames thanks for looking into this, I'm sorry that you're running into this drift replacement issue. I'll take a deeper look on the PR Tuesday, but I see one main issue with the following statement:

Effectively this changes the logic from finding
all the PVCs associated with non-drainable pods
and ignoring those ones, to finding all the PVCs
associated with drainable pods and only blocking
on those ones.

The issue is that Pod resources can get cleaned up in Kubernetes BEFORE the associated volumes are unpublished from the node (unmounted and/or detached). This is because the pod lifecycle is not affected by the lifecycle of the PVCs it is associated with.

Therefore if you block node termination by filtering in attachments of drainable pods instead of filtering out those of non-drainable pod, you are not guaranteed to block node termination until all volumes are detached (because the pod resources might be deleted by K8s before associated VolumeAttachment resources are dealt with). You can run into the following race condition:

  1. Node starts being drained by Karpenter
  2. Pod enters terminating state
  3. Pod is drained and pod resource is deleted
  4. K8s + CSI Driver sees that volume must be unmounted, begins unmount (but not detach)
  5. Karpenter does not see this yet-to-be-detached volume when looping through node's pods and therefore proceeds with node termination
  6. CSI Driver begins volume detach, but node is already terminating (holding volume hostage)
  7. When stateful workload is rescheduled on another node, it now cannot attach volume and start until previous node fully terminated (Which this blocking of node termination is trying to prevent)

Or even worse, the Node resource could be deleted before kubelet confirms volume is unmounted. (Which leads to a 6+ minute delay on stateful workload migration)


Nonetheless, VolumeAttachment resources should not linger once a PV is released, so I'm curious about that case. Is there anything special about these volumes? This could be a bug in Attach/Detach controller instead of Kubernetes?

You can see this Preventing 6+ minute delays from StatefulSet Disruption Request for Comment for more information on why this blocking until volumes detached happens. I'm working on a up-to-date/cleaned-up version of this document which should be up at the end of the month.

@willthames
Copy link
Contributor Author

willthames commented Sep 23, 2024

Hi @AndrewSirenko, thanks for your detailed response, definitely appreciate the extra context!

The scenario you describe is the opposite problem to the one I have (nodes disappearing before the volume attachment, causing problems, rather than nodes not being terminated because of the volume attachment), so there's definitely a conflict somewhere.

I'm only having this problem with one file system type as far as I can tell (EBS) but also with one pod type (couchbase, which is managed as pods by the couchbase controller (rather than a higher level abstraction such as statefulsets)). But the couchbase pods are long gone by the time the clean up needs to happen. (Initially I thought this was a problem for EFS where we have a lot more applications using it, but it's EBS and this is the only application using it. EBS is also a lot less tolerant to being mounted on multiple pods!)

This problem persists for days, the only way I can get karpenter to complete drift termination is through deleting affected volume attachments (which is not an approach that is at all satisfying).

I'll definitely check for any outstanding efs controller bugs.

The alternative PR to this one doesn't block on unattached PVCs, but that one does not clean up if the volume is then mounted onto another pod

@willthames
Copy link
Contributor Author

This PR is no longer required (I'll add an explanation in #1684)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Drift replacement stuck due to "Cannot disrupt NodeClaim"
3 participants