🐛 Machine deletion skips waiting for volumes detached for unreachable Nodes #10662

typeid · 2024-05-22T16:48:50Z

What this PR does / why we need it:

When MHC are remediating an unresponsive node, CAPI drains the node and gives up waiting for pods to be deleted after the SkipWaitForDeleteTimeoutSeconds. If one of the pods it gave up on has an attached volume, the machine deletion gets stuck.

This PR aligns the behavior for volume detachment during deletion to what we're doing with the drains: if the node is unreachable, we skip detaching volumes.

Which issue(s) this PR fixes
Fixes #10661

/area machine

k8s-ci-robot · 2024-05-22T16:48:59Z

Hi @typeid. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

mjlshen

This is how OpenShift's handles this situation as well https://github.com/openshift/machine-api-operator/blob/dcf1387cb69f8257345b2062cff79a6aefb1f5d9/pkg/controller/machine/drain_controller.go#L164-L171. Furthermore, this allows pods to successfully be deleted, triggering CSI drivers to detach the relevant volumes, which allows the machine to be replaced

/lgtm

k8s-ci-robot · 2024-05-22T20:10:09Z

LGTM label has been added.

Git tree hash: 038bbf764ac6107032bcd184e1434415cc660f3d

jackfrancis · 2024-05-22T20:13:08Z

@willie-yao @Jont828 Could you sanity check the equivalent functionality for MachinePoolMachines? Do we need changes there as well?

sbueringer · 2024-05-23T12:50:02Z

@willie-yao @Jont828 Could you sanity check the equivalent functionality for MachinePoolMachines? Do we need changes there as well?

Do MachineHealthChecks work with MachinePool Machines? (I just don't remember that we implemented anything for it, at least in core CAPI)

chrischdi

/ok-to-test

willie-yao · 2024-05-28T20:02:43Z

Could you sanity check the equivalent functionality for MachinePoolMachines? Do we need changes there as well?

I know for a fact that MachineHealthCheck is not implemented for ClusterClass MachinePools, as stated in the "Later" section of issue #5991. @Jont828 are you able to verify if they are implemented for MachinePoolMachines?

enxebre · 2024-05-29T12:47:19Z

@typeid can you clarify where exactly in this https://github.com/kubernetes/kubectl/blob/master/pkg/drain/drain.go#L268-L360 flow this makes the difference?

mjlshen · 2024-05-30T04:16:19Z

@enxebre GracePeriodSeconds (the difference) is in https://github.com/kubernetes/kubectl/blob/d3ad75324522280fa8bccd7ad949b34336f5fc84/pkg/drain/drain.go#L312
--> https://github.com/kubernetes/kubectl/blob/d3ad75324522280fa8bccd7ad949b34336f5fc84/pkg/drain/drain.go#L150
--> https://github.com/kubernetes/kubectl/blob/d3ad75324522280fa8bccd7ad949b34336f5fc84/pkg/drain/drain.go#L133-L136

Skipping a couple steps here, but GracePeriodSeconds then takes effect in the kube-apiserver https://github.com/kubernetes/apiserver/blob/259cd1817cb1cc73bc33029df69934ac5c0d07ea/pkg/registry/generic/registry/store.go#L1116 where the graceful will eventually evaluate to false after the grace period has expired and the pods will be deleted by the api server instead of being stuck in a Terminating state forever.

That said, this is also behavior that could also be handled via https://kubernetes.io/docs/concepts/cluster-administration/node-shutdown/#non-graceful-node-shutdown instead. The description exactly describes the symptoms of this bug as well.

When a node is shutdown but not detected by kubelet's Node Shutdown Manager, the pods that are part of a StatefulSet will be stuck in terminating status on the shutdown node and cannot move to a new running node. This is because kubelet on the shutdown node is not available to delete the pods so the StatefulSet cannot create a new pod with the same name. If there are volumes used by the pods, the VolumeAttachments will not be deleted from the original shutdown node so the volumes used by these pods cannot be attached to a new running node. As a result, the application running on the StatefulSet cannot function properly. If the original shutdown node comes up, the pods will be deleted by kubelet and new pods will be created on a different running node. If the original shutdown node does not come up, these pods will be stuck in terminating status on the shutdown node forever.

enxebre · 2024-05-30T10:28:54Z

That said, this is also behavior that could also be handled via https://kubernetes.io/docs/concepts/cluster-administration/node-shutdown/#non-graceful-node-shutdown instead. The description exactly describes the symptoms of this bug as well.

I guess for us to take advantage of it the instance would need to be in the process of being shut down already, so it makes sense to me to mimic it earlier, at least for now.

Skipping a couple steps here, but GracePeriodSeconds then takes effect in the kube-apiserver https://github.com/kubernetes/apiserver/blob/259cd1817cb1cc73bc33029df69934ac5c0d07ea/pkg/registry/generic/registry/store.go#L1116 where the graceful will eventually evaluate to false after the grace period has expired and the pods will be deleted by the api server instead of being stuck in a Terminating state forever.

So does this effectively invalidates our SkipWaitForDeleteTimeoutSeconds, since all pods will match GracePeriodSeconds and get terminated first?

typeid · 2024-05-30T11:45:16Z

I initially had the same understanding as @mjlshen's explanation, but I did more digging and I'm no longer sure just setting the grace period to 1 is actually solving the issue here.

Reason:

if no grace period or -1 (CAPI case) is specified, we use the default deleteOptions: see here
we call the eviction api with these empty deleteOptions: see here
this results in using the default grace period for the object (30 seconds for pods if unspecified):
- If this value is nil, the default grace period for the specified type will be used. source
- For pods, this Defaults to 30 seconds. , see terminationGracePeriodSeconds here

So I would expect the following to happen (current logic as well as with 1 second grace period - nothing differs in terms of pod deletion):

we evict all pods via goroutines at the same time
we then wait for these pods to be deleted, but timeout after 20 seconds due to our global timeout
we keep on retrying until we reach the SkipWaitForDeleteTimeoutSeconds time we set here, then we ignore the pods not yet deleted and move on. This means we show the drain as completed, the pods are still around.

However, at this point, the eviction API has already been called (both cases, current vs PR), and the deletion grace period should be set on the pod, so the api server should take care of it.
There seems to be no difference when it comes to actual pod deletion between explicitly setting it and letting it default, it should only decide the sooner or later.

I re-tested draining with grace period 1, and it doesn't delete the pods for my reproducer. I thought I had tried that, but by not working I think that confirms the logic above. Setting the grace period to 0, however, results in the pods deleting.

Further digging in apiserver...

With grace period 0 we run different logic on the apiserver side, return (graceful = false, gracefulPending =false , err = nil).
With grace period 1 (or -1, which results in deleteGracePeriod = pod setting, default 30 sec), we update the deletion timestamp and grace period: https://github.com/kubernetes/apiserver/blob/4aef12dc73bd08c68036e1f8990c869806cbab58/pkg/registry/rest/delete.go#L134 and return (graceful=true, gracefulpending=false, err = nil)

The logic after that gets a bit complex, but it seems like setting it to 0 could be what want for immediate deletion.

I tried reproducing this with MAPI MHC, and it skips the pod deletion but doesn't get stuck replacing the node: I'm not sure what happens with the volume. Maybe it doesn't block the node deletion due to existing volumes? I'll have to dig further on that.

So does this effectively invalidates our SkipWaitForDeleteTimeoutSeconds, since all pods will match GracePeriodSeconds and get terminated first?

Not necessarily. I think in the case the pods still don't delete even with grace period = 1, we would still skip them after their deletionTimestamp is older than 5 minutes with that setting. MAPI actually sets SkipWaitForDeleteTimeoutSeconds to 1 as well, though (here and here). We might want to mimic that as well?

I'll sync up with @mjlshen to get the same understanding and see if I'm possibly missing something here. :)

enxebre · 2024-05-30T14:55:26Z

With grace period 0 we run different logic on the apiserver side, return (graceful = false, gracefulPending =false , err = nil).
With grace period 1 (or -1, which results in deleteGracePeriod = pod setting, default 30 sec), we update the deletion timestamp and grace period: https://github.com/kubernetes/apiserver/blob/4aef12dc73bd08c68036e1f8990c869806cbab58/pkg/registry/rest/delete.go#L134 and return (graceful=true, gracefulpending=false, err = nil)

With a pending source code verification I assume the second case won't help in this scenario because it requires kubelet coordination via pod status, which is not possible because the kubelet is unreachable.

I tried reproducing this with MAPI MHC, and it skips the pod deletion but doesn't get stuck replacing the node: I'm not sure what happens with the volume. Maybe it doesn't block the node deletion due to existing volumes? I'll have to dig further on that.

Yes, the volume detaching decoupling form draining it's not implemented in mapi.

So to summarise, the erratic behaviour is that we have a semantic to ack an unreachable Node and an opinion on how react to it (We Assume that giving up on those pods after 5min is safe as the eviction request is satisfied by pdbs) but we are not consistent on that opinion for volumes dettachment.
So my preference would be that we add that semantic consistently to the volume detachment process. I.e I propose we add additional logic within isNodeVolumeDetachingAllowed / nodeVolumeDetachTimeoutExceeded to account for unreachable Nodes and give up after the same SkipWaitForDeleteTimeoutSeconds we have for drain.

FWIW see related old discussion when we introduced this semantic #2165 (comment)

typeid · 2024-05-30T15:07:15Z

That sounds good to me.

I think we should still make the drain consistent to what MAPI does for unreachable nodes: setting GracePeriod and the PodDeletionTimeout both to 1 second. While this still would not fix issue #10661 (to my current understanding), it would at least allow faster replacement of unreachable nodes. I will open a new issue for that and link a new PR to not cause extra confusion by remapping this PR.

internal/controllers/machine/machine_controller.go

internal/controllers/machine/machine_controller_test.go

enxebre · 2024-05-31T11:58:51Z

For history context this where we introduced permanent wait on detachment:
#4707

This is where we exposed it as API input:
#6285.

Now this PR aligns the behaviour with drain for unreachable Nodes.
/lgtm

k8s-ci-robot · 2024-05-31T11:58:56Z

LGTM label has been added.

Git tree hash: d99ffbe32ca8dfdfe705b34e26485170293b14c8

enxebre · 2024-05-31T13:12:09Z

ptal @chrischdi @sbueringer

sbueringer · 2024-06-04T09:38:28Z

I'll take a look, but I need some time to get a full picture

chrischdi · 2024-06-05T15:40:51Z

Implementation makes sense to me and matches the issue

/lgtm

sbueringer · 2024-06-13T15:39:31Z

Just a quick update, I'm looking into this. Just have to double check something with folks downstream. I'll get back to you ASAP.

sbueringer · 2024-06-13T18:04:03Z

/lgtm

Slightly renamed the PR, because this affects more than just MHC.

/hold
I want to check with some other folks, will get back to you ASAP

sbueringer · 2024-06-17T06:20:59Z

/hold cancel

sbueringer · 2024-06-17T06:21:14Z

Thx folks!

sbueringer · 2024-06-17T06:21:25Z

/approve

k8s-ci-robot · 2024-06-17T06:21:58Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sbueringer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [sbueringer]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sbueringer · 2024-06-17T08:40:16Z

/cherry-pick release-1.7

k8s-infra-cherrypick-robot · 2024-06-17T08:41:17Z

@sbueringer: new pull request created: #10765

In response to this:

/cherry-pick release-1.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added area/machine Issues or PRs related to machine lifecycle management cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 22, 2024

k8s-ci-robot requested a review from jackfrancis May 22, 2024 16:48

k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label May 22, 2024

k8s-ci-robot requested a review from killianmuldoon May 22, 2024 16:48

mjlshen reviewed May 22, 2024

View reviewed changes

k8s-ci-robot assigned mjlshen May 22, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 22, 2024

chrischdi reviewed May 27, 2024

View reviewed changes

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 27, 2024

typeid mentioned this pull request May 30, 2024

🐛 Speed up ignoring terminating Pods when draining unreachable Nodes #10706

Merged

typeid changed the title ~~🐛 MachineHealthCheck properly remediates unreachable nodes with volumes attached~~ [WIP] 🐛 MachineHealthCheck properly remediates unreachable nodes with volumes attached May 30, 2024

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 30, 2024

typeid force-pushed the 10661 branch from 97d3981 to cbdb4e5 Compare May 31, 2024 10:20

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 31, 2024

k8s-ci-robot requested a review from mjlshen May 31, 2024 10:20

k8s-ci-robot removed the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label May 31, 2024

enxebre reviewed May 31, 2024

View reviewed changes

internal/controllers/machine/machine_controller.go Show resolved Hide resolved

typeid force-pushed the 10661 branch from 3603655 to bb90c1a Compare May 31, 2024 11:38

enxebre reviewed May 31, 2024

View reviewed changes

internal/controllers/machine/machine_controller_test.go Show resolved Hide resolved

typeid force-pushed the 10661 branch from bb90c1a to b0f8bd6 Compare May 31, 2024 11:49

fix(10661): volumes don't block deletion of unreachable nodes

b30fcd1

typeid force-pushed the 10661 branch from b0f8bd6 to b30fcd1 Compare May 31, 2024 11:54

k8s-ci-robot assigned enxebre May 31, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 31, 2024

k8s-ci-robot assigned chrischdi Jun 5, 2024

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 13, 2024

k8s-ci-robot assigned sbueringer Jun 13, 2024

sbueringer changed the title ~~🐛 MachineHealthCheck properly remediates unreachable nodes with volumes attached~~ 🐛 Machine deletion skips waiting for volumes detached for unreachable Nodes Jun 13, 2024

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 17, 2024

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 17, 2024

k8s-ci-robot merged commit 9411031 into kubernetes-sigs:main Jun 17, 2024
36 checks passed

k8s-ci-robot added this to the v1.8 milestone Jun 17, 2024

k8s-infra-cherrypick-robot mentioned this pull request Jun 17, 2024

[release-1.7] 🐛 Machine deletion skips waiting for volumes detached for unreachable Nodes #10765

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 Machine deletion skips waiting for volumes detached for unreachable Nodes #10662

🐛 Machine deletion skips waiting for volumes detached for unreachable Nodes #10662

typeid commented May 22, 2024 •

edited

Loading

k8s-ci-robot commented May 22, 2024

mjlshen left a comment

k8s-ci-robot commented May 22, 2024

jackfrancis commented May 22, 2024

sbueringer commented May 23, 2024

chrischdi left a comment

willie-yao commented May 28, 2024

enxebre commented May 29, 2024

mjlshen commented May 30, 2024

enxebre commented May 30, 2024 •

edited

Loading

typeid commented May 30, 2024

enxebre commented May 30, 2024 •

edited

Loading

typeid commented May 30, 2024 •

edited

Loading

enxebre commented May 31, 2024 •

edited

Loading

k8s-ci-robot commented May 31, 2024

enxebre commented May 31, 2024

sbueringer commented Jun 4, 2024

chrischdi commented Jun 5, 2024

sbueringer commented Jun 13, 2024

sbueringer commented Jun 13, 2024 •

edited

Loading

sbueringer commented Jun 17, 2024

sbueringer commented Jun 17, 2024

sbueringer commented Jun 17, 2024

k8s-ci-robot commented Jun 17, 2024

sbueringer commented Jun 17, 2024

k8s-infra-cherrypick-robot commented Jun 17, 2024

🐛 Machine deletion skips waiting for volumes detached for unreachable Nodes #10662

🐛 Machine deletion skips waiting for volumes detached for unreachable Nodes #10662

Conversation

typeid commented May 22, 2024 • edited Loading

k8s-ci-robot commented May 22, 2024

mjlshen left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented May 22, 2024

jackfrancis commented May 22, 2024

sbueringer commented May 23, 2024

chrischdi left a comment

Choose a reason for hiding this comment

willie-yao commented May 28, 2024

enxebre commented May 29, 2024

mjlshen commented May 30, 2024

enxebre commented May 30, 2024 • edited Loading

typeid commented May 30, 2024

enxebre commented May 30, 2024 • edited Loading

typeid commented May 30, 2024 • edited Loading

enxebre commented May 31, 2024 • edited Loading

k8s-ci-robot commented May 31, 2024

enxebre commented May 31, 2024

sbueringer commented Jun 4, 2024

chrischdi commented Jun 5, 2024

sbueringer commented Jun 13, 2024

sbueringer commented Jun 13, 2024 • edited Loading

sbueringer commented Jun 17, 2024

sbueringer commented Jun 17, 2024

sbueringer commented Jun 17, 2024

k8s-ci-robot commented Jun 17, 2024

sbueringer commented Jun 17, 2024

k8s-infra-cherrypick-robot commented Jun 17, 2024

typeid commented May 22, 2024 •

edited

Loading

enxebre commented May 30, 2024 •

edited

Loading

enxebre commented May 30, 2024 •

edited

Loading

typeid commented May 30, 2024 •

edited

Loading

enxebre commented May 31, 2024 •

edited

Loading

sbueringer commented Jun 13, 2024 •

edited

Loading