Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graceful scaledown of deprecated MachineDeployment #812

Closed
mattburgess opened this issue May 5, 2023 · 7 comments
Closed

Graceful scaledown of deprecated MachineDeployment #812

mattburgess opened this issue May 5, 2023 · 7 comments
Labels
area/usability Usability related kind/enhancement Enhancement, improvement, extension priority/3 Priority (lower number equals higher priority) status/closed Issue is closed (either delivered or triaged)

Comments

@mattburgess
Copy link

How to categorize this issue?

/area usability
/kind enhancement
/priority 3

What would you like to be added:

We'd like a way to be able to gracefully scale a MachineDeployment down to 0, specifically without assuming that PDBs will protect pod availability.

Why is this needed:

From time to time we have a need to completely remove a MachineDeployment from our clusters. Ideally we'd run something like kubectl -n machine-controller-manager scale machinedeployment my-md --replicas 0 and just let MCM control things. However, this can lead to undesirable consequences:

  1. If that MachineDeployment is managing more than x% of the cluster's capacity then it can lead to more pods being evicted than our buffer pods are reserving capacity for, meaning outages whilst we wait for CA to notice unschedulable pods and bring more nodes in
  2. If a particular pod doesn't have a PDB set, or the PDB is misconfigured (e.g. maxUnavailable: 0) then the nodes running those pods will hit the eviction timeout of 10 mins and all will be terminated at the same time leading to a loss of service. Unfortunately, we're not in control of those PDBs and despite some efforts to ask for them to be adjusted we can't be guaranteed that they will be.

In an ideal scenario I'd quite like the following workflow:

  1. I can mark the MachineDeployment in some way so that CA ignores it for any scaling decisions (is cluster-autoscaler.kubernetes.io/scale-down-disabled: true sufficient? Does that also tell it to not scale up either?)
  2. I can mark the MachineDeployment in some way so that MCM knows it needs to be gracefully drained
  3. MCM proceeds to scale down the MachineDeployment x (user configurable) nodes at a time. It waits for that scaledown to complete and for the number of unschedulable pods to be < y (user configurable) before proceeding with the next iteration of the scaledown loop
@mattburgess mattburgess added the kind/enhancement Enhancement, improvement, extension label May 5, 2023
@gardener-robot gardener-robot added area/usability Usability related priority/3 Priority (lower number equals higher priority) labels May 5, 2023
@himanshu-kun
Copy link
Contributor

himanshu-kun commented May 12, 2023

I can mark the MachineDeployment in some way so that CA ignores it for any scaling decisions

you can remove the machineDeployment from the --nodes flag of autoscaler

is cluster-autoscaler.kubernetes.io/scale-down-disabled: true sufficient? Does that also tell it to not scale up either?

it is a per node annotation and doesn't tell autoscaler anything about the node group , so scale up for node group would still happen

It waits for that scaledown to complete and for the number of unschedulable pods to be < y (user configurable) before proceeding with the next iteration of the scaledown loop

We have written MCM to work with CA . CA deals with unschedulable pods and directs MCM to scale a particular node group up or down. MCM only deals with machine in terms of their count , so it will just complicate things if we make MCM smart enough .
(plus it will be quite complicated in situations where more unschedulable pods are coming in which would stop the scale-down , so it cannot be generalized)

MCM proceeds to scale down the MachineDeployment x (user configurable) nodes at a time

This can be done using a script also where you issue the command

kubectl -n machine-controller-manager scale machinedeployment my-md --replicas <replicas>

while keeping a note of the available machine in the deployment. Too much configurability from our side is not required

@himanshu-kun
Copy link
Contributor

is cluster-autoscaler.kubernetes.io/scale-down-disabled: true sufficient? Does that also tell it to not scale up either?

you can achieve this by adding a taint to all nodes of machineDeployment which your pods don't tolerate. To do so add the taint in spec.template.spec.nodeTemplate.taints section.

@himanshu-kun
Copy link
Contributor

/ping @mattburgess

@gardener-robot
Copy link

@mattburgess ℹ️ please take some time to help himanshu-kun or redirect to someone else if you can't.

@mattburgess
Copy link
Author

MCM only deals with machine in terms of their count , so it will just complicate things if we make MCM smart enough .

Yeah, understood. Thanks for the detailed response. We were trying to avoid having to write our own scale-down utility but it looks like that might be unavoidable.

you can remove the machineDeployment from the --nodes flag of autoscaler

It's a shame that node group auto-discovery hasn't been plugged in yet as doing this obviously requires a code change + redeployment on our side. We may look at contributing auto-discovery if it isn't already being looked at?

It'd be nice, then, if CA supported such node-group deprecation. That way our migration of MDs/node-groups would look like this:

  1. Create new MachineDeployment (with CA node-group auto-discovery wired up we'd not need to make any changes to CA config)
  2. Add a label to the old MachineDeployment to signal to CA that it should no longer consider the related node-group for scale-up, and to gracefully scale the node-group to 0
  3. Delete the old MachineDeployment

Do you think that's a reasonable request that might be considerd on the CA side? Either way, I'm happy for this to be closed, and we'll deal with the scale down on our side for now.

@himanshu-kun
Copy link
Contributor

It's a shame that node group auto-discovery hasn't been plugged in yet as doing this obviously requires a code change + redeployment on our side. We may look at contributing auto-discovery if it isn't already being looked at?

We also wanted to implement it, but because of low demand and us having our hands full ,we had iceboxed the issue gardener/autoscaler#29

Your contributions are welcomed. Pls comment on the issue about how you want to implement it , and then we can discuss there
Kindly close this issue if there are no further queries.

@mattburgess
Copy link
Author

Thanks again for the feedback @himanshu-kun.

@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Jun 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/usability Usability related kind/enhancement Enhancement, improvement, extension priority/3 Priority (lower number equals higher priority) status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

No branches or pull requests

3 participants