Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support parallel node-group scale-ups #268

Open
himanshu-kun opened this issue Dec 7, 2023 · 4 comments
Open

Support parallel node-group scale-ups #268

himanshu-kun opened this issue Dec 7, 2023 · 4 comments
Labels
kind/enhancement Enhancement, improvement, extension

Comments

@himanshu-kun
Copy link

What would you like to be added:

Upstream has enabled a feature where CA could scale-up multiple node-groups in a single RunOnce() loop. This could help reduce the latency of scale-ups

PR -> kubernetes#5731
It is available from v1.28 k/CA release
But as listed in the release notes we have to test it out with our MCM and also should wait a few upstream releases for the feature to stabilize

Demand was raised as per live ticket # 4048

Why is this needed:
There are customers who need quicker scale-ups , as scale-up one node-group at a time delays their scale-downs, adding to cost and crossing maintenance time windows, if they are doing something like blue/green deployment

@himanshu-kun himanshu-kun added the kind/enhancement Enhancement, improvement, extension label Dec 7, 2023
@ashwani2k
Copy link

Post Grooming of the Issue:
Inputs from @rishabh-11 and @unmarshall
Verbatim

I think in our case, parallel scaleups and sequential scaleups don't have much of a difference. The reason is if you look at the methods that execute these scaleups i.e. executeScaleUpsParallel (

func (e *scaleUpExecutor) executeScaleUpsParallel(
) and executeScaleUpsSync (
func (e *scaleUpExecutor) executeScaleUpsSync(
), both call executeScaleUp (
func (e *scaleUpExecutor) executeScaleUp(
). executeScaleUpsParallel calls it in a go routine for each scale up and executeScaleUpsSync calls it one by one in a for loop. Now, executeScaleUp just increases the replica field of the machine deployment corresponding to the node group which won't take time if everything is working fine. So we don't save any noticeable time.

@rubenscheja
Copy link

I thought the parellelism would at least improve for scale-up requests on different workergroups

When looking at the 1.28.0 CA changelog I noticed that the parallel drain feature is now the default.
Would that be sth. the could help us? I could not find any change of precedence in scale-up vs scale-down in the architecture description, though.

Another idea that we briefly talked about in the gardener stakeholder sync is using multiple CA instances in parallel, each instance caring for a different set of workergroups (that we would need to assign to a CA in the shoot).
image
As the "blue" group would only face scale-downs during maintenance, and the "green" group would only face scale-ups, and as both would be handled by different CAs, there should not be any blocking of scale-downs due to scale-ups.
Question is, if that is feasible to be implemented with a reasonable effort

@rubenscheja
Copy link

rubenscheja commented Mar 2, 2024

BTW: We just faced and immense scale-down delay in our latest maintenance (link in our internal ticket)
The Cluster Autoscaler seems to have been overwhelmed by the number of pods for which he needed to calculate potential scale-ups. This increase the CA Cycle time from the usual 10s to over one minute as far as I can see from the linked controlplane logs in the ticket.

@gardener-robot
Copy link

@rubenscheja You have mentioned internal references in the public. Please check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement Enhancement, improvement, extension
Projects
None yet
Development

No branches or pull requests

4 participants