Support parallel node-group scale-ups #268

himanshu-kun · 2023-12-07T11:43:49Z

What would you like to be added:

Upstream has enabled a feature where CA could scale-up multiple node-groups in a single RunOnce() loop. This could help reduce the latency of scale-ups

PR -> kubernetes#5731
It is available from v1.28 k/CA release
But as listed in the release notes we have to test it out with our MCM and also should wait a few upstream releases for the feature to stabilize

Demand was raised as per live ticket # 4048

Why is this needed:
There are customers who need quicker scale-ups , as scale-up one node-group at a time delays their scale-downs, adding to cost and crossing maintenance time windows, if they are doing something like blue/green deployment

The text was updated successfully, but these errors were encountered:

ashwani2k · 2024-02-29T08:16:57Z

Post Grooming of the Issue:
Inputs from @rishabh-11 and @unmarshall
Verbatim

I think in our case, parallel scaleups and sequential scaleups don't have much of a difference. The reason is if you look at the methods that execute these scaleups i.e. executeScaleUpsParallel (

autoscaler/cluster-autoscaler/core/scaleup/orchestrator/executor.go

Line 91 in 0ebbdfb

func (e *scaleUpExecutor) executeScaleUpsParallel(

) and executeScaleUpsSync (

autoscaler/cluster-autoscaler/core/scaleup/orchestrator/executor.go

Line 72 in 0ebbdfb

func (e *scaleUpExecutor) executeScaleUpsSync(

), both call executeScaleUp (

autoscaler/cluster-autoscaler/core/scaleup/orchestrator/executor.go

Line 139 in 0ebbdfb

func (e *scaleUpExecutor) executeScaleUp(

). executeScaleUpsParallel calls it in a go routine for each scale up and executeScaleUpsSync calls it one by one in a for loop. Now, executeScaleUp just increases the replica field of the machine deployment corresponding to the node group which won't take time if everything is working fine. So we don't save any noticeable time.

rubenscheja · 2024-02-29T10:37:25Z

I thought the parellelism would at least improve for scale-up requests on different workergroups

When looking at the 1.28.0 CA changelog I noticed that the parallel drain feature is now the default.
Would that be sth. the could help us? I could not find any change of precedence in scale-up vs scale-down in the architecture description, though.

Another idea that we briefly talked about in the gardener stakeholder sync is using multiple CA instances in parallel, each instance caring for a different set of workergroups (that we would need to assign to a CA in the shoot).

As the "blue" group would only face scale-downs during maintenance, and the "green" group would only face scale-ups, and as both would be handled by different CAs, there should not be any blocking of scale-downs due to scale-ups.
Question is, if that is feasible to be implemented with a reasonable effort

rubenscheja · 2024-03-02T13:52:10Z

BTW: We just faced and immense scale-down delay in our latest maintenance (link in our internal ticket)
The Cluster Autoscaler seems to have been overwhelmed by the number of pods for which he needed to calculate potential scale-ups. This increase the CA Cycle time from the usual 10s to over one minute as far as I can see from the linked controlplane logs in the ticket.

gardener-robot · 2024-03-02T13:52:19Z

@rubenscheja You have mentioned internal references in the public. Please check.

himanshu-kun added the kind/enhancement Enhancement, improvement, extension label Dec 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support parallel node-group scale-ups #268

Support parallel node-group scale-ups #268

himanshu-kun commented Dec 7, 2023

ashwani2k commented Feb 29, 2024

rubenscheja commented Feb 29, 2024

rubenscheja commented Mar 2, 2024 •

edited

Loading

gardener-robot commented Mar 2, 2024

Support parallel node-group scale-ups #268

Support parallel node-group scale-ups #268

Comments

himanshu-kun commented Dec 7, 2023

ashwani2k commented Feb 29, 2024

rubenscheja commented Feb 29, 2024

rubenscheja commented Mar 2, 2024 • edited Loading

gardener-robot commented Mar 2, 2024

rubenscheja commented Mar 2, 2024 •

edited

Loading