Allow a way to specify extended resources for scale-from-zero scenario #132

himanshu-kun · 2022-06-28T12:35:03Z

What would you like to be added:
There should be a mechanism in our autoscaler so that the user could specify any extended resources his nodes have and the autoscaler becomes aware of it, so that during scaling out a node group from zero it could consider that.

Why is this needed:
It has been noticed that currently the autoscaler couldn't scale a node group from zero if the pod is requesting an extended resource. This is happening because the nodeTemplate which the autoscaler creates doesn't have the extended resources specified.
However it is able to scale from one , because there the autoscaler can form nodeTemplate from an existing node.
There are ways in AWS and Azure implementation of autoscaler to specify such resources.

himanshu-kun · 2022-06-28T13:00:06Z

I have thought of some possible solution:

We could enhance the logic of using an existing node in the worker group introduced using this PR to include extended resources also. But this will not solve the case when we don't have any node in any zone of the worker group.
We could enhance the nodeTemplate passing through shoot YAML feature (enabled in AWS and Azure as of now) to also pass extended resources. But this approach comes with certain drawbacks:

any update here would lead to rolling update of the worker group nodes
delivery of this would require changes to gardenlet, extensions and autoscaler so it would take time.

We could use something similar to what AWS and Azure do. They have a mechanism to add tags to their scaling groups(in our case its machineDeployment). From those tags , autoscaler could update the nodeTemplate with the resources. But we have one problem:

we currenlty don't allow customers to add tags directly to machinedeployment

cc @unmarshall @ashwani2k

himanshu-kun · 2022-06-28T13:02:18Z

Upstream issue worth noticing kubernetes#1869

himanshu-kun · 2023-02-28T10:19:56Z

Post grooming decision

Specify the node template in the provider config section of the worker. From there, the corresponding extension will pick it up and populate the worker config which contains the NodeTemplate. This will be used to generate the machine class. The CA code at the moment does not consider the ephemeral storage in case of scale from zero.

Inside GetMachineDeploymentNodeTemplate

if len(filteredNodes) > 0 {
				klog.V(1).Infof("Nodes already existing in the worker pool %s", workerPool)
				baseNode := filteredNodes[0]
				klog.V(1).Infof("Worker pool node used to form template is %s and its capacity is cpu: %s, memory:%s", baseNode.Name, baseNode.Status.Capacity.Cpu().String(), baseNode.Status.Capacity.Memory().String())
				instance = instanceType{
					VCPU:             baseNode.Status.Capacity[apiv1.ResourceCPU],
					Memory:           baseNode.Status.Capacity[apiv1.ResourceMemory],
					GPU:              baseNode.Status.Capacity[gpu.ResourceNvidiaGPU],
					EphemeralStorage: baseNode.Status.Capacity[apiv1.ResourceEphemeralStorage],
					PodCount:         baseNode.Status.Capacity[apiv1.ResourcePods],
				}
			} else {
				klog.V(1).Infof("Generating node template only using nodeTemplate from MachineClass %s: template resources-> cpu: %s,memory: %s", machineClass.Name, nodeTemplateAttributes.Capacity.Cpu().String(), nodeTemplateAttributes.Capacity.Memory().String())
				instance = instanceType{
					VCPU:   nodeTemplateAttributes.Capacity[apiv1.ResourceCPU],
					Memory: nodeTemplateAttributes.Capacity[apiv1.ResourceMemory],
					GPU:    nodeTemplateAttributes.Capacity["gpu"],
					// Numbers pods per node will depends on the CNI used and the maxPods kubelet config, default is often 110
					PodCount: resource.MustParse("110"),
				}

We need to fix this part to consider ephemeral storage in the else part. Also, we need to fix the validation of NodeTemplate in the gardener provider extension to explicitly specify ephemeral storage without CPU, GPU, or memory.

himanshu-kun added kind/enhancement Enhancement, improvement, extension area/auto-scaling Auto-scaling (CA/HPA/VPA/HVPA, predominantly control plane, but also otherwise) related priority/3 Priority (lower number equals higher priority) labels Jun 28, 2022

himanshu-kun mentioned this issue Nov 25, 2022

Support for IaaS machine tags for all machines in a worker pool gardener/machine-controller-manager#750

Open

gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Dec 25, 2022

himanshu-kun added needs/planning Needs (more) planning with other MCM maintainers and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Feb 28, 2023

gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Nov 7, 2023

rishabh-11 assigned elankath May 23, 2024

elankath linked a pull request Jul 4, 2024 that will close this issue

support ephemeral-storage and custom extended resouces #311

Open

rishabh-11 removed the needs/planning Needs (more) planning with other MCM maintainers label Jul 17, 2024

elankath mentioned this issue Jul 18, 2024

Support extended resources in provider config node template without re-specifying core resources gardener/gardener-extension-provider-aws#1009

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow a way to specify extended resources for scale-from-zero scenario #132

Allow a way to specify extended resources for scale-from-zero scenario #132

himanshu-kun commented Jun 28, 2022

himanshu-kun commented Jun 28, 2022

himanshu-kun commented Jun 28, 2022

himanshu-kun commented Feb 28, 2023 •

edited by rishabh-11

Loading

Allow a way to specify extended resources for scale-from-zero scenario #132

Allow a way to specify extended resources for scale-from-zero scenario #132

Comments

himanshu-kun commented Jun 28, 2022

himanshu-kun commented Jun 28, 2022

himanshu-kun commented Jun 28, 2022

himanshu-kun commented Feb 28, 2023 • edited by rishabh-11 Loading

Post grooming decision

himanshu-kun commented Feb 28, 2023 •

edited by rishabh-11

Loading