Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node utilization util snapshot #1533

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

ingvagabund
Copy link
Contributor

@ingvagabund ingvagabund commented Oct 11, 2024

Built on top of #1532.
Prep work for load-aware descheduling: #225.

Later on, usageSnapshot can be promoted to an interface with actualUsageSnapshot implementation pulling the usage from metrics.

Notes (towards integrating plugins with metrics):

  • utilize https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-metrics-pipeline/#metrics-api
    • kubectl get podmetrics + kubectl get nodemetrics
  • implement an exponentially weighted moving average from the ^^ metrics (introduce initial delay for picking at least 5 samples before running descheduling plugins, with the default window set to 5s, with both window and initial delay configurable)
  • extend the descheduling framework with metrics collector (with ewma computed) and make it available for each plugin (the collector will start to collect if either of the plugins accesses the metrics)

TODO:

  • define initialDelay (and pullInterval for the kubernetes metrics) in case a user wants to let the metrics collector to smooth the resource changes.

For testing purposes:

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
# metricsCollector:
#   enabled: true
profiles:
  - name: ProfileName
    pluginConfig:
    - name: "LowNodeUtilization"
      args:
        thresholds:
          "memory": 20
        targetThresholds:
          "memory": 70
        metricsUtilization:
          metricsServer: true
          prometheusURL: https://prometheus-k8s-openshift-monitoring.apps.jchaloup-20241106.group-b.devcluster.openshift.com
          prometheusAuthToken: XXXXX
          promQuery: instance:node_cpu:rate:sum
    plugins:
      balance:
        enabled:
          - "LowNodeUtilization"

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from ingvagabund. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 11, 2024
@fanhaouu
Copy link
Contributor

Hello, master. Due to the company's busy schedule previously, I only managed to complete half of the related KEP. I'm glad to see that you're working on this. It looks like you're aiming to reuse the current Node utilization logic. I have a few suggestions:

It should support different data sources, similar to PayPal's load-watcher.
It should support various real-time data processing algorithms. For instance, real-time calculations, using rate averages, or predictions based on EWMA + P95, similar to the approach used by autoscaler.
If the goal is to address real-time CPU hotspots, perhaps there’s no need to calculate the number of nodes below or above a certain threshold. Of course, you could also provide a switch to control this behavior.

Hope these suggestions help!

@ingvagabund
Copy link
Contributor Author

Hello sir :)

thank you for taking part in composing the out-of-tree descheduling plugin KEP.

It should support different data sources, similar to PayPal's load-watcher.

You are on the right track here. I'd like to get in touch with load-watcher maintainers and extend the codebase to provide a generic interface for accessing metrics related to pod utilization as well. Currently, only actual node utilization gets collected. Meantime, I am forming the code here to be able to better integrate with other utilization sources like metrics.

It should support various real-time data processing algorithms. For instance, real-time calculations, using rate averages, or predictions based on EWMA + P95, similar to the approach used by autoscaler.

This is where we can debate more. Thank you for sharing the specifics. There's an open issue for the pod autoscaler suggesting to introduce EMA: kubernetes/kubernetes#62235. Are you aware if there's a similar issue or a discussion for the cluster autoscaler? I'd love to learn more about how it's implemented there. Ultimately, the current plugin just needs to know which pod, when evicted, will improve the overall node/workload utilization when properly re-scheduled. I could see various ways to produce the utilization snapshot using various methods.

If the goal is to address real-time CPU hotspots, perhaps there’s no need to calculate the number of nodes below or above a certain threshold. Of course, you could also provide a switch to control this behavior.

I can see how evicting hotspot pods is related to consuming the metrics/real-time node utilization. In the current plugin context this is more suitable for a new/different plugin. I can also see how RemoveDuplicates can be extended to evict based on overall node utilization instead of the current counting approach. Not every plugin will need to consume metrics. Though, there can be common pieces shared across them through the descheduling framework.

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 5, 2024
@ingvagabund ingvagabund force-pushed the node-utilization-util-snapshot branch 3 times, most recently from d744a96 to 800c92c Compare November 6, 2024 18:34
@ingvagabund
Copy link
Contributor Author

kubernetes/kubernetes#128663 to address the discrepancy in the fake metrics client node/pod metricses resource name.

…xtraction into a dedicated usage client

Turning a usage client into an interface allows to implement other kinds
of usage clients like actual usage or prometheus based resource
collection.
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 7, 2024
@ingvagabund
Copy link
Contributor Author

/test pull-descheduler-verify-master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants