-
Notifications
You must be signed in to change notification settings - Fork 669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Swap pods to reduce fragmentation #1519
Comments
Hi @jinglinliang. The issue description looks quite awesome. I love the snapshot picture. I wished there were more such reports :). Wrt. either of the node utilization strategies it's ultimately up to the scheduler to make the switch. The descheduler plugins might evict some of the non-anti-affinity pods. Yet, these non-anti-affinity pods need to first get scheduled to where the blue pods are. Running I presume preemption and priorities does not help since both blue and green pods have the same or very similar priority? I.e. With the profiles you can configure something like: apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
- name: Round1Low
# timeout: 1m or any reasonable number that takes into account number of nodes/pods in a cluster.
pluginConfig:
- name: "LowNodeUtilization"
args:
thresholds:
"memory": 20
targetThresholds:
"memory": 70
- name: "DefaultEvictor"
args:
... # evict only green pods
plugins:
balance:
enabled:
- "LowNodeUtilization"
- name: Round1High
# timeout: 1m or any reasonable number that takes into account number of nodes/pods in a cluster.
pluginConfig:
- name: "HighNodeUtilization"
args:
thresholds:
"memory": 20
- name: "DefaultEvictor"
args:
... # evict only blue pods
plugins:
balance:
enabled:
- "HighNodeUtilization"
- name: Round2Low
# timeout: 1m or any reasonable number that takes into account number of nodes/pods in a cluster.
pluginConfig:
- name: "LowNodeUtilization"
args:
thresholds:
"memory": 20
targetThresholds:
"memory": 70
- name: "DefaultEvictor"
args:
... # evict only green pods
plugins:
balance:
enabled:
- "LowNodeUtilization"
- name: Round2High
pluginConfig:
- name: "HighNodeUtilization"
args:
thresholds:
"memory": 20
- name: "DefaultEvictor"
args:
... # evict only blue pods
plugins:
balance:
enabled:
- "HighNodeUtilization"
... Perform the shaking multiple times. Yet, the current descheduler will be quite quick in evicting pods. So we'd have to implement a timeout between profiles that will wait e.g. for 1 minute (user configured) before "shaking the nodes" again. |
Hi @ingvagabund. Thank you very much for the reply :) Some clarifications:
"Shaking the nodes" is an interesting idea but seems very unpredictable.
It would be difficult to define the "blue" or "green" pods here. Also, the clusters may just enter a stable state based on the profile when all nodes are, for example, 50% allocated, and the total number of nodes stays the same as the snapshot. (please correct me if i'm wrong) Another idea we had is to set the |
Some of our clusters have some small anti-affinity deployments and are causing lots of fragmentations. Here's a snapshot of one cluster
the blue deployment has anti-affinity, and the cluster ended up in this state after the blue deployment restarts
I'm poking around solutions to alleviate this situation.
First, cluster autoscaler (CAS) is not scaling down those low utilization nodes because none of the blue pods can fit into the rest of the nodes, which are pretty much fully packed.
I came across the
HighNodeUtilization
&LowNodeUtilization
plugins in de-scheduler but looks like the eviction logic is similar to CAS.I'm wondering if it's possible to implement or use existing de-scheduler plugins to achieve some kind of swap function, which swaps the blue pods with the non-anti-affinity pods in other nodes, so that each of the fully packed nodes can have one blue pod. And the swapped out non-anti-affinity pods can be packed into much fewer nodes.
Any ideas are appreciated!
The text was updated successfully, but these errors were encountered: