-
Notifications
You must be signed in to change notification settings - Fork 523
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rejection due to timeout / unreserve #722
Comments
To ensure I can fully reproduce the problem, the testing procedure is:
May I know each Job's replicas number and I suppose each PodGroup's spec are basically the same? (w/ scheduleTimeoutSeconds=10 and minMember equal to the Job's replicas number) Also, are you running on a 8-nodes cluster? and what's the CPU capacity in each node? Last, recently we introduced a perf fix for coscheduling, the latest master and Helm image (v0.28.9) should have contained it. |
hey @Huang-Wei ! I figured this out - the default plugin config in the helm chart install
And for the experiments I was running, even the default of 60 was too low. I bumped this up to 300 seconds and the errors resolved and was able to get it working. I should have read the error more closely (at the time of posting this I did not):
I'm wondering - should that default maybe be upped to something like 120? It would provide the example to customize the argument, but without limiting a test case that someone might have, which might be more extensive than a small hello world case (at least for me it was). |
May I know the size (minMember) of the PodGroup in your experiment?
120 seems too much as a general default value IMO. Actually in additional to plugin-level config, it also honors PodGroup-level config, which can be specified in the PodGroup spec, and it takes precedence over the plugin-level one: scheduler-plugins/pkg/util/podgroup.go Lines 64 to 67 in 3f841b4
|
My experiment had a few hundred jobs ranging in sizes from 2 to 6. So the groups weren't huge, but the size of the queue was. It would have been hugely helped with kueue, but we wanted to test coscheduling on its own. |
Sure, and agree. And I do that as well - I was generating the PodGroup specs dynamically and arbitrarily decided to put the setting at the config level. You are right I could have done it the other way around. Anyhoo, we are good to close the issue if you don't see any need for follow up or changes. |
I will find time to simulate it locally. This could be a good test to verify result of some ongoing work (e.g., #661) |
Great! Here is the automation for what we are running - I'm building a tool to collect data about scheduler decisions to add to this, but that should minimally reproduce (and you can change the timeout or look at earlier runs (the directory names) to find the initial bug. https://github.com/converged-computing/operator-experiments/tree/main/google/scheduler/run10#coscheduling |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close not-planned |
@k8s-triage-robot: Closing this issue, marking it as "Not Planned". In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@Huang-Wei the bot closed the issue, but did you ever get to test this? |
Not yet. Let me re-open it in case anyone can look into it. |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close not-planned |
@k8s-triage-robot: Closing this issue, marking it as "Not Planned". In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Hi! I want to make sure I'm not doing anything wrong. I bring up a new cluster on GKE:
And then install the scheduler plugin as a custom scheduler:
And am running jobs, about 190 total that look like variants of this (note each has a podgroup, job, and service):
I think that logic is sane because the first few our of the gate (only 3) run to completion and I have logs:
I can also verify that other plugins we are testing can run all jobs to completion, so it's not an issue (as far as I can see) with the script to get the logs, which basically just submits and then watches for complete and saves the log with one request. I get three jobs total that run, then it loops like this forever:
What we are doing that is non-standard is bulk submission at once - do you see any potential gotchas there, or something else? Thanks for the help!
The text was updated successfully, but these errors were encountered: