Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rejection due to timeout / unreserve #722

Closed
vsoch opened this issue Apr 17, 2024 · 15 comments
Closed

Rejection due to timeout / unreserve #722

vsoch opened this issue Apr 17, 2024 · 15 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@vsoch
Copy link
Contributor

vsoch commented Apr 17, 2024

Hi! I want to make sure I'm not doing anything wrong. I bring up a new cluster on GKE:

gcloud container clusters create test-cluster \
    --threads-per-core=1 \
    --placement-type=COMPACT \
    --num-nodes=8 \
    --no-enable-autorepair \
    --no-enable-autoupgrade \
    --region=us-central1-a \
    --project=${GOOGLE_PROJECT} \
    --machine-type=c2d-standard-8

And then install the scheduler plugin as a custom scheduler:

git clone --depth 1 https://github.com/kubernetes-sigs/scheduler-plugins /tmp/sp
cd /tmp/sp/manifests/install/charts
helm install coscheduling as-a-second-scheduler/

And am running jobs, about 190 total that look like variants of this (note each has a podgroup, job, and service):

apiVersion: v1
kind: Service
metadata:
  name: s906
spec:
  clusterIP: None
  selector:
    job-name: job-0-9-size-6
---
apiVersion: batch/v1
kind: Job
metadata:
  # name will be derived based on iteration
  name: job-0-9-size-6
spec:
  completions: 6
  parallelism: 6
  completionMode: Indexed
  # alpha in 1.30 so not supported yet
  # successPolicy:
  #  - succeededIndexes: "0"
  template:
    metadata:
      labels:
        app: job-0-9-size-6
        scheduling.x-k8s.io/pod-group: job-0-9-size-6
        
    spec:
      subdomain: s906
      schedulerName: scheduler-plugins-scheduler
      restartPolicy: Never
      containers:
      - name: example-workload
        image: bash:latest
        resources:
          limits:
            cpu: "2"
          requests:
            cpu: "2"
        command:
        - bash
        - -c
        - |
          if [ $JOB_COMPLETION_INDEX -ne "0" ]
            then
              sleep infinity
          fi
          echo "START: $(date +%s%N | cut -b1-13)"
          for i in 0 1 2 3 4 5
          do
            gotStatus="-1"
            wantStatus="0"             
            while [ $gotStatus -ne $wantStatus ]
            do                                       
              ping -c 1 job-0-9-size-6-${i}.s906 > /dev/null 2>&1
              gotStatus=$?                
              if [ $gotStatus -ne $wantStatus ]; then
                echo "Failed to ping pod job-0-9-size-6-${i}.s906, retrying in 1 second..."
                sleep 1
              fi
            done                                                         
            echo "Successfully pinged pod: job-0-9-size-6-${i}.s906"
          done
          echo "DONE: $(date +%s%N | cut -b1-13)"
          # echo "DONE: $(date +%s)"

---
# PodGroup CRD spec
apiVersion: scheduling.x-k8s.io/v1alpha1
kind: PodGroup
metadata:
  name: job-0-9-size-6
spec:
  scheduleTimeoutSeconds: 10
  minMember: 6

I think that logic is sane because the first few our of the gate (only 3) run to completion and I have logs:

===
Output: recorded-at: 2024-04-17 12:17:50.257757
START: 1713356244
Failed to ping pod job-0-0-size-2-0.s002, retrying in 1 second...
Failed to ping pod job-0-0-size-2-0.s002, retrying in 1 second...
Successfully pinged pod: job-0-0-size-2-0.s002
Successfully pinged pod: job-0-0-size-2-1.s002
DONE: 1713356246

===
Times: recorded-at: 2024-04-17 12:17:50.257899
{"end_time": "2024-04-17 12:17:50.257629", "start_time": "2024-04-17 12:17:50.118119", "batch_done_submit_time": "2024-04-17 12:17:49.217638", "submit_time": "2024-04-17 12:17:21.270262", "submit_to_completion": 28.987367, "total_time": 0.13951, "uid": "scheduler-plugins-scheduler-batch-0-iter-0-size-2"}

I can also verify that other plugins we are testing can run all jobs to completion, so it's not an issue (as far as I can see) with the script to get the logs, which basically just submits and then watches for complete and saves the log with one request. I get three jobs total that run, then it loops like this forever:

I0417 12:22:39.317221       1 coscheduling.go:215] "Pod is waiting to be scheduled to node" pod="default/job-0-0-size-4-0-gkv47" nodeName="gke-test-cluster-default-pool-b4ebeb32-trmp"
E0417 12:22:40.700252       1 schedule_one.go:1004] "Error scheduling pod; retrying" err="rejected due to timeout after waiting 10s at plugin Coscheduling" pod="default/job-0-0-size-4-2-h4mv8"
E0417 12:22:40.768269       1 schedule_one.go:1004] "Error scheduling pod; retrying" err="rejection in Unreserve" pod="default/job-0-0-size-4-0-gkv47"
I0417 12:22:43.604752       1 trace.go:236] Trace[927737765]: "Scheduling" namespace:default,name:job-0-0-size-4-1-gpp59 (17-Apr-2024 12:22:43.015) (total time: 589ms):
Trace[927737765]: ---"Computing predicates done" 588ms (12:22:43.604)
Trace[927737765]: [589.250812ms] [589.250812ms] END

What we are doing that is non-standard is bulk submission at once - do you see any potential gotchas there, or something else? Thanks for the help!

@Huang-Wei
Copy link
Contributor

To ensure I can fully reproduce the problem, the testing procedure is:

  • submit 160 Jobs (at once I suppose?)
  • each Job associated w/ its own PodGroup

May I know each Job's replicas number and I suppose each PodGroup's spec are basically the same? (w/ scheduleTimeoutSeconds=10 and minMember equal to the Job's replicas number)

Also, are you running on a 8-nodes cluster? and what's the CPU capacity in each node?

Last, recently we introduced a perf fix for coscheduling, the latest master and Helm image (v0.28.9) should have contained it.

@vsoch
Copy link
Contributor Author

vsoch commented Apr 18, 2024

hey @Huang-Wei ! I figured this out - the default plugin config in the helm chart install values.yaml is 10 seconds:

pluginConfig:
- name: Coscheduling
  args:
    permitWaitingTimeSeconds: 10

And for the experiments I was running, even the default of 60 was too low. I bumped this up to 300 seconds and the errors resolved and was able to get it working. I should have read the error more closely (at the time of posting this I did not):

rejected due to timeout after waiting 10s at plugin Coscheduling

I'm wondering - should that default maybe be upped to something like 120? It would provide the example to customize the argument, but without limiting a test case that someone might have, which might be more extensive than a small hello world case (at least for me it was).

@Huang-Wei
Copy link
Contributor

And for the experiments I was running, even the default of 60 was too low.

May I know the size (minMember) of the PodGroup in your experiment?

should that default maybe be upped to something like 120?

120 seems too much as a general default value IMO. Actually in additional to plugin-level config, it also honors PodGroup-level config, which can be specified in the PodGroup spec, and it takes precedence over the plugin-level one:

// GetWaitTimeDuration returns a wait timeout based on the following precedences:
// 1. spec.scheduleTimeoutSeconds of the given pg, if specified
// 2. given scheduleTimeout, if not nil
// 3. fall back to DefaultWaitTime

@vsoch
Copy link
Contributor Author

vsoch commented Apr 19, 2024

My experiment had a few hundred jobs ranging in sizes from 2 to 6. So the groups weren't huge, but the size of the queue was. It would have been hugely helped with kueue, but we wanted to test coscheduling on its own.

@vsoch
Copy link
Contributor Author

vsoch commented Apr 19, 2024

120 seems too much as a general default value IMO. Actually in additional to plugin-level config, it also honors PodGroup-level config, which can be specified in the PodGroup spec, and it takes precedence over the plugin-level one:

Sure, and agree. And I do that as well - I was generating the PodGroup specs dynamically and arbitrarily decided to put the setting at the config level. You are right I could have done it the other way around. Anyhoo, we are good to close the issue if you don't see any need for follow up or changes.

@Huang-Wei
Copy link
Contributor

My experiment had a few hundred jobs ranging in sizes from 2 to 6. So the groups weren't huge, but the size of the queue was. It would have been hugely helped with kueue, but we wanted to test coscheduling on its own.

I will find time to simulate it locally. This could be a good test to verify result of some ongoing work (e.g., #661)

@vsoch
Copy link
Contributor Author

vsoch commented Apr 19, 2024

Great! Here is the automation for what we are running - I'm building a tool to collect data about scheduler decisions to add to this, but that should minimally reproduce (and you can change the timeout or look at earlier runs (the directory names) to find the initial bug. https://github.com/converged-computing/operator-experiments/tree/main/google/scheduler/run10#coscheduling

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 19, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 18, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 17, 2024
@vsoch
Copy link
Contributor Author

vsoch commented Sep 17, 2024

@Huang-Wei the bot closed the issue, but did you ever get to test this?

@Huang-Wei
Copy link
Contributor

did you ever get to test this?

Not yet.

Let me re-open it in case anyone can look into it.

@Huang-Wei Huang-Wei reopened this Sep 18, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 18, 2024
@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

4 participants