Rejection due to timeout / unreserve #722

vsoch · 2024-04-17T12:30:13Z

Hi! I want to make sure I'm not doing anything wrong. I bring up a new cluster on GKE:

gcloud container clusters create test-cluster \
    --threads-per-core=1 \
    --placement-type=COMPACT \
    --num-nodes=8 \
    --no-enable-autorepair \
    --no-enable-autoupgrade \
    --region=us-central1-a \
    --project=${GOOGLE_PROJECT} \
    --machine-type=c2d-standard-8

And then install the scheduler plugin as a custom scheduler:

git clone --depth 1 https://github.com/kubernetes-sigs/scheduler-plugins /tmp/sp
cd /tmp/sp/manifests/install/charts
helm install coscheduling as-a-second-scheduler/

And am running jobs, about 190 total that look like variants of this (note each has a podgroup, job, and service):

apiVersion: v1
kind: Service
metadata:
  name: s906
spec:
  clusterIP: None
  selector:
    job-name: job-0-9-size-6
---
apiVersion: batch/v1
kind: Job
metadata:
  # name will be derived based on iteration
  name: job-0-9-size-6
spec:
  completions: 6
  parallelism: 6
  completionMode: Indexed
  # alpha in 1.30 so not supported yet
  # successPolicy:
  #  - succeededIndexes: "0"
  template:
    metadata:
      labels:
        app: job-0-9-size-6
        scheduling.x-k8s.io/pod-group: job-0-9-size-6
        
    spec:
      subdomain: s906
      schedulerName: scheduler-plugins-scheduler
      restartPolicy: Never
      containers:
      - name: example-workload
        image: bash:latest
        resources:
          limits:
            cpu: "2"
          requests:
            cpu: "2"
        command:
        - bash
        - -c
        - |
          if [ $JOB_COMPLETION_INDEX -ne "0" ]
            then
              sleep infinity
          fi
          echo "START: $(date +%s%N | cut -b1-13)"
          for i in 0 1 2 3 4 5
          do
            gotStatus="-1"
            wantStatus="0"             
            while [ $gotStatus -ne $wantStatus ]
            do                                       
              ping -c 1 job-0-9-size-6-${i}.s906 > /dev/null 2>&1
              gotStatus=$?                
              if [ $gotStatus -ne $wantStatus ]; then
                echo "Failed to ping pod job-0-9-size-6-${i}.s906, retrying in 1 second..."
                sleep 1
              fi
            done                                                         
            echo "Successfully pinged pod: job-0-9-size-6-${i}.s906"
          done
          echo "DONE: $(date +%s%N | cut -b1-13)"
          # echo "DONE: $(date +%s)"

---
# PodGroup CRD spec
apiVersion: scheduling.x-k8s.io/v1alpha1
kind: PodGroup
metadata:
  name: job-0-9-size-6
spec:
  scheduleTimeoutSeconds: 10
  minMember: 6

I think that logic is sane because the first few our of the gate (only 3) run to completion and I have logs:

===
Output: recorded-at: 2024-04-17 12:17:50.257757
START: 1713356244
Failed to ping pod job-0-0-size-2-0.s002, retrying in 1 second...
Failed to ping pod job-0-0-size-2-0.s002, retrying in 1 second...
Successfully pinged pod: job-0-0-size-2-0.s002
Successfully pinged pod: job-0-0-size-2-1.s002
DONE: 1713356246

===
Times: recorded-at: 2024-04-17 12:17:50.257899
{"end_time": "2024-04-17 12:17:50.257629", "start_time": "2024-04-17 12:17:50.118119", "batch_done_submit_time": "2024-04-17 12:17:49.217638", "submit_time": "2024-04-17 12:17:21.270262", "submit_to_completion": 28.987367, "total_time": 0.13951, "uid": "scheduler-plugins-scheduler-batch-0-iter-0-size-2"}

I can also verify that other plugins we are testing can run all jobs to completion, so it's not an issue (as far as I can see) with the script to get the logs, which basically just submits and then watches for complete and saves the log with one request. I get three jobs total that run, then it loops like this forever:

I0417 12:22:39.317221       1 coscheduling.go:215] "Pod is waiting to be scheduled to node" pod="default/job-0-0-size-4-0-gkv47" nodeName="gke-test-cluster-default-pool-b4ebeb32-trmp"
E0417 12:22:40.700252       1 schedule_one.go:1004] "Error scheduling pod; retrying" err="rejected due to timeout after waiting 10s at plugin Coscheduling" pod="default/job-0-0-size-4-2-h4mv8"
E0417 12:22:40.768269       1 schedule_one.go:1004] "Error scheduling pod; retrying" err="rejection in Unreserve" pod="default/job-0-0-size-4-0-gkv47"
I0417 12:22:43.604752       1 trace.go:236] Trace[927737765]: "Scheduling" namespace:default,name:job-0-0-size-4-1-gpp59 (17-Apr-2024 12:22:43.015) (total time: 589ms):
Trace[927737765]: ---"Computing predicates done" 588ms (12:22:43.604)
Trace[927737765]: [589.250812ms] [589.250812ms] END

What we are doing that is non-standard is bulk submission at once - do you see any potential gotchas there, or something else? Thanks for the help!

The text was updated successfully, but these errors were encountered:

Huang-Wei · 2024-04-18T22:52:46Z

To ensure I can fully reproduce the problem, the testing procedure is:

submit 160 Jobs (at once I suppose?)
each Job associated w/ its own PodGroup

May I know each Job's replicas number and I suppose each PodGroup's spec are basically the same? (w/ scheduleTimeoutSeconds=10 and minMember equal to the Job's replicas number)

Also, are you running on a 8-nodes cluster? and what's the CPU capacity in each node?

Last, recently we introduced a perf fix for coscheduling, the latest master and Helm image (v0.28.9) should have contained it.

vsoch · 2024-04-18T23:26:02Z

hey @Huang-Wei ! I figured this out - the default plugin config in the helm chart install values.yaml is 10 seconds:

pluginConfig:
- name: Coscheduling
  args:
    permitWaitingTimeSeconds: 10

And for the experiments I was running, even the default of 60 was too low. I bumped this up to 300 seconds and the errors resolved and was able to get it working. I should have read the error more closely (at the time of posting this I did not):

rejected due to timeout after waiting 10s at plugin Coscheduling

I'm wondering - should that default maybe be upped to something like 120? It would provide the example to customize the argument, but without limiting a test case that someone might have, which might be more extensive than a small hello world case (at least for me it was).

Huang-Wei · 2024-04-19T00:31:54Z

And for the experiments I was running, even the default of 60 was too low.

May I know the size (minMember) of the PodGroup in your experiment?

should that default maybe be upped to something like 120?

120 seems too much as a general default value IMO. Actually in additional to plugin-level config, it also honors PodGroup-level config, which can be specified in the PodGroup spec, and it takes precedence over the plugin-level one:

scheduler-plugins/pkg/util/podgroup.go

Lines 64 to 67 in 3f841b4

    
           // GetWaitTimeDuration returns a wait timeout based on the following precedences: 
        
           // 1. spec.scheduleTimeoutSeconds of the given pg, if specified 
        
           // 2. given scheduleTimeout, if not nil 
        
           // 3. fall back to DefaultWaitTime

vsoch · 2024-04-19T01:07:34Z

My experiment had a few hundred jobs ranging in sizes from 2 to 6. So the groups weren't huge, but the size of the queue was. It would have been hugely helped with kueue, but we wanted to test coscheduling on its own.

vsoch · 2024-04-19T01:08:40Z

120 seems too much as a general default value IMO. Actually in additional to plugin-level config, it also honors PodGroup-level config, which can be specified in the PodGroup spec, and it takes precedence over the plugin-level one:

Sure, and agree. And I do that as well - I was generating the PodGroup specs dynamically and arbitrarily decided to put the setting at the config level. You are right I could have done it the other way around. Anyhoo, we are good to close the issue if you don't see any need for follow up or changes.

Huang-Wei · 2024-04-19T22:22:13Z

My experiment had a few hundred jobs ranging in sizes from 2 to 6. So the groups weren't huge, but the size of the queue was. It would have been hugely helped with kueue, but we wanted to test coscheduling on its own.

I will find time to simulate it locally. This could be a good test to verify result of some ongoing work (e.g., #661)

vsoch · 2024-04-19T23:56:30Z

Great! Here is the automation for what we are running - I'm building a tool to collect data about scheduler decisions to add to this, but that should minimally reproduce (and you can change the timeout or look at earlier runs (the directory names) to find the initial bug. https://github.com/converged-computing/operator-experiments/tree/main/google/scheduler/run10#coscheduling

k8s-triage-robot · 2024-07-19T00:33:42Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-08-18T00:41:19Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-09-17T01:13:35Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-09-17T01:13:39Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

vsoch · 2024-09-17T02:54:54Z

@Huang-Wei the bot closed the issue, but did you ever get to test this?

Huang-Wei · 2024-09-18T05:14:46Z

did you ever get to test this?

Not yet.

Let me re-open it in case anyone can look into it.

k8s-triage-robot · 2024-10-18T06:14:52Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-10-18T06:14:57Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 19, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 18, 2024

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 17, 2024

Huang-Wei reopened this Sep 18, 2024

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rejection due to timeout / unreserve #722

Rejection due to timeout / unreserve #722

vsoch commented Apr 17, 2024

Huang-Wei commented Apr 18, 2024

vsoch commented Apr 18, 2024

Huang-Wei commented Apr 19, 2024

vsoch commented Apr 19, 2024

vsoch commented Apr 19, 2024

Huang-Wei commented Apr 19, 2024

vsoch commented Apr 19, 2024

k8s-triage-robot commented Jul 19, 2024

k8s-triage-robot commented Aug 18, 2024

k8s-triage-robot commented Sep 17, 2024

k8s-ci-robot commented Sep 17, 2024

vsoch commented Sep 17, 2024

Huang-Wei commented Sep 18, 2024

k8s-triage-robot commented Oct 18, 2024

k8s-ci-robot commented Oct 18, 2024

Rejection due to timeout / unreserve #722

Rejection due to timeout / unreserve #722

Comments

vsoch commented Apr 17, 2024

Huang-Wei commented Apr 18, 2024

vsoch commented Apr 18, 2024

Huang-Wei commented Apr 19, 2024

vsoch commented Apr 19, 2024

vsoch commented Apr 19, 2024

Huang-Wei commented Apr 19, 2024

vsoch commented Apr 19, 2024

k8s-triage-robot commented Jul 19, 2024

k8s-triage-robot commented Aug 18, 2024

k8s-triage-robot commented Sep 17, 2024

k8s-ci-robot commented Sep 17, 2024

vsoch commented Sep 17, 2024

Huang-Wei commented Sep 18, 2024

k8s-triage-robot commented Oct 18, 2024

k8s-ci-robot commented Oct 18, 2024