Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom Volcano Queues not working with MPIJob V1 #2325

Open
ameya-parab opened this issue Nov 9, 2024 · 0 comments
Open

Custom Volcano Queues not working with MPIJob V1 #2325

ameya-parab opened this issue Nov 9, 2024 · 0 comments

Comments

@ameya-parab
Copy link

ameya-parab commented Nov 9, 2024

What happened?

I am unable to use any custom queues created for use with the Volcano Scheduler for Kubeflow MPIJobs. When Volcano creates a PodGroup, it is automatically assigned to the default queue rather than the custom queue mentioned as part of runPolicy.schedulingPolicy.queue spec.

The following MPIjob should use the custom queue production, but it instead uses the default queue.

apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
  name: horovod-mnist-high
  namespace: my-namespace
spec:
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
          labels:
            Owner: my-namespace
            pipelines.kubeflow.org/pipeline-sdk-type: kfp
            training.kubeflow.org/type: mpijobs
        spec:
          schedulerName: volcano
          containers:
          - args:
            - -np
            - "2"
            - --allow-run-as-root
            - -bind-to
            - none
            - -map-by
            - slot
            - -x
            - LD_LIBRARY_PATH
            - -x
            - PATH
            - -mca
            - pml
            - ob1
            - -mca
            - btl
            - ^openib
            - python
            - /horovod/examples/pytorch/pytorch_mnist.py
            - --epochs
            - "100"
            command:
            - mpirun
            env:
            - name: OMPI_JOB_NAME
              value: horovod-mnist-high
            - name: KUBERNETES_NAMESPACE
              value: my-namespace
            - name: POD_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.podIP
            image: horovod/horovod:0.28.1
            imagePullPolicy: Always
            name: launcher
            resources:
              limits:
                cpu: "2"
                memory: 4Gi
              requests:
                cpu: "1"
                memory: 2Gi
            securityContext:
              privileged: true
          hostNetwork: false
          serviceAccountName: default-editor
    Worker:
      replicas: 2
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
          labels:
            Owner: my-namespace
            pipelines.kubeflow.org/pipeline-sdk-type: kfp
            training.kubeflow.org/type: mpijobs
        spec:
          schedulerName: volcano
          containers:
          - env:
            - name: OMPI_JOB_NAME
              value: horovod-mnist-high
            - name: KUBERNETES_NAMESPACE
              value: my-namespace
            - name: POD_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.podIP
            image: horovod/horovod:0.28.1
            imagePullPolicy: Always
            name: worker
            resources:
              limits:
                memory: 10Gi
                nvidia.com/gpu: "1"
              requests:
                cpu: "2"
                memory: 5Gi
            securityContext:
              privileged: true
            volumeMounts:
            - mountPath: /usr/local/bin/kubectl
              name: mpi-job-kubectl
              subPath: kubectl
            - mountPath: /dev/shm
              name: dshm
          hostNetwork: false
          initContainers:
          - command:
            - cp
            - /opt/bitnami/kubectl/bin/kubectl
            - /shared/kubectl
            env:
            - name: OMPI_JOB_NAME
              value: horovod-mnist-high
            - name: KUBERNETES_NAMESPACE
              value: my-namespace
            - name: POD_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.podIP
            image: bitnami/kubectl:1.30.6
            imagePullPolicy: Always
            name: kubectl-delivery
            resources:
              requests:
                cpu: 100m
                memory: 128Mi
            volumeMounts:
            - mountPath: /shared
              name: mpi-job-kubectl
          serviceAccountName: default-editor
          volumes:
          - emptyDir:
              medium: Memory
            name: dshm
          - emptyDir: {}
            name: mpi-job-kubectl
  runPolicy:
    cleanPodPolicy: Running
    schedulingPolicy:
      queue: production
      minAvailable: 3
  slotsPerWorker: 1

Resultant PodGroup:

apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
  name: podgroup-1d7357bc-880f-4334-97e7-d3e3c06e0f47
  namespace: my-namespace
  status:
  conditions:
    - lastTransitionTime: '2024-11-09T19:55:01Z'
      reason: tasks in gang are ready to be scheduled
      status: 'True'
      transitionID: 01d05369-859e-45c4-b20c-5801e577552e
      type: Scheduled
  phase: Running
  running: 1
spec:
  minMember: 1
  minResources:
    count/pods: '1'
    cpu: '1'
    limits.cpu: '2'
    limits.memory: 4Gi
    memory: 2Gi
    pods: '1'
    requests.cpu: '1'
    requests.memory: 2Gi
  queue: default

What did you expect to happen?

If the runPolicy.schedulingPolicy.queue specifies a custom queue. The Volcano PodGroup should be assigned to that specific group, not the default Volcano Queue.

Environment

Kubernetes version: 1.25
Training Operator version: kubeflow/training-operator:v1-855e096
Training Operator Python SDK version: NA
Volcano version: 1.10.0

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

@ameya-parab ameya-parab changed the title Custom Volcano Queue not working with MPIJob V1 Custom Volcano Queues not working with MPIJob V1 Nov 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant