Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement MPS natively as in linux #807

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Conversation

thien-lm
Copy link

@thien-lm thien-lm commented Jul 8, 2024

How GPU resource be shared by MPS in Linux ?

  1. GPU compute mode will be set to EXCLUSIVE_PROCESS to ensure any process request to use GPU will need to talk to MPS control daemon and MPS server
  2. By default, each MPS client process can be access up to 100% memory and 100% available threads of GPUs
  3. MPS resource can be limited from MPS control daemon level, MPS client level to CUDA context level: https://docs.nvidia.com/deploy/mps/#performance

refer: https://docs.nvidia.com/deploy/mps

Strategies to provisioning resource in MPS

refer: https://docs.nvidia.com/deploy/mps/#performance:

  1. A common provisioning strategy is to uniformly partition the available threads equally to each MPS client processes - this is how NVDP devs implemented MPS
  2. A more optimal strategy is to uniformly partition the portion by half of the number of expected clients
  3. The near optimal provision strategy is to non-uniformly partition the available threads based on the workloads of each MPS clients (i.e., set active thread percentage to 30% for client 1 and set active thread percentage to 70 % client 2 if the ratio of the client1 workload and the client2 workload is 30%: 70%) - this is what i want
  4. The most optimal provision strategy is to precisely limit the number of SMs to use for each MPS clients knowing the execution resource requirements of each client

How did the main branch of nvidia device plugin implemented MPS?

  • NVDP devs just set hard limit at control daemon level, by 100/n for both memory and threads, with n is the number of replicas
  • I think it will be so inconvenient for us to use MPS

My solution

  1. I will remove the hard limit 100/n be set at control daemon level
  2. Instead, i wll set resource limit for each container will use MPS in Kubernetes by two environment variable: CUDA_MPS_ACTIVE_THREAD_PERCENTAGE and CUDA_MPS_PINNED_DEVICE_MEM_LIMIT
  3. By that way, the resource provisioning of MPS in NVDP will be very flexible, because each container will be provided the number of threads and memory as it need, was that so nice?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant