Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] Must validate ENV settings or wrong gpu selected by nvidia-smi #59

Merged
merged 6 commits into from
Jun 21, 2024

Conversation

Qubitium
Copy link
Contributor

@Qubitium Qubitium commented Jun 21, 2024

nvidia-smi uses PCI_BUS_ID order but python program may be launched using default which is not PCI_BUS_ID order for gpu. If the env values do not match, wrong gpu is returned for gpu_id. Validate the env and raise error if issue exists.

TESTS

  • gpu_id in range
  • gpu_id out of range
  • gpu_id with multi-gpu + no PCI_BUS_ID order
  • gpu_id with multi-gpu + PCI_BUS_ID order set

@Qubitium Qubitium changed the title Raise error if CUDA_DEVICE_ORDER=PCI_BUS_ID env is not applied in mul… [FIX] Must validate ENV settings for gpu value for nvidia-smi is the wrong gpu Jun 21, 2024
@Qubitium Qubitium changed the title [FIX] Must validate ENV settings for gpu value for nvidia-smi is the wrong gpu [FIX] Must validate ENV settings or wrong gpu selected by nvidia-smi Jun 21, 2024
@Qubitium
Copy link
Contributor Author

@microsoft-github-policy-service agree

@microsoft-github-policy-service agree

@Qubitium Qubitium marked this pull request as ready for review June 21, 2024 11:47
@Qubitium
Copy link
Contributor Author

@LeiWang1999 Ready for review. The CUDA order ENV must be validated (match nvidia-smi) in multi-gpu env or we get the wrong gpu back.

@LeiWang1999 LeiWang1999 merged commit 2634815 into microsoft:main Jun 21, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants