Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve "Can't find nvidia/amd toolkit" error #25912

Merged
merged 3 commits into from
Sep 9, 2024

Conversation

e-kayrakli
Copy link
Contributor

  1. Suggested by @mppf, this PR adds Try setting CHPL_CUDA_PATH to the cuda installation path to the error message. Before, CHPL_CUDA_PATH was not mentioned in the error message.
  2. The error says "nvidia toolkit" or "amd toolkit". Those are not real things. We need "cuda toolkit" or "rocm toolkit". This PR adjusts for that. Capitalizations are still not perfect, but I don't want to wire a new variable in on these scripts at this point.

Tested on a system with no GPUs that:

> export CHPL_GPU=nvidia
> printchplenv

Error: Can't find cuda toolkit. Try setting CHPL_CUDA_PATH to the cuda installation path. To avoid this issue, you can have GPU code run on the CPU by setting 'CHPL_GPU=cpu'. To turn this error into a warning set CHPLENV_GPU_REQ_ERRS_AS_WARNINGS.

> export CHPL_GPU=amd
> printchplenv

Error: Can't find rocm toolkit. Try setting CHPL_ROCM_PATH to the rocm installation path. To avoid this issue, you can have GPU code run on the CPU by setting 'CHPL_GPU=cpu'. To turn this error into a warning set CHPLENV_GPU_REQ_ERRS_AS_WARNINGS.

Scripts continue to function normally on a system with a GPU.

Copy link
Member

@vasslitvinov vasslitvinov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that it works, I am OK with it as-is.

I find it surprising that the two occurrences of "cuda" or "rocm" in the error message are generated by different functions, get() and gpu.runtime_impl. I hope that was intentional. May be worth adding a note in the PR's OP or in the code explaining the difference.

@e-kayrakli
Copy link
Contributor Author

I find it surprising that the two occurrences of "cuda" or "rocm" in the error message are generated by different functions, get() and gpu.runtime_impl. I hope that was intentional. May be worth adding a note in the PR's OP or in the code explaining the difference.

This is somewhat historical. get() will give you nvidia/amd/cpu. These are alternatives for CHPL_GPU and more user-facing than its counterpart because that's what the user knows, ie the brand of their GPU. The runtime_impl is not that "user-facing" and they are cuda/rocm/none for the 3 alternatives. One of the motivations for this separation was the potential future of having a portable runtime implementation like llvm/offload for all vendors, where we'd still care about CHPL_GPU input from the user, but the runtime_impl would no longer be the toolkit provided by that vendor. Note also that technically HIP can be used on NVIDIA GPUs, which could be another combination down the road.

Signed-off-by: Engin Kayraklioglu <[email protected]>
@e-kayrakli
Copy link
Contributor Author

Oh, my answer might have been a bit mis-informed. I realized that I haven't pushed my final changes, which may be the answer to your comment, @vass: db35993

@e-kayrakli e-kayrakli merged commit 6ed664a into chapel-lang:main Sep 9, 2024
7 checks passed
@e-kayrakli e-kayrakli deleted the gpu-missing-sdk-error branch September 9, 2024 22:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants