Improve "Can't find nvidia/amd toolkit" error #25912

e-kayrakli · 2024-09-09T19:26:08Z

Suggested by @mppf, this PR adds Try setting CHPL_CUDA_PATH to the cuda installation path to the error message. Before, CHPL_CUDA_PATH was not mentioned in the error message.
The error says "nvidia toolkit" or "amd toolkit". Those are not real things. We need "cuda toolkit" or "rocm toolkit". This PR adjusts for that. Capitalizations are still not perfect, but I don't want to wire a new variable in on these scripts at this point.

Tested on a system with no GPUs that:

> export CHPL_GPU=nvidia
> printchplenv

Error: Can't find cuda toolkit. Try setting CHPL_CUDA_PATH to the cuda installation path. To avoid this issue, you can have GPU code run on the CPU by setting 'CHPL_GPU=cpu'. To turn this error into a warning set CHPLENV_GPU_REQ_ERRS_AS_WARNINGS.

> export CHPL_GPU=amd
> printchplenv

Error: Can't find rocm toolkit. Try setting CHPL_ROCM_PATH to the rocm installation path. To avoid this issue, you can have GPU code run on the CPU by setting 'CHPL_GPU=cpu'. To turn this error into a warning set CHPLENV_GPU_REQ_ERRS_AS_WARNINGS.

Scripts continue to function normally on a system with a GPU.

Signed-off-by: Engin Kayraklioglu <[email protected]>

vasslitvinov

Given that it works, I am OK with it as-is.

I find it surprising that the two occurrences of "cuda" or "rocm" in the error message are generated by different functions, get() and gpu.runtime_impl. I hope that was intentional. May be worth adding a note in the PR's OP or in the code explaining the difference.

e-kayrakli · 2024-09-09T21:08:48Z

I find it surprising that the two occurrences of "cuda" or "rocm" in the error message are generated by different functions, get() and gpu.runtime_impl. I hope that was intentional. May be worth adding a note in the PR's OP or in the code explaining the difference.

This is somewhat historical. get() will give you nvidia/amd/cpu. These are alternatives for CHPL_GPU and more user-facing than its counterpart because that's what the user knows, ie the brand of their GPU. The runtime_impl is not that "user-facing" and they are cuda/rocm/none for the 3 alternatives. One of the motivations for this separation was the potential future of having a portable runtime implementation like llvm/offload for all vendors, where we'd still care about CHPL_GPU input from the user, but the runtime_impl would no longer be the toolkit provided by that vendor. Note also that technically HIP can be used on NVIDIA GPUs, which could be another combination down the road.

Signed-off-by: Engin Kayraklioglu <[email protected]>

e-kayrakli · 2024-09-09T21:17:32Z

Oh, my answer might have been a bit mis-informed. I realized that I haven't pushed my final changes, which may be the answer to your comment, @vass: db35993

e-kayrakli added 2 commits September 9, 2024 12:16

Improve the error message for missing GPU SDK

cd880b7

Signed-off-by: Engin Kayraklioglu <[email protected]>

Minor fix on the test system

947c5a6

Signed-off-by: Engin Kayraklioglu <[email protected]>

vasslitvinov approved these changes Sep 9, 2024

View reviewed changes

Fix the toolkit name, too

db35993

Signed-off-by: Engin Kayraklioglu <[email protected]>

e-kayrakli merged commit 6ed664a into chapel-lang:main Sep 9, 2024
7 checks passed

e-kayrakli deleted the gpu-missing-sdk-error branch September 9, 2024 22:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve "Can't find nvidia/amd toolkit" error #25912

Improve "Can't find nvidia/amd toolkit" error #25912

e-kayrakli commented Sep 9, 2024

vasslitvinov left a comment

e-kayrakli commented Sep 9, 2024

e-kayrakli commented Sep 9, 2024

Improve "Can't find nvidia/amd toolkit" error #25912

Improve "Can't find nvidia/amd toolkit" error #25912

Conversation

e-kayrakli commented Sep 9, 2024

vasslitvinov left a comment

Choose a reason for hiding this comment

e-kayrakli commented Sep 9, 2024

e-kayrakli commented Sep 9, 2024