Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA Forward Compatibility mode #1022

Open
luyug opened this issue Aug 29, 2024 · 1 comment
Open

CUDA Forward Compatibility mode #1022

luyug opened this issue Aug 29, 2024 · 1 comment

Comments

@luyug
Copy link

luyug commented Aug 29, 2024

Hello,

I am running the jax container on a GH200 cluster. The cluster maintainer would like to keep CUDA kernel driver at v12.2.
When running the jax-toolbox nightly container, fused_attention in transformer engine raise exception of unsupported PTX.
I am trying to resolve the problem and wonder if it is possible to enable CUDA Forward Compatibility mode on the container?

Thanks in advance!

@olupton
Copy link
Contributor

olupton commented Aug 30, 2024

In principle the forward compatibility packages are installed in the ghcr.io/nvidia/jax:XXX containers.
If you run nvidia-smi inside/outside the container, what CUDA versions does it show?

If it shows the older 12.2 version in both places, it might be that you are not using the NVIDIA container toolkit (https://docs.nvidia.com/deploy/cuda-compatibility/index.html#frequently-asked-questions), or that some manual LD_LIBRARY_PATH changes or directories mounted in from the host system are interfering. You can check which libcuda.so* libraries/symlinks are appearing inside your container with something like find / -name 'libcuda.so*'. In this configuration (current nightlies use CUDA 12.5 containers, your 12.2 driver is older) then libcuda.so.555.42.02 (from the compat package) should be being used.

If it still doesn't work, please provide more details of the cluster environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants