-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vega GPU not recognized inside container #65
Comments
I was experiencing this issue too on Arch Linux. I tried a variety of docker flags before settling on Here's the full command: |
Debian bullseye with the 5.5 kernel here: I had the same issue as you and your solution fixed it, except I did not have rocm-dkms installed on the host (fails to build on this kernel version anyway). When the
Starting tensorflow fails with the following error:
After I added the
|
I'm having this issue with Ubuntu 20.04 as the host. The --privileged flag has no effect. I upgraded from Ubuntu 19.04 since it had the same issue and apparently Ubuntu 20 has better AMD support, but that doesn't seem to have done anything. I've followed all the instructions from here: https://github.com/RadeonOpenCompute/ROCm-docker/blob/master/quick-start.md with no issues. When I run
And then lists the CPU and 1 GPU with the name "gfx900" (Vega64). |
@CyberCyclone |
@aligirayhanozbay Are you sure its running with GPU? I can get it to run, but it only runs on the CPU. The GPU load is 0%. |
@CyberCyclone I'm fairly sure - running a little snippet of code like this, I see 100% GPU utilization and the GPU clock shoots up to 1800MHz (stock max clock of the VII) monitored using radeontop outside the container: import torch
q = torch.rand(10000,10000).cuda()
while True:
r = torch.einsum('ij,jk->',q,q) |
@CyberCyclone You can get rid of the error in rocminfo by installing kmod. Run the following in the docker container:
Presumably kmod contains lsmod or installs it or something. The point is, lsmod is missing from the container (at least under root). You will then get:
|
Running Arch linux, kernel 5.5.11. rocminfo on the host does show both agents and I can use OpenCL, but miopen currently fails to build, so I went for docker. Inside rocm-terminal container however, rocminfo only shows my CPU as an agent.
/dev/kfd under current arch is not in group video, but render. The container doesn't have it defined and shows 990 as a group:
I therefore call the container as follows for testing:
docker run -it --network=host --device=/dev/kfd --device=/dev/dri --ipc=host --shm-size 8G --group-add video --group-add 990 --security-opt seccomp=unconfined rocm/rocm-terminal
rocminfo inside container:
Interestingly, rocm-smi sees the GPU:
rocminfo on host:
Also on host:
Would appreciate any assistance...
The text was updated successfully, but these errors were encountered: