Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nvidia passthrough not working for TrueNAS 24.04.0 (Dragonfish) #127

Closed
dalgibbard opened this issue Apr 24, 2024 · 68 comments · Fixed by #166
Closed

Nvidia passthrough not working for TrueNAS 24.04.0 (Dragonfish) #127

dalgibbard opened this issue Apr 24, 2024 · 68 comments · Fixed by #166
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@dalgibbard
Copy link
Contributor

dalgibbard commented Apr 24, 2024

Latest version of jailmaker (1.1.5)
As per title; in Dragonfish 24.04.0, Nvidia passthrough seems to be broken -- nvidia-smi working fine on host, but inside container it gives:

nvidia-container-cli: initialization error: load library failed: /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.1: file too short

Seems the script uses nvidia-container-cli list to find nvidia files which need mounting, but container expects files outside of this:

# On TrueNAS host:
$ nvidia-container-cli list
/dev/nvidiactl
/dev/nvidia-uvm
/dev/nvidia-uvm-tools
/dev/nvidia-modeset
/dev/nvidia0
/usr/lib/nvidia/current/nvidia-smi
/usr/bin/nvidia-persistenced
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.545.23.08
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-cfg.so.545.23.08
/usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so.545.23.08
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ptxjitcompiler.so.545.23.08
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-nvvm.so.545.23.08
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-encode.so.545.23.08
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvcuvid.so.545.23.08

Note that this list doesn't include the file the container is expecting.

Adding a manual mount to my jail's config for --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current resolved it though; not sure if that's a good idea or not, but it works at least :)

@neoKushan
Copy link

I had the same problem, @dalgibbard's solution fixed it for me.

I did an apt-get update/upgrade to fix it at first and somehow that hosed the entire jail, but I quickly rolled back and applied the --bind fix

@Jip-Hop
Copy link
Owner

Jip-Hop commented Apr 26, 2024

Glad to hear you were able to get it working @dalgibbard.

I found the same error message mentioned in: NVIDIA/libnvidia-container#224. But I'm leaving investigating the root cause and implementing a fix for this issue up to the community. I don't have an nvidia GPU in my system and I'm not looking forward to another blind, back-and-forth, trial and error fixing process akin to #4.

@Jip-Hop Jip-Hop added bug Something isn't working help wanted Extra attention is needed labels Apr 26, 2024
@mooglestiltzkin
Copy link

mooglestiltzkin commented Apr 28, 2024

I had the same problem, @dalgibbard's solution fixed it for me.

I did an apt-get update/upgrade to fix it at first and somehow that hosed the entire jail, but I quickly rolled back and applied the --bind fix

i still couldn't get immich to use graphics

to find out the stuff
nvidia-container-cli list

jlmkr edit docker made changes, saved, jlmkr restart docker

        --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current
        --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-nvvm.so.545.23.08
        --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ptxjitcompiler.so.545.23.08
        --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so.545.23.08
        --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-encode.so.545.23.08
        --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvcuvid.so.545.23.08
        --bind-ro=/usr/bin/nvidia-persistenced
        --bind=/dev/nvidia0
        --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-cfg.so.545.23.08
        --bind=/dev/nvidia-uvm-tools
        --bind=/dev/nvidiactl 
	--bind=/dev/nvidia-uvm
        --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.545.23.08
        --bind-ro=/usr/bin/nvidia-smi
        --bind-ro=/usr/lib/nvidia/current/nvidia-smi

when trying to deploy immich i get this
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]]

*noticed when i had this it wouldn't start the docker jail. so i removed it from the config and it worked again
--bind=/dev/nvidia-modeset

@neoKushan
Copy link

jlmkr edit docker made changes, saved, jlmkr restart docker

        --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current
        --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-nvvm.so.545.23.08
        --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ptxjitcompiler.so.545.23.08
        --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so.545.23.08
        --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-encode.so.545.23.08
        --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvcuvid.so.545.23.08
        --bind-ro=/usr/bin/nvidia-persistenced
        --bind=/dev/nvidia0
        --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-cfg.so.545.23.08
        --bind=/dev/nvidia-uvm-tools
        --bind=/dev/nvidiactl 
	--bind=/dev/nvidia-uvm
        --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.545.23.08
        --bind-ro=/usr/bin/nvidia-smi
        --bind-ro=/usr/lib/nvidia/current/nvidia-smi

You shouldn't ened all these bind mounts, the only one you need is --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current

@mooglestiltzkin
Copy link

ty will update accordingly

@dasunsrule32
Copy link

dasunsrule32 commented May 6, 2024

I created a new jail with nvidia gpu passthrough enabled and mounted the suggested directory as ro. I can run nvidia-smi and it detects my card inside the container:

nvidia-smi 
Mon May  6 16:43:49 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1080 Ti     Off | 00000000:2B:00.0 Off |                  N/A |
|  0%   37C    P8              11W / 280W |      0MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

When I attempt to use the card in Plex, I get the following errors:

[Req#18ae/Transcode] [FFMPEG] - Cannot load libcuda.so.1
[Req#18ae/Transcode] [FFMPEG] - Could not dynamically load CUDA

I did bind-ro the TrueNAS SCALE dir with the nvidia modules in them:

ls -la /usr/lib/x86_64-linux-gnu/nvidia/current/
total 79835
drwxr-xr-x 2 root root       21 Apr 22 19:20 .
drwxr-xr-x 3 root root        3 May  6 16:02 ..
lrwxrwxrwx 1 root root       12 Nov  6 21:21 libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root       20 Nov  6 21:21 libcuda.so.1 -> libcuda.so.545.23.08
-rw-r--r-- 1 root root 29453200 Nov  6 16:49 libcuda.so.545.23.08
lrwxrwxrwx 1 root root       15 Nov  6 21:21 libnvcuvid.so -> libnvcuvid.so.1
lrwxrwxrwx 1 root root       23 Nov  6 21:21 libnvcuvid.so.1 -> libnvcuvid.so.545.23.08
-rw-r--r-- 1 root root 10009920 Nov  6 16:22 libnvcuvid.so.545.23.08
lrwxrwxrwx 1 root root       26 Nov  6 21:21 libnvidia-cfg.so.1 -> libnvidia-cfg.so.545.23.08
-rw-r--r-- 1 root root   274968 Nov  6 16:19 libnvidia-cfg.so.545.23.08
lrwxrwxrwx 1 root root       21 Nov  6 21:21 libnvidia-encode.so -> libnvidia-encode.so.1
lrwxrwxrwx 1 root root       29 Nov  6 21:21 libnvidia-encode.so.1 -> libnvidia-encode.so.545.23.08
-rw-r--r-- 1 root root   252576 Apr 25 09:50 libnvidia-encode.so.545.23.08
lrwxrwxrwx 1 root root       17 Nov  6 21:21 libnvidia-ml.so -> libnvidia-ml.so.1
lrwxrwxrwx 1 root root       25 Nov  6 21:21 libnvidia-ml.so.1 -> libnvidia-ml.so.545.23.08
-rw-r--r-- 1 root root  1992128 Nov  6 16:23 libnvidia-ml.so.545.23.08
lrwxrwxrwx 1 root root       19 Nov  6 21:21 libnvidia-nvvm.so -> libnvidia-nvvm.so.4
lrwxrwxrwx 1 root root       27 Nov  6 21:21 libnvidia-nvvm.so.4 -> libnvidia-nvvm.so.545.23.08
-rw-r--r-- 1 root root 86781944 Nov  6 17:28 libnvidia-nvvm.so.545.23.08
lrwxrwxrwx 1 root root       37 Nov  6 21:21 libnvidia-ptxjitcompiler.so.1 -> libnvidia-ptxjitcompiler.so.545.23.08
-rw-r--r-- 1 root root 26589472 Nov  6 16:55 libnvidia-ptxjitcompiler.so.545.23.08

Here is the nvidia-container-cli output from the host:

nvidia-container-cli list    
/dev/nvidiactl
/dev/nvidia-uvm
/dev/nvidia-uvm-tools
/dev/nvidia-modeset
/dev/nvidia0
/usr/lib/nvidia/current/nvidia-smi
/usr/bin/nvidia-persistenced
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.545.23.08
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-cfg.so.545.23.08
/usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so.545.23.08
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ptxjitcompiler.so.545.23.08
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-nvvm.so.545.23.08
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-encode.so.545.23.08
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvcuvid.so.545.23.08

Jail config:

startup=1
gpu_passthrough_intel=0
gpu_passthrough_nvidia=1
# Turning off seccomp filtering improves performance at the expense of security
seccomp=1

# Use macvlan networking to provide an isolated network namespace,
# so docker can manage firewall rules
# Alternatively use --network-bridge=br1 instead of --network-macvlan
# Ensure to change eno1/br1 to the interface name you want to use
# You may want to add additional options here, e.g. bind mounts
systemd_nspawn_user_args=--network-bridge=br0
        --system-call-filter='add_key keyctl bpf'
        --bind='/mnt/tank/containers/:/mnt/containers'
        --bind='/mnt/tank/data/apps/:/mnt/data'
        --bind='/mnt/tank/media/:/mnt/media'
        --bind='/mnt/tank/data/stacks:/opt/stacks'
        --bind-ro='/usr/lib/x86_64-linux-gnu/nvidia/current'

# Script to run on the HOST before starting the jail
# Load kernel module and config kernel settings required for docker
pre_start_hook=#!/usr/bin/bash
        set -euo pipefail
        echo 'PRE_START_HOOK'
        echo 1 > /proc/sys/net/ipv4/ip_forward
        modprobe br_netfilter
        echo 1 > /proc/sys/net/bridge/bridge-nf-call-iptables
        echo 1 > /proc/sys/net/bridge/bridge-nf-call-ip6tables

# Only used while creating the jail
distro=debian
release=bookworm

# Install docker inside the jail:
# https://docs.docker.com/engine/install/debian/#install-using-the-repository
# NOTE: this script will run in the host networking namespace and ignores
# all systemd_nspawn_user_args such as bind mounts
initial_setup=#!/usr/bin/bash
        set -euo pipefail

        apt-get update && apt-get -y install ca-certificates curl host
        install -m 0755 -d /etc/apt/keyrings
        curl -fsSL https://download.docker.com/linux/debian/gpg -o /etc/apt/keyrings/docker.asc
        chmod a+r /etc/apt/keyrings/docker.asc

        echo \
        "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/debian \
        $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
        tee /etc/apt/sources.list.d/docker.list > /dev/null
        apt-get update
        apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

# You generally will not need to change the options below
systemd_run_default_args=--property=KillMode=mixed
        --property=Type=notify
        --property=RestartForceExitStatus=133
        --property=SuccessExitStatus=133
        --property=Delegate=yes
        --property=TasksMax=infinity
        --collect
        --setenv=SYSTEMD_NSPAWN_LOCK=0

systemd_nspawn_default_args=--keep-unit
        --quiet
        --boot
        --bind-ro=/sys/module
        --inaccessible=/sys/module/apparmor

Plex docker compose file:

services:
  plex:
    container_name: plex
    image: plexinc/pms-docker
    restart: unless-stopped
    network_mode: host
    volumes:
      - /mnt/data/plex/configs:/config
      - /mnt/data/plex/transcode:/transcode
      - /mnt/media/plex:/data
    env_file:
      - .env

I can see the library is there. Any ideas?

@neoKushan
Copy link

Did you install the Nvidia runtime and configure Plex to use it?

@dasunsrule32
Copy link

dasunsrule32 commented May 7, 2024

Yes, just got it working. I missed setting the NVIDIA_* vars in my env file. After that it fires up. So, in addition to above, I had to do the following:

  1. Install Nvidia Toolkit in jail
  2. Configure Nvidia Toolkit in jail
  3. Enable ENV vars for NVIDIA*
NVIDIA_VISIBLE_DEVICES=all
NVIDIA_DRIVER_CAPABILITIES=compute,video,utility

Working Plex compose.yml:

services:
  plex:
    container_name: plex
    image: plexinc/pms-docker
    restart: unless-stopped
    network_mode: host
    volumes:
      - /mnt/data/plex/configs:/config
      - /mnt/data/plex/transcode:/transcode
      - /mnt/media/plex:/data
    deploy:
      resources:
        reservations:
          devices:
            - capabilities:
                - gpu
    env_file:
      - .env
networks: {}

Working nvidia-smi output from TrueNAS host:

Mon May  6 18:35:06 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1080 Ti     Off | 00000000:2B:00.0 Off |                  N/A |
|  0%   45C    P2              73W / 280W |    298MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   2658633      C   ...lib/plexmediaserver/Plex Transcoder      296MiB |
+---------------------------------------------------------------------------------------+

@mooglestiltzkin
Copy link

moogle here. glad it helped gratz :}

@dasunsrule32
Copy link

moogle here. glad it helped gratz :}

Thank you for pointing me in the right direction. 😊

@mooglestiltzkin
Copy link

moogle here. glad it helped gratz :}

Thank you for pointing me in the right direction. 😊

np

by the way may i ask how you got ffmpeg installed? i need that for my jellyfin setup to get it to work.

jailmaker/jails/docker/rootfs/usr/lib/jellyfin-ffmpeg/ffmpeg

this location i think

@dasunsrule32
Copy link

moogle here. glad it helped gratz :}

Thank you for pointing me in the right direction. 😊

np

by the way may i ask how you got ffmpeg installed? i need that for my jellyfin setup to get it to work.

jailmaker/jails/docker/rootfs/usr/lib/jellyfin-ffmpeg/ffmpeg

this location i think

I'll take a look tomorrow. I'll grab the compose file and the docker image and spin it up and see if I can get it to recognize.

@mooglestiltzkin
Copy link

mooglestiltzkin commented May 7, 2024

moogle here. glad it helped gratz :}

Thank you for pointing me in the right direction. 😊

np
by the way may i ask how you got ffmpeg installed? i need that for my jellyfin setup to get it to work.
jailmaker/jails/docker/rootfs/usr/lib/jellyfin-ffmpeg/ffmpeg
this location i think

I'll take a look tomorrow. I'll grab the compose file and the docker image and spin it up and see if I can get it to recognize.

np. i would really appreciate this when you are able to do so.

i think the graphics card part is setup correctly but it's how to install ffmpeg i don't know when taking into account jailmaker.

In jailmaker we have this

/mnt/tank/jailmaker/jails/docker/rootfs/usr/lib

in jellyfin the ffmpeg points to

FFmpeg path:
/usr/lib/jellyfin-ffmpeg/ffmpeg
The path to the FFmpeg application file or folder containing FFmpeg.

but that jellyfin-ffmpeg/ffmpeg does not exist there at that location.

so i thought i'd have to go jlmkr shell docker then install the repo and the custom jellyfin ffmpeg, but i don't know how.

I'm using debian bookworm for jailmaker docker.

@neoKushan
Copy link

moogle here. glad it helped gratz :}

Thank you for pointing me in the right direction. 😊

np

by the way may i ask how you got ffmpeg installed? i need that for my jellyfin setup to get it to work.

jailmaker/jails/docker/rootfs/usr/lib/jellyfin-ffmpeg/ffmpeg

this location i think

You shouldn't really be modifying the contents of a jail from outside the jail itself, nor should you be modifying the contents of a docker container image from outside the container itself - it's just asking for trouble.

Which Docker image are you using for Jellyfin? I can see there's 3 available and I'd be surprised if they don't come with ffmpeg already installed, especially as jellyfin has its own fork. Looking at the linuxserver.io image, I can see that it should have jellyfin-ffmpeg5 bundle as part of it - https://github.com/linuxserver/docker-jellyfin/blob/master/package_versions.txt

@mooglestiltzkin
Copy link

moogle here. glad it helped gratz :}

Thank you for pointing me in the right direction. 😊

np
by the way may i ask how you got ffmpeg installed? i need that for my jellyfin setup to get it to work.
jailmaker/jails/docker/rootfs/usr/lib/jellyfin-ffmpeg/ffmpeg
this location i think

You shouldn't really be modifying the contents of a jail from outside the jail itself, nor should you be modifying the contents of a docker container image from outside the container itself - it's just asking for trouble.

Which Docker image are you using for Jellyfin? I can see there's 3 available and I'd be surprised if they don't come with ffmpeg already installed, especially as jellyfin has its own fork. Looking at the linuxserver.io image, I can see that it should have jellyfin-ffmpeg5 bundle as part of it - https://github.com/linuxserver/docker-jellyfin/blob/master/package_versions.txt

i use linuxserver jellyfin'

but when i enable nvidia hardware acceleration it doesn't work. i think ffmpeg doesn't get detected so it exits.

my error message
https://forums.truenas.com/t/qnap-ts-877-truenas-journal/1646/590?u=mooglestiltzkin

@mooglestiltzkin
Copy link

How do you do this?

We automatically add the necessary environment variable that will utilise all the features available on a GPU on the host. Once nvidia-container-toolkit is installed on your host you will need to re/create the docker container with the nvidia container runtime --runtime=nvidia and add an environment variable -e NVIDIA_VISIBLE_DEVICES=all (can also be set to a specific gpu's UUID, this can be discovered by running nvidia-smi --query-gpu=gpu_name,gpu_uuid --format=csv ). NVIDIA automatically mounts the GPU and drivers from your host into the container.

@mooglestiltzkin
Copy link

tried adding this to environment but still didn't work

NVIDIA_VISIBLE_DEVICES=all

@neoKushan
Copy link

neoKushan commented May 7, 2024

moogle here. glad it helped gratz :}

Thank you for pointing me in the right direction. 😊

np
by the way may i ask how you got ffmpeg installed? i need that for my jellyfin setup to get it to work.
jailmaker/jails/docker/rootfs/usr/lib/jellyfin-ffmpeg/ffmpeg
this location i think

You shouldn't really be modifying the contents of a jail from outside the jail itself, nor should you be modifying the contents of a docker container image from outside the container itself - it's just asking for trouble.
Which Docker image are you using for Jellyfin? I can see there's 3 available and I'd be surprised if they don't come with ffmpeg already installed, especially as jellyfin has its own fork. Looking at the linuxserver.io image, I can see that it should have jellyfin-ffmpeg5 bundle as part of it - https://github.com/linuxserver/docker-jellyfin/blob/master/package_versions.txt

i use linuxserver jellyfin'

but when i enable nvidia hardware acceleration it doesn't work. i think ffmpeg doesn't get detected so it exits.

my error message https://forums.truenas.com/t/qnap-ts-877-truenas-journal/1646/590?u=mooglestiltzkin

I think you're going down the wrong track. The error message you linked to doesn't say FFMPEG isn't present, it says FFMPEG exited with an error (Possibly due to transcoding but not necessarily). I am 99.99% certain that FFMPEG is installed and isn't the issue.

So you tried nvidia-smi and that works, right - good. That means that your jail can see the GPU. Now you need to check if a container within the jail can see the GPU - so run this command:

sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

You should get the same result, it's just running nvidia-smi within a container instead of on the OS itself. Try that first.

EDIT: Looking at your docker compose script, you don't appear to have set runtime: nvidia either. That'll be a problem.

  jellyfin:
    image: lscr.io/linuxserver/jellyfin:latest
    runtime: nvidia
    container_name: jellyfin
    environment:
      - PUID=0
      - PGID=0
      - TZ=newyork
      - JELLYFIN_PublishedServerUrl=https://jellyfin.mydomain.duckdns.org #optional
    volumes:
      - /mnt/docker/data/jellyfin:/config
      - /mnt/xxxxx/Videos:/data/media1:ro
      - /mnt/xxxxx/Videos:/data/media2:ro
    ports:
      - 8096:8096
      - 8920:8920 #optional
      - 7359:7359/udp #optional
    # - 1900:1900/udp #optional
    networks:
      - proxy
      
    deploy:
      resources:
        reservations:
          devices:
            - capabilities:
                - gpu
      
      
    restart: unless-stopped
#networks: {}


networks:
  proxy:
    external: true

Something like that. You also don't need all that stuff about reserving the GPU, either. You can remove all of the stuff in deploy.

@mooglestiltzkin
Copy link

@neoKushan

ty Neo. Now i have a better angle to look at.

I used dockge entered bash, did the command but got this

root@bxxxxxxxxe:/# sudo docker run --rm --runtime=nvidia 00gpus all ubuntu nvidia-smi
bash: sudo: command not found

did a google went to jlmkr shell docker

then docker exec -it bash

which is pretty much the same as the dockge earlier.

not sure how to sh into the container to be able to run that command

@neoKushan
Copy link

neoKushan commented May 7, 2024

No no, the command is to be run on the jail, not within the container. It'll spin up a new container that only displays that output.

So 'jlmkr shell docker' then

'sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi'

@mooglestiltzkin
Copy link

mooglestiltzkin commented May 7, 2024

but i think my jail for docker uses debian. so isn't ubuntu incorrect? confused

i deployed docker using the debian docker template
https://github.com/Jip-Hop/jailmaker/tree/main/templates/docker

root@docker:~# sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
Unable to find image 'ubuntu:latest' locally
latest: Pulling from library/ubuntu
49xxxxx4a: Pull complete 
Digest: sha25xxxxxxxxxxe15
Status: Downloaded newer image for ubuntu:latest

ya thought so. that doesn't look right since i'm not using ubuntu

@neoKushan
Copy link

neoKushan commented May 7, 2024

The container image is based off Ubuntu, that's normal in the docker world. It's fine.

Containers can be based off of other operating systems like arch and will happily run on any docker host.

@mooglestiltzkin
Copy link

mooglestiltzkin commented May 7, 2024

well after running that it says this

root@docker:~# sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
Unable to find image 'ubuntu:latest' locally
latest: Pulling from library/ubuntu
49xxxxxxa: Pull complete 
Digest: sha256xxxxxxxxxxxxxxxfe15
Status: Downloaded newer image for ubuntu:latest
Tue May  7 xxxxxx 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1050        Off | 00000000:12:00.0 Off |                  N/A |
| 35%   36C    P0              N/A /  75W |      0MiB /  2048MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
root@docker:~# 

@neoKushan
Copy link

neoKushan commented May 7, 2024

Okay good, so update your docker compose file to add runtime: nvidia, do another docker compose up to reload jellyfin and see if that works.

@mooglestiltzkin
Copy link

Okay good, so update your docker compose file to add runtime: nvidia, do another docker compose up to reload jellyfin and see if that works.

how to add runtime: nvidia ? is that under environment? I use dockge so

runtime=nvidia

?

@mooglestiltzkin
Copy link

I did a search i found this

version: '3'
services:
  jellyfin:
    image: jellyfin/jellyfin
    user: 1000:1000
    network_mode: 'host'
    volumes:
      - /path/to/config:/config
      - /path/to/cache:/cache
      - /path/to/media:/media
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]

I'll try
https://jellyfin.org/docs/general/administration/hardware-acceleration/nvidia/

@neoKushan
Copy link

yeah I did give an example in this post but you might have missed it.

@mooglestiltzkin
Copy link

Tried didn't work still.

went to look at other stuff to check troubleshoot
https://www.reddit.com/r/selfhosted/comments/w559xa/jellyfin_nvidia_in_docker_a_guide_for_newbies/

root@truenas[~]# jlmkr shell docker
Connected to machine docker. Press ^] three times within 1s to exit session.
root@docker:~# nvidia-smi --query-gpu=gpu_name,gpu_uuid --format=csv
name, uuid
NVIDIA GeForce GTX 1050, GPU-67xxxxx110
root@docker:~# 

hm...

@dasunsrule32
Copy link

Okay, those are the logs from jellyfin - there should be a separate set of logs for ffmpeg or transcoding - can you find and post those?

omg it worked.

i was testing some stuff

in the docker compose in environments i added NVIDIA_VISIBLE_DEVICES=all

after making that change it worked.

So, the .env in dockge where i had a NVIDIA_VISIBLE_DEVICES=all did absolutely nothing apparently. because i can't explain otherwise why the same value did not kick in.

Albeit i had both a environment in the compose, as well as in the UI for dockge. Not sure the implications of that. I just assumed if u had env top, and env both they would both be active. Guess not.

Anyway that's what fixed it.

I think the solution overall was a combination of steps you suggested

sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

    environment:
      - NVIDIA_VISIBLE_DEVICES=all

    runtime: nvidia

hope this helps the next person that comes along and ran into my issue.

ty so much @neoKushan and @dasunsrule32

Nice! I don't know that you need the runtime because it's defined in the docker daemon config, but it can't hurt to have it.

cat /etc/docker/daemon.json 
{
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }
}

Just to be safe, I did add it to my compose.yml, but it didn't affect it either way. The big one is the NVIDIA_* vars to enable the GPU in the container.

@mooglestiltzkin
Copy link

Nice! I don't know that you need the runtime because it's defined in the docker daemon config, but it can't hurt to have it.

mine is same like yours in that location

{
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }
}

thx for the update

@Jip-Hop
Copy link
Owner

Jip-Hop commented May 7, 2024

Would be great if anyone could focus on getting nvidia passthrough working again without having to do workarounds and provide a PR because this issue is getting too long IMHO. 😅

@dalgibbard
Copy link
Contributor Author

Would be great if anyone could focus on getting nvidia passthrough working again without having to do workarounds and provide a PR because this issue is getting too long IMHO. 😅

Lol yeah, can we take the unrelated issues offline?
I do plan to make a PR for this but keep forgetting.
I'll stick it in my calendar as a reminder and see if I can come up with something tomorrow.

@dasunsrule32
Copy link

Would be great if anyone could focus on getting nvidia passthrough working again without having to do workarounds and provide a PR because this issue is getting too long IMHO. 😅

I have a custom config for nvidia that I'm testing now. I'll push it soon with example instructions how to deploy.

@dasunsrule32
Copy link

I can confirm that the config is working out of the box with Nvidia. I just created a jail and booted Plex on it and hw transcoding is working. I'll submit it shortly.

@dasunsrule32
Copy link

This is the PR, I'm going to work on adding documentation now as well.

#163

@Jip-Hop
Copy link
Owner

Jip-Hop commented May 9, 2024

@dasunsrule32 reports not running into the "file too short" error on SCALE 24.04 when using the pre-release version of jailmaker from the develop branch:

#163 (comment)

I don't think I added anything in particular to fix the "file too short" error though...

@dalgibbard
Copy link
Contributor Author

@Jip-Hop I'll get a PR raised in a bit; it's caused by the mounts for Nvidia common; unrelated to the template/init changes etc.

TLDR; the current method for locating the Nvidia modules to bind mount should use the parent dir for the nvidia-common folder, else it misses some of the libs/modules.

ie. What this issue was actually about all along, before people went on huge tangents 🤣

@Jip-Hop
Copy link
Owner

Jip-Hop commented May 9, 2024

it's caused by the mounts for Nvidia common; unrelated to the template/init changes etc

But isn't it strange the "file too short" error doesn't occur for some users on SCALE 24.04?

@Jip-Hop
Copy link
Owner

Jip-Hop commented May 9, 2024

The old nvidia issue may also provide some helpful insights as to how the current method of passing through came to be. Specifically this comment: #4 (comment).

@dalgibbard
Copy link
Contributor Author

dalgibbard commented May 9, 2024

it's caused by the mounts for Nvidia common; unrelated to the template/init changes etc

But isn't it strange the "file too short" error doesn't occur for some users on SCALE 24.04?

It'll depend on the application being run I suspect (My application is cuda based for example, so needs certain libs that might not be mounted, compared to say, nvenc.), plus any manual implementation they have already added.

I myself can attest to multiple custom bind mounts already :)

@dalgibbard
Copy link
Contributor Author

PR for review: #165

@dalgibbard
Copy link
Contributor Author

dalgibbard commented May 9, 2024

The old nvidia issue may also provide some helpful insights as to how the current method of passing through came to be. Specifically this comment: #4 (comment).

I suspect that this is valid in places where running the Nvidia container runtime, but not when running GPU workloads directly in the jail?
That might be the key difference here.

If we're already mounting a few files from that dir anyway, I don't see any harm in extending that to be the directory personally.

FWIW, I'm running both container-based and non-container based workloads with this change, and both are working correctly.

@dalgibbard
Copy link
Contributor Author

Re-raised PR against develop branch: #166

@dasunsrule32
Copy link

dasunsrule32 commented May 9, 2024

This is a good approach. I might suggest using awk rather than grep, as it's usually more performant.

@dalgibbard
Copy link
Contributor Author

This is a good approach. I might suggest using awk rather than grep, as it's usually more performant.

Yeah... I guess. But in this case we're iterating through like 10 lines of output- I figured the grep implementation is more readable than awk here. I could test perf difference, but given the volume of text, it's very likely to be minimal, if any.

Equivalent awk implementation would look like this I guess:

command1 | awk 'NR==FNR{a[$0];next} !($0 in a)' - <(command2)

Which is... Incomprehensible to most lol
Unless there's a cleaner/easier to read implementation?

@Jip-Hop
Copy link
Owner

Jip-Hop commented May 10, 2024

I don't have time to look at this properly this week but I prefer a pure python implementation. I think we're pretty close to solving this cleanly :)

@dalgibbard
Copy link
Contributor Author

I don't have time to look at this properly this week but I prefer a pure python implementation. I think we're pretty close to solving this cleanly :)

"Pure python" implementation is possible (ignoring the subprocess calls, but at least getting rid of the 'shell' requirement), by pre-running the subtractive command and storing that, and then subtracting it from the existing list.

Since that's the preference, I'll sort this in a bit :)

@Jip-Hop
Copy link
Owner

Jip-Hop commented May 10, 2024

Is anyone able to test and confirm that this issue is fixed in: https://github.com/dalgibbard/jailmaker/blob/issue-127-nvidia-passthrough/jlmkr.py? I'd like to test it myself but can't without nvidia GPU. Would be great to get additional confirmation before creating a new release.

Please create a new jail with nvidia passthrough enabled. Don't manually add any additional mounts.

Thanks again @dalgibbard for reporting this issue and providing a PR.

@neoKushan
Copy link

neoKushan commented May 10, 2024

So I just gave this a test and hit an error even starting the jail -

May 10 15:22:52 ming .ExecStartPre[490613]: PRE_START_HOOK
May 10 15:22:52 ming systemd-nspawn[490615]: systemd 252.22-1~deb12u1 running in system mode (+PAM +AUDIT +SELINUX +APPARMOR +IMA +>
May 10 15:22:52 ming systemd-nspawn[490615]: Detected virtualization systemd-nspawn.
May 10 15:22:52 ming systemd-nspawn[490615]: Detected architecture x86-64.
May 10 15:22:52 ming systemd-nspawn[490615]: Detected first boot.
May 10 15:22:52 ming systemd-nspawn[490615]:
May 10 15:22:52 ming systemd-nspawn[490615]: Welcome to Debian GNU/Linux 12 (bookworm)!
May 10 15:22:52 ming systemd-nspawn[490615]:
May 10 15:22:52 ming systemd-nspawn[490615]: Initializing machine ID from container UUID.
May 10 15:22:53 ming systemd-nspawn[490615]: Failed to create control group inotify object: Too many open files
May 10 15:22:53 ming systemd-nspawn[490615]: Failed to allocate manager object: Too many open files
May 10 15:22:53 ming systemd-nspawn[490615]: [!!!!!!] Failed to allocate manager object.
May 10 15:22:53 ming systemd-nspawn[490615]: Exiting PID 1...
May 10 15:22:53 ming systemd[1]: jlmkr-nvtesting.service: Main process exited, code=exited, status=255/EXCEPTION

I used the v1.4.1 version of the script in the branch you linked and the docker template from the same branch. The only thing I changed in the template was setting gpu_passthrough_nvidia=0 to gpu_passthrough_nvidia=1 and changing systemd_nspawn_user_args=--network-bridge=br1 to systemd_nspawn_user_args=--network-bridge=br0 (as br0 is my bridge).

However, I am not sure if this is related to the nvidia changes as I tried again another new jail using the v1.4.1 script, without setting gpu_passthrough_nvidia to 1 and got a similar output:

Starting jail test with the following command:

systemd-run --property=KillMode=mixed --property=Type=notify --property=RestartForceExitStatus=133 --property=SuccessExitStatus=133 --property=Delegate=yes --property=TasksMax=infinity --collect --setenv=SYSTEMD_NSPAWN_LOCK=0 --unit=jlmkr-test --working-directory=./jails/test '--description=My nspawn jail test [created with jailmaker]' --property=ExecStartPre=/mnt/gordon/jailmaker/jails/test/.ExecStartPre -- systemd-nspawn --keep-unit --quiet --boot --bind-ro=/sys/module --inaccessible=/sys/module/apparmor --machine=test --directory=rootfs --notify-ready=yes --network-bridge=br0 --resolv-conf=bind-host '--system-call-filter=add_key keyctl bpf'

Job for jlmkr-test.service failed.
See "systemctl status jlmkr-test.service" and "journalctl -xeu jlmkr-test.service" for details.

Failed to start jail test...
In case of a config error, you may fix it with:
jlmkr edit test

root@ming[/mnt/gordon/jailmaker]# journalctl -xeu jlmkr-test.service
May 10 15:28:56 ming .ExecStartPre[509351]: PRE_START_HOOK
May 10 15:28:56 ming systemd-nspawn[509353]: systemd 252.22-1~deb12u1 running in system mode (+PAM +AUDIT +SELINUX +APPARMOR +IMA +>
May 10 15:28:56 ming systemd-nspawn[509353]: Detected virtualization systemd-nspawn.
May 10 15:28:56 ming systemd-nspawn[509353]: Detected architecture x86-64.
May 10 15:28:56 ming systemd-nspawn[509353]: Detected first boot.
May 10 15:28:56 ming systemd-nspawn[509353]:
May 10 15:28:56 ming systemd-nspawn[509353]: Welcome to Debian GNU/Linux 12 (bookworm)!
May 10 15:28:56 ming systemd-nspawn[509353]:
May 10 15:28:56 ming systemd-nspawn[509353]: Initializing machine ID from container UUID.
May 10 15:28:56 ming systemd-nspawn[509353]: Failed to create control group inotify object: Too many open files
May 10 15:28:56 ming systemd-nspawn[509353]: Failed to allocate manager object: Too many open files
May 10 15:28:56 ming systemd-nspawn[509353]: [!!!!!!] Failed to allocate manager object.
May 10 15:28:56 ming systemd-nspawn[509353]: Exiting PID 1...
May 10 15:28:56 ming systemd[1]: jlmkr-test.service: Main process exited, code=exited, status=255/EXCEPTION
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ An ExecStart= process belonging to unit jlmkr-test.service has exited.
░░
░░ The process' exit code is 'exited' and its exit status is 255.
May 10 15:28:56 ming systemd[1]: jlmkr-test.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ The unit jlmkr-test.service has entered the 'failed' state with result 'exit-code'.
May 10 15:28:56 ming systemd[1]: Failed to start jlmkr-test.service - My nspawn jail test [created with jailmaker].
░░ Subject: A start job for unit jlmkr-test.service has failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ A start job for unit jlmkr-test.service has finished with a failure.
░░
░░ The job identifier is 1628358 and the job result is failed.

EDIT: I reverted back to v1.2.1 of jlmkr and it created a jail just fine. Let me see where the breaking change came in.

@dasunsrule32
Copy link

dasunsrule32 commented May 10, 2024

@neoKushan
Copy link

Ahhh, that's useful to know! I'll make that tweak and try again.

@neoKushan
Copy link

Yup, can confirm the v1.4.1 script works a treat!

I installed the jail using the docker template with gpu_passthrough_nvidia=1 and that was all fine.

I then shelled into the jail and ran nvidia-smi which showed my GPU humming away.

I then ran sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi and it also showed my gpu stats, I assume that's about all that's needed to prove it out?

@b-neufeld
Copy link

Yup, can confirm the v1.4.1 script works a treat!

I installed the jail using the docker template with gpu_passthrough_nvidia=1 and that was all fine.

I then shelled into the jail and ran nvidia-smi which showed my GPU humming away.

I then ran sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi and it also showed my gpu stats, I assume that's about all that's needed to prove it out?

Confirming all these steps worked for me too!

Jip-Hop pushed a commit that referenced this issue May 11, 2024
* Fix Nvidia Passthrough closing #127
* Mount libraries parent directory
* Use the dynamic library path from the existing code
@Jip-Hop
Copy link
Owner

Jip-Hop commented May 17, 2024

Anyone here able to help debug issue
#174?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants