Nvidia passthrough not working for TrueNAS 24.04.0 (Dragonfish) #127

dalgibbard · 2024-04-24T08:40:44Z

Latest version of jailmaker (1.1.5)
As per title; in Dragonfish 24.04.0, Nvidia passthrough seems to be broken -- nvidia-smi working fine on host, but inside container it gives:

nvidia-container-cli: initialization error: load library failed: /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.1: file too short

Seems the script uses nvidia-container-cli list to find nvidia files which need mounting, but container expects files outside of this:

# On TrueNAS host:
$ nvidia-container-cli list
/dev/nvidiactl
/dev/nvidia-uvm
/dev/nvidia-uvm-tools
/dev/nvidia-modeset
/dev/nvidia0
/usr/lib/nvidia/current/nvidia-smi
/usr/bin/nvidia-persistenced
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.545.23.08
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-cfg.so.545.23.08
/usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so.545.23.08
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ptxjitcompiler.so.545.23.08
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-nvvm.so.545.23.08
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-encode.so.545.23.08
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvcuvid.so.545.23.08

Note that this list doesn't include the file the container is expecting.

Adding a manual mount to my jail's config for --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current resolved it though; not sure if that's a good idea or not, but it works at least :)

The text was updated successfully, but these errors were encountered:

neoKushan · 2024-04-24T15:22:35Z

I had the same problem, @dalgibbard's solution fixed it for me.

I did an apt-get update/upgrade to fix it at first and somehow that hosed the entire jail, but I quickly rolled back and applied the --bind fix

Jip-Hop · 2024-04-26T15:44:40Z

Glad to hear you were able to get it working @dalgibbard.

I found the same error message mentioned in: NVIDIA/libnvidia-container#224. But I'm leaving investigating the root cause and implementing a fix for this issue up to the community. I don't have an nvidia GPU in my system and I'm not looking forward to another blind, back-and-forth, trial and error fixing process akin to #4.

mooglestiltzkin · 2024-04-28T10:49:26Z

I had the same problem, @dalgibbard's solution fixed it for me.

I did an apt-get update/upgrade to fix it at first and somehow that hosed the entire jail, but I quickly rolled back and applied the --bind fix

i still couldn't get immich to use graphics

to find out the stuff
nvidia-container-cli list

jlmkr edit docker made changes, saved, jlmkr restart docker

        --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current
        --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-nvvm.so.545.23.08
        --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ptxjitcompiler.so.545.23.08
        --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so.545.23.08
        --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-encode.so.545.23.08
        --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvcuvid.so.545.23.08
        --bind-ro=/usr/bin/nvidia-persistenced
        --bind=/dev/nvidia0
        --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-cfg.so.545.23.08
        --bind=/dev/nvidia-uvm-tools
        --bind=/dev/nvidiactl 
	--bind=/dev/nvidia-uvm
        --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.545.23.08
        --bind-ro=/usr/bin/nvidia-smi
        --bind-ro=/usr/lib/nvidia/current/nvidia-smi

when trying to deploy immich i get this
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]]

*noticed when i had this it wouldn't start the docker jail. so i removed it from the config and it worked again
--bind=/dev/nvidia-modeset

neoKushan · 2024-04-29T03:27:18Z

jlmkr edit docker made changes, saved, jlmkr restart docker

        --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current
        --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-nvvm.so.545.23.08
        --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ptxjitcompiler.so.545.23.08
        --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so.545.23.08
        --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-encode.so.545.23.08
        --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvcuvid.so.545.23.08
        --bind-ro=/usr/bin/nvidia-persistenced
        --bind=/dev/nvidia0
        --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-cfg.so.545.23.08
        --bind=/dev/nvidia-uvm-tools
        --bind=/dev/nvidiactl 
	--bind=/dev/nvidia-uvm
        --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.545.23.08
        --bind-ro=/usr/bin/nvidia-smi
        --bind-ro=/usr/lib/nvidia/current/nvidia-smi

You shouldn't ened all these bind mounts, the only one you need is --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current

mooglestiltzkin · 2024-04-29T03:37:49Z

ty will update accordingly

dasunsrule32 · 2024-05-06T23:47:01Z

I created a new jail with nvidia gpu passthrough enabled and mounted the suggested directory as ro. I can run nvidia-smi and it detects my card inside the container:

nvidia-smi 
Mon May  6 16:43:49 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1080 Ti     Off | 00000000:2B:00.0 Off |                  N/A |
|  0%   37C    P8              11W / 280W |      0MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

When I attempt to use the card in Plex, I get the following errors:

[Req#18ae/Transcode] [FFMPEG] - Cannot load libcuda.so.1
[Req#18ae/Transcode] [FFMPEG] - Could not dynamically load CUDA

I did bind-ro the TrueNAS SCALE dir with the nvidia modules in them:

ls -la /usr/lib/x86_64-linux-gnu/nvidia/current/
total 79835
drwxr-xr-x 2 root root       21 Apr 22 19:20 .
drwxr-xr-x 3 root root        3 May  6 16:02 ..
lrwxrwxrwx 1 root root       12 Nov  6 21:21 libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root       20 Nov  6 21:21 libcuda.so.1 -> libcuda.so.545.23.08
-rw-r--r-- 1 root root 29453200 Nov  6 16:49 libcuda.so.545.23.08
lrwxrwxrwx 1 root root       15 Nov  6 21:21 libnvcuvid.so -> libnvcuvid.so.1
lrwxrwxrwx 1 root root       23 Nov  6 21:21 libnvcuvid.so.1 -> libnvcuvid.so.545.23.08
-rw-r--r-- 1 root root 10009920 Nov  6 16:22 libnvcuvid.so.545.23.08
lrwxrwxrwx 1 root root       26 Nov  6 21:21 libnvidia-cfg.so.1 -> libnvidia-cfg.so.545.23.08
-rw-r--r-- 1 root root   274968 Nov  6 16:19 libnvidia-cfg.so.545.23.08
lrwxrwxrwx 1 root root       21 Nov  6 21:21 libnvidia-encode.so -> libnvidia-encode.so.1
lrwxrwxrwx 1 root root       29 Nov  6 21:21 libnvidia-encode.so.1 -> libnvidia-encode.so.545.23.08
-rw-r--r-- 1 root root   252576 Apr 25 09:50 libnvidia-encode.so.545.23.08
lrwxrwxrwx 1 root root       17 Nov  6 21:21 libnvidia-ml.so -> libnvidia-ml.so.1
lrwxrwxrwx 1 root root       25 Nov  6 21:21 libnvidia-ml.so.1 -> libnvidia-ml.so.545.23.08
-rw-r--r-- 1 root root  1992128 Nov  6 16:23 libnvidia-ml.so.545.23.08
lrwxrwxrwx 1 root root       19 Nov  6 21:21 libnvidia-nvvm.so -> libnvidia-nvvm.so.4
lrwxrwxrwx 1 root root       27 Nov  6 21:21 libnvidia-nvvm.so.4 -> libnvidia-nvvm.so.545.23.08
-rw-r--r-- 1 root root 86781944 Nov  6 17:28 libnvidia-nvvm.so.545.23.08
lrwxrwxrwx 1 root root       37 Nov  6 21:21 libnvidia-ptxjitcompiler.so.1 -> libnvidia-ptxjitcompiler.so.545.23.08
-rw-r--r-- 1 root root 26589472 Nov  6 16:55 libnvidia-ptxjitcompiler.so.545.23.08

Here is the nvidia-container-cli output from the host:

nvidia-container-cli list    
/dev/nvidiactl
/dev/nvidia-uvm
/dev/nvidia-uvm-tools
/dev/nvidia-modeset
/dev/nvidia0
/usr/lib/nvidia/current/nvidia-smi
/usr/bin/nvidia-persistenced
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.545.23.08
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-cfg.so.545.23.08
/usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so.545.23.08
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ptxjitcompiler.so.545.23.08
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-nvvm.so.545.23.08
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-encode.so.545.23.08
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvcuvid.so.545.23.08

Jail config:

startup=1
gpu_passthrough_intel=0
gpu_passthrough_nvidia=1
# Turning off seccomp filtering improves performance at the expense of security
seccomp=1

# Use macvlan networking to provide an isolated network namespace,
# so docker can manage firewall rules
# Alternatively use --network-bridge=br1 instead of --network-macvlan
# Ensure to change eno1/br1 to the interface name you want to use
# You may want to add additional options here, e.g. bind mounts
systemd_nspawn_user_args=--network-bridge=br0
        --system-call-filter='add_key keyctl bpf'
        --bind='/mnt/tank/containers/:/mnt/containers'
        --bind='/mnt/tank/data/apps/:/mnt/data'
        --bind='/mnt/tank/media/:/mnt/media'
        --bind='/mnt/tank/data/stacks:/opt/stacks'
        --bind-ro='/usr/lib/x86_64-linux-gnu/nvidia/current'

# Script to run on the HOST before starting the jail
# Load kernel module and config kernel settings required for docker
pre_start_hook=#!/usr/bin/bash
        set -euo pipefail
        echo 'PRE_START_HOOK'
        echo 1 > /proc/sys/net/ipv4/ip_forward
        modprobe br_netfilter
        echo 1 > /proc/sys/net/bridge/bridge-nf-call-iptables
        echo 1 > /proc/sys/net/bridge/bridge-nf-call-ip6tables

# Only used while creating the jail
distro=debian
release=bookworm

# Install docker inside the jail:
# https://docs.docker.com/engine/install/debian/#install-using-the-repository
# NOTE: this script will run in the host networking namespace and ignores
# all systemd_nspawn_user_args such as bind mounts
initial_setup=#!/usr/bin/bash
        set -euo pipefail

        apt-get update && apt-get -y install ca-certificates curl host
        install -m 0755 -d /etc/apt/keyrings
        curl -fsSL https://download.docker.com/linux/debian/gpg -o /etc/apt/keyrings/docker.asc
        chmod a+r /etc/apt/keyrings/docker.asc

        echo \
        "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/debian \
        $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
        tee /etc/apt/sources.list.d/docker.list > /dev/null
        apt-get update
        apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

# You generally will not need to change the options below
systemd_run_default_args=--property=KillMode=mixed
        --property=Type=notify
        --property=RestartForceExitStatus=133
        --property=SuccessExitStatus=133
        --property=Delegate=yes
        --property=TasksMax=infinity
        --collect
        --setenv=SYSTEMD_NSPAWN_LOCK=0

systemd_nspawn_default_args=--keep-unit
        --quiet
        --boot
        --bind-ro=/sys/module
        --inaccessible=/sys/module/apparmor

Plex docker compose file:

services:
  plex:
    container_name: plex
    image: plexinc/pms-docker
    restart: unless-stopped
    network_mode: host
    volumes:
      - /mnt/data/plex/configs:/config
      - /mnt/data/plex/transcode:/transcode
      - /mnt/media/plex:/data
    env_file:
      - .env

I can see the library is there. Any ideas?

neoKushan · 2024-05-07T01:27:40Z

Did you install the Nvidia runtime and configure Plex to use it?

dasunsrule32 · 2024-05-07T01:40:53Z

Yes, just got it working. I missed setting the NVIDIA_* vars in my env file. After that it fires up. So, in addition to above, I had to do the following:

NVIDIA_VISIBLE_DEVICES=all
NVIDIA_DRIVER_CAPABILITIES=compute,video,utility

Working Plex compose.yml:

services:
  plex:
    container_name: plex
    image: plexinc/pms-docker
    restart: unless-stopped
    network_mode: host
    volumes:
      - /mnt/data/plex/configs:/config
      - /mnt/data/plex/transcode:/transcode
      - /mnt/media/plex:/data
    deploy:
      resources:
        reservations:
          devices:
            - capabilities:
                - gpu
    env_file:
      - .env
networks: {}

Working nvidia-smi output from TrueNAS host:

Mon May  6 18:35:06 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1080 Ti     Off | 00000000:2B:00.0 Off |                  N/A |
|  0%   45C    P2              73W / 280W |    298MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   2658633      C   ...lib/plexmediaserver/Plex Transcoder      296MiB |
+---------------------------------------------------------------------------------------+

mooglestiltzkin · 2024-05-07T02:16:02Z

moogle here. glad it helped gratz :}

dasunsrule32 · 2024-05-07T03:23:11Z

moogle here. glad it helped gratz :}

Thank you for pointing me in the right direction. 😊

mooglestiltzkin · 2024-05-07T04:01:04Z

moogle here. glad it helped gratz :}

Thank you for pointing me in the right direction. 😊

np

by the way may i ask how you got ffmpeg installed? i need that for my jellyfin setup to get it to work.

jailmaker/jails/docker/rootfs/usr/lib/jellyfin-ffmpeg/ffmpeg

this location i think

dasunsrule32 · 2024-05-07T04:14:44Z

moogle here. glad it helped gratz :}

Thank you for pointing me in the right direction. 😊

np

by the way may i ask how you got ffmpeg installed? i need that for my jellyfin setup to get it to work.

jailmaker/jails/docker/rootfs/usr/lib/jellyfin-ffmpeg/ffmpeg

this location i think

I'll take a look tomorrow. I'll grab the compose file and the docker image and spin it up and see if I can get it to recognize.

mooglestiltzkin · 2024-05-07T04:47:28Z

moogle here. glad it helped gratz :}

Thank you for pointing me in the right direction. 😊

np
by the way may i ask how you got ffmpeg installed? i need that for my jellyfin setup to get it to work.
jailmaker/jails/docker/rootfs/usr/lib/jellyfin-ffmpeg/ffmpeg
this location i think

I'll take a look tomorrow. I'll grab the compose file and the docker image and spin it up and see if I can get it to recognize.

np. i would really appreciate this when you are able to do so.

i think the graphics card part is setup correctly but it's how to install ffmpeg i don't know when taking into account jailmaker.

In jailmaker we have this

/mnt/tank/jailmaker/jails/docker/rootfs/usr/lib

in jellyfin the ffmpeg points to

FFmpeg path:
/usr/lib/jellyfin-ffmpeg/ffmpeg
The path to the FFmpeg application file or folder containing FFmpeg.

but that jellyfin-ffmpeg/ffmpeg does not exist there at that location.

so i thought i'd have to go jlmkr shell docker then install the repo and the custom jellyfin ffmpeg, but i don't know how.

I'm using debian bookworm for jailmaker docker.

neoKushan · 2024-05-07T08:25:10Z

moogle here. glad it helped gratz :}

Thank you for pointing me in the right direction. 😊

np

by the way may i ask how you got ffmpeg installed? i need that for my jellyfin setup to get it to work.

jailmaker/jails/docker/rootfs/usr/lib/jellyfin-ffmpeg/ffmpeg

this location i think

You shouldn't really be modifying the contents of a jail from outside the jail itself, nor should you be modifying the contents of a docker container image from outside the container itself - it's just asking for trouble.

Which Docker image are you using for Jellyfin? I can see there's 3 available and I'd be surprised if they don't come with ffmpeg already installed, especially as jellyfin has its own fork. Looking at the linuxserver.io image, I can see that it should have jellyfin-ffmpeg5 bundle as part of it - https://github.com/linuxserver/docker-jellyfin/blob/master/package_versions.txt

mooglestiltzkin · 2024-05-07T10:29:07Z

moogle here. glad it helped gratz :}

Thank you for pointing me in the right direction. 😊

np
by the way may i ask how you got ffmpeg installed? i need that for my jellyfin setup to get it to work.
jailmaker/jails/docker/rootfs/usr/lib/jellyfin-ffmpeg/ffmpeg
this location i think

You shouldn't really be modifying the contents of a jail from outside the jail itself, nor should you be modifying the contents of a docker container image from outside the container itself - it's just asking for trouble.

Which Docker image are you using for Jellyfin? I can see there's 3 available and I'd be surprised if they don't come with ffmpeg already installed, especially as jellyfin has its own fork. Looking at the linuxserver.io image, I can see that it should have jellyfin-ffmpeg5 bundle as part of it - https://github.com/linuxserver/docker-jellyfin/blob/master/package_versions.txt

i use linuxserver jellyfin'

but when i enable nvidia hardware acceleration it doesn't work. i think ffmpeg doesn't get detected so it exits.

my error message
https://forums.truenas.com/t/qnap-ts-877-truenas-journal/1646/590?u=mooglestiltzkin

mooglestiltzkin · 2024-05-07T10:31:40Z

How do you do this?

We automatically add the necessary environment variable that will utilise all the features available on a GPU on the host. Once nvidia-container-toolkit is installed on your host you will need to re/create the docker container with the nvidia container runtime --runtime=nvidia and add an environment variable -e NVIDIA_VISIBLE_DEVICES=all (can also be set to a specific gpu's UUID, this can be discovered by running nvidia-smi --query-gpu=gpu_name,gpu_uuid --format=csv ). NVIDIA automatically mounts the GPU and drivers from your host into the container.

mooglestiltzkin · 2024-05-07T10:36:35Z

tried adding this to environment but still didn't work

NVIDIA_VISIBLE_DEVICES=all

neoKushan · 2024-05-07T11:35:29Z

moogle here. glad it helped gratz :}

Thank you for pointing me in the right direction. 😊

np
by the way may i ask how you got ffmpeg installed? i need that for my jellyfin setup to get it to work.
jailmaker/jails/docker/rootfs/usr/lib/jellyfin-ffmpeg/ffmpeg
this location i think

You shouldn't really be modifying the contents of a jail from outside the jail itself, nor should you be modifying the contents of a docker container image from outside the container itself - it's just asking for trouble.
Which Docker image are you using for Jellyfin? I can see there's 3 available and I'd be surprised if they don't come with ffmpeg already installed, especially as jellyfin has its own fork. Looking at the linuxserver.io image, I can see that it should have jellyfin-ffmpeg5 bundle as part of it - https://github.com/linuxserver/docker-jellyfin/blob/master/package_versions.txt

i use linuxserver jellyfin'

but when i enable nvidia hardware acceleration it doesn't work. i think ffmpeg doesn't get detected so it exits.

my error message https://forums.truenas.com/t/qnap-ts-877-truenas-journal/1646/590?u=mooglestiltzkin

I think you're going down the wrong track. The error message you linked to doesn't say FFMPEG isn't present, it says FFMPEG exited with an error (Possibly due to transcoding but not necessarily). I am 99.99% certain that FFMPEG is installed and isn't the issue.

So you tried nvidia-smi and that works, right - good. That means that your jail can see the GPU. Now you need to check if a container within the jail can see the GPU - so run this command:

sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

You should get the same result, it's just running nvidia-smi within a container instead of on the OS itself. Try that first.

EDIT: Looking at your docker compose script, you don't appear to have set runtime: nvidia either. That'll be a problem.

  jellyfin:
    image: lscr.io/linuxserver/jellyfin:latest
    runtime: nvidia
    container_name: jellyfin
    environment:
      - PUID=0
      - PGID=0
      - TZ=newyork
      - JELLYFIN_PublishedServerUrl=https://jellyfin.mydomain.duckdns.org #optional
    volumes:
      - /mnt/docker/data/jellyfin:/config
      - /mnt/xxxxx/Videos:/data/media1:ro
      - /mnt/xxxxx/Videos:/data/media2:ro
    ports:
      - 8096:8096
      - 8920:8920 #optional
      - 7359:7359/udp #optional
    # - 1900:1900/udp #optional
    networks:
      - proxy
      
    deploy:
      resources:
        reservations:
          devices:
            - capabilities:
                - gpu
      
      
    restart: unless-stopped
#networks: {}


networks:
  proxy:
    external: true

Something like that. You also don't need all that stuff about reserving the GPU, either. You can remove all of the stuff in deploy.

mooglestiltzkin · 2024-05-07T12:17:13Z

@neoKushan

ty Neo. Now i have a better angle to look at.

I used dockge entered bash, did the command but got this

root@bxxxxxxxxe:/# sudo docker run --rm --runtime=nvidia 00gpus all ubuntu nvidia-smi
bash: sudo: command not found

did a google went to jlmkr shell docker

then docker exec -it bash

which is pretty much the same as the dockge earlier.

not sure how to sh into the container to be able to run that command

neoKushan · 2024-05-07T12:22:01Z

No no, the command is to be run on the jail, not within the container. It'll spin up a new container that only displays that output.

So 'jlmkr shell docker' then

'sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi'

mooglestiltzkin · 2024-05-07T12:23:12Z

but i think my jail for docker uses debian. so isn't ubuntu incorrect? confused

i deployed docker using the debian docker template
https://github.com/Jip-Hop/jailmaker/tree/main/templates/docker

root@docker:~# sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
Unable to find image 'ubuntu:latest' locally
latest: Pulling from library/ubuntu
49xxxxx4a: Pull complete 
Digest: sha25xxxxxxxxxxe15
Status: Downloaded newer image for ubuntu:latest

ya thought so. that doesn't look right since i'm not using ubuntu

neoKushan · 2024-05-07T12:26:15Z

The container image is based off Ubuntu, that's normal in the docker world. It's fine.

Containers can be based off of other operating systems like arch and will happily run on any docker host.

mooglestiltzkin · 2024-05-07T12:28:52Z

well after running that it says this

root@docker:~# sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
Unable to find image 'ubuntu:latest' locally
latest: Pulling from library/ubuntu
49xxxxxxa: Pull complete 
Digest: sha256xxxxxxxxxxxxxxxfe15
Status: Downloaded newer image for ubuntu:latest
Tue May  7 xxxxxx 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1050        Off | 00000000:12:00.0 Off |                  N/A |
| 35%   36C    P0              N/A /  75W |      0MiB /  2048MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
root@docker:~#

neoKushan · 2024-05-07T12:35:16Z

Okay good, so update your docker compose file to add runtime: nvidia, do another docker compose up to reload jellyfin and see if that works.

mooglestiltzkin · 2024-05-07T12:36:35Z

Okay good, so update your docker compose file to add runtime: nvidia, do another docker compose up to reload jellyfin and see if that works.

how to add runtime: nvidia ? is that under environment? I use dockge so

runtime=nvidia

?

mooglestiltzkin · 2024-05-07T12:37:30Z

I did a search i found this

version: '3'
services:
  jellyfin:
    image: jellyfin/jellyfin
    user: 1000:1000
    network_mode: 'host'
    volumes:
      - /path/to/config:/config
      - /path/to/cache:/cache
      - /path/to/media:/media
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]

I'll try
https://jellyfin.org/docs/general/administration/hardware-acceleration/nvidia/

neoKushan · 2024-05-07T12:38:47Z

yeah I did give an example in this post but you might have missed it.

mooglestiltzkin · 2024-05-07T12:41:45Z

Tried didn't work still.

went to look at other stuff to check troubleshoot
https://www.reddit.com/r/selfhosted/comments/w559xa/jellyfin_nvidia_in_docker_a_guide_for_newbies/

root@truenas[~]# jlmkr shell docker
Connected to machine docker. Press ^] three times within 1s to exit session.
root@docker:~# nvidia-smi --query-gpu=gpu_name,gpu_uuid --format=csv
name, uuid
NVIDIA GeForce GTX 1050, GPU-67xxxxx110
root@docker:~#

hm...

dasunsrule32 · 2024-05-07T16:58:00Z

Okay, those are the logs from jellyfin - there should be a separate set of logs for ffmpeg or transcoding - can you find and post those?

omg it worked.

i was testing some stuff

in the docker compose in environments i added NVIDIA_VISIBLE_DEVICES=all

after making that change it worked.

So, the .env in dockge where i had a NVIDIA_VISIBLE_DEVICES=all did absolutely nothing apparently. because i can't explain otherwise why the same value did not kick in.

Albeit i had both a environment in the compose, as well as in the UI for dockge. Not sure the implications of that. I just assumed if u had env top, and env both they would both be active. Guess not.

Anyway that's what fixed it.

I think the solution overall was a combination of steps you suggested

sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
    environment:
      - NVIDIA_VISIBLE_DEVICES=all

    runtime: nvidia
hope this helps the next person that comes along and ran into my issue.

ty so much @neoKushan and @dasunsrule32

Nice! I don't know that you need the runtime because it's defined in the docker daemon config, but it can't hurt to have it.

cat /etc/docker/daemon.json 
{
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }
}

Just to be safe, I did add it to my compose.yml, but it didn't affect it either way. The big one is the NVIDIA_* vars to enable the GPU in the container.

mooglestiltzkin · 2024-05-07T18:09:54Z

Nice! I don't know that you need the runtime because it's defined in the docker daemon config, but it can't hurt to have it.

mine is same like yours in that location

{
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }
}

thx for the update

Jip-Hop · 2024-05-07T18:50:53Z

Would be great if anyone could focus on getting nvidia passthrough working again without having to do workarounds and provide a PR because this issue is getting too long IMHO. 😅

dalgibbard · 2024-05-07T19:32:53Z

Would be great if anyone could focus on getting nvidia passthrough working again without having to do workarounds and provide a PR because this issue is getting too long IMHO. 😅

Lol yeah, can we take the unrelated issues offline?
I do plan to make a PR for this but keep forgetting.
I'll stick it in my calendar as a reminder and see if I can come up with something tomorrow.

dasunsrule32 · 2024-05-07T20:38:17Z

Would be great if anyone could focus on getting nvidia passthrough working again without having to do workarounds and provide a PR because this issue is getting too long IMHO. 😅

I have a custom config for nvidia that I'm testing now. I'll push it soon with example instructions how to deploy.

dasunsrule32 · 2024-05-07T21:10:48Z

I can confirm that the config is working out of the box with Nvidia. I just created a jail and booted Plex on it and hw transcoding is working. I'll submit it shortly.

dasunsrule32 · 2024-05-07T21:38:28Z

This is the PR, I'm going to work on adding documentation now as well.

#163

Jip-Hop · 2024-05-09T07:20:29Z

@dasunsrule32 reports not running into the "file too short" error on SCALE 24.04 when using the pre-release version of jailmaker from the develop branch:

#163 (comment)

I don't think I added anything in particular to fix the "file too short" error though...

dalgibbard · 2024-05-09T07:28:14Z

@Jip-Hop I'll get a PR raised in a bit; it's caused by the mounts for Nvidia common; unrelated to the template/init changes etc.

TLDR; the current method for locating the Nvidia modules to bind mount should use the parent dir for the nvidia-common folder, else it misses some of the libs/modules.

ie. What this issue was actually about all along, before people went on huge tangents 🤣

Jip-Hop · 2024-05-09T09:34:53Z

it's caused by the mounts for Nvidia common; unrelated to the template/init changes etc

But isn't it strange the "file too short" error doesn't occur for some users on SCALE 24.04?

Jip-Hop · 2024-05-09T09:36:19Z

The old nvidia issue may also provide some helpful insights as to how the current method of passing through came to be. Specifically this comment: #4 (comment).

dalgibbard · 2024-05-09T09:37:41Z

it's caused by the mounts for Nvidia common; unrelated to the template/init changes etc

But isn't it strange the "file too short" error doesn't occur for some users on SCALE 24.04?

It'll depend on the application being run I suspect (My application is cuda based for example, so needs certain libs that might not be mounted, compared to say, nvenc.), plus any manual implementation they have already added.

I myself can attest to multiple custom bind mounts already :)

dalgibbard · 2024-05-09T09:54:02Z

PR for review: #165

dalgibbard · 2024-05-09T09:59:32Z

The old nvidia issue may also provide some helpful insights as to how the current method of passing through came to be. Specifically this comment: #4 (comment).

I suspect that this is valid in places where running the Nvidia container runtime, but not when running GPU workloads directly in the jail?
That might be the key difference here.

If we're already mounting a few files from that dir anyway, I don't see any harm in extending that to be the directory personally.

FWIW, I'm running both container-based and non-container based workloads with this change, and both are working correctly.

dalgibbard · 2024-05-09T10:46:11Z

Re-raised PR against develop branch: #166

dasunsrule32 · 2024-05-09T22:19:32Z

This is a good approach. I might suggest using awk rather than grep, as it's usually more performant.

dalgibbard · 2024-05-10T06:07:58Z

This is a good approach. I might suggest using awk rather than grep, as it's usually more performant.

Yeah... I guess. But in this case we're iterating through like 10 lines of output- I figured the grep implementation is more readable than awk here. I could test perf difference, but given the volume of text, it's very likely to be minimal, if any.

Equivalent awk implementation would look like this I guess:

command1 | awk 'NR==FNR{a[$0];next} !($0 in a)' - <(command2)

Which is... Incomprehensible to most lol
Unless there's a cleaner/easier to read implementation?

Jip-Hop · 2024-05-10T06:42:01Z

I don't have time to look at this properly this week but I prefer a pure python implementation. I think we're pretty close to solving this cleanly :)

dalgibbard · 2024-05-10T06:46:06Z

I don't have time to look at this properly this week but I prefer a pure python implementation. I think we're pretty close to solving this cleanly :)

"Pure python" implementation is possible (ignoring the subprocess calls, but at least getting rid of the 'shell' requirement), by pre-running the subtractive command and storing that, and then subtracting it from the existing list.

Since that's the preference, I'll sort this in a bit :)

Jip-Hop · 2024-05-10T12:40:26Z

Is anyone able to test and confirm that this issue is fixed in: https://github.com/dalgibbard/jailmaker/blob/issue-127-nvidia-passthrough/jlmkr.py? I'd like to test it myself but can't without nvidia GPU. Would be great to get additional confirmation before creating a new release.

Please create a new jail with nvidia passthrough enabled. Don't manually add any additional mounts.

Thanks again @dalgibbard for reporting this issue and providing a PR.

neoKushan · 2024-05-10T14:30:56Z

So I just gave this a test and hit an error even starting the jail -

May 10 15:22:52 ming .ExecStartPre[490613]: PRE_START_HOOK
May 10 15:22:52 ming systemd-nspawn[490615]: systemd 252.22-1~deb12u1 running in system mode (+PAM +AUDIT +SELINUX +APPARMOR +IMA +>
May 10 15:22:52 ming systemd-nspawn[490615]: Detected virtualization systemd-nspawn.
May 10 15:22:52 ming systemd-nspawn[490615]: Detected architecture x86-64.
May 10 15:22:52 ming systemd-nspawn[490615]: Detected first boot.
May 10 15:22:52 ming systemd-nspawn[490615]:
May 10 15:22:52 ming systemd-nspawn[490615]: Welcome to Debian GNU/Linux 12 (bookworm)!
May 10 15:22:52 ming systemd-nspawn[490615]:
May 10 15:22:52 ming systemd-nspawn[490615]: Initializing machine ID from container UUID.
May 10 15:22:53 ming systemd-nspawn[490615]: Failed to create control group inotify object: Too many open files
May 10 15:22:53 ming systemd-nspawn[490615]: Failed to allocate manager object: Too many open files
May 10 15:22:53 ming systemd-nspawn[490615]: [!!!!!!] Failed to allocate manager object.
May 10 15:22:53 ming systemd-nspawn[490615]: Exiting PID 1...
May 10 15:22:53 ming systemd[1]: jlmkr-nvtesting.service: Main process exited, code=exited, status=255/EXCEPTION

I used the v1.4.1 version of the script in the branch you linked and the docker template from the same branch. The only thing I changed in the template was setting gpu_passthrough_nvidia=0 to gpu_passthrough_nvidia=1 and changing systemd_nspawn_user_args=--network-bridge=br1 to systemd_nspawn_user_args=--network-bridge=br0 (as br0 is my bridge).

However, I am not sure if this is related to the nvidia changes as I tried again another new jail using the v1.4.1 script, without setting gpu_passthrough_nvidia to 1 and got a similar output:

Starting jail test with the following command:

systemd-run --property=KillMode=mixed --property=Type=notify --property=RestartForceExitStatus=133 --property=SuccessExitStatus=133 --property=Delegate=yes --property=TasksMax=infinity --collect --setenv=SYSTEMD_NSPAWN_LOCK=0 --unit=jlmkr-test --working-directory=./jails/test '--description=My nspawn jail test [created with jailmaker]' --property=ExecStartPre=/mnt/gordon/jailmaker/jails/test/.ExecStartPre -- systemd-nspawn --keep-unit --quiet --boot --bind-ro=/sys/module --inaccessible=/sys/module/apparmor --machine=test --directory=rootfs --notify-ready=yes --network-bridge=br0 --resolv-conf=bind-host '--system-call-filter=add_key keyctl bpf'

Job for jlmkr-test.service failed.
See "systemctl status jlmkr-test.service" and "journalctl -xeu jlmkr-test.service" for details.

Failed to start jail test...
In case of a config error, you may fix it with:
jlmkr edit test

root@ming[/mnt/gordon/jailmaker]# journalctl -xeu jlmkr-test.service
May 10 15:28:56 ming .ExecStartPre[509351]: PRE_START_HOOK
May 10 15:28:56 ming systemd-nspawn[509353]: systemd 252.22-1~deb12u1 running in system mode (+PAM +AUDIT +SELINUX +APPARMOR +IMA +>
May 10 15:28:56 ming systemd-nspawn[509353]: Detected virtualization systemd-nspawn.
May 10 15:28:56 ming systemd-nspawn[509353]: Detected architecture x86-64.
May 10 15:28:56 ming systemd-nspawn[509353]: Detected first boot.
May 10 15:28:56 ming systemd-nspawn[509353]:
May 10 15:28:56 ming systemd-nspawn[509353]: Welcome to Debian GNU/Linux 12 (bookworm)!
May 10 15:28:56 ming systemd-nspawn[509353]:
May 10 15:28:56 ming systemd-nspawn[509353]: Initializing machine ID from container UUID.
May 10 15:28:56 ming systemd-nspawn[509353]: Failed to create control group inotify object: Too many open files
May 10 15:28:56 ming systemd-nspawn[509353]: Failed to allocate manager object: Too many open files
May 10 15:28:56 ming systemd-nspawn[509353]: [!!!!!!] Failed to allocate manager object.
May 10 15:28:56 ming systemd-nspawn[509353]: Exiting PID 1...
May 10 15:28:56 ming systemd[1]: jlmkr-test.service: Main process exited, code=exited, status=255/EXCEPTION
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ An ExecStart= process belonging to unit jlmkr-test.service has exited.
░░
░░ The process' exit code is 'exited' and its exit status is 255.
May 10 15:28:56 ming systemd[1]: jlmkr-test.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ The unit jlmkr-test.service has entered the 'failed' state with result 'exit-code'.
May 10 15:28:56 ming systemd[1]: Failed to start jlmkr-test.service - My nspawn jail test [created with jailmaker].
░░ Subject: A start job for unit jlmkr-test.service has failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ A start job for unit jlmkr-test.service has finished with a failure.
░░
░░ The job identifier is 1628358 and the job result is failed.

EDIT: I reverted back to v1.2.1 of jlmkr and it created a jail just fine. Let me see where the breaking change came in.

dasunsrule32 · 2024-05-10T14:39:28Z

This is a sysctl setting that needs tweaked.

https://forums.truenas.com/t/linux-jails-sandboxes-containers-with-jailmaker/417/142?u=dasunsrule32

neoKushan · 2024-05-10T14:48:46Z

Ahhh, that's useful to know! I'll make that tweak and try again.

neoKushan · 2024-05-10T14:58:30Z

Yup, can confirm the v1.4.1 script works a treat!

I installed the jail using the docker template with gpu_passthrough_nvidia=1 and that was all fine.

I then shelled into the jail and ran nvidia-smi which showed my GPU humming away.

I then ran sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi and it also showed my gpu stats, I assume that's about all that's needed to prove it out?

b-neufeld · 2024-05-11T02:15:20Z

Yup, can confirm the v1.4.1 script works a treat!

I installed the jail using the docker template with gpu_passthrough_nvidia=1 and that was all fine.

I then shelled into the jail and ran nvidia-smi which showed my GPU humming away.

I then ran sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi and it also showed my gpu stats, I assume that's about all that's needed to prove it out?

Confirming all these steps worked for me too!

* Fix Nvidia Passthrough closing #127 * Mount libraries parent directory * Use the dynamic library path from the existing code

Jip-Hop · 2024-05-17T13:36:48Z

Anyone here able to help debug issue
#174?

Jip-Hop added bug Something isn't working help wanted Extra attention is needed labels Apr 26, 2024

Jip-Hop mentioned this issue May 9, 2024

Add nvidia jailmaker config #163

Closed

dalgibbard mentioned this issue May 9, 2024

Fix for Nvidia Passthrough #166

Merged

Jip-Hop closed this as completed in #166 May 11, 2024

Jip-Hop pushed a commit that referenced this issue May 11, 2024

Fix for Nvidia Passthrough (#166)

5259927

* Fix Nvidia Passthrough closing #127 * Mount libraries parent directory * Use the dynamic library path from the existing code

Nvidia passthrough not working for TrueNAS 24.04.0 (Dragonfish) #127

Nvidia passthrough not working for TrueNAS 24.04.0 (Dragonfish) #127

Comments

dalgibbard commented Apr 24, 2024 • edited Loading

neoKushan commented Apr 24, 2024

Jip-Hop commented Apr 26, 2024

mooglestiltzkin commented Apr 28, 2024 • edited Loading

neoKushan commented Apr 29, 2024

mooglestiltzkin commented Apr 29, 2024

dasunsrule32 commented May 6, 2024 • edited Loading

neoKushan commented May 7, 2024

dasunsrule32 commented May 7, 2024 • edited Loading

mooglestiltzkin commented May 7, 2024

dasunsrule32 commented May 7, 2024

mooglestiltzkin commented May 7, 2024

dasunsrule32 commented May 7, 2024

mooglestiltzkin commented May 7, 2024 • edited Loading

neoKushan commented May 7, 2024

mooglestiltzkin commented May 7, 2024

mooglestiltzkin commented May 7, 2024

mooglestiltzkin commented May 7, 2024

neoKushan commented May 7, 2024 • edited Loading

mooglestiltzkin commented May 7, 2024

neoKushan commented May 7, 2024 • edited Loading

mooglestiltzkin commented May 7, 2024 • edited Loading

neoKushan commented May 7, 2024 • edited Loading

mooglestiltzkin commented May 7, 2024 • edited Loading

neoKushan commented May 7, 2024 • edited Loading

mooglestiltzkin commented May 7, 2024

mooglestiltzkin commented May 7, 2024

neoKushan commented May 7, 2024

mooglestiltzkin commented May 7, 2024

dasunsrule32 commented May 7, 2024

mooglestiltzkin commented May 7, 2024

Jip-Hop commented May 7, 2024

dalgibbard commented May 7, 2024

dasunsrule32 commented May 7, 2024

dasunsrule32 commented May 7, 2024

dasunsrule32 commented May 7, 2024

Jip-Hop commented May 9, 2024

dalgibbard commented May 9, 2024

Jip-Hop commented May 9, 2024

Jip-Hop commented May 9, 2024

dalgibbard commented May 9, 2024 • edited Loading

dalgibbard commented May 9, 2024

dalgibbard commented May 9, 2024 • edited Loading

dalgibbard commented May 9, 2024

dasunsrule32 commented May 9, 2024 • edited Loading

dalgibbard commented May 10, 2024

Jip-Hop commented May 10, 2024

dalgibbard commented May 10, 2024

Jip-Hop commented May 10, 2024

neoKushan commented May 10, 2024 • edited Loading

dasunsrule32 commented May 10, 2024 • edited Loading

neoKushan commented May 10, 2024

neoKushan commented May 10, 2024

b-neufeld commented May 11, 2024

Jip-Hop commented May 17, 2024

dalgibbard commented Apr 24, 2024 •

edited

Loading

mooglestiltzkin commented Apr 28, 2024 •

edited

Loading

dasunsrule32 commented May 6, 2024 •

edited

Loading

dasunsrule32 commented May 7, 2024 •

edited

Loading

mooglestiltzkin commented May 7, 2024 •

edited

Loading

neoKushan commented May 7, 2024 •

edited

Loading

neoKushan commented May 7, 2024 •

edited

Loading

mooglestiltzkin commented May 7, 2024 •

edited

Loading

neoKushan commented May 7, 2024 •

edited

Loading

mooglestiltzkin commented May 7, 2024 •

edited

Loading

neoKushan commented May 7, 2024 •

edited

Loading

dalgibbard commented May 9, 2024 •

edited

Loading

dalgibbard commented May 9, 2024 •

edited

Loading

dasunsrule32 commented May 9, 2024 •

edited

Loading

neoKushan commented May 10, 2024 •

edited

Loading

dasunsrule32 commented May 10, 2024 •

edited

Loading