You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What you expected to happen:
I'm expecting a new pod to start on the worker where I have a Mellanox card and to compile driver for kernel used on CoreOS. I see this pod is starting of the appropriate worker but the driver is failing to compile with those logs from the pod:
Unsetting driver ready state
No OFED driver found for kernel 4.18.0-372.41.1.rt7.198.el8_6.x86_64
Enabling RHOCP and EUS RPM repos...
ID="rhcos"
VERSION_ID="4.12"
RHEL_VERSION="8.6"
Updating Subscription Management repositories.
Unable to read consumer identity
Subscription Manager is operating in container mode.
Updating Subscription Management repositories.
Unable to read consumer identity
Subscription Manager is operating in container mode.
cuda 7.8 kB/s | 3.5 kB 00:00
cuda 2.6 MB/s | 2.9 MB 00:01
Red Hat OpenShift Container Platform 4.12 for R 9.5 MB/s | 31 MB 00:03
Red Hat Enterprise Linux 8 for x86_64 - AppStre 12 MB/s | 47 MB 00:03
Red Hat Enterprise Linux 8 for x86_64 - BaseOS 14 MB/s | 53 MB 00:03
Red Hat Universal Base Image 8 (RPMs) - BaseOS 6.3 kB/s | 3.8 kB 00:00
Red Hat Universal Base Image 8 (RPMs) - BaseOS 466 kB/s | 717 kB 00:01
Red Hat Universal Base Image 8 (RPMs) - AppStre 9.1 kB/s | 4.2 kB 00:00
Red Hat Universal Base Image 8 (RPMs) - AppStre 1.6 MB/s | 3.0 MB 00:01
Red Hat Universal Base Image 8 (RPMs) - CodeRea 8.3 kB/s | 3.8 kB 00:00
Red Hat Universal Base Image 8 (RPMs) - CodeRea 74 kB/s | 102 kB 00:01
Metadata cache created.
Updating Subscription Management repositories.
Unable to read consumer identity
Subscription Manager is operating in container mode.
Updating Subscription Management repositories.
Unable to read consumer identity
Subscription Manager is operating in container mode.
cuda 11 kB/s | 3.5 kB 00:00
Red Hat OpenShift Container Platform 4.12 for R 6.4 kB/s | 4.1 kB 00:00
Red Hat Enterprise Linux 8 for x86_64 - BaseOS 19 MB/s | 64 MB 00:03
Red Hat Enterprise Linux 8 for x86_64 - AppStre 7.7 kB/s | 4.5 kB 00:00
Red Hat Enterprise Linux 8 for x86_64 - BaseOS 8.1 kB/s | 4.1 kB 00:00
Red Hat Universal Base Image 8 (RPMs) - BaseOS 7.0 kB/s | 3.8 kB 00:00
Red Hat Universal Base Image 8 (RPMs) - AppStre 6.7 kB/s | 4.2 kB 00:00
Red Hat Universal Base Image 8 (RPMs) - CodeRea 4.9 kB/s | 3.8 kB 00:00
Metadata cache created.
Installing dependencies
Error: Unable to find a match: kernel-4.18.0-372.41.1.rt7.198.el8_6.x86_64
Command "dnf -q -y --releasever=8.6 install kernel-4.18.0-372.41.1.rt7.198.el8_6.x86_64" failed with exit code: 1
Terminate event caught
Terminating container
Unsetting driver ready state
Deleting udev rules
rm: cannot remove '/host/etc/udev/rules.d/82-net-setup-link.rules': No such file or directory
rm: cannot remove '/host/etc/udev/mlnx_bf_udev': No such file or directory
rm: cannot remove '/host/etc/infiniband/vf-net-link-name.sh': No such file or directory
Unmounting Mellanox OFED driver rootfs
How to reproduce it (as minimally and precisely as possible):
Try to install driver on openshift 4.12.2. I also tried updating to openshift 4.12.45 with same issue (with different Kernel being used).
Anything else we need to know?:
I tried manual dnf search for the kernel package it tries to install and it fails. It seems DNF has kernel patches package available for few kernel but not the basic kernel module for this version:
kpatch-patch-4_18_0-372_13_1.x86_64 : Live kernel patching module for kernel-4.18.0-372.13.1.el8_6.x86_64
kpatch-patch-4_18_0-372_16_1.x86_64 : Live kernel patching module for kernel-4.18.0-372.16.1.el8_6.x86_64
kpatch-patch-4_18_0-372_19_1.x86_64 : Live kernel patching module for kernel-4.18.0-372.19.1.el8_6.x86_64
kpatch-patch-4_18_0-372_26_1.x86_64 : Live kernel patching module for kernel-4.18.0-372.26.1.el8_6.x86_64
kpatch-patch-4_18_0-372_32_1.x86_64 : Initial empty kpatch-patch for kernel-4.18.0-372.32.1.el8_6.x86_64
kpatch-patch-4_18_0-372_9_1.x86_64 : Live kernel patching module for kernel-4.18.0-372.9.1.el8.x86_64
I tried using different entitlement and I always get same result. Last attempt as using a "Red Hat Developer Subscription for Individuals" entitlement on the cluster
Output of: kubectl -n nvidia-network-operator get -A:
kubectl -n nvidia-network-operator get -A
You must specify the type of resource to get. Use "kubectl api-resources" for a complete list of supported resources.
error: Required resource not specified.
Use "kubectl explain <resource>" for a detailed description of that resource (e.g. kubectl explain pods).
See 'kubectl get -h' for help and examples
Network Operator version: 23.7.0
Logs of Network Operator controller:
Logs of the various Pods in nvidia-network-operator namespace: Provided above
Helm Configuration (if applicable): N/A
Kubernetes' nodes information (labels, annotations and status): kubectl get node -o yaml: I think it is not pertinent for this issue
Environment:
Kubernetes version (use kubectl version): Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.4+a34b9e9", GitCommit:"b6d1f054747e9886f61dd85316deac3415e2726f", GitTreeState:"clean", BuildDate:"2023-01-10T15:55:28Z", GoVersion:"go1.19.4", Compiler:"gc", Platform:"linux/amd64"}
What happened:
Followed instruction on this page to install operator on openshift:
https://docs.nvidia.com/networking/display/kubernetes2310/network+operator
(Used the nvidia-network-operator fomr certified community operator with NFD operator)
I then created a NicClusterPolicy with following spec to have ofed driver installed on worker.
What you expected to happen:
I'm expecting a new pod to start on the worker where I have a Mellanox card and to compile driver for kernel used on CoreOS. I see this pod is starting of the appropriate worker but the driver is failing to compile with those logs from the pod:
How to reproduce it (as minimally and precisely as possible):
Try to install driver on openshift 4.12.2. I also tried updating to openshift 4.12.45 with same issue (with different Kernel being used).
Anything else we need to know?:
I tried manual dnf search for the kernel package it tries to install and it fails. It seems DNF has kernel patches package available for few kernel but not the basic kernel module for this version:
kpatch-patch-4_18_0-372_13_1.x86_64 : Live kernel patching module for kernel-4.18.0-372.13.1.el8_6.x86_64
kpatch-patch-4_18_0-372_16_1.x86_64 : Live kernel patching module for kernel-4.18.0-372.16.1.el8_6.x86_64
kpatch-patch-4_18_0-372_19_1.x86_64 : Live kernel patching module for kernel-4.18.0-372.19.1.el8_6.x86_64
kpatch-patch-4_18_0-372_26_1.x86_64 : Live kernel patching module for kernel-4.18.0-372.26.1.el8_6.x86_64
kpatch-patch-4_18_0-372_32_1.x86_64 : Initial empty kpatch-patch for kernel-4.18.0-372.32.1.el8_6.x86_64
kpatch-patch-4_18_0-372_9_1.x86_64 : Live kernel patching module for kernel-4.18.0-372.9.1.el8.x86_64
I tried using different entitlement and I always get same result. Last attempt as using a "Red Hat Developer Subscription for Individuals" entitlement on the cluster
Logs:
kubectl -n nvidia-network-operator get -A
:nvidia-network-operator
namespace: Provided abovekubectl get node -o yaml
: I think it is not pertinent for this issueEnvironment:
Kubernetes version (use
kubectl version
): Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.4+a34b9e9", GitCommit:"b6d1f054747e9886f61dd85316deac3415e2726f", GitTreeState:"clean", BuildDate:"2023-01-10T15:55:28Z", GoVersion:"go1.19.4", Compiler:"gc", Platform:"linux/amd64"}Hardware configuration:
4b:00.0 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]
4b:00.1 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]
firmware-version: 22.32.2004 (MT_0000000437)
OS (e.g:
cat /etc/os-release
):cat /etc/os-release
NAME="Red Hat Enterprise Linux CoreOS"
ID="rhcos"
ID_LIKE="rhel fedora"
VERSION="412.86.202301311551-0"
VERSION_ID="4.12"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux CoreOS 412.86.202301311551-0 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::coreos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://docs.openshift.com/container-platform/4.12/"
BUG_REPORT_URL="https://access.redhat.com/labs/rhir/"
REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform"
REDHAT_BUGZILLA_PRODUCT_VERSION="4.12"
REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform"
REDHAT_SUPPORT_PRODUCT_VERSION="4.12"
OPENSHIFT_VERSION="4.12"
RHEL_VERSION="8.6"
OSTREE_VERSION="412.86.202301311551-0"
Kernel (e.g.
uname -a
): Linux ocp1-worker3 4.18.0-372.41.1.rt7.198.el8_6.x86_64 Fix deployment image #1 SMP PREEMPT_RT Fri Jan 6 15:08:19 EST 2023 x86_64 x86_64 x86_64 GNU/LinuxThe text was updated successfully, but these errors were encountered: