Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to install driver on openshift 4.12 #720

Closed
jf1cloutier opened this issue Dec 22, 2023 · 1 comment
Closed

Fail to install driver on openshift 4.12 #720

jf1cloutier opened this issue Dec 22, 2023 · 1 comment
Labels
bug Something isn't working

Comments

@jf1cloutier
Copy link

What happened:
Followed instruction on this page to install operator on openshift:
https://docs.nvidia.com/networking/display/kubernetes2310/network+operator

(Used the nvidia-network-operator fomr certified community operator with NFD operator)

I then created a NicClusterPolicy with following spec to have ofed driver installed on worker.

spec:
  ofedDriver:
    image: mofed
    livenessProbe:
      initialDelaySeconds: 3000
      periodSeconds: 30
    readinessProbe:
      initialDelaySeconds: 1000
      periodSeconds: 30
    repository: nvcr.io/nvidia/mellanox
    startupProbe:
      initialDelaySeconds: 1000
      periodSeconds: 20
    terminationGracePeriodSeconds: 300
    upgradePolicy:
      autoUpgrade: true
      drain:
        deleteEmptyDir: true
        enable: true
        force: true
        podSelector: ''
        timeoutSeconds: 300
      maxParallelUpgrades: 1
      waitForCompletion:
        timeoutSeconds: 0
    version: 23.07-0.5.0.0

What you expected to happen:
I'm expecting a new pod to start on the worker where I have a Mellanox card and to compile driver for kernel used on CoreOS. I see this pod is starting of the appropriate worker but the driver is failing to compile with those logs from the pod:

Unsetting driver ready state
No OFED driver found for kernel 4.18.0-372.41.1.rt7.198.el8_6.x86_64
Enabling RHOCP and EUS RPM repos...
ID="rhcos"
VERSION_ID="4.12"
RHEL_VERSION="8.6"
Updating Subscription Management repositories.
Unable to read consumer identity
Subscription Manager is operating in container mode.
Updating Subscription Management repositories.
Unable to read consumer identity
Subscription Manager is operating in container mode.
cuda                                            7.8 kB/s | 3.5 kB     00:00
cuda                                            2.6 MB/s | 2.9 MB     00:01
Red Hat OpenShift Container Platform 4.12 for R 9.5 MB/s |  31 MB     00:03
Red Hat Enterprise Linux 8 for x86_64 - AppStre  12 MB/s |  47 MB     00:03
Red Hat Enterprise Linux 8 for x86_64 - BaseOS   14 MB/s |  53 MB     00:03
Red Hat Universal Base Image 8 (RPMs) - BaseOS  6.3 kB/s | 3.8 kB     00:00
Red Hat Universal Base Image 8 (RPMs) - BaseOS  466 kB/s | 717 kB     00:01
Red Hat Universal Base Image 8 (RPMs) - AppStre 9.1 kB/s | 4.2 kB     00:00
Red Hat Universal Base Image 8 (RPMs) - AppStre 1.6 MB/s | 3.0 MB     00:01
Red Hat Universal Base Image 8 (RPMs) - CodeRea 8.3 kB/s | 3.8 kB     00:00
Red Hat Universal Base Image 8 (RPMs) - CodeRea  74 kB/s | 102 kB     00:01
Metadata cache created.
Updating Subscription Management repositories.
Unable to read consumer identity
Subscription Manager is operating in container mode.
Updating Subscription Management repositories.
Unable to read consumer identity
Subscription Manager is operating in container mode.
cuda                                             11 kB/s | 3.5 kB     00:00
Red Hat OpenShift Container Platform 4.12 for R 6.4 kB/s | 4.1 kB     00:00
Red Hat Enterprise Linux 8 for x86_64 - BaseOS   19 MB/s |  64 MB     00:03
Red Hat Enterprise Linux 8 for x86_64 - AppStre 7.7 kB/s | 4.5 kB     00:00
Red Hat Enterprise Linux 8 for x86_64 - BaseOS  8.1 kB/s | 4.1 kB     00:00
Red Hat Universal Base Image 8 (RPMs) - BaseOS  7.0 kB/s | 3.8 kB     00:00
Red Hat Universal Base Image 8 (RPMs) - AppStre 6.7 kB/s | 4.2 kB     00:00
Red Hat Universal Base Image 8 (RPMs) - CodeRea 4.9 kB/s | 3.8 kB     00:00
Metadata cache created.
Installing dependencies
Error: Unable to find a match: kernel-4.18.0-372.41.1.rt7.198.el8_6.x86_64

Command "dnf -q -y --releasever=8.6 install kernel-4.18.0-372.41.1.rt7.198.el8_6.x86_64" failed with exit code: 1
Terminate event caught
Terminating container
Unsetting driver ready state
Deleting udev rules
rm: cannot remove '/host/etc/udev/rules.d/82-net-setup-link.rules': No such file or directory
rm: cannot remove '/host/etc/udev/mlnx_bf_udev': No such file or directory
rm: cannot remove '/host/etc/infiniband/vf-net-link-name.sh': No such file or directory
Unmounting Mellanox OFED driver rootfs

How to reproduce it (as minimally and precisely as possible):
Try to install driver on openshift 4.12.2. I also tried updating to openshift 4.12.45 with same issue (with different Kernel being used).

Anything else we need to know?:
I tried manual dnf search for the kernel package it tries to install and it fails. It seems DNF has kernel patches package available for few kernel but not the basic kernel module for this version:
kpatch-patch-4_18_0-372_13_1.x86_64 : Live kernel patching module for kernel-4.18.0-372.13.1.el8_6.x86_64
kpatch-patch-4_18_0-372_16_1.x86_64 : Live kernel patching module for kernel-4.18.0-372.16.1.el8_6.x86_64
kpatch-patch-4_18_0-372_19_1.x86_64 : Live kernel patching module for kernel-4.18.0-372.19.1.el8_6.x86_64
kpatch-patch-4_18_0-372_26_1.x86_64 : Live kernel patching module for kernel-4.18.0-372.26.1.el8_6.x86_64
kpatch-patch-4_18_0-372_32_1.x86_64 : Initial empty kpatch-patch for kernel-4.18.0-372.32.1.el8_6.x86_64
kpatch-patch-4_18_0-372_9_1.x86_64 : Live kernel patching module for kernel-4.18.0-372.9.1.el8.x86_64

I tried using different entitlement and I always get same result. Last attempt as using a "Red Hat Developer Subscription for Individuals" entitlement on the cluster

Logs:

  • NicClusterPolicy CR spec and state:
  k get NicClusterPolicy nic-cluster-policy -o yaml
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  creationTimestamp: "2023-12-21T19:33:13Z"
  generation: 1
  name: nic-cluster-policy
  resourceVersion: "280143"
  uid: 7d28443a-0a48-486f-92ef-7509c3678842
spec:
  ofedDriver:
    image: mofed
    livenessProbe:
      initialDelaySeconds: 3000
      periodSeconds: 30
    readinessProbe:
      initialDelaySeconds: 1000
      periodSeconds: 30
    repository: nvcr.io/nvidia/mellanox
    startupProbe:
      initialDelaySeconds: 1000
      periodSeconds: 20
    terminationGracePeriodSeconds: 300
    upgradePolicy:
      autoUpgrade: true
      drain:
        deleteEmptyDir: true
        enable: true
        force: true
        podSelector: ""
        timeoutSeconds: 300
      maxParallelUpgrades: 1
      waitForCompletion:
        timeoutSeconds: 0
    version: 23.07-0.5.0.0
status:
  appliedStates:
  - name: state-pod-security-policy
    state: ignore
  - name: state-multus-cni
    state: ignore
  - name: state-container-networking-plugins
    state: ignore
  - name: state-ipoib-cni
    state: ignore
  - name: state-whereabouts-cni
    state: ignore
  - name: state-OFED
    state: notReady
  - name: state-SRIOV-device-plugin
    state: ignore
  - name: state-RDMA-device-plugin
    state: ignore
  - name: state-ib-kubernetes
    state: ignore
  - name: state-nv-ipam-cni
    state: ignore
  state: notReady
  • Output of: kubectl -n nvidia-network-operator get -A:
kubectl -n nvidia-network-operator get -A
You must specify the type of resource to get. Use "kubectl api-resources" for a complete list of supported resources.

error: Required resource not specified.
Use "kubectl explain <resource>" for a detailed description of that resource (e.g. kubectl explain pods).
See 'kubectl get -h' for help and examples
  • Network Operator version: 23.7.0
  • Logs of Network Operator controller:
  • Logs of the various Pods in nvidia-network-operator namespace: Provided above
  • Helm Configuration (if applicable): N/A
  • Kubernetes' nodes information (labels, annotations and status): kubectl get node -o yaml: I think it is not pertinent for this issue

Environment:

  • Kubernetes version (use kubectl version): Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.4+a34b9e9", GitCommit:"b6d1f054747e9886f61dd85316deac3415e2726f", GitTreeState:"clean", BuildDate:"2023-01-10T15:55:28Z", GoVersion:"go1.19.4", Compiler:"gc", Platform:"linux/amd64"}

  • Hardware configuration:
    4b:00.0 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]
    4b:00.1 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]
    firmware-version: 22.32.2004 (MT_0000000437)

  • OS (e.g: cat /etc/os-release):
    cat /etc/os-release
    NAME="Red Hat Enterprise Linux CoreOS"
    ID="rhcos"
    ID_LIKE="rhel fedora"
    VERSION="412.86.202301311551-0"
    VERSION_ID="4.12"
    PLATFORM_ID="platform:el8"
    PRETTY_NAME="Red Hat Enterprise Linux CoreOS 412.86.202301311551-0 (Ootpa)"
    ANSI_COLOR="0;31"
    CPE_NAME="cpe:/o:redhat:enterprise_linux:8::coreos"
    HOME_URL="https://www.redhat.com/"
    DOCUMENTATION_URL="https://docs.openshift.com/container-platform/4.12/"
    BUG_REPORT_URL="https://access.redhat.com/labs/rhir/"
    REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform"
    REDHAT_BUGZILLA_PRODUCT_VERSION="4.12"
    REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform"
    REDHAT_SUPPORT_PRODUCT_VERSION="4.12"
    OPENSHIFT_VERSION="4.12"
    RHEL_VERSION="8.6"
    OSTREE_VERSION="412.86.202301311551-0"

  • Kernel (e.g. uname -a): Linux ocp1-worker3 4.18.0-372.41.1.rt7.198.el8_6.x86_64 Fix deployment image #1 SMP PREEMPT_RT Fri Jan 6 15:08:19 EST 2023 x86_64 x86_64 x86_64 GNU/Linux

@jf1cloutier jf1cloutier added the bug Something isn't working label Dec 22, 2023
@rollandf
Copy link
Member

Please use more recent versions, MOFED compilation moved to use DTK, no need for entitlement

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants