Skip to content

Latest commit

 

History

History
344 lines (241 loc) · 14.2 KB

troubleshooting.md

File metadata and controls

344 lines (241 loc) · 14.2 KB

Troubleshooting

Your AMD sales representative and support team are available to assist with troubleshooting KubernetesOnload in your cluster. Please see release notes.

FAQ

How can I confirm Onload Operator has deployed successfully?

Onload Operator commands will show the expected output. Namely, its pod is running with logs clear of errors and Onload CRD will be present.

Why is Onload Operator failing?

Inspect the Onload Operator resources to determine whether:

  • allocated namespace (default of onload-operator-system) or selected nodes have insufficient resource quotas or storage availability or compatible architectures, or
  • your current user context is insufficiently privileged for creating the cluster-level administrative resources, or
  • onload-operator image is not accessible from registries or is a stale image, or
  • the deployment's DEVICE_PLUGIN_IMG environment variable is not set, or
  • the deployment, with its Controller Manager container, was deployed but not Onload CRD object, or vice versa.

How can I confirm my Onload CR has deployed successfully?

  • Onload Operator will be deployed (see above).
  • KMM Operator will be a compatible version and operating.
  • Onload component resources, including the Onload CR itself, will be applied without validation errors and present in the cluster with their logs clear of errors.
  • Your workloads will be accelerated when they request Onload. See sfnettest example.

Why is Onload Device Plugin failing?

For example, the pod has a status of CrashLoopBackOff.

Use the Onload component commands to determine whether:

  • onload-device-plugin, onload-worker, or onload-user images failed to pull from registries, or
  • the allocated namespace (by default, current context):
    • has a security policy which disallows privileged pods/roles, or
    • has insufficient resource quotas, or
  • your current user context is insufficiently privileged for creating privileged resources, or
  • init container failed to copy Onload files to a node(s) local storage as either:
    • sufficient writable free space is not available on node filesystem mounts, or
    • kernel security (eg. SELinux) is lacking a compatible configuration for the mounts, or
    • one or more of the host mount paths is incompatible with the underlying host filesystem layout, or
  • Onload Worker failed to start an environment for the Onload Control Plane as:
    • /dev/onload was not found due to node host missing:
      • onload kernel module, or
      • out-of-tree sfc kernel module, or
    • its container ID was not found or parsed due to an incompatible CRI, or
  • Onload Device Plugin's AMD Solarflare hardware discovery failed due to incompatible or missing hardware or other kernel sysfs enumeration issue, or
  • another more general issue, such as namespace security policies preventing privileged containers.

Why are kernel modules not loading on the nodes / Why is KMM failing?

Determine whether:

  • the Module CR (named *-module) Onload components have not been autogenerated due to:
    • an invalid Onload CR, or
    • an Onload Operator reconciler error which will be present in its log, or
  • KMM Operator is not a compatible version or is not operating due to invalid values in the Onload CR which have been passed through to KMM, such as:
    • spec.onload.kernelMapping.regexp which does not compile or does not match any nodes' kernel versions, or
    • spec.onload.kernelMapping.kernelModuleImage is not fully configured, such as:
      • missing the literal ${KERNEL_FULL_VERSION} KMM template variable due to being inadvertently pre-rendered by other configuration management, or
      • referring to a registry that is not configured or operational, or
    • spec.onload.kernelMapping.build.buildArgs.ONLOAD_SOURCE is missing or mismatched to onload-user image, or .ONLOAD_BUILD_PARAMS is invalid or also mismatched to the build of Onload in onload-user image, either of which are causing KMM's in-cluster build to fail.

Why is my workload not accelerating or starting?

Use the accelerated pod commands to determine whether:

  • the resource amd.com/onload has not been requested, or
  • the LD_PRELOAD environment variable is not set when expected, and/or
  • the /bin/onload mount has been disabled, or
  • an AMD Solarflare hardware network interface provided by Multus:
    • is misconfigured, or
    • references outdated kernel networking state due to differing behaviour in out-of-tree kernel modules, or
  • the LIBC version of provided onload binary is incompatible with the container's base image.

See sfnettest example.

Why is my accelerated workload performing differently than on bare metal?

Refer to the Onload User Guide to compare all standard tunables. Many configurations, such as kernel boot parameters, NUMA node selection, and Onload profiles, are specified in slightly different ways.

This is likely caused by a variable other than KubernetesOnload as:

  • Onload Operator does not introduce any additional layers into Onload's bypass stack; it is just orchestration.
  • The version of Onload compiled in the onload-user images and the kernel modules compiled in-cluster or out of cluster from the onload-source image match the standard Linux packages and are not customised variants.
  • The kernel module files may be encapsulated in container images but are loaded directly and natively into the kernel.

Please contact your AMD sales representative and support team to discuss optimisation of your workloads. Please see release notes.

Why is the NFD label selector missing?

If using NFD selectors, has your NodeFeatureDiscovery CR been configured with this PCIe feature?

How can I reinstall everything?

Uninstallation simply involves reversing the deployment procedures outlined in the README. This involves replacing kubectl apply --validate=true commands with kubectl delete.

You may wish to run all deletion commands simultaneously or step backwards sequentially to check for errors.

  1. Delete your accelerated pods
  2. Delete any Onload profiles (eg. latency ConfigMap)
  3. Delete Onload CR. Confirm Onload component commands return empty.
  4. Delete Onload Operator. Confirm Onload Operator commands return empty.
  5. Delete KMM & NFD Operators, either via OLM (OperatorHub) or their Kustomize command lines.

Diagnostic commands

Onload Operator

# Expect 'Running' Deployment of `onload-operator-controller-manager`
kubectl get all -n onload-operator-system

# Check conditions are true, replica is available, namespace is as expected.
# On error, check locations of manager image & `DEVICE_PLUGIN_IMG` are accessible
kubectl describe -n onload-operator-system deployment/onload-operator-controller-manager

# Expect to be present
kubectl get crd onloads.onload.amd.com

kubectl logs -n onload-operator-system deployment/onload-operator-controller-manager --tail 200 -f

Onload components

All resources in all namespaces

kubectl get onload,ds,pod,module,cm,crd,sa,role,rolebinding --all-namespaces -l app.kubernetes.io/part-of=onload

Onload CR and overview its autogenerated dependents

# Expect your Onload CR(s)
kubectl get onload

# Expect Onload Device Plugin DaemonSet, 'onload-module' Module, and optionally 'onload-sfcmod' Module
kubectl get ds,pod,module -l app.kubernetes.io/managed-by=onload-operator

# Expect Status of Module Loader to be Available
kubectl describe module onload-module

# Optional
kubectl describe module onload-sfcmod

Onload Device Plugin containers

kubectl describe pod -l app.kubernetes.io/component=device-plugin

# Expect files copied by init container
kubectl logs -l app.kubernetes.io/component=device-plugin -c init -f

# Expect Onload Worker (containing Onload Control Plane) to have loaded
kubectl logs -l app.kubernetes.io/component=device-plugin -c onload-worker --tail 200 -f

# Expect listing of provided mounts, discovered interfaces, and RPC server up
kubectl logs -l app.kubernetes.io/component=device-plugin -c device-plugin -f

Registries

Are the images accessible to the cluster? Either by allowing access the internet sources (DockerHub) or providing and configuring container images for an internal registry.

If using in-cluster builds, has a container image registry with write access been configured within your cluster?

Has the registry/ies been configured in the cluster, Service Account, and/or Onload CR as required, such as being insecure, internally-signed, or with a pull secret. In OpenShift, this is configured in image.config.openshift.io/cluster CR.

Are the images stale? This is particularly of concern if images have been cached locally using tags that do not completely uniquely identify the image, including any particular build parameters if they are built in-house. Related to this is the imagePullPolicy in Onload CR.

OpenShift Image Registry

KMM and Onload Operator will work with any writable registry, however this repo's OCP-specific Onload CR samples utilise the OpenShift Image Registry by default to store in-cluster build images.

# Expect configured, eg. spec.managementState, spec.defaultRoute, and spec.storage
oc get configs.imageregistry.operator.openshift.io/cluster -o yaml

# Expect `onload-module` when using for kernel module in-cluster builds
oc get imagestreams -n onload-clusterlocal

# Expect list of KMM-built or pre-built images
oc describe imagestream onload-module -n onload-clusterlocal

KMM Operator version

# When deployed via OLM (OperatorHub), version:
kubectl get csv -n openshift-kmm

# else, version in labels and image tags:
kubectl describe -n openshift-kmm deployment/kmm-operator-controller-manager

KMM Operator operation

While Onload Operator configures KMM, it does not report on, or otherwise repeat, logs and statuses produced by KMM. The following commands are a summary of KMM 1.1 troubleshooting commands:

# Expect 'Running' pods and no messages about in-cluster build failures
onload_cr_name=onload
kubectl describe ds -l kmm.node.kubernetes.io/module.name=${onload_cr_name}-module
kubectl describe ds -l kmm.node.kubernetes.io/module.name=${onload_cr_name}-sfcmod
kubectl logs -l kmm.node.kubernetes.io/module.name=${onload_cr_name}-module
kubectl logs -l kmm.node.kubernetes.io/module.name=${onload_cr_name}-sfcmod

# Expect reconciler loop to have succeeded
kubectl logs -n openshift-kmm deployment/kmm-operator-controller-manager --tail 200 -f

Refer to KMM documentation for further expected deployment state details and OpenShift-specific troubleshooting advice.

Day 0/1 MachineConfig

  • Has the container image been stored locally?
  • Are the systemd services healthy?
  • Is the module loaded?
node=compute-0

oc get mc -l app.kubernetes.io/part-of=onload

# Expect two healthy services
oc debug -q node/$node -- chroot /host systemctl status '*sfc*'

# Expect the onload-module image in local image store
oc debug -q node/$node -- chroot /host podman image ls -a | grep onload

# Expect the module is loaded without fatal messages
oc debug -q node/$node -- lsmod | grep sfc
oc debug -q node/$node -- dmesg -xT | grep -i sfc

NFD selectors

# Expect PCI 1924 snippet to be present
kubectl get NodeFeatureDiscovery -n openshift-nfd -o yaml

# Expect all nodes with sfc cards
kubectl get nodes -l feature.node.kubernetes.io/pci-1924.present

Node host

# Set node name and OpenShift or Kubernetes (as implementations have incompatibilities)
debug_node="oc debug node/compute-0 -q"
#debug_node="kubectl debug node/compute-0 --image=busybox -q"

# Expect storage space available
$debug_node -- chroot /host df -h /opt

# Expect matching Onload version number
$debug_node ONLOAD_PRELOAD=/opt/onload/usr/lib64/libonload.so -- chroot /host /opt/onload/usr/bin/onload --version

# Expect both `sfc` and `onload` kernel modules loaded
$debug_node -- chroot /host lsmod | grep -e sfc -e onload

# Expect clean logs for both modules after boot time
$debug_node -- dmesg -xT | grep -ie sfc -e onload

$debug_node -- chroot /host /opt/onload/usr/bin/onload_stackdump lots # or `filters` or `filter_table`
$debug_node -- chroot /host /opt/onload/usr/bin/onload_mibdump llap

Accelerated pods

pod=onload-sfnettest-client

# Expect `amd.com/onload`
kubectl get pod $pod -o jsonpath='{..resources.limits}'

# Expect path to `libonload.so` unless `setPreload` in Onload CR is disabled
kubectl exec $pod -- sh -c 'echo $LD_PRELOAD'

# Expect `onload` binary and/or `libonload.so` libraries. Adjust if different mount paths set in Onload CR
kubectl exec onload-sfnettest-client -- ls -R /opt/onload

# Expect Onload version
kubectl exec onload-sfnettest-client -- /opt/onload/usr/lib64/libonload.so

# Expect a net1 interface for Multus networks
kubectl get events --field-selector reason=AddedInterface

# Expect `NetworkAttachmentDefinition` referencing accelerable hardware
nad=ipvlan-bond0
kubectl get net-attach-def $nad -o yaml

Further details on individuals applications are available from Onload tools on the node host.

Modules and devices already loaded or mounted

Remove any old deployments, including:

  • Onload Operator deployments which were not cleanly removed, or
  • alternatives to KubernetesOnload such as direct filesystem install.

A crude method, to be performed on all nodes, that does not handle management components, would be:

debug_node="oc debug node/compute-0 -q"
#debug_node="kubectl debug node/compute-0 --image=busybox -q"

$debug_node -- chroot /host rmmod -v onload sfc_char sfc_resource sfc sfc_driverlink mtd vdpa

Caution

Removing the sfc kernel module will remove any network interfaces it provides, which may include connections the node is configured to depend on.

Footnotes

SPDX-License-Identifier: MIT
SPDX-FileCopyrightText: (c) Copyright 2024 Advanced Micro Devices, Inc.