test flake: spontaneous node reboot #1055

pohly · 2021-12-07T08:14:49Z

A worker node spontaneously rebooted, causing container restarts and thus test failures.

Seen in https://cloudnative-k8sci.southcentralus.cloudapp.azure.com/view/pmem-csi/job/pmem-csi/view/change-requests/job/PR-1054/4/

https://cloudnative-k8sci.southcentralus.cloudapp.azure.com/view/pmem-csi/job/pmem-csi/view/change-requests/job/PR-1054/4/artifact/joblog-jenkins-pmem-csi-PR-1054-4-test-1.19.log:

Dec  7 00:33:17.367: INFO: Waiting up to 3m0s for all (but 0) nodes to be ready
[AfterEach] direct-production
  /mnt/workspace/pmem-csi_PR-1054/test/e2e/deploy/deploy.go:1112
�[1mSTEP�[0m: checking for test "direct-production Deployment Kata Containers [Testpattern: CSI Ephemeral-volume (ext4)] dax should support MAP_SYNC" in namespace default, test success
pmem-csi-intel-com-controller-d875b774-r6shd/[email protected]: ==== end of pod log ====
WARNING: pod log: pmem-csi-intel-com-controller-d875b774-r6shd/pmem-driver: Get "https://172.17.0.5:10250/containerLogs/default/pmem-csi-intel-com-controller-d875b774-r6shd/pmem-driver?follow=true": dial tcp 172.17.0.5:10250: connect: connection refused
...
Dec  7 00:34:37.493: INFO: Done with waiting, PMEM-CSI driver v1.0.0-48-g858d2ca0 is ready.
Dec  7 00:34:37.514: FAIL: container "pmem-driver" in pod "pmem-csi-intel-com-controller-d875b774-r6shd" restarted 1 times, last state: {Waiting:nil Running:nil Terminated:&ContainerStateTerminated{ExitCode:255,Signal:0,Reason:Unknown,Message:,StartedAt:2021-12-06 23:23:32 +0000 UTC,FinishedAt:2021-12-07 00:33:58 +0000 UTC,ContainerID:containerd://c103cca52585e83c30b0afa64ce57b8048fb90998e36bf7a96bafeafaec4ecb3,}}

https://cloudnative-k8sci.southcentralus.cloudapp.azure.com/view/pmem-csi/job/pmem-csi/view/change-requests/job/PR-1054/4/artifact/joblog-jenkins-pmem-csi-PR-1054-4-kubeletlogs-1.19.log:

Dec 07 00:30:52 pmem-csi-govm-worker1 kubelet[855]: E1207 00:30:52.749878     855 upgradeaware.go:387] Error proxying data from backend to client: tls: use of closed connection
-- Boot 2a48549ebe844612bb074c64784b43f9 --
Dec 07 00:33:59 pmem-csi-govm-worker1 systemd[1]: Started kubelet: The Kubernetes Node Agent.
Dec 07 00:34:01 pmem-csi-govm-worker1 kubelet[636]: I1207 00:34:01.312285     636 server.go:411] Version: v1.19.11

The text was updated successfully, but these errors were encountered:

Putting the Kubernetes version nearer the start of the file names ensures that files for the same test case are grouped together when sorting. Sorting by the type of content is less useful because usually one wants to investigate a specific test failure which occurred under a specific Kubernetes version. Splitting up log messages by node makes the individual files smaller and simplifies debugging of a problem that occurred on a specific node. To debug the spontanuous node reboot (intel#1055) the full systemd journal is needed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test flake: spontaneous node reboot #1055

test flake: spontaneous node reboot #1055

pohly commented Dec 7, 2021

test flake: spontaneous node reboot #1055

test flake: spontaneous node reboot #1055

Comments

pohly commented Dec 7, 2021