Skip to content
This repository has been archived by the owner on Oct 22, 2024. It is now read-only.

test flake: spontaneous node reboot #1055

Open
pohly opened this issue Dec 7, 2021 · 0 comments
Open

test flake: spontaneous node reboot #1055

pohly opened this issue Dec 7, 2021 · 0 comments

Comments

@pohly
Copy link
Contributor

pohly commented Dec 7, 2021

A worker node spontaneously rebooted, causing container restarts and thus test failures.

Seen in https://cloudnative-k8sci.southcentralus.cloudapp.azure.com/view/pmem-csi/job/pmem-csi/view/change-requests/job/PR-1054/4/

https://cloudnative-k8sci.southcentralus.cloudapp.azure.com/view/pmem-csi/job/pmem-csi/view/change-requests/job/PR-1054/4/artifact/joblog-jenkins-pmem-csi-PR-1054-4-test-1.19.log:

Dec  7 00:33:17.367: INFO: Waiting up to 3m0s for all (but 0) nodes to be ready
[AfterEach] direct-production
  /mnt/workspace/pmem-csi_PR-1054/test/e2e/deploy/deploy.go:1112
�[1mSTEP�[0m: checking for test "direct-production Deployment Kata Containers [Testpattern: CSI Ephemeral-volume (ext4)] dax should support MAP_SYNC" in namespace default, test success
pmem-csi-intel-com-controller-d875b774-r6shd/[email protected]: ==== end of pod log ====
WARNING: pod log: pmem-csi-intel-com-controller-d875b774-r6shd/pmem-driver: Get "https://172.17.0.5:10250/containerLogs/default/pmem-csi-intel-com-controller-d875b774-r6shd/pmem-driver?follow=true": dial tcp 172.17.0.5:10250: connect: connection refused
...
Dec  7 00:34:37.493: INFO: Done with waiting, PMEM-CSI driver v1.0.0-48-g858d2ca0 is ready.
Dec  7 00:34:37.514: FAIL: container "pmem-driver" in pod "pmem-csi-intel-com-controller-d875b774-r6shd" restarted 1 times, last state: {Waiting:nil Running:nil Terminated:&ContainerStateTerminated{ExitCode:255,Signal:0,Reason:Unknown,Message:,StartedAt:2021-12-06 23:23:32 +0000 UTC,FinishedAt:2021-12-07 00:33:58 +0000 UTC,ContainerID:containerd://c103cca52585e83c30b0afa64ce57b8048fb90998e36bf7a96bafeafaec4ecb3,}}

https://cloudnative-k8sci.southcentralus.cloudapp.azure.com/view/pmem-csi/job/pmem-csi/view/change-requests/job/PR-1054/4/artifact/joblog-jenkins-pmem-csi-PR-1054-4-kubeletlogs-1.19.log:

Dec 07 00:30:52 pmem-csi-govm-worker1 kubelet[855]: E1207 00:30:52.749878     855 upgradeaware.go:387] Error proxying data from backend to client: tls: use of closed connection
-- Boot 2a48549ebe844612bb074c64784b43f9 --
Dec 07 00:33:59 pmem-csi-govm-worker1 systemd[1]: Started kubelet: The Kubernetes Node Agent.
Dec 07 00:34:01 pmem-csi-govm-worker1 kubelet[636]: I1207 00:34:01.312285     636 server.go:411] Version: v1.19.11
pohly added a commit to pohly/pmem-CSI that referenced this issue Dec 7, 2021
Putting the Kubernetes version nearer the start of the file names ensures that
files for the same test case are grouped together when sorting. Sorting by the
type of content is less useful because usually one wants to investigate a
specific test failure which occurred under a specific Kubernetes version.

Splitting up log messages by node makes the individual files smaller and
simplifies debugging of a problem that occurred on a specific node.

To debug the spontanuous node
reboot (intel#1055) the full systemd journal
is needed.
pohly added a commit to pohly/pmem-CSI that referenced this issue Dec 7, 2021
Putting the Kubernetes version nearer the start of the file names ensures that
files for the same test case are grouped together when sorting. Sorting by the
type of content is less useful because usually one wants to investigate a
specific test failure which occurred under a specific Kubernetes version.

Splitting up log messages by node makes the individual files smaller and
simplifies debugging of a problem that occurred on a specific node.

To debug the spontanuous node
reboot (intel#1055) the full systemd journal
is needed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant