Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods stuck in "ContainerCreating" status in AKS: FailedCreatePodSandBox #11478

Closed
oskarm93 opened this issue Oct 12, 2023 · 5 comments
Closed
Labels

Comments

@oskarm93
Copy link

oskarm93 commented Oct 12, 2023

What is the issue?

When we do deployment updates, sometimes our pods will randomly stop finishing creation.
New pod is created and stuck in "ContainerCreating". Pod is not even enabled with linkerd. linkerd annotation is not enabled.
We pre-install linkerd in CNI mode on all our clusters, but some teams don't use it. They will still run into this issue.

5m43s       Warning   FailedCreatePodSandBox   pod/<app_name>-68c448f44d-vp62n              (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "57f16a0e9098017767041eae11660c574c3350ad12073b261914898f55a5c63c": plugin type="linkerd-cni" name="linkerd-cni" failed (add): Unauthorized

How can it be reproduced?

Unknown. Seems to happen randomly on different nodes.

Logs, error output, etc

kubectl get pod -o wide
NAME                          READY   STATUS              RESTARTS        AGE   IP            NODE                              NOMINATED NODE   READINESS GATES
<app_name>-68c448f44d-vp62n   0/1     ContainerCreating   0               40m   <none>        aks-default-19181164-vmss000009   <none>           <none>
<app_name>-8574c97d4b-wsgq2   1/1     Running             5 (2d22h ago)   23d   10.18.16.80   aks-default-19181164-vmss00000a   <none>           <none>

Describe pod:
https://gist.github.com/oskarm93/335679f5abfc6b0f6c8da198c71f6db9

kubectl get pod -n linkerd-cni -o wide
NAME                READY   STATUS    RESTARTS        AGE   IP          NODE                              NOMINATED NODE   READINESS GATES
linkerd-cni-pg589   1/1     Running   0               62d   10.18.1.5   aks-default-19181164-vmss000000   <none>           <none>
linkerd-cni-rhpb8   1/1     Running   1 (2d22h ago)   54d   10.18.1.6   aks-default-19181164-vmss00000a   <none>           <none>
linkerd-cni-v9rv4   1/1     Running   0               55d   10.18.1.7   aks-default-19181164-vmss000009   <none>           <none>

Linkerd CNI logs (node 09):
https://gist.github.com/oskarm93/3e67a6ff935c55fdb0b42e0c190281d7

Linkerd CNI describe pod (node 09):
https://gist.github.com/oskarm93/b93dbdd4c1977c08514067abdfbf9bc5

output of linkerd check -o short

linkerd check -o short
Linkerd core checks
===================

kubernetes-version
------------------
× is running the minimum kubectl version
    exit status 1
    see https://linkerd.io/2.11/checks/#kubectl-version for hints

linkerd-existence
-----------------
‼ cluster networks can be verified
    the following nodes do not expose a podCIDR:
        aks-default-19181164-vmss000000
        aks-default-19181164-vmss000009
        aks-default-19181164-vmss00000a
    see https://linkerd.io/2.11/checks/#l5d-cluster-networks-verified for hints

linkerd-version
---------------
‼ cli is up-to-date
    is running version 2.11.4 but the latest stable version is 2.14.1
    see https://linkerd.io/2.11/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 2.13.1 but the latest stable version is 2.14.1
    see https://linkerd.io/2.11/checks/#l5d-version-control for hints
‼ control plane and cli versions match
    control plane running stable-2.13.1 but cli running stable-2.11.4
    see https://linkerd.io/2.11/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
        * linkerd-destination-c74967cdf-sjpdx (stable-2.13.1)
        * linkerd-identity-5d5d8954c6-whjrs (stable-2.13.1)
        * linkerd-proxy-injector-7d458667cd-p6wcc (stable-2.13.1)
        * prometheus-69cd9b4b65-c4l9k (stable-2.13.1)
        * tap-76b6bd6d59-mdqwr (stable-2.13.1)
        * tap-injector-59f9cb8655-p5g77 (stable-2.13.1)
        * web-cc997c6b5-2v9nn (stable-2.13.1)
    see https://linkerd.io/2.11/checks/#l5d-cp-proxy-version for hints
‼ control plane proxies and cli versions match
    linkerd-destination-c74967cdf-sjpdx running stable-2.13.1 but cli running stable-2.11.4
    see https://linkerd.io/2.11/checks/#l5d-cp-proxy-cli-version for hints

- Running viz extension check
<this always gets stuck>

Environment

Kubernetes version: 1.26.6
Environment: AKS
OS: AKSUbuntu-2204gen2containerd-202307.27.0
Linkerd Version:

helm ls -A
NAME                                    NAMESPACE               REVISION        UPDATED                                 STATUS          CHART                                                                           APP VERSION
linkerd-cni                             linkerd-cni             1               2023-05-09 09:39:16.993582904 +0000 UTC deployed        linkerd2-cni-30.8.1                                                             stable-2.13.1
linkerd-control-plane                   linkerd                 1               2023-05-09 09:46:25.68485696 +0000 UTC  deployed        linkerd-control-plane-1.12.1                                                    stable-2.13.1
linkerd-crds                            linkerd                 1               2023-05-09 09:39:13.387040572 +0000 UTC deployed        linkerd-crds-1.6.0
linkerd-viz                             linkerd                 1               2023-05-09 09:47:02.75521461 +0000 UTC  deployed        linkerd-viz-30.8.1                                                              stable-2.13.1

Possible solution

Restarting CNI pod on the node where pod was going to start usually solves the problem.

Additional context

No response

Would you like to work on fixing this bug?

None

@oskarm93 oskarm93 added the bug label Oct 12, 2023
@alpeb
Copy link
Member

alpeb commented Oct 12, 2023

Thanks for the detailed report 💯
There have been important improvements in linkerd's CNI plugin since version stable-2.13.1, which is what you have. Please upgrade to at least stable-2.13.6, and let us know how it goes!

@Dark3clipse
Copy link

Dark3clipse commented Nov 7, 2023

I am getting the same error on one of my linkerd-cni pods:

Every 2.0s: kubectl get all                                           kubernetes-client: Tue Nov  7 15:48:13 2023

NAME                    READY   STATUS              RESTARTS       AGE
pod/linkerd-cni-4djjv   1/1     Running             1 (3d2h ago)   4d4h
pod/linkerd-cni-blx5v   1/1     Running             1 (3d2h ago)   4d4h
pod/linkerd-cni-cxsn2   0/1     ContainerCreating   0              12m
pod/linkerd-cni-flmv2   1/1     Running             1 (3d3h ago)   4d4h
pod/linkerd-cni-rfvhj   1/1     Running             1 (24h ago)    42h
pod/linkerd-cni-zlxdz   1/1     Running             2 (42h ago)    4d4h
pod/linkerd-cni-zsxj6   1/1     Running             2 (42h ago)    4d4h

Deleting the pod does not resolve this issue for me.

Events:
  Type     Reason                  Age   From               Message
  ----     ------                  ----  ----               -------
  Normal   Scheduled               20s   default-scheduler  Successfully assigned linkerd-cni/linkerd-cni-cxsn2 to cp3
  Warning  FailedCreatePodSandBox  20s   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "8fe60ffc62f4539a892f74f4f0207ea0e2667edaff67a294fbbf402ee11c8b76": plugin type="linkerd-cni" name="linkerd-cni" failed (add): Unauthorized
  Warning  FailedCreatePodSandBox  7s    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "10a9b49fcc2039757e0e00c6dd485faf866d1711516adc73c097af63e9140dad": plugin type="linkerd-cni" name="linkerd-cni" failed (add): Unauthorized

linkerd versions:

chart: linkerd-control-plane
version: "v1.17.5-edge"

chart: linkerd2-cni
version: "30.13.1-edge"

chart: linkerd-crds
version: "v1.9.0-edge"

chart: linkerd-viz
version: "30.13.5-edge"

I am not on AKS.

@zip-chanko
Copy link

@alpeb I have similar issue on EKS v1.25 with vpc-cni v1.12.6-eksbuild.2. Linkerd version is 2.13.5.

Warning  FailedCreatePodSandBox  4m20s (x90059 over 13d)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "f4df221c93495b1b811911c8a9f371b9483102e8fe2d3c154c51c5d036d11de7": plugin type="linkerd-cni" name="linkerd-cni" failed (add): Unauthorized

Currently the temporary workaround is I just recycle the aws-node pod which mentioned in #1831 and #59.

Possibly race condition where mentioned by #10738. Is the race condition fixed in version 2.13.6 by #11169?

@wmorgan
Copy link
Member

wmorgan commented Jan 7, 2024

Please try on a more recent Linkerd. 2.14.8 is the most recent. See support policy section in https://linkerd.io/releases/#stable-latest-version-stable-2148

@oskarm93
Copy link
Author

oskarm93 commented Mar 7, 2024

We have not experienced this issue in a while.
AKS 1.27.9
Linkerd control Plane helm chart version 1.16.9

@oskarm93 oskarm93 closed this as completed Mar 7, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 7, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants