Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runner pod always comes back in Terminating state #37

Closed
lanecm opened this issue Apr 16, 2020 · 16 comments
Closed

Runner pod always comes back in Terminating state #37

lanecm opened this issue Apr 16, 2020 · 16 comments
Assignees

Comments

@lanecm
Copy link

lanecm commented Apr 16, 2020

I deleted a runner pod to pick IRSA changes, but the pod always comes back in Terminating state:

kubectl get pods rev-gh-actions-sample-tkv6h-tr7d4 
NAME                                READY   STATUS        RESTARTS   AGE
rev-gh-actions-sample-tkv6h-tr7d4   0/2     Terminating   0          5s

In the manager logs, I see the following error:

2020-04-16T17:21:45.963Z        ERROR   controller-runtime.controller   Reconciler error        {"controller": "runner", "request": "actions-runners/rev-gh-actions-sample-tkv6h-tr7d
4", "error": "pods \"rev-gh-actions-sample-tkv6h-tr7d4\" already exists"}
github.com/go-logr/zapr.(*zapLogger).Error
        /go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:258
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:232
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:211
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
        /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88

Any ideas on what I'm doing wrong? Thank you!

@chenrui333
Copy link
Contributor

chenrui333 commented Apr 17, 2020

I am seeing the same error as well (I think the pod object just did not get recreated?)

2020-04-17T21:13:58.846Z	ERROR	controllers.RunnerReplicaSet	Failed to update runner status	{"runner": "default/meetup-android-runner-ml4g4", "error": "Operation cannot be fulfilled on runnerreplicasets.actions.summerwind.dev \"meetup-android-runner-ml4g4\": the object has been modified; please apply your changes to the latest version and try again"}
github.com/go-logr/zapr.(*zapLogger).Error
	/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128

@summerwind
Copy link
Contributor

summerwind commented Apr 18, 2020

@lanecm Thank you for the report!
Let me ask one thing. How did you delete the runner pod?

@chenrui333
Copy link
Contributor

chenrui333 commented Apr 18, 2020

@summerwind Here is what I did,

  • kubectl delete -f runnerdeployment.yaml
  • update replicas
  • kubectl apply -f runnerdeployment.yaml

Running into same issue

@lanecm
Copy link
Author

lanecm commented Apr 20, 2020

@summerwind -- Similar to what @chenrui333 did:

I deleted the runnerdeployment using:

kubectl delete runnerdeployment rev-gh-actions-sample

I've also tried deleting the controller as well. Let me know if I can provide any additional information!

@lanecm
Copy link
Author

lanecm commented Apr 23, 2020

@summerwind -- Just wanted to checkin: Any updates on this issue? Or any additional information I can provide or help with?

@summerwind
Copy link
Contributor

Sorry for the delayed response. I'll see if it reproduces in my environment.

@summerwind
Copy link
Contributor

summerwind commented Apr 24, 2020

I couldn't reproduce the problem in my environment as follows.

Manifest

# test.yaml
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: test
spec:
  replicas: 2
  template:
    spec:
      repository: summerwind/actions-runner-controller

Starting runners

All runners have been successfully started.

$ kubectl apply -f test.yaml
runnerdeployment.actions.summerwind.dev/test created
$ kubectl get pods -w
NAME               READY   STATUS    RESTARTS   AGE
test-wzhz7-xtwbh   0/2     Pending   0          0s
test-wzhz7-2gw99   0/2     Pending   0          0s
test-wzhz7-xtwbh   0/2     Pending   0          0s
test-wzhz7-2gw99   0/2     Pending   0          0s
test-wzhz7-xtwbh   0/2     ContainerCreating   0          0s
test-wzhz7-2gw99   0/2     ContainerCreating   0          0s
test-wzhz7-xtwbh   2/2     Running             0          4s
test-wzhz7-2gw99   2/2     Running             0          6s

Deleting runners

Confirmed that all runners have been stopped.

$ kubectl delete runnerdeployments test
runnerdeployment.actions.summerwind.dev "test" deleted
$ kubectl get pods -w
NAME               READY   STATUS    RESTARTS   AGE
test-wzhz7-2gw99   2/2     Running   0          62s
test-wzhz7-xtwbh   2/2     Running   0          62s
test-wzhz7-xtwbh   2/2     Terminating   0          77s
test-wzhz7-2gw99   2/2     Terminating   0          78s
test-wzhz7-xtwbh   0/2     Terminating   0          80s
test-wzhz7-2gw99   0/2     Terminating   0          81s
test-wzhz7-xtwbh   0/2     Terminating   0          85s
test-wzhz7-xtwbh   0/2     Terminating   0          85s
test-wzhz7-2gw99   0/2     Terminating   0          85s
test-wzhz7-2gw99   0/2     Terminating   0          85s
$ kubectl get pods
No resources found in default namespace.

@lanecm Can I see your pod status in the terminating state with the following command?

$ kubectl get pods ${POD_NAME} -o yaml

@lanecm
Copy link
Author

lanecm commented Apr 27, 2020

Hi @summerwind -- Yes, output:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubernetes.io/psp: eks.privileged
  creationTimestamp: "2020-04-27T10:51:51Z"
  deletionGracePeriodSeconds: 30
  deletionTimestamp: "2020-04-27T10:52:22Z"
  labels:
    runner-template-hash: 77f4656b99
  name: rev-gh-actions-sample-85lmx-t9gdp
  namespace: actions-runners
  ownerReferences:
    - apiVersion: actions.summerwind.dev/v1alpha1
      blockOwnerDeletion: true
      controller: true
      kind: Runner
      name: rev-gh-actions-sample-85lmx-t9gdp
      uid: e5215a92-80d3-11ea-a8aa-0a9c722078fd
  resourceVersion: "61248901"
  selfLink: /api/v1/namespaces/actions-runners/pods/rev-gh-actions-sample-85lmx-t9gdp
  uid: 19abde54-8875-11ea-b382-122f77966705
spec:
  containers:
    - env:
        - name: RUNNER_NAME
          value: rev-gh-actions-sample-85lmx-t9gdp
        - name: RUNNER_REPO
          value: cfacorp/rev-gh-actions-sample
        - name: RUNNER_TOKEN
          value: AMEHS6E4YAMDS4YO5U265B26U262I
        - name: AWS_ROLE_ARN
          value:
        - name: AWS_WEB_IDENTITY_TOKEN_FILE
          value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
      image: <runner-image>
      imagePullPolicy: Always
      name: runner
      resources: {}
      securityContext:
        runAsGroup: 0
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
        - mountPath: /var/run
          name: docker
        - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
          name: rev-gh-actions-sample-sa-token-rjxpx
          readOnly: true
        - mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount
          name: aws-iam-token
          readOnly: true
    - env:
        - name: AWS_ROLE_ARN
          value: <irsa-role-arn>
        - name: AWS_WEB_IDENTITY_TOKEN_FILE
          value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
      image: docker:19.03.6-dind
      imagePullPolicy: IfNotPresent
      name: docker
      resources: {}
      securityContext:
        privileged: true
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
        - mountPath: /var/run
          name: docker
        - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
          name: rev-gh-actions-sample-sa-token-rjxpx
          readOnly: true
        - mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount
          name: aws-iam-token
          readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: ip-10-0-220-14.ec2.internal
  priority: 100
  priorityClassName: global-default
  restartPolicy: OnFailure
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: rev-gh-actions-sample-sa
  serviceAccountName: rev-gh-actions-sample-sa
  terminationGracePeriodSeconds: 30
  tolerations:
    - effect: NoExecute
      key: node.kubernetes.io/not-ready
      operator: Exists
      tolerationSeconds: 300
    - effect: NoExecute
      key: node.kubernetes.io/unreachable
      operator: Exists
      tolerationSeconds: 300
  volumes:
    - name: aws-iam-token
      projected:
        defaultMode: 420
        sources:
          - serviceAccountToken:
              audience: sts.amazonaws.com
              expirationSeconds: 86400
              path: token
    - emptyDir: {}
      name: docker
    - name: rev-gh-actions-sample-sa-token-rjxpx
      secret:
        defaultMode: 420
        secretName: rev-gh-actions-sample-sa-token-rjxpx
status:
  conditions:
    - lastProbeTime: null
      lastTransitionTime: "2020-04-27T10:51:51Z"
      status: "True"
      type: Initialized
    - lastProbeTime: null
      lastTransitionTime: "2020-04-27T10:51:51Z"
      message: "containers with unready status: [runner docker]"
      reason: ContainersNotReady
      status: "False"
      type: Ready
    - lastProbeTime: null
      lastTransitionTime: "2020-04-27T10:51:51Z"
      message: "containers with unready status: [runner docker]"
      reason: ContainersNotReady
      status: "False"
      type: ContainersReady
    - lastProbeTime: null
      lastTransitionTime: "2020-04-27T10:51:51Z"
      status: "True"
      type: PodScheduled
  containerStatuses:
    - image: docker:19.03.6-dind
      imageID: ""
      lastState: {}
      name: docker
      ready: false
      restartCount: 0
      state:
        terminated:
          exitCode: 0
          finishedAt: null
          startedAt: null
    - image: <runner-image>
      imageID: ""
      lastState: {}
      name: runner
      ready: false
      restartCount: 0
      state:
        terminated:
          exitCode: 0
          finishedAt: null
          startedAt: null
  hostIP: 10.0.220.14
  phase: Pending
  qosClass: BestEffort
  startTime: "2020-04-27T10:51:51Z"

Thank you for investigating! Please let me know how I can help.

@summerwind
Copy link
Contributor

summerwind commented Apr 29, 2020

@lanecm Thank you for the information.
I'll try the EKS cluster to see if this can be reproduced.
Can you also give me the results of the following commands for the terminating pods?

$ kubectl logs ${POD_NAME} -c runner
$ kubectl describe pods ${POD_NAME}
$ kubectl describe runners ${POD_NAME}
$ kubectl describe runnerdeployments rev-gh-actions-sample

@summerwind summerwind self-assigned this Apr 29, 2020
@lanecm
Copy link
Author

lanecm commented Apr 30, 2020

@summerwind -- Output for each command:

  1. kubectl describe pods ${POD_NAME} :
Name:                      rev-gh-actions-sample-85lmx-t9gdp
Namespace:                 actions-runners
Priority:                  100
Priority Class Name:       global-default
Node:                      <node-name>/
Labels:                    runner-template-hash=77f4656b99
Annotations:               kubernetes.io/psp: eks.privileged
Status:                    Terminating (lasts <invalid>)
Termination Grace Period:  30s
IP:                        
IPs:                       <none>
Controlled By:             Runner/rev-gh-actions-sample-85lmx-t9gdp
Containers:
  runner:
    Image:      <runner-image>
    Port:       <none>
    Host Port:  <none>
    Environment:
      RUNNER_NAME:                  rev-gh-actions-sample-85lmx-t9gdp
      RUNNER_REPO:                  <orgname>/rev-gh-actions-sample
      RUNNER_TOKEN:                 AMEHS6H6E7LFR4FIL3SO6S26VLIBM
      AWS_ROLE_ARN:                 <role-arn>
      AWS_WEB_IDENTITY_TOKEN_FILE:  /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    Mounts:
      /var/run from docker (rw)
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from rev-gh-actions-sample-sa-token-rjxpx (ro)
  docker:
    Image:      docker:19.03.6-dind
    Port:       <none>
    Host Port:  <none>
    Environment:
      AWS_ROLE_ARN:                 <role-arn>
      AWS_WEB_IDENTITY_TOKEN_FILE:  /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    Mounts:
      /var/run from docker (rw)
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from rev-gh-actions-sample-sa-token-rjxpx (ro)
Conditions:
  Type           Status
  PodScheduled   True 
Volumes:
  aws-iam-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  86400
  docker:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  rev-gh-actions-sample-sa-token-rjxpx:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  rev-gh-actions-sample-sa-token-rjxpx
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  1s    default-scheduler  Successfully assigned actions-runners/rev-gh-actions-sample-85lmx-t9gdp to <node-name>
  1. kubectl describe runners ${POD_NAME} :
Name:         rev-gh-actions-sample-85lmx-t9gdp
Namespace:    actions-runners
Labels:       runner-template-hash=77f4656b99
Annotations:  <none>
API Version:  actions.summerwind.dev/v1alpha1
Kind:         Runner
Metadata:
  Creation Timestamp:  2020-04-17T17:50:15Z
  Finalizers:
    runner.actions.summerwind.dev
  Generate Name:  rev-gh-actions-sample-85lmx-
  Generation:     1
  Owner References:
    API Version:           actions.summerwind.dev/v1alpha1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  RunnerReplicaSet
    Name:                  rev-gh-actions-sample-85lmx
    UID:                   e5205099-80d3-11ea-a8aa-0a9c722078fd
  Resource Version:        65957286
  Self Link:               /apis/actions.summerwind.dev/v1alpha1/namespaces/actions-runners/runners/rev-gh-actions-sample-85lmx-t9gdp
  UID:                     e5215a92-80d3-11ea-a8aa-0a9c722078fd
Spec:
  Image:       
  Repository:  <orgname>//rev-gh-actions-sample
  Resources:
  Service Account Name:  rev-gh-actions-sample-sa
Status:
  Message:  
  Phase:    Pending
  Reason:   
  Registration:
    Expires At:  2020-04-30T13:18:14Z
    Repository:  <orgname>/rev-gh-actions-sample
    Token:       AMEHS6H6E7LFR4FIL3SO6S26VLIBM
Events:
  Type    Reason      Age                        From               Message
  ----    ------      ----                       ----               -------
  Normal  PodDeleted  23m (x362730 over 5d21h)   runner-controller  Deleted pod 'rev-gh-actions-sample-85lmx-t9gdp'
  Normal  PodCreated  3m6s (x327159 over 5d21h)  runner-controller  Created pod 'rev-gh-actions-sample-85lmx-t9gdp'
  1. kubectl describe runnerdeployments rev-gh-actions-sample :
Name:         rev-gh-actions-sample
Namespace:    actions-runners
Labels:       app.kubernetes.io/instance=cicd-actions-runners
Annotations:  API Version:  actions.summerwind.dev/v1alpha1
Kind:         RunnerDeployment
Metadata:
  Creation Timestamp:  2020-04-17T17:50:15Z
  Generation:          1
  Resource Version:    47289236
  Self Link:           /apis/actions.summerwind.dev/v1alpha1/namespaces/actions-runners/runnerdeployments/rev-gh-actions-sample
  UID:                 e51f10d9-80d3-11ea-a8aa-0a9c722078fd
Spec:
  Replicas:  1
  Template:
    Spec:
      Env:
      Repository:            <orgname>/rev-gh-actions-sample
      Service Account Name:  rev-gh-actions-sample-sa
Events:                      <none>

@summerwind
Copy link
Contributor

@lanecm Thank you for giving me the information!
What about the container's log as follows? It may be related to #33.

$ kubectl logs ${POD_NAME} -c runner

@lanecm
Copy link
Author

lanecm commented May 5, 2020

@summerwind -- Unfortunately, it's bit tricky to get the pod logs, but it's stuck in a loop:

$ kubectl logs rev-gh-actions-sample-85lmx-t9gdp -c runner 
Error from server (BadRequest): container "runner" in pod "rev-gh-actions-sample-85lmx-t9gdp" is terminated

Not sure the best way to get the logs?

@summerwind
Copy link
Contributor

Thanks! How about to use --previous flag?

$ kubectl logs ${POD_NAME} -c runner --previous

@svrakitin
Copy link

svrakitin commented Nov 2, 2020

Hey @summerwind,

I've encountered the same issue in our EKS cluster. The runner pod is stuck in terminated state and there is an event saying

Error: cannot find volume "aws-iam-token" to mount into container "runner".

Runner has service account attached with eks.amazonaws.com/role-arn annotation making IAM mutating webhook inject aws-iam-token volume and AWS_ROLE_ARN and AWS_WEB_IDENTITY_TOKEN_FILE environmental variables.

I managed to workaround it hardcoding volumes same way as mutating webhook like this:

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: ${var.github_organization}-runner
  namespace: actions-runner-system
spec:
  replicas: 3
  template:
    spec:
      image: summerwind/actions-runner-dind:latest
      dockerdWithinRunnerContainer: true
      organization: ${var.github_organization}
      env:
      - name: AWS_REGION
        value: ${var.region}
      - name: AWS_DEFAULT_REGION
        value: ${var.region}
      - name: AWS_ROLE_ARN
        value: ${aws_iam_role.actions_runner.arn}
      - name: AWS_WEB_IDENTITY_TOKEN_FILE
        value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
      securityContext:
        fsGroup: 65534
      volumeMounts:
      - mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount
        name: aws-iam-token
        readOnly: true
      volumes:
      - name: aws-iam-token
        projected:
          defaultMode: 420
          sources:
          - serviceAccountToken:
              audience: sts.amazonaws.com
              expirationSeconds: 86400
              path: token

@mumoshu
Copy link
Collaborator

mumoshu commented Apr 25, 2021

@svrakitin It's been addressed in #200 and the PR #226 and should be non-issue today! Thanks for reporting.

@mumoshu
Copy link
Collaborator

mumoshu commented Apr 25, 2021

Closing as resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants