Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COMPUTE-6881: Prevent following initContainer to run if previous sidecar container failed #48

Open
wants to merge 1 commit into
base: release-1.30.4-lyft.1
Choose a base branch
from

Conversation

murongl-lyft
Copy link

@murongl-lyft murongl-lyft commented Oct 23, 2024

The upstream Kubelet native sidecar implementation favors 'not blocking follow up sidecar container",
so it won't wait if this is a Sidecar container even it failed to start.

For us we like to have the strong dependency, e.g. in any sidecar container fails, it will stop proceeding.
Credit to @. abhinavdahiya here

@murongl-lyft
Copy link
Author

/ptal @abhinavdahiya @tomwans

@abhinavdahiya
Copy link

what's wrong with moving on to next init container if previous one starts but fails after. eventually all will be fine.

@murongl-lyft
Copy link
Author

what's wrong with moving on to next init container if previous one starts but fails after. eventually all will be fine.

per Lyft's design, we might have chance to see the below scenario which we want to avoid.

2024-10-10T16:15:13Z, iamwait-gojson,   finishedAt: "2024-10-10T16:15:13Z"
2024-10-10T16:15:13Z, runtimepullinit,    finishedAt: "2024-10-10T16:16:04Z"
2024-10-10T16:16:29Z, init-envoymesh, finishedAt: "2024-10-10T16:16:29Z"
2024-10-10T16:16:30Z, **envoymesh.  >>>> **Failed**** 
2024-10-10T16:16:38Z, fluentbit.     (Ready)
2024-10-10T16:16:42Z, nsqv2          (Ready)
2024-10-10T16:16:58Z, aeproxy-unknown   (Ready)
2024-10-10T16:17:05Z, gotestbed-service-gojson.   (Ready)

@abhinavdahiya
Copy link

what's wrong with moving on to next init container if previous one starts but fails after. eventually all will be fine.

per Lyft's design, we might have chance to see the below scenario which we want to avoid.

2024-10-10T16:15:13Z, iamwait-gojson,   finishedAt: "2024-10-10T16:15:13Z"
2024-10-10T16:15:13Z, runtimepullinit,    finishedAt: "2024-10-10T16:16:04Z"
2024-10-10T16:16:29Z, init-envoymesh, finishedAt: "2024-10-10T16:16:29Z"
2024-10-10T16:16:30Z, **envoymesh.  >>>> **Failed**** 
2024-10-10T16:16:38Z, fluentbit.     (Ready)
2024-10-10T16:16:42Z, nsqv2          (Ready)
2024-10-10T16:16:58Z, aeproxy-unknown   (Ready)
2024-10-10T16:17:05Z, gotestbed-service-gojson.   (Ready)

but the container will restart and be running again, which is the expected workflow.

this is also how currently sidecars behave with Lyft-Patch. if the sidecar container fails, it is restarted.

@murongl-lyft
Copy link
Author

what's wrong with moving on to next init container if previous one starts but fails after. eventually all will be fine.

per Lyft's design, we might have chance to see the below scenario which we want to avoid.

2024-10-10T16:15:13Z, iamwait-gojson,   finishedAt: "2024-10-10T16:15:13Z"
2024-10-10T16:15:13Z, runtimepullinit,    finishedAt: "2024-10-10T16:16:04Z"
2024-10-10T16:16:29Z, init-envoymesh, finishedAt: "2024-10-10T16:16:29Z"
2024-10-10T16:16:30Z, **envoymesh.  >>>> **Failed**** 
2024-10-10T16:16:38Z, fluentbit.     (Ready)
2024-10-10T16:16:42Z, nsqv2          (Ready)
2024-10-10T16:16:58Z, aeproxy-unknown   (Ready)
2024-10-10T16:17:05Z, gotestbed-service-gojson.   (Ready)

but the container will restart and be running again, which is the expected workflow.

this is also how currently sidecars behave with Lyft-Patch. if the sidecar container fails, it is restarted.

We only enforce the first time Pod starting procedures.
After that if individual sidecar container restart then it's fine, same as current Product behavior.

@murongl-lyft murongl-lyft changed the title Prevent following initContainer to run if previous sidecar container failed COMPUTE-6881: Prevent following initContainer to run if previous sidecar container failed Oct 24, 2024
@abhinavdahiya
Copy link

abhinavdahiya commented Oct 24, 2024

For us we like to have the strong dependency, e.g. in any sidecar container fails, it will stop proceeding.

I don't think this is a requirement for our sidecar containers.

We don't really care if the sidecar containers restart as many time during pod startup. We expect eventually consistency..

What we care is application container is started only after all sidecars are ready

if !isSidecar(pod, container.Name) && sidecarStatus.SidecarsPresent && sidecarStatus.ContainersWaiting && !sidecarStatus.SidecarsReady {
klog.Infof("Pod: %s, Container: %s, sidecar=%v skipped: Present=%v,Ready=%v,ContainerWaiting=%v", format.Pod(pod), container.Name, isSidecar(pod, container.Name), sidecarStatus.SidecarsPresent, sidecarStatus.SidecarsReady, sidecarStatus.ContainersWaiting)
continue
}

and after things are ready, future failures are also part of normal operation from the perspective of the kubelet for us.

@abhinavdahiya
Copy link

also i did some testing with https://github.com/lyft/k8senvoy/pull/1325/commits/406b3f70aefc988e32f81e089dc2d2c402b24911

  • force fully make the envoymesh container fail.

and then updated boba4 to test pyexample2

kc get pods pyexample2-main-69894f7546-2pp8n
NAME                               READY   STATUS                  RESTARTS      AGE
pyexample2-main-69894f7546-2pp8n   0/7     Init:CrashLoopBackOff   4 (69s ago)   2m58s
  - containerID: cri-o://73a387a7bb690f7cf7595979948b39573865bccd75a56167c0c582f285e116c0
    image: lyft.net/envoysidecar:406b3f70aefc988e32f81e089dc2d2c402b24911
    imageID: lyft.net/envoysidecar@sha256:b0502612d022b7a8904cc16a8be9f138c417786e7afdaef9e80b167592389665
    lastState:
      terminated:
        containerID: cri-o://73a387a7bb690f7cf7595979948b39573865bccd75a56167c0c582f285e116c0
        exitCode: 1
        finishedAt: "2024-10-24T13:05:58Z"
        reason: Error
        startedAt: "2024-10-24T13:05:57Z"
    name: envoymesh
    ready: false
    restartCount: 5
    started: false
    state:
      waiting:
        message: back-off 2m40s restarting failed container=envoymesh pod=pyexample2-main-69894f7546-2pp8n_pyexample2-staging(734f4d3d-3823-4053-bdae-d505ef3521f5)
        reason: CrashLoopBackOff
        
  - containerID: cri-o://76a6a1eb9b20bfd3c01aa0e8069b24e487347723eaaaad16f9a24e6d8a9e123b
    image: lyft.net/tugboatsc:47db478d00d7fa843f740832ec1664e95d8f0578
    imageID: 173840052742.dkr.ecr.us-east-1.amazonaws.com/tugboatsc@sha256:3735c3a3a3b98795a8af57e1c863ed7b0afb8d9e9696d9535befc84083940242
    lastState: {}
    name: statsrelayagent-unknown
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2024-10-24T13:06:01Z"

the envoymesh container failed, but is actually being restarted.
and other sidecar containers are running fine.

Both the outcomes i would say are correct imo.

  - containerID: cri-o://abf999664141a19780d195252218820b777404e86f7421a8f4d994d9a8e396ea
    image: lyft.net/pyexample2:387a684886a9481157ae14a27d0fd9c1cf82785c
    imageID: lyft.net/pyexample2@sha256:366037a83eff01aa2a241bc43b27d2b8d44974821ab1257c510e16bb76ecc26e
    lastState: {}
    name: pyexample2-service-pythonkv
    ready: false
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2024-10-24T13:06:02Z"

the problem is that the app container is also running.
Which is the bad outcome.. but that is not special to init container failing and restarting.

In general what i'm understanding the behavior of native sidecar is that app container will be started after all sidecar containers are started (regardless of their state)... It does not wait for them to become ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants