COMPUTE-6881: Prevent following initContainer to run if previous sidecar container failed #48

murongl-lyft · 2024-10-23T20:30:40Z

The upstream Kubelet native sidecar implementation favors 'not blocking follow up sidecar container",
so it won't wait if this is a Sidecar container even it failed to start.

For us we like to have the strong dependency, e.g. in any sidecar container fails, it will stop proceeding.
Credit to @. abhinavdahiya here

…led to start

murongl-lyft · 2024-10-23T20:31:21Z

/ptal @abhinavdahiya @tomwans

abhinavdahiya · 2024-10-23T20:34:29Z

what's wrong with moving on to next init container if previous one starts but fails after. eventually all will be fine.

murongl-lyft · 2024-10-23T20:43:01Z

what's wrong with moving on to next init container if previous one starts but fails after. eventually all will be fine.

per Lyft's design, we might have chance to see the below scenario which we want to avoid.

2024-10-10T16:15:13Z, iamwait-gojson,   finishedAt: "2024-10-10T16:15:13Z"
2024-10-10T16:15:13Z, runtimepullinit,    finishedAt: "2024-10-10T16:16:04Z"
2024-10-10T16:16:29Z, init-envoymesh, finishedAt: "2024-10-10T16:16:29Z"
2024-10-10T16:16:30Z, **envoymesh.  >>>> **Failed**** 
2024-10-10T16:16:38Z, fluentbit.     (Ready)
2024-10-10T16:16:42Z, nsqv2          (Ready)
2024-10-10T16:16:58Z, aeproxy-unknown   (Ready)
2024-10-10T16:17:05Z, gotestbed-service-gojson.   (Ready)

abhinavdahiya · 2024-10-23T21:21:52Z

what's wrong with moving on to next init container if previous one starts but fails after. eventually all will be fine.

per Lyft's design, we might have chance to see the below scenario which we want to avoid.
2024-10-10T16:15:13Z, iamwait-gojson,   finishedAt: "2024-10-10T16:15:13Z"
2024-10-10T16:15:13Z, runtimepullinit,    finishedAt: "2024-10-10T16:16:04Z"
2024-10-10T16:16:29Z, init-envoymesh, finishedAt: "2024-10-10T16:16:29Z"
2024-10-10T16:16:30Z, **envoymesh.  >>>> **Failed**** 
2024-10-10T16:16:38Z, fluentbit.     (Ready)
2024-10-10T16:16:42Z, nsqv2          (Ready)
2024-10-10T16:16:58Z, aeproxy-unknown   (Ready)
2024-10-10T16:17:05Z, gotestbed-service-gojson.   (Ready)

but the container will restart and be running again, which is the expected workflow.

this is also how currently sidecars behave with Lyft-Patch. if the sidecar container fails, it is restarted.

murongl-lyft · 2024-10-23T21:30:58Z

what's wrong with moving on to next init container if previous one starts but fails after. eventually all will be fine.

per Lyft's design, we might have chance to see the below scenario which we want to avoid.
2024-10-10T16:15:13Z, iamwait-gojson,   finishedAt: "2024-10-10T16:15:13Z"
2024-10-10T16:15:13Z, runtimepullinit,    finishedAt: "2024-10-10T16:16:04Z"
2024-10-10T16:16:29Z, init-envoymesh, finishedAt: "2024-10-10T16:16:29Z"
2024-10-10T16:16:30Z, **envoymesh.  >>>> **Failed**** 
2024-10-10T16:16:38Z, fluentbit.     (Ready)
2024-10-10T16:16:42Z, nsqv2          (Ready)
2024-10-10T16:16:58Z, aeproxy-unknown   (Ready)
2024-10-10T16:17:05Z, gotestbed-service-gojson.   (Ready)
but the container will restart and be running again, which is the expected workflow.

this is also how currently sidecars behave with Lyft-Patch. if the sidecar container fails, it is restarted.

We only enforce the first time Pod starting procedures.
After that if individual sidecar container restart then it's fine, same as current Product behavior.

abhinavdahiya · 2024-10-24T12:38:54Z

For us we like to have the strong dependency, e.g. in any sidecar container fails, it will stop proceeding.

I don't think this is a requirement for our sidecar containers.

We don't really care if the sidecar containers restart as many time during pod startup. We expect eventually consistency..

What we care is application container is started only after all sidecars are ready

kubernetes/pkg/kubelet/kuberuntime/kuberuntime_manager.go

Lines 997 to 1000 in cc8d8ad

    
           if !isSidecar(pod, container.Name) && sidecarStatus.SidecarsPresent && sidecarStatus.ContainersWaiting && !sidecarStatus.SidecarsReady { 
        
           	klog.Infof("Pod: %s, Container: %s, sidecar=%v skipped: Present=%v,Ready=%v,ContainerWaiting=%v", format.Pod(pod), container.Name, isSidecar(pod, container.Name), sidecarStatus.SidecarsPresent, sidecarStatus.SidecarsReady, sidecarStatus.ContainersWaiting) 
        
           	continue 
        
           }

and after things are ready, future failures are also part of normal operation from the perspective of the kubelet for us.

abhinavdahiya · 2024-10-24T13:11:09Z

also i did some testing with https://github.com/lyft/k8senvoy/pull/1325/commits/406b3f70aefc988e32f81e089dc2d2c402b24911

force fully make the envoymesh container fail.

and then updated boba4 to test pyexample2

kc get pods pyexample2-main-69894f7546-2pp8n
NAME                               READY   STATUS                  RESTARTS      AGE
pyexample2-main-69894f7546-2pp8n   0/7     Init:CrashLoopBackOff   4 (69s ago)   2m58s

  - containerID: cri-o://73a387a7bb690f7cf7595979948b39573865bccd75a56167c0c582f285e116c0
    image: lyft.net/envoysidecar:406b3f70aefc988e32f81e089dc2d2c402b24911
    imageID: lyft.net/envoysidecar@sha256:b0502612d022b7a8904cc16a8be9f138c417786e7afdaef9e80b167592389665
    lastState:
      terminated:
        containerID: cri-o://73a387a7bb690f7cf7595979948b39573865bccd75a56167c0c582f285e116c0
        exitCode: 1
        finishedAt: "2024-10-24T13:05:58Z"
        reason: Error
        startedAt: "2024-10-24T13:05:57Z"
    name: envoymesh
    ready: false
    restartCount: 5
    started: false
    state:
      waiting:
        message: back-off 2m40s restarting failed container=envoymesh pod=pyexample2-main-69894f7546-2pp8n_pyexample2-staging(734f4d3d-3823-4053-bdae-d505ef3521f5)
        reason: CrashLoopBackOff
        
  - containerID: cri-o://76a6a1eb9b20bfd3c01aa0e8069b24e487347723eaaaad16f9a24e6d8a9e123b
    image: lyft.net/tugboatsc:47db478d00d7fa843f740832ec1664e95d8f0578
    imageID: 173840052742.dkr.ecr.us-east-1.amazonaws.com/tugboatsc@sha256:3735c3a3a3b98795a8af57e1c863ed7b0afb8d9e9696d9535befc84083940242
    lastState: {}
    name: statsrelayagent-unknown
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2024-10-24T13:06:01Z"

the envoymesh container failed, but is actually being restarted.
and other sidecar containers are running fine.

Both the outcomes i would say are correct imo.

  - containerID: cri-o://abf999664141a19780d195252218820b777404e86f7421a8f4d994d9a8e396ea
    image: lyft.net/pyexample2:387a684886a9481157ae14a27d0fd9c1cf82785c
    imageID: lyft.net/pyexample2@sha256:366037a83eff01aa2a241bc43b27d2b8d44974821ab1257c510e16bb76ecc26e
    lastState: {}
    name: pyexample2-service-pythonkv
    ready: false
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2024-10-24T13:06:02Z"

the problem is that the app container is also running.
Which is the bad outcome.. but that is not special to init container failing and restarting.

In general what i'm understanding the behavior of native sidecar is that app container will be started after all sidecar containers are started (regardless of their state)... It does not wait for them to become ready.

prevent follow initContainer to run if previous sidecar container fai…

eaa1dd9

…led to start

murongl-lyft changed the title ~~Prevent following initContainer to run if previous sidecar container failed~~ COMPUTE-6881: Prevent following initContainer to run if previous sidecar container failed Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

COMPUTE-6881: Prevent following initContainer to run if previous sidecar container failed #48

COMPUTE-6881: Prevent following initContainer to run if previous sidecar container failed #48

murongl-lyft commented Oct 23, 2024 •

edited

Loading

murongl-lyft commented Oct 23, 2024

abhinavdahiya commented Oct 23, 2024

murongl-lyft commented Oct 23, 2024

abhinavdahiya commented Oct 23, 2024

murongl-lyft commented Oct 23, 2024

abhinavdahiya commented Oct 24, 2024 •

edited

Loading

abhinavdahiya commented Oct 24, 2024

COMPUTE-6881: Prevent following initContainer to run if previous sidecar container failed #48

Are you sure you want to change the base?

COMPUTE-6881: Prevent following initContainer to run if previous sidecar container failed #48

Conversation

murongl-lyft commented Oct 23, 2024 • edited Loading

murongl-lyft commented Oct 23, 2024

abhinavdahiya commented Oct 23, 2024

murongl-lyft commented Oct 23, 2024

abhinavdahiya commented Oct 23, 2024

murongl-lyft commented Oct 23, 2024

abhinavdahiya commented Oct 24, 2024 • edited Loading

abhinavdahiya commented Oct 24, 2024

murongl-lyft commented Oct 23, 2024 •

edited

Loading

abhinavdahiya commented Oct 24, 2024 •

edited

Loading