Watch stream missing job completion events #2238

headyj · 2024-05-23T13:29:46Z

What happened (please include outputs or screenshots):

Sometimes the watch stream seems to be missing job completion events. This is not easy to reproduce as 2 executions of the same code in a row might have different result.

Here is the code, which is watching a job status and printing the logs:

w = watch.Watch()
for event in w.stream(func=batchV1.list_namespaced_job, namespace=namespace, timeout_seconds=0):
  if event['object'].metadata.name == jobName:
    logging.info(event)
    if event['type'] == "ADDED":
      logging.info("job %s created, waiting for pod to be running...", jobName)
    if event["object"].status.ready:
      pods = coreV1.list_namespaced_pod(namespace=namespace,label_selector="job-name={}".format(jobName))
      logging.info("pod %s is ready", pods.items[0].metadata.name)
      for line in coreV1.read_namespaced_pod_log(name=pods.items[0].metadata.name, namespace=namespace, follow=True, _preload_content=False).stream():
        print(line.decode(),end = '')
    if event["object"].status.succeeded:
      logging.info("Finished pod stream.")
      w.stop()
    if not event["object"].status.active and event["object"].status.failed:
      w.stop()
      logging.error("Job Failed")
      sys.exit(1)

Sometimes, the script never ends even when the watched job is completed. The script itself is executed in the same Kubernetes cluster but in a different namespace. I tried setting multiple values for timeout_seconds but it doesn't help, the last event is when it becomes active:

[INFO] {'type': 'ADDED', 'object': {'api_version': 'batch/v1', [...] 'job-name': 'my-job-1716468085', [...] 'status': {'active': None, [...], 'ready': None, 'start_time': None [...]
[INFO] job my-job-1716468085 created, waiting for pod to be running...
[INFO] {'type': 'MODIFIED', 'object': {'api_version': 'batch/v1', [...] 'job-name': 'my-job-1716468085', [...] 'status': {'active': 1, [...], 'ready': 0, 'start_time': datetime.datetime(2024, 5, 23, 12, 41, 25, tzinfo=tzlocal()), [...]

The event is correctly updated on Kubernetes side, checking on k9s:

Events:
  Type    Reason            Age   From            Message
  ----    ------            ----  ----            -------
  Normal  SuccessfulCreate  35m   job-controller  Created pod: my-job-1716468085-9dq8d
  Normal  Completed         32m   job-controller  Job completed

What you expected to happen:

Job completion event should be catch and sent

How to reproduce it (as minimally and precisely as possible):

Just use the above code in python 3.12-slim docker image. As said above, the problem seems to be sporadic. I wasn't able to reproduce it another way yet but I will update this ticket if so.

Anything else we need to know?:

Environment:

Kubernetes version (kubectl version): v1.29 (EKS)
OS (e.g., MacOS 10.13.6): N/A
Python version (python --version) python 3.12-slim official docker image: https://hub.docker.com/_/python
Python client version (pip list | grep kubernetes): 29.0.0

The text was updated successfully, but these errors were encountered:

yliaog · 2024-06-05T20:25:27Z

Thanks for reporting the issue, please update the ticket when you can reproduce it reliably.

k8s-triage-robot · 2024-09-03T20:44:30Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-10-03T21:28:54Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

leseb · 2024-10-16T08:39:15Z

I realize this isn't a very helpful comment, but I've been encountering the same issue. It's exactly as described earlier—the events are being reported on the Kubernetes side, but the watch seems to be stuck. :/

It mostly happens with long-running jobs (though is 20 minutes really that long?). The issue is quite sporadic and difficult to reproduce consistently...

I'm starting to think that resetting the watcher every 5 minutes or so might be the way to go. I believe the events will still be captured, even if they're sent during the restart. It's not the ideal solution, but if anyone has a better suggestion, I'm open to it.

@headyj did you get past that problem?
Thanks!

leseb · 2024-10-16T12:53:39Z

Might be related to #869. I had to re-establish the watch each I caught:

"Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))

With that, I don't have any hanging watchers anymore.

headyj · 2024-10-18T08:45:17Z

@leseb we are currently using a workaround by adding a retry:

w = watch.Watch()
timedOut = True
for i in range(maxRetry):
  for event in w.stream
    [...]
  if not timedOut:
      break
  i += 1
if timedOut:
  logging.error("Watch stream timed out after %s seconds",maxRetry*timeoutSeconds)
  sys.exit(1)

leseb · 2024-10-18T09:01:32Z

@headyj thanks for your response, here's what I do:

exit_flag = False
while not exit_flag:
  try:
    for event in w.stream(timeout_seconds=60):
      [...]
      if success:
        exit_flag=True

  # Catches the following error:
  # urllib3.exceptions.ProtocolError: ("Connection broken: InvalidChunkLength
  except urllib3.exceptions.ProtocolError as e:
    logger.warning("Connection broken reconnecting the watcher %s", str(e))
    time.sleep(5)  # Backoff before retrying
  finally:
    w.stop()

headyj added the kind/bug Categorizes issue or PR as related to a bug. label May 23, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 3, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Watch stream missing job completion events #2238

Watch stream missing job completion events #2238

headyj commented May 23, 2024

yliaog commented Jun 5, 2024

k8s-triage-robot commented Sep 3, 2024

k8s-triage-robot commented Oct 3, 2024

leseb commented Oct 16, 2024

leseb commented Oct 16, 2024

headyj commented Oct 18, 2024

leseb commented Oct 18, 2024

Watch stream missing job completion events #2238

Watch stream missing job completion events #2238

Comments

headyj commented May 23, 2024

yliaog commented Jun 5, 2024

k8s-triage-robot commented Sep 3, 2024

k8s-triage-robot commented Oct 3, 2024

leseb commented Oct 16, 2024

leseb commented Oct 16, 2024

headyj commented Oct 18, 2024

leseb commented Oct 18, 2024