Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watch stream missing job completion events #2238

Open
headyj opened this issue May 23, 2024 · 7 comments
Open

Watch stream missing job completion events #2238

headyj opened this issue May 23, 2024 · 7 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@headyj
Copy link

headyj commented May 23, 2024

What happened (please include outputs or screenshots):

Sometimes the watch stream seems to be missing job completion events. This is not easy to reproduce as 2 executions of the same code in a row might have different result.

Here is the code, which is watching a job status and printing the logs:

w = watch.Watch()
for event in w.stream(func=batchV1.list_namespaced_job, namespace=namespace, timeout_seconds=0):
  if event['object'].metadata.name == jobName:
    logging.info(event)
    if event['type'] == "ADDED":
      logging.info("job %s created, waiting for pod to be running...", jobName)
    if event["object"].status.ready:
      pods = coreV1.list_namespaced_pod(namespace=namespace,label_selector="job-name={}".format(jobName))
      logging.info("pod %s is ready", pods.items[0].metadata.name)
      for line in coreV1.read_namespaced_pod_log(name=pods.items[0].metadata.name, namespace=namespace, follow=True, _preload_content=False).stream():
        print(line.decode(),end = '')
    if event["object"].status.succeeded:
      logging.info("Finished pod stream.")
      w.stop()
    if not event["object"].status.active and event["object"].status.failed:
      w.stop()
      logging.error("Job Failed")
      sys.exit(1)

Sometimes, the script never ends even when the watched job is completed. The script itself is executed in the same Kubernetes cluster but in a different namespace. I tried setting multiple values for timeout_seconds but it doesn't help, the last event is when it becomes active:

[INFO] {'type': 'ADDED', 'object': {'api_version': 'batch/v1', [...] 'job-name': 'my-job-1716468085', [...] 'status': {'active': None, [...], 'ready': None, 'start_time': None [...]
[INFO] job my-job-1716468085 created, waiting for pod to be running...
[INFO] {'type': 'MODIFIED', 'object': {'api_version': 'batch/v1', [...] 'job-name': 'my-job-1716468085', [...] 'status': {'active': 1, [...], 'ready': 0, 'start_time': datetime.datetime(2024, 5, 23, 12, 41, 25, tzinfo=tzlocal()), [...]

The event is correctly updated on Kubernetes side, checking on k9s:

Events:
  Type    Reason            Age   From            Message
  ----    ------            ----  ----            -------
  Normal  SuccessfulCreate  35m   job-controller  Created pod: my-job-1716468085-9dq8d
  Normal  Completed         32m   job-controller  Job completed

What you expected to happen:

Job completion event should be catch and sent

How to reproduce it (as minimally and precisely as possible):

Just use the above code in python 3.12-slim docker image. As said above, the problem seems to be sporadic. I wasn't able to reproduce it another way yet but I will update this ticket if so.

Anything else we need to know?:

Environment:

  • Kubernetes version (kubectl version): v1.29 (EKS)
  • OS (e.g., MacOS 10.13.6): N/A
  • Python version (python --version) python 3.12-slim official docker image: https://hub.docker.com/_/python
  • Python client version (pip list | grep kubernetes): 29.0.0
@headyj headyj added the kind/bug Categorizes issue or PR as related to a bug. label May 23, 2024
@yliaog
Copy link
Contributor

yliaog commented Jun 5, 2024

Thanks for reporting the issue, please update the ticket when you can reproduce it reliably.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 3, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 3, 2024
@leseb
Copy link

leseb commented Oct 16, 2024

I realize this isn't a very helpful comment, but I've been encountering the same issue. It's exactly as described earlier—the events are being reported on the Kubernetes side, but the watch seems to be stuck. :/

It mostly happens with long-running jobs (though is 20 minutes really that long?). The issue is quite sporadic and difficult to reproduce consistently...

I'm starting to think that resetting the watcher every 5 minutes or so might be the way to go. I believe the events will still be captured, even if they're sent during the restart. It's not the ideal solution, but if anyone has a better suggestion, I'm open to it.

@headyj did you get past that problem?
Thanks!

@leseb
Copy link

leseb commented Oct 16, 2024

Might be related to #869. I had to re-establish the watch each I caught:

"Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))

With that, I don't have any hanging watchers anymore.

@headyj
Copy link
Author

headyj commented Oct 18, 2024

@leseb we are currently using a workaround by adding a retry:

w = watch.Watch()
timedOut = True
for i in range(maxRetry):
  for event in w.stream
    [...]
  if not timedOut:
      break
  i += 1
if timedOut:
  logging.error("Watch stream timed out after %s seconds",maxRetry*timeoutSeconds)
  sys.exit(1)

@leseb
Copy link

leseb commented Oct 18, 2024

@headyj thanks for your response, here's what I do:

exit_flag = False
while not exit_flag:
  try:
    for event in w.stream(timeout_seconds=60):
      [...]
      if success:
        exit_flag=True

  # Catches the following error:
  # urllib3.exceptions.ProtocolError: ("Connection broken: InvalidChunkLength
  except urllib3.exceptions.ProtocolError as e:
    logger.warning("Connection broken reconnecting the watcher %s", str(e))
    time.sleep(5)  # Backoff before retrying
  finally:
    w.stop()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

5 participants