-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Watch stream missing job completion events #2238
Comments
Thanks for reporting the issue, please update the ticket when you can reproduce it reliably. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
I realize this isn't a very helpful comment, but I've been encountering the same issue. It's exactly as described earlier—the events are being reported on the Kubernetes side, but the It mostly happens with long-running jobs (though is 20 minutes really that long?). The issue is quite sporadic and difficult to reproduce consistently... I'm starting to think that resetting the watcher every 5 minutes or so might be the way to go. I believe the events will still be captured, even if they're sent during the restart. It's not the ideal solution, but if anyone has a better suggestion, I'm open to it. @headyj did you get past that problem? |
Might be related to #869. I had to re-establish the watch each I caught:
With that, I don't have any hanging watchers anymore. |
@leseb we are currently using a workaround by adding a retry:
|
@headyj thanks for your response, here's what I do: exit_flag = False
while not exit_flag:
try:
for event in w.stream(timeout_seconds=60):
[...]
if success:
exit_flag=True
# Catches the following error:
# urllib3.exceptions.ProtocolError: ("Connection broken: InvalidChunkLength
except urllib3.exceptions.ProtocolError as e:
logger.warning("Connection broken reconnecting the watcher %s", str(e))
time.sleep(5) # Backoff before retrying
finally:
w.stop() |
What happened (please include outputs or screenshots):
Sometimes the watch stream seems to be missing job completion events. This is not easy to reproduce as 2 executions of the same code in a row might have different result.
Here is the code, which is watching a job status and printing the logs:
Sometimes, the script never ends even when the watched job is completed. The script itself is executed in the same Kubernetes cluster but in a different namespace. I tried setting multiple values for
timeout_seconds
but it doesn't help, the last event is when it becomes active:The event is correctly updated on Kubernetes side, checking on k9s:
What you expected to happen:
Job completion event should be catch and sent
How to reproduce it (as minimally and precisely as possible):
Just use the above code in python 3.12-slim docker image. As said above, the problem seems to be sporadic. I wasn't able to reproduce it another way yet but I will update this ticket if so.
Anything else we need to know?:
Environment:
kubectl version
): v1.29 (EKS)python --version
) python 3.12-slim official docker image: https://hub.docker.com/_/pythonpip list | grep kubernetes
): 29.0.0The text was updated successfully, but these errors were encountered: