-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Linkerd proxy fails to connect to other proxy #12720
Comments
We enabled DEBUG logs on the client proxy: I captured 1000 lines before any mention of These seem suspect:
But it's only one of the client replicas having this issue. Others are sending requests just fine. Also, these logs seem like they would be good to have in INFO; it's difficult for us to toggle DEBUG logs on because they log user tokens, etc. |
I'm afraid I can't give you a lot of guidance here without having a reproducible scenario. We have some docs about failfast and 503 and 504 errors that might help. What's special about the client that is causing these issues? Is it generating too many requests that the resources allocated to the proxy aren't enough? You could also look at the proxy metrics for the problematic pod and see how they compare to the healthy ones. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions. |
What is the issue?
After upgrading to 2024.5.5, we have seen an increase in 503/504 timeouts, though seemingly only for a single (client) pod. It shows as logs like this:
It fails to connect, then goes into fail-fast mode. This is only happening on a single (client) pod out of 5 total replicas. This is the youngest replica at <1d old; others are ~5 days old. Other replicas don't have any issue connecting to this service (even while this one is having a problem). Also, there are requests that do successfully make it from this problematic pod occasionally.
It's also not just this service (00dfd97) that it is having trouble connecting to. The same is happening with ~20 others as well.
This happened yesterday, and I removed the bad pod, then the new one that was created had the same issue. After removing the pod a second time, it seemingly started working until a few hours later.
How can it be reproduced?
I don't know how to reproduce it. It seems to just happen after some time.
Logs, error output, etc
controller metrics: https://gist.github.com/andrewdinunzio/022c6db28b347cc333402af3092ac18a
(problematic client) proxy metrics: https://gist.github.com/andrewdinunzio/9597687d904d3ffe8cd30489c1568fb6
(server) proxy metrics: https://gist.github.com/andrewdinunzio/29ed81d33f915596a4ba72bf9a97db5a
Linkerd-init logs (client):
Policy controller logs:
(looks like a lot of errors but we see similar logs in other regions that are not exhibiting this problem)
output of
linkerd check -o short
Environment
AKS
Possible solution
No response
Additional context
No response
Would you like to work on fixing this bug?
None
The text was updated successfully, but these errors were encountered: