Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd troubleshooting docs don't mention the most common issue #1404

Open
Tejeev opened this issue Jul 25, 2024 · 0 comments
Open

etcd troubleshooting docs don't mention the most common issue #1404

Tejeev opened this issue Jul 25, 2024 · 0 comments

Comments

@Tejeev
Copy link
Contributor

Tejeev commented Jul 25, 2024

Summary

https://ranchermanager.docs.rancher.com/v2.8/troubleshooting/kubernetes-components/troubleshooting-etcd-nodes does not mention took too long warnings. Slow etcd IO is, in my experience with our customers, the most common cause of issues with etcd.

I believe it's also worth having a section on grpc errors, since that comes up fairly often.

Details

Slow etcd performance(performance testing and optimization) (000020100)
Tuning etcd for large installs

What does the etcd warning “apply entries took too long” mean?
After a majority of etcd members agree to commit a request, each etcd server applies the request to its data store and persists the result to disk. Even with a slow mechanical disk or a virtualized network disk, such as Amazon’s EBS or Google’s PD, applying a request should normally take fewer than 50 milliseconds. If the average apply duration exceeds 100 milliseconds, etcd will warn that entries are taking too long to apply.

Usually this issue is caused by a slow disk. The disk could be experiencing contention among etcd and other applications, or the disk is too simply slow (e.g., a shared virtualized disk). To rule out a slow disk from causing this warning, monitor backend_commit_duration_seconds (p99 duration should be less than 25ms) to confirm the disk is reasonably fast. If the disk is too slow, assigning a dedicated disk to etcd or using faster disk will typically solve the problem.

The second most common cause is CPU starvation. If monitoring of the machine’s CPU usage shows heavy utilization, there may not be enough compute capacity for etcd. Moving etcd to dedicated machine, increasing process resource isolation cgroups, or renicing the etcd server process into a higher priority can usually solve the problem.

Expensive user requests which access too many keys (e.g., fetching the entire keyspace) can also cause long apply latencies. Accessing fewer than a several hundred keys per request, however, should always be performant.

If none of the above suggestions clear the warnings, please open an issue with detailed logging, monitoring, metrics and optionally workload information.

https://etcd.io/docs/v3.1/faq/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants