Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mitigate the performance impact on tikv slow #51585

Open
cfzjywxk opened this issue Mar 7, 2024 · 1 comment
Open

Mitigate the performance impact on tikv slow #51585

cfzjywxk opened this issue Mar 7, 2024 · 1 comment
Labels
type/enhancement The issue or PR belongs to an enhancement.

Comments

@cfzjywxk
Copy link
Contributor

cfzjywxk commented Mar 7, 2024

This is the all-in-one document about improvements in mitigation the performance impact on tikv slow including "restart, slow tikv (disk io jitter/hang/overload etc)". From this issue, you can track all related problems, bug-fixes and tasks for improvement and enhancement.

Background

Challenges in stability: When one or more tikv instances encounter issues or slow down in a large-scale cluster, what impact does it have on the overall performance of the TiDB cluster?
Assuming we have 100 tikv nodes, when 1 tikv node encounters an issue. We typically assume that the overall performance impact is not more than 1/100, but in reality, this is not the case at all.

From a lot of production environment issues, we have found that when cluster size increases, the performance impact on the entire cluster far exceeds the assumption mentioned above when a single tikv node encounters problems. The reasons for the significant impact can be categorized as follows:

  1. Bugs, such as the schema cache not caching historical schema versions as expected, resulting in penetration of tikv.
  2. Reaching the existing implementation capacity boundaries, such as the impact of tikv with meta region failures on overall performance.
  3. Constraints of the architecture.

Therefore, improving the overall stability and resilience of TiDB essentially requires:

  1. Improving quality and addressing bugs.
  2. Optimizing implementations to bring TiDB's resilience capabilities closer to the upper limits of architectural design constraints.

This tracking issue focuses on and consolidates the second point.

From the end to end perspective, the speed of failover depends on the critical paths related both to kv-client and tikv when some tikv nodes fail. We examine the current state and improvement of each component related to kv-client and tikv taking a top-down perspective, combining the known user issues and pain points encountered at present.

Region Cache Related

Problems:

  • The region information not updated in time, causing unexpected cross-AZ flow
  • Region reload causes significant pressure on PD
  • The interface has multiple usage patterns and is tightly coupled with surrounding modules, making it difficult to maintain

Tracking issue:

Replica selection & TiKV Error Handling & Retry Related

Problems:

  • Inappropriate error handling with unexpected backoff/retry, leading to slow recovery or timeout errors
  • The "tikv slow" information is not being utilized, leading to ineffective retries and resource wastage
  • Insufficient unit test coverage for state transitions, resulting in complex code state machines that are difficult to maintain
  • Lack of stability test baselines to measure the performance and stability of replica selection, error handling, and retry mechanisms

Tracking issue:

Enabling TiKV Slow Score By Default

Problems:

  • The raft log write io jitter may have significant impact on user queries

Tracking issue:

Building A Unified Health Controller And Feedback Mechanism

Problems:

  • The slow information could not be detected by the kv-client, it should be helpful for the kv-client to decide peer selection and avoid resource wastage

Tracking issue:

Warmup Before PD Heatbeat And Leader Movement

Problems:

  • The tikv nodes could be requested to handle requests before warm up, causing latency spike, issue. The PD store heartbeat could be sent after the log applying and warm up operations on the restarted tikv node.
  • The tikv nodes could be busy applying raft logs after network partition, scheduling leader peers to the just started node may cause high write latency because of apply wait. issue

Tracking issue:

Enabling Async-IO By Default

image

Problems:

  • The raft log io jitter have significant impact on the raft store loop, using async-io could help mitigate the impact of the IO jitter.

Tracking issue:

Allow Log Apply When The Quorum Has Formed

image

Problems:

  • The leader can not advance write request processing(applying logs), even though the logs have already been committed by a majority of the replicas, results in significant impact on write latency due to single EBS IO jitter.

Tracking issue:

Avoid IO opertions in store loop

Problems:

  • The IO operations should be avoided in store loop as much as possile

Tracking issue:

@cfzjywxk cfzjywxk added the type/enhancement The issue or PR belongs to an enhancement. label Mar 7, 2024
ti-chi-bot bot pushed a commit to tikv/tikv that referenced this issue Mar 7, 2024
close #16614, ref pingcap/tidb#51585

Enable `async-io` by default with changing the setting `raftstore.store-io-pool-size` from 0 to 1.

Signed-off-by: lucasliang <[email protected]>
@gengliqi
Copy link
Contributor

gengliqi commented Mar 7, 2024

FYI: The last two figures come from this PPT which has more explanation.

dbsid pushed a commit to dbsid/tikv that referenced this issue Mar 24, 2024
close tikv#16614, ref pingcap/tidb#51585

Enable `async-io` by default with changing the setting `raftstore.store-io-pool-size` from 0 to 1.

Signed-off-by: lucasliang <[email protected]>
Signed-off-by: dbsid <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement The issue or PR belongs to an enhancement.
Projects
None yet
Development

No branches or pull requests

2 participants