Mitigate the performance impact on tikv slow #51585

cfzjywxk · 2024-03-07T08:08:00Z

This is the all-in-one document about improvements in mitigation the performance impact on tikv slow including "restart, slow tikv (disk io jitter/hang/overload etc)". From this issue, you can track all related problems, bug-fixes and tasks for improvement and enhancement.

Background

Challenges in stability: When one or more tikv instances encounter issues or slow down in a large-scale cluster, what impact does it have on the overall performance of the TiDB cluster?
Assuming we have 100 tikv nodes, when 1 tikv node encounters an issue. We typically assume that the overall performance impact is not more than 1/100, but in reality, this is not the case at all.

From a lot of production environment issues, we have found that when cluster size increases, the performance impact on the entire cluster far exceeds the assumption mentioned above when a single tikv node encounters problems. The reasons for the significant impact can be categorized as follows:

Bugs, such as the schema cache not caching historical schema versions as expected, resulting in penetration of tikv.
Reaching the existing implementation capacity boundaries, such as the impact of tikv with meta region failures on overall performance.
Constraints of the architecture.

Therefore, improving the overall stability and resilience of TiDB essentially requires:

Improving quality and addressing bugs.
Optimizing implementations to bring TiDB's resilience capabilities closer to the upper limits of architectural design constraints.

This tracking issue focuses on and consolidates the second point.

From the end to end perspective, the speed of failover depends on the critical paths related both to kv-client and tikv when some tikv nodes fail. We examine the current state and improvement of each component related to kv-client and tikv taking a top-down perspective, combining the known user issues and pain points encountered at present.

Region Cache Related

Problems:

The region information not updated in time, causing unexpected cross-AZ flow
Region reload causes significant pressure on PD
The interface has multiple usage patterns and is tightly coupled with surrounding modules, making it difficult to maintain

Tracking issue:

RegionCache Enhancement tikv/client-go#1104

Replica selection & TiKV Error Handling & Retry Related

Problems:

Inappropriate error handling with unexpected backoff/retry, leading to slow recovery or timeout errors
The "tikv slow" information is not being utilized, leading to ineffective retries and resource wastage
Insufficient unit test coverage for state transitions, resulting in complex code state machines that are difficult to maintain
Lack of stability test baselines to measure the performance and stability of replica selection, error handling, and retry mechanisms

Tracking issue:

replica selector enhancement task tikv/client-go#1167
Avoid Follower Read Retry Experiment
- make experimental improvements to replica selector tikv/client-go#1109

Enabling TiKV Slow Score By Default

Problems:

The raft log write io jitter may have significant impact on user queries

Tracking issue:

scheduler: enable evict-slow-store scheduler by default. tikv/pd#7564

Building A Unified Health Controller And Feedback Mechanism

Problems:

The slow information could not be detected by the kv-client, it should be helpful for the kv-client to decide peer selection and avoid resource wastage

Tracking issue:

Unified health controller and feedback mechanism
- Performance/health feedback to client and the Unified Health Controller tikv/tikv#16297

Warmup Before PD Heatbeat And Leader Movement

Problems:

The tikv nodes could be requested to handle requests before warm up, causing latency spike, issue. The PD store heartbeat could be sent after the log applying and warm up operations on the restarted tikv node.
The tikv nodes could be busy applying raft logs after network partition, scheduling leader peers to the just started node may cause high write latency because of apply wait. issue

Tracking issue:

Activiate tikv node after warm up
- TiKV restart process optimization tikv/tikv#15874

Enabling Async-IO By Default

Problems:

The raft log io jitter have significant impact on the raft store loop, using async-io could help mitigate the impact of the IO jitter.

Tracking issue:

raftstore: enable async-io as default. tikv/tikv#16614

Allow Log Apply When The Quorum Has Formed

Problems:

The leader can not advance write request processing(applying logs), even though the logs have already been committed by a majority of the replicas, results in significant impact on write latency due to single EBS IO jitter.

Tracking issue:

Allow leader to apply when the log is committed
- TiKV should be tolerant on single EBS IO jitter tikv/tikv#16457

Avoid IO opertions in store loop

Problems:

The IO operations should be avoided in store loop as much as possile

Tracking issue:

Avoid snapshot related IO in the store thread
- raftstore: Avoid snapshot IO in raftstore thread tikv/tikv#16682

The text was updated successfully, but these errors were encountered:

close #16614, ref pingcap/tidb#51585 Enable `async-io` by default with changing the setting `raftstore.store-io-pool-size` from 0 to 1. Signed-off-by: lucasliang <[email protected]>

gengliqi · 2024-03-07T12:11:42Z

FYI: The last two figures come from this PPT which has more explanation.

close tikv#16614, ref pingcap/tidb#51585 Enable `async-io` by default with changing the setting `raftstore.store-io-pool-size` from 0 to 1. Signed-off-by: lucasliang <[email protected]> Signed-off-by: dbsid <[email protected]>

cfzjywxk added the type/enhancement The issue or PR belongs to an enhancement. label Mar 7, 2024

LykxSassinator mentioned this issue Mar 7, 2024

raftstore: enable async-io by default. tikv/tikv#16615

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mitigate the performance impact on tikv slow #51585

Mitigate the performance impact on tikv slow #51585

cfzjywxk commented Mar 7, 2024 •

edited

Loading

gengliqi commented Mar 7, 2024

Mitigate the performance impact on tikv slow #51585

Mitigate the performance impact on tikv slow #51585

Comments

cfzjywxk commented Mar 7, 2024 • edited Loading

Background

Region Cache Related

Replica selection & TiKV Error Handling & Retry Related

Enabling TiKV Slow Score By Default

Building A Unified Health Controller And Feedback Mechanism

Warmup Before PD Heatbeat And Leader Movement

Enabling Async-IO By Default

Allow Log Apply When The Quorum Has Formed

Avoid IO opertions in store loop

gengliqi commented Mar 7, 2024

cfzjywxk commented Mar 7, 2024 •

edited

Loading