improve batch efficiency for high throughput workloads #1411

zyguan · 2024-08-01T06:49:49Z

Currently, we can only batch more requests when the tikv is detected as overload. This PR allow adaptively batching more requests according to request arrival intervals. Workloads with high throughput (like sysbench oltp point select) can benfit a lot from it because batching more requests reduces the grpc overhead and the number of syscalls.

ref: #1366 & #1373

Signed-off-by: zyguan <[email protected]>

cfzjywxk · 2024-08-02T10:04:15Z

internal/client/client_batch.go

+	Q float64 `json:"q,omitempty"`
+}
+
+type turboBatchTrigger struct {


Need to add comments from the design doc here to introduce the turboBatchTrigger and turboBatchOptions concepts briefly.

The detailed design document is also neede to be referenced in the github issue.

internal/client/client_batch.go

Signed-off-by: zyguan <[email protected]>

cfzjywxk

LGTM

crazycs520 · 2024-08-12T15:42:11Z

internal/client/client_batch.go

+	trigger, ok := newTurboBatchTriggerFromPolicy(cfg.BatchPolicy)
+	if !ok {
+		initBatchPolicyWarn.Do(func() {
+			logutil.BgLogger().Warn("fallback to default batch policy due to invalid value", zap.String("value", cfg.BatchPolicy))


When the configuration is incorrect, this log may be printed too frequently.

It's in a sync.Once

ekexium · 2024-08-12T13:26:36Z

config/client.go

+	BatchPolicyBasic    = "basic"
+	BatchPolicyStandard = "standard"
+	BatchPolicyPositive = "positive"
+	BatchPolicyCustom   = "custom"


Maybe it's better to have one-line comments for all the enums defined in this PR?

ekexium · 2024-08-12T16:16:24Z

Some thoughts: Have we considered node load metrics like CPU utilization (for TiDB, and possibly TiKV) as factors in the algorithm?
We have a special branch that handle the case where TiKV gRPC pool utilization exceeds 200%, which seems a bit sub-optimal to me.

It might be worth exploring an adaptive approach that accounts for these variables.

crazycs520 · 2024-08-12T16:35:52Z

internal/client/client_batch.go

+				n, m := math.Modf(avgBatchWaitSize)
+				batchWaitSize := int(n)
+				if trigger.opts.V == 0 {
+					batchWaitSize = int(cfg.BatchWaitSize)
+				} else if m >= trigger.opts.Q {
+					batchWaitSize++
+				}


Could you extract this into a function and add tests to it?

MyonKeminta · 2024-08-12T15:48:08Z

internal/client/client_batch.go

@@ -98,6 +106,8 @@ type batchCommandsBuilder struct {
 	requestIDs []uint64
 	// In most cases, there isn't any forwardingReq.
 	forwardingReqs map[string]*tikvpb.BatchCommandsRequest
+
+	maxReqStartTime time.Time


How about lastReqStartTime or latestReqStartTime? I'm afraid "max...time" may look like describing some longest duration.

MyonKeminta · 2024-08-12T16:40:35Z

internal/client/client_batch.go

+				if !a.fetchMoreTimer.Stop() {
+					<-a.fetchMoreTimer.C
+				}


Looks we need this to be executed for all places when exiting this select block. I'm afraid this may be easily missing when the code is further modified by people. How do you think if we add a defer block after starting the timer as an insurance (while still keeping line 364-366)?

MyonKeminta · 2024-08-12T16:47:35Z

internal/client/client_batch.go


 	// Do an additional non-block try. Here we test the length with `maxBatchSize` instead
 	// of `batchWaitSize` because trying best to fetch more requests is necessary so that
 	// we can adjust the `batchWaitSize` dynamically.
+	yield := false


MyonKeminta · 2024-08-12T17:30:56Z

internal/client/client_batch.go

 // BatchSendLoopPanicCounter is only used for testing.
 var BatchSendLoopPanicCounter int64 = 0

+var initBatchPolicyWarn sync.Once


It looks better to be a field of an instance of the batch client, instead of global. If some caller creates the client more than once during the process's lifetime, the warning may be missing for later created ones.

It might not be that important though... maybe we'd better don't change this for now if we need to include this in the next release.

MyonKeminta · 2024-08-12T17:38:42Z

internal/client/client_batch.go

+	T float64 `json:"t,omitempty"`
+	W float64 `json:"w,omitempty"`
+	P float64 `json:"p,omitempty"`
+	Q float64 `json:"q,omitempty"`


Can there be some comments explaining the semantics of these parameters?

MyonKeminta · 2024-08-12T18:02:53Z

internal/client/client_batch.go

+}
+
+func (t *turboBatchTrigger) turboWaitTime() time.Duration {
+	return time.Duration(t.opts.T * float64(time.Second))


Thanks golang...

zyguan · 2024-08-13T01:05:28Z

Some thoughts: Have we considered node load metrics like CPU utilization (for TiDB, and possibly TiKV) as factors in the algorithm? We have a special branch that handle the case where TiKV gRPC pool utilization exceeds 200%, which seems a bit sub-optimal to me.

It might be worth exploring an adaptive approach that accounts for these variables.

Yes, the load and cpu util have been considered. Since the feature is preferred to be turned on by default, the most important task is to eliminate the perf regression under light workloads. The main reason for the regression is the extra wait time introduced by fetchMoreRequests, and it's directly related to the request arrival interval. If the interval is much longer than the max wait time, then it barely waits for more requests. However, it's hard to modeling the relationship between the regression and the load / cpu util.

Signed-off-by: zyguan <[email protected]>

…1411)" This reverts commit 4c6b217.

zyguan added 14 commits June 24, 2024 02:24

improve batch efficiency by multiple attempts

9168387

Signed-off-by: zyguan <[email protected]>

add support for aggressive batching

84bf6e8

Signed-off-by: zyguan <[email protected]>

Merge remote-tracking branch 'origin/master' into debug

e64af20

Merge remote-tracking branch 'origin/master' into debug

da896e3

add some metrics for batch client

2976ddb

Signed-off-by: zyguan <[email protected]>

fetch more requests according to recent wait head durs

eea39bf

Signed-off-by: zyguan <[email protected]>

add experimental batch options

3571f1b

Signed-off-by: zyguan <[email protected]>

optimize and refactor

8cbe29b

Signed-off-by: zyguan <[email protected]>

some minor updates

4c73423

Signed-off-by: zyguan <[email protected]>

fix the metric of head arrival interval

64ae2ee

Signed-off-by: zyguan <[email protected]>

a minor update

bb3ab8b

Signed-off-by: zyguan <[email protected]>

some minor fixes

1edd82f

Signed-off-by: zyguan <[email protected]>

Merge remote-tracking branch 'origin/master' into dev/batch

67652c6

update according to the spec

0fe0bf6

Signed-off-by: zyguan <[email protected]>

ti-chi-bot bot added the dco-signoff: yes Indicates the PR's author has signed the dco. label Aug 1, 2024

This was referenced Aug 1, 2024

improve batch efficiency by multiple attempts #1373

Closed

add support for aggressive batching #1366

Closed

zyguan added 3 commits August 1, 2024 07:02

fix the ut

743f97a

Signed-off-by: zyguan <[email protected]>

fix batch condition

77c788b

Signed-off-by: zyguan <[email protected]>

Merge branch 'master' into dev/batch

a8ac45f

cfzjywxk reviewed Aug 2, 2024

View reviewed changes

add some doc comments

3ed73e4

Signed-off-by: zyguan <[email protected]>

zyguan force-pushed the dev/batch branch from f4e49d6 to 3ed73e4 Compare August 6, 2024 03:00

cfzjywxk requested review from MyonKeminta and crazycs520 August 6, 2024 08:56

cfzjywxk approved these changes Aug 6, 2024

View reviewed changes

crazycs520 reviewed Aug 12, 2024

View reviewed changes

ekexium approved these changes Aug 12, 2024

View reviewed changes

crazycs520 reviewed Aug 12, 2024

View reviewed changes

MyonKeminta reviewed Aug 12, 2024

View reviewed changes

zyguan added 3 commits August 13, 2024 09:07

Merge branch 'master' into dev/batch

3a3edd7

address tikv#1411 (comment)

3cd6555

Signed-off-by: zyguan <[email protected]>

rename some vars according to the comments

4f21ce7

Signed-off-by: zyguan <[email protected]>

cfzjywxk requested review from crazycs520 and MyonKeminta August 13, 2024 01:53

add more comments

6feb08c

Signed-off-by: zyguan <[email protected]>

MyonKeminta approved these changes Aug 13, 2024

View reviewed changes

cfzjywxk merged commit 4c6b217 into tikv:master Aug 13, 2024
11 checks passed

you06-pingcap pushed a commit to you06/client-go that referenced this pull request Sep 13, 2024

Revert "improve batch efficiency for high throughput workloads (tikv#…

a14cd3e

…1411)" This reverts commit 4c6b217.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve batch efficiency for high throughput workloads #1411

improve batch efficiency for high throughput workloads #1411

zyguan commented Aug 1, 2024

cfzjywxk Aug 2, 2024

cfzjywxk left a comment

crazycs520 Aug 12, 2024

ekexium Aug 12, 2024

ekexium Aug 12, 2024 •

edited

Loading

ekexium commented Aug 12, 2024 •

edited

Loading

crazycs520 Aug 12, 2024

MyonKeminta Aug 12, 2024

MyonKeminta Aug 12, 2024

MyonKeminta Aug 12, 2024

MyonKeminta Aug 12, 2024

MyonKeminta Aug 12, 2024

MyonKeminta Aug 12, 2024

zyguan commented Aug 13, 2024

improve batch efficiency for high throughput workloads #1411

improve batch efficiency for high throughput workloads #1411

Conversation

zyguan commented Aug 1, 2024

Choose a reason for hiding this comment

cfzjywxk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ekexium Aug 12, 2024 • edited Loading

Choose a reason for hiding this comment

ekexium commented Aug 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zyguan commented Aug 13, 2024

ekexium Aug 12, 2024 •

edited

Loading

ekexium commented Aug 12, 2024 •

edited

Loading