-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve batch efficiency for high throughput workloads #1411
Conversation
Signed-off-by: zyguan <[email protected]>
Signed-off-by: zyguan <[email protected]>
Signed-off-by: zyguan <[email protected]>
Signed-off-by: zyguan <[email protected]>
Signed-off-by: zyguan <[email protected]>
Signed-off-by: zyguan <[email protected]>
Signed-off-by: zyguan <[email protected]>
Signed-off-by: zyguan <[email protected]>
Signed-off-by: zyguan <[email protected]>
Signed-off-by: zyguan <[email protected]>
Signed-off-by: zyguan <[email protected]>
Signed-off-by: zyguan <[email protected]>
Signed-off-by: zyguan <[email protected]>
Q float64 `json:"q,omitempty"` | ||
} | ||
|
||
type turboBatchTrigger struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to add comments from the design doc here to introduce the turboBatchTrigger
and turboBatchOptions
concepts briefly.
The detailed design document is also neede to be referenced in the github issue.
Signed-off-by: zyguan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
trigger, ok := newTurboBatchTriggerFromPolicy(cfg.BatchPolicy) | ||
if !ok { | ||
initBatchPolicyWarn.Do(func() { | ||
logutil.BgLogger().Warn("fallback to default batch policy due to invalid value", zap.String("value", cfg.BatchPolicy)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the configuration is incorrect, this log may be printed too frequently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's in a sync.Once
config/client.go
Outdated
BatchPolicyBasic = "basic" | ||
BatchPolicyStandard = "standard" | ||
BatchPolicyPositive = "positive" | ||
BatchPolicyCustom = "custom" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's better to have one-line comments for all the enums defined in this PR?
Some thoughts: Have we considered node load metrics like CPU utilization (for TiDB, and possibly TiKV) as factors in the algorithm? It might be worth exploring an adaptive approach that accounts for these variables. |
internal/client/client_batch.go
Outdated
n, m := math.Modf(avgBatchWaitSize) | ||
batchWaitSize := int(n) | ||
if trigger.opts.V == 0 { | ||
batchWaitSize = int(cfg.BatchWaitSize) | ||
} else if m >= trigger.opts.Q { | ||
batchWaitSize++ | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you extract this into a function and add tests to it?
internal/client/client_batch.go
Outdated
@@ -98,6 +106,8 @@ type batchCommandsBuilder struct { | |||
requestIDs []uint64 | |||
// In most cases, there isn't any forwardingReq. | |||
forwardingReqs map[string]*tikvpb.BatchCommandsRequest | |||
|
|||
maxReqStartTime time.Time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about lastReqStartTime
or latestReqStartTime
? I'm afraid "max...time" may look like describing some longest duration.
if !a.fetchMoreTimer.Stop() { | ||
<-a.fetchMoreTimer.C | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks we need this to be executed for all places when exiting this select
block. I'm afraid this may be easily missing when the code is further modified by people. How do you think if we add a defer block after starting the timer as an insurance (while still keeping line 364-366)?
internal/client/client_batch.go
Outdated
|
||
// Do an additional non-block try. Here we test the length with `maxBatchSize` instead | ||
// of `batchWaitSize` because trying best to fetch more requests is necessary so that | ||
// we can adjust the `batchWaitSize` dynamically. | ||
yield := false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yielded
?
// BatchSendLoopPanicCounter is only used for testing. | ||
var BatchSendLoopPanicCounter int64 = 0 | ||
|
||
var initBatchPolicyWarn sync.Once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks better to be a field of an instance of the batch client, instead of global. If some caller creates the client more than once during the process's lifetime, the warning may be missing for later created ones.
It might not be that important though... maybe we'd better don't change this for now if we need to include this in the next release.
T float64 `json:"t,omitempty"` | ||
W float64 `json:"w,omitempty"` | ||
P float64 `json:"p,omitempty"` | ||
Q float64 `json:"q,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can there be some comments explaining the semantics of these parameters?
} | ||
|
||
func (t *turboBatchTrigger) turboWaitTime() time.Duration { | ||
return time.Duration(t.opts.T * float64(time.Second)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks golang...
Yes, the load and cpu util have been considered. Since the feature is preferred to be turned on by default, the most important task is to eliminate the perf regression under light workloads. The main reason for the regression is the extra wait time introduced by |
Signed-off-by: zyguan <[email protected]>
Signed-off-by: zyguan <[email protected]>
Signed-off-by: zyguan <[email protected]>
Currently, we can only batch more requests when the tikv is detected as overload. This PR allow adaptively batching more requests according to request arrival intervals. Workloads with high throughput (like sysbench oltp point select) can benfit a lot from it because batching more requests reduces the grpc overhead and the number of syscalls.
ref: #1366 & #1373