Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Round-robin updateMember may cause the timestamp to fallback #6770

Open
JmPotato opened this issue Jul 6, 2023 · 6 comments
Open

Round-robin updateMember may cause the timestamp to fallback #6770

JmPotato opened this issue Jul 6, 2023 · 6 comments

Comments

@JmPotato
Copy link
Member

JmPotato commented Jul 6, 2023

Bug Report

We use a round-robin way to request the TSO server with FindGroupByKeyspaceID gRPC call.

func (c *tsoServiceDiscovery) updateMember() error {
// The keyspace membership or the primary serving address of the keyspace group, to which this
// keyspace belongs, might have been changed. We need to query tso servers to get the latest info.
tsoServerAddr, err := c.getTSOServer(c.apiSvcDiscovery)
if err != nil {
log.Error("[tso] failed to get tso server", errs.ZapError(err))
return err
}
keyspaceID := c.GetKeyspaceID()
var keyspaceGroup *tsopb.KeyspaceGroup
if len(tsoServerAddr) > 0 {
keyspaceGroup, err = c.findGroupByKeyspaceID(keyspaceID, tsoServerAddr, updateMemberTimeout)
if err != nil {
if c.tsoServerDiscovery.countFailure() {
log.Error("[tso] failed to find the keyspace group",
zap.Uint32("keyspace-id-in-request", keyspaceID),
zap.String("tso-server-addr", tsoServerAddr),
errs.ZapError(err))
}
return err
}
c.tsoServerDiscovery.resetFailure()
} else {

However, if the requested TSO server satisfies this kind of case, it may cause the timestamp to fallback.

  • TSO node A initialized group 1.
  • TSO node B has not initialized group 1 yet.

If we do updateMember quickly while requesting the TSO, it may cause the keyspace group in the request to change between the default group and group 1, resulting in the timestamp fallback.

@JmPotato JmPotato added type/bug The issue is confirmed as a bug. component/client Client logic. component/tso Timestamp Oracle. component/keyspace Key space. labels Jul 6, 2023
@JmPotato
Copy link
Member Author

JmPotato commented Jul 7, 2023

After some investigation, I found the following minimum reproduction steps, presume we have a keyspace 1 belonging to group 1.

  1. TSO node A initializes group 1 and is elected as the primary of group 1.
  2. Keyspace 1 client executes updateMember and finds that the TSO node A is the primary of group 1.
  3. As the primary of group 1, TSO node A responds successfully to the TSO request of the keyspace 1 client at least once.
  4. The gRPC stream between keyspace 1 client and TSO node A occurs an error and is destroyed.
  5. TSO node B has not initialized group 1 yet at this time.
  6. Keyspace 1 client executes updateMember and finds that the TSO node B doesn't have any info about keyspace 1, so the default group is chosen as the served group.
  7. As the primary of the default group, TSO node B responds successfully to the TSO request of the keyspace 1 client at least once.
  8. If the TSO of the default group is less than group 1, panic will be triggered.

The critical point here is the order in which steps 4 and 5 occur, that is, the reconstruction of the stream needs to be triggered before the TSO node of the next polling in round-robin does not initialize this group at this time.

wsd6e7RE15

@JmPotato
Copy link
Member Author

Because the trigger conditions are complicated, we consider not fixing this problem for the time being. However, due to the lack of some constraints in the current client-side group switching mechanism, it is still worthwhile to introduce some protection mechanisms to limit the occurrence of this situation.

@nolouch
Copy link
Contributor

nolouch commented Oct 23, 2023

@lhy1024
Copy link
Contributor

lhy1024 commented Oct 26, 2023

@lhy1024
Copy link
Contributor

lhy1024 commented Nov 7, 2023

@CabinfeverB
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants