Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix data loss when the old primary takes over the slots after online #974

Open
wants to merge 2 commits into
base: unstable
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 36 additions & 5 deletions src/cluster_legacy.c
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,10 @@ int verifyClusterNodeId(const char *name, int length);
sds clusterEncodeOpenSlotsAuxField(int rdbflags);
int clusterDecodeOpenSlotsAuxField(int rdbflags, sds s);

#define CLUSTER_BROADCAST_ALL 0
#define CLUSTER_BROADCAST_LOCAL_REPLICAS 1
void clusterBroadcastPong(int target);

/* Only primaries that own slots have voting rights.
* Returns 1 if the node has voting rights, otherwise returns 0. */
static inline int clusterNodeIsVotingPrimary(clusterNode *n) {
Expand Down Expand Up @@ -1830,6 +1834,7 @@ int clusterBumpConfigEpochWithoutConsensus(void) {
if (myself->configEpoch == 0 || myself->configEpoch != maxEpoch) {
server.cluster->currentEpoch++;
myself->configEpoch = server.cluster->currentEpoch;
clusterBroadcastPong(CLUSTER_BROADCAST_ALL);
clusterDoBeforeSleep(CLUSTER_TODO_SAVE_CONFIG | CLUSTER_TODO_FSYNC_CONFIG);
serverLog(LL_NOTICE, "New configEpoch set to %llu", (unsigned long long)myself->configEpoch);
return C_OK;
Expand Down Expand Up @@ -1880,6 +1885,14 @@ int clusterBumpConfigEpochWithoutConsensus(void) {
* with the conflicting epoch (the 'sender' node), it will assign itself
* the greatest configuration epoch currently detected among nodes plus 1.
*
* The above is an optimistic scenario. It this node and the sender node
* are in the same shard, their conflict in configEpoch indicates that a
* node has experienced a partition. Or for example, the old primary node
* was down then up again, and the new primary node won the election. In
* this case, we need to take the replication offset into consideration,
* otherwise, if the old primary wins the collision, we will lose some of
* the new primary's data.
*
* This means that even if there are multiple nodes colliding, the node
* with the greatest Node ID never moves forward, so eventually all the nodes
* end with a different configuration epoch.
Comment on lines 1885 to 1898
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to describe this as if same shard ... otherwise ... like you did in another code comment below:

/* If sender and myself are in the same shard, the one with the
 * bigger offset will win. Otherwise if sender and myself are not
 * in the same shard, the one will the lexicographically small
 * Node ID will win.*/

And the text below the added text describes the guarantee "the node with the greatest Node ID never moves forward". We need to change this text.

Do we have another guarantees in the same-shard scenario? Can it happen that sender and myself don't have the same view on the replication offsets and both try to bump the epoch?

And what if one node thinks that it is in the same shard but the other one has a different idea and tries to compare by node-id? Can it happen that both try to bump the epoch?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some thoughts:

  1. shard-id is mostly stable except in two situations:

case 1. The delayed propagation of shard-id, as captured in #778 and #573. I am still trying to wrap my head around all the possible permutations.

case 2. cluster-allow-replica-migration when enabled would allow a replica or primary to join another shard if their shard lost their last slot to the other shard. BTW, I don't get the use case of this setting and even more why it is enabled by default. This setting created quite a few surprises IMO. We disable it on GCP Memorystore.

In both cases, we could have a split "are_in_the_same_shard" view on the nodes involved.

case 1. The split view could happen such that node A correctly concludes that it is in the same shard as B but B, because it hasn't received A real shard_id, thinks otherwise and relies on the node-id only to resolve the conflict. When this happens, the conflict continues. Eventually, B will receive A's real shard_id and should arrive at the same conclusion as A. Data written to the losing node during this transitional period will get lost when the conflict is eventually resolved.

case 2. The split could happen in both the source and target shards but I think the end result is innocuous in the source shard since it will be empty by then. The impact on the target shard should be mitigated by #885

  1. We will always lose data (except for the trivial case of no user writes) even if we pick the larger offset of the two. The replication history diverges the moment the "split brain" occurs even the two still share the same repl_id, which is what leads to the epoch conflict. By favoring the larger offset, we are saying the one with more user writes wins. I think this is a reasonable decision but just wanted to clarify that the two no longer share the same replication history despite having the same repl_id.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, @madolson, this is one of the many places where I fully agree with you a new clustering solution that is designed holistically would fare better.

Expand All @@ -1888,14 +1901,34 @@ void clusterHandleConfigEpochCollision(clusterNode *sender) {
/* Prerequisites: nodes have the same configEpoch and are both primaries. */
if (sender->configEpoch != myself->configEpoch || !clusterNodeIsPrimary(sender) || !clusterNodeIsPrimary(myself))
return;
/* Don't act if the colliding node has a smaller Node ID. */
if (memcmp(sender->name, myself->name, CLUSTER_NAMELEN) <= 0) return;

/* If sender and myself are in the same shard, the one with the
* bigger offset will win. Otherwise if sender and myself are not
* in the same shard, the one will the lexicographically small
* Node ID will win.*/
if (areInSameShard(sender, myself)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose I don't get why this works. If we both nodes think they are primaries, won't both of our offsets be the replication offsets we are now using could still be the same, and both nodes would both bump. The previous logic was consistent between nodes, whereas this logic may not and they may fight? It feels like this is just a race between the manual failover and the automated failover, and I'm not sure we necessarily should automatically be picking a winner.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the offset is the same, yes, it will still have the same issue, in this case, if the offset is the same, we will use the origin node id logic to pick a winner, in this case, the logic is the same as before.

i guess in here the only problem i want to solve is: the old primary paused or block (offset is small), and the new primary was elected and did some writes (offset is large). They encountered a conflict and the old primary win, got a new config epoch and forced the new primary to become a replica again.

long long sender_offset = getNodeReplicationOffset(sender);
long long myself_offset = getNodeReplicationOffset(myself);
if (sender_offset > myself_offset) {
/* Don't act if the colliding node has a bigger offset. */
return;
} else if (sender_offset == myself_offset) {
/* If the offset are the same, we fall back to Node ID logic.
* Don't act if the colliding node has a smaller Node ID. */
if (memcmp(sender->name, myself->name, CLUSTER_NAMELEN) <= 0) return;
}
} else {
/* Don't act if the colliding node has a smaller Node ID. */
if (memcmp(sender->name, myself->name, CLUSTER_NAMELEN) <= 0) return;
}

/* Get the next ID available at the best of this node knowledge. */
server.cluster->currentEpoch++;
myself->configEpoch = server.cluster->currentEpoch;
clusterSaveConfigOrDie(1);
clusterBroadcastPong(CLUSTER_BROADCAST_ALL);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we document why there is a broadcast here as well. If there is any type of significant conflicts we'll send out a lot of messages unnecessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we can document it. i guess in normal cases, conflicts should be rare.

I think i am doing this here is also because, if the primary has a larger epoch, and its replicas don't know it in time, the replica may not be able to win the election (it has an old epoch and will get rejected in this case, but rare just my guess)

serverLog(LL_NOTICE, "configEpoch collision with node %.40s (%s). configEpoch set to %llu", sender->name,
sender->human_nodename, (unsigned long long)myself->configEpoch);
clusterSaveConfigOrDie(1);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print the logs first and then save the config, in case the save fails and we lost the context

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good context to keep in the code. add it as code comment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM, i will add this in the next commit.

}

/* -----------------------------------------------------------------------------
Expand Down Expand Up @@ -4001,8 +4034,6 @@ void clusterSendPing(clusterLink *link, int type) {
* CLUSTER_BROADCAST_ALL -> All known instances.
* CLUSTER_BROADCAST_LOCAL_REPLICAS -> All replicas in my primary-replicas ring.
*/
#define CLUSTER_BROADCAST_ALL 0
#define CLUSTER_BROADCAST_LOCAL_REPLICAS 1
void clusterBroadcastPong(int target) {
dictIterator *di;
dictEntry *de;
Expand Down
4 changes: 4 additions & 0 deletions tests/unit/cluster/manual-takeover.tcl
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,12 @@ set paused_pid [srv 0 pid]
set paused_pid1 [srv -1 pid]
set paused_pid2 [srv -2 pid]
test "Killing majority of master nodes" {
# Bumping the epochs to increase the chance of conflicts.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without the fix, this can lead to increased conflicts, which can cause tests to fail

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the original issue is less about the non-deterministic epoch resolution behavior we have today but more about the race between the primary election and the forced takeover. I think that would be addressed if we disable the replica failover before pausing the primaries.

The change of bumping the epoch before pausing the primaries changes the intent of the test IMO to validate your "more deterministic conflict resolution" proposal, which does require increasing the conflict probably.

Can we make a surgical fix to the original test issue but create a new file to add the new test? It is a bit confusing to include it in a test file called manual-takeover as conflicts can happen outside manual-takeover too. maybe "cluster-epoch-conflict.tcl" or just "epoch-conflict.tcl"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am fail to see why disbale the replica failover will work. Do you mean disable the it on the replica side? The code already doing it. Or do you mean disable it on the primary side?

# For this test, disable replica failover until
# all of the primaries are confirmed killed. Otherwise
# there might be enough time to elect a replica.
set replica_ids { 5 6 7 }
foreach id $replica_ids {
    R $id config set cluster-replica-no-failover yes
}

set paused_pid [srv 0 pid]
set paused_pid1 [srv -1 pid]
set paused_pid2 [srv -2 pid]
test "Killing majority of master nodes" {
    pause_process $paused_pid
    pause_process $paused_pid1
    pause_process $paused_pid2
}

foreach id $replica_ids {
    R $id config set cluster-replica-no-failover no
}

yes, it is a good idea to split it, i will find some ways to do it.

Copy link
Member

@PingXie PingXie Sep 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am fail to see why disbale the replica failover will work.

I think there is a race condition between turning off failover on the replicas and stopping the primary. I meant to suggest the following changes (reverting the order R $id config set cluster-replica-no-failover no and pause_process $paused_pid). Can you give it a try (without the server code change)?

Suggested change
# Bumping the epochs to increase the chance of conflicts.
foreach id $replica_ids {
R $id config set cluster-replica-no-failover no
}
test "Killing majority of master nodes" {
pause_process $paused_pid
pause_process $paused_pid1
pause_process $paused_pid2
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the specific race? look like you are suggested to remove this one? e81bd15

R 0 cluster bumpepoch
pause_process $paused_pid
R 1 cluster bumpepoch
pause_process $paused_pid1
R 2 cluster bumpepoch
pause_process $paused_pid2
}

Expand Down
Loading