[question] Assigning from VOTER to SPARE and back to VOTER (and other scenarios) #612

mdorier · 2023-08-29T12:00:49Z

mdorier
Aug 29, 2023

I am trying various scenarios to better understand what is the expected usage of some functions. The result in some of these scenarios don't match my expectations, so I open this thread in hope to understand where my assumptions are wrong.

Scenario 1

A cluster of 3 processes is bootstrapped and running.
At this point calling raft_leader on each process gives me the same answer: process 1 is the leader.
I call raft_assign in the leader to assign the role of SPARE to process 3.
I can continue calling raft_apply in the leader, process 1 and 2 will see the commands appear in their state machine.
I eventually call raft_assign in the leader to assign the role of VOTER to process 3 (the function succeeds).
At this point, if I call raft_leader in process 3, it fails (process 3 doesn't know who the leader is). If I ask for its state machine's content, I don't see the commands it missed. I can wait as long as I want, process 3 does not catch up.

Ifraft_assign is local to the leader and does not communicate anything to other processes, I would expect that the leader will simply stop contacting process 3, who will not receive heartbeats and will eventually try to elect itself but fail because it will never get a majority. But when it's reassigned as a voter, I would expect it to restart getting heartbeats from the leader and to catch up, which doesn't seem to be what happens.

Alternatively, I could also expect raft_assign to commit a configuration change in all the processes including process 3, and process 3 should know it's not supposed to expect heartbeats anymore?

Scenario 2

This scenario is similar but instead of calling raft_assign to assign process 3 the role of SPARE then back to VOTER, I call raft_remove to remove process 3, and eventually call raft_add again followed by raft_assign to make it a voter. Note that I don't shutdown the process. Process 3 is still running. The result is the same as above: when process 3 is back to being a voter, it does not catch up and does not know who the leader is.

I will run with more tracing to see what's happening in particular in process 3, but in the meantime, do my scenarios make sense?

Note that I am using my own implementation of a raft_io backend, which I extensively tested (the scenarios above are testing some edge-cases). In particular I can spin up a new process, call raft_add followed by raft_assign to make it visible to the leader and to assign it the role of voter, and the new process does catch up on missing entries. The problem happens when I have an existing process running and I either assign it as spare then back to voter, or remove it then re-add it to the cluster.

Thanks!

MathieuBordere · 2023-08-30T11:33:53Z

MathieuBordere
Aug 30, 2023

Scenario 1

I'd expect process 3 to know who the leader is. And that its state-machine will catch up with the leader's state-machine. Would be interested in more tracing information to see what's going on.

Scenario 2

I would also expect process 3 to catch up and to know who the leader is.

0 replies

mdorier · 2023-08-30T15:23:09Z

mdorier
Aug 30, 2023
Author

Ok I added tracing and noticed my code had mistakes: I had started with scenario 2 but I had forgotten to call raft_assign after raft_add to make the process a voter again, which explains why it wasn't catching up. When I had moved to scenario 1 (assign to spare then back to voter), I had another bug causing the second raft_assign not to be called (bad luck). Now that it's fixed, I don't see any problem anymore.

One thing I noticed with tracing enabled though is that when assigning a process as spare, the process is not notified of it (the new configuration is sent to the remaining voters/standby), so the process that is now a spare starts election rounds, which is useless and consumes cycles. Is there any way to prevent it from doing that? (maybe assigning it to standby then to spare)

0 replies

MathieuBordere · 2023-08-30T15:43:13Z

MathieuBordere
Aug 30, 2023

...
One thing I noticed with tracing enabled though is that when assigning a process as spare, the process is not notified of it (the new configuration is sent to the remaining voters/standby), so the process that is now a spare starts election rounds, which is useless and consumes cycles. Is there any way to prevent it from doing that? (maybe assigning it to standby then to spare)

Hmm that's interesting, let me think about that.

0 replies

cole-miller · 2023-09-05T21:10:39Z

cole-miller
Sep 5, 2023

One thing I noticed with tracing enabled though is that when assigning a process as spare, the process is not notified of it (the new configuration is sent to the remaining voters/standby), so the process that is now a spare starts election rounds, which is useless and consumes cycles. Is there any way to prevent it from doing that? (maybe assigning it to standby then to spare)

This is a good point! Indeed, assigning the standby role first should do the trick, but I'm not opposed to implementing a fix such that assigning directly to spare just works. We would have to have a special case on the leader that replicates the new configuration to any nodes that have been demoted to spare. That might end up having some tricky edge cases, especially around retrying, but it doesn't seem totally impractical.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[question] Assigning from VOTER to SPARE and back to VOTER (and other scenarios) #612

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

[question] Assigning from VOTER to SPARE and back to VOTER (and other scenarios) #612

mdorier Aug 29, 2023

Scenario 1

Scenario 2

Replies: 4 comments

MathieuBordere Aug 30, 2023

mdorier Aug 30, 2023 Author

MathieuBordere Aug 30, 2023

cole-miller Sep 5, 2023

mdorier
Aug 29, 2023

MathieuBordere
Aug 30, 2023

mdorier
Aug 30, 2023
Author

MathieuBordere
Aug 30, 2023

cole-miller
Sep 5, 2023