Replies: 4 comments
-
Scenario 1 I'd expect process 3 to know who the leader is. And that its state-machine will catch up with the leader's state-machine. Would be interested in more tracing information to see what's going on. Scenario 2 I would also expect process 3 to catch up and to know who the leader is. |
Beta Was this translation helpful? Give feedback.
-
Ok I added tracing and noticed my code had mistakes: I had started with scenario 2 but I had forgotten to call One thing I noticed with tracing enabled though is that when assigning a process as spare, the process is not notified of it (the new configuration is sent to the remaining voters/standby), so the process that is now a spare starts election rounds, which is useless and consumes cycles. Is there any way to prevent it from doing that? (maybe assigning it to standby then to spare) |
Beta Was this translation helpful? Give feedback.
-
Hmm that's interesting, let me think about that. |
Beta Was this translation helpful? Give feedback.
-
This is a good point! Indeed, assigning the standby role first should do the trick, but I'm not opposed to implementing a fix such that assigning directly to spare just works. We would have to have a special case on the leader that replicates the new configuration to any nodes that have been demoted to spare. That might end up having some tricky edge cases, especially around retrying, but it doesn't seem totally impractical. |
Beta Was this translation helpful? Give feedback.
-
I am trying various scenarios to better understand what is the expected usage of some functions. The result in some of these scenarios don't match my expectations, so I open this thread in hope to understand where my assumptions are wrong.
Scenario 1
raft_leader
on each process gives me the same answer: process 1 is the leader.raft_assign
in the leader to assign the role of SPARE to process 3.raft_apply
in the leader, process 1 and 2 will see the commands appear in their state machine.raft_assign
in the leader to assign the role of VOTER to process 3 (the function succeeds).raft_leader
in process 3, it fails (process 3 doesn't know who the leader is). If I ask for its state machine's content, I don't see the commands it missed. I can wait as long as I want, process 3 does not catch up.If
raft_assign
is local to the leader and does not communicate anything to other processes, I would expect that the leader will simply stop contacting process 3, who will not receive heartbeats and will eventually try to elect itself but fail because it will never get a majority. But when it's reassigned as a voter, I would expect it to restart getting heartbeats from the leader and to catch up, which doesn't seem to be what happens.Alternatively, I could also expect
raft_assign
to commit a configuration change in all the processes including process 3, and process 3 should know it's not supposed to expect heartbeats anymore?Scenario 2
This scenario is similar but instead of calling
raft_assign
to assign process 3 the role of SPARE then back to VOTER, I callraft_remove
to remove process 3, and eventually callraft_add
again followed byraft_assign
to make it a voter. Note that I don't shutdown the process. Process 3 is still running. The result is the same as above: when process 3 is back to being a voter, it does not catch up and does not know who the leader is.I will run with more tracing to see what's happening in particular in process 3, but in the meantime, do my scenarios make sense?
Note that I am using my own implementation of a
raft_io
backend, which I extensively tested (the scenarios above are testing some edge-cases). In particular I can spin up a new process, callraft_add
followed byraft_assign
to make it visible to the leader and to assign it the role of voter, and the new process does catch up on missing entries. The problem happens when I have an existing process running and I either assign it as spare then back to voter, or remove it then re-add it to the cluster.Thanks!
Beta Was this translation helpful? Give feedback.
All reactions