Symbol | Meaning |
---|---|
N |
Total number of cluster nodes, at least 1. The total number of clusters cannot be dynamically expanded |
M |
Minimum number of cluster high availability nodes, M = N/2 + 1 . It is considered passed only when the number of votes is at least M |
seed |
Seed node, the entry point for cluster access |
leader |
The manager in the cluster, only the leader can modify the status of members |
follower |
Ordinary members in the cluster, including seed and non-seed , can participate in voting |
member |
A general term for cluster nodes, can be seed , follower , leader |
version |
The version of the cluster, each time a new leader is generated, the version increases |
receiver |
Specifically refers to com.github.liuyehcf.framework.io.athena.Receiver |
heartbeatInterval |
Heartbeat interval |
heartbeatTimeout |
Heartbeat timeout |
ttlTimeout |
Maximum downtime allowed for a node (including network exceptions) |
retryInterval |
Retry interval, including rejoining the cluster, querying the status of the leader , etc. |
proposal |
Proposal, divided into pre-proposal stage (pre ) and formal proposal stage (formal ) |
vote |
Vote, corresponding to proposal , divided into pre-proposal stage (pre ) and formal proposal stage (formal ) |
proposer |
The member who proposes the proposal |
candidate |
The member who hopes to be elected as the leader by proposing |
Events are divided into two categories, system events (system-event
) and (custom-event
).
system-event
is handled by the system framework and will not be passed to the user'sreceiver
.custom-event
is passed to the user'sreceiver
and processed according to the matching logic.
The sender of an event must ensure that the event is sent accurately, i.e., the information expressed by the event matches the state of the cluster when the event is sent (only focus on that moment).
- For example, when sending
LeaderKeepAliveAck
, it is necessary to verify whether the sender is the leader, and only send it after successful verification.
Suppose two nodes A
and B
in the cluster. Node A
observes that the state of node B
is λ1
, and based on this observation, node A
sends the observation result λ1
and the processing method θ
to B
.
When B
receives the processing method θ
, it needs to validate whether the current state is λ1
.
- If the observed result of
B
is alsoλ1
, then executeθ
. - If the observed result of
B
is notλ1
, then do not process it.
Node states and their meanings are as follows
State | Description |
---|---|
joining |
Initialization state. When a node communicates with the leader for the first time to request joining the cluster, the leader will mark it as joining state. |
active |
After a node successfully joins the cluster, the leader will mark it as active state. |
unreachable |
When the leader does not receive the heartbeat packet from the node after heartbeatTimeout , it will mark the node as unreachable . |
leaving |
When the leader does not receive the heartbeat from the node after ttlTimeout , it will mark the node as leaving . |
removed |
After marking the node as leaving , the leader will perform some cleanup work. After the cleanup is completed, the node will be marked as removed . |
State Machine
Current State | Condition | Next State |
---|---|---|
/ | Join the cluster | joining |
joining |
Finish initialization | active |
active |
The leader does not receive the heartbeat after heartbeatTimeout |
unreachable |
unreachable |
The heartbeatTimeout is reached and the heartbeat is received again |
active |
unreachable |
The leader does not receive the heartbeat after ttlTimeout |
leaving |
leaving |
Finish cleanup | removed |
Cluster Startup
seed
startup:- Iterate through all other
seed
nodes and attempt to join the cluster. - If the joining is successful, end.
- If the joining fails, directly become the
leader
,version = 1
, and end.
- Iterate through all other
non-seed
startup:- Iterate through all
seed
nodes and attempt to join the cluster. - If the joining is successful, end.
- If the joining fails, wait for
retryInterval
and repeat the process until the end.
- Iterate through all
Leader Recovery After Failure
- When the number of live nodes is at least
M
, a newleader
can be elected, and the cluster is restored to a normal state. - When the number of live nodes is less than
M
, the cluster becomes unavailable, and the nodes themselves still attempt to initiate the election process, but it will inevitably fail.seed
: If noleader
is elected after3 * ttlTimeout
, clean up the cluster, increase the version number, and directly become theleader
.non-seed
: If noleader
is elected after3 * ttlTimeout
, clean up the cluster, reset the version number to 1, and reattempt to join the cluster.
KeepAlive
follower
needs to maintain a heartbeat with theleader
.follower
sendsLeaderKeepAlive
.leader
replies withLeaderKeepAliveAck
.- If the
leader
does not receive theLeaderKeepAlive
event from thefollower
afterheartbeatTime
, it marks the status of thefollower
asunreachable
and notifies otherfollowers
. - If the
leader
does not receive theLeaderKeepAlive
event from the follower afterttlTimeout
, it marks the status of thefollower
asleaving
andremoved
, then removes thefollower
from the cluster and notifies otherfollowers
. - If the
follower
does not receive theLeaderKeepAliveAck
event from theleader
afterttlTimeout
, it resets the localleader
, initiates theleader
election process, and nominates itself as theleader
.
leader
needs to maintain a heartbeat with allseed
nodes to avoid isolated islands.leader
sendsSeedKeepAlive
.seed
replies withSeedKeepAliveAck
.- If a
seed
finds that the observedleader
is different from the local observedleader
, it sendsLeaderRelieve
based on certain principles to force oneleader
to step down. - When the
leader
receives the message, it notifies otherfollowers
to rejoin the cluster (clears the cluster and resets the version to 1), and it also attempts to rejoin the cluster (clears the cluster and resets the version to 1). - Conflict resolution strategy for
leader
: If the versions are different, take the one with the larger version number. If the versions are the same, take the one with the largerAddress
(compare host strings first, then compare ports).
Follower Probe
- Maintain a heartbeat with the
leader
at intervals ofheartbeatInterval
and check the status of theleader
.
Leader Probe
- Maintain a heartbeat with the
follower
and handle exceptions. The detailed process is not described here. - Maintain a heartbeat with all
seed
nodes to avoid isolated islands.- If all nodes have the same seed configuration, isolation can be completely avoided.
- If all nodes have different seed configurations, there is a certain probability of isolated islands.
Paxos
- In one election, the
id
of each proposal must be different. - After the election, to avoid receiving proposals from the previous election and polluting the
candidate
. - After a new
leader
is generated, the local state needs to be cleared, including the maximum proposal number and thecandidate
corresponding to that proposal. - The validity period of the local state is
ttlTimeout * 3
.
Flowchart Source Code
skinparam backgroundColor #EEEBDC
skinparam handwritten true
skinparam sequence {
ArrowColor DeepSkyBlue
ActorBorderColor DeepSkyBlue
LifeLineBorderColor blue
LifeLineBackgroundColor #A9DCDF
ParticipantBorderColor DeepSkyBlue
ParticipantBackgroundColor DodgerBlue
ParticipantFontName Impact
ParticipantFontSize 17
ParticipantFontColor #A9DCDF
ActorBackgroundColor aqua
ActorFontColor DeepSkyBlue
ActorFontSize 17
ActorFontName Aapex
}
participant member
participant seed
participant leader
member -> seed: LeaderQuery
alt leader exist
seed --> member: LeaderResponse(leaderAddress)
member -> leader: JoinClusterRequest(address)
alt is leader
leader --> member: JoinClusterResponse(accept)
leader -> leader: moving member status to <joining>\nand broadcast to all active members
leader -> leader: moving member status to <active>\nand broadcast to all active members
end
alt is not leader
leader --> member: JoinClusterResponse(reject)
...
note right member: repeat this process from start\n(ask other seed)
end
end
alt leader not exist
seed --> member: LeaderResponse(not exist)
...
note right member: repeat this process from start\n(ask other seed)
end
Flowchart
Flowchart Source Code
skinparam backgroundColor #EEEBDC
skinparam handwritten true
skinparam sequence {
ArrowColor DeepSkyBlue
ActorBorderColor DeepSkyBlue
LifeLineBorderColor blue
LifeLineBackgroundColor #A9DCDF
ParticipantBorderColor DeepSkyBlue
ParticipantBackgroundColor DodgerBlue
ParticipantFontName Impact
ParticipantFontSize 17
ParticipantFontColor #A9DCDF
ActorBackgroundColor aqua
ActorFontColor DeepSkyBlue
ActorFontSize 17
ActorFontName Aapex
}
participant seed1
participant seed2
seed1 -> seed2: LeaderQuery
alt leader exist
seed2 --> seed1: LeaderResponse(leaderAddress)
note right seed1: continue with steps in NORMAL_JOINING_PROCESS
end
alt leader not exist\nhas other seeds to try
note right seed1: repeat this process from start\n(ask other seed)
end
alt leader not exist\nhas no other seeds to try
note right seed1: let itself as leader
end
Flowchart
leader1
sends SeedKeepAlive
event to seed
, and the locally observed leader
of seed
is leader2
.
Flowchart Source Code
skinparam backgroundColor #EEEBDC
skinparam handwritten true
skinparam sequence {
ArrowColor DeepSkyBlue
ActorBorderColor DeepSkyBlue
LifeLineBorderColor blue
LifeLineBackgroundColor #A9DCDF
ParticipantBorderColor DeepSkyBlue
ParticipantBackgroundColor DodgerBlue
ParticipantFontName Impact
ParticipantFontSize 17
ParticipantFontColor #A9DCDF
ActorBackgroundColor aqua
ActorFontColor DeepSkyBlue
ActorFontSize 17
ActorFontName Aapex
}
participant leader1
participant seed
participant leader2
leader1 -> seed: SeedKeepAlive(leader1, version1)\nfrom the local perspective
seed -> seed: check whether the leader1 and version1 \nare consistent with the local observations
seed --> leader1: SeedKeepAliveAck
alt inconsistent
note right seed: assume leader1 win the competition
seed -> leader2: LeaderRelieve(leader1)
leader2 -> leader2: rejoin the cluster\nand broadcast to all active members to do so
end
alt other case
note right seed: do nothing
end
Flowchart
Before starting the election, it is necessary to inquire other nodes whether the leader
is normal from their perspectives. The election process begins only if at least M
replies indicate that the leader is not normal. This is to avoid unnecessary leader
elections due to a node's recovery from disconnection.
Flowchart Source Code
skinparam backgroundColor #EEEBDC
skinparam handwritten true
skinparam sequence {
ArrowColor DeepSkyBlue
ActorBorderColor DeepSkyBlue
LifeLineBorderColor blue
LifeLineBackgroundColor #A9DCDF
ParticipantBorderColor DeepSkyBlue
ParticipantBackgroundColor DodgerBlue
ParticipantFontName Impact
ParticipantFontSize 17
ParticipantFontColor #A9DCDF
ActorBackgroundColor aqua
ActorFontColor DeepSkyBlue
ActorFontSize 17
ActorFontName Aapex
}
participant follower
participant otherFollowers
follower -> otherFollowers: LeaderStatusQuery
otherFollowers --> follower: LeaderStatusQueryResponse(isHealthy)
alt leader is healthy
follower -> follower: in all replies, if the number of unhealthy < M,\nthen stop shis process
end
alt leader is not healthy
follower -> follower: in all replies, if the number of unhealthy >= M,\nthen start the following steps
follower -> otherFollowers: Proposal(pre, self, self)
otherFollowers --> follower: Vote(pre, isAccept, candidate),\nthe candidate has the largest accepted formal-proposal id
alt pre-proposal reject
follower -> follower: in all pre-votes, if the number of accept < M,\nthen stop this process
end
alt pre-proposal accept
follower -> follower: in all pre-votes, if the number of accept >= M,\nthen choose the candidate with largest formal-proposal id
follower -> otherFollowers: Proposal(formal, self, candidate)
otherFollowers --> follower: Vote(formal, isAccept)
alt formal-proposal reject
follower -> follower: in all formal-votes, if the number of accept < M,\nthen stop this process
end
alt formal-proposal accept
follower -> follower: in all formal-votes, if the number of accept >= M,\nthen the candidate is elected as leader
end
end
end
Flowchart
Core Principles
- The party initiating the connection will send
active-greet
. - The passive party receiving
active-greet
will reply withpassive-greet
. - The mapping of
address
andchannel
is cached only when receiving the correspondinggreet
from the other party. - If a mapping conflict is detected during caching, only the connection initiated by the party with the smaller
address
is kept.
In the following flow, assuming address(member1)<address(member2)
- The red lines represent connections initiated by
member1
. - The green lines represent connections initiated by
member2
.
skinparam backgroundColor #EEEBDC
skinparam handwritten true
skinparam sequence {
ArrowColor DeepSkyBlue
ActorBorderColor DeepSkyBlue
LifeLineBorderColor blue
LifeLineBackgroundColor #A9DCDF
ParticipantBorderColor DeepSkyBlue
ParticipantBackgroundColor DodgerBlue
ParticipantFontName Impact
ParticipantFontSize 17
ParticipantFontColor #A9DCDF
ActorBackgroundColor aqua
ActorFontColor DeepSkyBlue
ActorFontSize 17
ActorFontName Aapex
}
participant timeline
timeline-[#red]>timeline: member1 try to talk to member2,\nbut cannot find connection with member2,\nthen connect member2 and send active-greet
timeline-[#green]>timeline: member2 try to talk to member1,\nbut cannot find connection with member1,\nthen connect member1 and send active-greet
...
alt active-greet from member2 to member1 arrive first
timeline-[#green]>timeline: member1 received the active-greet from member2,\nand save member2->channel(green one) mapping,\nthen send passive-greet to member2
...
alt active-greet from member1 to member2 arrive first
timeline-[#red]>timeline: member2 received the active-greet from member1,\nand save member1->channel(red one) mapping,\nthen send passive-greet to member1
note right timeline: No matter the following two steps in which order the results will not be affected
timeline-[#green]>timeline: member2 received the passive-greet from member1,\nand try to save member1->channel(green one) mapping but find mapping exists,\nthen save member1->channel(red one) according address comparation policy
timeline-[#red]>timeline: member1 received the passive-greet from member2,\nand try to save member2->channel(red one) mapping but find mapping exists,\nthen save member2->channel(red one) according address comparation policy
end
alt passive-greet from member1 to member2 arrive first
timeline-[#green]>timeline: member2 received the passive-greet from member1,\nand save member1->channel(green one) mapping
timeline-[#red]>timeline: member2 received the active-greet from member1,\nand try to save member1->channel(red one) mapping but find mapping exists,\nthen save member1-channel(red one) according address comparation policy\nsend passive-greet to member1
timeline-[#red]>timeline: member1 received the passive-greet from member2,\nand try to save member2->channel(red one) mapping but find mapping exists,\nthen save member2->channel(red one) according address comparation policy
end
end
alt active-greet from member1 to member2 arrive first
note right timeline: this case is opposite with the upper case
end
Flowchart
member1
andmember2
simultaneously detect the leader offline and inquire other members, receiving replies that the leader is not online.- Both
member1
andmember2
simultaneously initiate leader elections, each nominating themselves as the leader. - The proposal from
member1
reaches other members first, so they all agree, andmember1
becomes the leader. - At this point,
member2
has already acceptedmember1
as the leader, but due to the previous investigation where everyone said the leader is offline, it also initiates a leader election.
How to avoid it: The version when initiating the election must be the version when initiating the leader status query + 1.
System messages are out of order.
- The leader sends
JoinClusterResponse
first and then sendsMemberStatusUpdate
. - The member receives
JoinClusterResponse
first and then receivesMemberStatusUpdate
. - But due to completely asynchronous processing,
MemberStatusUpdate
may be executed beforeJoinClusterResponse
. - This causes the member's state to become
joining
, although it will reply automatically, it is unnecessary.
Does ClusterAlignment only synchronize with active nodes?
- A member is in the joining process.
- The leader handling the joining process receives a LeaderRelieve.
- Since the member is not in the active state, it does not send ReJoinCluster to the member.
- Therefore, after the member exceeds 3 times the ttlTimeout, it will attempt to rejoin the cluster.