-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for race condition in node-join/node-left loop #15521
Fix for race condition in node-join/node-left loop #15521
Conversation
❌ Gradle check result for 9496aa1: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for 94ff71b: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for 78b6fdd: ABORTED Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
server/src/main/java/org/opensearch/cluster/coordination/Coordinator.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/opensearch/cluster/NodeConnectionsService.java
Outdated
Show resolved
Hide resolved
❌ Gradle check result for 8411788: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for f0cf40c: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
server/src/main/java/org/opensearch/transport/TransportService.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/opensearch/transport/ClusterConnectionManager.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/opensearch/transport/ClusterConnectionManager.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/opensearch/transport/ConnectionManager.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/opensearch/transport/RemoteConnectionManager.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/opensearch/cluster/coordination/Coordinator.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/opensearch/cluster/coordination/Coordinator.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/opensearch/cluster/NodeConnectionsService.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/opensearch/transport/ClusterConnectionManager.java
Outdated
Show resolved
Hide resolved
❌ Gradle check result for a1db9fd: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for 61eac52: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for dd14cc8: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
4a46aa6
to
d7be8b7
Compare
rebased main |
❌ Gradle check result for 4a46aa6: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for 412eada: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for dd68f30: null Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for ca48626: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for 36e473d: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
9c788fb
to
e02aa10
Compare
❌ Gradle check result for 9c788fb: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for e02aa10: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
server/src/main/java/org/opensearch/transport/ClusterConnectionManager.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/opensearch/cluster/NodeConnectionsService.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/opensearch/transport/ClusterConnectionManager.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/opensearch/transport/ConnectionManager.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/opensearch/transport/ClusterConnectionManager.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/opensearch/transport/ClusterConnectionManager.java
Show resolved
Hide resolved
server/src/main/java/org/opensearch/transport/ClusterConnectionManager.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/opensearch/transport/ClusterConnectionManager.java
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/opensearch/cluster/coordination/NodeJoinLeftIT.java
Outdated
Show resolved
Hide resolved
❌ Gradle check result for b0a7ae3: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for 0d3db12: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
0d3db12
to
9e322c4
Compare
force pushed to rebase from main |
❌ Gradle check result for 9e322c4: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for 85a9e37: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Rahul Karajgikar <[email protected]>
Signed-off-by: Rahul Karajgikar <[email protected]>
Signed-off-by: Rahul Karajgikar <[email protected]>
Signed-off-by: Rahul Karajgikar <[email protected]>
a107371
to
1ded2cc
Compare
Signed-off-by: Rahul Karajgikar <[email protected]>
❌ Gradle check result for 9a060cc: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Rahul Karajgikar <[email protected]>
❌ Gradle check result for 7b0d28b: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Rahul Karajgikar <[email protected]>
❕ Gradle check result for e0a0ae2: UNSTABLE Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
1563e1a
into
opensearch-project:main
* Add custom connect to node for handleJoinRequest Signed-off-by: Rahul Karajgikar <[email protected]> (cherry picked from commit 1563e1a) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
(cherry picked from commit 1563e1a) Signed-off-by: Rahul Karajgikar <[email protected]>
…t#15521) * Add custom connect to node for handleJoinRequest Signed-off-by: Rahul Karajgikar <[email protected]>
…t#15521) * Add custom connect to node for handleJoinRequest Signed-off-by: Rahul Karajgikar <[email protected]>
…t#15521) * Add custom connect to node for handleJoinRequest Signed-off-by: Rahul Karajgikar <[email protected]>
…t#15521) * Add custom connect to node for handleJoinRequest Signed-off-by: Rahul Karajgikar <[email protected]>
…t#15521) * Add custom connect to node for handleJoinRequest Signed-off-by: Rahul Karajgikar <[email protected]>
Description
Fix for race condition in node-join/node-left loop.
Scenario where race condition can happen:
node-left
task on cluster manager thread.node-left
task.discovery.find_peers_interval
setting]node-join
task into its thread.node-join
task would only start after thenode-left
task is completed since cluster-manager is single threaded.node-left
task has completed publication and has started to apply the new state on the cluster manager.node-left
task, cluster manager wipes out the connection to the leaving node.node-left
task then completes and thenode-join
task begins.node-join
task starts. This task assumes that because the previous join request succeeded, that the connection to the joining node would still be there.FollowersChecker
thread to add this new node.FollowerChecker
thread fails withNodeNotConnectedException
because the connection was wiped and triggers a newnode-left
task.node-left
task also takes time, we end up in an infinite loop ofnode-left
andnode-join
tasks.FollowerChecker
is modified to handle thisNodeNotConnectedException
gracefully without triggering anode-left
task, the state publication to this joining node still fails because the connection was wiped. So thenode-join
state never completes on the joining node and it forever remains in candidate phase.To summarise, if we allow a
node-join
task into the queue before thenode-left
task disconnects from the node, we will see the race condition happen.Fix:
As part of the fix for this, we now reject the initial join request from a node that has an ongoing
node-left
task.The join request will only succeed after the
node-left
task completes committing state on cluster manager, so the connection that gets created as part of the join request does not get wiped out and causenode-join
task to fail.This is done by marking nodes as pending disconnect right before publish state of
node-left
task.We mark the nodes as completed disconnect after commit state of
node-left
task, or on re-election of cluster manager.If there is a connection request from cluster-manager to any other node during this time, we reject the connection request. This blocks join requests because during join requests, cluster-manager tries to connect to the node trying to join.
The join request will keep retrying, and once the
node-left
succeeds, the join request will be able to make a connection and succeed.Main classes:
Related Issues
Resolves #4874
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.