-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for reroute operation during node-left #16468
Conversation
❌ Gradle check result for 8a64664: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for 96c0766: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
❌ Gradle check result for 83a5525: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for 4227892: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
server/src/main/java/org/opensearch/gateway/ReplicaShardBatchAllocator.java
Outdated
Show resolved
Hide resolved
❌ Gradle check result for c506262: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
❌ Gradle check result for 3c2bd1d: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
844b95d
to
0bfcc63
Compare
❌ Gradle check result for 0bfcc63: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Rajiv Kumar Vaidyanathan <[email protected]>
Signed-off-by: Rajiv Kumar Vaidyanathan <[email protected]> (cherry picked from commit 9f7d3b6) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
@rajiv-kv @shwetathareja why did we skip a change log entry for this bug fix? |
There wasn't any user-facing changes as the null check already exists in the code. |
But it's fixing a bug. (I think? I don't see a linked bug issue.) Let me make sure I understand what's happening here: This was a node that was already shutting down, encountering an exception while shutting down? Thus the only impact would be confusing entries in logging in case anyone was trying to figure out why the node dropped in the first place? In which case, while an end user may not see it, anyone trying to diagnose a node drop will see a change in log behavior. |
…16507) (cherry picked from commit 9f7d3b6) Signed-off-by: Rajiv Kumar Vaidyanathan <[email protected]> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Description
[Describe what this change achieves]
During node-left task execution, the shards on the leaving node are failed and reroute operation is triggered. The input RoutingAllocation to reroute has updated ShardRouting information. As part of reroute, the existing recoveries are looked for a better match using the ShardRouting cached in ShardBatch cache. Since this can contain the stale ShardRoutring assigned to the node leaving the cluster, it causes NPE when the node is referenced during shard allocation decisions.
Repro for stale batch cache entries as compared to RoutingAllocation
node
uwpG13DpRq-ZvWOUbgpivA
is removed from cluster and the ShardAllocator tries to reference the PrimaryShard[index][2]
which was previously allocated on the node. The batch cache has Replica shard for[index][2]
still in INITIALIZING state although it has been marked as UNASSIGNED in RoutingAllocation.Related Issues
Resolves #[Issue number to be closed when this PR is merged]
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.