Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry remote state download while bootstrap #15950

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

soosinha
Copy link
Member

Description

When a remote state enabled cluster manager node boots-up, it tries to download the remote state. But if there some issue while downloading like file not present then the download fails with an exception. The OpenSearch process stays active but unresponsive. This PR addresses this by:

  • Adding retries to the download to mitigate the issue where the node is trying to download a stale cluster state
  • If the retries fail, bail out by throwing an Error so that process exists and a new process is spawned.

Related Issues

NA

Check List

  • Functionality includes testing.
  • [] API changes companion pull request created, if applicable.
  • [] Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

❌ Gradle check result for 3e02411: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❕ Gradle check result for 8f79550: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link

codecov bot commented Sep 17, 2024

Codecov Report

Attention: Patch coverage is 75.00000% with 5 lines in your changes missing coverage. Please review.

Project coverage is 71.68%. Comparing base (92d7fe8) to head (440c131).
Report is 15 commits behind head on main.

Files with missing lines Patch % Lines
.../java/org/opensearch/gateway/GatewayMetaState.java 75.00% 4 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #15950      +/-   ##
============================================
- Coverage     71.90%   71.68%   -0.23%     
+ Complexity    64216    64112     -104     
============================================
  Files          5272     5277       +5     
  Lines        300597   300708     +111     
  Branches      43440    43451      +11     
============================================
- Hits         216151   215567     -584     
- Misses        66680    67385     +705     
+ Partials      17766    17756      -10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

int delayInMills = 100;
for (int attempt = 1; attempt <= maxAttempts; attempt++) {
try {
return restoreClusterState(remoteStoreRestoreService, clusterState, lastKnownClusterUUID);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a info log for every attempt

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe reuse some retryable entity already present in cluster manager

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Bukhtawar I do not see a suitable retryable entity already present. Let me know if you have any reference.

Signed-off-by: Sooraj Sinha <[email protected]>
Copy link
Contributor

✅ Gradle check result for 440c131: SUCCESS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants