Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use local clusterState call during healthchecks #8187

Merged
merged 4 commits into from
Sep 18, 2024

Conversation

gargharsh3134
Copy link
Contributor

@gargharsh3134 gargharsh3134 commented Sep 13, 2024

Description

During health checks, while trying to get the optimised health checks property for all the nodes, a cluster state call is made to Opensearch which eventually is served by the ClusterManager node and hence stresses it out. For large clusters having high number of indices and shards, the cluster state size grows and serving cluster state requests from each node starts to degrade the cluster manager's performance. The attribute for each node can very well be served from local cluster state of the node. As per cluster manager's functioning, a node whose cluster state is lagging will be kicked out of the cluster after 90 seconds of waiting. So, serving clusterState from local for OSD process should not lead to any functionality change, except a little time lag in case the cluster state is lagging on the local node.

This change just updates the clusterState call to use the local=true query param, so that the request doesn't impact ClusterManager node.

Issues Resolved

Screenshot

Testing the changes

Changelog

Check List

  • All tests pass
    • yarn test:jest
    • yarn test:jest_integration
  • New functionality includes testing.
  • New functionality has been documented.
  • Update CHANGELOG.md
  • Commits are signed per the DCO using --signoff

Copy link
Contributor

ℹ️ Manual Changeset Creation Reminder

Please ensure manual commit for changeset file 8187.yml under folder changelogs/fragments to complete this PR.

If you want to use the available OpenSearch Changeset Bot App to avoid manual creation of changeset file you can install it in your forked repository following this link.

For more information about formatting of changeset files, please visit OpenSearch Auto Changeset and Release Notes Tool.

Copy link
Contributor

❌ Changeset File Not Added Yet

Please ensure manual commit for changeset file 8187.yml under folder changelogs/fragments to complete this PR. File still missing.

1 similar comment
Copy link
Contributor

❌ Changeset File Not Added Yet

Please ensure manual commit for changeset file 8187.yml under folder changelogs/fragments to complete this PR. File still missing.

Copy link

codecov bot commented Sep 13, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 64.06%. Comparing base (b826df8) to head (f478d79).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8187      +/-   ##
==========================================
+ Coverage   64.05%   64.06%   +0.01%     
==========================================
  Files        3741     3741              
  Lines       88629    88629              
  Branches    13801    13801              
==========================================
+ Hits        56771    56784      +13     
+ Misses      31260    31247      -13     
  Partials      598      598              
Flag Coverage Δ
Linux_1 30.06% <ø> (ø)
Linux_2 58.83% <ø> (ø)
Linux_3 40.37% <ø> (ø)
Linux_4 31.46% <ø> (ø)
Windows_1 30.09% <ø> (+0.01%) ⬆️
Windows_2 58.78% <ø> (ø)
Windows_3 40.37% <ø> (ø)
Windows_4 31.46% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ruanyl
ruanyl previously approved these changes Sep 16, 2024
@ruanyl
Copy link
Member

ruanyl commented Sep 16, 2024

Could you please mark this PR to be ready to review?
The code change looks fine to me, but I left a question about this old comment:

    /*
     * Using _cluster/state/nodes to retrieve the cluster_id of each node from cluster manager node which
     * is considered to be a lightweight operation to aggegrate different cluster_ids from the OpenSearch nodes.
     */

Copy link
Contributor

❌ Changeset File Not Added Yet

Please ensure manual commit for changeset file 8187.yml under folder changelogs/fragments to complete this PR. File still missing.

1 similar comment
Copy link
Contributor

❌ Changeset File Not Added Yet

Please ensure manual commit for changeset file 8187.yml under folder changelogs/fragments to complete this PR. File still missing.

Copy link
Member

@ashwin-pc ashwin-pc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Can you add a changeset to the PR. I think this change is worth calling out in the changelog. Will approve once thats ready.

Signed-off-by: Harsh Garg <[email protected]>
@gargharsh3134
Copy link
Contributor Author

Nice! Can you add a changeset to the PR. I think this change is worth calling out in the changelog. Will approve once thats ready.

@ashwin-pc Thanks for taking a look. I have added the changelog fragment, please take a look again.

@gargharsh3134
Copy link
Contributor Author

Could you please mark this PR to be ready to review? The code change looks fine to me, but I left a question about this old comment:

    /*
     * Using _cluster/state/nodes to retrieve the cluster_id of each node from cluster manager node which
     * is considered to be a lightweight operation to aggegrate different cluster_ids from the OpenSearch nodes.
     */

@ruanyl Thanks for looking into it. I have updated the code comment, please check again.

@ruanyl
Copy link
Member

ruanyl commented Sep 17, 2024

@gargharsh3134 While I was diving into the original implementation, I found the related issue of it: #330

I'm not familiar with the context of it, would you mind to take a look and see if your changes are compatible with it?

@gargharsh3134
Copy link
Contributor Author

@gargharsh3134 While I was diving into the original implementation, I found the related issue of it: #330

I'm not familiar with the context of it, would you mind to take a look and see if your changes are compatible with it?

@ruanyl I went through the linked issue, and looks like the new changes should maintain compatibility. The issue calls out the problem of fanning out the health check calls directly to all the nodes, and incase some of the nodes are slow and under resource constraint, they can bring down the OSD process. The proposed solution was to just make a single _cluster/state call to figure out when to really fan out these health checks to all the nodes, instead of doing it each time.

In the existing implementation, the request path was OSD making a call to local OS process which then is calling clusterManager node, once clusterManager responds back to local OS, the same response is served to OSD. With the new changes, the only difference is, instead of local node calling cluster manager, it would serve the request from it's own local cluster state (which might be lagging if the node is slow but would eventually catch up).

@BionIT BionIT merged commit cc5531b into opensearch-project:main Sep 18, 2024
74 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants