-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use local clusterState call during healthchecks #8187
Use local clusterState call during healthchecks #8187
Conversation
Signed-off-by: Harsh Garg <[email protected]>
ℹ️ Manual Changeset Creation ReminderPlease ensure manual commit for changeset file 8187.yml under folder changelogs/fragments to complete this PR. If you want to use the available OpenSearch Changeset Bot App to avoid manual creation of changeset file you can install it in your forked repository following this link. For more information about formatting of changeset files, please visit OpenSearch Auto Changeset and Release Notes Tool. |
❌ Changeset File Not Added YetPlease ensure manual commit for changeset file 8187.yml under folder changelogs/fragments to complete this PR. File still missing. |
1 similar comment
❌ Changeset File Not Added YetPlease ensure manual commit for changeset file 8187.yml under folder changelogs/fragments to complete this PR. File still missing. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #8187 +/- ##
==========================================
+ Coverage 64.05% 64.06% +0.01%
==========================================
Files 3741 3741
Lines 88629 88629
Branches 13801 13801
==========================================
+ Hits 56771 56784 +13
+ Misses 31260 31247 -13
Partials 598 598
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Could you please mark this PR to be ready to review?
|
❌ Changeset File Not Added YetPlease ensure manual commit for changeset file 8187.yml under folder changelogs/fragments to complete this PR. File still missing. |
1 similar comment
❌ Changeset File Not Added YetPlease ensure manual commit for changeset file 8187.yml under folder changelogs/fragments to complete this PR. File still missing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Can you add a changeset to the PR. I think this change is worth calling out in the changelog. Will approve once thats ready.
Signed-off-by: Harsh Garg <[email protected]>
b648a5b
to
936cdce
Compare
@ashwin-pc Thanks for taking a look. I have added the changelog fragment, please take a look again. |
@ruanyl Thanks for looking into it. I have updated the code comment, please check again. |
@gargharsh3134 While I was diving into the original implementation, I found the related issue of it: #330 I'm not familiar with the context of it, would you mind to take a look and see if your changes are compatible with it? |
@ruanyl I went through the linked issue, and looks like the new changes should maintain compatibility. The issue calls out the problem of fanning out the health check calls directly to all the nodes, and incase some of the nodes are slow and under resource constraint, they can bring down the OSD process. The proposed solution was to just make a single _cluster/state call to figure out when to really fan out these health checks to all the nodes, instead of doing it each time. In the existing implementation, the request path was OSD making a call to local OS process which then is calling clusterManager node, once clusterManager responds back to local OS, the same response is served to OSD. With the new changes, the only difference is, instead of local node calling cluster manager, it would serve the request from it's own local cluster state (which might be lagging if the node is slow but would eventually catch up). |
Description
During health checks, while trying to get the optimised health checks property for all the nodes, a cluster state call is made to Opensearch which eventually is served by the ClusterManager node and hence stresses it out. For large clusters having high number of indices and shards, the cluster state size grows and serving cluster state requests from each node starts to degrade the cluster manager's performance. The attribute for each node can very well be served from local cluster state of the node. As per cluster manager's functioning, a node whose cluster state is lagging will be kicked out of the cluster after 90 seconds of waiting. So, serving clusterState from local for OSD process should not lead to any functionality change, except a little time lag in case the cluster state is lagging on the local node.
This change just updates the clusterState call to use the
local=true
query param, so that the request doesn't impact ClusterManager node.Issues Resolved
Screenshot
Testing the changes
Changelog
Check List
yarn test:jest
yarn test:jest_integration