Use local clusterState call during healthchecks #8187

gargharsh3134 · 2024-09-13T03:54:56Z

Description

During health checks, while trying to get the optimised health checks property for all the nodes, a cluster state call is made to Opensearch which eventually is served by the ClusterManager node and hence stresses it out. For large clusters having high number of indices and shards, the cluster state size grows and serving cluster state requests from each node starts to degrade the cluster manager's performance. The attribute for each node can very well be served from local cluster state of the node. As per cluster manager's functioning, a node whose cluster state is lagging will be kicked out of the cluster after 90 seconds of waiting. So, serving clusterState from local for OSD process should not lead to any functionality change, except a little time lag in case the cluster state is lagging on the local node.

This change just updates the clusterState call to use the local=true query param, so that the request doesn't impact ClusterManager node.

Issues Resolved

Screenshot

Testing the changes

Changelog

Check List

All tests pass
- yarn test:jest
- yarn test:jest_integration
New functionality includes testing.
New functionality has been documented.
Update CHANGELOG.md
Commits are signed per the DCO using --signoff

Signed-off-by: Harsh Garg <[email protected]>

github-actions · 2024-09-13T03:55:28Z

ℹ️ Manual Changeset Creation Reminder

Please ensure manual commit for changeset file 8187.yml under folder changelogs/fragments to complete this PR.

If you want to use the available OpenSearch Changeset Bot App to avoid manual creation of changeset file you can install it in your forked repository following this link.

For more information about formatting of changeset files, please visit OpenSearch Auto Changeset and Release Notes Tool.

github-actions · 2024-09-13T03:59:29Z

❌ Changeset File Not Added Yet

Please ensure manual commit for changeset file 8187.yml under folder changelogs/fragments to complete this PR. File still missing.

github-actions · 2024-09-13T03:59:55Z

❌ Changeset File Not Added Yet

Please ensure manual commit for changeset file 8187.yml under folder changelogs/fragments to complete this PR. File still missing.

codecov · 2024-09-13T04:10:08Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 64.06%. Comparing base (b826df8) to head (f478d79).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #8187      +/-   ##
==========================================
+ Coverage   64.05%   64.06%   +0.01%     
==========================================
  Files        3741     3741              
  Lines       88629    88629              
  Branches    13801    13801              
==========================================
+ Hits        56771    56784      +13     
+ Misses      31260    31247      -13     
  Partials      598      598

Flag	Coverage Δ
Linux_1	`30.06% <ø> (ø)`
Linux_2	`58.83% <ø> (ø)`
Linux_3	`40.37% <ø> (ø)`
Linux_4	`31.46% <ø> (ø)`
Windows_1	`30.09% <ø> (+0.01%)`	⬆️
Windows_2	`58.78% <ø> (ø)`
Windows_3	`40.37% <ø> (ø)`
Windows_4	`31.46% <ø> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/core/server/opensearch/version_check/ensure_opensearch_version.ts

ruanyl · 2024-09-16T06:00:47Z

Could you please mark this PR to be ready to review?
The code change looks fine to me, but I left a question about this old comment:

    /*
     * Using _cluster/state/nodes to retrieve the cluster_id of each node from cluster manager node which
     * is considered to be a lightweight operation to aggegrate different cluster_ids from the OpenSearch nodes.
     */

github-actions · 2024-09-16T16:54:55Z

❌ Changeset File Not Added Yet

Please ensure manual commit for changeset file 8187.yml under folder changelogs/fragments to complete this PR. File still missing.

github-actions · 2024-09-16T16:56:28Z

❌ Changeset File Not Added Yet

Please ensure manual commit for changeset file 8187.yml under folder changelogs/fragments to complete this PR. File still missing.

ashwin-pc

Nice! Can you add a changeset to the PR. I think this change is worth calling out in the changelog. Will approve once thats ready.

Signed-off-by: Harsh Garg <[email protected]>

gargharsh3134 · 2024-09-16T20:42:10Z

Nice! Can you add a changeset to the PR. I think this change is worth calling out in the changelog. Will approve once thats ready.

@ashwin-pc Thanks for taking a look. I have added the changelog fragment, please take a look again.

gargharsh3134 · 2024-09-17T04:05:03Z

Could you please mark this PR to be ready to review? The code change looks fine to me, but I left a question about this old comment:
    /*
     * Using _cluster/state/nodes to retrieve the cluster_id of each node from cluster manager node which
     * is considered to be a lightweight operation to aggegrate different cluster_ids from the OpenSearch nodes.
     */

@ruanyl Thanks for looking into it. I have updated the code comment, please check again.

ruanyl · 2024-09-17T06:17:21Z

@gargharsh3134 While I was diving into the original implementation, I found the related issue of it: #330

I'm not familiar with the context of it, would you mind to take a look and see if your changes are compatible with it?

gargharsh3134 · 2024-09-17T16:38:17Z

@gargharsh3134 While I was diving into the original implementation, I found the related issue of it: #330

I'm not familiar with the context of it, would you mind to take a look and see if your changes are compatible with it?

@ruanyl I went through the linked issue, and looks like the new changes should maintain compatibility. The issue calls out the problem of fanning out the health check calls directly to all the nodes, and incase some of the nodes are slow and under resource constraint, they can bring down the OSD process. The proposed solution was to just make a single _cluster/state call to figure out when to really fan out these health checks to all the nodes, instead of doing it each time.

In the existing implementation, the request path was OSD making a call to local OS process which then is calling clusterManager node, once clusterManager responds back to local OS, the same response is served to OSD. With the new changes, the only difference is, instead of local node calling cluster manager, it would serve the request from it's own local cluster state (which might be lagging if the node is slow but would eventually catch up).

Use local clusterState call during healthchecks

637692f

Signed-off-by: Harsh Garg <[email protected]>

github-actions bot added the first-time-contributor label Sep 13, 2024

github-actions bot added the failed changeset label Sep 13, 2024

ruanyl previously approved these changes Sep 16, 2024

View reviewed changes

ruanyl reviewed Sep 16, 2024

View reviewed changes

src/core/server/opensearch/version_check/ensure_opensearch_version.ts Show resolved Hide resolved

gargharsh3134 marked this pull request as ready for review September 16, 2024 16:48

gargharsh3134 requested review from ananzh, kavilla, AMoo-Miki, ashwin-pc, joshuarrrr, abbyhu2000, zengyan-amazon, zhongnansu, manasvinibs, ZilongX, Flyingliuhub, curq, bandinib-amzn, SuZhou-Joe, BionIT, xinruiba, zhyuanqi, mengweieric and LDrago27 as code owners September 16, 2024 16:48

gargharsh3134 requested review from virajsanghvi, sejli, joshuali925 and huyaboo as code owners September 16, 2024 16:48

ashwin-pc reviewed Sep 16, 2024

View reviewed changes

gargharsh3134 dismissed ruanyl’s stale review via b648a5b September 16, 2024 20:39

Adding changeLog

936cdce

Signed-off-by: Harsh Garg <[email protected]>

gargharsh3134 force-pushed the localClusterStateCall branch from b648a5b to 936cdce Compare September 16, 2024 20:40

Merge branch 'main' into localClusterStateCall

de92f84

ashwin-pc approved these changes Sep 17, 2024

View reviewed changes

Merge branch 'main' into localClusterStateCall

f478d79

ruanyl approved these changes Sep 17, 2024

View reviewed changes

github-actions bot removed the failed changeset label Sep 18, 2024

BionIT merged commit cc5531b into opensearch-project:main Sep 18, 2024
74 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use local clusterState call during healthchecks #8187

Use local clusterState call during healthchecks #8187

gargharsh3134 commented Sep 13, 2024 •

edited

Loading

github-actions bot commented Sep 13, 2024

github-actions bot commented Sep 13, 2024

github-actions bot commented Sep 13, 2024

codecov bot commented Sep 13, 2024 •

edited

Loading

ruanyl commented Sep 16, 2024

github-actions bot commented Sep 16, 2024

github-actions bot commented Sep 16, 2024

ashwin-pc left a comment

gargharsh3134 commented Sep 16, 2024

gargharsh3134 commented Sep 17, 2024

ruanyl commented Sep 17, 2024

gargharsh3134 commented Sep 17, 2024

Use local clusterState call during healthchecks #8187

Use local clusterState call during healthchecks #8187

Conversation

gargharsh3134 commented Sep 13, 2024 • edited Loading

Description

Issues Resolved

Screenshot

Testing the changes

Changelog

Check List

github-actions bot commented Sep 13, 2024

ℹ️ Manual Changeset Creation Reminder

github-actions bot commented Sep 13, 2024

❌ Changeset File Not Added Yet

github-actions bot commented Sep 13, 2024

❌ Changeset File Not Added Yet

codecov bot commented Sep 13, 2024 • edited Loading

Codecov Report

ruanyl commented Sep 16, 2024

github-actions bot commented Sep 16, 2024

❌ Changeset File Not Added Yet

github-actions bot commented Sep 16, 2024

❌ Changeset File Not Added Yet

ashwin-pc left a comment

Choose a reason for hiding this comment

gargharsh3134 commented Sep 16, 2024

gargharsh3134 commented Sep 17, 2024

ruanyl commented Sep 17, 2024

gargharsh3134 commented Sep 17, 2024

gargharsh3134 commented Sep 13, 2024 •

edited

Loading

codecov bot commented Sep 13, 2024 •

edited

Loading