DAOS-16477 mgmt: return suspect engines for pool healthy query #15458

wangshilong · 2024-11-06T06:38:46Z

After significant failures, the system may leave behind some suspect engines that were marked as DEAD by the SWIM protocol, but were not excluded from the system to prevent data loss. An administrator can bring these ranks back online by restarting them.

This PR aims to provide an administrative interface for querying suspect engines following a massive failure. These suspect engines can be retrieved using the daos/dmg --health-only command.

An example of output of dmg pool query --health-only:

Pool 6f450a68-8c7d-4da9-8900-02691650f6a2, ntarget=8, disabled=2, leader=3, version=4, state=Degraded Pool health info:

Disabled ranks: 1
Suspect ranks: 2
Rebuild busy, 0 objs, 0 recs

Features: DmgPoolQueryRanks
Required-githooks: true

Before requesting gatekeeper:

Two review approvals and any prior change requests have been resolved.
Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
Commit messages follows the guidelines outlined here.
Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

After significant failures, the system may leave behind some suspect engines that were marked as DEAD by the SWIM protocol, but were not excluded from the system to prevent data loss. An administrator can bring these ranks back online by restarting them. This PR aims to provide an administrative interface for querying suspect engines following a massive failure. These suspect engines can be retrieved using the daos/dmg --health-only command. An example of output of dmg pool query --health-only: Pool 6f450a68-8c7d-4da9-8900-02691650f6a2, ntarget=8, disabled=2, leader=3, version=4, state=Degraded Pool health info: - Disabled ranks: 1 - Suspect ranks: 2 - Rebuild busy, 0 objs, 0 recs Features: DmgPoolQueryRanks Required-githooks: true Signed-off-by: Wang Shilong <[email protected]>

github-actions · 2024-11-06T06:39:30Z

Ticket title is 'Provide admin interface to query hanging engines after massive failure'
Status is 'In Review'
https://daosio.atlassian.net/browse/DAOS-16477

wangshilong · 2024-11-07T00:50:10Z

To reviewers: This PR landed before but got reverted because of conflicts with MD-on-SSD phase2 PR, i refreshed the PR with master and removed a walkaround in the pool query tests(since rebuild bug fixed)

liw

C changes look good to me.

daosbuild1 · 2024-11-07T14:47:33Z

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15458/1/execution/node/1478/log

phender

There is at least one fix needed. I also have a couple questions about the intent of some changes.

phender · 2024-11-07T16:42:49Z

src/tests/ftest/control/dmg_pool_query_ranks.py

                data['response'].get('enabled_ranks')))
        self.assertListEqual(
            data['response'].get('disabled_ranks'), [],
-            "Invalid disabled_ranks field: want=[], got={}".format(
+            "Invalid suspect_ranks field: want=[], got={}".format(


Shouldn't this still be disabled_ranks?

Yes, should be disabled ranks..

phender · 2024-11-07T16:53:51Z

src/tests/ftest/control/dmg_pool_query_ranks.py

-            self.pool.wait_for_rebuild_to_end()
+        exclude_rank = all_ranks[0]
+        suspect_rank = all_ranks[1]
+        self.log.info("Starting excluding rank:%d all_ranks=%s", exclude_rank, all_ranks)


The new behavior of only stopping two ranks does not align with the test description or how it currently runs. The test starts 3 ranks (with the option to expand this by changing the test_servers count in the test yaml). The old code would stop all three ranks (or more if more servers were specified). Is this fully intentional?

Since we're modifying this test it would be beneficial use the new log step feature. This is helpful for determining how long parts of the test take and provides a unique search string to jump to in the log:

Suggested change

self.log.info("Starting excluding rank:%d all_ranks=%s", exclude_rank, all_ranks)

self.log_step(f"Starting excluding rank:{exclude_rank} all_ranks={all_ranks}")

The new behavior of only stopping two ranks does not align with the test description or how it currently runs. The test starts 3 ranks (with the option to expand this by changing the test_servers count in the test yaml). The old code would stop all three ranks (or more if more servers were specified). Is this fully intentional?

I think test logs need be updated. We don't need stop all three ranks. test intention is to test enable, disable, suspect ranks change accordingly..

Since we're modifying this test it would be beneficial use the new log step feature. This is helpful for determining how long parts of the test take and provides a unique search string to jump to in the log:

Yup, that is good idea indeed.

src/tests/ftest/control/dmg_pool_query_ranks.py

phender · 2024-11-07T17:05:10Z

src/tests/ftest/util/server_utils_params.py

@@ -495,6 +494,7 @@ def __init__(self, base_namespace, index, provider=None, max_storage_tiers=MAX_S
            "ABT_ENV_MAX_NUM_XSTREAMS=100",
            "ABT_MAX_NUM_XSTREAMS=100",
            "DAOS_MD_CAP=1024",
+            "DAOS_POOL_RF=4",


By moving this into default_env_vars the DAOS_POOL_RF=4 will only be set for test that do NOT define a /run/server_config/engines/x/env_vars in their test yaml, or redfine it in that same yaml setting. Just confirming that this is the intent. For example, a test like, src/tests/ftest/daos_test/rebuild.py (because its yaml file defines specific env_vars entries - overriding the default) will no longer run with DAOS_POOL_RF=4 with this change. There are currently ~50 functional tests that override the server config env_vars.

@phender I am bit confused with current behavior. for this specific test, i want to reset DAOS_POOL_RF=2, but for all other tests, i want it to be default.

put DAOS_POOL_RF=4 in default_env_vars not match what i need, but don't move it, set it in the server yaml file won't change the value. would you point out how we should fix this? I asked this before...

Hmm, we currently don't have anything setup in the test harness for individual env_vars that would allow us to define one like DAOS_POOL_RF with a value that will be set if not defined, but that can also be overridden. The REQUIRED_ENV_VARS dictionary defines env_vars that must be set and with a very specific value - so there is no logic that would allow them to be overridden.

Moving DAOS_POOL_RF=4 out of the REQUIRED_ENV_VARS["common"] allows for the test yaml to define a different value, like DAOS_POOL_RF=2 - so that is needed - but then any test that does not define any DAOS_POOL_RF value no longer is getting DAOS_POOL_RF=4 set.

I see two options: 1) Add DAOS_POOL_RF=4 to all other tests that currently define a server_config/engines/x/env_vars entry in their test yaml - like you've done in src/tests/ftest/daos_test/suite.yaml, or 2) rework the REQUIRED_ENV_VARS logic to support allowing certain values to be overridden.

I would probably prefer second method to avoid touching too many files.

Required-githooks: true Features: DmgPoolQueryRanks Signed-off-by: Wang Shilong <[email protected]>

daosbuild1 · 2024-11-11T04:33:14Z

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15458/2/testReport/

daosbuild1 · 2024-11-11T08:38:41Z

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15458/2/execution/node/1196/log

wangshilong · 2024-11-12T01:08:38Z

@phender would you help fix CI env issue? thanks!

wangshilong marked this pull request as ready for review November 7, 2024 00:47

wangshilong requested review from a team as code owners November 7, 2024 00:47

wangshilong requested review from tanabarr, liw and kjacque November 7, 2024 00:47

liw previously approved these changes Nov 7, 2024

View reviewed changes

tanabarr previously approved these changes Nov 7, 2024

View reviewed changes

phender requested changes Nov 7, 2024

View reviewed changes

address some comments

92c6882

Required-githooks: true Features: DmgPoolQueryRanks Signed-off-by: Wang Shilong <[email protected]>

wangshilong dismissed stale reviews from tanabarr and liw via 92c6882 November 11, 2024 03:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DAOS-16477 mgmt: return suspect engines for pool healthy query #15458

DAOS-16477 mgmt: return suspect engines for pool healthy query #15458

wangshilong commented Nov 6, 2024

github-actions bot commented Nov 6, 2024

wangshilong commented Nov 7, 2024

liw left a comment

daosbuild1 commented Nov 7, 2024

phender left a comment

phender Nov 7, 2024

wangshilong Nov 9, 2024

phender Nov 7, 2024

wangshilong Nov 9, 2024

phender Nov 7, 2024

wangshilong Nov 8, 2024

phender Nov 8, 2024

wangshilong Nov 9, 2024

daosbuild1 commented Nov 11, 2024

daosbuild1 commented Nov 11, 2024

wangshilong commented Nov 12, 2024

	self.log.info("Starting excluding rank:%d all_ranks=%s", exclude_rank, all_ranks)
	self.log_step(f"Starting excluding rank:{exclude_rank} all_ranks={all_ranks}")

DAOS-16477 mgmt: return suspect engines for pool healthy query #15458

Are you sure you want to change the base?

DAOS-16477 mgmt: return suspect engines for pool healthy query #15458

Conversation

wangshilong commented Nov 6, 2024

Before requesting gatekeeper:

Gatekeeper:

github-actions bot commented Nov 6, 2024

wangshilong commented Nov 7, 2024

liw left a comment

Choose a reason for hiding this comment

daosbuild1 commented Nov 7, 2024

phender left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daosbuild1 commented Nov 11, 2024

daosbuild1 commented Nov 11, 2024

wangshilong commented Nov 12, 2024