-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-16477 mgmt: return suspect engines for pool healthy query #15458
base: master
Are you sure you want to change the base?
Conversation
After significant failures, the system may leave behind some suspect engines that were marked as DEAD by the SWIM protocol, but were not excluded from the system to prevent data loss. An administrator can bring these ranks back online by restarting them. This PR aims to provide an administrative interface for querying suspect engines following a massive failure. These suspect engines can be retrieved using the daos/dmg --health-only command. An example of output of dmg pool query --health-only: Pool 6f450a68-8c7d-4da9-8900-02691650f6a2, ntarget=8, disabled=2, leader=3, version=4, state=Degraded Pool health info: - Disabled ranks: 1 - Suspect ranks: 2 - Rebuild busy, 0 objs, 0 recs Features: DmgPoolQueryRanks Required-githooks: true Signed-off-by: Wang Shilong <[email protected]>
Ticket title is 'Provide admin interface to query hanging engines after massive failure' |
To reviewers: This PR landed before but got reverted because of conflicts with MD-on-SSD phase2 PR, i refreshed the PR with master and removed a walkaround in the pool query tests(since rebuild bug fixed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
C changes look good to me.
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15458/1/execution/node/1478/log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is at least one fix needed. I also have a couple questions about the intent of some changes.
data['response'].get('enabled_ranks'))) | ||
self.assertListEqual( | ||
data['response'].get('disabled_ranks'), [], | ||
"Invalid disabled_ranks field: want=[], got={}".format( | ||
"Invalid suspect_ranks field: want=[], got={}".format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this still be disabled_ranks
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, should be disabled ranks..
self.pool.wait_for_rebuild_to_end() | ||
exclude_rank = all_ranks[0] | ||
suspect_rank = all_ranks[1] | ||
self.log.info("Starting excluding rank:%d all_ranks=%s", exclude_rank, all_ranks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new behavior of only stopping two ranks does not align with the test description or how it currently runs. The test starts 3 ranks (with the option to expand this by changing the test_servers
count in the test yaml). The old code would stop all three ranks (or more if more servers were specified). Is this fully intentional?
Since we're modifying this test it would be beneficial use the new log step feature. This is helpful for determining how long parts of the test take and provides a unique search string to jump to in the log:
self.log.info("Starting excluding rank:%d all_ranks=%s", exclude_rank, all_ranks) | |
self.log_step(f"Starting excluding rank:{exclude_rank} all_ranks={all_ranks}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new behavior of only stopping two ranks does not align with the test description or how it currently runs. The test starts 3 ranks (with the option to expand this by changing the
test_servers
count in the test yaml). The old code would stop all three ranks (or more if more servers were specified). Is this fully intentional?
I think test logs need be updated. We don't need stop all three ranks. test intention is to test enable, disable, suspect ranks change accordingly..
Since we're modifying this test it would be beneficial use the new log step feature. This is helpful for determining how long parts of the test take and provides a unique search string to jump to in the log:
Yup, that is good idea indeed.
@@ -495,6 +494,7 @@ def __init__(self, base_namespace, index, provider=None, max_storage_tiers=MAX_S | |||
"ABT_ENV_MAX_NUM_XSTREAMS=100", | |||
"ABT_MAX_NUM_XSTREAMS=100", | |||
"DAOS_MD_CAP=1024", | |||
"DAOS_POOL_RF=4", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By moving this into default_env_vars
the DAOS_POOL_RF=4
will only be set for test that do NOT define a /run/server_config/engines/x/env_vars
in their test yaml, or redfine it in that same yaml setting. Just confirming that this is the intent. For example, a test like, src/tests/ftest/daos_test/rebuild.py
(because its yaml file defines specific env_vars
entries - overriding the default) will no longer run with DAOS_POOL_RF=4
with this change. There are currently ~50 functional tests that override the server config env_vars.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@phender I am bit confused with current behavior. for this specific test, i want to reset DAOS_POOL_RF=2, but for all other tests, i want it to be default.
put DAOS_POOL_RF=4 in default_env_vars not match what i need, but don't move it, set it in the server yaml file won't change the value. would you point out how we should fix this? I asked this before...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, we currently don't have anything setup in the test harness for individual env_vars that would allow us to define one like DAOS_POOL_RF
with a value that will be set if not defined, but that can also be overridden. The REQUIRED_ENV_VARS
dictionary defines env_vars that must be set and with a very specific value - so there is no logic that would allow them to be overridden.
Moving DAOS_POOL_RF=4
out of the REQUIRED_ENV_VARS["common"]
allows for the test yaml to define a different value, like DAOS_POOL_RF=2
- so that is needed - but then any test that does not define any DAOS_POOL_RF
value no longer is getting DAOS_POOL_RF=4
set.
I see two options: 1) Add DAOS_POOL_RF=4
to all other tests that currently define a server_config/engines/x/env_vars
entry in their test yaml - like you've done in src/tests/ftest/daos_test/suite.yaml, or 2) rework the REQUIRED_ENV_VARS
logic to support allowing certain values to be overridden.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably prefer second method to avoid touching too many files.
Required-githooks: true Features: DmgPoolQueryRanks Signed-off-by: Wang Shilong <[email protected]>
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15458/2/testReport/ |
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15458/2/execution/node/1196/log |
@phender would you help fix CI env issue? thanks! |
After significant failures, the system may leave behind some suspect engines that were marked as DEAD by the SWIM protocol, but were not excluded from the system to prevent data loss. An administrator can bring these ranks back online by restarting them.
This PR aims to provide an administrative interface for querying suspect engines following a massive failure. These suspect engines can be retrieved using the daos/dmg --health-only command.
An example of output of dmg pool query --health-only:
Pool 6f450a68-8c7d-4da9-8900-02691650f6a2, ntarget=8, disabled=2, leader=3, version=4, state=Degraded Pool health info:
Features: DmgPoolQueryRanks
Required-githooks: true
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: