Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-16477 mgmt: return suspect engines for pool healthy query #15458

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

wangshilong
Copy link
Contributor

After significant failures, the system may leave behind some suspect engines that were marked as DEAD by the SWIM protocol, but were not excluded from the system to prevent data loss. An administrator can bring these ranks back online by restarting them.

This PR aims to provide an administrative interface for querying suspect engines following a massive failure. These suspect engines can be retrieved using the daos/dmg --health-only command.

An example of output of dmg pool query --health-only:

Pool 6f450a68-8c7d-4da9-8900-02691650f6a2, ntarget=8, disabled=2, leader=3, version=4, state=Degraded Pool health info:

  • Disabled ranks: 1
  • Suspect ranks: 2
  • Rebuild busy, 0 objs, 0 recs

Features: DmgPoolQueryRanks
Required-githooks: true

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

After significant failures, the system may leave behind some suspect
engines that were marked as DEAD by the SWIM protocol, but were not
excluded from the system to prevent data loss. An administrator
can bring these ranks back online by restarting them.

This PR aims to provide an administrative interface for querying
suspect engines following a massive failure. These suspect engines
can be retrieved using the daos/dmg --health-only command.

An example of output of dmg pool query --health-only:

Pool 6f450a68-8c7d-4da9-8900-02691650f6a2, ntarget=8, disabled=2, leader=3, version=4, state=Degraded
Pool health info:
- Disabled ranks: 1
- Suspect ranks: 2
- Rebuild busy, 0 objs, 0 recs

Features: DmgPoolQueryRanks
Required-githooks: true
Signed-off-by: Wang Shilong <[email protected]>
Copy link

github-actions bot commented Nov 6, 2024

Ticket title is 'Provide admin interface to query hanging engines after massive failure'
Status is 'In Review'
https://daosio.atlassian.net/browse/DAOS-16477

@wangshilong wangshilong marked this pull request as ready for review November 7, 2024 00:47
@wangshilong wangshilong requested review from a team as code owners November 7, 2024 00:47
@wangshilong
Copy link
Contributor Author

To reviewers: This PR landed before but got reverted because of conflicts with MD-on-SSD phase2 PR, i refreshed the PR with master and removed a walkaround in the pool query tests(since rebuild bug fixed)

liw
liw previously approved these changes Nov 7, 2024
Copy link
Contributor

@liw liw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C changes look good to me.

tanabarr
tanabarr previously approved these changes Nov 7, 2024
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15458/1/execution/node/1478/log

Copy link
Contributor

@phender phender left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is at least one fix needed. I also have a couple questions about the intent of some changes.

data['response'].get('enabled_ranks')))
self.assertListEqual(
data['response'].get('disabled_ranks'), [],
"Invalid disabled_ranks field: want=[], got={}".format(
"Invalid suspect_ranks field: want=[], got={}".format(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this still be disabled_ranks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, should be disabled ranks..

self.pool.wait_for_rebuild_to_end()
exclude_rank = all_ranks[0]
suspect_rank = all_ranks[1]
self.log.info("Starting excluding rank:%d all_ranks=%s", exclude_rank, all_ranks)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new behavior of only stopping two ranks does not align with the test description or how it currently runs. The test starts 3 ranks (with the option to expand this by changing the test_servers count in the test yaml). The old code would stop all three ranks (or more if more servers were specified). Is this fully intentional?

Since we're modifying this test it would be beneficial use the new log step feature. This is helpful for determining how long parts of the test take and provides a unique search string to jump to in the log:

Suggested change
self.log.info("Starting excluding rank:%d all_ranks=%s", exclude_rank, all_ranks)
self.log_step(f"Starting excluding rank:{exclude_rank} all_ranks={all_ranks}")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new behavior of only stopping two ranks does not align with the test description or how it currently runs. The test starts 3 ranks (with the option to expand this by changing the test_servers count in the test yaml). The old code would stop all three ranks (or more if more servers were specified). Is this fully intentional?

I think test logs need be updated. We don't need stop all three ranks. test intention is to test enable, disable, suspect ranks change accordingly..

Since we're modifying this test it would be beneficial use the new log step feature. This is helpful for determining how long parts of the test take and provides a unique search string to jump to in the log:

Yup, that is good idea indeed.

src/tests/ftest/control/dmg_pool_query_ranks.py Outdated Show resolved Hide resolved
src/tests/ftest/control/dmg_pool_query_ranks.py Outdated Show resolved Hide resolved
src/tests/ftest/control/dmg_pool_query_ranks.py Outdated Show resolved Hide resolved
src/tests/ftest/control/dmg_pool_query_ranks.py Outdated Show resolved Hide resolved
src/tests/ftest/control/dmg_pool_query_ranks.py Outdated Show resolved Hide resolved
@@ -495,6 +494,7 @@ def __init__(self, base_namespace, index, provider=None, max_storage_tiers=MAX_S
"ABT_ENV_MAX_NUM_XSTREAMS=100",
"ABT_MAX_NUM_XSTREAMS=100",
"DAOS_MD_CAP=1024",
"DAOS_POOL_RF=4",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By moving this into default_env_vars the DAOS_POOL_RF=4 will only be set for test that do NOT define a /run/server_config/engines/x/env_vars in their test yaml, or redfine it in that same yaml setting. Just confirming that this is the intent. For example, a test like, src/tests/ftest/daos_test/rebuild.py (because its yaml file defines specific env_vars entries - overriding the default) will no longer run with DAOS_POOL_RF=4 with this change. There are currently ~50 functional tests that override the server config env_vars.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phender I am bit confused with current behavior. for this specific test, i want to reset DAOS_POOL_RF=2, but for all other tests, i want it to be default.

put DAOS_POOL_RF=4 in default_env_vars not match what i need, but don't move it, set it in the server yaml file won't change the value. would you point out how we should fix this? I asked this before...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we currently don't have anything setup in the test harness for individual env_vars that would allow us to define one like DAOS_POOL_RF with a value that will be set if not defined, but that can also be overridden. The REQUIRED_ENV_VARS dictionary defines env_vars that must be set and with a very specific value - so there is no logic that would allow them to be overridden.

Moving DAOS_POOL_RF=4 out of the REQUIRED_ENV_VARS["common"] allows for the test yaml to define a different value, like DAOS_POOL_RF=2 - so that is needed - but then any test that does not define any DAOS_POOL_RF value no longer is getting DAOS_POOL_RF=4 set.

I see two options: 1) Add DAOS_POOL_RF=4 to all other tests that currently define a server_config/engines/x/env_vars entry in their test yaml - like you've done in src/tests/ftest/daos_test/suite.yaml, or 2) rework the REQUIRED_ENV_VARS logic to support allowing certain values to be overridden.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably prefer second method to avoid touching too many files.

Required-githooks: true
Features: DmgPoolQueryRanks
Signed-off-by: Wang Shilong <[email protected]>
@wangshilong wangshilong dismissed stale reviews from tanabarr and liw via 92c6882 November 11, 2024 03:51
@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15458/2/testReport/

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15458/2/execution/node/1196/log

@wangshilong
Copy link
Contributor Author

@phender would you help fix CI env issue? thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants