Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Traverser: add valid return in mod_plan for partial cancel when allocation exists #1292

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

milroy
Copy link
Member

@milroy milroy commented Sep 5, 2024

Issue #1284 identified a problem where rabbits are not released due to a traverser error during partial cancellation. The traverser should skip the rest of the mod_plan function when an allocation is found and mod_data.mod_type == job_modify_t::PARTIAL_CANCEL. This PR adds a goto statement to return 0 under this circumstance.

jameshcorbett
jameshcorbett previously approved these changes Sep 5, 2024
Copy link
Member

@jameshcorbett jameshcorbett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed it makes the error messages I was seeing go away!

@milroy
Copy link
Member Author

milroy commented Sep 5, 2024

@jameshcorbett can I set MWP or do we need a test case to check for this in Fluxion?

It would be good to know if the partial cancel behaves as expected when encountering this issue.

@jameshcorbett
Copy link
Member

Yeah it would be good to have a test case somehow. I put this PR in flux-sched v0.38.0 via a patch so I don't think there's as much hurry to merge it. I'm also working on a test case in flux-coral2 environment but it's not working quite right yet and the tests are slow so it's taking a while.

I can provide a graph and jobspec that will hit the issue, but I don't know about the pattern of partial-cancel RPCs.

@milroy
Copy link
Member Author

milroy commented Sep 5, 2024

After more thought, I think we need to add a testsuite check for this issue.

@jameshcorbett
Copy link
Member

Some of my flux-coral2 tests are suggesting to me that the rabbit resources aren't freed, even though the error message goes away. I submitted a bunch of identical rabbit jobs back-to-back and the first couple go through and then one gets stuck in SCHED, which is what I expect to happen when fluxion thinks all of the ssd resources are allocated.

@milroy
Copy link
Member Author

milroy commented Sep 5, 2024

Some of my flux-coral2 tests are suggesting to me that the rabbit resources aren't freed, even though the error message goes away.

Let's make a reproducer similar to this one: https://github.com/flux-framework/flux-sched/blob/master/t/issues/t1129-match-overflow.sh

What are the scheduler and queue policies set to in the coral2 tests? Edit: I just noticed this PR: flux-framework/flux-coral2#208, but it would be good to have a test in sched, too.

@milroy milroy dismissed jameshcorbett’s stale review September 5, 2024 17:59

Dismissing the first review to make sure this doesn't get merged accidentally. I'll re-request once we understand the behavior better.

@jameshcorbett
Copy link
Member

What are the scheduler and queue policies set to in the coral2 tests?

Whatever the defaults are I think, there's no configuration done.

Let's make a reproducer similar to this one: https://github.com/flux-framework/flux-sched/blob/master/t/issues/t1129-match-overflow.sh

Thanks for the pointer, I should be able to look into this tomorrow!

Problem: issue flux-framework#1284 identified a scenario where rabbits are not
released due to a traverser error during partial cancellation. The
traverser should skip the rest of the mod_plan function when an
allocation is found and mod_data.mod_type ==
job_modify_t::PARTIAL_CANCEL.

Add a goto statement to return 0 under this circumstance.
Copy link

codecov bot commented Sep 19, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 75.2%. Comparing base (d11a3d5) to head (5203c6c).

Additional details and impacted files
@@           Coverage Diff            @@
##           master   #1292     +/-   ##
========================================
- Coverage    75.2%   75.2%   -0.1%     
========================================
  Files         110     110             
  Lines       15230   15231      +1     
========================================
- Hits        11467   11457     -10     
- Misses       3763    3774     +11     
Files with missing lines Coverage Δ
resource/traversers/dfu_impl_update.cpp 74.8% <100.0%> (-1.5%) ⬇️

... and 1 file with indirect coverage changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants