Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dense resource type with DoS protection #1287

Merged
merged 3 commits into from
Sep 30, 2024

Conversation

trws
Copy link
Member

@trws trws commented Sep 3, 2024

I'm still debating this change to be honest. The refcounted strings for resource types are showing up as a meaningful cost in comparisons and refcounting, it's not terrible but it's also not great. This solves the issue by making it so that dense storage interners have a notion of being "final" or placed in a state where new strings can't be added anymore. It also adds an interface to "open" a finalized interner for updates, on a per-thread basis, so that we can do things like database updates. To my surprise, this passed all but one of our tests without re-opening the interners, and that last one by re-opening them in update_resource.

The upside is that after this change, and without using it to optimize any of the type maps or anything like that, we get another 20-30% improvement in throughput for some of my tests. I've seen as much as 250 jobs/s. The downside is that the interface is a little bit odd for this, and at the moment the only way we have to signal the thing is closed is to throw an exception. Given that the only source of new resource types after init should be jobspec though, which already throws exceptions for any parse errors (which is basically what this would be, if the resource type doesn't exist it can't match) I'm not sure how bad that is. Would love input on this. Might be easier to use if it worked with a lock/unlock pair so lock_guard would work, not sure.

I think this
fixes #1298

@trws trws force-pushed the dense_resource_type branch 2 times, most recently from c1012c9 to 4a341e6 Compare September 3, 2024 19:36
problem: we regularly need to manage resources in the face of exceptions
and don't have a scope_guard implementation in either our required boost
version or standard library

solution: add an open source implementation that's simple and easy to
maintain
problem: refcounted strings for resource types are turning out to be
expensive enough to actually slow us down.  We compare them and copy
them so often they show up in the traces as 20-30% of runtime even as
just shared_ptrs

solution: avoid the possible denial of service issue from resource types
by adding an interface to the interner type that finalizes it,
preventing new strings from being added until it is explicitly "opened"
for addition.  This way after initialization of the graph/database we
finalize the interners for subsystems and resource types, then open
them back up for updates in update_resource.  The danger is in JobSpec,
which can no longer create new resource types.  This also effectively
makes it a parse error for a user to specify an unknown resource, which
will happen as soon as the jobspec hits qmanager.

This changes the output of one test of resource_query from emitting
"unsatisfied" as info to emitting a descriptive parse error because it
was unsatisfied due to unknown resources.  That's why one test output
file is modified.
problem: we now validate resources used in jobspecs against known
resources, new resources in _updates_ should work but new ones from
users in jobs or matches should fail, we have no tests for this other
than some accidental ones

solution: add a jobspec with an invalid resource, and ensure that it
fails to match
@trws
Copy link
Member Author

trws commented Sep 25, 2024

@garlick, @jameshcorbett, @milroy, have a look at this. The code path that causes the crash in #1298 should be completely gone, and it should make things both faster and a bit more self-validating since we'll catch unknown resource types at jobspec parsing time rather than later. Unfortunately I haven't been able to repro the issue from #1298, and my attempts to get onto elcap to do it have been foiled by some really gnarly latency from Perth (at pawsey it's sometimes 5 seconds to see a character in ssh 😬 ). If someone could give it a poke and verify I'd really appreciate it.

Copy link
Member

@jameshcorbett jameshcorbett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks reasonable, what does testing it out on elcap look like? ie what would I need to do

// resource_type must be closed or full
throw parse_error (resnode["type"],
"Value of \"type\" must be a resource type known to fluxion");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change goes along with the test change right? Maybe these two changes could be their own commit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ensuring that the failure gets converted to a parse error, otherwise it gets handled like something completely else and several existing tests fail. I could group them up but it would leave a commit that can't pass.

@@ -56,6 +56,11 @@ test_expect_success 'handling of a malformed jobspec works' '
test_expect_code 2 flux ion-resource match allocate ${malform}
'

test_expect_success 'handling of an invalid resource type works' '
test_expect_code 1 flux ion-resource match allocate \
"${SHARNESS_TEST_SRCDIR}/data/resource/jobspecs/basics/bad_res_type.yaml"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any way to check for the specific error message?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not here, but that's a good point. I should be able to add a resource_query test that can check for the parse error and that it comes from the resource type specifically.

@trws
Copy link
Member Author

trws commented Sep 25, 2024

I think this looks reasonable, what does testing it out on elcap look like? ie what would I need to do

I'm not 100% sure other than build it and run some bulksubmit jobs. @grondo did you ever manage to repro it?

@grondo
Copy link
Contributor

grondo commented Sep 25, 2024

I was never able to reproduce it. It seemed to occur 10% of the time at 128 nodes. The jobspec and batch script are in my homedir at fluxion-crash.

@trws trws added the merge-when-passing mergify.io - merge PR automatically once CI passes label Sep 30, 2024
@mergify mergify bot merged commit a042bbf into flux-framework:master Sep 30, 2024
21 checks passed
Copy link

codecov bot commented Sep 30, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 75.3%. Comparing base (d11a3d5) to head (852543f).
Report is 4 commits behind head on master.

Additional details and impacted files
@@          Coverage Diff           @@
##           master   #1287   +/-   ##
======================================
  Coverage    75.2%   75.3%           
======================================
  Files         110     111    +1     
  Lines       15230   15262   +32     
======================================
+ Hits        11467   11505   +38     
+ Misses       3763    3757    -6     
Files with missing lines Coverage Δ
resource/libjobspec/jobspec.cpp 86.5% <100.0%> (+0.1%) ⬆️
resource/modules/resource_match.cpp 69.2% <100.0%> (+0.3%) ⬆️
resource/utilities/command.cpp 78.5% <100.0%> (+<0.1%) ⬆️
resource/utilities/resource-query.cpp 76.4% <100.0%> (+0.2%) ⬆️
src/common/libintern/interner.cpp 70.0% <100.0%> (+5.7%) ⬆️
src/common/libintern/interner.hpp 97.7% <100.0%> (+0.1%) ⬆️
src/common/libintern/scope_guard.hpp 100.0% <100.0%> (ø)

... and 5 files with indirect coverage changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge-when-passing mergify.io - merge PR automatically once CI passes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fluxion crash in 10% of user runs on elcap
3 participants