-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dense resource type with DoS protection #1287
Conversation
c1012c9
to
4a341e6
Compare
4a341e6
to
6cb8142
Compare
problem: we regularly need to manage resources in the face of exceptions and don't have a scope_guard implementation in either our required boost version or standard library solution: add an open source implementation that's simple and easy to maintain
problem: refcounted strings for resource types are turning out to be expensive enough to actually slow us down. We compare them and copy them so often they show up in the traces as 20-30% of runtime even as just shared_ptrs solution: avoid the possible denial of service issue from resource types by adding an interface to the interner type that finalizes it, preventing new strings from being added until it is explicitly "opened" for addition. This way after initialization of the graph/database we finalize the interners for subsystems and resource types, then open them back up for updates in update_resource. The danger is in JobSpec, which can no longer create new resource types. This also effectively makes it a parse error for a user to specify an unknown resource, which will happen as soon as the jobspec hits qmanager. This changes the output of one test of resource_query from emitting "unsatisfied" as info to emitting a descriptive parse error because it was unsatisfied due to unknown resources. That's why one test output file is modified.
problem: we now validate resources used in jobspecs against known resources, new resources in _updates_ should work but new ones from users in jobs or matches should fail, we have no tests for this other than some accidental ones solution: add a jobspec with an invalid resource, and ensure that it fails to match
b3e770e
to
852543f
Compare
@garlick, @jameshcorbett, @milroy, have a look at this. The code path that causes the crash in #1298 should be completely gone, and it should make things both faster and a bit more self-validating since we'll catch unknown resource types at jobspec parsing time rather than later. Unfortunately I haven't been able to repro the issue from #1298, and my attempts to get onto elcap to do it have been foiled by some really gnarly latency from Perth (at pawsey it's sometimes 5 seconds to see a character in ssh 😬 ). If someone could give it a poke and verify I'd really appreciate it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks reasonable, what does testing it out on elcap look like? ie what would I need to do
// resource_type must be closed or full | ||
throw parse_error (resnode["type"], | ||
"Value of \"type\" must be a resource type known to fluxion"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change goes along with the test change right? Maybe these two changes could be their own commit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is ensuring that the failure gets converted to a parse error, otherwise it gets handled like something completely else and several existing tests fail. I could group them up but it would leave a commit that can't pass.
@@ -56,6 +56,11 @@ test_expect_success 'handling of a malformed jobspec works' ' | |||
test_expect_code 2 flux ion-resource match allocate ${malform} | |||
' | |||
|
|||
test_expect_success 'handling of an invalid resource type works' ' | |||
test_expect_code 1 flux ion-resource match allocate \ | |||
"${SHARNESS_TEST_SRCDIR}/data/resource/jobspecs/basics/bad_res_type.yaml" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any way to check for the specific error message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not here, but that's a good point. I should be able to add a resource_query test that can check for the parse error and that it comes from the resource type specifically.
I'm not 100% sure other than build it and run some bulksubmit jobs. @grondo did you ever manage to repro it? |
I was never able to reproduce it. It seemed to occur 10% of the time at 128 nodes. The jobspec and batch script are in my homedir at |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #1287 +/- ##
======================================
Coverage 75.2% 75.3%
======================================
Files 110 111 +1
Lines 15230 15262 +32
======================================
+ Hits 11467 11505 +38
+ Misses 3763 3757 -6
|
I'm still debating this change to be honest. The refcounted strings for resource types are showing up as a meaningful cost in comparisons and refcounting, it's not terrible but it's also not great. This solves the issue by making it so that dense storage interners have a notion of being "final" or placed in a state where new strings can't be added anymore. It also adds an interface to "open" a finalized interner for updates, on a per-thread basis, so that we can do things like database updates. To my surprise, this passed all but one of our tests without re-opening the interners, and that last one by re-opening them in update_resource.
The upside is that after this change, and without using it to optimize any of the type maps or anything like that, we get another 20-30% improvement in throughput for some of my tests. I've seen as much as 250 jobs/s. The downside is that the interface is a little bit odd for this, and at the moment the only way we have to signal the thing is closed is to throw an exception. Given that the only source of new resource types after init should be jobspec though, which already throws exceptions for any parse errors (which is basically what this would be, if the resource type doesn't exist it can't match) I'm not sure how bad that is. Would love input on this. Might be easier to use if it worked with a lock/unlock pair so lock_guard would work, not sure.
I think this
fixes #1298