Dense resource type with DoS protection #1287

trws · 2024-09-03T17:53:52Z

I'm still debating this change to be honest. The refcounted strings for resource types are showing up as a meaningful cost in comparisons and refcounting, it's not terrible but it's also not great. This solves the issue by making it so that dense storage interners have a notion of being "final" or placed in a state where new strings can't be added anymore. It also adds an interface to "open" a finalized interner for updates, on a per-thread basis, so that we can do things like database updates. To my surprise, this passed all but one of our tests without re-opening the interners, and that last one by re-opening them in update_resource.

The upside is that after this change, and without using it to optimize any of the type maps or anything like that, we get another 20-30% improvement in throughput for some of my tests. I've seen as much as 250 jobs/s. The downside is that the interface is a little bit odd for this, and at the moment the only way we have to signal the thing is closed is to throw an exception. Given that the only source of new resource types after init should be jobspec though, which already throws exceptions for any parse errors (which is basically what this would be, if the resource type doesn't exist it can't match) I'm not sure how bad that is. Would love input on this. Might be easier to use if it worked with a lock/unlock pair so lock_guard would work, not sure.

I think this
fixes #1298

problem: we regularly need to manage resources in the face of exceptions and don't have a scope_guard implementation in either our required boost version or standard library solution: add an open source implementation that's simple and easy to maintain

problem: refcounted strings for resource types are turning out to be expensive enough to actually slow us down. We compare them and copy them so often they show up in the traces as 20-30% of runtime even as just shared_ptrs solution: avoid the possible denial of service issue from resource types by adding an interface to the interner type that finalizes it, preventing new strings from being added until it is explicitly "opened" for addition. This way after initialization of the graph/database we finalize the interners for subsystems and resource types, then open them back up for updates in update_resource. The danger is in JobSpec, which can no longer create new resource types. This also effectively makes it a parse error for a user to specify an unknown resource, which will happen as soon as the jobspec hits qmanager. This changes the output of one test of resource_query from emitting "unsatisfied" as info to emitting a descriptive parse error because it was unsatisfied due to unknown resources. That's why one test output file is modified.

problem: we now validate resources used in jobspecs against known resources, new resources in _updates_ should work but new ones from users in jobs or matches should fail, we have no tests for this other than some accidental ones solution: add a jobspec with an invalid resource, and ensure that it fails to match

trws · 2024-09-25T03:39:16Z

@garlick, @jameshcorbett, @milroy, have a look at this. The code path that causes the crash in #1298 should be completely gone, and it should make things both faster and a bit more self-validating since we'll catch unknown resource types at jobspec parsing time rather than later. Unfortunately I haven't been able to repro the issue from #1298, and my attempts to get onto elcap to do it have been foiled by some really gnarly latency from Perth (at pawsey it's sometimes 5 seconds to see a character in ssh 😬 ). If someone could give it a poke and verify I'd really appreciate it.

jameshcorbett

I think this looks reasonable, what does testing it out on elcap look like? ie what would I need to do

jameshcorbett · 2024-09-25T05:00:34Z

resource/libjobspec/jobspec.cpp

+        // resource_type must be closed or full
+        throw parse_error (resnode["type"],
+                           "Value of \"type\" must be a resource type known to fluxion");
+    }


This change goes along with the test change right? Maybe these two changes could be their own commit?

This is ensuring that the failure gets converted to a parse error, otherwise it gets handled like something completely else and several existing tests fail. I could group them up but it would leave a commit that can't pass.

jameshcorbett · 2024-09-25T05:01:08Z

t/t4001-match-allocate.t

@@ -56,6 +56,11 @@ test_expect_success 'handling of a malformed jobspec works' '
    test_expect_code 2 flux ion-resource match allocate ${malform}
 '

+test_expect_success 'handling of an invalid resource type works' '
+    test_expect_code 1 flux ion-resource match allocate \
+        "${SHARNESS_TEST_SRCDIR}/data/resource/jobspecs/basics/bad_res_type.yaml"


Any way to check for the specific error message?

Not here, but that's a good point. I should be able to add a resource_query test that can check for the parse error and that it comes from the resource type specifically.

trws · 2024-09-25T05:16:15Z

I think this looks reasonable, what does testing it out on elcap look like? ie what would I need to do

I'm not 100% sure other than build it and run some bulksubmit jobs. @grondo did you ever manage to repro it?

grondo · 2024-09-25T14:15:55Z

I was never able to reproduce it. It seemed to occur 10% of the time at 128 nodes. The jobspec and batch script are in my homedir at fluxion-crash.

codecov · 2024-09-30T16:02:28Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 75.3%. Comparing base (d11a3d5) to head (852543f).
Report is 4 commits behind head on master.

Additional details and impacted files

@@          Coverage Diff           @@
##           master   #1287   +/-   ##
======================================
  Coverage    75.2%   75.3%           
======================================
  Files         110     111    +1     
  Lines       15230   15262   +32     
======================================
+ Hits        11467   11505   +38     
+ Misses       3763    3757    -6

Files with missing lines	Coverage Δ
resource/libjobspec/jobspec.cpp	`86.5% <100.0%> (+0.1%)`	⬆️
resource/modules/resource_match.cpp	`69.2% <100.0%> (+0.3%)`	⬆️
resource/utilities/command.cpp	`78.5% <100.0%> (+<0.1%)`	⬆️
resource/utilities/resource-query.cpp	`76.4% <100.0%> (+0.2%)`	⬆️
src/common/libintern/interner.cpp	`70.0% <100.0%> (+5.7%)`	⬆️
src/common/libintern/interner.hpp	`97.7% <100.0%> (+0.1%)`	⬆️
src/common/libintern/scope_guard.hpp	`100.0% <100.0%> (ø)`

... and 5 files with indirect coverage changes

trws requested review from milroy and jameshcorbett September 3, 2024 17:53

trws force-pushed the dense_resource_type branch 2 times, most recently from c1012c9 to 4a341e6 Compare September 3, 2024 19:36

trws force-pushed the dense_resource_type branch from 4a341e6 to 6cb8142 Compare September 25, 2024 03:26

trws added 3 commits September 25, 2024 11:34

helpers: add scope_guard

0f37a56

problem: we regularly need to manage resources in the face of exceptions and don't have a scope_guard implementation in either our required boost version or standard library solution: add an open source implementation that's simple and easy to maintain

trws force-pushed the dense_resource_type branch from b3e770e to 852543f Compare September 25, 2024 03:35

jameshcorbett approved these changes Sep 25, 2024

View reviewed changes

jameshcorbett mentioned this pull request Sep 30, 2024

NEWS.md: add release notes for v0.39.0 #1302

Merged

2 tasks

trws added the merge-when-passing mergify.io - merge PR automatically once CI passes label Sep 30, 2024

mergify bot merged commit a042bbf into flux-framework:master Sep 30, 2024
21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dense resource type with DoS protection #1287

Dense resource type with DoS protection #1287

trws commented Sep 3, 2024 •

edited

Loading

trws commented Sep 25, 2024

jameshcorbett left a comment

jameshcorbett Sep 25, 2024

trws Sep 25, 2024

jameshcorbett Sep 25, 2024

trws Sep 25, 2024

trws commented Sep 25, 2024

grondo commented Sep 25, 2024

codecov bot commented Sep 30, 2024

Dense resource type with DoS protection #1287

Dense resource type with DoS protection #1287

Conversation

trws commented Sep 3, 2024 • edited Loading

trws commented Sep 25, 2024

jameshcorbett left a comment

Choose a reason for hiding this comment

jameshcorbett Sep 25, 2024

Choose a reason for hiding this comment

trws Sep 25, 2024

Choose a reason for hiding this comment

jameshcorbett Sep 25, 2024

Choose a reason for hiding this comment

trws Sep 25, 2024

Choose a reason for hiding this comment

trws commented Sep 25, 2024

grondo commented Sep 25, 2024

codecov bot commented Sep 30, 2024

Codecov Report

trws commented Sep 3, 2024 •

edited

Loading