Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modules: Move feasibility/satisfiability checking into a new module #1285

Draft
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

jacobtkeio
Copy link

Problem: Feasibility checking takes up a significant amount of sched-fluxion-resource's time that could be better spent on scheduling. Additionally, feasibility checking is not RFC 27 compliant.
Solution: Move feasibility checking into a new module, sched-fluxion-feasibility, that can run on multiple ranks and make it RFC 27 compliant.

Copy link

codecov bot commented Aug 30, 2024

Codecov Report

Attention: Patch coverage is 72.53259% with 295 lines in your changes missing coverage. Please review.

Project coverage is 75.4%. Comparing base (4d9fc6f) to head (c3d821e).

Files with missing lines Patch % Lines
resource/modules/resource.cpp 70.9% 219 Missing ⚠️
resource/modules/feasibility.cpp 69.8% 60 Missing ⚠️
resource/modules/resource_match.cpp 85.9% 16 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master   #1285    +/-   ##
=======================================
  Coverage    75.3%   75.4%            
=======================================
  Files         111     114     +3     
  Lines       15361   15526   +165     
=======================================
+ Hits        11575   11708   +133     
- Misses       3786    3818    +32     
Files with missing lines Coverage Δ
qmanager/modules/qmanager.cpp 73.3% <ø> (+4.2%) ⬆️
resource/modules/resource_match.hpp 100.0% <100.0%> (ø)
resource/modules/resource_match.cpp 67.2% <85.9%> (-1.8%) ⬇️
resource/modules/feasibility.cpp 69.8% <69.8%> (ø)
resource/modules/resource.cpp 70.9% <70.9%> (ø)

Jacob Tkeio and others added 14 commits September 28, 2024 16:07
Problem: sched.feasibility RPCs take up too much of
	 sched-fluxion-resource's single-threaded time.

Add CLI option to create a 'feasibility version' of
sched-fluxion-resource called sched-fluxion-satisfiability
that can run on multiple ranks.
Problem: Only one resource.acquire RPC can be active at a time, but both
         s-f-resource and s-f-satisfiability try to open one.

Make s-f-satisfiability call s-f-resource.notify to populate
	its resources (kind of like s-f-qmanager does).
Make s-f-resource send its resources instead of null on notify RPCs
	to accomodate s-f-satisfiability.
Finally, force the FIRST matching policy for s-f-satisfiability.
Problem: Having a 'satisfiability version' of the s-f-resource module
         makes resource_match.cpp hard to read and less maintainable.

Split resource_match.cpp into a resource.cpp and feasibility.cpp
that contain their respective modules. Leave the code common to both
in resource_match.cpp with a new header, resource_match.hpp. This
simplifies adding new modules that acquire and match resources.
Problem: Tests segfault when unloading s-f-resource under flux-broker

Create a sched-fluxion-resource-module library that loads the
'resource' target only once between both s-f-resource and
s-f-feasibility. The previous CMakeLists.txt was loading
the 'resource' target twice, which caused a segfault during
unload when running under 'flux broker' on static variable
ANY_RESOURCE_TYPE{"*"} in matcher.cpp.
Problem: The feasibility module ignores s-f-resource after it gets
         resources from s-f-resource.notify, but it should exit when
         s-f-resource does to prevent odd behavior after an
         s-f-resource reload, especially with different resources.

Make s-f-feasibility listen for errors on the .notify stream.

Send graph expiration time to feasibility module

Problem: s-f-resource.notify does not currently send graph expiration
         info to s-f-feasibility.

Send it.

Fix several errors

Problem: Improper use of git including forgetting to add a file in
         the previous commit and losing changes during a merge.

Fix various things including broken type of m_acquired_resources and
    unnecessary+broken code in resource_match_opts.
Problem: s-f-feasibility launches on all ranks by default. This is
         probably not a good default behavior.

Change rc1.d/01-s-f to launch s-f-feasibility on only rank 0.
Change rc3.d/01-s-f to remove s-f-feasibility from all ranks.
This allows for any layout of s-f-feasibility instances while
guaranteeing at least one.
Problem: notify_request_cb waits for resource.acquire if it
         has not yet recieved resources. However, it is
         guaranteed to have resources since init_resource-
         _graph must return before flux_reactor_run starts,
         and notify_request_cb can only happen after that.

Remove the check.
Problem: s-f-feasibility performs feasibility checking through
         the sched.feasibility RPC. However, RFC 27 requires
         feasibility checking to be performed in feasibility.check.

Make s-f-feasibility register the 'feasibility' service and
'feasibility.check' RPC instead of the 'sched.feasibility' RPC.
Remove 'sched.feasibility' forwarder cb in qmanager.cpp.
Problem: Some tests call 'sched.feasibility', which no loger exists.

Swap 'sched.feasibility' for 'feasibility.check'.

Add feasibility module to sched-sharness and t1020

Problem: Some tests that need satisfiability information do not
         load and unload s-f-feasibility.

Add load_feasibility, reload_feasibility, and remove_feasibility
functions to sched-sharness.sh and the relevant tests.
Problem: All calls to feasibility.check are in tests for s-f-resource,
	 not tests for s-f-feasibility.

Move them to a separate test for the feasibility module.

Disallow load-file behavior in feasibility test

Problem: t4014 expects feasibility to load resources from a resource
         module that was passed a load-file, which is not desired
         behavior.

Update t4014 to expect failure on such a load.
Problem: The formatting did not pass the CI code formatting check.

Apply the required changes.
Problem: As a holdover from s-f-resource, s-f-feasibility marks its
         acquired resources as DOWN, which is unnecessary.

Remove this resource marking from init_resource_graph.
Problem: If notify_request_cb fails flux_respond_pack,
         it responds with nothing, leaving feasibility
         without resources but active.

Make notify_request_cb send a flux_respond_error on
a flux_respond_pack_error.
Problem: fedora40 CI fails on t4014 due to """
  error: bug in the test script: broken &&-chain:
      load_feasibility
      flux dmesg -c | grep -q "File exists"
"""

Add && between successive statements in t4014.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant