Skip to content

modules: Move feasibility/satisfiability checking into a new module #1285

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

jacobtkeio
Copy link

@jacobtkeio jacobtkeio commented Aug 30, 2024

Problem: Feasibility checking takes up a significant amount of sched-fluxion-resource's time that could be better spent on scheduling.
Solution: Move feasibility checking into a new module, sched-fluxion-feasibility, that can run on multiple ranks.

@jacobtkeio jacobtkeio marked this pull request as ready for review March 1, 2025 08:23
@jacobtkeio jacobtkeio force-pushed the master branch 3 times, most recently from 6e18b88 to cd8ba09 Compare March 6, 2025 20:10
@jacobtkeio jacobtkeio force-pushed the master branch 5 times, most recently from 65ca1fc to 4eaef1c Compare May 1, 2025 21:20
@jacobtkeio jacobtkeio force-pushed the master branch 7 times, most recently from fd25295 to fd31259 Compare May 28, 2025 19:58
Jacob Tkeio and others added 4 commits June 30, 2025 10:58
Problem: feasibility.check RPCs take up too much of
sched-fluxion-resource's single-threaded time.

Add sched-fluxion-feasibility module that can run on multiple ranks
to handle feasibility.check RPCs.
Problem: All calls to the feasibility.check RPC are
in s-f-resource tests, not s-f-feasibility tests.

Create a feasibility module test file, t4014, and update other
feasibility.check tests to query the feasibility module.
Problem: t4014 does not test the behavior of
resource -> feasibility notification after
the feasibility module restarts. This is a
special case because resource needs to send
first-time information after the restart,
including all resources, lost resources,
and resource graph expiration.

Create t4015 to test feasibility notification.
Problem: PR flux-framework#1352 changed populate_resource_db to support resource
shrinking, but it left in an old version of the function.

Remove it.
Copy link

codecov bot commented Jul 17, 2025

Codecov Report

Attention: Patch coverage is 75.67128% with 299 lines in your changes missing coverage. Please review.

Project coverage is 76.5%. Comparing base (fc7621f) to head (cece306).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
resource/modules/resource.cpp 74.4% 219 Missing ⚠️
resource/modules/feasibility.cpp 74.6% 56 Missing ⚠️
resource/modules/resource_match.cpp 83.3% 24 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##           master   #1285     +/-   ##
========================================
+ Coverage    76.4%   76.5%   +0.1%     
========================================
  Files         112     115      +3     
  Lines       16700   16929    +229     
========================================
+ Hits        12765   12962    +197     
- Misses       3935    3967     +32     
Files with missing lines Coverage Δ
qmanager/modules/qmanager.cpp 74.3% <ø> (+3.9%) ⬆️
resource/modules/resource_match.hpp 100.0% <100.0%> (ø)
resource/modules/resource_match.cpp 67.9% <83.3%> (-3.1%) ⬇️
resource/modules/feasibility.cpp 74.6% <74.6%> (ø)
resource/modules/resource.cpp 74.4% <74.4%> (ø)

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment on lines +1 to +3
/*****************************************************************************\
* Copyright 2014 Lawrence Livermore National Security, LLC
* (c.f. AUTHORS, NOTICE.LLNS, LICENSE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the copyright year:

Suggested change
/*****************************************************************************\
* Copyright 2014 Lawrence Livermore National Security, LLC
* (c.f. AUTHORS, NOTICE.LLNS, LICENSE)
/*****************************************************************************\
* Copyright 2025 Lawrence Livermore National Security, LLC
* (c.f. AUTHORS, NOTICE.LLNS, LICENSE)

@@ -0,0 +1,1548 @@
/*****************************************************************************\
* Copyright 2014 Lawrence Livermore National Security, LLC
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Copyright 2014 Lawrence Livermore National Security, LLC
* Copyright 2025 Lawrence Livermore National Security, LLC

Copy link
Member

@milroy milroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really good work. I have a few nits, and a minor point for discussion regarding the RPC name.

Apart from that, I'd like to discuss changing the satisfiability pruning filter to be ALL:core,ALL:node and perhaps add ALL:gpu as well. Adding a node type to the pruning filter by default should make the is_satisfiable function in the traverser faster and more powerful.

@@ -113,7 +113,7 @@ def rpc_namespace_info(self, rank, type_name, identity):

def rpc_satisfiability(self, jobspec):
payload = {"jobspec": jobspec}
return self.handle.rpc("sched-fluxion-resource.satisfiability", payload).get()
return self.handle.rpc("feasibility.check", payload).get()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's revisit this RPC name. Adhering to the existing pattern yields sched-fluxion-feasibility.satisfiability which is clunky.

@@ -0,0 +1,192 @@
/*****************************************************************************\
* Copyright 2014 Lawrence Livermore National Security, LLC
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Copyright 2014 Lawrence Livermore National Security, LLC
* Copyright 2025 Lawrence Livermore National Security, LLC

Comment on lines +1 to +20
version: 9999
resources:
- type: node
count: 2
with:
- type: slot
count: 1
label: default
with:
- type: core
count: 1
attributes:
system:
duration: 300
tasks:
- command: [ "hostlist" ]
slot: default
count:
per_slot: 1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the current directory structure convention, I think it's better to move shrink2-4 into t/data/resource/jobspecs/satisfiability.

Comment on lines +30 to +31
prune-filters=ALL:core subsystems=containment policy=high &&
load_feasibility load-file=${grug} load-format=grug \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should set prune-filters to ALL:core,ALL:node in the tests and in the module.

@@ -86,6 +86,7 @@ set(ALL_TESTS
t4012-set-status.t
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit message body should wrap at 72 characters.

@milroy milroy requested a review from trws July 22, 2025 07:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants