Skip to content

add flux module remove --cancel option #6894

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jun 30, 2025

Conversation

garlick
Copy link
Member

@garlick garlick commented Jun 26, 2025

This adds an option to remove an ornery module by sending it a pthread_cancel(3) instead of politely requesting that it shut down. That required the addition of a pthread cleanup callback, which triggered a bit of refactoring.

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@garlick
Copy link
Member Author

garlick commented Jun 30, 2025

Thanks!

garlick added 8 commits June 30, 2025 20:38
Problem: flux module remove has no way to request that
pthread_cancel(3) be used to stop a module.

Add a --cancel option.
Problem: if pthread_join() on a module thread fails, the
error says that pthread_cancel() failed, but there isn't
really a way for pthread_join() to fail in a way that implicates
pthread_cancel().

Change pthred_cancel to pthread_join in the error message.
Problem: in the future, module thread cleanup will need
to occur in a cancellation callback, but the module
argv is allocated in the module thread and would be leaked
in such an arrangement.

Add argc and argv as module struct members and alloc/free
them in the broker before the module thread starts.
There was really no need to do that in the module thread.
Problem: in the future, module thread cleanup will need
to occur in a cancellation callback, but the module
errno (if any) is on the stack of the module thread
and would not be available in a cleanup callback.

Add mod_main_errno and mod_main_failed as module struct
members so they can be easily passed to the cancellation
callback.
Problem: a module expects a response to its request to
change the module state to FINALIZING, but there won't be
one if the request is made in a cancellation cleanup callback
and the cancellation occurs after the broker reactor has
exited.

Wait at most 1s for this response, then move on.
Problem: when a broker module is stuck looping somewhere other
than the reactor, flux module remove's reactive shutdown request
is ignored.

Support an optional cancel flag in the module.remove request.
When this flag is set, pthread_cancel(3) is called on the module
thread.  As long as the thread's loop is passing through one of
the cancellation points listed in pthreads(7), cancellation should
be effective.

Fixes flux-framework#6887
Problem: flux module remove --cancel has no test coverage.

Add a module that calls pause(), a pthreads(7) cancellation point.
Show that flux module remove cannot remove it.
Show that flux module remove --cancel can remove it.
Problem: flux module remove --cancel has no documentation.

Add that option to the man page.
@mergify mergify bot force-pushed the module_cancel branch from d2bd2e9 to 1221851 Compare June 30, 2025 20:38
@mergify mergify bot merged commit 984102b into flux-framework:master Jun 30, 2025
35 checks passed
Copy link

codecov bot commented Jun 30, 2025

Codecov Report

Attention: Patch coverage is 88.88889% with 4 lines in your changes missing coverage. Please review.

Project coverage is 83.85%. Comparing base (4e80d20) to head (1221851).
Report is 10 commits behind head on master.

Files with missing lines Patch % Lines
src/broker/modhash.c 77.77% 2 Missing ⚠️
src/broker/module.c 92.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6894      +/-   ##
==========================================
- Coverage   83.85%   83.85%   -0.01%     
==========================================
  Files         538      538              
  Lines       90029    90041      +12     
==========================================
+ Hits        75491    75501      +10     
- Misses      14538    14540       +2     
Files with missing lines Coverage Δ
src/cmd/flux-module.c 80.00% <100.00%> (+0.04%) ⬆️
src/broker/modhash.c 80.41% <77.77%> (+0.20%) ⬆️
src/broker/module.c 81.40% <92.00%> (+1.81%) ⬆️

... and 8 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants