Skip to content

add simple shell plugin for monitoring memory and cpu usage#6958

Merged
mergify[bot] merged 3 commits intoflux-framework:masterfrom
garlick:sysmon
Sep 3, 2025
Merged

add simple shell plugin for monitoring memory and cpu usage#6958
mergify[bot] merged 3 commits intoflux-framework:masterfrom
garlick:sysmon

Conversation

@garlick
Copy link
Member

@garlick garlick commented Aug 8, 2025

There was a flux-discuss email question recently asking about a way to monitor resource utilization. I had some cpu and memory utilization code laying around and made a quick and dirty broker module that polled /proc on the heartbeat and calculated cpu usage over the heartbeat interval. My thought was that someone could do better making a bulk RPC to each rank to gather this info than trying to spawn processes.

Then I made a demo shell plugin that just monitors the local cpu and memory usage and looks for big changes to log, wtih the thought that this could be the basis for a debugging tool that looks for deadlocked or leaky ranks in a big job.

Meh. This may be a solution in search of a problem. I thought I'd post it anyway since the sunk cost is nonzero, it and see if anybody had any thoughts on how to redirect it towards something more useful? If not, I have no problem euthanizing it.

Edit: tests are TBD

@grondo
Copy link
Contributor

grondo commented Aug 8, 2025

I haven't looked at this fully yet, but this seems like a pretty useful start to me!

If the shell sysmon plugin could also monitor some process statistics, this could also potentially solve #5488.

We could add a special handler to the shell rank 0 plugin that pulls data from all the other shells and sends the result back to the caller (as is done with the mpir plugin for the procdesc data) Perhaps with some simple ways to aggregate the bulk data. This has proven to be pretty scalable.

@garlick garlick force-pushed the sysmon branch 2 times, most recently from a21ceee to 71e5470 Compare August 21, 2025 20:26
@garlick
Copy link
Member Author

garlick commented Aug 21, 2025

I reworked this to just use the shell's cgroup memory and cpu controller directly. If enabled with -o sysmon, each shell will log memory and cpu info at the end of the job:

    5.065s: flux-shell[1]: sysmon: memory.peak=3.1179504G
    5.066s: flux-shell[1]: sysmon: loadavg-overall=2.00
    5.068s: flux-shell[0]: sysmon: memory.peak=3.1179504G
    5.068s: flux-shell[0]: sysmon: loadavg-overall=2.00

If tracing is enabled with -o verbose=2, a little snapshot is logged on each heartbeat:

    2.497s: flux-shell[0]: TRACE: sysmon: memory.current=1.3986053G
    2.497s: flux-shell[0]: TRACE: sysmon: loadavg=2.00
    2.497s: flux-shell[1]: TRACE: sysmon: memory.current=1.3990936G
    2.497s: flux-shell[1]: TRACE: sysmon: loadavg=2.00

The polling period for the tracing can be explicitly set with -o sysmon.period=FSD.

Aggregation of this data would make it a lot more useful, but I'm wondering if this baby step is enough better than nothing that we should consider it as is, as a stepping stone?

No tests yet - I wanted to see what others thought.

@garlick garlick force-pushed the sysmon branch 3 times, most recently from 4bb5c06 to a1242e1 Compare August 30, 2025 16:38
@garlick garlick changed the title WIP: add sysmon shell plugin add simple shell plugin for monitoring memory and cpu usage Aug 30, 2025
@garlick
Copy link
Member Author

garlick commented Aug 30, 2025

I added testing and removed the WIP. I am not hard over on merging this as is, but wanted the option to be available.

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a useful thing to throw in. If we want something more later, this could be a great starting point, and it may help some users in the near term.

Just a couple comments inline.

Comment on lines +313 to +316
int i = json_is_integer (period);
if (i < 0)
goto error_period;
ctx->period = json_integer_value (period);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think json_is_integer() only returns 0 or 1 so the test could never be true? Did you mean to compare json_integer_value() < 0? (I'm not sure, though, since i wasn't reused to assign to ctx->period.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Yeah I'm sure I meant to check the value.

Comment on lines +344 to +352
ctx->path_memory_current = get_cgroup_path (mypid, "memory.current", R_OK);
ctx->path_memory_peak = get_cgroup_path (mypid, "memory.peak", R_OK);
ctx->path_cpu_stat = get_cgroup_path (mypid, "cpu.stat", R_OK);
if (!ctx->path_memory_current
|| !ctx->path_memory_peak) {
shell_log_error ("error caching cgroup paths");
goto error;
}
Copy link
Contributor

@grondo grondo Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should be obvious, but why isn't ctx->path_cpu_stat checked along with the memory paths?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An oversight! Fixing.

@garlick
Copy link
Member Author

garlick commented Sep 3, 2025

I fixed those things and rebased on current master.

@garlick
Copy link
Member Author

garlick commented Sep 3, 2025

I'll set MWP, thanks!

Problem: currently there is no built-in monitoring of cpu
and memory utilization for jobs.

Add a shell plugin that samples data from the cgroups V2 cpu and
memory controllers.  The plugin is disabled by default, and can be
enabled with the 'sysmon' shell option.

The default behavior is to log the summary data from each shell
at the end of the job:

5.065s: flux-shell[1]: sysmon: memory.peak=3.1179504G
5.066s: flux-shell[1]: sysmon: loadavg-overall=2.00
5.068s: flux-shell[0]: sysmon: memory.peak=3.1179504G
5.068s: flux-shell[0]: sysmon: loadavg-overall=2.00

The plugin polls cgroups periodically.  By default, this is tied
to the Flux heartbeat event.  A different period may be selected by
setting the sysmon.period=FSD shell option.  Each polling cycle logs
sample information at trace level (view with -o verbose=2):

2.497s: flux-shell[0]: TRACE: sysmon: memory.current=1.3986053G
2.497s: flux-shell[0]: TRACE: sysmon: loadavg=2.00
2.497s: flux-shell[1]: TRACE: sysmon: memory.current=1.3990936G
2.497s: flux-shell[1]: TRACE: sysmon: loadavg=2.00

In this case the reported load average only covers the polling period.
Problem: the sysmon shell plugin has no documentation.

Add some to the man page.
Problem: there is no test coverage for the sysmon shell plugin

Add a script.
@mergify mergify bot merged commit ebe8b15 into flux-framework:master Sep 3, 2025
34 of 35 checks passed
@garlick garlick deleted the sysmon branch September 3, 2025 18:43
@codecov
Copy link

codecov bot commented Dec 2, 2025

Codecov Report

❌ Patch coverage is 82.38342% with 34 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.03%. Comparing base (fa1329b) to head (fc89ccd).
⚠️ Report is 508 commits behind head on master.

Files with missing lines Patch % Lines
src/shell/sysmon.c 82.38% 34 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6958      +/-   ##
==========================================
- Coverage   84.05%   84.03%   -0.03%     
==========================================
  Files         545      546       +1     
  Lines       91955    92148     +193     
==========================================
+ Hits        77296    77437     +141     
- Misses      14659    14711      +52     
Files with missing lines Coverage Δ
src/shell/builtins.c 92.30% <ø> (ø)
src/shell/sysmon.c 82.38% <82.38%> (ø)

... and 11 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants