add simple shell plugin for monitoring memory and cpu usage by garlick · Pull Request #6958 · flux-framework/flux-core

garlick · 2025-08-08T22:33:02Z

There was a flux-discuss email question recently asking about a way to monitor resource utilization. I had some cpu and memory utilization code laying around and made a quick and dirty broker module that polled /proc on the heartbeat and calculated cpu usage over the heartbeat interval. My thought was that someone could do better making a bulk RPC to each rank to gather this info than trying to spawn processes.

Then I made a demo shell plugin that just monitors the local cpu and memory usage and looks for big changes to log, wtih the thought that this could be the basis for a debugging tool that looks for deadlocked or leaky ranks in a big job.

Meh. This may be a solution in search of a problem. I thought I'd post it anyway since the sunk cost is nonzero, it and see if anybody had any thoughts on how to redirect it towards something more useful? If not, I have no problem euthanizing it.

Edit: tests are TBD

grondo · 2025-08-08T22:59:38Z

I haven't looked at this fully yet, but this seems like a pretty useful start to me!

If the shell sysmon plugin could also monitor some process statistics, this could also potentially solve #5488.

We could add a special handler to the shell rank 0 plugin that pulls data from all the other shells and sends the result back to the caller (as is done with the mpir plugin for the procdesc data) Perhaps with some simple ways to aggregate the bulk data. This has proven to be pretty scalable.

src/shell/sysmon.c

garlick · 2025-08-21T20:40:56Z

I reworked this to just use the shell's cgroup memory and cpu controller directly. If enabled with -o sysmon, each shell will log memory and cpu info at the end of the job:

    5.065s: flux-shell[1]: sysmon: memory.peak=3.1179504G
    5.066s: flux-shell[1]: sysmon: loadavg-overall=2.00
    5.068s: flux-shell[0]: sysmon: memory.peak=3.1179504G
    5.068s: flux-shell[0]: sysmon: loadavg-overall=2.00

If tracing is enabled with -o verbose=2, a little snapshot is logged on each heartbeat:

    2.497s: flux-shell[0]: TRACE: sysmon: memory.current=1.3986053G
    2.497s: flux-shell[0]: TRACE: sysmon: loadavg=2.00
    2.497s: flux-shell[1]: TRACE: sysmon: memory.current=1.3990936G
    2.497s: flux-shell[1]: TRACE: sysmon: loadavg=2.00

The polling period for the tracing can be explicitly set with -o sysmon.period=FSD.

Aggregation of this data would make it a lot more useful, but I'm wondering if this baby step is enough better than nothing that we should consider it as is, as a stepping stone?

No tests yet - I wanted to see what others thought.

garlick · 2025-08-30T17:06:39Z

I added testing and removed the WIP. I am not hard over on merging this as is, but wanted the option to be available.

grondo

This seems like a useful thing to throw in. If we want something more later, this could be a great starting point, and it may help some users in the near term.

Just a couple comments inline.

grondo · 2025-09-03T15:53:08Z

src/shell/sysmon.c

+                int i = json_is_integer (period);
+                if (i < 0)
+                    goto error_period;
+                ctx->period = json_integer_value (period);


I think json_is_integer() only returns 0 or 1 so the test could never be true? Did you mean to compare json_integer_value() < 0? (I'm not sure, though, since i wasn't reused to assign to ctx->period.

Good catch! Yeah I'm sure I meant to check the value.

grondo · 2025-09-03T15:56:17Z

src/shell/sysmon.c

+    ctx->path_memory_current = get_cgroup_path (mypid, "memory.current", R_OK);
+    ctx->path_memory_peak = get_cgroup_path (mypid, "memory.peak", R_OK);
+    ctx->path_cpu_stat = get_cgroup_path (mypid, "cpu.stat", R_OK);
+    if (!ctx->path_memory_current
+        || !ctx->path_memory_peak) {
+        shell_log_error ("error caching cgroup paths");
+        goto error;
+    }


Probably should be obvious, but why isn't ctx->path_cpu_stat checked along with the memory paths?

An oversight! Fixing.

garlick · 2025-09-03T16:17:30Z

I fixed those things and rebased on current master.

garlick · 2025-09-03T17:13:22Z

I'll set MWP, thanks!

Problem: currently there is no built-in monitoring of cpu and memory utilization for jobs. Add a shell plugin that samples data from the cgroups V2 cpu and memory controllers. The plugin is disabled by default, and can be enabled with the 'sysmon' shell option. The default behavior is to log the summary data from each shell at the end of the job: 5.065s: flux-shell[1]: sysmon: memory.peak=3.1179504G 5.066s: flux-shell[1]: sysmon: loadavg-overall=2.00 5.068s: flux-shell[0]: sysmon: memory.peak=3.1179504G 5.068s: flux-shell[0]: sysmon: loadavg-overall=2.00 The plugin polls cgroups periodically. By default, this is tied to the Flux heartbeat event. A different period may be selected by setting the sysmon.period=FSD shell option. Each polling cycle logs sample information at trace level (view with -o verbose=2): 2.497s: flux-shell[0]: TRACE: sysmon: memory.current=1.3986053G 2.497s: flux-shell[0]: TRACE: sysmon: loadavg=2.00 2.497s: flux-shell[1]: TRACE: sysmon: memory.current=1.3990936G 2.497s: flux-shell[1]: TRACE: sysmon: loadavg=2.00 In this case the reported load average only covers the polling period.

Problem: the sysmon shell plugin has no documentation. Add some to the man page.

Problem: there is no test coverage for the sysmon shell plugin Add a script.

codecov · 2025-12-02T19:01:14Z

Codecov Report

❌ Patch coverage is 82.38342% with 34 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.03%. Comparing base (fa1329b) to head (fc89ccd).
⚠️ Report is 508 commits behind head on master.

Files with missing lines	Patch %	Lines
src/shell/sysmon.c	82.38%	34 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6958      +/-   ##
==========================================
- Coverage   84.05%   84.03%   -0.03%     
==========================================
  Files         545      546       +1     
  Lines       91955    92148     +193     
==========================================
+ Hits        77296    77437     +141     
- Misses      14659    14711      +52

Files with missing lines	Coverage Δ
src/shell/builtins.c	`92.30% <ø> (ø)`
src/shell/sysmon.c	`82.38% <82.38%> (ø)`

... and 11 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

garlick force-pushed the sysmon branch from bb5e890 to 412ce5e Compare August 9, 2025 00:03

garlick force-pushed the sysmon branch 2 times, most recently from a21ceee to 71e5470 Compare August 21, 2025 20:26

github-advanced-security bot found potential problems Aug 21, 2025

View reviewed changes

src/shell/sysmon.c Fixed Show fixed Hide fixed

garlick force-pushed the sysmon branch from 71e5470 to 78318dc Compare August 21, 2025 23:00

garlick force-pushed the sysmon branch 3 times, most recently from 4bb5c06 to a1242e1 Compare August 30, 2025 16:38

garlick changed the title ~~WIP: add sysmon shell plugin~~ add simple shell plugin for monitoring memory and cpu usage Aug 30, 2025

grondo approved these changes Sep 3, 2025

View reviewed changes

garlick force-pushed the sysmon branch from a1242e1 to 8cfb230 Compare September 3, 2025 16:17

garlick added the merge-when-passing label Sep 3, 2025

garlick added 3 commits September 3, 2025 17:54

flux-shell(1): document the sysmon plugin

498f7e2

Problem: the sysmon shell plugin has no documentation. Add some to the man page.

testsuite: add sysmon sharness script

fc89ccd

Problem: there is no test coverage for the sysmon shell plugin Add a script.

mergify bot force-pushed the sysmon branch from 8cfb230 to fc89ccd Compare September 3, 2025 17:54

mergify bot merged commit ebe8b15 into flux-framework:master Sep 3, 2025
34 of 35 checks passed

garlick deleted the sysmon branch September 3, 2025 18:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add simple shell plugin for monitoring memory and cpu usage#6958

add simple shell plugin for monitoring memory and cpu usage#6958
mergify[bot] merged 3 commits intoflux-framework:masterfrom
garlick:sysmon

garlick commented Aug 8, 2025 •

edited

Loading

Uh oh!

grondo commented Aug 8, 2025

Uh oh!

Uh oh!

garlick commented Aug 21, 2025

Uh oh!

garlick commented Aug 30, 2025

Uh oh!

grondo left a comment

Uh oh!

grondo Sep 3, 2025

Uh oh!

garlick Sep 3, 2025

Uh oh!

grondo Sep 3, 2025 •

edited

Loading

Uh oh!

garlick Sep 3, 2025

Uh oh!

garlick commented Sep 3, 2025

Uh oh!

garlick commented Sep 3, 2025

Uh oh!

Uh oh!

codecov bot commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

garlick commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

grondo commented Aug 8, 2025

Uh oh!

Uh oh!

garlick commented Aug 21, 2025

Uh oh!

garlick commented Aug 30, 2025

Uh oh!

grondo left a comment

Choose a reason for hiding this comment

Uh oh!

grondo Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

garlick Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

grondo Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

garlick Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

garlick commented Sep 3, 2025

Uh oh!

garlick commented Sep 3, 2025

Uh oh!

Uh oh!

codecov bot commented Dec 2, 2025

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

garlick commented Aug 8, 2025 •

edited

Loading

grondo Sep 3, 2025 •

edited

Loading