add simple shell plugin for monitoring memory and cpu usage#6958
add simple shell plugin for monitoring memory and cpu usage#6958mergify[bot] merged 3 commits intoflux-framework:masterfrom
Conversation
|
I haven't looked at this fully yet, but this seems like a pretty useful start to me! If the shell sysmon plugin could also monitor some process statistics, this could also potentially solve #5488. We could add a special handler to the shell rank 0 plugin that pulls data from all the other shells and sends the result back to the caller (as is done with the mpir plugin for the procdesc data) Perhaps with some simple ways to aggregate the bulk data. This has proven to be pretty scalable. |
a21ceee to
71e5470
Compare
|
I reworked this to just use the shell's cgroup memory and cpu controller directly. If enabled with If tracing is enabled with The polling period for the tracing can be explicitly set with Aggregation of this data would make it a lot more useful, but I'm wondering if this baby step is enough better than nothing that we should consider it as is, as a stepping stone? No tests yet - I wanted to see what others thought. |
4bb5c06 to
a1242e1
Compare
|
I added testing and removed the WIP. I am not hard over on merging this as is, but wanted the option to be available. |
grondo
left a comment
There was a problem hiding this comment.
This seems like a useful thing to throw in. If we want something more later, this could be a great starting point, and it may help some users in the near term.
Just a couple comments inline.
src/shell/sysmon.c
Outdated
| int i = json_is_integer (period); | ||
| if (i < 0) | ||
| goto error_period; | ||
| ctx->period = json_integer_value (period); |
There was a problem hiding this comment.
I think json_is_integer() only returns 0 or 1 so the test could never be true? Did you mean to compare json_integer_value() < 0? (I'm not sure, though, since i wasn't reused to assign to ctx->period.
There was a problem hiding this comment.
Good catch! Yeah I'm sure I meant to check the value.
| ctx->path_memory_current = get_cgroup_path (mypid, "memory.current", R_OK); | ||
| ctx->path_memory_peak = get_cgroup_path (mypid, "memory.peak", R_OK); | ||
| ctx->path_cpu_stat = get_cgroup_path (mypid, "cpu.stat", R_OK); | ||
| if (!ctx->path_memory_current | ||
| || !ctx->path_memory_peak) { | ||
| shell_log_error ("error caching cgroup paths"); | ||
| goto error; | ||
| } |
There was a problem hiding this comment.
Probably should be obvious, but why isn't ctx->path_cpu_stat checked along with the memory paths?
|
I fixed those things and rebased on current master. |
|
I'll set MWP, thanks! |
Problem: currently there is no built-in monitoring of cpu and memory utilization for jobs. Add a shell plugin that samples data from the cgroups V2 cpu and memory controllers. The plugin is disabled by default, and can be enabled with the 'sysmon' shell option. The default behavior is to log the summary data from each shell at the end of the job: 5.065s: flux-shell[1]: sysmon: memory.peak=3.1179504G 5.066s: flux-shell[1]: sysmon: loadavg-overall=2.00 5.068s: flux-shell[0]: sysmon: memory.peak=3.1179504G 5.068s: flux-shell[0]: sysmon: loadavg-overall=2.00 The plugin polls cgroups periodically. By default, this is tied to the Flux heartbeat event. A different period may be selected by setting the sysmon.period=FSD shell option. Each polling cycle logs sample information at trace level (view with -o verbose=2): 2.497s: flux-shell[0]: TRACE: sysmon: memory.current=1.3986053G 2.497s: flux-shell[0]: TRACE: sysmon: loadavg=2.00 2.497s: flux-shell[1]: TRACE: sysmon: memory.current=1.3990936G 2.497s: flux-shell[1]: TRACE: sysmon: loadavg=2.00 In this case the reported load average only covers the polling period.
Problem: the sysmon shell plugin has no documentation. Add some to the man page.
Problem: there is no test coverage for the sysmon shell plugin Add a script.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6958 +/- ##
==========================================
- Coverage 84.05% 84.03% -0.03%
==========================================
Files 545 546 +1
Lines 91955 92148 +193
==========================================
+ Hits 77296 77437 +141
- Misses 14659 14711 +52
🚀 New features to boost your workflow:
|
There was a flux-discuss email question recently asking about a way to monitor resource utilization. I had some cpu and memory utilization code laying around and made a quick and dirty broker module that polled
/procon the heartbeat and calculated cpu usage over the heartbeat interval. My thought was that someone could do better making a bulk RPC to each rank to gather this info than trying to spawn processes.Then I made a demo shell plugin that just monitors the local cpu and memory usage and looks for big changes to log, wtih the thought that this could be the basis for a debugging tool that looks for deadlocked or leaky ranks in a big job.
Meh. This may be a solution in search of a problem. I thought I'd post it anyway since the sunk cost is nonzero, it and see if anybody had any thoughts on how to redirect it towards something more useful? If not, I have no problem euthanizing it.
Edit: tests are TBD