Skip to content

Conversation

@ethan-puyaubreau
Copy link
Contributor

@ethan-puyaubreau ethan-puyaubreau commented Aug 15, 2025

This PR introduces a generic Daemon class for background task execution in the energy profiler infrastructure:

  • Thread-safe periodic execution with configurable intervals
  • Clean start/stop lifecycle management
  • Reusable foundation for background monitoring tasks

The daemon system will enable energy providers to perform periodic power sampling at specified intervals, supporting the energy measurement capabilities planned in #301.

@ethan-puyaubreau ethan-puyaubreau changed the title Feature/energy profiler daemon Energy profiling tools: Deamon system Aug 15, 2025
@ethan-puyaubreau ethan-puyaubreau changed the title Energy profiling tools: Deamon system Energy profiling tools: Daemon system Aug 15, 2025
@ethan-puyaubreau ethan-puyaubreau force-pushed the feature/energy-profiler-daemon branch from b757df7 to 84a0ef5 Compare August 15, 2025 17:14
@ethan-puyaubreau ethan-puyaubreau force-pushed the feature/energy-profiler-daemon branch from 4039559 to 10c87de Compare August 18, 2025 13:30
@ethan-puyaubreau ethan-puyaubreau marked this pull request as ready for review August 19, 2025 18:16
@ethan-puyaubreau ethan-puyaubreau force-pushed the feature/energy-profiler-daemon branch 4 times, most recently from 2c20e1f to 08e5d24 Compare August 28, 2025 19:01
void run();
void stop();
bool is_running() const { return running_; }
std::thread& get_thread() { return thread_; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we actually want this to be modifiable?

Copy link
Contributor

@JBludau JBludau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Description and title could be updated, but the rest is ok for now

@ethan-puyaubreau ethan-puyaubreau changed the title Energy profiling tools: Daemon system Energy profiling tools: Add Daemon class for periodic task execution Aug 29, 2025
@JBludau
Copy link
Contributor

JBludau commented Aug 29, 2025

If we go to c++20 we can use jthread for thread safety, but that is for kokkos 5

@ethan-puyaubreau ethan-puyaubreau force-pushed the feature/energy-profiler-daemon branch from 180df12 to 9458838 Compare August 29, 2025 18:26

class Daemon {
public:
Daemon(std::function<void()> func, int interval_ms)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we pass in a std::chrono::duration directly, instead of an int? What about the type int; e.g. what is the default arithmetic type that std::chrono::milliseconds encodes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that is a good idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 646bf35

std::thread& get_thread() { return thread_; }

private:
std::chrono::milliseconds interval_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be chosen as the smallest possible tick period? Thinking that "milliseconds" may be quite long as a sampling interval.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm ... we could use the smallest to make it as general as possible. Nevertheless, the question that arises is if there is anything useful below the ms threshold. Do you have an example in mind?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed in 646bf35

Copy link
Contributor

@vlkale vlkale Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed in 646bf35

Sounds good though at least for interactions with performance monitoring systems (including for #291 for LDMS), I don't think there is any benefit (and it may actually be detrimental) to go below the millisecond level in production scenarios.


private:
std::chrono::milliseconds interval_;
bool running_{false};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an issue of thread safety?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. As stated in a previous comment, we could just switch to jthread and https://en.cppreference.com/w/cpp/thread/jthread/request_stop.html once we have c++20

Comment on lines 19 to 21
#include <functional>
#include <thread>
#include <chrono>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best to order the includes alphabetically.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 626f777

//
//@HEADER

#pragma once
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably agree at the level of kokkos-tools on using consistently either preprocessor conditionals or this pragma to protect our includes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that consistency would be a good thing to aim for. There are a lot more things in Tools that could be made consistent though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switched to classic ifdef in 67a8030

Comment on lines 25 to 26
} else {
throw std::runtime_error("Daemon already started");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to think about how to be consistent in handling errors. In your tool, it appears you're sometimes using exceptions, and sometimes using a boolean to handle errors. I'm not sure if there's a policy at the kokkos-tools level to avoid exceptions. Probably best to check e.g. with @JBludau.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would even say we don't want to throw when the daemon is already running. This is not an exception but should just do nothing if the daemon is started.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the exceptions in 4502851

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, exceptions aren't really used for Kokkos Tools libraries - and I think not throwing here is the right choice.

Comment on lines +32 to +34
auto next_run = std::chrono::high_resolution_clock::now() + interval_;
func_();
std::this_thread::sleep_until(next_run);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering about use cases where e.g. the time that the function takes is longer than the sampling interval.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We never ment this to be able to run a function at an interval lower than its own execution time. That would be a much more complex daemon.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Just thinking that in terms of strategy for the PR, it may make sense to clarify whether the PR is going for a general deamon, or for a deamon that works for the energy use case. Perhaps focusing on the energy use case would be helpful because it would give a good framework to think about issues like the choice of the tick for the duration, or this issue of the sampling frequency vs the duration of the API call.

If you narrow the focus to a daemon specific to the energy use case, the daemon could be placed into the KokkosTools::EnergyProfiler namespace for now.

That being said, I do think it would make sense to think about what happens when the API call takes longer than the sampling interval. Perhaps not systematically. But e.g. in the case where for whatever random reason, the API call happens to take a bit longer than expected.

Copy link
Contributor

@vlkale vlkale Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, let me preface by saying this work by @ethan-puyaubreau has been valuable in terms of capabilities for energy profiling and monitoring as well as ideas for software engineering that apply to other tools in Kokkos Tools.

With that, to hone in on your bigger picture question on a generic daemon class:

Assuming I have understood your use of the term Daemon as a real-time independent monitoring service, a key reason for my Kokkos Tools PR #291 for the Kokkos Tools LDMS connector is enabling this daemon-like functionality via a production-level performance monitoring software tool, called LDMS (Lightweight Distributed Monitoring Service). The motivation for it is that the Kokkos Tools LDMS connector can eventually integrated with - or more specifically be a parent tools library to - any other Kokkos Tools child/base library, e.g., nvtx-connector, variorium-connector. A variant of this LDMS connector had been used for the last couple years in production - it has been closed-source until this PR #291. (By the way, PR #291 is ready for review - feel free to provide suggestions for improvements and/or let me know if you have questions on it.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved it into the KokkosTools::EnergyProfiler namespace in f01b82e

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Contributor

@maartenarnst maartenarnst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR would also benefit from tests and an example.

@maartenarnst
Copy link
Contributor

Hey @JBludau. I've just gone over the changes. You'll find a few suggestions. Sill thinking that it would be good to add a small test. Perhaps there could be a test with a callback function that just waits for a given duration. The test could use durations that would be typical for the use case of the energy profiler.

JBludau and others added 4 commits September 18, 2025 09:41
Co-authored-by: Maarten Arnst <maarten.arnst@ulg.ac.be>
Co-authored-by: Maarten Arnst <maarten.arnst@ulg.ac.be>
Co-authored-by: Maarten Arnst <maarten.arnst@ulg.ac.be>
Co-authored-by: Maarten Arnst <maarten.arnst@ulg.ac.be>
@JBludau
Copy link
Contributor

JBludau commented Sep 18, 2025

Hey @JBludau. I've just gone over the changes. You'll find a few suggestions. Sill thinking that it would be good to add a small test. Perhaps there could be a test with a callback function that just waits for a given duration. The test could use durations that would be typical for the use case of the energy profiler.

Yeah, I asked @ethan-puyaubreau if he could add his test.

Copy link
Contributor

@aprokop aprokop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I only have a couple question.
We need this capability as soon as we can for one of my projects.


namespace KokkosTools::EnergyProfiler {
void Daemon::start() {
if (!running_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the semantics of start() if it's already running? Should it break? Should it be restarted?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should just do nothing if it is already running. Multiple start calls should be fine since in our current use case the global system time is recorded. If someone wants to use it differently I would opt we add a restart() functionality.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but as described below ... this is not threadsafe anyway

}

void Daemon::stop() {
if (running_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the semantics here? Should we require that it's been running?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, it is not threadsafe, so we are limited with the current impl. But if we care we can either go to jthread or at least use atomics for the status

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I think C++20 std::jthread is a great idea here. I have a slight inclination to use C++ atomics for now since it seems more straightforward - I am wondering how portable/available jthreads would be on various computing systems currently?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the tools don't have a minimum requirement for 20, right? Since most of the other tools are not threadsafe, I would say that is a larger discussion.
But if we want it to be std::jthread now, we can update. I am happy for any contribution

Co-authored-by: Andrey Prokopenko <andrey.prok@gmail.com>
@vlkale
Copy link
Contributor

vlkale commented Dec 2, 2025

Hi all, I was wondering about the status on this. I think this is a good PR that would be valuable for a lot of profiling data beyond energy information, which is relevant to #291 - albeit I would say the energy profiling use case is the one that is probably of most value currently to production Kokkos applications.

Do you know the rough timeline of this PR by chance? I know that @ethan-puyaubreau is done with his internship, so there may be timing constraints from his end.

I think this PR is on the right track overall.

@JBludau
Copy link
Contributor

JBludau commented Dec 2, 2025

The main open thing is a desire to have an example for the daemon. So far I was not able to come up with a good one.
Personally I think it is fine without an example as the energy profiling pr's that depend on this are basically going to be example usages of this.

@vlkale
Copy link
Contributor

vlkale commented Dec 2, 2025

The main open thing is a desire to have an example for the daemon. So far I was not able to come up with a good one. Personally I think it is fine without an example as the energy profiling pr's that depend on this are basically going to be example usages of this.

Yes, I definitely agree just considering the energy profiling use case already shows an example usage. It would certainly help to have a small example since it can make clear assumptions being made on its use, but I think it doesn't need to be a blocker. I do think some of the error handling and/or tests for corner cases (e.g., interval being longer than time for kernel) that are discussed by @maartenarnst and you in this PR are important to think about.

Except for making sure no critical error checks are missing, I think the code logic of this PR makes sense to me.

Doing some sort of beta release of this Daemon class, especially so, e.g., @aprokop along with #291 can use this would be good to get early feedback on real-world production use cases, as that could directly provide a lot of insight for improvements. Maybe it could be merged in as a beta feature early next year?

Just my thoughts on this now - I will look and see if I can make a few more improvements to the code and error checks in the PR myself till then.

Copy link
Contributor

@vlkale vlkale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, just a suggestion to consider a default interval

namespace KokkosTools::EnergyProfiler {
class Daemon {
public:
Daemon(std::function<void()> func, const std::chrono::duration& interval)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One quick note is I don't see a default value for the time interval of the Daemon but maybe you could have one. If you do, I think a reasonable one would be 1 second. This would at least makes sense from the point of view of #291. Maybe there are other use cases and situations where you want it to be shorter (or longer).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think the user explicitly specifying a duration is a good practice here. It is a core setting of the daemon that should be thought about explicitly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, got it - yes, I agree. We want this Daemon class to be generic beyond the energy profiling use case and/or LDMS use case, and it should indeed be thought out by users explicitly based on the scenario.

@JBludau
Copy link
Contributor

JBludau commented Dec 5, 2025

@maartenarnst do you want us to investigate the use case of a longer return time than the frequency the daemon is given? If so, do you have a specific case in mind that would need this?

@JBludau
Copy link
Contributor

JBludau commented Jan 5, 2026

@vlkale I think we can merge this. Maarten did not reply and we still can look at making the daemon able to deal with runtimes longer than its frequency later if we ever have a usecase for this

@vlkale vlkale added the feature Needed feature but software still is correct on its own label Jan 6, 2026
@vlkale
Copy link
Contributor

vlkale commented Jan 6, 2026

@vlkale I think we can merge this. Maarten did not reply and we still can look at making the daemon able to deal with runtimes longer than its frequency later if we ever have a usecase for this

@JBludau I am good with merging it. Yes, allowing for runtimes longer than its frequency would be good but it's not needed right now for immediate/priority use cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature Needed feature but software still is correct on its own

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants