-
Notifications
You must be signed in to change notification settings - Fork 69
Energy profiling tools: Add Daemon class for periodic task execution #300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Energy profiling tools: Add Daemon class for periodic task execution #300
Conversation
b757df7 to
84a0ef5
Compare
4039559 to
10c87de
Compare
2c20e1f to
08e5d24
Compare
| void run(); | ||
| void stop(); | ||
| bool is_running() const { return running_; } | ||
| std::thread& get_thread() { return thread_; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we actually want this to be modifiable?
JBludau
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Description and title could be updated, but the rest is ok for now
|
If we go to c++20 we can use jthread for thread safety, but that is for kokkos 5 |
180df12 to
9458838
Compare
profiling/energy-profiler/daemon.hpp
Outdated
|
|
||
| class Daemon { | ||
| public: | ||
| Daemon(std::function<void()> func, int interval_ms) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we pass in a std::chrono::duration directly, instead of an int? What about the type int; e.g. what is the default arithmetic type that std::chrono::milliseconds encodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that is a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in 646bf35
profiling/energy-profiler/daemon.hpp
Outdated
| std::thread& get_thread() { return thread_; } | ||
|
|
||
| private: | ||
| std::chrono::milliseconds interval_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be chosen as the smallest possible tick period? Thinking that "milliseconds" may be quite long as a sampling interval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm ... we could use the smallest to make it as general as possible. Nevertheless, the question that arises is if there is anything useful below the ms threshold. Do you have an example in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed in 646bf35
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
| private: | ||
| std::chrono::milliseconds interval_; | ||
| bool running_{false}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an issue of thread safety?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. As stated in a previous comment, we could just switch to jthread and https://en.cppreference.com/w/cpp/thread/jthread/request_stop.html once we have c++20
profiling/energy-profiler/daemon.hpp
Outdated
| #include <functional> | ||
| #include <thread> | ||
| #include <chrono> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Best to order the includes alphabetically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in 626f777
profiling/energy-profiler/daemon.hpp
Outdated
| // | ||
| //@HEADER | ||
|
|
||
| #pragma once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably agree at the level of kokkos-tools on using consistently either preprocessor conditionals or this pragma to protect our includes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that consistency would be a good thing to aim for. There are a lot more things in Tools that could be made consistent though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switched to classic ifdef in 67a8030
profiling/energy-profiler/daemon.cpp
Outdated
| } else { | ||
| throw std::runtime_error("Daemon already started"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to think about how to be consistent in handling errors. In your tool, it appears you're sometimes using exceptions, and sometimes using a boolean to handle errors. I'm not sure if there's a policy at the kokkos-tools level to avoid exceptions. Probably best to check e.g. with @JBludau.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would even say we don't want to throw when the daemon is already running. This is not an exception but should just do nothing if the daemon is started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed the exceptions in 4502851
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, exceptions aren't really used for Kokkos Tools libraries - and I think not throwing here is the right choice.
| auto next_run = std::chrono::high_resolution_clock::now() + interval_; | ||
| func_(); | ||
| std::this_thread::sleep_until(next_run); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering about use cases where e.g. the time that the function takes is longer than the sampling interval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We never ment this to be able to run a function at an interval lower than its own execution time. That would be a much more complex daemon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Just thinking that in terms of strategy for the PR, it may make sense to clarify whether the PR is going for a general deamon, or for a deamon that works for the energy use case. Perhaps focusing on the energy use case would be helpful because it would give a good framework to think about issues like the choice of the tick for the duration, or this issue of the sampling frequency vs the duration of the API call.
If you narrow the focus to a daemon specific to the energy use case, the daemon could be placed into the KokkosTools::EnergyProfiler namespace for now.
That being said, I do think it would make sense to think about what happens when the API call takes longer than the sampling interval. Perhaps not systematically. But e.g. in the case where for whatever random reason, the API call happens to take a bit longer than expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, let me preface by saying this work by @ethan-puyaubreau has been valuable in terms of capabilities for energy profiling and monitoring as well as ideas for software engineering that apply to other tools in Kokkos Tools.
With that, to hone in on your bigger picture question on a generic daemon class:
Assuming I have understood your use of the term Daemon as a real-time independent monitoring service, a key reason for my Kokkos Tools PR #291 for the Kokkos Tools LDMS connector is enabling this daemon-like functionality via a production-level performance monitoring software tool, called LDMS (Lightweight Distributed Monitoring Service). The motivation for it is that the Kokkos Tools LDMS connector can eventually integrated with - or more specifically be a parent tools library to - any other Kokkos Tools child/base library, e.g., nvtx-connector, variorium-connector. A variant of this LDMS connector had been used for the last couple years in production - it has been closed-source until this PR #291. (By the way, PR #291 is ready for review - feel free to provide suggestions for improvements and/or let me know if you have questions on it.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved it into the KokkosTools::EnergyProfiler namespace in f01b82e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
maartenarnst
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR would also benefit from tests and an example.
|
Hey @JBludau. I've just gone over the changes. You'll find a few suggestions. Sill thinking that it would be good to add a small test. Perhaps there could be a test with a callback function that just waits for a given duration. The test could use durations that would be typical for the use case of the energy profiler. |
Co-authored-by: Maarten Arnst <maarten.arnst@ulg.ac.be>
Co-authored-by: Maarten Arnst <maarten.arnst@ulg.ac.be>
Co-authored-by: Maarten Arnst <maarten.arnst@ulg.ac.be>
Co-authored-by: Maarten Arnst <maarten.arnst@ulg.ac.be>
Yeah, I asked @ethan-puyaubreau if he could add his test. |
aprokop
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. I only have a couple question.
We need this capability as soon as we can for one of my projects.
|
|
||
| namespace KokkosTools::EnergyProfiler { | ||
| void Daemon::start() { | ||
| if (!running_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the semantics of start() if it's already running? Should it break? Should it be restarted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should just do nothing if it is already running. Multiple start calls should be fine since in our current use case the global system time is recorded. If someone wants to use it differently I would opt we add a restart() functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but as described below ... this is not threadsafe anyway
| } | ||
|
|
||
| void Daemon::stop() { | ||
| if (running_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the semantics here? Should we require that it's been running?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, it is not threadsafe, so we are limited with the current impl. But if we care we can either go to jthread or at least use atomics for the status
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I think C++20 std::jthread is a great idea here. I have a slight inclination to use C++ atomics for now since it seems more straightforward - I am wondering how portable/available jthreads would be on various computing systems currently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the tools don't have a minimum requirement for 20, right? Since most of the other tools are not threadsafe, I would say that is a larger discussion.
But if we want it to be std::jthread now, we can update. I am happy for any contribution
Co-authored-by: Andrey Prokopenko <andrey.prok@gmail.com>
|
Hi all, I was wondering about the status on this. I think this is a good PR that would be valuable for a lot of profiling data beyond energy information, which is relevant to #291 - albeit I would say the energy profiling use case is the one that is probably of most value currently to production Kokkos applications. Do you know the rough timeline of this PR by chance? I know that @ethan-puyaubreau is done with his internship, so there may be timing constraints from his end. I think this PR is on the right track overall. |
|
The main open thing is a desire to have an example for the daemon. So far I was not able to come up with a good one. |
Yes, I definitely agree just considering the energy profiling use case already shows an example usage. It would certainly help to have a small example since it can make clear assumptions being made on its use, but I think it doesn't need to be a blocker. I do think some of the error handling and/or tests for corner cases (e.g., interval being longer than time for kernel) that are discussed by @maartenarnst and you in this PR are important to think about. Except for making sure no critical error checks are missing, I think the code logic of this PR makes sense to me. Doing some sort of beta release of this Daemon class, especially so, e.g., @aprokop along with #291 can use this would be good to get early feedback on real-world production use cases, as that could directly provide a lot of insight for improvements. Maybe it could be merged in as a beta feature early next year? Just my thoughts on this now - I will look and see if I can make a few more improvements to the code and error checks in the PR myself till then. |
vlkale
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, just a suggestion to consider a default interval
| namespace KokkosTools::EnergyProfiler { | ||
| class Daemon { | ||
| public: | ||
| Daemon(std::function<void()> func, const std::chrono::duration& interval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One quick note is I don't see a default value for the time interval of the Daemon but maybe you could have one. If you do, I think a reasonable one would be 1 second. This would at least makes sense from the point of view of #291. Maybe there are other use cases and situations where you want it to be shorter (or longer).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think the user explicitly specifying a duration is a good practice here. It is a core setting of the daemon that should be thought about explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, got it - yes, I agree. We want this Daemon class to be generic beyond the energy profiling use case and/or LDMS use case, and it should indeed be thought out by users explicitly based on the scenario.
|
@maartenarnst do you want us to investigate the use case of a longer return time than the frequency the daemon is given? If so, do you have a specific case in mind that would need this? |
|
@vlkale I think we can merge this. Maarten did not reply and we still can look at making the daemon able to deal with runtimes longer than its frequency later if we ever have a usecase for this |
@JBludau I am good with merging it. Yes, allowing for runtimes longer than its frequency would be good but it's not needed right now for immediate/priority use cases. |
This PR introduces a generic Daemon class for background task execution in the energy profiler infrastructure:
The daemon system will enable energy providers to perform periodic power sampling at specified intervals, supporting the energy measurement capabilities planned in #301.