-
Notifications
You must be signed in to change notification settings - Fork 32
RE-Implemented NVIDIA Energy capture via C #1167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 file(s) reviewed, 3 comment(s)
Edit PR Review Bot Settings | Greptile
@ribalba Can you please check what the correct install command is for the libraries under Fedora. ChatGPT suggested: sudo dnf install cuda-nvml-dev It also mentioned it is not in the distributions repos and you need to add: sudo dnf config-manager --add-repo=https://developer.download.nvidia.com/compute/cuda/repos/fedora$(rpm -E %fedora)/x86_64/cuda-fedora.repo |
Old Energy EstimationEco CI Output:
🌳 CO2 Data: Total cost of whole PR so far: |
@ribalba Ping |
From https://docs.nvidia.com/cuda/cuda-installation-guide-linux/#network-repo-installation-for-fedora:
and then I can build it with
but I can't test. But the executable loads fine. I would not patch everything in our default dev instead just adding this to the documentation. Also I would warn in the makefile and exit if we are on something different than ubuntu. |
Hmm this looks VERY flaky. Also I do not want to fix on Cuda 12.9 can you somehow send me a dynamic version of this and maybe add it directly to the PR? My question would be why the |
The problem is that there are only these two packages that provide the nvm.h
and cuda installs it in its own dir.
and the package puts them there:
|
* main: (73 commits) Forcing int64 in pandas to be safe Splits the diskio provider into reads and writes (#1189) Sorting by created_at now Hotfix: Compare values were 3 orders of magnitude to low due to double division (#1191) Sampling rate rework (#1194) Phase padding can now be turned on and off (#1193) User 0 should have flow_process_duration and total duration only at 30 minutes and no data in json 'measurement' AI-Tests can now activated and deactivated in tests (Testing QoL): JS errors in frontend tests are now reported typo added no-else-raise Checking in more cases now if github detected even if path broken AI Optimisations Frontend added to FOSS version as appetizer (#1192) Allow repo URLs with unknown schemes but issues warning Revert "Test fix\nwe changed from failing on unknowns to allowing them due to allowing other vendors or private repos with reduced capbility tokens that might be cloneable but do not expose the API" general wording Runtime phase reconstruction only when runtime phase is present (fix): shutdown_on_job_no must only be non false (fix): Null check for resolution must also be in system_checks (fix): Providers without resolution must also be mappable to _sampling_interval_padding ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
18 file(s) reviewed, 8 comment(s)
Edit PR Review Bot Settings | Greptile
So after some research there is now way to make this "nicer". There is no meta package so I would need to do some shell magic to find out the newest version and the library dir will need to be added as this is intended from Nvidia so you can have multiple versions of the library installed. |
Funny that Ubuntu does than exactly that and creates a meta package on top of it :) Ok, I will use your latest pointers then. People have to make PRs then to update nvidia libraries for our static install process in the future I guess |
* main: (24 commits) Removing -x again as the stderr out was problematic for tests (fix): Using correct container name Simplifying flow with new setup-commands to detach adding detach option to setup-commands Using pre-built squid reverse proxy run-template now echoes command for debugging If no carbon data present show placeholder instead of broken image Image badges correct link and noopener Bump pydantic from 2.11.4 to 2.11.5 (#1196) Bump hiredis from 3.1.1 to 3.2.0 (#1197) Added git bisect script to find bad commits Sending emails text only again. Brevo supports this Nicer email display when a new softare gets added (signature change): Changing metrics to metric in both timeline APIs (fix): metrics typo for API query Checking outside symlinks when cloning from URL (#1195) Errors for non-200 not correctly captures Allowing filter by failed runs Moving to firefox browser (fix): Powermetrics must set sampling_rate_configured ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
9 file(s) reviewed, 4 comment(s)
Edit PR Review Bot Settings | Greptile
Eco CI Output:
🌳 CO2 Data: Total cost of whole PR so far: |
We were experiencing some sampling-rate issues with the
nvidia-smi
implementation where the sampling jitter was too high.This is a re-implementation in C which is still minimally slower than our other providers, but quite performant so we can achieve sampling < 100 ms

Greptile Summary
Re-implemented NVIDIA GPU power monitoring from shell script to C using NVML library, significantly improving sampling rate consistency and enabling sub-100ms measurement intervals.
metric_providers/gpu/energy/nvidia/nvml/component/source.c
with NVML library integration for precise GPU power measurementsprovider.py
where microseconds to seconds conversion is missing in energy calculationMakefile
that may need better version handling (/usr/local/cuda-12.9
)Makefile
(-lnvidia-ml
appears twice in LDFLAGS)install_linux.sh
with new--nvidia-gpu
flag