RE-Implemented NVIDIA Energy capture via C #1167

ArneTR · 2025-04-29T11:18:10Z

We were experiencing some sampling-rate issues with the nvidia-smi implementation where the sampling jitter was too high.

This is a re-implementation in C which is still minimally slower than our other providers, but quite performant so we can achieve sampling < 100 ms

Greptile Summary

Re-implemented NVIDIA GPU power monitoring from shell script to C using NVML library, significantly improving sampling rate consistency and enabling sub-100ms measurement intervals.

Added metric_providers/gpu/energy/nvidia/nvml/component/source.c with NVML library integration for precise GPU power measurements
Fixed unit conversion issue in provider.py where microseconds to seconds conversion is missing in energy calculation
Added CUDA path dependency in Makefile that may need better version handling (/usr/local/cuda-12.9)
Duplicate NVML library linking in Makefile (-lnvidia-ml appears twice in LDFLAGS)
Added NVIDIA toolkit headers installation support in install_linux.sh with new --nvidia-gpu flag

greptile-apps

_{4 file(s) reviewed, 3 comment(s)}
_{Edit PR Review Bot Settings | Greptile}

metric_providers/gpu/energy/nvidia/smi/component/Makefile

metric_providers/gpu/energy/nvidia/smi/component/source.c

ArneTR · 2025-04-29T11:49:39Z

@ribalba Can you please check what the correct install command is for the libraries under Fedora.

ChatGPT suggested: sudo dnf install cuda-nvml-dev

It also mentioned it is not in the distributions repos and you need to add: sudo dnf config-manager --add-repo=https://developer.download.nvidia.com/compute/cuda/repos/fedora$(rpm -E %fedora)/x86_64/cuda-fedora.repo

github-actions · 2025-04-29T12:00:27Z

Old Energy Estimation

Eco CI Output:

Label	🖥 avg. CPU utilization [%]	🔋 Total Energy [Joules]	🔌 avg. Power [Watts]	Duration [Seconds]
Measurement #1	27.6041	2858.67	4.14	690.87
---	---	---	---	---
Total Run	27.60	2858.67	4.14	690.87
---	---	---	---	---
Additional overhead from Eco CI	N/A	8.32	4.02	2.07

🌳 CO2 Data:
City: Chicago, Lat: 41.8835, Lon: -87.6305
IP: 172.183.76.133
CO₂ from energy is: 1.003393170 g
CO₂ from manufacturing (embodied carbon) is: 0.197114751 g
Carbon Intensity for this location: 351 gCO₂eq/kWh
SCI: 1.200508 gCO₂eq / pipeline run emitted

Total cost of whole PR so far:

ArneTR · 2025-05-14T05:36:55Z

@ribalba Ping

ribalba · 2025-05-14T17:22:17Z

From https://docs.nvidia.com/cuda/cuda-installation-guide-linux/#network-repo-installation-for-fedora:


wget https://developer.download.nvidia.com/compute/cuda/repos/fedora$(rpm -E %fedora)/x86_64/cuda-fedora$(rpm -E %fedora).repo
sudo mv cuda-fedora$(rpm -E %fedora).repo /etc/yum.repos.d/
sudo dnf makecache
sudo dnf install  libnvidia-ml
sudo dnf install cuda-nvml-devel-12-9
sudo ln -s /usr/lib64/libnvidia-ml.so.1 /usr/lib64/libnvidia-ml.so

and then I can build it with

gcc -O3 -Wall -Werror -I../../../../../../lib/c ../../../../../../lib/c/gmt-lib.o source.c     -L../../../../../../lib/c -lc     -I/usr/local/cuda-12.9/targets/x86_64-linux/include     -L/usr/lib64     -lnvidia-ml     -o metric-provider-binary

but I can't test. But the executable loads fine.

I would not patch everything in our default dev instead just adding this to the documentation. Also I would warn in the makefile and exit if we are on something different than ubuntu.

ArneTR · 2025-05-15T05:07:16Z

Hmm this looks VERY flaky. Also I do not want to fix on Cuda 12.9

can you somehow send me a dynamic version of this and maybe add it directly to the PR? My question would be why the -I/usr/local/cuda-12.9/targets/x86_64-linux/include is necessary but under Ubunutu it is not ....

ribalba · 2025-05-16T07:32:57Z

The problem is that there are only these two packages that provide the nvm.h

didi@fedora:~/code/green-metrics-tool/metric_providers/gpu/energy/nvidia/nvml/component$ sudo dnf search cuda-nvml-devel
Updating and loading repositories:
Repositories loaded.
Matched fields: name
 cuda-nvml-devel-12-8.x86_64: NVML native dev links, headers.
 cuda-nvml-devel-12-9.x86_64: NVML native dev links, headers.
didi@fedora:~/code/green-metrics-tool/metric_providers/gpu/energy/nvidia/nvml/component$

and cuda installs it in its own dir.

$ gcc -print-search-dirs
install: /usr/lib/gcc/x86_64-redhat-linux/14/
programs: =/usr/libexec/gcc/x86_64-redhat-linux/14/:/usr/libexec/gcc/x86_64-redhat-linux/14/:/usr/libexec/gcc/x86_64-redhat-linux/:/usr/lib/gcc/x86_64-redhat-linux/14/:/usr/lib/gcc/x86_64-redhat-linux/:/usr/lib/gcc/x86_64-redhat-linux/14/../../../../x86_64-redhat-linux/bin/x86_64-redhat-linux/14/:/usr/lib/gcc/x86_64-redhat-linux/14/../../../../x86_64-redhat-linux/bin/
libraries: =/usr/lib/gcc/x86_64-redhat-linux/14/:/usr/lib/gcc/x86_64-redhat-linux/14/../../../../x86_64-redhat-linux/lib/x86_64-redhat-linux/14/:/usr/lib/gcc/x86_64-redhat-linux/14/../../../../x86_64-redhat-linux/lib/../lib64/:/usr/lib/gcc/x86_64-redhat-linux/14/../../../x86_64-redhat-linux/14/:/usr/lib/gcc/x86_64-redhat-linux/14/../../../../lib64/:/lib/x86_64-redhat-linux/14/:/lib/../lib64/:/usr/lib/x86_64-redhat-linux/14/:/usr/lib/../lib64/:/usr/lib/gcc/x86_64-redhat-linux/14/../../../../x86_64-redhat-linux/lib/:/usr/lib/gcc/x86_64-redhat-linux/14/../../../:/lib/:/usr/lib/

and the package puts them there:

$ sudo dnf repoquery -l  cuda-nvml-devel-12-9.x86_64
Updating and loading repositories:
Repositories loaded.
/usr/lib64/pkgconfig/nvidia-ml-12.9.pc
/usr/local/cuda-12.9
/usr/local/cuda-12.9/include
/usr/local/cuda-12.9/lib64
/usr/local/cuda-12.9/nvml
/usr/local/cuda-12.9/nvml/example
/usr/local/cuda-12.9/nvml/example/Makefile
/usr/local/cuda-12.9/nvml/example/README.txt
/usr/local/cuda-12.9/nvml/example/example.c
/usr/local/cuda-12.9/nvml/example/supportedVgpus.c
/usr/local/cuda-12.9/targets
/usr/local/cuda-12.9/targets/x86_64-linux
/usr/local/cuda-12.9/targets/x86_64-linux/include
/usr/local/cuda-12.9/targets/x86_64-linux/include/nvml.h
/usr/local/cuda-12.9/targets/x86_64-linux/lib
/usr/local/cuda-12.9/targets/x86_64-linux/lib/stubs
/usr/local/cuda-12.9/targets/x86_64-linux/lib/stubs/libnvidia-ml.a
/usr/local/cuda-12.9/targets/x86_64-linux/lib/stubs/libnvidia-ml.so
/usr/share/licenses/cuda-nvml-devel-12-9
/usr/share/licenses/cuda-nvml-devel-12-9/LICENSE

* main: (73 commits) Forcing int64 in pandas to be safe Splits the diskio provider into reads and writes (#1189) Sorting by created_at now Hotfix: Compare values were 3 orders of magnitude to low due to double division (#1191) Sampling rate rework (#1194) Phase padding can now be turned on and off (#1193) User 0 should have flow_process_duration and total duration only at 30 minutes and no data in json 'measurement' AI-Tests can now activated and deactivated in tests (Testing QoL): JS errors in frontend tests are now reported typo added no-else-raise Checking in more cases now if github detected even if path broken AI Optimisations Frontend added to FOSS version as appetizer (#1192) Allow repo URLs with unknown schemes but issues warning Revert "Test fix\nwe changed from failing on unknowns to allowing them due to allowing other vendors or private repos with reduced capbility tokens that might be cloneable but do not expose the API" general wording Runtime phase reconstruction only when runtime phase is present (fix): shutdown_on_job_no must only be non false (fix): Null check for resolution must also be in system_checks (fix): Providers without resolution must also be mappable to _sampling_interval_padding ...

ArneTR · 2025-05-19T05:19:05Z

@greptileai

greptile-apps

_{18 file(s) reviewed, 8 comment(s)}
_{Edit PR Review Bot Settings | Greptile}

lib/install_shared.sh

install_linux.sh

metric_providers/gpu/energy/nvidia/nvml/component/provider.py

metric_providers/gpu/energy/nvidia/nvml/component/source.c

ribalba · 2025-05-22T08:36:30Z

So after some research there is now way to make this "nicer". There is no meta package so I would need to do some shell magic to find out the newest version and the library dir will need to be added as this is intended from Nvidia so you can have multiple versions of the library installed.

ArneTR · 2025-05-23T06:11:01Z

Funny that Ubuntu does than exactly that and creates a meta package on top of it :)

Ok, I will use your latest pointers then. People have to make PRs then to update nvidia libraries for our static install process in the future I guess

* main: (24 commits) Removing -x again as the stderr out was problematic for tests (fix): Using correct container name Simplifying flow with new setup-commands to detach adding detach option to setup-commands Using pre-built squid reverse proxy run-template now echoes command for debugging If no carbon data present show placeholder instead of broken image Image badges correct link and noopener Bump pydantic from 2.11.4 to 2.11.5 (#1196) Bump hiredis from 3.1.1 to 3.2.0 (#1197) Added git bisect script to find bad commits Sending emails text only again. Brevo supports this Nicer email display when a new softare gets added (signature change): Changing metrics to metric in both timeline APIs (fix): metrics typo for API query Checking outside symlinks when cloning from URL (#1195) Errors for non-200 not correctly captures Allowing filter by failed runs Moving to firefox browser (fix): Powermetrics must set sampling_rate_configured ...

ArneTR · 2025-05-24T04:37:21Z

@greptileai

greptile-apps

_{9 file(s) reviewed, 4 comment(s)}
_{Edit PR Review Bot Settings | Greptile}

metric_providers/gpu/energy/nvidia/nvml/component/Makefile

metric_providers/gpu/energy/nvidia/nvml/component/source.c

github-actions · 2025-05-24T04:46:29Z

Eco CI Output:

Label	🖥 avg. CPU utilization [%]	🔋 Total Energy [Joules]	🔌 avg. Power [Watts]	Duration [Seconds]
Measurement #1	28.0073	3517.39	4.17	844.25
---	---	---	---	---
Total Run	28.01	3517.39	4.17	844.25
---	---	---	---	---
Additional overhead from Eco CI	N/A	10.16	4.08	2.49

🌳 CO2 Data:
City: Boydton, Lat: 36.6676, Lon: -78.3875
IP: 68.154.31.175
CO₂ from energy is: 1.252190840 g
CO₂ from manufacturing (embodied carbon) is: 0.240876182 g
Carbon Intensity for this location: 356 gCO₂eq/kWh
SCI: 1.493067 gCO₂eq / pipeline run emitted

Total cost of whole PR so far:

RE-Implemented NVIDIA Energy capture via C

3b83ff3

greptile-apps bot reviewed Apr 29, 2025

View reviewed changes

metric_providers/gpu/energy/nvidia/smi/component/Makefile Outdated Show resolved Hide resolved

metric_providers/gpu/energy/nvidia/smi/component/source.c Outdated Show resolved Hide resolved

metric_providers/gpu/energy/nvidia/smi/component/source.c Outdated Show resolved Hide resolved

ArneTR added 4 commits April 29, 2025 13:23

Name change from NVIDIA SMI to NVML [skip ci]

6a61ec7

Makefile cleanup

98ef46a

Adding NVIDIA Headers download to install

dc25a09

Directory rename

d10051d

ArneTR added 2 commits May 19, 2025 07:14

Changed resolution to sampling_rate [skip ci]

565f414

greptile-apps bot reviewed May 19, 2025

View reviewed changes

ArneTR added 4 commits May 24, 2025 06:22

Fixing installer --nvidia-gpu optione

cd601fb

Installing libs for fedora for --nvidia-gpu

3d0a8c1

Including fedora libs. Should not harm Ubuntu target

9792307

greptile-apps bot reviewed May 24, 2025

View reviewed changes

Nvidia lm was duplicate; nvmlShutdown added when checking card [skip ci]

7d9ac87

ArneTR merged commit 9469fe8 into main May 24, 2025

ArneTR deleted the nvidia-energy-c branch May 24, 2025 10:49

RE-Implemented NVIDIA Energy capture via C #1167

RE-Implemented NVIDIA Energy capture via C #1167

Uh oh!

Conversation

ArneTR commented Apr 29, 2025 • edited by greptile-apps bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ArneTR commented Apr 29, 2025

Uh oh!

github-actions bot commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArneTR commented May 14, 2025

Uh oh!

ribalba commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArneTR commented May 15, 2025

Uh oh!

ribalba commented May 16, 2025

Uh oh!

ArneTR commented May 19, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ribalba commented May 22, 2025

Uh oh!

ArneTR commented May 23, 2025

Uh oh!

ArneTR commented May 24, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented May 24, 2025

Uh oh!

Uh oh!

ArneTR commented Apr 29, 2025 •

edited by greptile-apps bot

Loading

github-actions bot commented Apr 29, 2025 •

edited

Loading

ribalba commented May 14, 2025 •

edited

Loading