Skip to content

RE-Implemented NVIDIA Energy capture via C #1167

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
May 24, 2025
Merged

RE-Implemented NVIDIA Energy capture via C #1167

merged 12 commits into from
May 24, 2025

Conversation

ArneTR
Copy link
Member

@ArneTR ArneTR commented Apr 29, 2025

We were experiencing some sampling-rate issues with the nvidia-smi implementation where the sampling jitter was too high.

This is a re-implementation in C which is still minimally slower than our other providers, but quite performant so we can achieve sampling < 100 ms
Screenshot 2025-04-29 at 1 14 11 PM

Greptile Summary

Re-implemented NVIDIA GPU power monitoring from shell script to C using NVML library, significantly improving sampling rate consistency and enabling sub-100ms measurement intervals.

  • Added metric_providers/gpu/energy/nvidia/nvml/component/source.c with NVML library integration for precise GPU power measurements
  • Fixed unit conversion issue in provider.py where microseconds to seconds conversion is missing in energy calculation
  • Added CUDA path dependency in Makefile that may need better version handling (/usr/local/cuda-12.9)
  • Duplicate NVML library linking in Makefile (-lnvidia-ml appears twice in LDFLAGS)
  • Added NVIDIA toolkit headers installation support in install_linux.sh with new --nvidia-gpu flag

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 file(s) reviewed, 3 comment(s)
Edit PR Review Bot Settings | Greptile

@ArneTR
Copy link
Member Author

ArneTR commented Apr 29, 2025

@ribalba Can you please check what the correct install command is for the libraries under Fedora.

ChatGPT suggested: sudo dnf install cuda-nvml-dev

It also mentioned it is not in the distributions repos and you need to add: sudo dnf config-manager --add-repo=https://developer.download.nvidia.com/compute/cuda/repos/fedora$(rpm -E %fedora)/x86_64/cuda-fedora.repo

Copy link

github-actions bot commented Apr 29, 2025

Old Energy Estimation

Eco CI Output:

Label 🖥 avg. CPU utilization [%] 🔋 Total Energy [Joules] 🔌 avg. Power [Watts] Duration [Seconds]
Measurement #1 27.6041 2858.67 4.14 690.87
--- --- --- --- ---
Total Run 27.60 2858.67 4.14 690.87
--- --- --- --- ---
Additional overhead from Eco CI N/A 8.32 4.02 2.07

🌳 CO2 Data:
City: Chicago, Lat: 41.8835, Lon: -87.6305
IP: 172.183.76.133
CO₂ from energy is: 1.003393170 g
CO₂ from manufacturing (embodied carbon) is: 0.197114751 g
Carbon Intensity for this location: 351 gCO₂eq/kWh
SCI: 1.200508 gCO₂eq / pipeline run emitted


Total cost of whole PR so far:

@ArneTR
Copy link
Member Author

ArneTR commented May 14, 2025

@ribalba Ping

@ribalba
Copy link
Member

ribalba commented May 14, 2025

From https://docs.nvidia.com/cuda/cuda-installation-guide-linux/#network-repo-installation-for-fedora:


wget https://developer.download.nvidia.com/compute/cuda/repos/fedora$(rpm -E %fedora)/x86_64/cuda-fedora$(rpm -E %fedora).repo
sudo mv cuda-fedora$(rpm -E %fedora).repo /etc/yum.repos.d/
sudo dnf makecache
sudo dnf install  libnvidia-ml
sudo dnf install cuda-nvml-devel-12-9
sudo ln -s /usr/lib64/libnvidia-ml.so.1 /usr/lib64/libnvidia-ml.so

and then I can build it with

gcc -O3 -Wall -Werror -I../../../../../../lib/c ../../../../../../lib/c/gmt-lib.o source.c     -L../../../../../../lib/c -lc     -I/usr/local/cuda-12.9/targets/x86_64-linux/include     -L/usr/lib64     -lnvidia-ml     -o metric-provider-binary

but I can't test. But the executable loads fine.

I would not patch everything in our default dev instead just adding this to the documentation. Also I would warn in the makefile and exit if we are on something different than ubuntu.

@ArneTR
Copy link
Member Author

ArneTR commented May 15, 2025

Hmm this looks VERY flaky. Also I do not want to fix on Cuda 12.9

can you somehow send me a dynamic version of this and maybe add it directly to the PR? My question would be why the -I/usr/local/cuda-12.9/targets/x86_64-linux/include is necessary but under Ubunutu it is not ....

@ribalba
Copy link
Member

ribalba commented May 16, 2025

The problem is that there are only these two packages that provide the nvm.h

didi@fedora:~/code/green-metrics-tool/metric_providers/gpu/energy/nvidia/nvml/component$ sudo dnf search cuda-nvml-devel
Updating and loading repositories:
Repositories loaded.
Matched fields: name
 cuda-nvml-devel-12-8.x86_64: NVML native dev links, headers.
 cuda-nvml-devel-12-9.x86_64: NVML native dev links, headers.
didi@fedora:~/code/green-metrics-tool/metric_providers/gpu/energy/nvidia/nvml/component$ 

and cuda installs it in its own dir.

$ gcc -print-search-dirs
install: /usr/lib/gcc/x86_64-redhat-linux/14/
programs: =/usr/libexec/gcc/x86_64-redhat-linux/14/:/usr/libexec/gcc/x86_64-redhat-linux/14/:/usr/libexec/gcc/x86_64-redhat-linux/:/usr/lib/gcc/x86_64-redhat-linux/14/:/usr/lib/gcc/x86_64-redhat-linux/:/usr/lib/gcc/x86_64-redhat-linux/14/../../../../x86_64-redhat-linux/bin/x86_64-redhat-linux/14/:/usr/lib/gcc/x86_64-redhat-linux/14/../../../../x86_64-redhat-linux/bin/
libraries: =/usr/lib/gcc/x86_64-redhat-linux/14/:/usr/lib/gcc/x86_64-redhat-linux/14/../../../../x86_64-redhat-linux/lib/x86_64-redhat-linux/14/:/usr/lib/gcc/x86_64-redhat-linux/14/../../../../x86_64-redhat-linux/lib/../lib64/:/usr/lib/gcc/x86_64-redhat-linux/14/../../../x86_64-redhat-linux/14/:/usr/lib/gcc/x86_64-redhat-linux/14/../../../../lib64/:/lib/x86_64-redhat-linux/14/:/lib/../lib64/:/usr/lib/x86_64-redhat-linux/14/:/usr/lib/../lib64/:/usr/lib/gcc/x86_64-redhat-linux/14/../../../../x86_64-redhat-linux/lib/:/usr/lib/gcc/x86_64-redhat-linux/14/../../../:/lib/:/usr/lib/

and the package puts them there:

$ sudo dnf repoquery -l  cuda-nvml-devel-12-9.x86_64
Updating and loading repositories:
Repositories loaded.
/usr/lib64/pkgconfig/nvidia-ml-12.9.pc
/usr/local/cuda-12.9
/usr/local/cuda-12.9/include
/usr/local/cuda-12.9/lib64
/usr/local/cuda-12.9/nvml
/usr/local/cuda-12.9/nvml/example
/usr/local/cuda-12.9/nvml/example/Makefile
/usr/local/cuda-12.9/nvml/example/README.txt
/usr/local/cuda-12.9/nvml/example/example.c
/usr/local/cuda-12.9/nvml/example/supportedVgpus.c
/usr/local/cuda-12.9/targets
/usr/local/cuda-12.9/targets/x86_64-linux
/usr/local/cuda-12.9/targets/x86_64-linux/include
/usr/local/cuda-12.9/targets/x86_64-linux/include/nvml.h
/usr/local/cuda-12.9/targets/x86_64-linux/lib
/usr/local/cuda-12.9/targets/x86_64-linux/lib/stubs
/usr/local/cuda-12.9/targets/x86_64-linux/lib/stubs/libnvidia-ml.a
/usr/local/cuda-12.9/targets/x86_64-linux/lib/stubs/libnvidia-ml.so
/usr/share/licenses/cuda-nvml-devel-12-9
/usr/share/licenses/cuda-nvml-devel-12-9/LICENSE

ArneTR added 2 commits May 19, 2025 07:14
* main: (73 commits)
  Forcing int64 in pandas to be safe
  Splits the diskio provider into reads and writes (#1189)
  Sorting by created_at now
  Hotfix: Compare values were 3 orders of magnitude to low due to double division (#1191)
  Sampling rate rework (#1194)
  Phase padding can now be turned on and off (#1193)
  User 0 should have flow_process_duration and total duration only at 30 minutes and no data in json 'measurement'
  AI-Tests can now activated and deactivated in tests
  (Testing QoL): JS errors in frontend tests are now reported
  typo
  added no-else-raise
  Checking in more cases now if github detected even if path broken
  AI Optimisations Frontend added to FOSS version as appetizer (#1192)
  Allow repo URLs with unknown schemes but issues warning
  Revert "Test fix\nwe changed from failing on unknowns to allowing them due to allowing other vendors or private repos with reduced capbility tokens that might be cloneable but do not expose the API"
  general wording
  Runtime phase reconstruction only when runtime phase is present
  (fix): shutdown_on_job_no must only be non false
  (fix): Null check for resolution must also be in system_checks
  (fix): Providers without resolution must also be mappable to _sampling_interval_padding
  ...
@ArneTR
Copy link
Member Author

ArneTR commented May 19, 2025

@greptileai

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

18 file(s) reviewed, 8 comment(s)
Edit PR Review Bot Settings | Greptile

@ribalba
Copy link
Member

ribalba commented May 22, 2025

So after some research there is now way to make this "nicer". There is no meta package so I would need to do some shell magic to find out the newest version and the library dir will need to be added as this is intended from Nvidia so you can have multiple versions of the library installed.

@ArneTR
Copy link
Member Author

ArneTR commented May 23, 2025

Funny that Ubuntu does than exactly that and creates a meta package on top of it :)

Ok, I will use your latest pointers then. People have to make PRs then to update nvidia libraries for our static install process in the future I guess

ArneTR added 4 commits May 24, 2025 06:22
* main: (24 commits)
  Removing -x again as the stderr out was problematic for tests
  (fix): Using correct container name
  Simplifying flow with new setup-commands to detach
  adding detach option to setup-commands
  Using pre-built squid reverse proxy
  run-template now echoes command for debugging
  If no carbon data present show placeholder instead of broken image
  Image badges correct link and noopener
  Bump pydantic from 2.11.4 to 2.11.5 (#1196)
  Bump hiredis from 3.1.1 to 3.2.0 (#1197)
  Added git bisect script to find bad commits
  Sending emails text only again. Brevo supports this
  Nicer email display when a new softare gets added
  (signature change): Changing metrics to metric in both timeline APIs
  (fix): metrics typo for API query
  Checking outside symlinks when cloning from URL (#1195)
  Errors for non-200 not correctly captures
  Allowing filter by failed runs
  Moving to firefox browser
  (fix): Powermetrics must set sampling_rate_configured
  ...
@ArneTR
Copy link
Member Author

ArneTR commented May 24, 2025

@greptileai

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9 file(s) reviewed, 4 comment(s)
Edit PR Review Bot Settings | Greptile

Copy link

Eco CI Output:

Label 🖥 avg. CPU utilization [%] 🔋 Total Energy [Joules] 🔌 avg. Power [Watts] Duration [Seconds]
Measurement #1 28.0073 3517.39 4.17 844.25
--- --- --- --- ---
Total Run 28.01 3517.39 4.17 844.25
--- --- --- --- ---
Additional overhead from Eco CI N/A 10.16 4.08 2.49

🌳 CO2 Data:
City: Boydton, Lat: 36.6676, Lon: -78.3875
IP: 68.154.31.175
CO₂ from energy is: 1.252190840 g
CO₂ from manufacturing (embodied carbon) is: 0.240876182 g
Carbon Intensity for this location: 356 gCO₂eq/kWh
SCI: 1.493067 gCO₂eq / pipeline run emitted


Total cost of whole PR so far:

@ArneTR ArneTR merged commit 9469fe8 into main May 24, 2025
@ArneTR ArneTR deleted the nvidia-energy-c branch May 24, 2025 10:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants