Skip to content

RE-Implemented NVIDIA Energy capture via C #1167

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Conversation

ArneTR
Copy link
Member

@ArneTR ArneTR commented Apr 29, 2025

We were experiencing some sampling-rate issues with the nvidia-smi implementation where the sampling jitter was too high.

This is a re-implementation in C which is still minimally slower than our other providers, but quite performant so we can achieve sampling < 100 ms
Screenshot 2025-04-29 at 1 14 11 PM

Greptile Summary

Re-implemented NVIDIA GPU power monitoring from Bash to C using NVML library, significantly improving sampling rate consistency and enabling sub-100ms measurement intervals.

  • Added source.c with direct NVML library integration for precise GPU power measurements
  • Removed metric-provider-nvidia-smi-wrapper.sh shell script to eliminate sampling jitter issues
  • Added Makefile with -O3 optimization and NVML library linkage for performance
  • Updated provider.py to support card model identification and handle the new C-based metrics format
  • Implemented static variables in source.c to prevent thread state persistence

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 file(s) reviewed, 3 comment(s)
Edit PR Review Bot Settings | Greptile

Comment on lines 113 to 114
{"help", no_argument, NULL, 'h'},
{"interval", no_argument, NULL, 'i'},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: interval option is marked as no_argument but requires an argument in getopt_long

Suggested change
{"help", no_argument, NULL, 'h'},
{"interval", no_argument, NULL, 'i'},
{"help", no_argument, NULL, 'h'},
{"interval", required_argument, NULL, 'i'},

@ArneTR
Copy link
Member Author

ArneTR commented Apr 29, 2025

@ribalba Can you please check what the correct install command is for the libraries under Fedora.

ChatGPT suggested: sudo dnf install cuda-nvml-dev

It also mentioned it is not in the distributions repos and you need to add: sudo dnf config-manager --add-repo=https://developer.download.nvidia.com/compute/cuda/repos/fedora$(rpm -E %fedora)/x86_64/cuda-fedora.repo

Copy link

Eco CI Output:

Label 🖥 avg. CPU utilization [%] 🔋 Total Energy [Joules] 🔌 avg. Power [Watts] Duration [Seconds]
Measurement #1 27.6041 2858.67 4.14 690.87
--- --- --- --- ---
Total Run 27.60 2858.67 4.14 690.87
--- --- --- --- ---
Additional overhead from Eco CI N/A 8.32 4.02 2.07

🌳 CO2 Data:
City: Chicago, Lat: 41.8835, Lon: -87.6305
IP: 172.183.76.133
CO₂ from energy is: 1.003393170 g
CO₂ from manufacturing (embodied carbon) is: 0.197114751 g
Carbon Intensity for this location: 351 gCO₂eq/kWh
SCI: 1.200508 gCO₂eq / pipeline run emitted


Total cost of whole PR so far:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant