Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better support for AMD GPUS #77

Open
anaprietonem opened this issue Oct 9, 2024 · 2 comments
Open

Better support for AMD GPUS #77

anaprietonem opened this issue Oct 9, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@anaprietonem
Copy link
Contributor

Is your feature request related to a problem? Please describe.

In terms of tracking system metrics from a profiler and mlflow perspective, the current code lacks some feature to better support other hardware different from Nvidia-GPUS. Since the only package we use to do this is pynvml and that is limited to NVIDIA GPUs.

Describe the solution you'd like

We could improve this, and at least provide better support to AMD GPUs since there is an open source package called pyrsmi (developed by ROCM), which does the same as pynvml, but for AMD ROCM hardware.
We could define a custom SystemMetrics Monitor that can handle many hadwares. (see comments for potential implementation). This will work out of the box, with the same config settings as you would run your mlflow. Lastly, this will only monitor one node. I.e 8 amd gpus on the same node, and not across nodes, since we assume that the memory consumption, speed, etc.. would be the same for all gpus except "master gpu".

Describe alternatives you've considered

No response

Additional context

This solution was originally suggested by @einrone so many thanks for this!

Organisation

No response

@anaprietonem anaprietonem added the enhancement New feature or request label Oct 9, 2024
@anaprietonem
Copy link
Contributor Author

See reference implementation provided by Aram for AIFS
monitor_system_metric.txt

@mpvginde
Copy link

Hi @anaprietonem, FYI:
right now we are using this PR: mlflow/mlflow#12694 from the people of MetNO, to do some basic monitoring of AMD GPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants