-
Notifications
You must be signed in to change notification settings - Fork 613
Add fix for devices that do not have memory resources #6823
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fix for devices that do not have memory resources #6823
Conversation
csadorf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should target this for 25.08, not 25.06.
python/cuml/cuml/tests/conftest.py
Outdated
| else: | ||
| return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think returning None here is a good idea, because it would lead to TypeErrors in many of our (stress) tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you recommend? Would returning 0 be a good solution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only place we use this function is to set pytest.max_gpu_memory. That variable is implicitly expected to be a nonzero integer number wherever it is used.
I think using None is ok to indicate "unknown", but then we need to make sure to test for that wherever pytest.max_gpu_memory is used.
|
Have we decided what we would like to do with this change? |
@viclafargue I think we should fix this up and merge into branch-25.10. |
|
/ok to test b4f2c85 |
|
/ok to test 7f9c957 |
|
@viclafargue I think you'll also need to add a pynvml dependency, I think that needs to be in the |
| try: | ||
| if device_id and not str(device_id).isnumeric(): | ||
| # This means device_id is UUID. | ||
| # This works for both MIG and non-MIG device UUIDs. | ||
| handle = pynvml.nvmlDeviceGetHandleByUUID(str.encode(device_id)) | ||
| if pynvml.nvmlDeviceIsMigDeviceHandle(handle): | ||
| # Additionally get parent device handle | ||
| # if the device itself is a MIG instance | ||
| handle = pynvml.nvmlDeviceGetDeviceHandleFromMigDeviceHandle( | ||
| handle | ||
| ) | ||
| else: | ||
| handle = pynvml.nvmlDeviceGetHandleByIndex(device_id) | ||
| return handle | ||
| except pynvml.NVMLError: | ||
| raise ValueError(f"Invalid device index or UUID: {device_id}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems fairly complicated for what appears to be a rather basic function. Is this really the recommended approach for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For general support, yes, but I presume this is for CI only so not necessarily all is required. However, this is a verbatim copy from Dask-CUDA, which is probably the only place this function is tested, so I think it makes sense to have a verbatim copy here as it will be less headache for you.
In the long-term, I'd like to have those functions in some shared package so that all RAPIDS projects can piggyback instead of copying verbatim. I've been pushing on that for 2 years but it has been really hard to convince our management of its value, perhaps now that we have similar functions copied in like 50 different places its value will finally become obvious. @quasiben
pentschev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @viclafargue !
|
/merge |
No description provided.