Skip to content

Conversation

@viclafargue
Copy link
Contributor

No description provided.

@viclafargue viclafargue requested review from csadorf and pentschev May 30, 2025 15:02
@viclafargue viclafargue requested a review from a team as a code owner May 30, 2025 15:02
@viclafargue viclafargue requested a review from jcrist May 30, 2025 15:02
@viclafargue viclafargue changed the base branch from branch-25.08 to branch-25.06 May 30, 2025 15:02
@github-actions github-actions bot added the Cython / Python Cython or Python issue label May 30, 2025
Copy link
Contributor

@csadorf csadorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should target this for 25.08, not 25.06.

Comment on lines 293 to 294
else:
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think returning None here is a good idea, because it would lead to TypeErrors in many of our (stress) tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you recommend? Would returning 0 be a good solution?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only place we use this function is to set pytest.max_gpu_memory. That variable is implicitly expected to be a nonzero integer number wherever it is used.

I think using None is ok to indicate "unknown", but then we need to make sure to test for that wherever pytest.max_gpu_memory is used.

@viclafargue viclafargue changed the base branch from branch-25.06 to branch-25.08 May 30, 2025 16:22
@jakirkham jakirkham added bug Something isn't working non-breaking Non-breaking change labels May 31, 2025
@jakirkham
Copy link
Member

Have we decided what we would like to do with this change?

@csadorf
Copy link
Contributor

csadorf commented Aug 1, 2025

Have we decided what we would like to do with this change?

@viclafargue I think we should fix this up and merge into branch-25.10.

@viclafargue viclafargue requested review from a team as code owners August 4, 2025 09:36
@copy-pr-bot
Copy link

copy-pr-bot bot commented Aug 4, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@viclafargue viclafargue changed the base branch from branch-25.08 to branch-25.10 August 4, 2025 09:37
@viclafargue
Copy link
Contributor Author

/ok to test b4f2c85

@viclafargue
Copy link
Contributor Author

/ok to test 7f9c957

@pentschev
Copy link
Member

@viclafargue I think you'll also need to add a pynvml dependency, I think that needs to be in the test_python section.

@github-actions github-actions bot added the conda conda issue label Aug 4, 2025
@viclafargue viclafargue requested a review from csadorf August 4, 2025 17:10
Comment on lines +305 to +320
try:
if device_id and not str(device_id).isnumeric():
# This means device_id is UUID.
# This works for both MIG and non-MIG device UUIDs.
handle = pynvml.nvmlDeviceGetHandleByUUID(str.encode(device_id))
if pynvml.nvmlDeviceIsMigDeviceHandle(handle):
# Additionally get parent device handle
# if the device itself is a MIG instance
handle = pynvml.nvmlDeviceGetDeviceHandleFromMigDeviceHandle(
handle
)
else:
handle = pynvml.nvmlDeviceGetHandleByIndex(device_id)
return handle
except pynvml.NVMLError:
raise ValueError(f"Invalid device index or UUID: {device_id}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fairly complicated for what appears to be a rather basic function. Is this really the recommended approach for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For general support, yes, but I presume this is for CI only so not necessarily all is required. However, this is a verbatim copy from Dask-CUDA, which is probably the only place this function is tested, so I think it makes sense to have a verbatim copy here as it will be less headache for you.

In the long-term, I'd like to have those functions in some shared package so that all RAPIDS projects can piggyback instead of copying verbatim. I've been pushing on that for 2 years but it has been really hard to convince our management of its value, perhaps now that we have similar functions copied in like 50 different places its value will finally become obvious. @quasiben

@viclafargue viclafargue requested a review from csadorf August 15, 2025 12:48
Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @viclafargue !

@viclafargue
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit faf59a8 into rapidsai:branch-25.10 Aug 27, 2025
76 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working conda conda issue Cython / Python Cython or Python issue non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants