ch4/ofi: Convert CUDA device id to handle for fi_mr_regattr #7156

raffenet · 2024-10-02T18:56:50Z

Pull Request Description

Libfabric docs say that the value of the cuda field in the regattr
struct is the device handle gotten from cuDeviceGet, not the
ordinal. Fixes #7148.

This PR picks [842f4cf] from #7049 in order to avoid confusion between device ids and handles.

Author Checklist

Provide Description
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits Follow Good Practice
Commits are self-contained and do not do two things at once.
Commit message is of the form: module: short description
Commit message explains what's in the commit.
Passes All Tests
Whitespace checker. Warnings test. Additional tests via comments.
Contribution Agreement
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.

raffenet · 2024-10-02T21:36:20Z

test:mpich/ch4/gpu

hzhou · 2024-10-09T15:05:09Z

src/mpl/include/mpl_gpu_cuda.h

@@ -22,4 +23,6 @@ typedef volatile int MPL_gpu_event_t;

 #define MPL_GPU_DEV_AFFINITY_ENV "CUDA_VISIBLE_DEVICES"

+#define MPL_gpu_device_id_to_handle(...) cuDeviceGet(__VA_ARGS__)


Rather than an MPL_gpu_ wrapper, maybe we can directly call cuDeviceGet? The usage (in fi_mr_regattr) is always inside the #ifdef MPL_HAVE_CUDA anyway, which makes it OK to use CUDA-specific calls. If we still want the MPL abstraction, I think that we can name MPL_cuda_ for CUDA-specific functions.

Makes sense. This version just calls cuDeviceGet in ofi_impl.h.

The abstraction of device id as an integer is a good abstraction above MPL. The opaqueness of MPL_gpu_device_handle_t, on the other hand, makes it useless, since the upperlayer can't do anything with it. For ZE, simply expose the internal device id. For new GPU runtimes that does support integer device ids, we can similarly do a map as in mpl_gpu_ze.c.

Libfabric docs say that the value of the cuda field in the regattr struct is the device handle gotten from cuDeviceGet, not the ordinal. Fixes pmodels#7148.

raffenet · 2024-10-09T16:05:10Z

test:mpich/ch4/gpu/ofi
test:mpich/ch4/ofi

raffenet force-pushed the issue-7148 branch from 0faf467 to f8c92b8 Compare October 2, 2024 21:36

hzhou reviewed Oct 9, 2024

View reviewed changes

hzhou and others added 2 commits October 9, 2024 10:58

ch4/ofi: Convert CUDA device id to handle for fi_mr_regattr

e4cef78

Libfabric docs say that the value of the cuda field in the regattr struct is the device handle gotten from cuDeviceGet, not the ordinal. Fixes pmodels#7148.

raffenet force-pushed the issue-7148 branch from f8c92b8 to e4cef78 Compare October 9, 2024 16:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ch4/ofi: Convert CUDA device id to handle for fi_mr_regattr #7156

ch4/ofi: Convert CUDA device id to handle for fi_mr_regattr #7156

raffenet commented Oct 2, 2024

raffenet commented Oct 2, 2024

hzhou Oct 9, 2024

raffenet Oct 9, 2024

raffenet commented Oct 9, 2024

		@@ -22,4 +23,6 @@ typedef volatile int MPL_gpu_event_t;

		#define MPL_GPU_DEV_AFFINITY_ENV "CUDA_VISIBLE_DEVICES"

		#define MPL_gpu_device_id_to_handle(...) cuDeviceGet(__VA_ARGS__)

ch4/ofi: Convert CUDA device id to handle for fi_mr_regattr #7156

Are you sure you want to change the base?

ch4/ofi: Convert CUDA device id to handle for fi_mr_regattr #7156

Conversation

raffenet commented Oct 2, 2024

Pull Request Description

Author Checklist

raffenet commented Oct 2, 2024

hzhou Oct 9, 2024

Choose a reason for hiding this comment

raffenet Oct 9, 2024

Choose a reason for hiding this comment

raffenet commented Oct 9, 2024