Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ch4/ofi: Convert CUDA device id to handle for fi_mr_regattr #7156

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

raffenet
Copy link
Contributor

@raffenet raffenet commented Oct 2, 2024

Pull Request Description

Libfabric docs say that the value of the cuda field in the regattr
struct is the device handle gotten from cuDeviceGet, not the
ordinal. Fixes #7148.

This PR picks [842f4cf] from #7049 in order to avoid confusion between device ids and handles.

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

@raffenet
Copy link
Contributor Author

raffenet commented Oct 2, 2024

test:mpich/ch4/gpu

@@ -22,4 +23,6 @@ typedef volatile int MPL_gpu_event_t;

#define MPL_GPU_DEV_AFFINITY_ENV "CUDA_VISIBLE_DEVICES"

#define MPL_gpu_device_id_to_handle(...) cuDeviceGet(__VA_ARGS__)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than an MPL_gpu_ wrapper, maybe we can directly call cuDeviceGet? The usage (in fi_mr_regattr) is always inside the #ifdef MPL_HAVE_CUDA anyway, which makes it OK to use CUDA-specific calls. If we still want the MPL abstraction, I think that we can name MPL_cuda_ for CUDA-specific functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. This version just calls cuDeviceGet in ofi_impl.h.

hzhou and others added 2 commits October 9, 2024 10:58
The abstraction of device id as an integer is a good abstraction above
MPL. The opaqueness of MPL_gpu_device_handle_t, on the other hand, makes
it useless, since the upperlayer can't do anything with it.

For ZE, simply expose the internal device id. For new GPU runtimes that
does support integer device ids, we can similarly do a map as in
mpl_gpu_ze.c.
Libfabric docs say that the value of the cuda field in the regattr
struct is the device handle gotten from cuDeviceGet, not the
ordinal. Fixes pmodels#7148.
@raffenet
Copy link
Contributor Author

raffenet commented Oct 9, 2024

test:mpich/ch4/gpu/ofi
test:mpich/ch4/ofi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OFI: memory registration of cuda memory
2 participants