Skip to content

Conversation

@ryanhankins
Copy link

@ryanhankins ryanhankins commented Nov 23, 2025

The bug manifested as segfaults inside fi_write() during internode PUT operations when using libfabric. This occurred because the code was attempting to access a NULL local handle for small operations, leading to invalid memory descriptor usage.

The issue can be reproduced by running multinode tests with NVSHMEM_REMOTE_TRANSPORT=libfabric, NVSHMEM_BOOTSTRAP=mpi, FI_PROVIDER=cxi, and NVSHMEM_HEAP_KIND=sysmem, such as a test program that performs internode PUT operations across multiple nodes.

The fix ensures that the local memory descriptor is only set when a valid local handle exists, preventing NULL pointer dereferences.

The bug manifested as segfaults inside fi_write() during internode PUT operations
when using libfabric. This occurred because the code was attempting to access a
NULL local handle for small operations, leading to invalid memory descriptor usage.

The issue can be reproduced by running multinode tests with
NVSHMEM_REMOTE_TRANSPORT=libfabric, NVSHMEM_BOOTSTRAP=mpi, FI_PROVIDER=cxi, and
NVSHMEM_HEAP_KIND=sysmem, such as a test program that performs intranode PUT
operations followed by internode PUT operations across multiple nodes.

The fix ensures that the local memory descriptor is only set when a valid local
handle exists, preventing NULL pointer dereferences.
@a-szegel
Copy link
Contributor

Good catch Ryan! It looks like local_handle is only used for EFA, so maybe it would be cleaner to initialize local_handle=NULL and local_mr=NULL at the top of the function and assign them inside the if statement on line 635.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants