Skip to content

prov/efa: Potential atomicity issue with ofi_copy_to_iov() for NCCL_PROTO=LL128 protocol support #11499

@zhou-yukun

Description

@zhou-yukun

In the context of NCCL_PROTO=LL128, which requires atomicity guarantees for received data within 128-byte boundaries, I've identified a potential issue in the current implementation.

In the function efa_rdm_pke_copy_payload_to_ope(), when handling non-CUDA and non-HMEM memory types (specifically system memory), the code uses ofi_copy_to_iov for data copying:

bytes_copied = ofi_copy_to_iov(ope->iov, ope->iov_count, segment_offset + ep->msg_prefix_size, pke->payload, pke->payload_size);

The ofi_copy_to_iov function ultimately calls memcpy, which may not guarantee atomicity for 128-byte data transfers. This could potentially violate the atomicity requirements of the LL128 protocol.

Questions:

  1. Is this a genuine concern for LL128 protocol compliance?
  2. Should system memory copies also have atomicity guarantees for 128-byte boundaries?
  3. If this is indeed an issue, what would be the recommended approach to ensure atomicity for system memory copies in LL128 scenarios?

Thank you for your attention to this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions