Skip to content

prov/efa/src/rdm/efa_rdm_pke.c:152 Assertion `pkt_entry->next == NULL' #11269

@jq

Description

@jq

Describe the bug
prov/efa/src/rdm/efa_rdm_pke.c:152: efa_rdm_pke_release_rx: Assertion `pkt_entry->next == NULL' failed.
This assertion failure occurs when sending traffic with multiple tags in a distributed training application. The
issue appears when multiple tagged send/receive operations are active concurrently, even when using
different tag values (e.g., tag=0 for one operation, tag=1 for another).
To Reproduce

  1. Set up a multi-node EFA environment (e.g., 2+ nodes on AWS EC2 with EFA enabled)
  2. Initialize libfabric with EFA provider
  3. Perform concurrent tagged operations:
    - Node 0: Broadcast operation with tag=0
    - Node 0: Broadcast operation with tag=1 (potentially overlapping with first broadcast)
    - Other nodes: Corresponding receive operations with matching tags
  4. The assertion fails during packet entry cleanup, typically on the receiving nodes

The issue is more likely to occur when:

  • Multiple broadcast operations are initiated in quick succession
  • Operations are performed asynchronously
  • High message rates or larger message sizes are used

Expected behavior
Multiple tagged operations should be able to execute concurrently without assertion failures. The EFA
provider should properly manage packet entries for different tagged operations, maintaining separate packet
lists or properly handling shared resources.
Output
prov/efa/src/rdm/efa_rdm_pke.c:152:
efa_rdm_pke_release_rx: Assertion `pkt_entry->next == NULL' failed.

Environment:

  1. OS Information:
    PRETTY_NAME="Ubuntu 22.04.3 LTS"
    NAME="Ubuntu"
    VERSION_ID="22.04"
    VERSION="22.04.3 LTS (Jammy Jellyfish)"
    VERSION_CODENAME=jammy

  2. Libfabric Version:
    libfabric: 2.1.0amzn4.0
    libfabric api: 2.1

  3. EFA Packages Installed:
    EFA installer version: 1.43.0
    Debug packages installed: yes
    Packages installed:
    ibacm_58.amzn0-1_amd64 ibverbs-providers_58.amzn0-1_amd64 ibverbs-utils_58.amzn0-1_amd64 infiniband-diags_58.amzn0-1_amd64 libibmad-dev_58.amzn0-1_amd64 libibmad5_58.amzn0-1_amd64 libibnetdisc-dev_58.amzn0-1_amd64 libibnetdisc5_58.amzn0-1_amd64 libibumad-dev_58.amzn0-1_amd64 libibumad3_58.amzn0-1_amd64 libibverbs-dev_58.amzn0-1_amd64 libibverbs1_58.amzn0-1_amd64 librdmacm-dev_58.amzn0-1_amd64 librdmacm1_58.amzn0-1_amd64 rdma-core_58.amzn0-1_amd64 rdmacm-utils_58.amzn0-1_amd64 efa-profile_1.7_all libfabric-aws-bin_2.1.0amzn4.0_amd64 libfabric-aws-dev_2.1.0amzn4.0_amd64 libfabric1-aws_2.1.0amzn4.0_amd64 libnccl-ofi_1.16.1-1_amd64 libpmix-aws_4.2.8_amd64 openmpi40-aws_4.1.7-1_amd64 openmpi50-aws_5.0.6_amd64 prrte-aws_3.0.6_amd64 ibacm-dbgsym_58.amzn0-1_amd64.ddeb ibverbs-providers-dbgsym_58.amzn0-1_amd64.ddeb ibverbs-utils-dbgsym_58.amzn0-1_amd64.ddeb infiniband-diags-dbgsym_58.amzn0-1_amd64.ddeb libibmad5-dbgsym_58.amzn0-1_amd64.ddeb libibnetdisc5-dbgsym_58.amzn0-1_amd64.ddeb libibumad3-dbgsym_58.amzn0-1_amd64.ddeb libibverbs1-dbgsym_58.amzn0-1_amd64.ddeb librdmacm1-dbgsym_58.amzn0-1_amd64.ddeb rdma-core-dbgsym_58.amzn0-1_amd64.ddeb rdmacm-utils-dbgsym_58.amzn0-1_amd64.ddeb libfabric1-aws-dbg_2.1.0amzn4.0_amd64 libnccl-ofi-dbgsym_1.16.1-1_amd64.ddeb

  4. EFA Provider Info:
    provider: efa
    fabric: efa-direct
    domain: rdmap223s0-rdm
    version: 201.0
    type: FI_EP_RDM
    protocol: FI_PROTO_EFA
    provider: efa
    fabric: efa
    domain: rdmap223s0-rdm
    version: 201.0
    type: FI_EP_RDM
    protocol: FI_PROTO_EFA
    provider: efa
    fabric: efa
    domain: rdmap223s0-dgrm
    version: 201.0
    type: FI_EP_DGRAM
    protocol: FI_PROTO_EFA

  5. EFA Devices:
    hca_id: rdmap223s0
    fw_ver: 0.0.0.0
    vendor_id: 0x1d0f

Additional context

  • The issue appears to be related to packet entry linked list management in the EFA RDM implementation
  • Using synchronous operations or adding delays between operations can sometimes avoid the issue
  • The assertion suggests that during packet entry release, the entry is still linked to another packet
    when it should be isolated - This may indicate a race condition in packet entry pool management or improper cleanup of completed operations
  • The issue is observed in distributed ML training workloads where multiple collective operations (broadcasts, all-reduces) may overlap

Potential areas to investigate:

  1. Thread safety of packet entry pool management with multiple concurrent tagged operations
  2. Tag matching logic and its interaction with packet entry lifecycle
  3. Race conditions in efa_rdm_pke_release_rx when multiple completions are processed

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions