-
Notifications
You must be signed in to change notification settings - Fork 456
Description
Describe the bug
prov/efa/src/rdm/efa_rdm_pke.c:152: efa_rdm_pke_release_rx: Assertion `pkt_entry->next == NULL' failed.
This assertion failure occurs when sending traffic with multiple tags in a distributed training application. The
issue appears when multiple tagged send/receive operations are active concurrently, even when using
different tag values (e.g., tag=0 for one operation, tag=1 for another).
To Reproduce
- Set up a multi-node EFA environment (e.g., 2+ nodes on AWS EC2 with EFA enabled)
- Initialize libfabric with EFA provider
- Perform concurrent tagged operations:
- Node 0: Broadcast operation with tag=0
- Node 0: Broadcast operation with tag=1 (potentially overlapping with first broadcast)
- Other nodes: Corresponding receive operations with matching tags - The assertion fails during packet entry cleanup, typically on the receiving nodes
The issue is more likely to occur when:
- Multiple broadcast operations are initiated in quick succession
- Operations are performed asynchronously
- High message rates or larger message sizes are used
Expected behavior
Multiple tagged operations should be able to execute concurrently without assertion failures. The EFA
provider should properly manage packet entries for different tagged operations, maintaining separate packet
lists or properly handling shared resources.
Output
prov/efa/src/rdm/efa_rdm_pke.c:152:
efa_rdm_pke_release_rx: Assertion `pkt_entry->next == NULL' failed.
Environment:
-
OS Information:
PRETTY_NAME="Ubuntu 22.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.3 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy -
Libfabric Version:
libfabric: 2.1.0amzn4.0
libfabric api: 2.1 -
EFA Packages Installed:
EFA installer version: 1.43.0
Debug packages installed: yes
Packages installed:
ibacm_58.amzn0-1_amd64 ibverbs-providers_58.amzn0-1_amd64 ibverbs-utils_58.amzn0-1_amd64 infiniband-diags_58.amzn0-1_amd64 libibmad-dev_58.amzn0-1_amd64 libibmad5_58.amzn0-1_amd64 libibnetdisc-dev_58.amzn0-1_amd64 libibnetdisc5_58.amzn0-1_amd64 libibumad-dev_58.amzn0-1_amd64 libibumad3_58.amzn0-1_amd64 libibverbs-dev_58.amzn0-1_amd64 libibverbs1_58.amzn0-1_amd64 librdmacm-dev_58.amzn0-1_amd64 librdmacm1_58.amzn0-1_amd64 rdma-core_58.amzn0-1_amd64 rdmacm-utils_58.amzn0-1_amd64 efa-profile_1.7_all libfabric-aws-bin_2.1.0amzn4.0_amd64 libfabric-aws-dev_2.1.0amzn4.0_amd64 libfabric1-aws_2.1.0amzn4.0_amd64 libnccl-ofi_1.16.1-1_amd64 libpmix-aws_4.2.8_amd64 openmpi40-aws_4.1.7-1_amd64 openmpi50-aws_5.0.6_amd64 prrte-aws_3.0.6_amd64 ibacm-dbgsym_58.amzn0-1_amd64.ddeb ibverbs-providers-dbgsym_58.amzn0-1_amd64.ddeb ibverbs-utils-dbgsym_58.amzn0-1_amd64.ddeb infiniband-diags-dbgsym_58.amzn0-1_amd64.ddeb libibmad5-dbgsym_58.amzn0-1_amd64.ddeb libibnetdisc5-dbgsym_58.amzn0-1_amd64.ddeb libibumad3-dbgsym_58.amzn0-1_amd64.ddeb libibverbs1-dbgsym_58.amzn0-1_amd64.ddeb librdmacm1-dbgsym_58.amzn0-1_amd64.ddeb rdma-core-dbgsym_58.amzn0-1_amd64.ddeb rdmacm-utils-dbgsym_58.amzn0-1_amd64.ddeb libfabric1-aws-dbg_2.1.0amzn4.0_amd64 libnccl-ofi-dbgsym_1.16.1-1_amd64.ddeb -
EFA Provider Info:
provider: efa
fabric: efa-direct
domain: rdmap223s0-rdm
version: 201.0
type: FI_EP_RDM
protocol: FI_PROTO_EFA
provider: efa
fabric: efa
domain: rdmap223s0-rdm
version: 201.0
type: FI_EP_RDM
protocol: FI_PROTO_EFA
provider: efa
fabric: efa
domain: rdmap223s0-dgrm
version: 201.0
type: FI_EP_DGRAM
protocol: FI_PROTO_EFA -
EFA Devices:
hca_id: rdmap223s0
fw_ver: 0.0.0.0
vendor_id: 0x1d0f
Additional context
- The issue appears to be related to packet entry linked list management in the EFA RDM implementation
- Using synchronous operations or adding delays between operations can sometimes avoid the issue
- The assertion suggests that during packet entry release, the entry is still linked to another packet
when it should be isolated - This may indicate a race condition in packet entry pool management or improper cleanup of completed operations - The issue is observed in distributed ML training workloads where multiple collective operations (broadcasts, all-reduces) may overlap
Potential areas to investigate:
- Thread safety of packet entry pool management with multiple concurrent tagged operations
- Tag matching logic and its interaction with packet entry lifecycle
- Race conditions in efa_rdm_pke_release_rx when multiple completions are processed