Skip to content

prov/lnx: shm+cxi:lnx only shm is working #11392

@andreaa93

Description

@andreaa93

Describe the bug
Between libfabric 2.0 and 2.2.0 we observed that the variable OMPI_MCA_opal_common_ofi_provider_include shm+cxi:lnx works only for shm and yields degraded performances between different nodes

To Reproduce
our openmpi module contain the following

setenv          FI_CXI_RX_MATCH_MODE software
setenv          FI_SHM_USE_XPMEM 1
setenv          FI_LNX_PROV_LINKS shm+cxi
setenv          PALS_PMI pmix
setenv          PALS_CPU_BIND none
setenv          OMPI_MCA_mtl ofi
setenv          OMPI_MCA_pml ^ucx
setenv          OMPI_MCA_opal_common_ofi_provider_include shm+cxi:lnx
setenv          PRTE_MCA_ras_base_launch_orted_on_hn 1

mpirun -np 2 opt/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw --validation

Expected behavior
Bandwidth performance should be good between two nodes, as observed for libfabric 2.0

Output

Libfabric 2.2.0 performances

# OSU MPI Bandwidth Test v7.5
# Datatype: MPI_CHAR.
# Size      Bandwidth (MB/s)        Validation
2097152              4435.40              Pass

Libfabric 2.0 performances

# OSU MPI Bandwidth Test v7.5
# Datatype: MPI_CHAR.
# Size      Bandwidth (MB/s)        Validation
2097152             22587.12              Pass

Environment:
Linux node 4.18.0-513.24.1.el8_9.x86_64 #1 SMP Thu Mar 14 14:20:09 EDT 2024 x86_64 x86_64 x86_64 GNU/Linux

Additional context
NIC is Slingshot 11 on HPE cluster

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions