-
Notifications
You must be signed in to change notification settings - Fork 911
Open
Description
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
v5.0.8
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Installed using conda forge
If you are building/installing from a git clone, please copy-n-paste the output from git submodule status
.
NA
Please describe the system on which you are running
- Operating system/version: Ubuntu 22.04
- Computer hardware: Single node with AMD EPYC 7742 CPU and 8xA30 Nvidia GPUs
- Network type: NA
Details of the problem
When using OMPI_MCA_pml=ob1
for my multi-GPU application, I noticed that that MPI communication is staged through the CPU buffers instead of using Peer-to-Peer Direct Memory Access. Upon investigation, I found that the culprit is a failed initialization of smcuda :
select: initializing btl component smcuda
CUDA: cuCtxGetCurrent returned NULL context
select: init of component smcuda returned failure
mca: base: close: component smcuda closed
Seeing the NULL context, I suspected that making some dummy CUDA call to initialize CUDA before any MPI call would fix this. And indeed it does resolve the issue and I get the expected bandwidth!
Steps to reproduce
- Setup a fresh conda environment and run the cupy mpi benchmark using ob1:
conda create -n smcuda_issue -y python=3.12 mpi4py openmpi cupy-core cuda-cudart cuda-version=12
conda activate smcuda_issue
export OMPI_MCA_opal_cuda_support=true
export OMPI_MCA_pml=ob1
mpiexec -n 2 python -m mpi4py.bench pingpong -a cupy -m 134217728
Output1 : (your bandwidths will likely differ)
# Size [B] Bandwidth [MB/s] | Time Mean [s] ± StdDev [s] Samples
134217728 2704.35 | 4.9630280e-02 ± 1.4886e-03 10
268435456 2759.26 | 9.7285419e-02 ± 1.6067e-04 10
536870912 2761.97 | 1.9437956e-01 ± 1.1675e-04 10
1073741824 2765.35 | 3.8828477e-01 ± 7.7874e-04 10
- Add the line
import cupy; _ = cupy.zeros(10)
to the top of the file$CONDA_PREFIX/lib/python3.12/site-packages/mpi4py/bench.py
and rerun the benchmark.
Output2: (your bandwidths will be noticeably higher than output1 if your GPUs support P2P DMA)
# MPI PingPong Test
# Size [B] Bandwidth [MB/s] | Time Mean [s] ± StdDev [s] Samples
134217728 260967.35 | 5.1430850e-04 ± 3.2618e-05 10
268435456 325483.86 | 8.2472740e-04 ± 2.1475e-06 10
536870912 365462.14 | 1.4690192e-03 ± 1.5596e-06 10
1073741824 391377.22 | 2.7434960e-03 ± 1.0634e-05 10
Metadata
Metadata
Assignees
Labels
No labels