Skip to content

smcuda initialization fails when MPI is initialized before CUDA #13354

@ghanem-nv

Description

@ghanem-nv

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

v5.0.8

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Installed using conda forge

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

NA

Please describe the system on which you are running

  • Operating system/version: Ubuntu 22.04
  • Computer hardware: Single node with AMD EPYC 7742 CPU and 8xA30 Nvidia GPUs
  • Network type: NA

Details of the problem

When using OMPI_MCA_pml=ob1 for my multi-GPU application, I noticed that that MPI communication is staged through the CPU buffers instead of using Peer-to-Peer Direct Memory Access. Upon investigation, I found that the culprit is a failed initialization of smcuda :

select: initializing btl component smcuda
CUDA: cuCtxGetCurrent returned NULL context
select: init of component smcuda returned failure
mca: base: close: component smcuda closed

Seeing the NULL context, I suspected that making some dummy CUDA call to initialize CUDA before any MPI call would fix this. And indeed it does resolve the issue and I get the expected bandwidth!

Steps to reproduce

  • Setup a fresh conda environment and run the cupy mpi benchmark using ob1:
conda create -n smcuda_issue -y python=3.12 mpi4py openmpi cupy-core cuda-cudart cuda-version=12 
conda activate smcuda_issue
export OMPI_MCA_opal_cuda_support=true
export OMPI_MCA_pml=ob1
mpiexec -n 2 python -m mpi4py.bench pingpong -a cupy -m 134217728

Output1 : (your bandwidths will likely differ)

# Size [B]  Bandwidth [MB/s] | Time Mean [s] ± StdDev [s]  Samples
 134217728           2704.35 | 4.9630280e-02 ± 1.4886e-03       10
 268435456           2759.26 | 9.7285419e-02 ± 1.6067e-04       10
 536870912           2761.97 | 1.9437956e-01 ± 1.1675e-04       10
1073741824           2765.35 | 3.8828477e-01 ± 7.7874e-04       10
  • Add the line import cupy; _ = cupy.zeros(10) to the top of the file $CONDA_PREFIX/lib/python3.12/site-packages/mpi4py/bench.py and rerun the benchmark.

Output2: (your bandwidths will be noticeably higher than output1 if your GPUs support P2P DMA)

# MPI PingPong Test
# Size [B]  Bandwidth [MB/s] | Time Mean [s] ± StdDev [s]  Samples
 134217728         260967.35 | 5.1430850e-04 ± 3.2618e-05       10
 268435456         325483.86 | 8.2472740e-04 ± 2.1475e-06       10
 536870912         365462.14 | 1.4690192e-03 ± 1.5596e-06       10
1073741824         391377.22 | 2.7434960e-03 ± 1.0634e-05       10

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions