Skip to content

bug with ngc image #11029

@wukong1992

Description

@wukong1992

Describe the bug

I’m not sure which component this error is related to. I’m running distributed training with Megatron-DeepSpeed using mpirun. DeepSpeed itself launches without any issues, but when I start it with mpirun, it throws an error. However, a simple mpirun script works fine. I’m confused about which part is causing this error.

ngc image:.nvcr.io/nvidia/pytorch:25.01-py3
run mpi
bug:python: symbol lookup error: /opt/hpcx/ucx/lib/ucx/libuct_cuda.so.0: undefined symbol: cuCtxSetFlags

debug info:
[

trace.log

openmpi version
mpirun (Open MPI) 4.1.7rc1

ucx_info -d

ucxinfo.log

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions