-
Notifications
You must be signed in to change notification settings - Fork 497
Open
Labels
Description
Describe the bug
I’m not sure which component this error is related to. I’m running distributed training with Megatron-DeepSpeed using mpirun. DeepSpeed itself launches without any issues, but when I start it with mpirun, it throws an error. However, a simple mpirun script works fine. I’m confused about which part is causing this error.
ngc image:.nvcr.io/nvidia/pytorch:25.01-py3
run mpi
bug:python: symbol lookup error: /opt/hpcx/ucx/lib/ucx/libuct_cuda.so.0: undefined symbol: cuCtxSetFlags
debug info:
[
openmpi version
mpirun (Open MPI) 4.1.7rc1
ucx_info -d