Open
Description
import torch sometimes works and sometimes crashes randomly with the following error on 128 nodes.
Could you tag if its an already known issue?
I used commits till 92bda82 on devel branch
pytorch 2.5
kaushikvelusamy@node_name:lustre_dir_location/kaushik> mpiexec --spindle=python-prefix=${CONDA_DIR} --np 24 --ppn 12 --genvall --genv=PYTHONPATH=${CONDA_DIR} --cpu-bind=list:4:9:14:19:20:25:56:61:66:71:74:79 ${APP_LOC}
spindle-bin/bin/spindle --location=/tmp/kaushik/spindle-test/spindle-temp-dir mpiexec --genvall --genv TMPDIR=/tmp/kaushik/spindle-test/spindle-temp-dir --genv=PYTHONPATH=lustre_dir_location/kaushik/spindle/spindle_tests_apps/torch-with-spindle/lus_pip_torch_2.3_env_128/ --np 1536 --ppn 12 --cpu-bind=list:4:9:14:19:20:25:56:61:66:71:74:79 spindlemarker lustre_dir_location/kaushik/spindle/spindle_tests_apps/torch-with-spindle/app_3.sh
Traceback (most recent call last):
File "<string>", line 1, in <module>
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "lustre_dir_location/kaushik/spindle/spindle_tests_apps/torch-with-spindle/lus_pip_torch_2.3_env_2/torch/__init__.py", line 36, in <module>
from .torch_version import __version__ as __version__
File "lustre_dir_location/kaushik/spindle/spindle_tests_apps/torch-with-spindle/lus_pip_torch_2.3_env_2/torch/torch_version.py", line 5, in <module>
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "lustre_dir_location/kaushik/spindle/spindle_tests_apps/torch-with-spindle/lus_pip_torch_2.3_env_2/torch/__init__.py", line 36, in <module>
from .torch_version import __version__ as __version__
File "lustre_dir_location/kaushik/spindle/spindle_tests_apps/torch-with-spindle/lus_pip_torch_2.3_env_2/torch/torch_version.py", line 5, in <module>
from ._vendor.packaging.version import Version, InvalidVersion
ValueError: source code string cannot contain null bytes
node_name.hostmgmt2611.cm.system_name.: rank 6 exited with code 1
node_name.hostmgmt2611.cm.system_name.: rank 1 died from signal 15
kaushikvelusamy@node_name:lustre_dir_location/kaushik>
Metadata
Metadata
Assignees
Labels
No labels