Skip to content

train error #24

Open
Open
@liuxinyiwssy

Description

@liuxinyiwssy

Fatal Python error: Segmentation fault

Current thread 0x000074aa85dae740 (most recent call first):
File "", line 219 in _call_with_frames_removed
File "", line 1166 in create_module
File "", line 556 in module_from_spec
File "", line 657 in _load_unlocked
File "", line 975 in _find_and_load_unlocked
File "", line 991 in _find_and_load
File "", line 219 in _call_with_frames_removed
File "", line 1042 in _handle_fromlist
File "/home/bydpc/anaconda3/envs/toponet/lib/python3.8/site-packages/ortools/graph/pywrapgraph.py", line 13 in
File "", line 219 in _call_with_frames_removed
File "", line 843 in exec_module
File "", line 671 in _load_unlocked
File "", line 975 in _find_and_load_unlocked
File "", line 991 in _find_and_load
File "", line 219 in _call_with_frames_removed
File "", line 1042 in _handle_fromlist
File "/home/bydpc/anaconda3/envs/toponet/lib/python3.8/site-packages/openlanev2/evaluation/f_score.py", line 40 in
File "", line 219 in _call_with_frames_removed
File "", line 843 in exec_module
File "", line 671 in _load_unlocked
File "", line 975 in _find_and_load_unlocked
File "", line 991 in _find_and_load
File "/home/bydpc/anaconda3/envs/toponet/lib/python3.8/site-packages/openlanev2/evaluation/evaluate.py", line 26 in
File "", line 219 in _call_with_frames_removed
File "", line 843 in exec_module
File "", line 671 in _load_unlocked
File "", line 975 in _find_and_load_unlocked
File "", line 991 in _find_and_load
File "/home/bydpc/anaconda3/envs/toponet/lib/python3.8/site-packages/openlanev2/evaluation/init.py", line 1 in
File "", line 219 in _call_with_frames_removed
File "", line 843 in exec_module
File "", line 671 in _load_unlocked
File "", line 975 in _find_and_load_unlocked
File "", line 991 in _find_and_load
File "/home/bydpc/lxy_ws/map_topo/TopoNet/projects/toponet/datasets/openlanev2_subset_A_dataset.py", line 20 in
File "", line 219 in _call_with_frames_removed
File "", line 843 in exec_module
File "", line 671 in _load_unlocked
File "", line 975 in _find_and_load_unlocked
File "", line 991 in _find_and_load
File "/home/bydpc/lxy_ws/map_topo/TopoNet/projects/toponet/datasets/init.py", line 2 in
File "", line 219 in _call_with_frames_removed
File "", line 843 in exec_module
File "", line 671 in _load_unlocked
File "", line 975 in _find_and_load_unlocked
File "", line 991 in _find_and_load
File "/home/bydpc/lxy_ws/map_topo/TopoNet/projects/toponet/init.py", line 1 in
File "", line 219 in _call_with_frames_removed
File "", line 843 in exec_module
File "", line 671 in _load_unlocked
File "", line 975 in _find_and_load_unlocked
File "", line 991 in _find_and_load
File "", line 1014 in _gcd_import
File "/home/bydpc/anaconda3/envs/toponet/lib/python3.8/importlib/init.py", line 127 in import_module
File "/home/bydpc/anaconda3/envs/toponet/lib/python3.8/site-packages/mmcv/utils/misc.py", line 73 in import_modules_from_strings
File "/home/bydpc/anaconda3/envs/toponet/lib/python3.8/site-packages/mmcv/utils/config.py", line 343 in fromfile
File "tools/train.py", line 171 in main
File "/home/bydpc/anaconda3/envs/toponet/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 361 in wrapper
File "tools/train.py", line 316 in
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 99244) of binary: /home/bydpc/anaconda3/envs/toponet/bin/python
/home/bydpc/anaconda3/envs/toponet/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py:367: UserWarning:


           CHILD PROCESS FAILED WITH NO ERROR_FILE                

CHILD PROCESS FAILED WITH NO ERROR_FILE
Child process 99244 (local_rank 0) FAILED (exitcode -11)
Error msg: Signal 11 (SIGSEGV) received by PID 99244
Without writing an error file to <N/A>.
While this DOES NOT affect the correctness of your application,
no trace information about the error will be available for inspection.
Consider decorating your top level entrypoint function with
torch.distributed.elastic.multiprocessing.errors.record. Example:

from torch.distributed.elastic.multiprocessing.errors import record

@record
def trainer_main(args):
# do train


warnings.warn(_no_error_file_warning_msg(rank, failure))
Traceback (most recent call last):
File "/home/bydpc/anaconda3/envs/toponet/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/bydpc/anaconda3/envs/toponet/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/bydpc/anaconda3/envs/toponet/lib/python3.8/site-packages/torch/distributed/run.py", line 702, in
main()
File "/home/bydpc/anaconda3/envs/toponet/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 361, in wrapper
return f(*args, **kwargs)
File "/home/bydpc/anaconda3/envs/toponet/lib/python3.8/site-packages/torch/distributed/run.py", line 698, in main
run(args)
File "/home/bydpc/anaconda3/envs/toponet/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run
elastic_launch(
File "/home/bydpc/anaconda3/envs/toponet/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/bydpc/anaconda3/envs/toponet/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:


          tools/train.py FAILED               

==================================================
Root Cause:
[0]:
time: 2025-03-17_13:06:05
rank: 0 (local_rank: 0)
exitcode: -11 (pid: 99244)
error_file: <N/A>
msg: "Signal 11 (SIGSEGV) received by PID 99244"

Other Failures:
<NO_OTHER_FAILURES>


I only have one GPU, so I ran script ./tools/dist_train.sh 1, but it gave me an error. Can anyone help me fix this?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions