You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Traceback (most recent call last):
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/finetune.py", line 13, in
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/train.py", line 402, in main
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddlenlp/trainer/argparser.py", line 233, in parse_args_into_dataclasses
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddlenlp/trainer/argparser.py", line 243, in common_parse
File "", line 108, in init
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddlenlp/trainer/training_args.py", line 1223, in post_init
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddle/distributed/fleet/fleet.py", line 340, in init
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddle/distributed/fleet/fleet.py", line 727, in _init_hybrid_parallel_env
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddle/distributed/fleet/base/topology.py", line 218, in init
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddle/distributed/communication/all_reduce.py", line 89, in all_reduce
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddle/distributed/communication/stream/all_reduce.py", line 157, in all_reduce
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddle/distributed/communication/stream/all_reduce.py", line 51, in _all_reduce_in_dygraph
ValueError: (InvalidArgument) TCP send error. Details: Broken pipe.
[Hint: Expected byte_sent > 0, but received byte_sent:-1 <= 0:0.] (at /root/paddlejob/workspace/env_run/Paddle/paddle/phi/core/distributed/store/tcp_utils.h:83)
bug描述 Describe the Bug
报错信息:
我们使用容器网络训练xpu任务 报了个这个错误
Traceback (most recent call last):
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/finetune.py", line 13, in
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/train.py", line 402, in main
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddlenlp/trainer/argparser.py", line 233, in parse_args_into_dataclasses
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddlenlp/trainer/argparser.py", line 243, in common_parse
File "", line 108, in init
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddlenlp/trainer/training_args.py", line 1223, in post_init
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddle/distributed/fleet/fleet.py", line 340, in init
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddle/distributed/fleet/fleet.py", line 727, in _init_hybrid_parallel_env
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddle/distributed/fleet/base/topology.py", line 218, in init
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddle/distributed/communication/all_reduce.py", line 89, in all_reduce
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddle/distributed/communication/stream/all_reduce.py", line 157, in all_reduce
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddle/distributed/communication/stream/all_reduce.py", line 51, in _all_reduce_in_dygraph
ValueError: (InvalidArgument) TCP send error. Details: Broken pipe.
[Hint: Expected byte_sent > 0, but received byte_sent:-1 <= 0:0.] (at /root/paddlejob/workspace/env_run/Paddle/paddle/phi/core/distributed/store/tcp_utils.h:83)
容器网络环境下以太网卡eth0和roce网卡xgbe2、xgbe3、xgbe4、xgbe5都能通过ping、ib_send_bw测试通,且宿主机网络模式下训练也正常,但是容器网络下不行
其他补充信息 Additional Supplementary Information
No response
The text was updated successfully, but these errors were encountered: