Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

在容器网络环境中进行多机P800分布式训练报错 #69220

Open
yalbaba opened this issue Nov 7, 2024 · 1 comment
Open

在容器网络环境中进行多机P800分布式训练报错 #69220

yalbaba opened this issue Nov 7, 2024 · 1 comment
Assignees

Comments

@yalbaba
Copy link

yalbaba commented Nov 7, 2024

bug描述 Describe the Bug

报错信息:
我们使用容器网络训练xpu任务 报了个这个错误

Traceback (most recent call last):
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/finetune.py", line 13, in
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/train.py", line 402, in main
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddlenlp/trainer/argparser.py", line 233, in parse_args_into_dataclasses
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddlenlp/trainer/argparser.py", line 243, in common_parse
File "", line 108, in init
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddlenlp/trainer/training_args.py", line 1223, in post_init
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddle/distributed/fleet/fleet.py", line 340, in init
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddle/distributed/fleet/fleet.py", line 727, in _init_hybrid_parallel_env
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddle/distributed/fleet/base/topology.py", line 218, in init
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddle/distributed/communication/all_reduce.py", line 89, in all_reduce
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddle/distributed/communication/stream/all_reduce.py", line 157, in all_reduce
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddle/distributed/communication/stream/all_reduce.py", line 51, in _all_reduce_in_dygraph
ValueError: (InvalidArgument) TCP send error. Details: Broken pipe.
[Hint: Expected byte_sent > 0, but received byte_sent:-1 <= 0:0.] (at /root/paddlejob/workspace/env_run/Paddle/paddle/phi/core/distributed/store/tcp_utils.h:83)

容器网络环境下以太网卡eth0和roce网卡xgbe2、xgbe3、xgbe4、xgbe5都能通过ping、ib_send_bw测试通,且宿主机网络模式下训练也正常,但是容器网络下不行

其他补充信息 Additional Supplementary Information

No response

@westfish
Copy link
Contributor

westfish commented Nov 8, 2024

您的问题很可能是由于容器网络环境中的配置导致节点间通信失败,可能需要调整容器的网络设置,确保必要的端口和网络接口正确映射和配置,可以重点确认一下容器网络端口是否正确映射,检查并配置防火墙,确保允许训练所需的端口通信。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants