-
Notifications
You must be signed in to change notification settings - Fork 178
Open
Description
使用generate_packed_dataset.py后的packed_data训练时,训练会卡在accessory/engine_pretrain.py 的metric_logger.synchronize_between_processes()不动,然后ddp超时结束。
在使用*.parquet文件时则没有问题。
[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=495104, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1805715 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
环境完全遵循文档中的requirement.txt.
Metadata
Metadata
Assignees
Labels
No labels