Description
here is the case:
CUDA_LAUNCH_BLOCKING=1 python -u -m torch.distributed.run --nproc_per_node 4 --nnodes 1 --rdzv_endpoint 127.0.0.1:1234 --rdzv_backend c10d --max_restarts 0 --tee 3
pretrain_gpt.py --tensor-model-parallel-size 4 --pipeline-model-parallel-size 1 --distributed-backend nccl
--num-layers 24 --hidden-size 1024 --num-attention-heads 16 --seq-length 1024 --max-position-embeddings 1024
--micro-batch-size 4 --global-batch-size 32 --lr 0.00015 --train-iters 10 --lr-decay-iters 2 --lr-decay-style cosine
--min-lr 1.0e-5 --weight-decay 1e-2 --lr-warmup-fraction .01 --clip-grad 1.0 --fp16 --data-path ../Megatron-DeepSpeed_DSAI/gpt2_345m_text_document
--vocab-file ../Megatron-DeepSpeed_DSAI/gpt2-vocab.json --merge-file ../Megatron-DeepSpeed_DSAI/gpt2-merges.txt --data-impl mmap --split 949,50,1
--log-interval 1 --save-interval 1 --eval-interval 1 --eval-iters 1 --checkpoint-activations
--save ./zero3_gpt2_345m --load ./zero3_gpt2_345m --deepspeed --deepspeed_config ds_config_z3.config --zero-stage 3 --deepspeed-activation-checkpointing
it report:
ZeRO-2 and ZeRO-3 are incompatible with pipeline parallelism