can we [add new feature ]support zero=2,3 with --tensor-model-parallel-size 2 --pipeline-model-parallel-size 2 for pretrain-gpt2?

here is the case:

CUDA_LAUNCH_BLOCKING=1 python -u -m torch.distributed.run --nproc_per_node 4 --nnodes 1 --rdzv_endpoint 127.0.0.1:1234 --rdzv_backend c10d --max_restarts 0 --tee 3 \
    pretrain_gpt.py --tensor-model-parallel-size 4 --pipeline-model-parallel-size 1 --distributed-backend nccl \
	--num-layers 24 --hidden-size 1024 --num-attention-heads 16 --seq-length 1024 --max-position-embeddings 1024 \
    --micro-batch-size 4 --global-batch-size 32 --lr 0.00015 --train-iters 10 --lr-decay-iters 2 --lr-decay-style cosine \
    --min-lr 1.0e-5 --weight-decay 1e-2 --lr-warmup-fraction .01 --clip-grad 1.0 --fp16 --data-path ../Megatron-DeepSpeed_DSAI/gpt2_345m_text_document \
	--vocab-file ../Megatron-DeepSpeed_DSAI/gpt2-vocab.json --merge-file ../Megatron-DeepSpeed_DSAI/gpt2-merges.txt --data-impl mmap --split 949,50,1 \
    --log-interval 1 --save-interval 1 --eval-interval 1 --eval-iters 1 --checkpoint-activations \
	--save ./zero3_gpt2_345m --load ./zero3_gpt2_345m --deepspeed --deepspeed_config ds_config_z3.config --zero-stage 3 --deepspeed-activation-checkpointing 

it report:
ZeRO-2 and ZeRO-3 are incompatible with pipeline parallelism

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

can we [add new feature ]support zero=2,3 with --tensor-model-parallel-size 2 --pipeline-model-parallel-size 2 for pretrain-gpt2? #407

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

can we [add new feature ]support zero=2,3 with --tensor-model-parallel-size 2 --pipeline-model-parallel-size 2 for pretrain-gpt2? #407

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions