Skip to content

BUG: training gpt2 with pp=2 error:list index out of range #406

Open
@9LLPPLL6

Description

@9LLPPLL6

this is my shell:

#!/bin/bash

# Runs the "345M" parameter model

export CUDA_DEVICE_MAX_CONNECTIONS=1

GPUS_PER_NODE=4
# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=1
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))

CHECKPOINT_PATH=/mnt/Megatron-DeepSpeed/Models/gpt-2/checkpoint
VOCAB_FILE=/mnt/Megatron-DeepSpeed/Models/gpt-2/data/gpt2-vocab.json
MERGE_FILE=/mnt/Megatron-DeepSpeed/Models/gpt-2/data/gpt2-merges.txt
DATA_PATH=/mnt/Megatron-DeepSpeed/Models/gpt-2/data/meg-gpt2_text_document

PP_SIZE=2

DISTRIBUTED_ARGS="
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT
"

GPT_ARGS="
    --num-layers 24 \
    --hidden-size 1024 \
    --num-attention-heads 16 \
    --seq-length 1024 \
    --max-position-embeddings 1024 \
    --micro-batch-size 4 \
    --global-batch-size 16 \
    --lr 0.00015 \
    --train-iters 500000 \
    --lr-decay-iters 320000 \
    --lr-decay-style cosine \
    --min-lr 1.0e-5 \
    --weight-decay 1e-2 \
    --lr-warmup-fraction .01 \
    --clip-grad 1.0 \
    --fp16 \
    --pipeline-model-parallel-size $PP_SIZE \
    --tensor-model-parallel-size 1
"

DATA_ARGS="
    --data-path $DATA_PATH \
    --vocab-file $VOCAB_FILE \
    --merge-file $MERGE_FILE \
    --data-impl mmap \
    --split 949,50,1
"

OUTPUT_ARGS="
    --log-interval 100 \
    --save-interval 10000 \
    --eval-interval 1000 \
    --eval-iters 10
"

PRETRAINPATH=/mnt/Megatron-DeepSpeed

torchrun $DISTRIBUTED_ARGS $PRETRAINPATH/pretrain_gpt.py \
    $GPT_ARGS \
    $DATA_ARGS \
    $OUTPUT_ARGS \
    --distributed-backend nccl \
    --save $CHECKPOINT_PATH \
    --load $CHECKPOINT_PATH \

the error is :

[rank2]: Traceback (most recent call last):
[rank2]:   File "/mnt/Megatron-DeepSpeed/pretrain_gpt.py", line 360, in <module>
[rank2]:     pretrain(train_valid_test_datasets_provider,
[rank2]:   File "/mnt/Megatron-DeepSpeed/megatron/training.py", line 172, in pretrain
[rank2]:     model, optimizer, opt_param_scheduler = setup_model_and_optimizer(
[rank2]:   File "/mnt/Megatron-DeepSpeed/megatron/training.py", line 534, in setup_model_and_optimizer
[rank2]:     model = get_model(model_provider_func, model_type)
[rank2]:   File "/mnt/Megatron-DeepSpeed/megatron/training.py", line 373, in get_model
[rank2]:     model = model_provider_func(
[rank2]:   File "/mnt/Megatron-DeepSpeed/pretrain_gpt.py", line 83, in model_provider
[rank2]:     model = GPTModel(
[rank2]:   File "/mnt/Megatron-DeepSpeed/megatron/model/gpt_model.py", line 208, in __init__
[rank2]:     self.language_model, self._language_model_key = get_language_model(
[rank2]:   File "/mnt/Megatron-DeepSpeed/megatron/model/language_model.py", line 68, in get_language_model
[rank2]:     language_model = TransformerLanguageModel(
[rank2]:   File "/mnt/Megatron-DeepSpeed/megatron/model/language_model.py", line 442, in __init__
[rank2]:     self.encoder = ParallelTransformer(
[rank2]:   File "/mnt/Megatron-DeepSpeed/megatron/model/transformer.py", line 1867, in __init__
[rank2]:     experts_per_layer = get_num_experts_per_layer(num_experts, self.num_layers, args.expert_interval, offset)
[rank2]:   File "/mnt/Megatron-DeepSpeed/megatron/model/transformer.py", line 1676, in get_num_experts_per_layer
[rank2]:     n_e = num_experts[(layer_num-1) // expert_interval] if layer_num % expert_interval == 0 else 1
[rank2]: IndexError: list index out of range

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions