Skip to content

复现K400 预训练 loss一直降不下来 #132

@Inscredion

Description

@Inscredion

最近复现K400 ViT-Small 预训练,2x8 H100, 单卡bs50,loss后面到0.6就降不下来了

OMP_NUM_THREADS=1 python -m torch.distributed.launch --nproc_per_node={len(gpu_ids)}
--master_port={port} --nnodes={num_nodes} --node_rank={index} --master_addr={MASTER_ADDR}
--use-env
run_mae_pretraining.py
--data_path {DATA_PATH}
--mask_type tube
--mask_ratio 0.9
--model pretrain_videomae_small_patch16_224
--decoder_depth 4
--batch_size 50
--num_frames 16
--sampling_rate 4
--opt adamw
--opt_betas 0.9 0.95
--warmup_epochs 40
--lr 1.5e-4
--save_ckpt_freq 50
--epochs 800
--resume {MODEL_PATH}
--log_dir {OUTPUT_DIR}
--output_dir {OUTPUT_DIR}

后面的loss是这样的
eta: 0:03:24 lr: 0.000040 min_lr: 0.000040 loss: 0.6245 (0.6292) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0500 (0.0500) grad_norm: 0.3587 (0.4485) time: 1.0438 data: 0.2533 max mem: 9071
想问下是什么原因

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions