Skip to content

Loss of qav and vaq came to nan quickly. #23

@chuanwise

Description

@chuanwise
Not using distributed mode
[18:17:54.955532] job dir: /home/23031212503/projects/Flipped-VQA
[18:17:54.955618] Namespace(batch_size=1,
epochs=5,
accum_iter=4,
llama_model_path='./pretrained/llama/',
model='7B',
adapter_layer=32,
adapter_len=10,
max_seq_len=650,
max_feats=10,
weight_decay=0.02,
lr=None,
blr=0.07,
min_lr=0.0,
warmup_epochs=2,
dataset='tvqa',
output_dir='./checkpoint/tvqa',
device='cuda',
seed=0,
resume='',
start_epoch=0,
num_workers=2,
pin_mem=True,
world_size=1,
local_rank=-1,
dist_on_itp=False,
dist_url='env://',
vaq=True,
qav=True,
bias=3.0,
tau=100.0,
sub=True,
distributed=False)
[18:18:16.740925] Num train data: 122039
[18:18:24.026051] Num val data: 15253
[18:18:24.039350] Using model: 7B
[18:18:24.041255] loading from pretrained/llama/7B/consolidated.00.pth
[18:19:13.553202] base lr: 7.00e-02
[18:19:13.553243] actual lr: 1.09e-03
[18:19:13.553254] accumulate grad iterations: 4
[18:19:13.553258] effective batch size: 4
[18:19:13.554187] AdamW (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.95)
    capturable: False
    eps: 1e-08
    foreach: None
    lr: 0.00109375
    maximize: False
    weight_decay: 0.0

Parameter Group 1
    amsgrad: False
    betas: (0.9, 0.95)
    capturable: False
    eps: 1e-08
    foreach: None
    lr: 0.00109375
    maximize: False
    weight_decay: 0.02
)
[18:19:13.554305] Start training for 5 epochs
[18:19:17.576096] Epoch: [0]  [     0/122039]  eta: 5 days, 16:15:56  lr: 0.000000  loss: 5.6871 (5.6871)  vqa_loss: 1.4844 (1.4844)  vaq_loss: 1.8125 (1.8125)  qav_loss: 2.3903 (2.3903)  time: 4.0197  data: 0.7782  max mem: 37679
[18:19:23.617162] Loss is nan, stopping training

But according to the printed, loss is not nan.

Command is the training command in README with some arguments about distributed training removed:

python train.py --model 7B --max_seq_len 650 --batch_size 1 --epochs 5 --warmup_epochs 2 --bias 3 --tau 100. --max_feats 10 --dataset tvqa --blr 7e-2 --weight_decay 0.02 --output_dir ./checkpoint/tvqa --dataset tvqa --accum_iter 4 --sub --vaq --qav

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions