Discussion on training issues I have encountered

Thank you for the implementation for the paper. This is the first time I'm dealing with transformer model, I tried to train over Kinetics700 dataset using this model. and I just want to share some of the issues I have encountered:

The paper suggested that the model works better with pretrained weights. Although this is a direct extension from image transformer, most of vision transformer's weights should apply directly, there are 2 places that are different:
1. Positional encoding: we have H x W x T instead of HxW. so I copied the same positional encoding for every frame, sort of like how we inflate imagenet weights on I3D without dividing by T. One alternative way I'm thinking is use angular initialization to generate an 1XT  positional encoding and then add to the HxW image positional encoding to form the HxWxT positional encoding. 
2. We are doing two self-attentions now instead of one per block, so there are double amount of weights for qkv and output fcs.  For now per block I use the same weights for the first and second self-attention if I use the same number of heads as the pretrained image model. Alternatively, in a different model I have half number of heads so the time attention and spatial attention will each use half of the heads weights.


Since it is the first time I'm dealing with Transformer, I want to reproduce what the paper claimed so I started with the "original" basic vision transformer setup: 

- 12 heads  12 blocks
- GELU instead of GEGLU
- embedding size 768
- Image size 224, divide to 16x16 patches 

 With this setup, on a V100 GPU we can only squeeze in 4 videos (4x8x3x224x224) for training even with `torch.amp` , this means if I'm doing an experiment on an p3x8 machine with 4 V100 gpus (~ 12$/h normally), it would take 39 days to do 300 epochs.  Of course it may not need to train for 300 epochs, but intuitively, training with batch size = 16 is not usually not optimal. 
 
So alternatively, I tried a new model with 6 heads and 8 blocks,  Now I can put 16 videos per GPU, so in total batch size = 64. The model started to train smoothly then training error increases after 7-8 epochs. The training accuracy peaked around 55% and I didn't even bother to run validation because I know clearly it's not working. Below list the relevant configuration  I was using. 

```
DATA:
  NUM_FRAMES: 8
  SAMPLING_RATE: 16
  TRAIN_JITTER_SCALES: [256, 320]
  TRAIN_CROP_SIZE: 224
  # TEST_CROP_SIZE: 224 # use if TEST.NUM_SPATIAL_CROPS: 1
  TEST_CROP_SIZE: 224 # use if TEST.NUM_SPATIAL_CROPS: 3
  INPUT_CHANNEL_NUM: [3]
  DECODING_BACKEND: torchvision
  MEAN: [0.5, 0.5, 0.5]
  STD: [0.5, 0.5, 0.5]
  WEIGHT_DECAY: 0.0
SOLVER:
  BASE_LR: 0.1 # 1 machine
  BASE_LR_SCALE_NUM_SHARDS: True
  LR_POLICY: cosine
  MAX_EPOCH: 300
  WEIGHT_DECAY: 5e-5
  WARMUP_EPOCHS: 35.0
  WARMUP_START_LR: 0.01
  OPTIMIZING_METHOD: sgd
TRANSFORMER:
  TOKEN_DIM: 768
  PATCH_SIZE: 16
  DEPTH: 8
  HEADS: 6
  HEAD_DIM: 64
  FF_DROPOUT: 0.1
  ATTN_DROPOUT: 0.0

```

So these are the issues are I have encountered for now. I want to share these because hopefully some of you are actually working with video model and maybe we can have a discussion. I think probably my next thing to try is to increase number of depth. 

Regards 
 
 
  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Discussion on training issues I have encountered #8

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Discussion on training issues I have encountered #8

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions