AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition [ArXiv] [Project Page]
This repository is the official implementation of AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition.
Rameswar Panda*, Chun-Fu (Richard) Chen*, Quanfu Fan, Ximeng Sun, Kate Saenko, Aude Oliva, Rogerio Feris, "AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition", ICCV 2021. (*: Equal Contribution)
If you use the codes and models from this repo, please cite our work. Thanks!
@inproceedings{panda2021adamml,
title={{AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition}},
author={Panda, Rameswar and Chen, Chun-Fu and Fan, Quanfu and Sun, Ximeng and Saenko, Kate and Oliva, Aude and Feris, Rogerio},
booktitle={International Conference on Computer Vision (ICCV)},
year={2021}
}
pip3 install torch torchvision librosa tqdm Pillow numpy
The dataloader (utils/video_dataset.py) can load RGB frames stored in the following format:
-- dataset_dir
---- train.txt
---- val.txt
---- test.txt
---- videos
------ video_0_folder
-------- 00001.jpg
-------- 00002.jpg
-------- ...
------ video_1_folder
------ ...
Each line in train.txt
and val.txt
includes 4 elements and separated by a symbol, e.g. space (
) or semicolon (;
).
Four elements (in order) include (1) relative paths to video_x_folder
from dataset_dir
, (2) starting frame number, usually 1, (3) ending frame number, (4) label id (a numeric number).
E.g., a video_x
has 300
frames and belong to label 1
.
path/to/video_x_folder 1 300 1
The difference for test.txt
is that each line will only have 3 elements (no label information).
The same format is used for optical flow
but each file (00001.jpg
) need to be x_00001.jpg
and y_00001.jpg
.
On the other hand, for audio data, you need to change the first elements to the path of corresponding wav
files, like
path/to/audio_x.wav 1 300 1
After that, you need to update the utils/data_config.py
for the datasets accordingly.
We provide the scripts in the tools
folder to extract RGB frames and audios from a video. To extract the optical flow, we use the docker image provided by TSN. Please see the help in the script.
We provide the pretrained models on the Kinetics-Sounds dataset, including the unimodality models and our AdaMML models. You can find all the models here.
After downloding the unimodality pretrained models, here is the command template to train AdaMML:
python3 train.py --multiprocessing-distributed --backbone_net adamml -d 50 \
--groups 8 --frames_per_group 4 -b 72 -j 96 --epochs 20 --warmup_epochs 5 --finetune_epochs 10 \
--modality MODALITY1 MODALITY2 --datadir /PATH/TO/MODALITY1 /PATH/TO/MODALITY2 --dataset DATASET --logdir LOGDIR \
--dense_sampling --fusion_point logits --unimodality_pretrained /PATH/TO/MODEL_MODALITY1 /PATH/TO/MODEL_MODALITY2 \
--learnable_lf_weights --num_segments 5 --cost_weights 1.0 0.005 --causality_modeling lstm --gammas 10.0 --sync-bn \
--lr 0.001 --p_lr 0.01 --lr_scheduler multisteps --lr_steps 10 15
The length of the following arguments depended on how many modalities you would like to include in AdaMML.
--modality
: the modalities, other augments needs to follow this order--datadir
: the data dir for each modality--unimodality_pretrained
: the pretrained unimodality model
Note that, to use rgbdiff
as a proxy, both rgbdiff
and flow
needs to be specified in --modality
and their corresponding --datadir
.
However, you only need to provided flow
pretrained model in the --unimodality_pretrained
Here are the examples to train AdaMML with different combinations.
RGB + Audio
python3 train.py --multiprocessing-distributed --backbone_net adamml -d 50 \
--groups 8 --frames_per_group 4 -b 72 -j 96 --epochs 20 --warmup_epochs 5 --finetune_epochs 10 \
--modality rgb sound --datadir /PATH/TO/RGB_DATA /PATH/TO/AUDIO_DATA --dataset DATASET --logdir LOGDIR \
--dense_sampling --fusion_point logits --unimodality_pretrained /PATH/TO/RGB_MODEL /PATH/TO/AUDIO_MODEL \
--learnable_lf_weights --num_segments 5 --cost_weights 1.0 0.05 --causality_modeling lstm --gammas 10.0 --sync-bn \
--lr 0.001 --p_lr 0.01 --lr_scheduler multisteps --lr_steps 10 15
RGB + Flow (with RGBDiff as Proxy)
python3 train.py --multiprocessing-distributed --backbone_net adamml -d 50 \
--groups 8 --frames_per_group 4 -b 72 -j 96 --epochs 20 --warmup_epochs 5 --finetune_epochs 10 \
--modality rgb flow rgbdiff --datadir /PATH/TO/RGB_DATA /PATH/TO/FLOW_DATA /PATH/TO/RGB_DATA --dataset DATASET --logdir LOGDIR \
--dense_sampling --fusion_point logits --unimodality_pretrained /PATH/TO/RGB_MODEL /PATH/TO/FLOW_MODEL \
--learnable_lf_weights --num_segments 5 --cost_weights 1.0 1.0 --causality_modeling lstm --gammas 10.0 --sync-bn \
--lr 0.001 --p_lr 0.01 --lr_scheduler multisteps --lr_steps 10 15
RGB + Audio + Flow (with RGBDiff as Proxy)
python3 train.py --multiprocessing-distributed --backbone_net adamml -d 50 \
--groups 8 --frames_per_group 4 -b 72 -j 96 --epochs 20 --warmup_epochs 5 --finetune_epochs 10 \
--modality rgb sound flow rgbdiff --datadir /PATH/TO/RGB_DATA /PATH/TO/AUDIO_DATA /PATH/TO/FLOW_DATA /PATH/TO/RGB_DATA --dataset DATASET --logdir LOGDIR \
--dense_sampling --fusion_point logits --unimodality_pretrained /PATH/TO/RGB_MODEL /PATH/TO/SOUND_MODEL /PATH/TO/FLOW_MODEL \
--learnable_lf_weights --num_segments 5 --cost_weights 0.5 0.05 0.8 --causality_modeling lstm --gammas 10.0 --sync-bn \
--lr 0.001 --p_lr 0.01 --lr_scheduler multisteps --lr_steps 10 15
Testing an AdaMML model is very straight-forward, you can simply use the training command with following modifications:
- add
-e
in the command - use
--pretrained /PATH/TO/MODEL
to load the trained model - remove
--multiprocessing-distributed
and--unimodality_pretrained
- set
--val_num_clips
if you would like to test under different number of video segments (default is 10)
Here is command template:
python3 train.py -e --backbone_net adamml -d 50 \
--groups 8 --frames_per_group 4 -b 72 -j 96 \
--modality MODALITY1 MODALITY2 --datadir /PATH/TO/MODALITY1 /PATH/TO/MODALITY2 --dataset DATASET --logdir LOGDIR \
--dense_sampling --fusion_point logits --pretrained /PATH/TO/ADAMML_MODEL \
--learnable_lf_weights --num_segments 5 --causality_modeling lstm --sync-bn