This repository contrains code for the CVPR'22 paper Look for the Change: Learning Object States and State-Modifying Actions from Untrimmed Web Videos.
-
Prerequisites
- Nvidia docker (i.e. only linux is supported, you can install environment from docker file manually or not use GPU and then you do not have to use docker / linux).
- Tested with at least 4GB VRAM GPU.
-
Download model weights
-
mkdir weights; cd weights wget https://data.ciirc.cvut.cz/public/projects/2022LookForTheChange/look-for-the-change.pth wget https://isis-data.science.uva.nl/mettes/imagenet-shuffle/mxnet/resnext101_bottomup_12988/resnext-101-1-0040.params wget https://isis-data.science.uva.nl/mettes/imagenet-shuffle/mxnet/resnext101_bottomup_12988/resnext-101-symbol.json mv resnext-101-symbol.json resnext-101-1-symbol.json
-
-
Setup the environment
- Our code can be run in a docker container. Build it by running the following command.
Note that by default, we compile custom CUDA code for architectures 6.1, 7.0, 7.5, and 8.0.
You may need to update the Dockerfile with your GPU architecture.
docker build -t look-for-the-change .
- Go into the docker image.
docker run -it --rm --gpus 1 -v $(pwd):$(pwd) -w $(pwd) look-for-the-change bash
- Our code can be run in a docker container. Build it by running the following command.
Note that by default, we compile custom CUDA code for architectures 6.1, 7.0, 7.5, and 8.0.
You may need to update the Dockerfile with your GPU architecture.
-
Extract video features
- Our model runs with preextracted features, run the following command for the extraction.
The script creates
python extract.py path/to/video.mp4
path/to/video.pickle
file with the extracted features. - Note you may need to edit
memory_limit
oftensorflow
infeature_extraction/tsm_model.py
if you have less than 6 GB of VRAM.
- Our model runs with preextracted features, run the following command for the extraction.
-
Get predictions
- Run the following command to get predictions for your video.
where
python predict.py category path/to/video.pickle [--visualize --video path/to/video.mp4]
category
is id of a dataset category such asbacon
for Bacon Frying. See ChangeIt dataset categories for all options. - The script creates
path/to/video.category.csv
with raw model predictions for each second of the original video. - If a path to the original video is provided, the script also generates visualization of the predictions.
- Run the following command to get predictions for your video.
-
Prerequisites
- Set up the docker environment and download the ResNeXT model weights as in points 0., 1., and 2. of the previous chapter.
- Note that for training the GPU is required due to the custom CUDA op.
-
Dataset preparation
- Download ChangeIt dataset videos. Note it is not necessary to download the videos in the best resolution available as only 224-by-224 px resolution is needed for feature extraction.
- Extract features from the videos.
This script will create
python extract.py path/to/video1.mp4 path/to/video2.mp4 ... --n_augmentations 10 --export_dir path/to/dataset_root/category_name
path/to/dataset_root/category_name/video1.pickle
andpath/to/dataset_root/category_name/video2.pickle
files with extracted features. It is important to have somedataset_root
folder containingcategory_name
sub-folders with individual video feature files.
-
Train a model
- Run the following command to train on the preextracted features. Note that for every category a separate training needs to be run.
Also keep in mind that due to the unsupervised nature of the algorithm, you may end up in bad local minima. We recommend to run the training multiple times to get the best results.
python train.py --pickle_roots path/to/dataset_root --category category_name --annotation_root path/to/annotation_root --noise_adapt_weight_root path/to/video_csv_files --noise_adapt_weight_threshold_file path/to/categories.csv
--annotation_root
is the location ofannotations
folder of ChangeIt dataset,--noise_adapt_weight_root
is the location ofvideos
folder of the dataset, and--noise_adapt_weight_threshold_file
points tocategories.csv
file of the dataset.
- Run the following command to train on the preextracted features. Note that for every category a separate training needs to be run.
Also keep in mind that due to the unsupervised nature of the algorithm, you may end up in bad local minima. We recommend to run the training multiple times to get the best results.
Tomáš Souček, Jean-Baptiste Alayrac, Antoine Miech, Ivan Laptev, and Josef Sivic. Look for the Change: Learning Object States and State-Modifying Actions from Untrimmed Web Videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
@inproceedings{soucek2022lookforthechange,
title={Look for the Change: Learning Object States and State-Modifying Actions from Untrimmed Web Videos},
author={Sou\v{c}ek, Tom\'{a}\v{s} and Alayrac, Jean-Baptiste and Miech, Antoine and Laptev, Ivan and Sivic, Josef},
booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2022}
}
The project was supported by the European Regional Development Fund under the project IMPACT (reg. no. CZ.02.1.01/0.0/0.0/15_003/0000468) and by the Ministry of Education, Youth and Sports of the Czech Republic through the e-INFRA CZ (ID:90140), the French government under management of Agence Nationale de la Recherche as part of the "Investissements d'avenir" program, reference ANR19-P3IA-0001 (PRAIRIE 3IA Institute), and Louis Vuitton ENS Chair on Artificial Intelligence. We would like to also thank Kateřina Součková and Lukáš Kořínek for their help with the dataset.