Skip to content

dipta007/Q2E

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Q2E

This repo contains all the data and code related to the paper Q2E: Query-to-Event Decomposition for Zero-Shot Multilingual Text-to-Video Retrieval

🔥News:

  • [14 Feb, 2025] Paper submitted to ARR February cycle.
  • [24 July, 2025] Paper will be presented on the MAGMaR workshop.

Outlines

Installation

Caution

Installation was tested on CUDA 12.4 and A100. If you see errors, please use the Docker instructions.

To run the code in this project, first, create a Python virtual environment using uv. To install uv, follow the UV Installation Guide.

uv venv --seed --python 3.10
uv sync

Data

Pre-Generated Data

source .venv/bin/activate
gdown --fuzzy https://drive.google.com/file/d/1qcr9ZqHptibJKHOwyOrjjbwQTjcsp_Vk/view
tar -xzvf data.tgz

Videos

Due to the redistribution policy, we cannot provide the videos directly. However, you can download the videos using the following instructions.

  1. Download the MultiVENT videos, and save them in the data/MultiVENT/videos directory.
  2. Download the MSR-VTT videos, and save them in the data/MSR-VTT-1kA/videos directory.

Pre-Trained Models

mkdir -p data/models/MultiCLIP
wget -O data/models/MultiCLIP/open_clip_pytorch_model.bin https://huggingface.co/laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k/resolve/main/open_clip_pytorch_model.bin

mkdir -p data/models/InternVideo2

Due to different licensing agreements, we cannot provide the InternVideo2 model directly. However, you can download the InternVideo2 model from here and save it as data/models/InternVideo2/InternVideo2-stage2_1b-224p-f4.pt.

Evaluation

Data is generated and already populated in the data directory. To generate the data, follow the instructions in the Data Generation section.

Evaluating MultiVENT

bash scripts/eval_multivent.sh 

Evaluating MSR-VTT-1kA

bash scripts/eval_msrvtt.sh

Data Generation Scripts

Data for MSR-VTT-1kA and MultiVENT datasets can be generated using the scripts below. The scripts will transcribe the audio, and generate the data for evaluation. Pre-generated data is available in the Download Pre-Generated Data section.

Dataset Audio Script Location
MultiVENT scripts/generate_multivent_asr.sh
MultiVENT - scripts/generate_multivent_noasr.sh
MSR-VTT-1kA scripts/generate_msrvtt_asr.sh
MSR-VTT-1kA - scripts/generate_msrvtt_noasr.sh
ALL ALL scripts/grid_search_data.py

Use Your Own Data

If you want to generate data using your own dataset, i.e, {DATA_DIR}, follow the instructions below.

  1. Download all videos to {DATA_DIR}/videos
  2. Write a CSV file with the following columns: query, video_id and save it as {DATA_DIR}/dataset.csv
  3. Generate the data
echo "Transcribing videos"
python -m src.data.transcribe_audios \
    --video_dir={DATA_DIR}/videos

echo "Processing raw data"
python -m src.data.query_decomp  \
    --data_dir={DATA_DIR} \
    --video_dir={DATA_DIR}/videos \
    --gen_max_model_len=2048

echo "Captioning frames"
python -m src.data.frame_caption \
    --data_dir={DATA_DIR} \
    --video_dir={DATA_DIR}/videos \
    --gen_max_model_len=16384 \
    --num_of_frames=16

echo "Captioning videos"
python -m src.data.frame2video_caption \
    --data_dir={DATA_DIR} \
    --video_dir={DATA_DIR}/videos \
    --gen_max_model_len=16384 \
    --num_of_frames=16
  1. Evaluate using MultiCLIP
echo "Without ASR"
python -m src.eval.MultiCLIP.infer \
    --note=eval \
    --dataset_dir={HFDatasetDIR} \
    --aggregation_methods=inv_entropy


echo "With ASR"
python -m src.eval.MultiCLIP.infer \
    --note=eval \
    --dataset_dir={HFDatasetDIR}\
    --aggregation_methods=inv_entropy
  1. Evaluate using InternVideo2
echo "Without ASR"
python -m src.eval.InternVideo2.infer \
    --note=eval \
    --dataset_dir={HFDatasetDIR}\
    --aggregation_methods=inv_entropy


echo "With ASR"
python -m src.eval.InternVideo2.infer \
    --note=eval \
    --dataset_dir={HFDatasetDIR}\
    --aggregation_methods=inv_entropy

Using Docker

Many of us doesn't have root permission to the server, there comes the udocker to the rescue. To use the code in a Docker container, follow the instructions below.

# Install udocker
uv add udocker
# Create and run the container
udocker pull runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04
udocker create --name="runpod" runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04
udocker setup --nvidia runpod
udocker run --volume="/${PWD}:/workspace" --name="runpod" runpod bash

# Inside the container
## install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

## install the dependencies
uv venv --seed --python=3.10
uv sync

Citation

If you find this code useful for your research, please consider citing:

@article{dipta2025q2e,
  title={Q2E: Query-to-Event Decomposition for Zero-Shot Multilingual Text-to-Video Retrieval},
  author={Dipta, Shubhashis Roy and Ferraro, Francis},
  journal={arXiv preprint arXiv:2506.10202},
  year={2025}
}

About

Q2E: Query-to-Event Decomposition for Zero-Shot Multilingual Text-to-Video Retrieval

Topics

Resources

Stars

Watchers

Forks