- [2025-03-18] We release the data and example evaluation code!
Impossible videos refer to videos displaying counterfactual and anti-reality scenes that are impossible in real world. We show some video examples below. Please visit our website to find more examples.
imp_vid_demo.mp4
Impossible videos can be a touch stone for advanced video models. As an out-of-real-world-distribution data, it requires the model to not simply memorize real-world data and retrieve similar information based on the input, but to genuinely learn from real-world data and reason upon the input.
This project aims to advance video research by answering the follow important questions:
- Can today's video generation models effectively follow prompts to generate impossible video content?
- Are today's video understanding models good enough for understanding impossible videos?
we introduce IPV-Bench, a novel benchmark designed to evaluate and foster progress in video understanding and generation.
- §IPV Taxonomy: IPV-Bench is underpinned by a comprehensive taxonomy, encompassing 4 domains, 14 categories. It features diverse scenes that defy physical, biological, geographical, or social laws.
- §IPV-Txt Prompt Suite: A prompt suite is constructed based on the taxonomy to evaluate video generation models, challenging their prompt following and creativity capabilities.
- §IPV-Vid Videos: A video benchmark is curated to assess Video-LLMs on their ability of understanding impossible videos, which particularly requires reasoning on temporal dynamics and world knowledge.
First, go to Huggingface and download our data and code, including videos, task files, and example evaluation code. The task files and examples files can also be found in this GitHub repo.
- Use
example_read_prompt.py
to read theipv_txt_prompt_suite.json
file to get the text prompts. - Use the text prompt to generate videos using your models.
- Annotate the
visual quality
andprompt following
fields for each video. - Compute
IPV Score
by stating the percentage of videos that are both of high quality and good prompt following.
🛠️ In this study, we employ human annotation to provide reliable insights for the models. We are still polishing on an automatic evaluation framework, which will be open-sourced in the future.
- The benchmark involves three tasks: Judgement, Multi-choice QA, and Open-ended QA.
- Navigate to example_eval/eval_judgement.py, example_eval/eval_mcqa.py, and example_eval/eval_openqa.py for each task.
- The example code implements the full evaluation pipeline. To evaluate your model, you simply modify the
inference_one()
function to produce the output.
Welcome to discuss with us and continuously improve the quality of impossible videos. Reach us with the WeChat QR code below!
If you find our work helpful, please kindly star this repo and consider citing our paper.
@misc{bai2025impossible,
title={Impossible Videos},
author={Zechen Bai and Hai Ci and Mike Zheng Shou},
year={2025},
eprint={xxxx.xxxxx},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/xxxx.xxxxx},
}