A framework to score reasoning capabilities in video generation models at scale, through cognitive tasks. We make it very convenient to add models, add tasks, run inferences, run scoring, manage datasets and display results. It's permissively open-source, and we welcome everyone to join us and build in public together! π
VMEvalKit provides unified access to 40 video generation models across 11 provider families:
For commercial APIs, we support Luma Dream Machine, Google Veo, Google Veo 3.1, WaveSpeed WAN 2.1, WaveSpeed WAN 2.2, Runway ML, OpenAI Sora. For open-source models, we support HunyuanVideo, VideoCrafter, DynamiCrafter, Stable Video Diffusion, Morphic, LTX-Video, and so on. See here for details.
VMEvalKit provides access to 9 local task generation engines(quickly increasing) and other external benchmark datasets (HuggingFace) here.
Tasks supported by VMEvalKit:
Chess, Maze, Raven, Rotation, Sudoku, Object Subtraction, Clock, mirror clock. For more details, see Task Docs.
VMEvalKit aims to provide an infrastructure for reasoning research in video models at scale:
- π― Task Creation at Scale: Create question dataset of many different cognitive tasks programmatically at scale and our framework makes sure the dataset to be well-organized.
- π Model Inference at Scale: Easy one-click inference of the entire question dataset across many video models (commercial APIs + open-source) with automatic resume, error handling, and structured output management, and automatically sync the inference results into the dataset.
- βοΈ Scoring Pipeline: Human scoring via web interface and AI scoring via automated MLLM scoring, also automatically sync the scoring results into the dataset.
- βοΈ Dataset Management: Manage question datasets from task creation, inference results from video models, and scoring results from humans or MLLM pipelines. Provides AWS S3 integration with version tracking and built-in logging for reproducibility.
We have completed running a question dataset of chess, maze, Sudoku, mental rotation, and Raven's Matrices on latest video models. Checkout our raw results videos on this website. Here are a few examples.
- Clone the repository
git clone https://github.com/hokindeng/VMEvalKit.git
cd VMEvalKit- Initialize submodules - good for optional open-source models and datasets
git submodule update --init --recursive- Configure environment - Copy the example environment file and add your API keys
cp env.template .env- Set up Python environment β Recommended: use a fresh virtual environment
python -m venv venv
source venv/bin/activateAlternatively, you can use other tools like uv for faster install (uv venv), or conda if your usecase has cross-language dependencies.
- Install dependencies:
pip install -r requirements.txt
pip install -e .For open-source video generation and evaluator models, please refer to Open Source Models for detailed installation instructions.
Here's a complete workflow from creating questions to scoring results:
# Generate 5 chess and maze questions each
python examples/create_questions.py --task chess maze --pairs-per-domain 5
# Output: Creates data/questions/ with chess_task/ and maze_task/ folders# Run on specific model (e.g., stable video diffusion)
python examples/generate_videos.py --model svd --task chess maze
# Output: Creates data/outputs/pilot_experiment/ with generated videos
# for close source model, need to set key in .env file# open source VLM Automated scoring
bash script/lmdeploy_server.sh
# Human scoring via web interface
python examples/score_videos.py human
# Automated GPT-4O scoring
python examples/score_videos.py gpt4o# Launch web dashboard to explore results
cd web && ./start.sh
# Open http://localhost:5000 in your browserThat's it! You now have:
- β
Custom reasoning questions in
data/questions/ - β
Generated videos in
data/outputs/ - β
Scoring results in
data/scorings/ - β Interactive dashboard
Every VMEvalKit dataset consists of Task Pairs - the basic unit for video reasoning scoring:
We have two types of tasks:
Each Task Pair consists of three core components:
- πΈ Initial state image (
first_frame.png): shows the starting point or problem to be solved - π― Final state image (
final_frame.png): illustrates the goal state or solution - π Text prompt (
prompt.txt): provides natural language instructions for the video model
There is also an accompanying question_metadata.json file with rich metadata. Each task pair is organized in its own folder (data/questions/{domain}_task/{question_id}/) containing all four files.
Each Task Pair consists of three core components:
- πΈ Initial state image (
first_frame.png): shows the starting point or problem to be solved - π Text answer (
goal.txt): provides the text answer to the question - π Text prompt (
prompt.txt): provides natural language instructions for the video model
With our VMEvalKit, you can easily create tasks with final text answer by simply adding a goal.txt file to the task folder, so you could adapt your VQA datasets to video reasoning tasks.
For more details, see Task Docs.
See Inference Guide for details.
See Scoring Guide for details.
See Data Management for details.
See Web Dashboard for details.
You can add new video generation models and reasoning tasks with minimal effort:
Adding New Models
Add any video generation model (API-based or open-source) with just a few steps:
# Example: Adding a new model wrapper
from vmevalkit.models.base import BaseVideoModel
class MyModelWrapper(BaseVideoModel):
def generate_video(self, image_path, text_prompt, **kwargs):
# Your model's video generation logic
return video_pathThen register it in MODEL_CATALOG.py:
"my-model": {
"provider": "mycompany",
"wrapper_path": "vmevalkit.models.my_model.MyModelWrapper",
...
}See Adding Models Guide for details.
Adding New Tasks
Create new reasoning tasks programmatically at scale:
from vmevalkit.tasks.base_task import BaseTask
class MyTask(BaseTask):
def generate_task_pair(self, ...):
# Generate initial and final states
initial_state = self.create_initial_state()
final_state = self.create_final_state()
prompt = self.create_prompt()
return {
"first_frame": initial_state,
"final_frame": final_state,
"prompt": prompt,
"metadata": {...}
}See Adding Tasks Guide for details.
VMEvalKit is meant to be a permissively open-source shared playground for everyone. If youβre interested in machine cognition, video models, evaluation, or anything anything π¦β¨, weβd love to build with you:
- π§ͺ Add new reasoning tasks (planning, causality, social, physical, etc.)
- π₯ Plug in new video models (APIs or open-source)
- π Experiment with better evaluation metrics and protocols
- π§± Improve infrastructure, logging, and the web dashboard
- π Use VMEvalKit in your own research and share back configs/scripts
- ππ Or Anything anything π¦β¨
π¬ Join us on Slack to ask questions, propose ideas, or start a collab: Slack Invite π
π Core Documentation:
- Inference Guide - Complete guide to running inference, supported models, and architecture
- Scoring Guide - Human and automated scoring methods
- Data Management - Dataset organization, S3 sync, and version tracking
- Adding Models - How to add new video generation models
- Adding Tasks - How to create new reasoning tasks
- Web Dashboard - Interactive results visualization
Here we keep track of papers spinned off from this code infrastructure and some works in progress.
This paper implements our experimental framework and demonstrates that leading video generation models (Sora-2 etc) can perform visual reasoning tasks with >60% success rates. See results.
Apache 2.0
If you find VMEvalKit useful in your research, please cite:
@misc{VMEvalKit,
author = {VMEvalKit Team},
title = {VMEvalKit: A framework for evaluating reasoning abilities in foundational video models},
year = {2025},
howpublished = {\url{https://github.com/Video-Reason/VMEvalKit}}
}
