VMEvalKit 🎥🧠

A framework to score reasoning capabilities in video generation models at scale, through cognitive tasks. We make it very convenient to add models, add tasks, run inferences, run scoring, manage datasets and display results. It's permissively open-source, and we welcome everyone to join us and build in public together! 🚀

🎬 Supported Models

VMEvalKit provides unified access to 40 video generation models across 11 provider families:

For commercial APIs, we support Luma Dream Machine, Google Veo, Google Veo 3.1, WaveSpeed WAN 2.1, WaveSpeed WAN 2.2, Runway ML, OpenAI Sora. For open-source models, we support HunyuanVideo, VideoCrafter, DynamiCrafter, Stable Video Diffusion, Morphic, LTX-Video, and so on. See here for details.

📊 Supported Datasets

VMEvalKit provides access to 9 local task generation engines(quickly increasing) and other external benchmark datasets (HuggingFace) here.

Local Task Generation Engines

Tasks supported by VMEvalKit:

Chess, Maze, Raven, Rotation, Sudoku, Object Subtraction, Clock, mirror clock. For more details, see Task Docs.

Basic Idea

VMEvalKit aims to provide an infrastructure for reasoning research in video models at scale:

🎯 Task Creation at Scale: Create question dataset of many different cognitive tasks programmatically at scale and our framework makes sure the dataset to be well-organized.
🚀 Model Inference at Scale: Easy one-click inference of the entire question dataset across many video models (commercial APIs + open-source) with automatic resume, error handling, and structured output management, and automatically sync the inference results into the dataset.
⚖️ Scoring Pipeline: Human scoring via web interface and AI scoring via automated MLLM scoring, also automatically sync the scoring results into the dataset.
☁️ Dataset Management: Manage question datasets from task creation, inference results from video models, and scoring results from humans or MLLM pipelines. Provides AWS S3 integration with version tracking and built-in logging for reproducibility.

We have completed running a question dataset of chess, maze, Sudoku, mental rotation, and Raven's Matrices on latest video models. Checkout our raw results videos on this website. Here are a few examples.

Installation & Setup

Clone the repository

git clone https://github.com/hokindeng/VMEvalKit.git
cd VMEvalKit

Initialize submodules - good for optional open-source models and datasets

git submodule update --init --recursive

Configure environment - Copy the example environment file and add your API keys

cp env.template .env

Set up Python environment – Recommended: use a fresh virtual environment

python -m venv venv
source venv/bin/activate

Alternatively, you can use other tools like uv for faster install (uv venv), or conda if your usecase has cross-language dependencies.

Install dependencies:

pip install -r requirements.txt
pip install -e .

For open-source video generation and evaluator models, please refer to Open Source Models for detailed installation instructions.

🚀 Quick Start - End-to-End Example

Here's a complete workflow from creating questions to scoring results:

1️⃣ Create Questions

# Generate 5 chess and maze questions each
python examples/create_questions.py --task chess maze --pairs-per-domain 5

# Output: Creates data/questions/ with chess_task/ and maze_task/ folders

2️⃣ Generate Videos

# Run on specific model (e.g., stable video diffusion)
python examples/generate_videos.py --model svd --task chess maze

# Output: Creates data/outputs/pilot_experiment/ with generated videos
# for close source model, need to set key in .env file

3️⃣ Score Results

# open source VLM Automated scoring
bash script/lmdeploy_server.sh

# Human scoring via web interface
python examples/score_videos.py human

# Automated GPT-4O scoring
python examples/score_videos.py gpt4o

4️⃣ View Results

# Launch web dashboard to explore results
cd web && ./start.sh
# Open http://localhost:5000 in your browser

That's it! You now have:

✅ Custom reasoning questions in data/questions/
✅ Generated videos in data/outputs/
✅ Scoring results in data/scorings/
✅ Interactive dashboard

Tasks

Every VMEvalKit dataset consists of Task Pairs - the basic unit for video reasoning scoring:

We have two types of tasks:

Final image

Each Task Pair consists of three core components:

📸 Initial state image (first_frame.png): shows the starting point or problem to be solved
🎯 Final state image (final_frame.png): illustrates the goal state or solution
📝 Text prompt (prompt.txt): provides natural language instructions for the video model

There is also an accompanying question_metadata.json file with rich metadata. Each task pair is organized in its own folder (data/questions/{domain}_task/{question_id}/) containing all four files.

Final text answer

Each Task Pair consists of three core components:

📸 Initial state image (first_frame.png): shows the starting point or problem to be solved
📝 Text answer (goal.txt): provides the text answer to the question
📝 Text prompt (prompt.txt): provides natural language instructions for the video model

With our VMEvalKit, you can easily create tasks with final text answer by simply adding a goal.txt file to the task folder, so you could adapt your VQA datasets to video reasoning tasks.

For more details, see Task Docs.

Inference Architecture

See Inference Guide for details.

Scoring Pipeline

See Scoring Guide for details.

Dataset Management

See Data Management for details.

Display Results

See Web Dashboard for details.

Add Models or Tasks

You can add new video generation models and reasoning tasks with minimal effort:

Adding New Models

Add any video generation model (API-based or open-source) with just a few steps:

# Example: Adding a new model wrapper
from vmevalkit.models.base import BaseVideoModel

class MyModelWrapper(BaseVideoModel):
    def generate_video(self, image_path, text_prompt, **kwargs):
        # Your model's video generation logic
        return video_path

Then register it in MODEL_CATALOG.py:

"my-model": {
    "provider": "mycompany",
    "wrapper_path": "vmevalkit.models.my_model.MyModelWrapper",
    ...
}

See Adding Models Guide for details.

Adding New Tasks

Create new reasoning tasks programmatically at scale:

from vmevalkit.tasks.base_task import BaseTask

class MyTask(BaseTask):
    def generate_task_pair(self, ...):
        # Generate initial and final states
        initial_state = self.create_initial_state()
        final_state = self.create_final_state()
        prompt = self.create_prompt()
        
        return {
            "first_frame": initial_state,
            "final_frame": final_state, 
            "prompt": prompt,
            "metadata": {...}
        }

See Adding Tasks Guide for details.

Invitation to Collaborate 🤝

VMEvalKit is meant to be a permissively open-source shared playground for everyone. If you’re interested in machine cognition, video models, evaluation, or anything anything 🦄✨, we’d love to build with you:

🧪 Add new reasoning tasks (planning, causality, social, physical, etc.)
🎥 Plug in new video models (APIs or open-source)
📊 Experiment with better evaluation metrics and protocols
🧱 Improve infrastructure, logging, and the web dashboard
📚 Use VMEvalKit in your own research and share back configs/scripts
🌟🎉 Or Anything anything 🦄✨

💬 Join us on Slack to ask questions, propose ideas, or start a collab: Slack Invite 🚀

Documentation

📚 Core Documentation:

Inference Guide - Complete guide to running inference, supported models, and architecture
Scoring Guide - Human and automated scoring methods
Data Management - Dataset organization, S3 sync, and version tracking
Adding Models - How to add new video generation models
Adding Tasks - How to create new reasoning tasks
Web Dashboard - Interactive results visualization

Research

Here we keep track of papers spinned off from this code infrastructure and some works in progress.

"Video Models Start to Solve Chess, Maze, Sudoku, Mental Rotation, and Raven's Matrices"

This paper implements our experimental framework and demonstrates that leading video generation models (Sora-2 etc) can perform visual reasoning tasks with >60% success rates. See results.

License

Apache 2.0

Citation

If you find VMEvalKit useful in your research, please cite:

@misc{VMEvalKit,
  author       = {VMEvalKit Team},
  title        = {VMEvalKit: A framework for evaluating reasoning abilities in foundational video models},
  year         = {2025},
  howpublished = {\url{https://github.com/Video-Reason/VMEvalKit}}
}

Name		Name	Last commit message	Last commit date
Latest commit History 396 Commits
analysis		analysis
asset		asset
data		data
docs		docs
examples		examples
paper		paper
script		script
submodules		submodules
vmevalkit		vmevalkit
web		web
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
env.template		env.template
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VMEvalKit 🎥🧠

🎬 Supported Models

📊 Supported Datasets

Local Task Generation Engines

Basic Idea

Installation & Setup

🚀 Quick Start - End-to-End Example

1️⃣ Create Questions

2️⃣ Generate Videos

3️⃣ Score Results

4️⃣ View Results

Tasks

Final image

Final text answer

Inference Architecture

Scoring Pipeline

Dataset Management

Display Results

Add Models or Tasks

Invitation to Collaborate 🤝

Documentation

Research

License

Citation

About

Uh oh!

Releases 4

Packages

Contributors 13

Languages

License

hokindeng/VMEvalKit

Folders and files

Latest commit

History

Repository files navigation

VMEvalKit 🎥🧠

🎬 Supported Models

📊 Supported Datasets

Local Task Generation Engines

Basic Idea

Installation & Setup

🚀 Quick Start - End-to-End Example

1️⃣ Create Questions

2️⃣ Generate Videos

3️⃣ Score Results

4️⃣ View Results

Tasks

Final image

Final text answer

Inference Architecture

Scoring Pipeline

Dataset Management

Display Results

Add Models or Tasks

Invitation to Collaborate 🤝

Documentation

Research

License

Citation

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Contributors 13

Languages

Packages