Skip to content

hokindeng/VMEvalKit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

VMEvalKit πŸŽ₯🧠

results Paper Hugging Face WeChat

A framework to score reasoning capabilities in video generation models at scale, through cognitive tasks. We make it very convenient to add models, add tasks, run inferences, run scoring, manage datasets and display results. It's permissively open-source, and we welcome everyone to join us and build in public together! πŸš€

VMEvalKit Framework

🎬 Supported Models

VMEvalKit provides unified access to 40 video generation models across 11 provider families:

For commercial APIs, we support Luma Dream Machine, Google Veo, Google Veo 3.1, WaveSpeed WAN 2.1, WaveSpeed WAN 2.2, Runway ML, OpenAI Sora. For open-source models, we support HunyuanVideo, VideoCrafter, DynamiCrafter, Stable Video Diffusion, Morphic, LTX-Video, and so on. See here for details.

πŸ“Š Supported Datasets

VMEvalKit provides access to 9 local task generation engines(quickly increasing) and other external benchmark datasets (HuggingFace) here.

Local Task Generation Engines

Tasks supported by VMEvalKit:

Chess, Maze, Raven, Rotation, Sudoku, Object Subtraction, Clock, mirror clock. For more details, see Task Docs.

Basic Idea

VMEvalKit aims to provide an infrastructure for reasoning research in video models at scale:

  • 🎯 Task Creation at Scale: Create question dataset of many different cognitive tasks programmatically at scale and our framework makes sure the dataset to be well-organized.
  • πŸš€ Model Inference at Scale: Easy one-click inference of the entire question dataset across many video models (commercial APIs + open-source) with automatic resume, error handling, and structured output management, and automatically sync the inference results into the dataset.
  • βš–οΈ Scoring Pipeline: Human scoring via web interface and AI scoring via automated MLLM scoring, also automatically sync the scoring results into the dataset.
  • ☁️ Dataset Management: Manage question datasets from task creation, inference results from video models, and scoring results from humans or MLLM pipelines. Provides AWS S3 integration with version tracking and built-in logging for reproducibility.

We have completed running a question dataset of chess, maze, Sudoku, mental rotation, and Raven's Matrices on latest video models. Checkout our raw results videos on this website. Here are a few examples.

Installation & Setup

  1. Clone the repository
git clone https://github.com/hokindeng/VMEvalKit.git
cd VMEvalKit
  1. Initialize submodules - good for optional open-source models and datasets
git submodule update --init --recursive
  1. Configure environment - Copy the example environment file and add your API keys
cp env.template .env
  1. Set up Python environment – Recommended: use a fresh virtual environment
python -m venv venv
source venv/bin/activate

Alternatively, you can use other tools like uv for faster install (uv venv), or conda if your usecase has cross-language dependencies.

  1. Install dependencies:
pip install -r requirements.txt
pip install -e .

For open-source video generation and evaluator models, please refer to Open Source Models for detailed installation instructions.

πŸš€ Quick Start - End-to-End Example

Here's a complete workflow from creating questions to scoring results:

1️⃣ Create Questions

# Generate 5 chess and maze questions each
python examples/create_questions.py --task chess maze --pairs-per-domain 5

# Output: Creates data/questions/ with chess_task/ and maze_task/ folders

2️⃣ Generate Videos

# Run on specific model (e.g., stable video diffusion)
python examples/generate_videos.py --model svd --task chess maze

# Output: Creates data/outputs/pilot_experiment/ with generated videos
# for close source model, need to set key in .env file

3️⃣ Score Results

# open source VLM Automated scoring
bash script/lmdeploy_server.sh

# Human scoring via web interface
python examples/score_videos.py human

# Automated GPT-4O scoring
python examples/score_videos.py gpt4o

4️⃣ View Results

# Launch web dashboard to explore results
cd web && ./start.sh
# Open http://localhost:5000 in your browser

That's it! You now have:

  • βœ… Custom reasoning questions in data/questions/
  • βœ… Generated videos in data/outputs/
  • βœ… Scoring results in data/scorings/
  • βœ… Interactive dashboard

Tasks

Every VMEvalKit dataset consists of Task Pairs - the basic unit for video reasoning scoring:

We have two types of tasks:

Final image

Each Task Pair consists of three core components:

  • πŸ“Έ Initial state image (first_frame.png): shows the starting point or problem to be solved
  • 🎯 Final state image (final_frame.png): illustrates the goal state or solution
  • πŸ“ Text prompt (prompt.txt): provides natural language instructions for the video model

There is also an accompanying question_metadata.json file with rich metadata. Each task pair is organized in its own folder (data/questions/{domain}_task/{question_id}/) containing all four files.

Task Pair Structure

Final text answer

Each Task Pair consists of three core components:

  • πŸ“Έ Initial state image (first_frame.png): shows the starting point or problem to be solved
  • πŸ“ Text answer (goal.txt): provides the text answer to the question
  • πŸ“ Text prompt (prompt.txt): provides natural language instructions for the video model

With our VMEvalKit, you can easily create tasks with final text answer by simply adding a goal.txt file to the task folder, so you could adapt your VQA datasets to video reasoning tasks.

For more details, see Task Docs.

Inference Architecture

See Inference Guide for details.

Scoring Pipeline

See Scoring Guide for details.

Dataset Management

See Data Management for details.

Display Results

See Web Dashboard for details.

Add Models or Tasks

You can add new video generation models and reasoning tasks with minimal effort:

Adding New Models

Add any video generation model (API-based or open-source) with just a few steps:

# Example: Adding a new model wrapper
from vmevalkit.models.base import BaseVideoModel

class MyModelWrapper(BaseVideoModel):
    def generate_video(self, image_path, text_prompt, **kwargs):
        # Your model's video generation logic
        return video_path

Then register it in MODEL_CATALOG.py:

"my-model": {
    "provider": "mycompany",
    "wrapper_path": "vmevalkit.models.my_model.MyModelWrapper",
    ...
}

See Adding Models Guide for details.

Adding New Tasks

Create new reasoning tasks programmatically at scale:

from vmevalkit.tasks.base_task import BaseTask

class MyTask(BaseTask):
    def generate_task_pair(self, ...):
        # Generate initial and final states
        initial_state = self.create_initial_state()
        final_state = self.create_final_state()
        prompt = self.create_prompt()
        
        return {
            "first_frame": initial_state,
            "final_frame": final_state, 
            "prompt": prompt,
            "metadata": {...}
        }

See Adding Tasks Guide for details.

Invitation to Collaborate 🀝

VMEvalKit is meant to be a permissively open-source shared playground for everyone. If you’re interested in machine cognition, video models, evaluation, or anything anything πŸ¦„βœ¨, we’d love to build with you:

  • πŸ§ͺ Add new reasoning tasks (planning, causality, social, physical, etc.)
  • πŸŽ₯ Plug in new video models (APIs or open-source)
  • πŸ“Š Experiment with better evaluation metrics and protocols
  • 🧱 Improve infrastructure, logging, and the web dashboard
  • πŸ“š Use VMEvalKit in your own research and share back configs/scripts
  • πŸŒŸπŸŽ‰ Or Anything anything πŸ¦„βœ¨

πŸ’¬ Join us on Slack to ask questions, propose ideas, or start a collab: Slack Invite πŸš€

Documentation

πŸ“š Core Documentation:

Research

Here we keep track of papers spinned off from this code infrastructure and some works in progress.

This paper implements our experimental framework and demonstrates that leading video generation models (Sora-2 etc) can perform visual reasoning tasks with >60% success rates. See results.

License

Apache 2.0

Citation

If you find VMEvalKit useful in your research, please cite:

@misc{VMEvalKit,
  author       = {VMEvalKit Team},
  title        = {VMEvalKit: A framework for evaluating reasoning abilities in foundational video models},
  year         = {2025},
  howpublished = {\url{https://github.com/Video-Reason/VMEvalKit}}
}