VBVR-Bench rule-based evaluation for assessing video generation models' reasoning capabilities. 100+ task-specific evaluators, deterministic 0-1 scores, no API calls.
Each VBVR-Bench evaluator is matched by the generator name in the directory path. For example, O-9_shape_scaling_data-generator maps to the shape-scaling evaluator. The evaluator reads the generated video and reference data, then scores it.
1. Generate questions using a VBVR-DataFactory data-generator:
git clone https://github.com/VBVR-DataFactory/O-9_shape_scaling_data-generator.git
cd O-9_shape_scaling_data-generator && pip install -r requirements.txt
python examples/generate.py --num-samples 10 --seed 42 --output /path/to/questions2. Run inference to generate videos:
python examples/generate_videos.py --questions-dir ./questions --output-dir ./outputs --model svd3. Organize for evaluation. The evaluator expects this directory structure:
outputs_rubrics/
└── {model_name}/
└── {generator_name}/ # e.g., O-9_shape_scaling_data-generator
└── {task_type}/ # e.g., shape_scaling_task
└── {task_id}/ # e.g., shape_scaling_00000000
└── {run_id}/ # any name (e.g., "default")
├── video/
│ └── output.mp4 # model-generated video
└── question/
├── first_frame.png # reference first frame
├── final_frame.png # reference final frame
├── prompt.txt # text prompt
└── ground_truth.mp4 # GT video (optional)
The generator name directory (e.g., O-9_shape_scaling_data-generator) must match a VBVR-Bench task name so the correct rule-based evaluator is selected. Each sample needs both the generated video/ and the reference question/ files.
4. Run evaluation:
python examples/score_videos.py --inference-dir ./outputs_rubricsMultiple generators can be placed under the same root. The evaluator walks all of them automatically:
outputs_rubrics/
└── svd/
├── G-3_stable_sort_data-generator/
│ └── stable_sort_task/...
├── O-9_shape_scaling_data-generator/
│ └── shape_scaling_task/...
└── G-15_maze_solving_data-generator/
└── maze_solving_task/...
python examples/score_videos.py --inference-dir ./outputs_rubricsBy default, only the task_specific score is returned (reasoning correctness). Use --full-score for the weighted combination of all 5 dimensions:
| Dimension | Weight | What it measures |
|---|---|---|
task_specific |
25% | Task-specific reasoning logic |
final_frame_accuracy |
35% | Does the final frame match the expected result? |
first_frame_consistency |
15% | Does the first frame match the input image? |
temporal_smoothness |
15% | Are frame transitions smooth? |
visual_quality |
10% | Sharpness and noise levels |
python examples/score_videos.py --inference-dir ./outputs_rubrics # basic
python examples/score_videos.py --inference-dir ./outputs_rubrics --full-score # all 5 dimensions
python examples/score_videos.py --inference-dir ./outputs_rubrics --device cpu # CPU mode
python examples/score_videos.py --inference-dir ./outputs_rubrics --gt-base-path /path # external GT data
python examples/score_videos.py --inference-dir ./outputs_rubrics -o ./my_evals # custom output dir| Flag | Default | Description |
|---|---|---|
-i, --inference-dir |
(required) | Root directory with the structure above |
-o, --eval-output-dir |
./evaluations/rubrics |
Where to write result JSONs |
-g, --gt-base-path |
None | External GT data path (optional) |
--device |
cuda |
cuda or cpu |
--full-score |
off | Score all 5 dimensions instead of just task_specific |
You can also use the runner module directly:
python -m vbvrevalkit.runner.score --inference-dir ./outputs_rubricsThe evaluator writes two types of files:
Per-sample (VBVRBenchEvaluator.json):
{
"metadata": {
"evaluator": "VBVRBenchEvaluator",
"model_name": "svd",
"task_type": "O-9_shape_scaling_data-generator/shape_scaling_task",
"task_id": "shape_scaling_00000000"
},
"result": {
"score": 0.8667,
"dimensions": { "task_specific": 0.8667 },
"details": {
"task_specific_details": {
"element_preservation": 0.6667,
"scaling_ratio": 1.0,
"shape_type_matching": 1.0,
"position_correctness": 1.0
}
},
"evaluation_type": "rubrics",
"vbvr_task_name": "O-9_shape_scaling_data-generator"
}
}Summary (VBVRBenchEvaluator_summary.json):
{
"global_statistics": {
"total_models": 1,
"total_samples": 10,
"mean_score": 0.8667
},
"models": {
"svd": {
"model_statistics": { "mean_score": 0.8667, "total_samples": 10 },
"by_category": { "Transformation": { "mean_score": 0.8667 } },
"by_split": { "Out_of_Domain": { "mean_score": 0.8667 } }
}
}
}Breakdowns include:
- Model: Mean, median, std per model
- Category: 6 categories (Abstraction, Categorization, Navigation, Perception, Physics, Transformation)
- Split: In_Domain / Out_of_Domain (50 tasks each)
Evaluation is resumable — re-running the same command skips already-scored tasks.