Qineng Wang1*, Wenlong Huang2*, Yu Zhou3, Hang Yin2, Tianwei Bao1, Jianwen Lyu1, Weiyu Liu2
Ruohan Zhang2†, Jiajun Wu2†, Li Fei-Fei2†, Manling Li1†
*Equal contribution, †Equal advising
1Northwestern University, 2Stanford University, 3UCLA
ENACT is a benchmark that evaluates embodied cognition through world modeling from egocentric interaction. It is designed to be simple and have a scalable dataset for evaluating forward and inverse dynamics in embodied AI systems.
The benchmark tests models on their ability to:
- Forward World Modeling: Predict the correct sequence of future states given a current state and a series of actions
- Inverse World Modeling: Infer the correct sequence of actions that led from an initial state to a sequence of observed future states
- Environment Installation
- Data Download
- Data Evaluation
- Optional: Generate Data Yourself
- Simulator Installation
⚠️ IMPORTANT: If you plan to use the BEHAVIOR-1K simulator for data generation (replaying HDF5 files), skip steps 2 and jump directly to Simulator Installation section below. The simulator setup will create its own conda environment with all required dependencies. After installing with the simulator env, get back to step 3.
git clone [email protected]:QinengWang-Aiden/ENACT.git
cd ENACT/Create a new conda environment named enact with Python 3.10:
conda create -n enact python=3.10 -y
conda activate enactInstall the package in editable mode:
pip install -e .
# Verify installation
enact --helpBy default, ENACT downloads the ENACT QA dataset which contains question-answer pairs with images for VLMs evaluation. You can optionally download additional datasets like HDF5 files, replayed activities, and segmented activities.
# Download only ENACT QA
python scripts/helpers/download_dataset.py
# Download ALL datasets
python scripts/helpers/download_dataset.py --allThis downloads the QA dataset (approximately 17 GB) to data/QA/ by default.
Complete options
# Download only ENACT QA dataset (default)
python scripts/helpers/download_dataset.py --output-dir ./data
# Skip ENACT QA dataset if you don't need it
python scripts/helpers/download_dataset.py --no-enact
# Download HDF5 dataset (raw simulation recordings)
python scripts/helpers/download_dataset.py --hdf5
# Download replayed activities (extracted scene graphs and frames)
python scripts/helpers/download_dataset.py --replayed
# Download segmented activities (segmented scene graphs)
python scripts/helpers/download_dataset.py --segmentedDataset Descriptions:
- ENACT QA (default, ~17 GB): Contains
enact_ordering.jsonlwith 8972 QA pairs and associated images for evaluation - HDF5 (Optional): Raw simulation recordings from BEHAVIOR-1K simulator
- Replayed Activities (Optional): Scene graphs and extracted frames from replayed HDF5 files
- Segmented Activities (Optional): Segmented scene graphs with action boundaries identified
After downloading, your data/ directory will contain:
data/
├── QA/ # ENACT QA dataset
│ ├── enact_ordering.jsonl # 8972 QA pairs
│ └── images/ # Associated images
│ ├── forward_world_modeling_ordering_3_steps/
│ ├── forward_world_modeling_ordering_4_steps/
│ ├── ...
│ ├── inverse_world_modeling_ordering_3_steps/
│ └── ...
├── raw_hdf5/ # (Optional) Raw simulation data
├── replayed_activities/ # (Optional) Extracted scene graphs
└── segmented_activities/ # (Optional) Segmented frames
Each line in enact_ordering.jsonl contains a QA instance with the following structure.
Key Fields:
id: Unique identifier for this QA instancetype: Question type (forward/inverse world modeling with N steps)images: List of image paths - first is current state, rest are shuffled future statesquestion: Full prompt with task description and actionsgt_answer: Ground truth ordering (e.g.,[2, 1]means the correct order is image 2 then image 1)
Example input format
{
"id": "task_name_type_hash",
"type": "forward_world_modeling_ordering_3_steps",
"task_name": "assembling_gift_baskets_1749468508582193",
"key_frame_ids": ["16084", "18290", "18501"],
"images": [
"QA/images/.../cur_state.png",
"QA/images/.../next_state_1.png",
"QA/images/.../next_state_2.png"
],
"question": "You are a capable agent...",
"options": [],
"gt_answer": [2, 1]
}Your model should generate a JSONL file where each line contains the original fields plus an answer field.
Requirements:
- All fields except
answermust match the inputenact_ordering.jsonl answershould be a string containing a parsable list (e.g.,"[2, 1]"instead of[2, 1])- Recommended naming:
enact_ordering_{model_name}.jsonl
Example model output format
{
"id": "task_name_type_hash",
"type": "forward_world_modeling_ordering_3_steps",
"task_name": "assembling_gift_baskets_1749468508582193",
"key_frame_ids": ["16084", "18290", "18501"],
"gt_answer": [2, 1],
"answer": "[2, 1]"
}# single file evaluation
enact eval your_model_output.jsonl
# batch file evaluation
# the evaluator will look for files matching pattern "enact_ordering_*.jsonl"
enact eval model_outputs_directory/Complete version with all options
# Specify custom data paths
enact eval your_model_output.jsonl \
--segmented-data data/segmented_activities \
--raw-data data/replayed_activities \
--output-root data/evaluation
# Enable detailed wrong case output
enact eval your_model_output.jsonl --analyze-wrong-cases
# Preview what would be evaluated without running
enact eval your_model_output.jsonl --dry-runArguments:
input_path: Path to JSONL file or directory containing JSONL files--segmented-data: Path to segmented activities (default:data/segmented_activities)--raw-data: Path to replayed activities (default:data/replayed_activities)--output-root: Where to save evaluation results (default:data/evaluation)--analyze-wrong-cases: Generate detailed signatures for incorrect predictions--dry-run: Show what would be evaluated without actually processing
After evaluation, results are saved to the output directory (default: data/evaluation/):
data/evaluation/
├── batch_evaluation_summary.json # Overall summary across all models
├── meta_performance/ # Summary metrics per model
│ └── enact_ordering_modelname.json
├── detailed_eval/ # Per-sample detailed results (JSONL)
│ └── enact_ordering_modelname.jsonl
└── signatures/ # (If --analyze-wrong-cases enabled, JSONL)
└── enact_ordering_modelname.jsonl
Note: The evaluator extracts model name from the input filename. For example:
- Input:
enact_ordering_gpt-4.jsonl→ Output files:enact_ordering_gpt-4.json/.jsonl - Input:
my_model_predictions.jsonl→ Model name:my_model_predictions
Contains aggregated metrics with overall and per-task-type breakdowns.
Key Metrics:
model_name: Name of the model being evaluated (extracted from filename)overall_performance.overall: Overall performance across all question typescount: Total number of QA instances evaluatedtask_accuracy: Percentage of correctly ordered sequences (exact match)pairwise_accuracy: Percentage of correct pairwise orderings
forward_world_modeling/inverse_world_modeling: Breakdown by dynamics type
Example JSON output
{
"model_name": "human",
"overall_performance": {
"overall": {
"count": 8972,
"task_accuracy": 0.8859786000891663,
"pairwise_accuracy": 0.9492396096497747
},
"forward_world_modeling": {
"count": 4486,
"task_accuracy": 0.879402585822559,
"pairwise_accuracy": 0.9481513916311064
},
"inverse_world_modeling": {
"count": 4486,
"task_accuracy": 0.8925546143557735,
"pairwise_accuracy": 0.9503278276684429
}
}
}Contains per-sample results with individual predictions and correctness (JSONL format, one JSON object per line).
Key Fields:
eval_metrics: Multiple accuracy measuresexact_match: Whether the full sequence matches exactlysemantic_match: Whether the meaning matches (allows reordering of simultaneous events)task_accuracy: Task-level correctness (same as exact_match)pairwise_accuracy: Percentage of correct pairwise orderings (partial credit)
ground_truth: Correct orderingmodel_answer: Model's predicted orderingraw_answer: Raw string output from the modelwrong_case_analysis: Detailed breakdown (always included, even for correct answers)
Example JSONL entry
{
"id": "assembling_gift_baskets_1749468508582193_forward_dynamics_ordering_3_steps_5dc7cfd5",
"task_name": "assembling_gift_baskets_1749468508582193",
"type": "forward_dynamics_ordering_3_steps",
"eval_metrics": {
"exact_match": false,
"semantic_match": false,
"task_accuracy": false,
"pairwise_accuracy": 0.5
},
"ground_truth": [2, 1],
"model_answer": [1, 2],
"raw_answer": "[1, 2]",
"wrong_case_analysis": {
"id": "...",
"type": "...",
"key_frame_ids": ["16084", "18290", "18501"],
"gt_answer": [2, 1],
"parsed_answer": [1, 2],
"correct_signatures": [["edge_add_..."], ["edge_remove_..."]],
"input_signatures": [["edge_remove_...", "edge_add_..."], ["edge_add_..."]],
"correct_natural_language": ["Action 1 description", "Action 2 description"],
"input_natural_language": ["Wrong action 1", "Wrong action 2"]
}
}When --analyze-wrong-cases is enabled, generates detailed analysis with action signatures (JSONL format, one JSON object per line).
Signature Analysis Fields:
correct_signatures: The actual state changes at each step (as edge operations)input_signatures: The state changes predicted by the modelcorrect_natural_language: Human-readable description of correct transitionsinput_natural_language: Human-readable description of model's predictionsequal_length: Whether model output has the correct number of steps
This file helps you understand why the model made mistakes by comparing the predicted state transitions with the ground truth.
Example JSONL entry
{
"id": "assembling_gift_baskets_1749468508582193_forward_dynamics_ordering_3_steps_5dc7cfd5",
"type": "forward_dynamics_ordering_3_steps",
"task_name": "assembling_gift_baskets_1749468508582193",
"key_frame_ids": ["16084", "18290", "18501"],
"gt_answer": [2, 1],
"parsed_answer": [1, 2],
"raw_answer": "[1, 2]",
"eval_metrics": {
"exact_match": false,
"semantic_match": false,
"task_accuracy": false,
"pairwise_accuracy": 0.5
},
"equal_length": true,
"correct_signatures": [
["edge_add_the robot r1_the butter cookie_LeftGrasping"],
["edge_remove_the butter cookie_the coffee table_OnTop"]
],
"input_signatures": [
["edge_remove_the butter cookie_the coffee table_OnTop", "edge_add_the robot r1_the butter cookie_LeftGrasping"],
["edge_add_the butter cookie_the coffee table_OnTop"]
],
"correct_natural_language": [
"The robot r1 changes to be using the left gripper to grasp the butter cookie.",
"The butter cookie stopped being on top of and touching the coffee table."
],
"input_natural_language": [
"The robot r1 changes to be using the left gripper to grasp the butter cookie. The butter cookie is no longer on top of and touching the coffee table.",
"The butter cookie transitions to be on top of and touching the coffee table."
]
}When evaluating a directory with multiple model outputs, a batch_evaluation_summary.json is created. This provides a quick comparison across all evaluated models.
Example JSON output
{
"total_processed": 2,
"successful": 2,
"failed": 0,
"results": [
{
"model_name": "gpt-5-mini-2025-08-07",
"status": "success",
"overall_stats": {
"count": 8972,
"task_accuracy": 0.3695,
"pairwise_accuracy": 0.6474
}
},
{
"model_name": "human",
"status": "success",
"overall_stats": {
"count": 8972,
"task_accuracy": 0.8860,
"pairwise_accuracy": 0.9492
}
}
]
}# 1. Download the ENACT QA dataset
python scripts/helpers/download_dataset.py
# 2. Run your model on data/QA/enact_ordering.jsonl to generate predictions
# Your model should output: enact_ordering_mymodel.jsonl
# 3. Evaluate your predictions
enact eval enact_ordering_mymodel.jsonl --analyze-wrong-cases
# 4. Check results
cat data/evaluation/meta_performance/enact_ordering_mymodel.json
# 5. For batch evaluation of multiple models
enact eval model_outputs_directory/ --analyze-wrong-cases
cat data/evaluation/batch_evaluation_summary.jsonThe ENACT dataset generation follows a multi-stage pipeline. You can start from any stage as we provide official intermediate datasets for each stage. Only Stage 1 (replaying HDF5 files) requires the BEHAVIOR-1K simulator.
Stage 0 (Optional): Collect Robot Data → raw_hdf5/
↓ (requires simulator)
Stage 1 (Optional): Replay HDF5 → replayed_activities/ (mp4 + scene_graph)
↓
Stage 1.5: Extract Frames → replayed_activities/*/external_sensor1/
↓
Stage 2: Segment Activities → segmented_activities/ (key frames only)
↓
Stage 3: Generate QA → QA/enact_ordering.jsonl
Official Data Sources:
- raw_hdf5: Google Drive (Ours) or Behavior HuggingFace (29 tasks, 200 trajectories each)
- replayed_activities: Google Drive
- segmented_activities: Google Drive
- QA dataset: HuggingFace (default)
Use Official Data Instead:
- Option 1 - Our curated dataset (subset):
python scripts/helpers/download_dataset.py --hdf5
- Option 2 - Full HuggingFace dataset (29 tasks × 200 trajectories):
- Visit: https://huggingface.co/datasets/behavior-1k/2025-challenge-rawdata
- This is all available hdf5 datasets used in BEHAVIOR Challenge.
Output: data/raw_hdf5/ containing HDF5 simulation recordings
This stage replays HDF5 files in the simulator to extract:
- Scene graphs (object relationships and states at each timestep)
- MP4 video (egocentric camera view)
Run Replay (Single File):
# After installing simulator
python scripts/helpers/replay_hdf5.py --file data/raw_hdf5/task_name.hdf5 --output_dir data/replayed_activitiesRun Replay (Batch Mode - All Files):
# Processes all HDF5 files in data/raw_hdf5/
bash scripts/helpers/batch_replay_hdf5.shOr Download Official Replayed Data:
python scripts/helpers/download_dataset.py --replayedOutput Structure: data/replayed_activities/
replayed_activities/
├── assembling_gift_baskets_1749468508582193/
│ ├── external_sensor1.mp4 # Egocentric video
│ └── scene_graph_0.json # Scene graph data
└── bringing_water_1750844141719178/
├── external_sensor1.mp4
└── scene_graph_0.json
No simulator required. Extract PNG frames from the MP4 videos produced in Stage 1. This step is required before segmentation.
Input: data/replayed_activities/ with MP4 files
Extract Frames (Single Task):
python scripts/helpers/frame_extraction.py --task_folder data/replayed_activities/assembling_gift_baskets_1749468508582193Extract Frames (Batch Mode - All Tasks):
python scripts/helpers/frame_extraction.py --task_folder data/replayed_activitiesSkip Already Processed:
python scripts/helpers/frame_extraction.py --task_folder data/replayed_activities --skip_existingOutput: Frames are extracted into external_sensor1/ subfolder in each task directory:
replayed_activities/
├── assembling_gift_baskets_1749468508582193/
│ ├── external_sensor1.mp4
│ ├── scene_graph_0.json
│ └── external_sensor1/ # New: extracted frames
│ ├── 00001.png
│ ├── 00002.png
│ └── ...
└── bringing_water_1750844141719178/
└── ...
No simulator required. This stage processes scene graphs to identify key frames where significant state changes occur (action boundaries), then copies the corresponding frames.
Input:
data/replayed_activities/with extracted frames (from Stage 1.5)- Scene graph JSON files
Run Segmentation:
# Basic usage (uses default paths)
enact segment
# Custom paths
enact segment data/replayed_activities data/segmented_activities
# Preview before processing
enact segment --dry-runOr Download Official Segmented Data:
python scripts/helpers/download_dataset.py --segmentedOutput Structure: data/segmented_activities/
segmented_activities/
├── assembling_gift_baskets_1749468508582193/
│ ├── external_sensor1/ # Segmented key frames
│ │ ├── 00059.png
│ │ ├── 00705.png
│ │ ├── 00916.png
│ │ └── ... # 53 key frames total
│ └── segmented_scene_graph_0.json # Scene graph with only key frames
├── canning_food_1751278778230696/
│ ├── external_sensor1/ # 78 key frames
│ │ └── ...
│ └── segmented_scene_graph_0.json
└── bringing_water_1750844141719178/
├── external_sensor1/ # 15 key frames
│ └── ...
└── segmented_scene_graph_0.json
Note: Each task typically has 15-80 segmented frames representing key action boundaries. For example, canning_food has 78 segmented frames, which can generate over 0.5 billion possible 10-step ordering questions.
No simulator required. This stage samples state transitions from segmented data to create forward and inverse world modeling questions.
Input:
data/segmented_activities/(from Stage 2 or downloaded)data/replayed_activities/(for extracting images)
Run QA Generation:
# Basic usage (uses default paths)
enact qa
# Custom paths
enact qa data/segmented_activities data/replayed_activities data/QA/enact_ordering.jsonl
# Control sampling
enact qa --seed 42 --num-to-sample 10
# Preview before generating
enact qa --dry-runOr Download Official QA Dataset:
python scripts/helpers/download_dataset.py # Downloads QA by defaultOutput:
data/QA/enact_ordering.jsonl- 8,972 QA pairs (in our paper's version)data/QA/images/- Organized by question type
Data Generation Scale:
For example, a task like Canning Food with 78 segmented frames can generate over 0.5 billion possible 10-step ordering questions. Our sampling strategy ensures diverse and challenging questions while maintaining computational feasibility.
Example QA entry structure
Each generated QA instance includes:
- Question prompt: Instructions for the model
- Images: Current state + shuffled future state images
- Actions: Ordered list of state transitions
- Ground truth: Correct ordering of future states
See Data Evaluation section for detailed format.
Example 1: Start from raw HDF5 (requires simulator)
# 1. Install simulator (see Simulator Installation section)
# 2. Download HDF5 files
python scripts/helpers/download_dataset.py --hdf5
# 3. Replay HDF5 in simulator (batch mode)
bash scripts/helpers/batch_replay_hdf5.sh
# Or single file:
# python scripts/helpers/replay_hdf5.py --file data/raw_hdf5/task.hdf5 --output_dir data/replayed_activities
# 4. Extract frames from videos
python scripts/helpers/frame_extraction.py --task_folder data/replayed_activities
# 5. Segment activities
enact segment
# 6. Generate QA
enact qa --seed 42Example 2: Start from replayed activities (no simulator needed)
# 1. Download replayed activities
python scripts/helpers/download_dataset.py --replayed
# 2. Extract frames from videos
python scripts/helpers/frame_extraction.py --task_folder data/replayed_activities
# 3. Segment activities
enact segment
# 4. Generate QA
enact qa --seed 42Example 3: Start from segmented activities (no simulator needed)
# 1. Download segmented activities and replayed activities (for images)
python scripts/helpers/download_dataset.py --segmented --replayed
# 2. Generate QA
enact qa --seed 42Example 4: Only evaluate on official QA dataset (no generation)
# 1. Download QA dataset (default)
python scripts/helpers/download_dataset.py
# 2. Run your model and evaluate
enact eval your_model_output.jsonlOnly required if you want to replay HDF5 files (Stage 1). The BEHAVIOR-1K simulator setup will create its own conda environment with all dependencies including OmniGibson, BDDL, and datasets.
⚠️ Important: If you already created anenactconda environment following the earlier steps, but you want to use the simulator later, you may delete your old env and install with the simulator installation script.
1. Initialize BEHAVIOR-1K submodule
cd ENACT/
git submodule update --init --recursive2. Run BEHAVIOR-1K setup script
cd BEHAVIOR-1K/
./setup.sh --new-env --omnigibson --bddl --joylo --datasetThis command will:
- Create a new conda environment
- Install OmniGibson simulator
- Install BDDL (Behavior Domain Definition Language)
- Download necessary datasets for simulation
Setup time: ~30-60 minutes depending on your internet connection and hardware.
After setup completes, verify the installation:
Test 1: Launch Isaac Sim
conda activate enact
isaacsimThis should open the Isaac Sim GUI. Close it after confirming it launches.
Test 2: Run robot control example
python OmniGibson/omnigibson/examples/robots/robot_control_example.pyThis should run a simulation with robot control.
After verifying simulator installation, return to the ENACT root directory:
cd ..
conda activate enact
pip install -e . Now you can proceed with Stage 1: Replay HDF5 to replay HDF5 files.
# General help
enact --help
# Help for specific subcommands
enact segment --help
enact qa --help
enact eval --helpYou can also import and use ENACT modules in your own Python code:
from enact.processors import SegmentationProcessor, EvaluatorProcessor
from enact.core.evaluators import OrderingEvaluator
# Segmentation
seg_processor = SegmentationProcessor(
input_root="data/replayed_activities",
output_root="data/segmented_activities"
)
seg_processor.process_all_tasks()
# Evaluation
eval_processor = EvaluatorProcessor(
input_path="model_output.jsonl",
segmented_data_dir="data/segmented_activities",
raw_data_dir="data/replayed_activities",
output_root="data/evaluation",
analyze_wrong_cases=True
)
eval_processor.process_all_files()If you use ENACT in your research, please cite:
@article{enact2025,
title={ENACT: Embodied Cognition through World Modeling from Egocentric Interaction},
author={ENACT Team},
year={2025}
}This project is licensed under the MIT License - see the LICENSE file for details.
ENACT builds upon the BEHAVIOR simulator.