-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Description
Match-3 (Candy Crush Style) Task for VMEvalKit
📊 Overview
The Match-3 task evaluates video generation models' capacity for pattern recognition, cascade prediction, and spatial reasoning through simplified match-3 puzzle games. This task tests whether models can:
- Identify matching patterns - Recognize horizontal/vertical groups of 3+ identical elements
- Predict cascading effects - Understand gravity and chain reactions after matches
- Generate elimination sequences - Produce videos showing step-by-step matching and clearing
- Demonstrate physics understanding - Show elements falling and new elements appearing
The task uses simplified 3×3 to 5×5 grids with 4-6 color types as a minimal test case for pattern matching and cascade reasoning capabilities.
🎯 Task Description
Input Components
- First Frame: A grid (3×3 to 5×5) with colored blocks/tiles arranged randomly
- Prompt: Text instruction to match and eliminate groups of 3 or more identical elements
- Format: 600×600px PNG image at 150 DPI with clear grid lines and distinct colors
Expected Output
- Video Sequence: Animation showing matching, elimination, gravity effects, and cascades
- Final Frame: Grid after all possible matches are resolved (stable state)
- Solution Path: Clear demonstration of matching patterns and chain reactions
Core Rules
- Match Requirement: 3 or more identical elements in a row (horizontal) or column (vertical)
- Elimination: Matched elements are removed from the grid
- Gravity: Elements above empty spaces fall downward
- Refill: New random elements appear from the top
- Cascade: New matches created by falling elements trigger additional eliminations
🎨 Visual Design Specification
Grid Layout
┌───┬───┬───┬───┐
│ 🔴│ 🟡│ 🔴│ 🟢│ ← Row 0: No match
├───┼───┼───┼───┤
│ 🟡│ 🔴│ 🔴│ 🔴│ ← Row 1: Match! (3 reds)
├───┼───┼───┼───┤
│ 🟢│ 🟡│ 🟢│ 🟡│ ← Row 2: No match
├───┼───┼───┼───┤
│ 🔴│ 🟢│ 🟡│ 🟢│ ← Row 3: No match
└───┴───┴───┴───┘
Visual Elements
| Component | Specification | Purpose |
|---|---|---|
| Grid Lines | Black, 2px width | Clear cell boundaries |
| Color Blocks | 4-6 distinct colors (Red, Blue, Green, Yellow, Purple, Orange) | High contrast, easily distinguishable |
| Cell Size | 100×100px minimum | Clear visibility |
| Canvas | White background | Maximum contrast |
| Aspect Ratio | 1:1 square | Uniform cell sizing |
Color Palette
- Red: RGB(255, 0, 0) or #FF0000
- Blue: RGB(0, 0, 255) or #0000FF
- Green: RGB(0, 255, 0) or #00FF00
- Yellow: RGB(255, 255, 0) or #FFFF00
- Purple: RGB(128, 0, 128) or #800080
- Orange: RGB(255, 165, 0) or #FFA500
🧩 Mathematical Foundation
Match Detection Algorithm
def find_matches(grid):
matches = []
rows, cols = len(grid), len(grid[0])
# Check horizontal matches
for i in range(rows):
count = 1
for j in range(1, cols):
if grid[i][j] == grid[i][j-1]:
count += 1
else:
if count >= 3:
matches.extend([(i, k) for k in range(j-count, j)])
count = 1
if count >= 3:
matches.extend([(i, k) for k in range(cols-count, cols)])
# Check vertical matches
for j in range(cols):
count = 1
for i in range(1, rows):
if grid[i][j] == grid[i-1][j]:
count += 1
else:
if count >= 3:
matches.extend([(k, j) for k in range(i-count, i)])
count = 1
if count >= 3:
matches.extend([(k, j) for k in range(rows-count, rows)])
return set(matches)Gravity Simulation
def apply_gravity(grid):
rows, cols = len(grid), len(grid[0])
new_grid = [['EMPTY'] * cols for _ in range(rows)]
# For each column, move non-empty cells down
for j in range(cols):
write_idx = rows - 1
for i in range(rows - 1, -1, -1):
if grid[i][j] != 'EMPTY':
new_grid[write_idx][j] = grid[i][j]
write_idx -= 1
return new_gridRefill Algorithm
def refill_empty_cells(grid, colors):
rows, cols = len(grid), len(grid[0])
for i in range(rows):
for j in range(cols):
if grid[i][j] == 'EMPTY':
grid[i][j] = random.choice(colors)
return gridComplexity Analysis
- Grid Sizes: 3×3 (9 cells), 4×4 (16 cells), 5×5 (25 cells)
- Color Count: 4-6 distinct colors
- Match Patterns: Horizontal (3+), Vertical (3+), L-shapes, T-shapes
- Cascade Depth: 1-5 levels depending on initial configuration
🎚️ Difficulty Levels
Easy (Level 0)
- Grid Size: 3×3
- Colors: 4 types
- Initial Matches: 0-1 existing matches
- Cascade Depth: 0-1 levels
- Complexity: Simple single-match scenarios
Medium (Level 1)
- Grid Size: 4×4
- Colors: 5 types
- Initial Matches: 1-2 existing matches
- Cascade Depth: 1-2 levels
- Complexity: Multiple matches with simple cascades
Hard (Level 2)
- Grid Size: 5×5
- Colors: 6 types
- Initial Matches: 2-3 existing matches
- Cascade Depth: 2-5 levels
- Complexity: Complex cascades with multiple chain reactions
🧠 Reasoning Requirements
Level 1: Pattern Recognition
- Visual Parsing: Identify grid structure and color distribution
- Match Detection: Recognize groups of 3+ identical elements
- Spatial Mapping: Track row/column positions (0-based indexing)
Level 2: Match Identification
- Horizontal Scanning: Find consecutive same-color elements in rows
- Vertical Scanning: Find consecutive same-color elements in columns
- Overlap Handling: Identify matches that share cells
Level 3: Cascade Prediction
- Gravity Understanding: Predict which elements fall after elimination
- Refill Prediction: Anticipate new elements appearing from top
- Chain Reaction: Identify new matches created by falling elements
- Stability Detection: Recognize when no more matches are possible
Level 4: Strategic Reasoning
- Match Selection: Choose which matches to trigger first for maximum cascade
- Multi-step Planning: Plan sequence of matches for optimal clearing
- Constraint Satisfaction: Ensure all matches are valid and legal
📝 Data Structure
Core Task Representation
@dataclass
class Match3TaskPair:
# Identification
id: str # Unique task identifier (e.g., "match3_0042")
task_category: str # Always "Match3"
# Task Content
prompt: str # Instruction text for the model
first_image_path: str # Path to initial grid image
final_image_path: str # Path to final stable state image
# Grid Data
initial_grid: List[List[str]] # 2D grid with color codes
final_grid: List[List[str]] # 2D grid after all matches resolved
grid_size: Tuple[int, int] # (rows, cols) dimensions
num_colors: int # Number of distinct colors used
# Match Data
initial_matches: List[Tuple[int, int]] # Positions of initial matches
total_matches: int # Total number of matches in solution
cascade_depth: int # Maximum cascade level reached
# Metadata
difficulty: str # "easy", "medium", or "hard"
match3_data: Dict[str, Any] # Additional puzzle metadata
created_at: str # ISO timestamp of generationGrid Representation
Cells are indexed (row, col) with (0,0) at top-left:
Grid positions: Array representation:
┌───┬───┬───┐
│0,0│0,1│0,2│ grid = [
├───┼───┼───┤ ['R', 'B', 'G'], # Row 0
│1,0│1,1│1,2│ ['B', 'R', 'R'], # Row 1
├───┼───┼───┤ ['G', 'B', 'R'] # Row 2
│2,0│2,1│2,2│ ]
└───┴───┴───┘
Color Encoding
- 'R': Red
- 'B': Blue
- 'G': Green
- 'Y': Yellow
- 'P': Purple
- 'O': Orange
🔄 Generation Pipeline
1. Grid Initialization
def generate_initial_grid(rows, cols, num_colors):
colors = ['R', 'B', 'G', 'Y', 'P', 'O'][:num_colors]
grid = [[random.choice(colors) for _ in range(cols)]
for _ in range(rows)]
# Ensure at least one match exists
matches = find_matches(grid)
if not matches:
# Force a match by modifying cells
create_guaranteed_match(grid, colors)
return grid2. Match Resolution
def resolve_all_matches(grid):
steps = []
cascade_level = 0
while True:
matches = find_matches(grid)
if not matches:
break
# Remove matched cells
for row, col in matches:
grid[row][col] = 'EMPTY'
steps.append({
'level': cascade_level,
'matches': matches,
'grid_before': copy.deepcopy(grid)
})
# Apply gravity
grid = apply_gravity(grid)
# Refill empty cells
grid = refill_empty_cells(grid, colors)
cascade_level += 1
return grid, steps3. Image Generation
def create_grid_image(grid, filepath):
rows, cols = len(grid), len(grid[0])
cell_size = 100
img_size = (cols * cell_size, rows * cell_size)
img = Image.new('RGB', img_size, 'white')
draw = ImageDraw.Draw(img)
color_map = {
'R': (255, 0, 0),
'B': (0, 0, 255),
'G': (0, 255, 0),
'Y': (255, 255, 0),
'P': (128, 0, 128),
'O': (255, 165, 0)
}
# Draw grid lines
for i in range(rows + 1):
y = i * cell_size
draw.line([(0, y), (img_size[0], y)], fill='black', width=2)
for j in range(cols + 1):
x = j * cell_size
draw.line([(x, 0), (x, img_size[1])], fill='black', width=2)
# Fill cells with colors
for i in range(rows):
for j in range(cols):
if grid[i][j] != 'EMPTY':
x1, y1 = j * cell_size + 2, i * cell_size + 2
x2, y2 = (j + 1) * cell_size - 2, (i + 1) * cell_size - 2
draw.rectangle([x1, y1, x2, y2], fill=color_map[grid[i][j]])
img.save(filepath, 'PNG', dpi=(150, 150))4. Validation
def validate_solution(initial_grid, final_grid):
# Check grid dimensions match
if len(initial_grid) != len(final_grid):
return False
if len(initial_grid[0]) != len(final_grid[0]):
return False
# Check no matches remain in final grid
remaining_matches = find_matches(final_grid)
if remaining_matches:
return False
# Check all cells are filled (no EMPTY cells)
for row in final_grid:
if 'EMPTY' in row:
return False
return True📊 Evaluation Framework
Primary Metrics
| Metric | Description | Scoring |
|---|---|---|
| Match Correctness | All matches identified are valid (3+ same color) | Binary (0/1) |
| Cascade Accuracy | Gravity and refill effects correctly simulated | Binary (0/1) |
| Completeness | Final state has no remaining matches | Binary (0/1) |
| Visual Clarity | Clear color distinction and grid structure | Scale (1-5) |
| Sequence Coherence | Logical progression from initial to final state | Scale (1-5) |
Evaluation Methods
1. Automatic (GPT-4O Vision)
# Evaluation prompt for GPT-4O
prompt = """
Compare the video sequence with expected match-3 resolution.
Check if:
1. All matches are valid (3+ same color in row/column)
2. Gravity effects are correctly shown (elements fall down)
3. New elements appear from top after elimination
4. Final state has no remaining matches
5. Cascade reactions are properly demonstrated
Rate 1-5 based on correctness and clarity.
"""2. Human Evaluation
- Visual inspection of generated video
- Manual verification of match validity
- Assessment of cascade demonstration
- Evaluation of physics simulation accuracy
3. Programmatic Validation
def evaluate_solution(generated_video, expected_steps):
# Extract frames from video
frames = extract_frames(generated_video)
# Parse grid from each frame
grids = [parse_grid_from_image(frame) for frame in frames]
# Validate each transition
for i in range(len(grids) - 1):
current = grids[i]
next_grid = grids[i + 1]
# Check if transition is valid
matches = find_matches(current)
expected_next = apply_match_elimination(current, matches)
expected_next = apply_gravity(expected_next)
expected_next = refill_empty_cells(expected_next, colors)
if expected_next != next_grid:
return False
return True🚀 Usage Examples
Basic Dataset Generation
from vmevalkit.tasks.match3_task import create_dataset
# Generate 50 puzzles with mixed difficulties
dataset = create_dataset(num_samples=50)
# Access individual tasks
for task in dataset['pairs']:
print(f"Task {task['id']}: {task['grid_size']} grid")
print(f"Initial matches: {len(task['initial_matches'])}")
print(f"Cascade depth: {task['cascade_depth']}")Custom Generation
from vmevalkit.tasks.match3_task import Match3TaskGenerator
generator = Match3TaskGenerator()
# Generate single task with specific difficulty
task = generator.generate_single_task(
task_id="custom_001",
difficulty=2, # Hard level
grid_size=(5, 5),
num_colors=6
)Integration with Runner
# Via command line
python vmevalkit/runner/create_dataset.py \
--pairs-per-domain 100 \
--random-seed 42
# Programmatic
from vmevalkit.runner.create_dataset import generate_domain_to_folders
tasks = generate_domain_to_folders(
domain_name="match3",
num_samples=100,
output_base=Path("data/questions"),
random_seed=42
)🔧 Technical Implementation
Dependencies
- NumPy: Array operations and grid manipulation
- Pillow (PIL): Image processing and color rendering
- Matplotlib: Optional visualization for debugging
- Dataclasses: Structured data representation
File Structure
vmevalkit/tasks/match3_task/
├── __init__.py # Module exports and initialization
├── PROMPTS.py # Standardized prompt templates
├── match3_reasoning.py # Core implementation
└── match3.md # This documentation
Key Classes
| Class | Purpose | Key Methods |
|---|---|---|
Match3GridGenerator |
Grid generation logic | generate_initial_grid(), find_matches() |
Match3TaskGenerator |
Task orchestration | generate_single_task(), generate_dataset() |
Match3TaskPair |
Data structure | Dataclass with validation |
Match3Dataset |
Collection management | save(), load() |
🎯 Model Performance Insights
Expected Behaviors
Successful Solution
Frame 1: Shows initial grid with colored blocks
Frame 2-N: Progressive matching and elimination
- Matched blocks highlighted or removed
- Blocks falling down (gravity effect)
- New blocks appearing from top
- Cascade reactions triggered
Final Frame: Stable state with no remaining matches
Common Failure Modes
- Invalid Matches: Identifying matches with less than 3 elements
- Missing Cascades: Not showing gravity effects after elimination
- Incorrect Refill: New elements appearing from wrong direction
- Incomplete Resolution: Final state still contains matches
- Visual Artifacts: Unclear colors or distorted grid structure
- No Sequence Logic: Random transitions without cause-effect relationship
Performance Factors
- Pattern Recognition: Ability to identify matching groups
- Physics Understanding: Grasping gravity and cascade mechanics
- Sequential Reasoning: Tracking state changes across frames
- Spatial Awareness: Understanding grid structure and positions
📈 Future Enhancements
Proposed Improvements
- Special Blocks: Power-ups (bombs, rainbow blocks) for advanced gameplay
- Larger Grids: Extend to 6×6, 7×7 for increased complexity
- Diagonal Matches: Support diagonal matching patterns
- Time Constraints: Add time-based challenges
- Objective Types: Clear specific colors, reach target score, clear all blocks
- Multi-step Planning: Require strategic match selection for optimal clearing
- Animation Quality: Smooth transitions and particle effects
Research Questions
- Can models learn cascade prediction from examples?
- Do models show consistent understanding of gravity mechanics?
- How does performance scale with grid size and color count?
- Can models plan multi-step match sequences strategically?
- Do models understand chain reaction causality?
Metadata
Metadata
Assignees
Labels
No labels