Skip to content

[Feature] match-3 #128

@LukeLIN-web

Description

@LukeLIN-web

Match-3 (Candy Crush Style) Task for VMEvalKit

📊 Overview

The Match-3 task evaluates video generation models' capacity for pattern recognition, cascade prediction, and spatial reasoning through simplified match-3 puzzle games. This task tests whether models can:

  1. Identify matching patterns - Recognize horizontal/vertical groups of 3+ identical elements
  2. Predict cascading effects - Understand gravity and chain reactions after matches
  3. Generate elimination sequences - Produce videos showing step-by-step matching and clearing
  4. Demonstrate physics understanding - Show elements falling and new elements appearing

The task uses simplified 3×3 to 5×5 grids with 4-6 color types as a minimal test case for pattern matching and cascade reasoning capabilities.

🎯 Task Description

Input Components

  • First Frame: A grid (3×3 to 5×5) with colored blocks/tiles arranged randomly
  • Prompt: Text instruction to match and eliminate groups of 3 or more identical elements
  • Format: 600×600px PNG image at 150 DPI with clear grid lines and distinct colors

Expected Output

  • Video Sequence: Animation showing matching, elimination, gravity effects, and cascades
  • Final Frame: Grid after all possible matches are resolved (stable state)
  • Solution Path: Clear demonstration of matching patterns and chain reactions

Core Rules

  • Match Requirement: 3 or more identical elements in a row (horizontal) or column (vertical)
  • Elimination: Matched elements are removed from the grid
  • Gravity: Elements above empty spaces fall downward
  • Refill: New random elements appear from the top
  • Cascade: New matches created by falling elements trigger additional eliminations

🎨 Visual Design Specification

Grid Layout

┌───┬───┬───┬───┐
│ 🔴│ 🟡│ 🔴│ 🟢│  ← Row 0: No match
├───┼───┼───┼───┤
│ 🟡│ 🔴│ 🔴│ 🔴│  ← Row 1: Match! (3 reds)
├───┼───┼───┼───┤
│ 🟢│ 🟡│ 🟢│ 🟡│  ← Row 2: No match
├───┼───┼───┼───┤
│ 🔴│ 🟢│ 🟡│ 🟢│  ← Row 3: No match
└───┴───┴───┴───┘

Visual Elements

Component Specification Purpose
Grid Lines Black, 2px width Clear cell boundaries
Color Blocks 4-6 distinct colors (Red, Blue, Green, Yellow, Purple, Orange) High contrast, easily distinguishable
Cell Size 100×100px minimum Clear visibility
Canvas White background Maximum contrast
Aspect Ratio 1:1 square Uniform cell sizing

Color Palette

  • Red: RGB(255, 0, 0) or #FF0000
  • Blue: RGB(0, 0, 255) or #0000FF
  • Green: RGB(0, 255, 0) or #00FF00
  • Yellow: RGB(255, 255, 0) or #FFFF00
  • Purple: RGB(128, 0, 128) or #800080
  • Orange: RGB(255, 165, 0) or #FFA500

🧩 Mathematical Foundation

Match Detection Algorithm

def find_matches(grid):
    matches = []
    rows, cols = len(grid), len(grid[0])
    
    # Check horizontal matches
    for i in range(rows):
        count = 1
        for j in range(1, cols):
            if grid[i][j] == grid[i][j-1]:
                count += 1
            else:
                if count >= 3:
                    matches.extend([(i, k) for k in range(j-count, j)])
                count = 1
        if count >= 3:
            matches.extend([(i, k) for k in range(cols-count, cols)])
    
    # Check vertical matches
    for j in range(cols):
        count = 1
        for i in range(1, rows):
            if grid[i][j] == grid[i-1][j]:
                count += 1
            else:
                if count >= 3:
                    matches.extend([(k, j) for k in range(i-count, i)])
                count = 1
        if count >= 3:
            matches.extend([(k, j) for k in range(rows-count, rows)])
    
    return set(matches)

Gravity Simulation

def apply_gravity(grid):
    rows, cols = len(grid), len(grid[0])
    new_grid = [['EMPTY'] * cols for _ in range(rows)]
    
    # For each column, move non-empty cells down
    for j in range(cols):
        write_idx = rows - 1
        for i in range(rows - 1, -1, -1):
            if grid[i][j] != 'EMPTY':
                new_grid[write_idx][j] = grid[i][j]
                write_idx -= 1
    
    return new_grid

Refill Algorithm

def refill_empty_cells(grid, colors):
    rows, cols = len(grid), len(grid[0])
    
    for i in range(rows):
        for j in range(cols):
            if grid[i][j] == 'EMPTY':
                grid[i][j] = random.choice(colors)
    
    return grid

Complexity Analysis

  • Grid Sizes: 3×3 (9 cells), 4×4 (16 cells), 5×5 (25 cells)
  • Color Count: 4-6 distinct colors
  • Match Patterns: Horizontal (3+), Vertical (3+), L-shapes, T-shapes
  • Cascade Depth: 1-5 levels depending on initial configuration

🎚️ Difficulty Levels

Easy (Level 0)

  • Grid Size: 3×3
  • Colors: 4 types
  • Initial Matches: 0-1 existing matches
  • Cascade Depth: 0-1 levels
  • Complexity: Simple single-match scenarios

Medium (Level 1)

  • Grid Size: 4×4
  • Colors: 5 types
  • Initial Matches: 1-2 existing matches
  • Cascade Depth: 1-2 levels
  • Complexity: Multiple matches with simple cascades

Hard (Level 2)

  • Grid Size: 5×5
  • Colors: 6 types
  • Initial Matches: 2-3 existing matches
  • Cascade Depth: 2-5 levels
  • Complexity: Complex cascades with multiple chain reactions

🧠 Reasoning Requirements

Level 1: Pattern Recognition

  • Visual Parsing: Identify grid structure and color distribution
  • Match Detection: Recognize groups of 3+ identical elements
  • Spatial Mapping: Track row/column positions (0-based indexing)

Level 2: Match Identification

  • Horizontal Scanning: Find consecutive same-color elements in rows
  • Vertical Scanning: Find consecutive same-color elements in columns
  • Overlap Handling: Identify matches that share cells

Level 3: Cascade Prediction

  • Gravity Understanding: Predict which elements fall after elimination
  • Refill Prediction: Anticipate new elements appearing from top
  • Chain Reaction: Identify new matches created by falling elements
  • Stability Detection: Recognize when no more matches are possible

Level 4: Strategic Reasoning

  • Match Selection: Choose which matches to trigger first for maximum cascade
  • Multi-step Planning: Plan sequence of matches for optimal clearing
  • Constraint Satisfaction: Ensure all matches are valid and legal

📝 Data Structure

Core Task Representation

@dataclass
class Match3TaskPair:
    # Identification
    id: str                      # Unique task identifier (e.g., "match3_0042")
    task_category: str           # Always "Match3"
    
    # Task Content  
    prompt: str                  # Instruction text for the model
    first_image_path: str        # Path to initial grid image
    final_image_path: str        # Path to final stable state image
    
    # Grid Data
    initial_grid: List[List[str]]  # 2D grid with color codes
    final_grid: List[List[str]]    # 2D grid after all matches resolved
    grid_size: Tuple[int, int]    # (rows, cols) dimensions
    num_colors: int              # Number of distinct colors used
    
    # Match Data
    initial_matches: List[Tuple[int, int]]  # Positions of initial matches
    total_matches: int           # Total number of matches in solution
    cascade_depth: int          # Maximum cascade level reached
    
    # Metadata
    difficulty: str              # "easy", "medium", or "hard"
    match3_data: Dict[str, Any]  # Additional puzzle metadata
    created_at: str             # ISO timestamp of generation

Grid Representation

Cells are indexed (row, col) with (0,0) at top-left:

Grid positions:     Array representation:
┌───┬───┬───┐      
│0,0│0,1│0,2│      grid = [
├───┼───┼───┤        ['R', 'B', 'G'],  # Row 0
│1,0│1,1│1,2│        ['B', 'R', 'R'],  # Row 1
├───┼───┼───┤        ['G', 'B', 'R']   # Row 2
│2,0│2,1│2,2│      ]
└───┴───┴───┘

Color Encoding

  • 'R': Red
  • 'B': Blue
  • 'G': Green
  • 'Y': Yellow
  • 'P': Purple
  • 'O': Orange

🔄 Generation Pipeline

1. Grid Initialization

def generate_initial_grid(rows, cols, num_colors):
    colors = ['R', 'B', 'G', 'Y', 'P', 'O'][:num_colors]
    grid = [[random.choice(colors) for _ in range(cols)] 
            for _ in range(rows)]
    
    # Ensure at least one match exists
    matches = find_matches(grid)
    if not matches:
        # Force a match by modifying cells
        create_guaranteed_match(grid, colors)
    
    return grid

2. Match Resolution

def resolve_all_matches(grid):
    steps = []
    cascade_level = 0
    
    while True:
        matches = find_matches(grid)
        if not matches:
            break
        
        # Remove matched cells
        for row, col in matches:
            grid[row][col] = 'EMPTY'
        
        steps.append({
            'level': cascade_level,
            'matches': matches,
            'grid_before': copy.deepcopy(grid)
        })
        
        # Apply gravity
        grid = apply_gravity(grid)
        
        # Refill empty cells
        grid = refill_empty_cells(grid, colors)
        
        cascade_level += 1
    
    return grid, steps

3. Image Generation

def create_grid_image(grid, filepath):
    rows, cols = len(grid), len(grid[0])
    cell_size = 100
    img_size = (cols * cell_size, rows * cell_size)
    
    img = Image.new('RGB', img_size, 'white')
    draw = ImageDraw.Draw(img)
    
    color_map = {
        'R': (255, 0, 0),
        'B': (0, 0, 255),
        'G': (0, 255, 0),
        'Y': (255, 255, 0),
        'P': (128, 0, 128),
        'O': (255, 165, 0)
    }
    
    # Draw grid lines
    for i in range(rows + 1):
        y = i * cell_size
        draw.line([(0, y), (img_size[0], y)], fill='black', width=2)
    
    for j in range(cols + 1):
        x = j * cell_size
        draw.line([(x, 0), (x, img_size[1])], fill='black', width=2)
    
    # Fill cells with colors
    for i in range(rows):
        for j in range(cols):
            if grid[i][j] != 'EMPTY':
                x1, y1 = j * cell_size + 2, i * cell_size + 2
                x2, y2 = (j + 1) * cell_size - 2, (i + 1) * cell_size - 2
                draw.rectangle([x1, y1, x2, y2], fill=color_map[grid[i][j]])
    
    img.save(filepath, 'PNG', dpi=(150, 150))

4. Validation

def validate_solution(initial_grid, final_grid):
    # Check grid dimensions match
    if len(initial_grid) != len(final_grid):
        return False
    if len(initial_grid[0]) != len(final_grid[0]):
        return False
    
    # Check no matches remain in final grid
    remaining_matches = find_matches(final_grid)
    if remaining_matches:
        return False
    
    # Check all cells are filled (no EMPTY cells)
    for row in final_grid:
        if 'EMPTY' in row:
            return False
    
    return True

📊 Evaluation Framework

Primary Metrics

Metric Description Scoring
Match Correctness All matches identified are valid (3+ same color) Binary (0/1)
Cascade Accuracy Gravity and refill effects correctly simulated Binary (0/1)
Completeness Final state has no remaining matches Binary (0/1)
Visual Clarity Clear color distinction and grid structure Scale (1-5)
Sequence Coherence Logical progression from initial to final state Scale (1-5)

Evaluation Methods

1. Automatic (GPT-4O Vision)

# Evaluation prompt for GPT-4O
prompt = """
Compare the video sequence with expected match-3 resolution.
Check if:
1. All matches are valid (3+ same color in row/column)
2. Gravity effects are correctly shown (elements fall down)
3. New elements appear from top after elimination
4. Final state has no remaining matches
5. Cascade reactions are properly demonstrated
Rate 1-5 based on correctness and clarity.
"""

2. Human Evaluation

  • Visual inspection of generated video
  • Manual verification of match validity
  • Assessment of cascade demonstration
  • Evaluation of physics simulation accuracy

3. Programmatic Validation

def evaluate_solution(generated_video, expected_steps):
    # Extract frames from video
    frames = extract_frames(generated_video)
    
    # Parse grid from each frame
    grids = [parse_grid_from_image(frame) for frame in frames]
    
    # Validate each transition
    for i in range(len(grids) - 1):
        current = grids[i]
        next_grid = grids[i + 1]
        
        # Check if transition is valid
        matches = find_matches(current)
        expected_next = apply_match_elimination(current, matches)
        expected_next = apply_gravity(expected_next)
        expected_next = refill_empty_cells(expected_next, colors)
        
        if expected_next != next_grid:
            return False
    
    return True

🚀 Usage Examples

Basic Dataset Generation

from vmevalkit.tasks.match3_task import create_dataset

# Generate 50 puzzles with mixed difficulties
dataset = create_dataset(num_samples=50)

# Access individual tasks
for task in dataset['pairs']:
    print(f"Task {task['id']}: {task['grid_size']} grid")
    print(f"Initial matches: {len(task['initial_matches'])}")
    print(f"Cascade depth: {task['cascade_depth']}")

Custom Generation

from vmevalkit.tasks.match3_task import Match3TaskGenerator

generator = Match3TaskGenerator()

# Generate single task with specific difficulty
task = generator.generate_single_task(
    task_id="custom_001",
    difficulty=2,  # Hard level
    grid_size=(5, 5),
    num_colors=6
)

Integration with Runner

# Via command line
python vmevalkit/runner/create_dataset.py \
    --pairs-per-domain 100 \
    --random-seed 42

# Programmatic
from vmevalkit.runner.create_dataset import generate_domain_to_folders

tasks = generate_domain_to_folders(
    domain_name="match3",
    num_samples=100,
    output_base=Path("data/questions"),
    random_seed=42
)

🔧 Technical Implementation

Dependencies

  • NumPy: Array operations and grid manipulation
  • Pillow (PIL): Image processing and color rendering
  • Matplotlib: Optional visualization for debugging
  • Dataclasses: Structured data representation

File Structure

vmevalkit/tasks/match3_task/
├── __init__.py           # Module exports and initialization
├── PROMPTS.py           # Standardized prompt templates
├── match3_reasoning.py  # Core implementation
└── match3.md           # This documentation

Key Classes

Class Purpose Key Methods
Match3GridGenerator Grid generation logic generate_initial_grid(), find_matches()
Match3TaskGenerator Task orchestration generate_single_task(), generate_dataset()
Match3TaskPair Data structure Dataclass with validation
Match3Dataset Collection management save(), load()

🎯 Model Performance Insights

Expected Behaviors

Successful Solution

Frame 1: Shows initial grid with colored blocks
Frame 2-N: Progressive matching and elimination
  - Matched blocks highlighted or removed
  - Blocks falling down (gravity effect)
  - New blocks appearing from top
  - Cascade reactions triggered
Final Frame: Stable state with no remaining matches

Common Failure Modes

  1. Invalid Matches: Identifying matches with less than 3 elements
  2. Missing Cascades: Not showing gravity effects after elimination
  3. Incorrect Refill: New elements appearing from wrong direction
  4. Incomplete Resolution: Final state still contains matches
  5. Visual Artifacts: Unclear colors or distorted grid structure
  6. No Sequence Logic: Random transitions without cause-effect relationship

Performance Factors

  • Pattern Recognition: Ability to identify matching groups
  • Physics Understanding: Grasping gravity and cascade mechanics
  • Sequential Reasoning: Tracking state changes across frames
  • Spatial Awareness: Understanding grid structure and positions

📈 Future Enhancements

Proposed Improvements

  1. Special Blocks: Power-ups (bombs, rainbow blocks) for advanced gameplay
  2. Larger Grids: Extend to 6×6, 7×7 for increased complexity
  3. Diagonal Matches: Support diagonal matching patterns
  4. Time Constraints: Add time-based challenges
  5. Objective Types: Clear specific colors, reach target score, clear all blocks
  6. Multi-step Planning: Require strategic match selection for optimal clearing
  7. Animation Quality: Smooth transitions and particle effects

Research Questions

  • Can models learn cascade prediction from examples?
  • Do models show consistent understanding of gravity mechanics?
  • How does performance scale with grid size and color count?
  • Can models plan multi-step match sequences strategically?
  • Do models understand chain reaction causality?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions