[Feature] match-3

# Match-3 (Candy Crush Style) Task for VMEvalKit

## 📊 Overview

The Match-3 task evaluates video generation models' capacity for **pattern recognition**, **cascade prediction**, and **spatial reasoning** through simplified match-3 puzzle games. This task tests whether models can:

1. **Identify matching patterns** - Recognize horizontal/vertical groups of 3+ identical elements
2. **Predict cascading effects** - Understand gravity and chain reactions after matches
3. **Generate elimination sequences** - Produce videos showing step-by-step matching and clearing
4. **Demonstrate physics understanding** - Show elements falling and new elements appearing

The task uses **simplified 3×3 to 5×5 grids** with 4-6 color types as a minimal test case for pattern matching and cascade reasoning capabilities.

## 🎯 Task Description

### Input Components
- **First Frame**: A grid (3×3 to 5×5) with colored blocks/tiles arranged randomly
- **Prompt**: Text instruction to match and eliminate groups of 3 or more identical elements
- **Format**: 600×600px PNG image at 150 DPI with clear grid lines and distinct colors

### Expected Output
- **Video Sequence**: Animation showing matching, elimination, gravity effects, and cascades
- **Final Frame**: Grid after all possible matches are resolved (stable state)
- **Solution Path**: Clear demonstration of matching patterns and chain reactions

### Core Rules
- **Match Requirement**: 3 or more identical elements in a row (horizontal) or column (vertical)
- **Elimination**: Matched elements are removed from the grid
- **Gravity**: Elements above empty spaces fall downward
- **Refill**: New random elements appear from the top
- **Cascade**: New matches created by falling elements trigger additional eliminations

## 🎨 Visual Design Specification

### Grid Layout
```
┌───┬───┬───┬───┐
│ 🔴│ 🟡│ 🔴│ 🟢│  ← Row 0: No match
├───┼───┼───┼───┤
│ 🟡│ 🔴│ 🔴│ 🔴│  ← Row 1: Match! (3 reds)
├───┼───┼───┼───┤
│ 🟢│ 🟡│ 🟢│ 🟡│  ← Row 2: No match
├───┼───┼───┼───┤
│ 🔴│ 🟢│ 🟡│ 🟢│  ← Row 3: No match
└───┴───┴───┴───┘
```

### Visual Elements
| Component | Specification | Purpose |
|-----------|--------------|---------|
| **Grid Lines** | Black, 2px width | Clear cell boundaries |
| **Color Blocks** | 4-6 distinct colors (Red, Blue, Green, Yellow, Purple, Orange) | High contrast, easily distinguishable |
| **Cell Size** | 100×100px minimum | Clear visibility |
| **Canvas** | White background | Maximum contrast |
| **Aspect Ratio** | 1:1 square | Uniform cell sizing |

### Color Palette
- **Red**: RGB(255, 0, 0) or #FF0000
- **Blue**: RGB(0, 0, 255) or #0000FF
- **Green**: RGB(0, 255, 0) or #00FF00
- **Yellow**: RGB(255, 255, 0) or #FFFF00
- **Purple**: RGB(128, 0, 128) or #800080
- **Orange**: RGB(255, 165, 0) or #FFA500

## 🧩 Mathematical Foundation

### Match Detection Algorithm
```python
def find_matches(grid):
    matches = []
    rows, cols = len(grid), len(grid[0])
    
    # Check horizontal matches
    for i in range(rows):
        count = 1
        for j in range(1, cols):
            if grid[i][j] == grid[i][j-1]:
                count += 1
            else:
                if count >= 3:
                    matches.extend([(i, k) for k in range(j-count, j)])
                count = 1
        if count >= 3:
            matches.extend([(i, k) for k in range(cols-count, cols)])
    
    # Check vertical matches
    for j in range(cols):
        count = 1
        for i in range(1, rows):
            if grid[i][j] == grid[i-1][j]:
                count += 1
            else:
                if count >= 3:
                    matches.extend([(k, j) for k in range(i-count, i)])
                count = 1
        if count >= 3:
            matches.extend([(k, j) for k in range(rows-count, rows)])
    
    return set(matches)
```

### Gravity Simulation
```python
def apply_gravity(grid):
    rows, cols = len(grid), len(grid[0])
    new_grid = [['EMPTY'] * cols for _ in range(rows)]
    
    # For each column, move non-empty cells down
    for j in range(cols):
        write_idx = rows - 1
        for i in range(rows - 1, -1, -1):
            if grid[i][j] != 'EMPTY':
                new_grid[write_idx][j] = grid[i][j]
                write_idx -= 1
    
    return new_grid
```

### Refill Algorithm
```python
def refill_empty_cells(grid, colors):
    rows, cols = len(grid), len(grid[0])
    
    for i in range(rows):
        for j in range(cols):
            if grid[i][j] == 'EMPTY':
                grid[i][j] = random.choice(colors)
    
    return grid
```

### Complexity Analysis
- **Grid Sizes**: 3×3 (9 cells), 4×4 (16 cells), 5×5 (25 cells)
- **Color Count**: 4-6 distinct colors
- **Match Patterns**: Horizontal (3+), Vertical (3+), L-shapes, T-shapes
- **Cascade Depth**: 1-5 levels depending on initial configuration

## 🎚️ Difficulty Levels

### Easy (Level 0)
- **Grid Size**: 3×3
- **Colors**: 4 types
- **Initial Matches**: 0-1 existing matches
- **Cascade Depth**: 0-1 levels
- **Complexity**: Simple single-match scenarios

### Medium (Level 1)
- **Grid Size**: 4×4
- **Colors**: 5 types
- **Initial Matches**: 1-2 existing matches
- **Cascade Depth**: 1-2 levels
- **Complexity**: Multiple matches with simple cascades

### Hard (Level 2)
- **Grid Size**: 5×5
- **Colors**: 6 types
- **Initial Matches**: 2-3 existing matches
- **Cascade Depth**: 2-5 levels
- **Complexity**: Complex cascades with multiple chain reactions

## 🧠 Reasoning Requirements

### Level 1: Pattern Recognition
- **Visual Parsing**: Identify grid structure and color distribution
- **Match Detection**: Recognize groups of 3+ identical elements
- **Spatial Mapping**: Track row/column positions (0-based indexing)

### Level 2: Match Identification
- **Horizontal Scanning**: Find consecutive same-color elements in rows
- **Vertical Scanning**: Find consecutive same-color elements in columns
- **Overlap Handling**: Identify matches that share cells

### Level 3: Cascade Prediction
- **Gravity Understanding**: Predict which elements fall after elimination
- **Refill Prediction**: Anticipate new elements appearing from top
- **Chain Reaction**: Identify new matches created by falling elements
- **Stability Detection**: Recognize when no more matches are possible

### Level 4: Strategic Reasoning
- **Match Selection**: Choose which matches to trigger first for maximum cascade
- **Multi-step Planning**: Plan sequence of matches for optimal clearing
- **Constraint Satisfaction**: Ensure all matches are valid and legal

## 📝 Data Structure

### Core Task Representation
```python
@dataclass
class Match3TaskPair:
    # Identification
    id: str                      # Unique task identifier (e.g., "match3_0042")
    task_category: str           # Always "Match3"
    
    # Task Content  
    prompt: str                  # Instruction text for the model
    first_image_path: str        # Path to initial grid image
    final_image_path: str        # Path to final stable state image
    
    # Grid Data
    initial_grid: List[List[str]]  # 2D grid with color codes
    final_grid: List[List[str]]    # 2D grid after all matches resolved
    grid_size: Tuple[int, int]    # (rows, cols) dimensions
    num_colors: int              # Number of distinct colors used
    
    # Match Data
    initial_matches: List[Tuple[int, int]]  # Positions of initial matches
    total_matches: int           # Total number of matches in solution
    cascade_depth: int          # Maximum cascade level reached
    
    # Metadata
    difficulty: str              # "easy", "medium", or "hard"
    match3_data: Dict[str, Any]  # Additional puzzle metadata
    created_at: str             # ISO timestamp of generation
```

### Grid Representation
Cells are indexed (row, col) with (0,0) at top-left:
```
Grid positions:     Array representation:
┌───┬───┬───┐      
│0,0│0,1│0,2│      grid = [
├───┼───┼───┤        ['R', 'B', 'G'],  # Row 0
│1,0│1,1│1,2│        ['B', 'R', 'R'],  # Row 1
├───┼───┼───┤        ['G', 'B', 'R']   # Row 2
│2,0│2,1│2,2│      ]
└───┴───┴───┘
```

### Color Encoding
- **'R'**: Red
- **'B'**: Blue
- **'G'**: Green
- **'Y'**: Yellow
- **'P'**: Purple
- **'O'**: Orange

## 🔄 Generation Pipeline

### 1. Grid Initialization
```python
def generate_initial_grid(rows, cols, num_colors):
    colors = ['R', 'B', 'G', 'Y', 'P', 'O'][:num_colors]
    grid = [[random.choice(colors) for _ in range(cols)] 
            for _ in range(rows)]
    
    # Ensure at least one match exists
    matches = find_matches(grid)
    if not matches:
        # Force a match by modifying cells
        create_guaranteed_match(grid, colors)
    
    return grid
```

### 2. Match Resolution
```python
def resolve_all_matches(grid):
    steps = []
    cascade_level = 0
    
    while True:
        matches = find_matches(grid)
        if not matches:
            break
        
        # Remove matched cells
        for row, col in matches:
            grid[row][col] = 'EMPTY'
        
        steps.append({
            'level': cascade_level,
            'matches': matches,
            'grid_before': copy.deepcopy(grid)
        })
        
        # Apply gravity
        grid = apply_gravity(grid)
        
        # Refill empty cells
        grid = refill_empty_cells(grid, colors)
        
        cascade_level += 1
    
    return grid, steps
```

### 3. Image Generation
```python
def create_grid_image(grid, filepath):
    rows, cols = len(grid), len(grid[0])
    cell_size = 100
    img_size = (cols * cell_size, rows * cell_size)
    
    img = Image.new('RGB', img_size, 'white')
    draw = ImageDraw.Draw(img)
    
    color_map = {
        'R': (255, 0, 0),
        'B': (0, 0, 255),
        'G': (0, 255, 0),
        'Y': (255, 255, 0),
        'P': (128, 0, 128),
        'O': (255, 165, 0)
    }
    
    # Draw grid lines
    for i in range(rows + 1):
        y = i * cell_size
        draw.line([(0, y), (img_size[0], y)], fill='black', width=2)
    
    for j in range(cols + 1):
        x = j * cell_size
        draw.line([(x, 0), (x, img_size[1])], fill='black', width=2)
    
    # Fill cells with colors
    for i in range(rows):
        for j in range(cols):
            if grid[i][j] != 'EMPTY':
                x1, y1 = j * cell_size + 2, i * cell_size + 2
                x2, y2 = (j + 1) * cell_size - 2, (i + 1) * cell_size - 2
                draw.rectangle([x1, y1, x2, y2], fill=color_map[grid[i][j]])
    
    img.save(filepath, 'PNG', dpi=(150, 150))
```

### 4. Validation
```python
def validate_solution(initial_grid, final_grid):
    # Check grid dimensions match
    if len(initial_grid) != len(final_grid):
        return False
    if len(initial_grid[0]) != len(final_grid[0]):
        return False
    
    # Check no matches remain in final grid
    remaining_matches = find_matches(final_grid)
    if remaining_matches:
        return False
    
    # Check all cells are filled (no EMPTY cells)
    for row in final_grid:
        if 'EMPTY' in row:
            return False
    
    return True
```

## 📊 Evaluation Framework

### Primary Metrics
| Metric | Description | Scoring |
|--------|-------------|---------|
| **Match Correctness** | All matches identified are valid (3+ same color) | Binary (0/1) |
| **Cascade Accuracy** | Gravity and refill effects correctly simulated | Binary (0/1) |
| **Completeness** | Final state has no remaining matches | Binary (0/1) |
| **Visual Clarity** | Clear color distinction and grid structure | Scale (1-5) |
| **Sequence Coherence** | Logical progression from initial to final state | Scale (1-5) |

### Evaluation Methods

#### 1. Automatic (GPT-4O Vision)
```python
# Evaluation prompt for GPT-4O
prompt = """
Compare the video sequence with expected match-3 resolution.
Check if:
1. All matches are valid (3+ same color in row/column)
2. Gravity effects are correctly shown (elements fall down)
3. New elements appear from top after elimination
4. Final state has no remaining matches
5. Cascade reactions are properly demonstrated
Rate 1-5 based on correctness and clarity.
"""
```

#### 2. Human Evaluation
- Visual inspection of generated video
- Manual verification of match validity
- Assessment of cascade demonstration
- Evaluation of physics simulation accuracy

#### 3. Programmatic Validation
```python
def evaluate_solution(generated_video, expected_steps):
    # Extract frames from video
    frames = extract_frames(generated_video)
    
    # Parse grid from each frame
    grids = [parse_grid_from_image(frame) for frame in frames]
    
    # Validate each transition
    for i in range(len(grids) - 1):
        current = grids[i]
        next_grid = grids[i + 1]
        
        # Check if transition is valid
        matches = find_matches(current)
        expected_next = apply_match_elimination(current, matches)
        expected_next = apply_gravity(expected_next)
        expected_next = refill_empty_cells(expected_next, colors)
        
        if expected_next != next_grid:
            return False
    
    return True
```

## 🚀 Usage Examples

### Basic Dataset Generation
```python
from vmevalkit.tasks.match3_task import create_dataset

# Generate 50 puzzles with mixed difficulties
dataset = create_dataset(num_samples=50)

# Access individual tasks
for task in dataset['pairs']:
    print(f"Task {task['id']}: {task['grid_size']} grid")
    print(f"Initial matches: {len(task['initial_matches'])}")
    print(f"Cascade depth: {task['cascade_depth']}")
```

### Custom Generation
```python
from vmevalkit.tasks.match3_task import Match3TaskGenerator

generator = Match3TaskGenerator()

# Generate single task with specific difficulty
task = generator.generate_single_task(
    task_id="custom_001",
    difficulty=2,  # Hard level
    grid_size=(5, 5),
    num_colors=6
)
```

### Integration with Runner
```python
# Via command line
python vmevalkit/runner/create_dataset.py \
    --pairs-per-domain 100 \
    --random-seed 42

# Programmatic
from vmevalkit.runner.create_dataset import generate_domain_to_folders

tasks = generate_domain_to_folders(
    domain_name="match3",
    num_samples=100,
    output_base=Path("data/questions"),
    random_seed=42
)
```

## 🔧 Technical Implementation

### Dependencies
- **NumPy**: Array operations and grid manipulation
- **Pillow (PIL)**: Image processing and color rendering
- **Matplotlib**: Optional visualization for debugging
- **Dataclasses**: Structured data representation

### File Structure
```
vmevalkit/tasks/match3_task/
├── __init__.py           # Module exports and initialization
├── PROMPTS.py           # Standardized prompt templates
├── match3_reasoning.py  # Core implementation
└── match3.md           # This documentation
```

### Key Classes
| Class | Purpose | Key Methods |
|-------|---------|-------------|
| `Match3GridGenerator` | Grid generation logic | `generate_initial_grid()`, `find_matches()` |
| `Match3TaskGenerator` | Task orchestration | `generate_single_task()`, `generate_dataset()` |
| `Match3TaskPair` | Data structure | Dataclass with validation |
| `Match3Dataset` | Collection management | `save()`, `load()` |

## 🎯 Model Performance Insights

### Expected Behaviors

#### Successful Solution
```
Frame 1: Shows initial grid with colored blocks
Frame 2-N: Progressive matching and elimination
  - Matched blocks highlighted or removed
  - Blocks falling down (gravity effect)
  - New blocks appearing from top
  - Cascade reactions triggered
Final Frame: Stable state with no remaining matches
```

#### Common Failure Modes
1. **Invalid Matches**: Identifying matches with less than 3 elements
2. **Missing Cascades**: Not showing gravity effects after elimination
3. **Incorrect Refill**: New elements appearing from wrong direction
4. **Incomplete Resolution**: Final state still contains matches
5. **Visual Artifacts**: Unclear colors or distorted grid structure
6. **No Sequence Logic**: Random transitions without cause-effect relationship

### Performance Factors
- **Pattern Recognition**: Ability to identify matching groups
- **Physics Understanding**: Grasping gravity and cascade mechanics
- **Sequential Reasoning**: Tracking state changes across frames
- **Spatial Awareness**: Understanding grid structure and positions

## 📈 Future Enhancements

### Proposed Improvements
1. **Special Blocks**: Power-ups (bombs, rainbow blocks) for advanced gameplay
2. **Larger Grids**: Extend to 6×6, 7×7 for increased complexity
3. **Diagonal Matches**: Support diagonal matching patterns
4. **Time Constraints**: Add time-based challenges
5. **Objective Types**: Clear specific colors, reach target score, clear all blocks
6. **Multi-step Planning**: Require strategic match selection for optimal clearing
7. **Animation Quality**: Smooth transitions and particle effects

### Research Questions
- Can models learn cascade prediction from examples?
- Do models show consistent understanding of gravity mechanics?
- How does performance scale with grid size and color count?
- Can models plan multi-step match sequences strategically?
- Do models understand chain reaction causality?

Component	Specification	Purpose
Grid Lines	Black, 2px width	Clear cell boundaries
Color Blocks	4-6 distinct colors (Red, Blue, Green, Yellow, Purple, Orange)	High contrast, easily distinguishable
Cell Size	100×100px minimum	Clear visibility
Canvas	White background	Maximum contrast
Aspect Ratio	1:1 square	Uniform cell sizing

Metric	Description	Scoring
Match Correctness	All matches identified are valid (3+ same color)	Binary (0/1)
Cascade Accuracy	Gravity and refill effects correctly simulated	Binary (0/1)
Completeness	Final state has no remaining matches	Binary (0/1)
Visual Clarity	Clear color distinction and grid structure	Scale (1-5)
Sequence Coherence	Logical progression from initial to final state	Scale (1-5)

Class	Purpose	Key Methods
`Match3GridGenerator`	Grid generation logic	`generate_initial_grid()`, `find_matches()`
`Match3TaskGenerator`	Task orchestration	`generate_single_task()`, `generate_dataset()`
`Match3TaskPair`	Data structure	Dataclass with validation
`Match3Dataset`	Collection management	`save()`, `load()`

[Feature] match-3 #128

Description

Match-3 (Candy Crush Style) Task for VMEvalKit

📊 Overview

🎯 Task Description

Input Components

Expected Output

Core Rules

🎨 Visual Design Specification

Grid Layout

Visual Elements

Color Palette

🧩 Mathematical Foundation

Match Detection Algorithm

Gravity Simulation

Refill Algorithm

Complexity Analysis

🎚️ Difficulty Levels

Easy (Level 0)

Medium (Level 1)

Hard (Level 2)

🧠 Reasoning Requirements

Level 1: Pattern Recognition

Level 2: Match Identification

Level 3: Cascade Prediction

Level 4: Strategic Reasoning

📝 Data Structure

Core Task Representation

Grid Representation

Color Encoding

🔄 Generation Pipeline

1. Grid Initialization

2. Match Resolution

3. Image Generation

4. Validation

📊 Evaluation Framework

Primary Metrics

Evaluation Methods

1. Automatic (GPT-4O Vision)

2. Human Evaluation

3. Programmatic Validation

🚀 Usage Examples

Basic Dataset Generation

Custom Generation

Integration with Runner

🔧 Technical Implementation

Dependencies

File Structure

Key Classes

🎯 Model Performance Insights

Expected Behaviors

Successful Solution

Common Failure Modes

Performance Factors

📈 Future Enhancements

Proposed Improvements

Research Questions

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions