CaptionStrike — Local Dataset Builder

A local-first application for creating training datasets using Qwen2.5-VL (default) for image/video captioning and reasoning, with Florence-2 as an optional alternative. Build high-quality datasets with drag-and-drop simplicity and AI-powered automation.

✨ Features

🎯 Qwen2.5-VL Integration: Primary VLM for captioning and reasoning (7B default)
🧠 Optional Alternative: Florence-2 Base/Large for perception-first captioning
🎵 Audio Processing: Speaker diarization and isolation using pyannote.audio
👤 Person Isolation: Face detection with InsightFace + optional SAM refinement
🖼️ Smart Conversion: Auto-convert to standard formats (PNG/MP4/MP3)
🏷️ ULID Tokens: Unique, sortable identifiers for all processed media
🌐 Web Interface: Gradio-based UI with drag-drop and inline editing
📊 Progress Tracking: Comprehensive logging and project management

🚀 Quick Start (Windows PowerShell)

1. Environment Setup

# Navigate to CaptionStrike directory
cd D:\Dropbox\SandBox\CaptionStrike

# Create conda environment
conda env create -f environment.yml

# Activate environment
conda activate CaptionStirke

2. Launch Application

# Start the local web interface
python app.py --root "D:\Datasets" --models_dir ".\models"

# Or specify custom paths
python app.py --root "C:\Your\Dataset\Path" --models_dir "C:\Your\Models\Path" --port 7860
# (Optional) Pre-download Qwen reasoning model to the models directory
python app.py --root "D:\Datasets" --models_dir ".\models" --prefetch-qwen

3. Using the Interface

Create Project: Enter a project name and click "Create Project"
Add Media: Drag and drop images, videos, or audio files
Configure Options:
- Toggle person isolation (face crops)
- Provide reference voice clip for audio processing
- Set audio timestamp ranges
Run Pipeline: Click "RUN pipeline" to process all media
Review Results: Browse thumbnails and edit captions inline
Export: Find processed files in <root>\<project>\processed\

4. Windows-Specific Setup Tips

# If you encounter path issues, use full Windows paths:
python app.py --root "C:\Users\YourName\Documents\Datasets" --models_dir "C:\Users\YourName\Documents\Models"

# To check if conda environment is active:
conda info --envs

# To verify Python and dependencies:
python -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}')"

📁 Project Structure

<root>\
└── <project_name>\
    ├── raw\                    # Original uploaded files
    │   ├── image\
    │   ├── video\
    │   └── audio\
    ├── processed\              # Converted & captioned files
    │   ├── image\              # PNG files with captions
    │   ├── video\              # MP4 files with action tags
    │   ├── audio\              # MP3 files with transcripts
    │   └── thumbs\             # 256px thumbnails for UI
    └── meta\
        ├── project.json        # Configuration & model settings
        └── run_logs.jsonl      # Processing history

🔧 Configuration

Edit <project>\meta\project.json to customize:

{
  "models": {
    "captioner": "Qwen/Qwen2.5-VL-7B-Instruct",
    "reasoning": {
      "enabled": false,
      "model": "Qwen/Qwen2.5-VL-7B-Instruct"
    }
  },
  "action": {
    "method": "first_frame",
    "rewrite_with_llm": true
  },
  "isolation": {
    "faces": true,
    "sam_refine": false
  }
}

🎯 Model Options

Primary Captioning (Qwen2.5-VL)

Qwen/Qwen2.5-VL-7B-Instruct (default)
Qwen/Qwen2.5-VL-3B-Instruct (lighter)
Qwen/Qwen2.5-VL-2B-Instruct (lightest)

Optional Alternative (Florence-2)

microsoft/Florence-2-base (faster)
microsoft/Florence-2-large (more detailed)

Model files are cached under --models_dir; use --prefetch-qwen to download ahead of time

Single Model Alternative

openbmb/MiniCPM-V-2_6 (all-in-one option)
Enable via single_model_mode: true

🛠️ System Requirements

Minimum

OS: Windows 10/11, Linux, macOS
RAM: 8GB (16GB recommended)
Storage: 10GB free space
Python: 3.10+

Dependencies

PyTorch 2.2+
Transformers 4.42+
Gradio 4.44+
FFmpeg (auto-installed via conda)

📋 File Format Support

Input Formats

Images: PNG, JPG, JPEG, WebP, BMP, TIFF, GIF
Videos: MP4, MOV, MKV, AVI, WMV, FLV, WebM
Audio: MP3, WAV, M4A, FLAC, AAC, OGG, WMA

Output Formats

Images: PNG (RGB, optimized)
Videos: MP4 (H.264, AAC, faststart)
Audio: MP3 (192kbps)

🔍 Processing Pipeline

Media Ingestion: Copy originals to raw/ folders
Format Conversion: Convert to standard formats
AI Analysis:
- Images: Qwen2.5-VL captioning (default) or Florence-2 captioning
- Videos: First-frame analysis + action tag inference
- Audio: Speaker diarization + transcript generation
Optional Enhancement: Florence-2 or Qwen-based refinement depending on selection
Token Assignment: Append unique ULID tokens
Thumbnail Generation: Create 256px previews
Logging: Record all processing steps

🎨 Caption Format

All captions follow this format:

A detailed description of the subject, setting, lighting, and mood [TKN-01HQXYZ123ABC456DEF789]

Video captions include action tags:

A video showing a person walking in a park with natural lighting [ACTION:person_activity] [TKN-01HQXYZ123ABC456DEF789]

🧪 Testing

Run the smoke test to verify installation:

pytest

This will test:

✅ Environment setup
✅ Model loading
✅ Media conversion
✅ Caption generation
✅ Token assignment
✅ File organization

🔧 Troubleshooting

Model Download Issues

# Pre-download models manually
python -c "from transformers import AutoProcessor; AutoProcessor.from_pretrained('Qwen/Qwen2.5-VL-7B-Instruct', trust_remote_code=True)"

CUDA/GPU Issues

# Check CUDA availability
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"

FFmpeg Issues

# Verify FFmpeg installation
ffmpeg -version

Windows Path Issues

# If you get path errors, try using raw strings or forward slashes:
python app.py --root "D:/Datasets" --models_dir "./models"

# Or escape backslashes:
python app.py --root "D:\\Datasets" --models_dir ".\\models"

📚 Advanced Usage

Batch Processing

Process multiple projects programmatically:

from src.core.pipeline import Pipeline
from src.core.io import ProjectLayout

# Initialize pipeline
pipeline = Pipeline(models_dir=r".\models")

# Process project (use raw strings for Windows paths)
layout = ProjectLayout(r"D:\Datasets", "my_project")
pipeline.process_project(layout)

Custom Model Integration

Add new model adapters in src\adapters\:

class CustomCaptioner:
    def caption_image(self, image):
        # Your custom implementation
        return {"caption": "Custom caption"}

🤝 Contributing

Fork the repository
Create feature branch: git checkout -b feature/amazing-feature
Commit changes: git commit -m 'Add amazing feature'
Push to branch: git push origin feature/amazing-feature
Open Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Microsoft for Florence-2 model
Alibaba for Qwen2.5-VL model
PyAnnote team for audio diarization
InsightFace team for face detection
Gradio team for the web interface framework

Adjustments from AugmentInstructions.txt

This implementation enhances the original scaffold with:

Qwen2.5-VL Integration: Primary captioning via Qwen VLM with optional Florence-2 alternative
Modular Architecture: Proper adapter pattern for different AI models
Enhanced Configuration: Comprehensive project.json with model selection options
Better Error Handling: Graceful fallbacks when models aren't available
Comprehensive Testing: Full smoke test suite and acceptance validation
Professional Documentation: Complete setup guide and troubleshooting section

The core functionality remains true to the original vision while providing a production-ready implementation with proper error handling and extensibility.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.idea		.idea
sample_media		sample_media
src		src
tests		tests
AugmentInstructions.txt		AugmentInstructions.txt
BEST_PRACTICES.md		BEST_PRACTICES.md
README.md		README.md
app.py		app.py
captionstrike.log		captionstrike.log
environment.yml		environment.yml
launch_captionstrike.bat		launch_captionstrike.bat
project_template.json		project_template.json
requirements.txt		requirements.txt
run_captionstrike.ps1		run_captionstrike.ps1
setup_instructions.md		setup_instructions.md
test_environment.ps1		test_environment.ps1
validate_installation.py		validate_installation.py

YellowjacketXVI/CaptionStrike

Folders and files

Latest commit

History

Repository files navigation