A local-first application for creating training datasets using Qwen2.5-VL (default) for image/video captioning and reasoning, with Florence-2 as an optional alternative. Build high-quality datasets with drag-and-drop simplicity and AI-powered automation.
- π― Qwen2.5-VL Integration: Primary VLM for captioning and reasoning (7B default)
- π§ Optional Alternative: Florence-2 Base/Large for perception-first captioning
- π΅ Audio Processing: Speaker diarization and isolation using pyannote.audio
- π€ Person Isolation: Face detection with InsightFace + optional SAM refinement
- πΌοΈ Smart Conversion: Auto-convert to standard formats (PNG/MP4/MP3)
- π·οΈ ULID Tokens: Unique, sortable identifiers for all processed media
- π Web Interface: Gradio-based UI with drag-drop and inline editing
- π Progress Tracking: Comprehensive logging and project management
# Navigate to CaptionStrike directory
cd D:\Dropbox\SandBox\CaptionStrike
# Create conda environment
conda env create -f environment.yml
# Activate environment
conda activate CaptionStirke# Start the local web interface
python app.py --root "D:\Datasets" --models_dir ".\models"
# Or specify custom paths
python app.py --root "C:\Your\Dataset\Path" --models_dir "C:\Your\Models\Path" --port 7860
# (Optional) Pre-download Qwen reasoning model to the models directory
python app.py --root "D:\Datasets" --models_dir ".\models" --prefetch-qwen- Create Project: Enter a project name and click "Create Project"
- Add Media: Drag and drop images, videos, or audio files
- Configure Options:
- Toggle person isolation (face crops)
- Provide reference voice clip for audio processing
- Set audio timestamp ranges
- Run Pipeline: Click "RUN pipeline" to process all media
- Review Results: Browse thumbnails and edit captions inline
- Export: Find processed files in
<root>\<project>\processed\
# If you encounter path issues, use full Windows paths:
python app.py --root "C:\Users\YourName\Documents\Datasets" --models_dir "C:\Users\YourName\Documents\Models"
# To check if conda environment is active:
conda info --envs
# To verify Python and dependencies:
python -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}')"<root>\
βββ <project_name>\
βββ raw\ # Original uploaded files
β βββ image\
β βββ video\
β βββ audio\
βββ processed\ # Converted & captioned files
β βββ image\ # PNG files with captions
β βββ video\ # MP4 files with action tags
β βββ audio\ # MP3 files with transcripts
β βββ thumbs\ # 256px thumbnails for UI
βββ meta\
βββ project.json # Configuration & model settings
βββ run_logs.jsonl # Processing history
Edit <project>\meta\project.json to customize:
{
"models": {
"captioner": "Qwen/Qwen2.5-VL-7B-Instruct",
"reasoning": {
"enabled": false,
"model": "Qwen/Qwen2.5-VL-7B-Instruct"
}
},
"action": {
"method": "first_frame",
"rewrite_with_llm": true
},
"isolation": {
"faces": true,
"sam_refine": false
}
}Qwen/Qwen2.5-VL-7B-Instruct(default)Qwen/Qwen2.5-VL-3B-Instruct(lighter)Qwen/Qwen2.5-VL-2B-Instruct(lightest)
microsoft/Florence-2-base(faster)microsoft/Florence-2-large(more detailed)
Model files are cached under --models_dir; use --prefetch-qwen to download ahead of time
openbmb/MiniCPM-V-2_6(all-in-one option)- Enable via
single_model_mode: true
- OS: Windows 10/11, Linux, macOS
- RAM: 8GB (16GB recommended)
- Storage: 10GB free space
- Python: 3.10+
- GPU: NVIDIA GPU with 6GB+ VRAM (CUDA support)
- RAM: 16GB+ for large models
- Storage: SSD for faster processing
- PyTorch 2.2+
- Transformers 4.42+
- Gradio 4.44+
- FFmpeg (auto-installed via conda)
- Images: PNG, JPG, JPEG, WebP, BMP, TIFF, GIF
- Videos: MP4, MOV, MKV, AVI, WMV, FLV, WebM
- Audio: MP3, WAV, M4A, FLAC, AAC, OGG, WMA
- Images: PNG (RGB, optimized)
- Videos: MP4 (H.264, AAC, faststart)
- Audio: MP3 (192kbps)
- Media Ingestion: Copy originals to
raw/folders - Format Conversion: Convert to standard formats
- AI Analysis:
- Images: Qwen2.5-VL captioning (default) or Florence-2 captioning
- Videos: First-frame analysis + action tag inference
- Audio: Speaker diarization + transcript generation
- Optional Enhancement: Florence-2 or Qwen-based refinement depending on selection
- Token Assignment: Append unique ULID tokens
- Thumbnail Generation: Create 256px previews
- Logging: Record all processing steps
All captions follow this format:
A detailed description of the subject, setting, lighting, and mood [TKN-01HQXYZ123ABC456DEF789]
Video captions include action tags:
A video showing a person walking in a park with natural lighting [ACTION:person_activity] [TKN-01HQXYZ123ABC456DEF789]
Run the smoke test to verify installation:
pytestThis will test:
- β Environment setup
- β Model loading
- β Media conversion
- β Caption generation
- β Token assignment
- β File organization
# Pre-download models manually
python -c "from transformers import AutoProcessor; AutoProcessor.from_pretrained('Qwen/Qwen2.5-VL-7B-Instruct', trust_remote_code=True)"# Check CUDA availability
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"# Verify FFmpeg installation
ffmpeg -version# If you get path errors, try using raw strings or forward slashes:
python app.py --root "D:/Datasets" --models_dir "./models"
# Or escape backslashes:
python app.py --root "D:\\Datasets" --models_dir ".\\models"Process multiple projects programmatically:
from src.core.pipeline import Pipeline
from src.core.io import ProjectLayout
# Initialize pipeline
pipeline = Pipeline(models_dir=r".\models")
# Process project (use raw strings for Windows paths)
layout = ProjectLayout(r"D:\Datasets", "my_project")
pipeline.process_project(layout)Add new model adapters in src\adapters\:
class CustomCaptioner:
def caption_image(self, image):
# Your custom implementation
return {"caption": "Custom caption"}- Fork the repository
- Create feature branch:
git checkout -b feature/amazing-feature - Commit changes:
git commit -m 'Add amazing feature' - Push to branch:
git push origin feature/amazing-feature - Open Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Microsoft for Florence-2 model
- Alibaba for Qwen2.5-VL model
- PyAnnote team for audio diarization
- InsightFace team for face detection
- Gradio team for the web interface framework
This implementation enhances the original scaffold with:
- Qwen2.5-VL Integration: Primary captioning via Qwen VLM with optional Florence-2 alternative
- Modular Architecture: Proper adapter pattern for different AI models
- Enhanced Configuration: Comprehensive project.json with model selection options
- Better Error Handling: Graceful fallbacks when models aren't available
- Comprehensive Testing: Full smoke test suite and acceptance validation
- Professional Documentation: Complete setup guide and troubleshooting section
The core functionality remains true to the original vision while providing a production-ready implementation with proper error handling and extensibility.