This is a Cog implementation of DeepSeek-OCR, a powerful vision-language model for Optical Character Recognition (OCR) that can extract text from documents, charts, tables, and other images.
- High-Quality OCR: Extract text from various document types including PDFs, scanned documents, charts, and tables
- Grounded OCR: Optional bounding box detection for extracted text regions
- Markdown Output: Convert documents to structured markdown format
- Multiple Task Modes: Pre-configured settings for different use cases (Tiny, Small, Base, Large, Gundam)
- GPU Accelerated: Uses CUDA 12.4 with Flash Attention 2 for efficient inference
- Offline Mode: Runs completely offline with local model checkpoints
- Cog installed
- NVIDIA GPU with CUDA support
- Docker
- Clone this repository:
git clone https://github.com/lucataco/cog-deepseek-OCR.git
cd deepseek-ocr- Build the Cog image:
cog buildNote: Model weights (~20GB) are automatically downloaded from Replicate's CDN on first run using pget, a fast parallel downloader. The weights are cached in the checkpoints/ directory for subsequent runs.
Extract text and convert to markdown format with bounding boxes:
cog predict -i [email protected]Simple text extraction without markdown formatting:
cog predict -i [email protected] -i task_type="Free OCR"Extract and describe chart or figure contents:
cog predict -i [email protected] -i task_type="Parse Figure"Find specific objects or text in the image:
cog predict -i [email protected] -i task_type="Locate Object by Reference" -i reference_text="the teacher"Choose from different resolution presets to balance speed and accuracy:
- Gundam (Recommended):
base_size=1024, image_size=640, crop_mode=True- Best balance, handles large documents - Tiny:
base_size=512, image_size=512, crop_mode=False- Fastest, lower quality - Small:
base_size=640, image_size=640, crop_mode=False- Fast with decent quality - Base:
base_size=1024, image_size=1024, crop_mode=False- Good quality - Large:
base_size=1280, image_size=1280, crop_mode=False- Best quality, slower
Example with custom resolution:
cog predict -i [email protected] -i resolution_size="Large"cog predict \
-i [email protected] \
-i task_type="Convert to Markdown" \
-i resolution_size="Gundam (Recommended)"| Parameter | Type | Default | Description |
|---|---|---|---|
image |
Path | Required | Input image to perform OCR on (supports JPG, PNG, etc.) |
task_type |
String | "Convert to Markdown" |
Task type: "Convert to Markdown", "Free OCR", "Parse Figure", or "Locate Object by Reference" |
reference_text |
String | "" |
Reference text to locate (only used with "Locate Object by Reference" task). Examples: "the teacher", "20-10", "a red car" |
resolution_size |
String | "Gundam (Recommended)" |
Resolution preset: "Gundam (Recommended)", "Tiny", "Small", "Base", or "Large" |
The model automatically uses optimized prompts for each task type:
Extracts text with structure and converts to markdown format. Includes bounding box detection for grounded OCR.
- Use for: Documents, articles, papers, structured text
- Output: Markdown with headings, paragraphs, lists, etc.
- Prompt used:
<image>\n<|grounding|>Convert the document to markdown.
Simple text extraction without markdown formatting or complex structure.
- Use for: Quick text extraction, simple documents
- Output: Plain text
- Prompt used:
<image>\nFree OCR.
Analyzes and describes charts, graphs, diagrams, and figures.
- Use for: Charts, graphs, diagrams, infographics
- Output: Description of the figure's content
- Prompt used:
<image>\nParse the figure.
Finds and locates specific objects or text mentioned in the reference.
- Use for: Finding specific elements in complex images
- Output: Location and context of the referenced object
- Prompt used:
<image>\nLocate <|ref|>{reference_text}<|/ref|> in the image. - Note: Requires
reference_textparameter
- Base Model: DeepSeek-V2 (7B parameters)
- Vision Encoder: Combination of SAM ViT-B and CLIP-L
- Attention: Flash Attention 2 for efficient inference
- Precision: bfloat16 for optimal GPU performance
The model uses dynamic preprocessing with crop mode to handle large documents efficiently:
- Gundam Mode: ~760 visual tokens for a typical document
- Compression Ratio: Typically 0.6-0.8 for text-heavy documents
- Max Tokens: Up to 8192 new tokens for output
deepseek-ocr/
├── cog.yaml # Cog configuration
├── predict.py # Prediction interface
├── requirements.txt # Python dependencies
├── checkpoints/ # Model checkpoints directory
│ ├── __init__.py # Package initialization
│ ├── modeling_deepseekocr.py
│ ├── modeling_deepseekv2.py
│ ├── configuration_deepseek_v2.py
│ ├── deepencoder.py
│ └── conversation.py
└── README.md
- predict.py: Main Cog predictor interface with automatic weight downloading, warning suppression, and parameter handling
- checkpoints/modeling_deepseekocr.py: Core model implementation with custom inference logic
- cog.yaml: Cog configuration specifying the runtime environment and pget installation
- requirements.txt: Python package dependencies
The implementation uses pget, a fast parallel file downloader, to automatically fetch model weights on first run:
- Weights are downloaded from:
https://weights.replicate.delivery/default/deepseek-ai/DeepSeek-OCR/model.tar - Downloaded to:
checkpoints/directory - Cached for subsequent runs
- Download time: ~3-5 minutes on a typical connection
# Test default (Convert to Markdown)
cog predict -i [email protected]
# Test different task types
cog predict -i [email protected] -i task_type="Free OCR"
cog predict -i [email protected] -i task_type="Parse Figure"
# Test different resolutions
cog predict -i [email protected] -i resolution_size="Tiny"
cog predict -i [email protected] -i resolution_size="Large"
# Test locate object
cog predict -i [email protected] -i task_type="Locate Object by Reference" -i reference_text="title"If you encounter model loading errors:
-
First Run - Automatic Download: On the first prediction, the model weights (~20GB) will be automatically downloaded from Replicate's CDN. This may take several minutes depending on your connection speed.
-
Verify checkpoints: After download, ensure checkpoints are present:
ls -la checkpoints/ # Should contain: config.json, model safetensors files, and Python files -
Manual Download: If automatic download fails, you can manually download from Hugging Face:
pip install huggingface-hub huggingface-cli download deepseek-ai/DeepSeek-OCR --local-dir checkpoints
-
Verify the
__init__.pyfile exists in the checkpoints directory
If you run out of GPU memory:
- Use a smaller
task_modelike "Tiny" or "Small" - Reduce
base_sizeandimage_sizein Custom mode - Disable
crop_modefor smaller documents
The implementation includes comprehensive warning suppression for known transformers library warnings that cannot be fixed at the application level. If you see unexpected warnings, they may indicate actual issues that need attention.
If you use this implementation, please cite the original DeepSeek-OCR paper:
@article{deepseek-ocr,
title={DeepSeek-OCR: A Vision-Language Model for Optical Character Recognition},
author={DeepSeek AI},
year={2024},
url={https://huggingface.co/deepseek-ai/DeepSeek-OCR}
}This implementation follows the license of the original DeepSeek-OCR model. Please refer to the official repository for licensing details.
- DeepSeek AI for the original model
- Replicate for the Cog framework
- Hugging Face for model hosting and transformers library
For issues related to:
- This Cog implementation: Open an issue in this repository
- The original model: Visit the DeepSeek-OCR repository
- Cog framework: Check the Cog documentation