Nanonets OCR Processor

A powerful batch OCR processing tool that extracts text from images and PDFs using the Nanonets OCR model. Now optimized for CPU usage by default with improved output readability.

Features

CPU-First Processing: Uses CPU by default for better compatibility and resource management
Multiple Output Formats: Creates readable outputs in multiple formats
Comprehensive Analysis: Detects tables, equations, images, watermarks, and page numbers
Progress Tracking: Real-time processing progress with time estimates
Error Handling: Robust error handling with detailed failure reporting

Supported File Formats

Images: .jpg, .jpeg, .png, .tiff, .tif
Documents: .pdf

Installation

Prerequisites

Python 3.8 or higher
CUDA Toolkit (optional, for GPU acceleration)
Git

Option 1: Local Installation

Clone the repository:

git clone <repository-url>
cd OCR

Create and activate virtual environment:

# Create virtual environment
python3 -m venv ocr_env

# Activate virtual environment
# On Linux/Mac:
source ocr_env/bin/activate
# On Windows:
ocr_env\Scripts\activate

Install dependencies:

# Install remaining dependencies
pip install -r requirements.txt

Prepare data directories:

# Place your documents in the example_files directory (or specify custom input directory). Some files have already been included.
cp /path/to/your/documents/* example_files/

Usage

Basic Usage (CPU - Default)

# CPU: Use conservative defaults (batch-size 1, 1080p images)
python batch_ocr.py --cpu --batch-size 1 --pdf-chunk-size 1

# GPU 8GB RAM: Use conservative defaults (batch-size 1, 1080p images)
python batch_ocr.py --gpu --batch-size 1 --pdf-chunk-size 1

# GPU 16GB RAM: Increase batch size for better performance
python batch_ocr.py --gpu --batch-size 5 --pdf-chunk-size 2

Custom Input/Output Directories

python batch_ocr.py --input-dir /path/to/data --output-dir /path/to/results

Advanced Usage Examples

# Process with GPU, custom batch size, and directory
python batch_ocr.py --gpu --batch-size 8 --input-dir example_files --output-dir results

# Process PDFs with smaller chunks for memory efficiency
python batch_ocr.py --pdf-chunk-size 2 --max-image-width 1024 --max-image-height 1024

# Full custom configuration
python batch_ocr.py \
  --gpu \
  --input-dir "example_files" \
  --output-dir "results" \
  --batch-size 4 \
  --pdf-chunk-size 1 \
  --max-image-width 2048 \
  --max-image-height 2048

Command Line Options

--cpu: Use CPU for processing (default)
--gpu: Use GPU for processing
--input-dir: Input directory containing documents (default: example_files)
--output-dir: Output directory for results (default: extracted_text)
--csv-file: CSV file for results summary (default: ocr_results.csv)
--batch-size: Number of files to process per batch (default: 2)
--pdf-chunk-size: Number of PDF pages to process at once (default: 1)
--max-image-width: Maximum image width in pixels (default: 1080)
--max-image-height: Maximum image height in pixels (default: 1080)

Note: The --cpu and --gpu options are mutually exclusive. If neither is specified, CPU processing is used by default.

Project Structure

OCR/
├── batch_ocr.py               # Main OCR processing script
├── requirements.txt           # Python dependencies
├── README.md                  # This file
├── .gitignore                 # Git exclusions
├── example_files/             # Sample input files directory
│   ├── a01-000u.png          # Example image files
│   ├── a01-003u.png
│   └── ...
└── extracted_text/            # Output directory (created when processing)
    ├── document1.md           # Individual markdown files for each document
    ├── document2.md
    └── ...

Output Structure

The processor creates multiple output formats for better readability:

extracted_text/               # Default output directory
├── document1.md              # Individual markdown files for each document
├── document2.md
├── document3.md
└── ...

Output Formats Explained

Individual Text Files (extracted_text/*.txt)
- Separate text file for each processed document
- Formatted with headers and metadata
- Preserves OCR formatting (tables, equations, etc.)
- Easy to read and search

Example Output

Individual Text File Format

================================================================================
OCR EXTRACTION RESULTS
================================================================================

DOCUMENT INFORMATION:
----------------------------------------
Filename: example.pdf
File Format: .pdf
File Size: 1,234,567 bytes
Created: 2025-06-26T22:10:37.102788

PROCESSING INFORMATION:
----------------------------------------
Processed: 2025-06-26T22:11:09.529471
Processing Time: 2.93 seconds
Pages Processed: 3

CONTENT ANALYSIS:
----------------------------------------
Word Count: 1,245
Character Count: 7,890
Contains Tables: Yes
Contains Equations: No
Contains Images: Yes
Contains Watermarks: No
Contains Page Numbers: Yes

EXTRACTED TEXT:
================================================================================

[Extracted text content here...]

Performance

CPU Processing: Optimized for CPU with appropriate data types
Memory Efficient: Conservative defaults for systems with limited RAM
Progress Tracking: Real-time progress updates with time estimates
Error Recovery: Continues processing even if individual files fail

Troubleshooting

Runtime Issues

Out of Memory: Use CPU mode (default) for better memory management

Performance Tips

Use CPU mode (default) for most use cases - more stable and compatible
GPU mode only if you have sufficient VRAM (8GB+) and CUDA support
Process large batches in smaller chunks if memory is limited
Use SSD storage for faster I/O when processing many files
Monitor memory usage with htop or nvidia-smi during processing
Close other applications to free up memory during processing

Requirements

See requirements.txt for complete dependency list. Key requirements:

Python 3.8+
PyTorch
Transformers
PIL (Pillow)
PyMuPDF
pandas

License

This project uses the Nanonets OCR model. Please check the model's license terms for commercial usage.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github/workflows		.github/workflows
assets		assets
example_files		example_files
extracted_text		extracted_text
hooks		hooks
.gitignore		.gitignore
README.md		README.md
batch_ocr.py		batch_ocr.py
build_appimage.sh		build_appimage.sh
gui_tkinter.desktop		gui_tkinter.desktop
gui_tkinter.py		gui_tkinter.py
gui_tkinter.spec		gui_tkinter.spec
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Nanonets OCR Processor

Features

Supported File Formats

Installation

Prerequisites

Option 1: Local Installation

Usage

Basic Usage (CPU - Default)

Custom Input/Output Directories

Advanced Usage Examples

Command Line Options

Project Structure

Output Structure

Output Formats Explained

Example Output

Individual Text File Format

Performance

Troubleshooting

Runtime Issues

Performance Tips

Requirements

License

About

Uh oh!

Releases

Packages

Languages

Pterodactylus/Nanonets-OCR

Folders and files

Latest commit

History

Repository files navigation

Nanonets OCR Processor

Features

Supported File Formats

Installation

Prerequisites

Option 1: Local Installation

Usage

Basic Usage (CPU - Default)

Custom Input/Output Directories

Advanced Usage Examples

Command Line Options

Project Structure

Output Structure

Output Formats Explained

Example Output

Individual Text File Format

Performance

Troubleshooting

Runtime Issues

Performance Tips

Requirements

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages