Skip to content

lehigh-university-libraries/htr

Repository files navigation

htr

Handwritten Text Recognition

Requirements

System Dependencies

  • ImageMagick (required for htr create command)
    • Used for image processing, word detection, and image manipulation
    • Install via:

Install

You can install htr using homebrew

brew tap lehigh-university-libraries/homebrew https://github.com/lehigh-university-libraries/homebrew
brew install lehigh-university-libraries/homebrew/htr

Download Binary

Instead of homebrew, you can download a binary for your system from the latest release

Then put the binary in a directory that is in your $PATH

Usage

The HTR tool supports multiple providers for text extraction from images. Set the appropriate environment variables for your chosen provider or create them in a .env file.

Supported Providers

OpenAI (default)

  • Provider: openai
  • Environment variable: OPENAI_API_KEY
  • Models: gpt-4o, gpt-4o-mini, gpt-4-vision-preview

Azure OCR

  • Provider: azure
  • Environment variables: AZURE_OCR_ENDPOINT, AZURE_OCR_API_KEY
  • Models: Uses Azure Computer Vision Read API 4.0

Google Gemini

  • Provider: gemini
  • Environment variable: GEMINI_API_KEY
  • Models: gemini-2.5-flash

Ollama (local)

  • Provider: ollama
  • Environment variable: OLLAMA_URL (optional, defaults to http://localhost:11434)
  • Models: llava, llava:13b, llava:34b, moondream, etc.

Eval

Evaluate OCR/HTR performance by sending images to AI vision models and comparing their output against ground truth transcripts.

OpenAI Example

htr eval \
  --provider openai \
  --model gpt-4o \
  --prompt "Extract all text from this image" \
  --temperature 0.0 \
  --csv fixtures/images.csv \
  --dir /Volumes/2025-Lyrasis-Catalyst-Fund/ground-truth-documents

Azure OCR Example

htr eval \
  --provider azure \
  --prompt "Extract all text from this image" \
  --csv fixtures/images.csv \
  --dir /Volumes/2025-Lyrasis-Catalyst-Fund/ground-truth-documents

Gemini Example

htr eval \
  --provider gemini \
  --model gemini-2.5-flash \
  --prompt "Extract all text from this image" \
  --temperature 0.0 \
  --csv fixtures/images.csv \
  --dir /Volumes/2025-Lyrasis-Catalyst-Fund/ground-truth-documents

Ollama Example

htr eval \
  --provider ollama \
  --model mistral-small3.2:24b \
  --prompt "Extract all text from this image" \
  --temperature 0.0 \
  --csv fixtures/images.csv \
  --dir /Volumes/2025-Lyrasis-Catalyst-Fund/ground-truth-documents

Handling Unknown Characters with --ignore

Sometimes ground truth transcripts contain characters that cannot be deciphered. Use the --ignore flag to mark these unknown characters and exclude them from accuracy calculations.

How it works:

  • Mark unknown characters in ground truth with a special pattern (e.g., |)
  • The LLM will still transcribe the unknown character in the image as something
  • HTR will automatically skip the corresponding output in the transcription when calculating metrics
  • If the ignore pattern is a standalone word (surrounded by spaces), skip the next word in the transcription
  • If the ignore pattern is within a word, skip the next character in the transcription

Examples:

# Use pipe (|) to mark unknown characters
htr eval \
  --provider openai \
  --model gpt-4o \
  --prompt "Extract all text from this image" \
  --csv fixtures/images.csv \
  --ignore '|' \
  --dir ./ground-truth

# Use multiple ignore patterns (pipe and comma)
htr eval \
  --provider gemini \
  --model gemini-1.5-flash \
  --prompt "Extract all text from this image" \
  --csv fixtures/images.csv \
  --ignore '|' \
  --ignore ',' \
  --dir ./ground-truth

Ground truth examples:

# Unknown word (standalone)
Ground truth: "The quick | fox"
LLM output:   "The quick brown fox"
Result:       Compares "The quick fox" vs "The quick fox" (skips "brown")

# Unknown character (within word)
Ground truth: "d|te"
LLM output:   "date"
Result:       Compares "dte" vs "dte" (skips "a")

# Multiple unknowns
Ground truth: "The | cat , jumped"
LLM output:   "The quick cat suddenly jumped"
Result:       Compares "The cat jumped" vs "The cat jumped" (skips "quick" and "suddenly")

Benefits:

  • More accurate evaluation metrics when dealing with damaged or unclear documents
  • Ignored characters are counted separately in results
  • Character and word accuracy rates exclude unknown characters from denominators

Single Line Mode

--single-line: Convert multi-line documents to single-line text

Removes all newlines, carriage returns, and tabs from ground truth and transcripts, normalizing multiple spaces to single spaces. This is useful when:

  • Your ground truth uses line breaks but the model output doesn't (or vice versa)
  • You want to focus on content accuracy regardless of line formatting
  • You need to normalize whitespace for fair comparison
# Evaluate as single-line text
htr eval \
  --provider openai \
  --model gpt-4o \
  --prompt "Extract all text from this image" \
  --csv fixtures/images.csv \
  --single-line \
  --dir ./ground-truth

Examples:

# With --single-line
Ground truth: "Line 1\nLine 2"
Model output: "Line 1 Line 2"
Result:       Perfect match (newlines converted to spaces)

# With tabs and multiple spaces
Ground truth: "Hello\t\tWorld\n\nTest"
Model output: "Hello World Test"
Result:       Perfect match (tabs, newlines, and multiple spaces normalized)

### Create

Create hOCR XML files from images using custom word detection and LLM transcription:

```bash
# Create hOCR XML from an image (prints to stdout)
htr create --image path/to/image.jpg --provider ollama --model llava

# Save output to a file
htr create --image path/to/image.jpg --provider openai --model gpt-4o -o output.hocr

# Use different providers
htr create --image scan.png --provider gemini --model gemini-1.5-flash -o scan.hocr

Note: The create command requires ImageMagick to be installed on your system.

Eval External

Evaluate transcriptions from external OCR/HTR models (like Loghi, Tesseract, Kraken, etc.) against ground truth transcripts. This command reads pre-generated transcriptions from text files and compares them to ground truth without making any API calls.

Usage

# Evaluate external model transcriptions
htr eval-external \
  --csv loghi_results.csv \
  --name loghi \
  --dir ./transcriptions

CSV Format

The CSV file should have 2 columns:

transcript,transcription
ground-truth-1.txt,loghi-output-1.txt
ground-truth-2.txt,loghi-output-2.txt

Where:

  • transcript: Path to the ground truth transcript file
  • transcription: Path to the external model's transcription output file

Example Workflow

  1. Run your images through an external HTR model (e.g., Loghi):

    # Example: Process images with Loghi
    for img in images/*.jpg; do
      loghi-htr predict --image "$img" --output "transcriptions/$(basename $img .jpg).txt"
    done
  2. Create a CSV mapping ground truth to external transcriptions:

    transcript,transcription
    groundtruth/page1.txt,transcriptions/page1.txt
    groundtruth/page2.txt,transcriptions/page2.txt
    
  3. Evaluate the external model's performance:

    htr eval-external --csv external_model.csv --name loghi --dir ./
  4. View results alongside other model evaluations:

    htr summary loghi
    htr csv  # Compare all models including external ones

Testing Specific Rows

# Test just the first few rows
htr eval-external --csv external_model.csv --name loghi --rows 0,1,2 --dir ./

Using Flags with External Models

All evaluation flags work with external model evaluations:

Using --ignore for unknown characters:

# Evaluate with unknown character handling
htr eval-external \
  --csv external_model.csv \
  --name loghi \
  --ignore '|' \
  --dir ./

# Multiple ignore patterns
htr eval-external \
  --csv tesseract_results.csv \
  --name tesseract \
  --ignore '|' \
  --ignore ',' \
  --dir ./transcriptions

Using --single-line for normalization:

# Convert to single-line for comparison
htr eval-external \
  --csv external_model.csv \
  --name loghi \
  --single-line \
  --dir ./

# Combine with ignore patterns
htr eval-external \
  --csv tesseract_results.csv \
  --name tesseract \
  --single-line \
  --ignore '|' \
  --dir ./transcriptions

These flags are useful when:

  • Your ground truth contains markers for unknown/unclear characters (--ignore)
  • External model output has different line break formatting (--single-line)

Summary

View summary statistics from existing evaluation results:

# List all available evaluation files
htr summary

# View summary for a specific evaluation
htr summary eval_2025-07-24_07-44-38.yaml

# Or just use the filename without extension
htr summary eval_2025-07-24_07-44-38

CSV Export

Export aggregated evaluation results from all models as CSV/TSV format, sorted by performance:

# Export all evaluation results as TSV
htr csv

# Export with per-page cost calculation
htr csv --input-price 2.50 --output-price 10.0

Basic Usage

The csv command scans all YAML files in the evals/ directory and aggregates performance metrics for each model:

htr csv

Output columns:

  • Model name and configuration
  • Total evaluations performed
  • Average character similarity (0-1)
  • Average character accuracy (0-1)
  • Average word similarity (0-1)
  • Average word accuracy (0-1)
  • Average word error rate (0-1)

Results are sorted by word similarity (best to worst) and output in tab-separated format for easy import into spreadsheet software.

Cost Analysis

When you provide pricing information, the csv command includes per-page cost estimates:

htr csv --input-price 2.50 --output-price 10.0

Additional columns with pricing:

  • Average input tokens per page
  • Average output tokens per page
  • PageCost: Estimated cost per page in dollars

Example output:

Model                          PageCost    AvgWordAccuracy
gpt-4o                        0.011250    0.605094
claude-sonnet-4-5-20250929    0.009845    0.598710
gemini-2.5-flash              0.003420    0.572504

Example Workflow

# 1. Run evaluations with different providers
htr eval --provider openai --model gpt-4o --prompt "Extract text" --csv images.csv
htr eval --provider claude --model claude-sonnet-4-5 --prompt "Extract text" --csv images.csv
htr eval --provider gemini --model gemini-2.5-flash --prompt "Extract text" --csv images.csv

# 2. Compare all models (performance only)
htr csv

# 3. Compare models with cost analysis
htr csv --input-price 2.50 --output-price 10.0

# 4. Save results to a file
htr csv --input-price 2.50 --output-price 10.0 > model_comparison.tsv

Pricing Notes

  • Prices are specified as cost per million tokens
  • Example: --input-price 2.50 means $2.50 per 1M input tokens
  • PageCost is calculated as: (avgInputTokens / 1,000,000) × inputPrice + (avgOutputTokens / 1,000,000) × outputPrice
  • Only evaluations with token data will show cost information (OpenAI, Claude, Gemini, Ollama)
  • Azure OCR evaluations will show 0.00 for tokens and cost (no token tracking)

Cost Estimation

Estimate costs for large-scale document transcription based on token usage data from evaluation runs. The cost command analyzes token consumption from an evaluation file and projects costs for transcribing a larger number of documents.

How It Works

  1. Token Tracking: When you run an evaluation, HTR automatically captures input and output token counts from API responses (OpenAI, Claude, Gemini, Ollama)
  2. Average Calculation: The cost command calculates average tokens per document from your evaluation
  3. Cost Projection: Estimates total cost for transcribing N documents based on your specified pricing

Usage

# Calculate cost estimate for an evaluation
htr cost gpt-4o --input-price 1.25 --output-price 10.0 --doc-count 1000

Required flags:

  • --input-price: Cost per million input tokens (e.g., 1.25 for $1.25/1M tokens)
  • --output-price: Cost per million output tokens (e.g., 10.0 for $10.00/1M tokens)

Optional flags:

  • --doc-count: Number of documents to estimate (default: 1000)

Example Workflow

# 1. Run an evaluation to collect token usage data
htr eval \
  --provider openai \
  --model gpt-4o \
  --prompt "Extract all text from this image" \
  --csv sample_docs.csv \
  --dir ./images

# 2. Calculate cost for 5000 documents using GPT-4o pricing
# Input: $2.50/1M tokens, Output: $10.00/1M tokens
htr cost gpt-4o --input-price 2.50 --output-price 10.0 --doc-count 5000

Example Output

=== COST ESTIMATION ===
File: gpt-4o.yaml
Provider: openai
Model: gpt-4o

=== Token Usage Statistics ===
Documents analyzed: 50
Average input tokens per document: 1847.32
Average output tokens per document: 456.18
Average total tokens per document: 2303.50

=== Pricing Configuration ===
Input token price: $2.50 per 1M tokens
Output token price: $10.00 per 1M tokens

=== Per Document Cost ===
Input cost: $0.004618
Output cost: $0.004562
Total cost: $0.009180

=== Estimated Cost for 5000 Documents ===
Input cost: $23.09
Output cost: $22.81
Total cost: $45.90

Token Support by Provider

  • OpenAI: ✅ Full token tracking (input/output)
  • Claude: ✅ Full token tracking (input/output)
  • Gemini: ✅ Full token tracking (input/output)
  • Ollama: ✅ Full token tracking (input/output)
  • Azure OCR: ❌ No token data (service doesn't provide usage info)

Notes

  • Token counts are captured directly from API responses, not calculated by HTR
  • Evaluation YAML files store token data as inputtokens and outputtokens fields
  • Cost estimates are based on averages across all documents in the evaluation
  • Use a representative sample of documents for more accurate cost projections

Testing Individual Items

You can test individual rows from your CSV to quickly evaluate a single provider:

# Test just the first row (index 0)
htr eval --provider azure --prompt "Extract all text from this image" --csv fixtures/images.csv --rows 0 --dir /path/to/images

# Test multiple specific rows
htr eval --provider gemini --model gemini-pro-vision --prompt "Extract all text from this image" --csv fixtures/images.csv --rows 0,5,10 --dir /path/to/images

Updating

Homebrew

If homebrew was used, you can simply upgrade the homebrew formulae for htr

brew update && brew upgrade htr

Download Binary

If the binary was downloaded and added to the $PATH updating htr could look as follows. Requires gh and tar

# update for your architecture
ARCH="htr_Linux_x86_64.tar.gz"
TAG=$(gh release list --exclude-pre-releases --exclude-drafts --limit 1 --repo lehigh-university-libraries/htr | awk '{print $3}')
gh release download $TAG --repo lehigh-university-libraries/htr --pattern $ARCH
tar -zxvf $ARCH
mv htr /directory/in/path/binary/was/placed
rm $ARCH

About

Handwritten Text Recognition llm eval tool

Topics

Resources

License

Stars

Watchers

Forks

Contributors 3

  •  
  •  
  •