Handwritten Text Recognition
- ImageMagick (required for
htr createcommand)- Used for image processing, word detection, and image manipulation
- Install via:
- macOS:
brew install imagemagick - Ubuntu/Debian:
apt-get install imagemagick - Windows: Download from ImageMagick website
- macOS:
You can install htr using homebrew
brew tap lehigh-university-libraries/homebrew https://github.com/lehigh-university-libraries/homebrew
brew install lehigh-university-libraries/homebrew/htr
Instead of homebrew, you can download a binary for your system from the latest release
Then put the binary in a directory that is in your $PATH
The HTR tool supports multiple providers for text extraction from images. Set the appropriate environment variables for your chosen provider or create them in a .env file.
- Provider:
openai - Environment variable:
OPENAI_API_KEY - Models:
gpt-4o,gpt-4o-mini,gpt-4-vision-preview
- Provider:
azure - Environment variables:
AZURE_OCR_ENDPOINT,AZURE_OCR_API_KEY - Models: Uses Azure Computer Vision Read API 4.0
- Provider:
gemini - Environment variable:
GEMINI_API_KEY - Models:
gemini-2.5-flash
- Provider:
ollama - Environment variable:
OLLAMA_URL(optional, defaults tohttp://localhost:11434) - Models:
llava,llava:13b,llava:34b,moondream, etc.
Evaluate OCR/HTR performance by sending images to AI vision models and comparing their output against ground truth transcripts.
htr eval \
--provider openai \
--model gpt-4o \
--prompt "Extract all text from this image" \
--temperature 0.0 \
--csv fixtures/images.csv \
--dir /Volumes/2025-Lyrasis-Catalyst-Fund/ground-truth-documentshtr eval \
--provider azure \
--prompt "Extract all text from this image" \
--csv fixtures/images.csv \
--dir /Volumes/2025-Lyrasis-Catalyst-Fund/ground-truth-documentshtr eval \
--provider gemini \
--model gemini-2.5-flash \
--prompt "Extract all text from this image" \
--temperature 0.0 \
--csv fixtures/images.csv \
--dir /Volumes/2025-Lyrasis-Catalyst-Fund/ground-truth-documentshtr eval \
--provider ollama \
--model mistral-small3.2:24b \
--prompt "Extract all text from this image" \
--temperature 0.0 \
--csv fixtures/images.csv \
--dir /Volumes/2025-Lyrasis-Catalyst-Fund/ground-truth-documentsSometimes ground truth transcripts contain characters that cannot be deciphered. Use the --ignore flag to mark these unknown characters and exclude them from accuracy calculations.
How it works:
- Mark unknown characters in ground truth with a special pattern (e.g.,
|) - The LLM will still transcribe the unknown character in the image as something
- HTR will automatically skip the corresponding output in the transcription when calculating metrics
- If the ignore pattern is a standalone word (surrounded by spaces), skip the next word in the transcription
- If the ignore pattern is within a word, skip the next character in the transcription
Examples:
# Use pipe (|) to mark unknown characters
htr eval \
--provider openai \
--model gpt-4o \
--prompt "Extract all text from this image" \
--csv fixtures/images.csv \
--ignore '|' \
--dir ./ground-truth
# Use multiple ignore patterns (pipe and comma)
htr eval \
--provider gemini \
--model gemini-1.5-flash \
--prompt "Extract all text from this image" \
--csv fixtures/images.csv \
--ignore '|' \
--ignore ',' \
--dir ./ground-truthGround truth examples:
# Unknown word (standalone)
Ground truth: "The quick | fox"
LLM output: "The quick brown fox"
Result: Compares "The quick fox" vs "The quick fox" (skips "brown")
# Unknown character (within word)
Ground truth: "d|te"
LLM output: "date"
Result: Compares "dte" vs "dte" (skips "a")
# Multiple unknowns
Ground truth: "The | cat , jumped"
LLM output: "The quick cat suddenly jumped"
Result: Compares "The cat jumped" vs "The cat jumped" (skips "quick" and "suddenly")
Benefits:
- More accurate evaluation metrics when dealing with damaged or unclear documents
- Ignored characters are counted separately in results
- Character and word accuracy rates exclude unknown characters from denominators
--single-line: Convert multi-line documents to single-line text
Removes all newlines, carriage returns, and tabs from ground truth and transcripts, normalizing multiple spaces to single spaces. This is useful when:
- Your ground truth uses line breaks but the model output doesn't (or vice versa)
- You want to focus on content accuracy regardless of line formatting
- You need to normalize whitespace for fair comparison
# Evaluate as single-line text
htr eval \
--provider openai \
--model gpt-4o \
--prompt "Extract all text from this image" \
--csv fixtures/images.csv \
--single-line \
--dir ./ground-truthExamples:
# With --single-line
Ground truth: "Line 1\nLine 2"
Model output: "Line 1 Line 2"
Result: Perfect match (newlines converted to spaces)
# With tabs and multiple spaces
Ground truth: "Hello\t\tWorld\n\nTest"
Model output: "Hello World Test"
Result: Perfect match (tabs, newlines, and multiple spaces normalized)
### Create
Create hOCR XML files from images using custom word detection and LLM transcription:
```bash
# Create hOCR XML from an image (prints to stdout)
htr create --image path/to/image.jpg --provider ollama --model llava
# Save output to a file
htr create --image path/to/image.jpg --provider openai --model gpt-4o -o output.hocr
# Use different providers
htr create --image scan.png --provider gemini --model gemini-1.5-flash -o scan.hocr
Note: The create command requires ImageMagick to be installed on your system.
Evaluate transcriptions from external OCR/HTR models (like Loghi, Tesseract, Kraken, etc.) against ground truth transcripts. This command reads pre-generated transcriptions from text files and compares them to ground truth without making any API calls.
# Evaluate external model transcriptions
htr eval-external \
--csv loghi_results.csv \
--name loghi \
--dir ./transcriptionsThe CSV file should have 2 columns:
transcript,transcription
ground-truth-1.txt,loghi-output-1.txt
ground-truth-2.txt,loghi-output-2.txt
Where:
transcript: Path to the ground truth transcript filetranscription: Path to the external model's transcription output file
-
Run your images through an external HTR model (e.g., Loghi):
# Example: Process images with Loghi for img in images/*.jpg; do loghi-htr predict --image "$img" --output "transcriptions/$(basename $img .jpg).txt" done
-
Create a CSV mapping ground truth to external transcriptions:
transcript,transcription groundtruth/page1.txt,transcriptions/page1.txt groundtruth/page2.txt,transcriptions/page2.txt -
Evaluate the external model's performance:
htr eval-external --csv external_model.csv --name loghi --dir ./
-
View results alongside other model evaluations:
htr summary loghi htr csv # Compare all models including external ones
# Test just the first few rows
htr eval-external --csv external_model.csv --name loghi --rows 0,1,2 --dir ./All evaluation flags work with external model evaluations:
Using --ignore for unknown characters:
# Evaluate with unknown character handling
htr eval-external \
--csv external_model.csv \
--name loghi \
--ignore '|' \
--dir ./
# Multiple ignore patterns
htr eval-external \
--csv tesseract_results.csv \
--name tesseract \
--ignore '|' \
--ignore ',' \
--dir ./transcriptionsUsing --single-line for normalization:
# Convert to single-line for comparison
htr eval-external \
--csv external_model.csv \
--name loghi \
--single-line \
--dir ./
# Combine with ignore patterns
htr eval-external \
--csv tesseract_results.csv \
--name tesseract \
--single-line \
--ignore '|' \
--dir ./transcriptionsThese flags are useful when:
- Your ground truth contains markers for unknown/unclear characters (
--ignore) - External model output has different line break formatting (
--single-line)
View summary statistics from existing evaluation results:
# List all available evaluation files
htr summary
# View summary for a specific evaluation
htr summary eval_2025-07-24_07-44-38.yaml
# Or just use the filename without extension
htr summary eval_2025-07-24_07-44-38Export aggregated evaluation results from all models as CSV/TSV format, sorted by performance:
# Export all evaluation results as TSV
htr csv
# Export with per-page cost calculation
htr csv --input-price 2.50 --output-price 10.0The csv command scans all YAML files in the evals/ directory and aggregates performance metrics for each model:
htr csvOutput columns:
- Model name and configuration
- Total evaluations performed
- Average character similarity (0-1)
- Average character accuracy (0-1)
- Average word similarity (0-1)
- Average word accuracy (0-1)
- Average word error rate (0-1)
Results are sorted by word similarity (best to worst) and output in tab-separated format for easy import into spreadsheet software.
When you provide pricing information, the csv command includes per-page cost estimates:
htr csv --input-price 2.50 --output-price 10.0Additional columns with pricing:
- Average input tokens per page
- Average output tokens per page
- PageCost: Estimated cost per page in dollars
Example output:
Model PageCost AvgWordAccuracy
gpt-4o 0.011250 0.605094
claude-sonnet-4-5-20250929 0.009845 0.598710
gemini-2.5-flash 0.003420 0.572504
# 1. Run evaluations with different providers
htr eval --provider openai --model gpt-4o --prompt "Extract text" --csv images.csv
htr eval --provider claude --model claude-sonnet-4-5 --prompt "Extract text" --csv images.csv
htr eval --provider gemini --model gemini-2.5-flash --prompt "Extract text" --csv images.csv
# 2. Compare all models (performance only)
htr csv
# 3. Compare models with cost analysis
htr csv --input-price 2.50 --output-price 10.0
# 4. Save results to a file
htr csv --input-price 2.50 --output-price 10.0 > model_comparison.tsv- Prices are specified as cost per million tokens
- Example:
--input-price 2.50means $2.50 per 1M input tokens - PageCost is calculated as:
(avgInputTokens / 1,000,000) × inputPrice + (avgOutputTokens / 1,000,000) × outputPrice - Only evaluations with token data will show cost information (OpenAI, Claude, Gemini, Ollama)
- Azure OCR evaluations will show
0.00for tokens and cost (no token tracking)
Estimate costs for large-scale document transcription based on token usage data from evaluation runs. The cost command analyzes token consumption from an evaluation file and projects costs for transcribing a larger number of documents.
- Token Tracking: When you run an evaluation, HTR automatically captures input and output token counts from API responses (OpenAI, Claude, Gemini, Ollama)
- Average Calculation: The cost command calculates average tokens per document from your evaluation
- Cost Projection: Estimates total cost for transcribing N documents based on your specified pricing
# Calculate cost estimate for an evaluation
htr cost gpt-4o --input-price 1.25 --output-price 10.0 --doc-count 1000Required flags:
--input-price: Cost per million input tokens (e.g.,1.25for $1.25/1M tokens)--output-price: Cost per million output tokens (e.g.,10.0for $10.00/1M tokens)
Optional flags:
--doc-count: Number of documents to estimate (default:1000)
# 1. Run an evaluation to collect token usage data
htr eval \
--provider openai \
--model gpt-4o \
--prompt "Extract all text from this image" \
--csv sample_docs.csv \
--dir ./images
# 2. Calculate cost for 5000 documents using GPT-4o pricing
# Input: $2.50/1M tokens, Output: $10.00/1M tokens
htr cost gpt-4o --input-price 2.50 --output-price 10.0 --doc-count 5000=== COST ESTIMATION ===
File: gpt-4o.yaml
Provider: openai
Model: gpt-4o
=== Token Usage Statistics ===
Documents analyzed: 50
Average input tokens per document: 1847.32
Average output tokens per document: 456.18
Average total tokens per document: 2303.50
=== Pricing Configuration ===
Input token price: $2.50 per 1M tokens
Output token price: $10.00 per 1M tokens
=== Per Document Cost ===
Input cost: $0.004618
Output cost: $0.004562
Total cost: $0.009180
=== Estimated Cost for 5000 Documents ===
Input cost: $23.09
Output cost: $22.81
Total cost: $45.90
- OpenAI: ✅ Full token tracking (input/output)
- Claude: ✅ Full token tracking (input/output)
- Gemini: ✅ Full token tracking (input/output)
- Ollama: ✅ Full token tracking (input/output)
- Azure OCR: ❌ No token data (service doesn't provide usage info)
- Token counts are captured directly from API responses, not calculated by HTR
- Evaluation YAML files store token data as
inputtokensandoutputtokensfields - Cost estimates are based on averages across all documents in the evaluation
- Use a representative sample of documents for more accurate cost projections
You can test individual rows from your CSV to quickly evaluate a single provider:
# Test just the first row (index 0)
htr eval --provider azure --prompt "Extract all text from this image" --csv fixtures/images.csv --rows 0 --dir /path/to/images
# Test multiple specific rows
htr eval --provider gemini --model gemini-pro-vision --prompt "Extract all text from this image" --csv fixtures/images.csv --rows 0,5,10 --dir /path/to/imagesIf homebrew was used, you can simply upgrade the homebrew formulae for htr
brew update && brew upgrade htr
If the binary was downloaded and added to the $PATH updating htr could look as follows. Requires gh and tar
# update for your architecture
ARCH="htr_Linux_x86_64.tar.gz"
TAG=$(gh release list --exclude-pre-releases --exclude-drafts --limit 1 --repo lehigh-university-libraries/htr | awk '{print $3}')
gh release download $TAG --repo lehigh-university-libraries/htr --pattern $ARCH
tar -zxvf $ARCH
mv htr /directory/in/path/binary/was/placed
rm $ARCH