htr

Handwritten Text Recognition

Requirements

System Dependencies

ImageMagick (required for htr create command)
- Used for image processing, word detection, and image manipulation
- Install via:
  - macOS: brew install imagemagick
  - Ubuntu/Debian: apt-get install imagemagick
  - Windows: Download from ImageMagick website

Install

You can install htr using homebrew

brew tap lehigh-university-libraries/homebrew https://github.com/lehigh-university-libraries/homebrew
brew install lehigh-university-libraries/homebrew/htr

Download Binary

Instead of homebrew, you can download a binary for your system from the latest release

Then put the binary in a directory that is in your $PATH

Usage

The HTR tool supports multiple providers for text extraction from images. Set the appropriate environment variables for your chosen provider or create them in a .env file.

Supported Providers

OpenAI (default)

Provider: openai
Environment variable: OPENAI_API_KEY
Models: gpt-4o, gpt-4o-mini, gpt-4-vision-preview

Azure OCR

Provider: azure
Environment variables: AZURE_OCR_ENDPOINT, AZURE_OCR_API_KEY
Models: Uses Azure Computer Vision Read API 4.0

Google Gemini

Provider: gemini
Environment variable: GEMINI_API_KEY
Models: gemini-2.5-flash

Ollama (local)

Provider: ollama
Environment variable: OLLAMA_URL (optional, defaults to http://localhost:11434)
Models: llava, llava:13b, llava:34b, moondream, etc.

Eval

Evaluate OCR/HTR performance by sending images to AI vision models and comparing their output against ground truth transcripts.

OpenAI Example

htr eval \
  --provider openai \
  --model gpt-4o \
  --prompt "Extract all text from this image" \
  --temperature 0.0 \
  --csv fixtures/images.csv \
  --dir /Volumes/2025-Lyrasis-Catalyst-Fund/ground-truth-documents

Azure OCR Example

htr eval \
  --provider azure \
  --prompt "Extract all text from this image" \
  --csv fixtures/images.csv \
  --dir /Volumes/2025-Lyrasis-Catalyst-Fund/ground-truth-documents

Gemini Example

htr eval \
  --provider gemini \
  --model gemini-2.5-flash \
  --prompt "Extract all text from this image" \
  --temperature 0.0 \
  --csv fixtures/images.csv \
  --dir /Volumes/2025-Lyrasis-Catalyst-Fund/ground-truth-documents

Ollama Example

htr eval \
  --provider ollama \
  --model mistral-small3.2:24b \
  --prompt "Extract all text from this image" \
  --temperature 0.0 \
  --csv fixtures/images.csv \
  --dir /Volumes/2025-Lyrasis-Catalyst-Fund/ground-truth-documents

Handling Unknown Characters with `--ignore`

Sometimes ground truth transcripts contain characters that cannot be deciphered. Use the --ignore flag to mark these unknown characters and exclude them from accuracy calculations.

How it works:

Mark unknown characters in ground truth with a special pattern (e.g., |)
The LLM will still transcribe the unknown character in the image as something
HTR will automatically skip the corresponding output in the transcription when calculating metrics
If the ignore pattern is a standalone word (surrounded by spaces), skip the next word in the transcription
If the ignore pattern is within a word, skip the next character in the transcription

Examples:

# Use pipe (|) to mark unknown characters
htr eval \
  --provider openai \
  --model gpt-4o \
  --prompt "Extract all text from this image" \
  --csv fixtures/images.csv \
  --ignore '|' \
  --dir ./ground-truth

# Use multiple ignore patterns (pipe and comma)
htr eval \
  --provider gemini \
  --model gemini-1.5-flash \
  --prompt "Extract all text from this image" \
  --csv fixtures/images.csv \
  --ignore '|' \
  --ignore ',' \
  --dir ./ground-truth

Ground truth examples:

# Unknown word (standalone)
Ground truth: "The quick | fox"
LLM output:   "The quick brown fox"
Result:       Compares "The quick fox" vs "The quick fox" (skips "brown")

# Unknown character (within word)
Ground truth: "d|te"
LLM output:   "date"
Result:       Compares "dte" vs "dte" (skips "a")

# Multiple unknowns
Ground truth: "The | cat , jumped"
LLM output:   "The quick cat suddenly jumped"
Result:       Compares "The cat jumped" vs "The cat jumped" (skips "quick" and "suddenly")

Benefits:

More accurate evaluation metrics when dealing with damaged or unclear documents
Ignored characters are counted separately in results
Character and word accuracy rates exclude unknown characters from denominators

Single Line Mode

--single-line: Convert multi-line documents to single-line text

Removes all newlines, carriage returns, and tabs from ground truth and transcripts, normalizing multiple spaces to single spaces. This is useful when:

Your ground truth uses line breaks but the model output doesn't (or vice versa)
You want to focus on content accuracy regardless of line formatting
You need to normalize whitespace for fair comparison

# Evaluate as single-line text
htr eval \
  --provider openai \
  --model gpt-4o \
  --prompt "Extract all text from this image" \
  --csv fixtures/images.csv \
  --single-line \
  --dir ./ground-truth

Examples:

# With --single-line
Ground truth: "Line 1\nLine 2"
Model output: "Line 1 Line 2"
Result:       Perfect match (newlines converted to spaces)

# With tabs and multiple spaces
Ground truth: "Hello\t\tWorld\n\nTest"
Model output: "Hello World Test"
Result:       Perfect match (tabs, newlines, and multiple spaces normalized)

### Create

Create hOCR XML files from images using custom word detection and LLM transcription:

```bash
# Create hOCR XML from an image (prints to stdout)
htr create --image path/to/image.jpg --provider ollama --model llava

# Save output to a file
htr create --image path/to/image.jpg --provider openai --model gpt-4o -o output.hocr

# Use different providers
htr create --image scan.png --provider gemini --model gemini-1.5-flash -o scan.hocr

Note: The create command requires ImageMagick to be installed on your system.

Eval External

Evaluate transcriptions from external OCR/HTR models (like Loghi, Tesseract, Kraken, etc.) against ground truth transcripts. This command reads pre-generated transcriptions from text files and compares them to ground truth without making any API calls.

Usage

# Evaluate external model transcriptions
htr eval-external \
  --csv loghi_results.csv \
  --name loghi \
  --dir ./transcriptions

CSV Format

The CSV file should have 2 columns:

transcript,transcription
ground-truth-1.txt,loghi-output-1.txt
ground-truth-2.txt,loghi-output-2.txt

Where:

transcript: Path to the ground truth transcript file
transcription: Path to the external model's transcription output file

Example Workflow

Run your images through an external HTR model (e.g., Loghi):

# Example: Process images with Loghi
for img in images/*.jpg; do
  loghi-htr predict --image "$img" --output "transcriptions/$(basename $img .jpg).txt"
done

Create a CSV mapping ground truth to external transcriptions:

transcript,transcription
groundtruth/page1.txt,transcriptions/page1.txt
groundtruth/page2.txt,transcriptions/page2.txt

Evaluate the external model's performance:

htr eval-external --csv external_model.csv --name loghi --dir ./

View results alongside other model evaluations:

htr summary loghi
htr csv  # Compare all models including external ones

Testing Specific Rows

# Test just the first few rows
htr eval-external --csv external_model.csv --name loghi --rows 0,1,2 --dir ./

Using Flags with External Models

All evaluation flags work with external model evaluations:

Using --ignore for unknown characters:

# Evaluate with unknown character handling
htr eval-external \
  --csv external_model.csv \
  --name loghi \
  --ignore '|' \
  --dir ./

# Multiple ignore patterns
htr eval-external \
  --csv tesseract_results.csv \
  --name tesseract \
  --ignore '|' \
  --ignore ',' \
  --dir ./transcriptions

Using --single-line for normalization:

# Convert to single-line for comparison
htr eval-external \
  --csv external_model.csv \
  --name loghi \
  --single-line \
  --dir ./

# Combine with ignore patterns
htr eval-external \
  --csv tesseract_results.csv \
  --name tesseract \
  --single-line \
  --ignore '|' \
  --dir ./transcriptions

These flags are useful when:

Your ground truth contains markers for unknown/unclear characters (--ignore)
External model output has different line break formatting (--single-line)

Summary

View summary statistics from existing evaluation results:

# List all available evaluation files
htr summary

# View summary for a specific evaluation
htr summary eval_2025-07-24_07-44-38.yaml

# Or just use the filename without extension
htr summary eval_2025-07-24_07-44-38

CSV Export

Export aggregated evaluation results from all models as CSV/TSV format, sorted by performance:

# Export all evaluation results as TSV
htr csv

# Export with per-page cost calculation
htr csv --input-price 2.50 --output-price 10.0

Basic Usage

The csv command scans all YAML files in the evals/ directory and aggregates performance metrics for each model:

htr csv

Output columns:

Model name and configuration
Total evaluations performed
Average character similarity (0-1)
Average character accuracy (0-1)
Average word similarity (0-1)
Average word accuracy (0-1)
Average word error rate (0-1)

Results are sorted by word similarity (best to worst) and output in tab-separated format for easy import into spreadsheet software.

Cost Analysis

When you provide pricing information, the csv command includes per-page cost estimates:

htr csv --input-price 2.50 --output-price 10.0

Additional columns with pricing:

Average input tokens per page
Average output tokens per page
PageCost: Estimated cost per page in dollars

Example output:

Model                          PageCost    AvgWordAccuracy
gpt-4o                        0.011250    0.605094
claude-sonnet-4-5-20250929    0.009845    0.598710
gemini-2.5-flash              0.003420    0.572504

Example Workflow

# 1. Run evaluations with different providers
htr eval --provider openai --model gpt-4o --prompt "Extract text" --csv images.csv
htr eval --provider claude --model claude-sonnet-4-5 --prompt "Extract text" --csv images.csv
htr eval --provider gemini --model gemini-2.5-flash --prompt "Extract text" --csv images.csv

# 2. Compare all models (performance only)
htr csv

# 3. Compare models with cost analysis
htr csv --input-price 2.50 --output-price 10.0

# 4. Save results to a file
htr csv --input-price 2.50 --output-price 10.0 > model_comparison.tsv

Pricing Notes

Prices are specified as cost per million tokens
Example: --input-price 2.50 means $2.50 per 1M input tokens
PageCost is calculated as: (avgInputTokens / 1,000,000) × inputPrice + (avgOutputTokens / 1,000,000) × outputPrice
Only evaluations with token data will show cost information (OpenAI, Claude, Gemini, Ollama)
Azure OCR evaluations will show 0.00 for tokens and cost (no token tracking)

Cost Estimation

Estimate costs for large-scale document transcription based on token usage data from evaluation runs. The cost command analyzes token consumption from an evaluation file and projects costs for transcribing a larger number of documents.

How It Works

Token Tracking: When you run an evaluation, HTR automatically captures input and output token counts from API responses (OpenAI, Claude, Gemini, Ollama)
Average Calculation: The cost command calculates average tokens per document from your evaluation
Cost Projection: Estimates total cost for transcribing N documents based on your specified pricing

Usage

# Calculate cost estimate for an evaluation
htr cost gpt-4o --input-price 1.25 --output-price 10.0 --doc-count 1000

Required flags:

--input-price: Cost per million input tokens (e.g., 1.25 for $1.25/1M tokens)
--output-price: Cost per million output tokens (e.g., 10.0 for $10.00/1M tokens)

Optional flags:

--doc-count: Number of documents to estimate (default: 1000)

Example Workflow

# 1. Run an evaluation to collect token usage data
htr eval \
  --provider openai \
  --model gpt-4o \
  --prompt "Extract all text from this image" \
  --csv sample_docs.csv \
  --dir ./images

# 2. Calculate cost for 5000 documents using GPT-4o pricing
# Input: $2.50/1M tokens, Output: $10.00/1M tokens
htr cost gpt-4o --input-price 2.50 --output-price 10.0 --doc-count 5000

Example Output

=== COST ESTIMATION ===
File: gpt-4o.yaml
Provider: openai
Model: gpt-4o

=== Token Usage Statistics ===
Documents analyzed: 50
Average input tokens per document: 1847.32
Average output tokens per document: 456.18
Average total tokens per document: 2303.50

=== Pricing Configuration ===
Input token price: $2.50 per 1M tokens
Output token price: $10.00 per 1M tokens

=== Per Document Cost ===
Input cost: $0.004618
Output cost: $0.004562
Total cost: $0.009180

=== Estimated Cost for 5000 Documents ===
Input cost: $23.09
Output cost: $22.81
Total cost: $45.90

Token Support by Provider

OpenAI: ✅ Full token tracking (input/output)
Claude: ✅ Full token tracking (input/output)
Gemini: ✅ Full token tracking (input/output)
Ollama: ✅ Full token tracking (input/output)
Azure OCR: ❌ No token data (service doesn't provide usage info)

Notes

Token counts are captured directly from API responses, not calculated by HTR
Evaluation YAML files store token data as inputtokens and outputtokens fields
Cost estimates are based on averages across all documents in the evaluation
Use a representative sample of documents for more accurate cost projections

Testing Individual Items

You can test individual rows from your CSV to quickly evaluate a single provider:

# Test just the first row (index 0)
htr eval --provider azure --prompt "Extract all text from this image" --csv fixtures/images.csv --rows 0 --dir /path/to/images

# Test multiple specific rows
htr eval --provider gemini --model gemini-pro-vision --prompt "Extract all text from this image" --csv fixtures/images.csv --rows 0,5,10 --dir /path/to/images

Updating

Homebrew

If homebrew was used, you can simply upgrade the homebrew formulae for htr

brew update && brew upgrade htr

Download Binary

If the binary was downloaded and added to the $PATH updating htr could look as follows. Requires gh and tar

# update for your architecture
ARCH="htr_Linux_x86_64.tar.gz"
TAG=$(gh release list --exclude-pre-releases --exclude-drafts --limit 1 --repo lehigh-university-libraries/htr | awk '{print $3}')
gh release download $TAG --repo lehigh-university-libraries/htr --pattern $ARCH
tar -zxvf $ARCH
mv htr /directory/in/path/binary/was/placed
rm $ARCH

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.github/workflows		.github/workflows
cmd		cmd
docs		docs
evals		evals
fixtures		fixtures
internal/utils		internal/utils
pkg		pkg
.gitignore		.gitignore
.golangci.yaml		.golangci.yaml
.goreleaser.yaml		.goreleaser.yaml
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go
renovate.json5		renovate.json5

License

lehigh-university-libraries/htr

Folders and files

Latest commit

History

Repository files navigation

htr

Requirements

System Dependencies

Install

Download Binary

Usage

Supported Providers

OpenAI (default)

Azure OCR

Google Gemini

Ollama (local)

Eval

OpenAI Example

Azure OCR Example

Gemini Example

Ollama Example

Handling Unknown Characters with --ignore

Single Line Mode

Eval External

Usage

CSV Format

Example Workflow

Testing Specific Rows

Using Flags with External Models

Summary

CSV Export

Basic Usage

Cost Analysis

Example Workflow

Pricing Notes

Cost Estimation

How It Works

Usage

Example Workflow

Example Output

Token Support by Provider

Notes

Testing Individual Items

Updating

Homebrew

Download Binary

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 25

Contributors 3

Uh oh!

Languages

Handling Unknown Characters with `--ignore`