Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary

This repository contains the research, code, and results for our paper on high-precision, high-throughput sentence boundary detection (SBD) libraries optimized for legal text.

About the Project

Accurate sentence boundary detection is critical for legal document processing, retrieval, and analysis. However, legal text presents unique challenges due to specialized citations, abbreviations, and complex sentence structures that confound general-purpose sentence boundary detectors.

We present two new open-source SBD libraries:

NUPunkt

A pure Python implementation that extends the unsupervised Punkt algorithm with legal domain optimizations, trained on the KL3M legal corpus.

Precision: 91.1%
Throughput: 10 million characters per second
Memory: 432 MB
No external dependencies
29-32% precision improvement over standard tools like NLTK Punkt and spaCy
Links: PyPI Package | GitHub Repository

CharBoundary

A family of character-level machine learning models in three sizes (small, medium, large) that offer balanced precision-recall tradeoffs.

Highest F1 score: 0.782 (large model)
Throughput: 518K-748K characters per second depending on model size
Requires only scikit-learn and optional ONNX runtime integration
Links: PyPI Package | GitHub Repository

Why This Matters

For legal RAG (Retrieval-Augmented Generation) systems, high precision in sentence boundary detection is essential to prevent fragmentation of related legal concepts, which leads to reasoning failures. Our research shows that the relationship between precision and fragmentation follows an inverse exponential curve, where even small improvements in precision yield significant reductions in downstream errors.

Experimental Results

Our evaluation on five diverse legal datasets comprising over 25,000 documents and 197,000 annotated sentence boundaries demonstrates that both libraries significantly outperform general-purpose alternatives:

NUPunkt excels in precision-critical applications where minimizing false positives is paramount
CharBoundary models provide the best overall F1 scores with excellent balance between precision and recall
Both libraries enable processing of multi-million document collections in minutes rather than hours on standard CPU hardware

Datasets

The research utilizes several legal datasets:

ALEA Legal Benchmark
MultiLegalSBD (SCOTUS, Cyber Crime, BVA, IP)

Repository Structure

This repository is organized to help you explore our research, replicate our results, and use our libraries.

Directory	Description
`/data`	Contains the annotated datasets used for evaluation: - `MultiLegalSBD/` - Multiple legal domain datasets with span annotations (SCOTUS, Cyber Crime, BVA, Intellectual Property) - `alea-legal-benchmark/` - ALEA legal benchmark with sentence boundary annotations
`/paper`	Complete LaTeX source for the research paper: - `main.tex` - Main paper document - `sections/` - Individual paper sections (introduction, methods, results, etc.) - `figures/` - Publication-quality figures and diagrams - `tables/` - LaTeX tables for benchmark results - `references/` - Bibliography and citations
`/results`	Comprehensive evaluation results and visualizations: - `paper_results_20250402_203406/` - Latest evaluation results - `evaluation_report.html` - Interactive visualization of results - `charts/` - Performance comparison charts (precision, recall, F1, throughput) - `publication_charts/` - High-quality charts used in the paper - `latex/` - Auto-generated LaTeX tables for the paper
`/src`	Source code for libraries and evaluation framework: - `lsp/` - Main package (Legal Sentence Paper) - `lsp/tokenizers/` - Implementation of NUPunkt, CharBoundary and baseline tokenizers - `lsp/core/` - Core functionality for data loading and processing - `lsp/examples/` - Example scripts and tools to reproduce experiments - `lsp/evaluation.py` - Evaluation metrics and benchmark logic

Getting Started

Installation

Project Installation (for reproducing paper results)

# Set up Python virtual environment
uv venv --seed && uv pip install pip && source .venv/bin/activate

# Install the project
pip install -e .

Using the Libraries in Your Projects

Install NUPunkt:

pip install nupunkt

Install CharBoundary:

pip install charboundary

Usage Examples

Test the tokenizers on legal examples:

python -m lsp.examples.test_legal_examples

Run a complete evaluation:

python -m lsp.examples.run_evaluation.py --charts --html

Process your own text:

python -m lsp --text "Employee's Annual Bonus shall be calculated pursuant to Sec. 4.3(c), subject to the limitations of I.R.C. § 409A(a)(2)(B)(i) and the withholding requirements of Sec. 7.3." --tokenizers nupunkt charboundary-large

Key CLI Commands

# List all available datasets
python -m lsp.examples.list_datasets

# Examine a specific dataset
python -m lsp.examples.examine_dataset DATASET [--example ID] [--random N]

# Run paper workflow (reproduces all results)
python -m lsp.examples.paper_workflow.py --output results/

For more commands and options, see the CLAUDE.md file.

Demo

Try our interactive demo at https://sentences.aleainstitute.ai/

License

Both libraries are available under the MIT license.

Authors

Michael J. Bommarito II (ALEA Institute, Stanford CodeX)
Daniel Martin Katz (Illinois Tech - Chicago Kent Law, Bucerius Law School, ALEA Institute, Stanford CodeX)
Jillian Bommarito (ALEA Institute)

Links

Models on Hugging Face: https://huggingface.co/models?other=arxiv:2504.04131
Paper on Hugging Face: https://huggingface.co/papers/2504.04131
Dataset on Hugging Face: https://huggingface.co/datasets/alea-institute/alea-legal-benchmark-sentence-paragraph-boundaries

Citation

@article{bommarito2025nupunkt,
  title={Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary},
  author={Bommarito II, Michael J. and Katz, Daniel Martin and Bommarito, Jillian},
  journal={},
  year={2025}
}

Acknowledgments

We drafted and revised this paper with the assistance of large language models. All errors or omissions are our own.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
paper		paper
results		results
src/lsp		src/lsp
.gitattributes		.gitattributes
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
pyproject.toml		pyproject.toml
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary

About the Project

NUPunkt

CharBoundary

Why This Matters

Experimental Results

Datasets

Repository Structure

Getting Started

Installation

Project Installation (for reproducing paper results)

Using the Libraries in Your Projects

Usage Examples

Key CLI Commands

Demo

License

Authors

Links

Citation

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

alea-institute/legal-sentence-paper

Folders and files

Latest commit

History

Repository files navigation

Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary

About the Project

NUPunkt

CharBoundary

Why This Matters

Experimental Results

Datasets

Repository Structure

Getting Started

Installation

Project Installation (for reproducing paper results)

Using the Libraries in Your Projects

Usage Examples

Key CLI Commands

Demo

License

Authors

Links

Citation

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages