Skip to content

alea-institute/legal-sentence-paper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary

This repository contains the research, code, and results for our paper on high-precision, high-throughput sentence boundary detection (SBD) libraries optimized for legal text.

About the Project

Accurate sentence boundary detection is critical for legal document processing, retrieval, and analysis. However, legal text presents unique challenges due to specialized citations, abbreviations, and complex sentence structures that confound general-purpose sentence boundary detectors.

We present two new open-source SBD libraries:

NUPunkt

A pure Python implementation that extends the unsupervised Punkt algorithm with legal domain optimizations, trained on the KL3M legal corpus.

  • Precision: 91.1%
  • Throughput: 10 million characters per second
  • Memory: 432 MB
  • No external dependencies
  • 29-32% precision improvement over standard tools like NLTK Punkt and spaCy
  • Links: PyPI Package | GitHub Repository

CharBoundary

A family of character-level machine learning models in three sizes (small, medium, large) that offer balanced precision-recall tradeoffs.

  • Highest F1 score: 0.782 (large model)
  • Throughput: 518K-748K characters per second depending on model size
  • Requires only scikit-learn and optional ONNX runtime integration
  • Links: PyPI Package | GitHub Repository

Why This Matters

For legal RAG (Retrieval-Augmented Generation) systems, high precision in sentence boundary detection is essential to prevent fragmentation of related legal concepts, which leads to reasoning failures. Our research shows that the relationship between precision and fragmentation follows an inverse exponential curve, where even small improvements in precision yield significant reductions in downstream errors.

Experimental Results

Our evaluation on five diverse legal datasets comprising over 25,000 documents and 197,000 annotated sentence boundaries demonstrates that both libraries significantly outperform general-purpose alternatives:

  • NUPunkt excels in precision-critical applications where minimizing false positives is paramount
  • CharBoundary models provide the best overall F1 scores with excellent balance between precision and recall
  • Both libraries enable processing of multi-million document collections in minutes rather than hours on standard CPU hardware

Datasets

The research utilizes several legal datasets:

  • ALEA Legal Benchmark
  • MultiLegalSBD (SCOTUS, Cyber Crime, BVA, IP)

Repository Structure

This repository is organized to help you explore our research, replicate our results, and use our libraries.

Directory Description
/data Contains the annotated datasets used for evaluation:
- MultiLegalSBD/ - Multiple legal domain datasets with span annotations (SCOTUS, Cyber Crime, BVA, Intellectual Property)
- alea-legal-benchmark/ - ALEA legal benchmark with sentence boundary annotations
/paper Complete LaTeX source for the research paper:
- main.tex - Main paper document
- sections/ - Individual paper sections (introduction, methods, results, etc.)
- figures/ - Publication-quality figures and diagrams
- tables/ - LaTeX tables for benchmark results
- references/ - Bibliography and citations
/results Comprehensive evaluation results and visualizations:
- paper_results_20250402_203406/ - Latest evaluation results
- evaluation_report.html - Interactive visualization of results
- charts/ - Performance comparison charts (precision, recall, F1, throughput)
- publication_charts/ - High-quality charts used in the paper
- latex/ - Auto-generated LaTeX tables for the paper
/src Source code for libraries and evaluation framework:
- lsp/ - Main package (Legal Sentence Paper)
- lsp/tokenizers/ - Implementation of NUPunkt, CharBoundary and baseline tokenizers
- lsp/core/ - Core functionality for data loading and processing
- lsp/examples/ - Example scripts and tools to reproduce experiments
- lsp/evaluation.py - Evaluation metrics and benchmark logic

Getting Started

Installation

Project Installation (for reproducing paper results)

# Set up Python virtual environment
uv venv --seed && uv pip install pip && source .venv/bin/activate

# Install the project
pip install -e .

Using the Libraries in Your Projects

Install NUPunkt:

pip install nupunkt

Install CharBoundary:

pip install charboundary

Usage Examples

Test the tokenizers on legal examples:

python -m lsp.examples.test_legal_examples

Run a complete evaluation:

python -m lsp.examples.run_evaluation.py --charts --html

Process your own text:

python -m lsp --text "Employee's Annual Bonus shall be calculated pursuant to Sec. 4.3(c), subject to the limitations of I.R.C. § 409A(a)(2)(B)(i) and the withholding requirements of Sec. 7.3." --tokenizers nupunkt charboundary-large

Key CLI Commands

# List all available datasets
python -m lsp.examples.list_datasets

# Examine a specific dataset
python -m lsp.examples.examine_dataset DATASET [--example ID] [--random N]

# Run paper workflow (reproduces all results)
python -m lsp.examples.paper_workflow.py --output results/

For more commands and options, see the CLAUDE.md file.

Demo

Try our interactive demo at https://sentences.aleainstitute.ai/

License

Both libraries are available under the MIT license.

Authors

  • Michael J. Bommarito II (ALEA Institute, Stanford CodeX)
  • Daniel Martin Katz (Illinois Tech - Chicago Kent Law, Bucerius Law School, ALEA Institute, Stanford CodeX)
  • Jillian Bommarito (ALEA Institute)

Links

Citation

@article{bommarito2025nupunkt,
  title={Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary},
  author={Bommarito II, Michael J. and Katz, Daniel Martin and Bommarito, Jillian},
  journal={},
  year={2025}
}

Acknowledgments

We drafted and revised this paper with the assistance of large language models. All errors or omissions are our own.

About

Legal Sentence Segmentation Paper

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published