Skip to content

JosePizarro3/NERxiv

Repository files navigation

NERxiv logo

CI Coverage Status License: PolyForm NC 1.0.0 PyPI version Python versions

NERxiv

Named Entity Recognition for arxiv papers (NERxiv) is a Python wrapper tool for extracting structured metadata from scientific papers on arXiv using LLMs and modern retrieval-augmented generation (RAG) techniques.

Visit the documentation page to learn how to use this tool.

What It Does

  • Uses pyrxiv to fetch, download, and extract text from arXiv papers
  • Chunks and embeds text with SentenceTransformers or LangChain to categorize papers content using local LLMs (via Ollama)
  • Includes CLI tools and notebook tutorials for reproducible workflows

Installation

Install the core package:

pip install nerxiv

Running LLMs Locally

We recommend running your own models locally using Ollama:

# Install Ollama (follow instructions on their website)
ollama pull <model-name>   # e.g., llama3, deepseek-r1, qwen3:30b

# Start the local server
ollama serve

Development

To contribute to NERxiv or run it locally, follow these steps:

Clone the Repository

git clone https://github.com/JosePizarro3/NERxiv.git
cd NERxiv

Set Up a Virtual Environment

We recommend Python ≥ 3.10:

python3 -m venv .venv
source .venv/bin/activate

Install Dependencies

Use uv (faster than pip) to install the package in editable mode with dev and docu extras:

pip install --upgrade pip
pip install uv
uv pip install -e .[dev,docu]

Run tests

Use pytest with verbosity to run all tests:

python -m pytest -sv tests

To check code coverage:

python -m pytest --cov=nerxiv tests

Code formatting and linting

We use Ruff for formatting and linting (configured via pyproject.toml).

Check linting issues:

ruff check .

Auto-format code:

ruff format . --check

Manually fix anything Ruff cannot handle automatically.

Documentation writing

To view the documentation locally, make sure to have installed the extra [docu] packages:

uv pip install -e '[docu]'

Note: This command installs mkdocs, mkdocs-material, and other documentation-related dependencies.

The first time, build the server:

mkdocs build

Run the documentation server:

mkdocs serve

The output looks like:

INFO    -  Building documentation...
INFO    -  Cleaning site directory
INFO    -  [14:07:47] Watching paths for changes: 'docs', 'mkdocs.yml'
INFO    -  [14:07:47] Serving on http://127.0.0.1:8000/

Simply click on http://127.0.0.1:8000/. The changes in the md files of the documentation are immediately reflected when the files are saved (the local web will automatically refresh).

About

A Python wrapper for extracting structured metadata from arXiv papers.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages