Skip to content

Automated pipeline that daily fetches, stores, and indexes arXiv research papers in MinIO.

License

Notifications You must be signed in to change notification settings

Bessouat40/rag-scientific-papers

Repository files navigation

RAG Scientific Papers

RAG Scientific Papers is a project that enables you to automatically fetch, process, and ingest the latest ArXiv research papers on any given topic on a daily basis. This daily retrieval supports continuous technological monitoring, ensuring that you stay up-to-date with emerging research and trends. The pipeline is orchestrated using Prefect for scheduling and seamless automation, and it stores the retrieved PDFs in a MinIO object storage system for efficient management and retrieval.

Thank you to arXiv for use of its open access interoperability.

RAGLight

Features

  • Fetch ArXiv Papers: Automatically query the ArXiv API for research papers based on a topic and publication date.
  • PDF Ingestion: Download the PDF files and store them in a MinIO bucket.
  • Embeddings extraction : Extract embeddings and store them inside Chroma vector store.
  • Pipeline Orchestration: Use Prefect flows and tasks to schedule and manage the pipelines.
  • UI to display pdf, read them and filter them.

Installation

  1. Clone the repository
git clone https://github.com/Bessouat40/rag-scientific-papers.git
cd rag-scientific-papers
  1. Configure .env File

You'll need to rename .env.example file and fill it with your own values :

mv .env.example .env
  1. Install the required packages
python -m pip install -r backend/requirements.txt
cd frontend
npm i

Usage

Start the Pipeline with Prefect locally

You can run the pipeline as a scheduled flow using Prefect. For example, to run the pipeline daily at midnight, use the Prefect deployment approach or serve the flow directly (for testing purposes).

python -m backend.main

Running Pipelines and UI with Docker

You can now run Prefect flow and UI inside a Docker container :

docker-compose up -d --build

Now you can access Prefect UI at localhost:4200. Your flow will run every day at midnight.

You can access UI at localhost:3000.

Configuration

Topic

The pipeline fetches articles based on a given topic.

You can modify this parameter in the .env file.

TODO

  • Containerization with Docker: Create a Dockerfile to containerize the application and manage its dependencies.

  • Embedding Extraction: Use a model to extract and store embeddings from the PDFs for later semantic search.

  • Semantic Search: Implement a semantic search feature that leverages the stored embeddings to enable more accurate article search.

  • Add UI

About

Automated pipeline that daily fetches, stores, and indexes arXiv research papers in MinIO.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published