RAG Scientific Papers

RAG Scientific Papers is a project that enables you to automatically fetch, process, and ingest the latest ArXiv research papers on any given topic on a daily basis. This daily retrieval supports continuous technological monitoring, ensuring that you stay up-to-date with emerging research and trends. The pipeline is orchestrated using Prefect for scheduling and seamless automation, and it stores the retrieved PDFs in a MinIO object storage system for efficient management and retrieval.

Thank you to arXiv for use of its open access interoperability.

Features

Fetch ArXiv Papers: Automatically query the ArXiv API for research papers based on a topic and publication date.
PDF Ingestion: Download the PDF files and store them in a MinIO bucket.
Embeddings extraction : Extract embeddings and store them inside Chroma vector store.
Pipeline Orchestration: Use Prefect flows and tasks to schedule and manage the pipelines.
UI to display pdf, read them and filter them.

Installation

Clone the repository

git clone https://github.com/Bessouat40/rag-scientific-papers.git
cd rag-scientific-papers

Configure .env File

You'll need to rename .env.example file and fill it with your own values :

mv .env.example .env

Install the required packages

python -m pip install -r backend/requirements.txt
cd frontend
npm i

Usage

Start the Pipeline with Prefect locally

You can run the pipeline as a scheduled flow using Prefect. For example, to run the pipeline daily at midnight, use the Prefect deployment approach or serve the flow directly (for testing purposes).

python -m backend.main

Running Pipelines and UI with Docker

You can now run Prefect flow and UI inside a Docker container :

docker-compose up -d --build

Now you can access Prefect UI at localhost:4200. Your flow will run every day at midnight.

You can access UI at localhost:3000.

Configuration

Topic

The pipeline fetches articles based on a given topic.

You can modify this parameter in the .env file.

TODO

Containerization with Docker: Create a Dockerfile to containerize the application and manage its dependencies.
Embedding Extraction: Use a model to extract and store embeddings from the PDFs for later semantic search.
Semantic Search: Implement a semantic search feature that leverages the stored embeddings to enable more accurate article search.
Add UI

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
backend		backend
docker		docker
frontend		frontend
media		media
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.production.yaml		docker-compose.production.yaml
docker-compose.yaml		docker-compose.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Scientific Papers

Features

Installation

Usage

Start the Pipeline with Prefect locally

Running Pipelines and UI with Docker

Configuration

Topic

TODO

About

Releases

Packages

Languages

License

Bessouat40/rag-scientific-papers

Folders and files

Latest commit

History

Repository files navigation

RAG Scientific Papers

Features

Installation

Usage

Start the Pipeline with Prefect locally

Running Pipelines and UI with Docker

Configuration

Topic

TODO

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages