Skip to content

marco-bertelli/rag.flask-start

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

10 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

RAG Flask Starter

A Retrieval-Augmented Generation (RAG) chatbot API built with Flask, LlamaIndex, Pinecone, and MongoDB. This project provides a production-ready foundation for building AI-powered conversational applications with document retrieval capabilities.

πŸš€ Features

  • RAG-powered Chat: Contextual responses using document retrieval with LlamaIndex
  • Multiple Source Types: Support for PDF, CSV, and Q&A pairs as knowledge sources
  • Vector Search: Pinecone integration for efficient similarity search
  • Persistent Storage: MongoDB for chat history, user management, and index storage
  • Streaming Responses: Real-time token streaming for chat responses
  • Rate Limiting: Built-in rate limiting with Flask-Limiter
  • Authentication: JWT-based authentication with admin and user roles
  • reCAPTCHA Support: Google reCAPTCHA validation middleware
  • CORS Enabled: Cross-Origin Resource Sharing support
  • Production Ready: Gunicorn (Linux) and Waitress (Windows) server support

πŸ“‹ Prerequisites

  • Python 3.11+
  • MongoDB instance
  • Pinecone account and API key
  • Perplexity API key (for LLM)
  • (Optional) Google reCAPTCHA keys

πŸ› οΈ Installation

  1. Clone the repository

    git clone https://github.com/marco-bertelli/rag.flask-start.git
    cd rag.flask-start
  2. Create a virtual environment

    python -m venv venv
    
    # Windows
    venv\Scripts\activate
    
    # Linux/Mac
    source venv/bin/activate
  3. Install dependencies

    pip install -r requirements.txt
  4. Configure environment variables

    Create a .env file in the root directory:

    # MongoDB
    MONGODB_URI=mongodb+srv://your-connection-string
    MONGODB_DATABASE=your-database-name
    
    # Security
    SECRET_KEY=your-jwt-secret-key
    
    # Pinecone
    PINECONE_API_KEY=your-pinecone-api-key
    PINECONE_ENV=your-pinecone-environment
    
    # Perplexity (LLM)
    PERPLEXITY_API_KEY=your-perplexity-api-key
    
    # OpenAI (optional)
    OPENAI_API_KEY=your-openai-api-key
    
    # reCAPTCHA (optional)
    RECAPTCHA_SECRET_KEY=your-recaptcha-secret-key

πŸš€ Running the Application

Development (Windows)

python windows_waitress_start.py

Production (Linux/Heroku)

gunicorn --preload --max-requests 500 --max-requests-jitter 5 -t 3 --worker-class gthread --timeout 120 index:app

The server will start on port 8080.

πŸ“š API Endpoints

Chat Endpoints

Method Endpoint Description Auth
GET /chats/me Get current user's chat User Token
GET /chats/me/history Get chat history User Token
GET /chats/guest Create a guest chat session None
GET /chats/<chatId>/answer?answer=<query> Query the chatbot (streaming) None
PUT /chats/message/<messageId>/feedback Set message feedback (good/bad) None

Source Management Endpoints (Admin Only)

Method Endpoint Description Auth
POST /index/source/<sourceType> Add a new source to the index Admin Token
DELETE /index/source/<sourceId> Remove a source from the index Admin Token

Source Types

  • qa: Question-Answer pairs
    { "question": "What is RAG?", "answer": "RAG stands for..." }
  • csv: CSV file with questions and answers columns
    { "path": "https://example.com/data.csv" }
  • pdf: PDF document
    { "path": "https://example.com/document.pdf" }

πŸ—οΈ Project Structure

rag.flask-start/
β”œβ”€β”€ app.py                 # Flask app configuration
β”œβ”€β”€ index.py               # Application entry point
β”œβ”€β”€ index_manager.py       # LlamaIndex setup and management
β”œβ”€β”€ conf.py                # Environment configuration loader
β”œβ”€β”€ windows_waitress_start.py  # Windows server startup
β”œβ”€β”€ Procfile               # Heroku/Gunicorn configuration
β”œβ”€β”€ requirements.txt       # Python dependencies
β”œβ”€β”€ data/                  # Sample data files
β”‚   └── rules.pdf          # Initial document for indexing
β”œβ”€β”€ apis/
β”‚   β”œβ”€β”€ chats.py           # Chat API endpoints
β”‚   └── sources.py         # Source management endpoints
β”œβ”€β”€ middlewares/
β”‚   β”œβ”€β”€ auth_middleware.py # JWT authentication
β”‚   └── re_captcha.py      # reCAPTCHA validation
β”œβ”€β”€ mongodb/
β”‚   └── index.py           # MongoDB operations
└── utils/
    β”œβ”€β”€ chat_history_parser.py  # Chat history formatting
    β”œβ”€β”€ mongo_parsers.py        # MongoDB JSON encoder
    β”œβ”€β”€ parsers.py              # Document parsing utilities
    β”œβ”€β”€ validators.py           # Input validation
    └── vector_database.py      # Pinecone/MongoDB vector store setup

βš™οΈ Configuration

LLM Settings

The project uses Perplexity's mixtral-8x7b-instruct model by default. Configuration is in index_manager.py:

llm = Perplexity(
    api_key=os.getenv("PERPLEXITY_API_KEY"), 
    model="mixtral-8x7b-instruct", 
    temperature=0.2
)

Embedding Model

Uses the local HuggingFace model BAAI/bge-small-en-v1.5 for embeddings:

Settings.embed_model = "local:BAAI/bge-small-en-v1.5"

Rate Limits

Rate limits are configured in apis/chats.py:

  • /chats/me: 5 requests per minute
  • /chats/me/history: 15 requests per minute
  • /chats/<chatId>/answer: 10 requests per minute

πŸ” Authentication

The API uses JWT tokens for authentication. Include the token in the Authorization header:

Authorization: Bearer <your-jwt-token>

User Roles

  • User: Can access chat features
  • Admin: Can manage knowledge sources (add/delete)
  • Guest: Limited access with temporary chat sessions

πŸ“¦ Dependencies

Key dependencies include:

  • Flask: Web framework
  • LlamaIndex: RAG framework
  • Pinecone: Vector database
  • PyMongo: MongoDB driver
  • Flask-Limiter: Rate limiting
  • Flask-CORS: CORS support
  • PyJWT: JWT authentication
  • llmsherpa: PDF parsing
  • Transformers & PyTorch: ML models

🚒 Deployment

Heroku

The project includes a Procfile for Heroku deployment:

web: gunicorn --preload --max-requests 500 --max-requests-jitter 5 -t 3 --worker-class gthread --timeout 120 index:app

Docker (Custom)

Create a Dockerfile:

FROM python:3.11-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . .

CMD ["gunicorn", "--preload", "-t", "120", "index:app"]

πŸ“„ License

This project is open source and available under the MIT License.

πŸ‘€ Author

Marco Bertelli

🀝 Contributing

Contributions, issues, and feature requests are welcome! Feel free to check the issues page.


⭐ Star this repository if you find it helpful!

About

Code of Medium story

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published