A Retrieval-Augmented Generation (RAG) chatbot API built with Flask, LlamaIndex, Pinecone, and MongoDB. This project provides a production-ready foundation for building AI-powered conversational applications with document retrieval capabilities.
- RAG-powered Chat: Contextual responses using document retrieval with LlamaIndex
- Multiple Source Types: Support for PDF, CSV, and Q&A pairs as knowledge sources
- Vector Search: Pinecone integration for efficient similarity search
- Persistent Storage: MongoDB for chat history, user management, and index storage
- Streaming Responses: Real-time token streaming for chat responses
- Rate Limiting: Built-in rate limiting with Flask-Limiter
- Authentication: JWT-based authentication with admin and user roles
- reCAPTCHA Support: Google reCAPTCHA validation middleware
- CORS Enabled: Cross-Origin Resource Sharing support
- Production Ready: Gunicorn (Linux) and Waitress (Windows) server support
- Python 3.11+
- MongoDB instance
- Pinecone account and API key
- Perplexity API key (for LLM)
- (Optional) Google reCAPTCHA keys
-
Clone the repository
git clone https://github.com/marco-bertelli/rag.flask-start.git cd rag.flask-start -
Create a virtual environment
python -m venv venv # Windows venv\Scripts\activate # Linux/Mac source venv/bin/activate
-
Install dependencies
pip install -r requirements.txt
-
Configure environment variables
Create a
.envfile in the root directory:# MongoDB MONGODB_URI=mongodb+srv://your-connection-string MONGODB_DATABASE=your-database-name # Security SECRET_KEY=your-jwt-secret-key # Pinecone PINECONE_API_KEY=your-pinecone-api-key PINECONE_ENV=your-pinecone-environment # Perplexity (LLM) PERPLEXITY_API_KEY=your-perplexity-api-key # OpenAI (optional) OPENAI_API_KEY=your-openai-api-key # reCAPTCHA (optional) RECAPTCHA_SECRET_KEY=your-recaptcha-secret-key
python windows_waitress_start.pygunicorn --preload --max-requests 500 --max-requests-jitter 5 -t 3 --worker-class gthread --timeout 120 index:appThe server will start on port 8080.
| Method | Endpoint | Description | Auth |
|---|---|---|---|
GET |
/chats/me |
Get current user's chat | User Token |
GET |
/chats/me/history |
Get chat history | User Token |
GET |
/chats/guest |
Create a guest chat session | None |
GET |
/chats/<chatId>/answer?answer=<query> |
Query the chatbot (streaming) | None |
PUT |
/chats/message/<messageId>/feedback |
Set message feedback (good/bad) | None |
| Method | Endpoint | Description | Auth |
|---|---|---|---|
POST |
/index/source/<sourceType> |
Add a new source to the index | Admin Token |
DELETE |
/index/source/<sourceId> |
Remove a source from the index | Admin Token |
- qa: Question-Answer pairs
{ "question": "What is RAG?", "answer": "RAG stands for..." } - csv: CSV file with questions and answers columns
{ "path": "https://example.com/data.csv" } - pdf: PDF document
{ "path": "https://example.com/document.pdf" }
rag.flask-start/
βββ app.py # Flask app configuration
βββ index.py # Application entry point
βββ index_manager.py # LlamaIndex setup and management
βββ conf.py # Environment configuration loader
βββ windows_waitress_start.py # Windows server startup
βββ Procfile # Heroku/Gunicorn configuration
βββ requirements.txt # Python dependencies
βββ data/ # Sample data files
β βββ rules.pdf # Initial document for indexing
βββ apis/
β βββ chats.py # Chat API endpoints
β βββ sources.py # Source management endpoints
βββ middlewares/
β βββ auth_middleware.py # JWT authentication
β βββ re_captcha.py # reCAPTCHA validation
βββ mongodb/
β βββ index.py # MongoDB operations
βββ utils/
βββ chat_history_parser.py # Chat history formatting
βββ mongo_parsers.py # MongoDB JSON encoder
βββ parsers.py # Document parsing utilities
βββ validators.py # Input validation
βββ vector_database.py # Pinecone/MongoDB vector store setup
The project uses Perplexity's mixtral-8x7b-instruct model by default. Configuration is in index_manager.py:
llm = Perplexity(
api_key=os.getenv("PERPLEXITY_API_KEY"),
model="mixtral-8x7b-instruct",
temperature=0.2
)Uses the local HuggingFace model BAAI/bge-small-en-v1.5 for embeddings:
Settings.embed_model = "local:BAAI/bge-small-en-v1.5"Rate limits are configured in apis/chats.py:
/chats/me: 5 requests per minute/chats/me/history: 15 requests per minute/chats/<chatId>/answer: 10 requests per minute
The API uses JWT tokens for authentication. Include the token in the Authorization header:
Authorization: Bearer <your-jwt-token>
- User: Can access chat features
- Admin: Can manage knowledge sources (add/delete)
- Guest: Limited access with temporary chat sessions
Key dependencies include:
- Flask: Web framework
- LlamaIndex: RAG framework
- Pinecone: Vector database
- PyMongo: MongoDB driver
- Flask-Limiter: Rate limiting
- Flask-CORS: CORS support
- PyJWT: JWT authentication
- llmsherpa: PDF parsing
- Transformers & PyTorch: ML models
The project includes a Procfile for Heroku deployment:
web: gunicorn --preload --max-requests 500 --max-requests-jitter 5 -t 3 --worker-class gthread --timeout 120 index:app
Create a Dockerfile:
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["gunicorn", "--preload", "-t", "120", "index:app"]This project is open source and available under the MIT License.
Marco Bertelli
- GitHub: @marco-bertelli
Contributions, issues, and feature requests are welcome! Feel free to check the issues page.
β Star this repository if you find it helpful!