A RAG system that leverages course notes to provide accurate, context-aware responses to questions about DS4300 topics.
Retrieval-Augmented Generation (RAG) is an AI architecture that combines the power of large language models with external knowledge retrieval to provide more accurate and contextual responses. This project implements a local RAG system specifically designed to answer questions about data science topics using course notes as the knowledge base.
The system's primary goal is to demonstrate the effectiveness of different RAG configurations by comparing various embedding models, vector databases, chunking strategies, and LLM models in a controlled environment.
- Python 3.8+: Core programming language
- Ollama: Local LLM deployment
- Mistral: Latest version for primary testing
- Qwen: 7B model for comparative analysis
- Vector Databases:
- Redis: High-performance in-memory vector store (fastest for real-time queries)
- Qdrant: Vector similarity search engine (best for complex similarity metrics)
- Chroma: Open-source vector database (easiest to set up and maintain)
- Embedding Models:
- Nomic AI's nomic-embed-text-v1.5
- MiniLM (multi-qa-MiniLM-L6-cos-v1)
- MPNet (all-mpnet-base-v2)
- Additional Libraries:
- Sentence Transformers for embeddings
- LangChain for text processing
- Plotly for visualization
- Pandas for data analysis
-
Document Processing:
- Automatic document ingestion from various formats
- Configurable text chunking with overlap control
- Metadata extraction and storage
-
Embedding Pipeline:
- Support for multiple embedding models
- Vector dimension validation
- Efficient batch processing
-
Vector Database Integration:
- Unified interface for multiple vector DBs
- Automatic collection management
- Configurable similarity search parameters
-
LLM Integration:
- Local LLM deployment via Ollama
- Configurable prompt templates
- Temperature and sampling controls
-
Evaluation Framework:
- Standardized question sets
- Performance metrics tracking
- Interactive visualization tools
- CSV export capabilities
.
├── data/ # Course notes and evaluation data
├── database/ # Vector database implementations
│ ├── chroma_db.py # Chroma DB client
│ ├── qdrant_db.py # Qdrant DB client
│ └── redis_db.py # Redis DB client
├── embeddings/ # Embedding model implementations
│ ├── sentence_transformer.py
│ └── test_config.py # Model configurations
├── llm/ # LLM interface
│ └── llm_interface.py # Ollama integration
├── evaluation/ # Evaluation scripts and tools
│ ├── evaluate_rag.py # Main evaluation script
│ └── generate_evaluation_responses_*.py # Response generators
├── main.py # Main RAG system implementation
└── requirements.txt # Project dependencies
-
Set up Python Environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install Dependencies:
pip install -r requirements.txt
-
Set up Ollama:
# Install Ollama (follow instructions at https://ollama.ai) # Pull the required models ollama pull mistral:latest ollama pull qwen:7b
-
Vector Database Setup:
# Redis (if using) docker run -d -p 6379:6379 redis # Qdrant (if using) docker run -d -p 6333:6333 qdrant/qdrant
-
Run Evaluation Pipeline:
# Generate responses with Redis python generate_evaluation_responses_redis.py # Generate responses with Qdrant python generate_evaluation_responses_qdrant.py # Generate responses with Chroma python generate_evaluation_responses_chroma.py
-
Evaluate Results:
python evaluate_rag.py
-
View Results:
- Check
evaluation_results/for JSON outputs - View interactive visualizations in
evaluation_results/visualizations/ - Access raw data in
evaluation_results/raw_responses.csv
- Check
The system supports various configuration options:
-
Chunking Strategies:
{ "chunk_size": 256, // or 512, 1024 "chunk_overlap": 25 // or 50, 100 } -
Embedding Models:
EMBEDDING_MODELS = { "nomic-ai/nomic-embed-text-v1.5": {...}, "multi-qa-MiniLM-L6-cos-v1": {...}, "all-mpnet-base-v2": {...} }
-
Vector Database Settings:
vector_db = RedisDB( collection_name="eval_config_name", embedding_model="model_name" )
-
LLM Parameters:
# Mistral configuration llm = OllamaLLM( model_name="mistral:latest", temperature=0.4 ) # Qwen configuration llm = OllamaLLM( model_name="qwen:7b", temperature=0.4 )
The system evaluates different configurations:
-
Chunking Strategies:
- Small (256/25): Fine-grained retrieval
- Medium (512/50): Balanced approach
- Large (1024/100): Context preservation
-
Embedding Models:
- Nomic AI: Latest generation embeddings
- MiniLM: Fast and efficient
- MPNet: Strong semantic understanding
-
Vector Databases:
- Redis: In-memory performance
- Qdrant: Advanced similarity search
- Chroma: Local persistence
-
LLM Models:
- Mistral: Latest version for primary testing
- Qwen: 7B model for comparative analysis
Key findings from the evaluation:
-
Performance Metrics:
- Memory usage across configurations
- Execution time for different components
- Response quality and relevance
- LLM model comparison (Mistral vs. Qwen)
-
Best Performing Configuration:
- Optimal chunking strategy
- Most effective embedding model
- Preferred vector database
- Preferred LLM model
-
Trade-offs:
- Speed vs. accuracy
- Memory usage vs. context size
- Local vs. distributed storage
- LLM performance vs. resource usage
Detailed results and visualizations are available in the evaluation_results/ directory.