A Retrieval-Augmented Generation (RAG) system featuring hybrid search (BM25 + semantic), multimodal support (text + images), and conversational memory. Built with LangChain, ChromaDB, and Gradio.
- Hybrid Search: Combines BM25 keyword search with semantic embeddings using Reciprocal Rank Fusion (RRF)
- BGE Embeddings: Uses
BAAI/bge-small-en-v1.5for superior semantic understanding - Structure-Aware Chunking: Respects natural file structure (notebooks by sections, Python by functions/classes, Markdown by headers)
- Incremental Ingestion: Only processes new/modified files, with automatic cleanup of deleted files
- Text + Image Search: Query across documents, code, and visual content simultaneously
- CLIP Embeddings: Vector search for images using OpenAI's CLIP model
- PDF Intelligence: Extracts text, tables (via pdfplumber), and images from PDFs
- Jupyter Notebook Support: Section-aware chunking keeps complete problems together (question + code + outputs)
- Optional AI Captioning: Generate image descriptions using Ollama vision models
- Conversational Memory: Remembers chat history for follow-up questions
- Inline Citations: Automatic source attribution with clickable references
- Context-Only Responses: Prevents hallucinations by restricting answers to retrieved context
- Modern Web UI: Clean Gradio interface with image gallery support
Required Software:
- Python 3.8 or later (Download here)
- Check your version:
python --versionorpython3 --version
- Check your version:
- pip (usually included with Python)
- Check:
pip --versionorpip3 --version
- Check:
- Ollama - Local LLM inference engine
- Install from: https://ollama.ai
Hardware:
- 4GB+ RAM (8GB+ recommended for image captioning)
- 2GB+ free disk space
# 1. Clone the repository
git clone https://github.com/shaunbeach/Multi_Modal_RAG.git
cd Multi_Modal_RAG
# 2. (Recommended) Create a virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# 3. Install Python dependencies
pip install -r requirements.txt
# 4. Install and start Ollama (if not already installed)
# Visit https://ollama.ai to download for your platform
# 5. Pull the required LLM model
ollama pull llama3.1-Instruct:8b_IQ4_XS
# 6. (Optional) Pull vision model for AI image captioning
ollama pull llava:7bNote: If you don't have Python installed, see our detailed installation guide for step-by-step instructions for Windows, macOS, and Linux.
-
Add your documents to the
documents/folder (PDFs, Jupyter notebooks, Markdown, code, images, etc.) -
Ingest documents:
# Update database with new/modified documents (smart & fast) python ingest.py python ingest.py --update # Same as above (explicit) # Update with AI-generated image captions (better quality) python ingest.py --update --caption-images # Force complete rebuild from scratch (only when needed) python ingest.py --force-rebuild
Note: You do NOT need to delete the database manually! The system automatically tracks changes and only processes new/modified files.
-
Launch the web interface:
python gui.py
-
Open your browser at
http://127.0.0.1:7860and start querying!
See QUICKSTART.md for detailed ingestion options.
python ingest.py
python ingest.py --update # Same as above (explicit)- Smart: Automatically detects new/modified files
- Fast: Only processes what changed
- Incremental: Preserves existing database
- No manual cleanup needed: System tracks everything for you
python ingest.py --update --caption-images- Accurate: Generates AI descriptions for images
- Incremental: Still only processes new/modified images
- Requires: Ollama vision model (
ollama pull llava:7b)
python ingest.py --force-rebuild
python ingest.py --force-rebuild --caption-images # Best quality- Fresh start: Deletes all existing data
- Use when: Changing embedding models or corrupted database
- Warning: Processes ALL files again (can be slow)
| Category | Extensions | Features |
|---|---|---|
| Documents | .pdf, .md, .txt, .rst |
Table extraction, layout preservation |
| Code | .py, .js, .java, .cpp |
Syntax-aware chunking |
| Data Science | .ipynb, .json, .yaml, .jsonl |
Captures cell outputs, conversations |
| Web | .html, .htm |
BeautifulSoup parsing |
| Images | .jpg, .png, .gif, .webp, .bmp, .tiff |
CLIP embeddings, optional AI captions |
Key Innovation: This system uses intelligent, structure-aware chunking that respects the natural organization of each file type:
- Jupyter Notebooks: Section-based chunking groups cells by markdown headers (
## Problem 1,### Exercise 2.1), keeping complete problems together (question + code + outputs). No more fragmented answers! - Python Files: AST-based chunking extracts complete functions and classes with all their methods intact
- Markdown/HTML: Section-based chunking preserves complete topics with all subsections
- JSON/YAML: Key-based chunking maintains logical configuration groupings
- PDFs: Page-based chunking with preserved table formatting
Why it matters: When you ask "How did I solve Problem 1?", you get the complete problem context - the question, all solution code, and all execution results in a single, cohesive chunk. No more piecing together fragments!
Learn more: See STRUCTURE_AWARE_CHUNKING.md and NOTEBOOK_SECTION_CHUNKING.md
For a deep dive into how this system works, see ARCHITECTURE.md.
- Hybrid Retrieval: BM25 (keyword precision) + BGE embeddings (semantic understanding)
- Multimodal Search: Parallel text and image retrieval with unified results
- Structure-Aware Chunking: Respects natural file structure (notebooks by sections, Python by functions/classes, Markdown by headers, JSON/YAML by keys)
- Vector Database: ChromaDB with separate collections for text and images
- LLM: Ollama with custom prompts for context-only responses
Customize the system by creating a .env file in the project root:
# Model Configuration
OLLAMA_MODEL=llama3.1-Instruct:8b_IQ4_XS
LLM_TEMPERATURE=0.3 # Lower = more focused (0.0-1.0)
LLM_CONTEXT_WINDOW=8192 # Must match your model's context size
LLM_MAX_TOKENS=8192 # Maximum response length
# Embedding Models
EMBEDDING_MODEL_NAME=BAAI/bge-small-en-v1.5
IMAGE_EMBEDDING_MODEL_NAME=openai/clip-vit-base-patch32
VISION_MODEL_NAME=llava:7b
# Retrieval Settings
NUM_RETRIEVAL_DOCS=10 # Text chunks to retrieve (6-20)
NUM_IMAGE_RESULTS=3 # Images to retrieve (2-5)
# Chunking Strategy
CHUNK_SIZE=500 # Code chunk size
PROSE_CHUNK_SIZE=1200 # Prose/document chunk size
CHUNK_OVERLAP=100
PROSE_CHUNK_OVERLAP=200
# Image Processing
MIN_IMAGE_SIZE=100 # Filter small images (pixels)
IMAGE_QUALITY=95 # JPEG quality (1-100)
# Server Settings
SERVER_HOST=127.0.0.1 # Use 0.0.0.0 for network access
SERVER_PORT=7860
# Paths
SOURCE_FOLDER_PATH=./documents
VECTORSTORE_PATH=./chroma_db
IMAGE_STORE_PATH=./image_storeSee docs/CONFIGURATION.md for the complete configuration guide including:
- How to change the LLM model
- Temperature and context window tuning
- Embedding model options
- Chunking strategy optimization
- Common configuration scenarios
Vague queries work! The hybrid search intelligently combines keyword and semantic matching:
- "What does Section 3.2 say about data validation?" → Finds specific sections via keywords
- "What preprocessing methods were used?" → Semantic search finds related content
- "Show me system architecture diagrams" → Multimodal search retrieves images
- "Explain the API design patterns in Chapter 4" → Source-specific retrieval with code examples
.
├── gui.py # Main web interface
├── ingest.py # Document ingestion pipeline
├── requirements.txt # Python dependencies
├── documents/ # Your documents go here
├── chroma_db/ # Vector database storage
│ └── file_tracking.json # Tracks ingested files
├── image_store/ # Extracted images
├── scripts/ # Utility scripts
│ ├── query_multimodal.py # CLI query tool
│ ├── test_rag.py # Testing utilities
│ └── test_search.py # Search testing
├── QUICKSTART.md # Quick start guide (root for visibility)
└── docs/ # Comprehensive documentation
├── ARCHITECTURE.md # System architecture
├── SETUP.md # Setup instructions
├── CONFIGURATION.md # Configuration guide
├── STRUCTURE_AWARE_CHUNKING.md # Complete technical guide
├── STRUCTURE_AWARE_CHUNKING_README.md # Structure-aware overview
├── NOTEBOOK_SECTION_CHUNKING.md # Notebook section chunking details
├── GUI_COMPATIBILITY.md # GUI integration details
├── GUI_SECTION_COMPATIBILITY.md # Section-aware GUI features
└── PROJECT_SUMMARY.md # Project summary
- Ensure files are in the
documents/folder - Check file extensions are supported
- Run with
--force-rebuildto reset
- Verify CLIP is installed:
pip install transformers torch - Check image files are valid and > 100px
- Re-run ingestion if images were added after initial run
- Reduce
NUM_RETRIEVAL_DOCS(default: 10) - Reduce
NUM_IMAGE_RESULTS(default: 3) - Use GPU for CLIP embeddings (automatic on CUDA/MPS)
- Reduce
CHUNK_SIZEto 300-400 - Decrease
NUM_RETRIEVAL_DOCSto 6-8 - Lower
LLM_MAX_TOKENSto 4096
Ingestion Speed (MacBook Pro M1, 82 files):
- Standard mode: ~45 seconds
- With AI captions: ~8 minutes
Query Latency:
- Text-only: 1-3 seconds
- Multimodal: 2-4 seconds
Memory Usage:
- Base: ~2GB RAM
- With CLIP: ~4GB RAM
- With LLaVA captioning: ~6GB RAM
Edit PROMPT_TEMPLATE in gui.py to customize LLM behavior.
Extend process_file() in ingest.py with custom extractors.
Replace Gradio interface with FastAPI/Flask for production deployments.
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Submit a pull request
MIT License - see LICENSE file for details.
Built with:
- LangChain - RAG orchestration
- ChromaDB - Vector database
- Gradio - Web interface
- Ollama - Local LLM inference
- Sentence Transformers - Text embeddings
- OpenAI CLIP - Image embeddings
- rank-bm25 - Keyword search
If you use this project in research, please cite:
@software{multimodal_rag_hybrid,
title={Multimodal RAG System with Hybrid Search},
author={Shaun Beach},
year={2025},
url={https://github.com/shaunbeach/Multi_Modal_RAG}
}