Multimodal RAG System with Hybrid Search

A Retrieval-Augmented Generation (RAG) system featuring hybrid search (BM25 + semantic), multimodal support (text + images), and conversational memory. Built with LangChain, ChromaDB, and Gradio.

Features

Advanced Retrieval

Hybrid Search: Combines BM25 keyword search with semantic embeddings using Reciprocal Rank Fusion (RRF)
BGE Embeddings: Uses BAAI/bge-small-en-v1.5 for superior semantic understanding
Structure-Aware Chunking: Respects natural file structure (notebooks by sections, Python by functions/classes, Markdown by headers)
Incremental Ingestion: Only processes new/modified files, with automatic cleanup of deleted files

Multimodal Capabilities

Text + Image Search: Query across documents, code, and visual content simultaneously
CLIP Embeddings: Vector search for images using OpenAI's CLIP model
PDF Intelligence: Extracts text, tables (via pdfplumber), and images from PDFs
Jupyter Notebook Support: Section-aware chunking keeps complete problems together (question + code + outputs)
Optional AI Captioning: Generate image descriptions using Ollama vision models

User Experience

Conversational Memory: Remembers chat history for follow-up questions
Inline Citations: Automatic source attribution with clickable references
Context-Only Responses: Prevents hallucinations by restricting answers to retrieved context
Modern Web UI: Clean Gradio interface with image gallery support

Quick Start

Prerequisites

Required Software:

Python 3.8 or later (Download here)
- Check your version: python --version or python3 --version
pip (usually included with Python)
- Check: pip --version or pip3 --version
Ollama - Local LLM inference engine
- Install from: https://ollama.ai

Hardware:

4GB+ RAM (8GB+ recommended for image captioning)
2GB+ free disk space

Installation

# 1. Clone the repository
git clone https://github.com/shaunbeach/Multi_Modal_RAG.git
cd Multi_Modal_RAG

# 2. (Recommended) Create a virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# 3. Install Python dependencies
pip install -r requirements.txt

# 4. Install and start Ollama (if not already installed)
# Visit https://ollama.ai to download for your platform

# 5. Pull the required LLM model
ollama pull llama3.1-Instruct:8b_IQ4_XS

# 6. (Optional) Pull vision model for AI image captioning
ollama pull llava:7b

Note: If you don't have Python installed, see our detailed installation guide for step-by-step instructions for Windows, macOS, and Linux.

Basic Usage

Add your documents to the documents/ folder (PDFs, Jupyter notebooks, Markdown, code, images, etc.)

Ingest documents:

# Update database with new/modified documents (smart & fast)
python ingest.py
python ingest.py --update         # Same as above (explicit)

# Update with AI-generated image captions (better quality)
python ingest.py --update --caption-images

# Force complete rebuild from scratch (only when needed)
python ingest.py --force-rebuild

Note: You do NOT need to delete the database manually! The system automatically tracks changes and only processes new/modified files.

Launch the web interface:
```
python gui.py
```
Open your browser at http://127.0.0.1:7860 and start querying!

Ingestion Modes

See QUICKSTART.md for detailed ingestion options.

Update Mode (Default - Recommended)

python ingest.py
python ingest.py --update    # Same as above (explicit)

Smart: Automatically detects new/modified files
Fast: Only processes what changed
Incremental: Preserves existing database
No manual cleanup needed: System tracks everything for you

Update with AI Captions

python ingest.py --update --caption-images

Accurate: Generates AI descriptions for images
Incremental: Still only processes new/modified images
Requires: Ollama vision model (ollama pull llava:7b)

Force Rebuild Mode

python ingest.py --force-rebuild
python ingest.py --force-rebuild --caption-images  # Best quality

Fresh start: Deletes all existing data
Use when: Changing embedding models or corrupted database
Warning: Processes ALL files again (can be slow)

Supported File Types

Category	Extensions	Features
Documents	`.pdf`, `.md`, `.txt`, `.rst`	Table extraction, layout preservation
Code	`.py`, `.js`, `.java`, `.cpp`	Syntax-aware chunking
Data Science	`.ipynb`, `.json`, `.yaml`, `.jsonl`	Captures cell outputs, conversations
Web	`.html`, `.htm`	BeautifulSoup parsing
Images	`.jpg`, `.png`, `.gif`, `.webp`, `.bmp`, `.tiff`	CLIP embeddings, optional AI captions

Structure-Aware Chunking

Key Innovation: This system uses intelligent, structure-aware chunking that respects the natural organization of each file type:

Jupyter Notebooks: Section-based chunking groups cells by markdown headers (## Problem 1, ### Exercise 2.1), keeping complete problems together (question + code + outputs). No more fragmented answers!
Python Files: AST-based chunking extracts complete functions and classes with all their methods intact
Markdown/HTML: Section-based chunking preserves complete topics with all subsections
JSON/YAML: Key-based chunking maintains logical configuration groupings
PDFs: Page-based chunking with preserved table formatting

Why it matters: When you ask "How did I solve Problem 1?", you get the complete problem context - the question, all solution code, and all execution results in a single, cohesive chunk. No more piecing together fragments!

Learn more: See STRUCTURE_AWARE_CHUNKING.md and NOTEBOOK_SECTION_CHUNKING.md

Architecture

For a deep dive into how this system works, see ARCHITECTURE.md.

Key Components

Hybrid Retrieval: BM25 (keyword precision) + BGE embeddings (semantic understanding)
Multimodal Search: Parallel text and image retrieval with unified results
Structure-Aware Chunking: Respects natural file structure (notebooks by sections, Python by functions/classes, Markdown by headers, JSON/YAML by keys)
Vector Database: ChromaDB with separate collections for text and images
LLM: Ollama with custom prompts for context-only responses

Configuration

Customize the system by creating a .env file in the project root:

# Model Configuration
OLLAMA_MODEL=llama3.1-Instruct:8b_IQ4_XS
LLM_TEMPERATURE=0.3              # Lower = more focused (0.0-1.0)
LLM_CONTEXT_WINDOW=8192          # Must match your model's context size
LLM_MAX_TOKENS=8192              # Maximum response length

# Embedding Models
EMBEDDING_MODEL_NAME=BAAI/bge-small-en-v1.5
IMAGE_EMBEDDING_MODEL_NAME=openai/clip-vit-base-patch32
VISION_MODEL_NAME=llava:7b

# Retrieval Settings
NUM_RETRIEVAL_DOCS=10            # Text chunks to retrieve (6-20)
NUM_IMAGE_RESULTS=3              # Images to retrieve (2-5)

# Chunking Strategy
CHUNK_SIZE=500                   # Code chunk size
PROSE_CHUNK_SIZE=1200            # Prose/document chunk size
CHUNK_OVERLAP=100
PROSE_CHUNK_OVERLAP=200

# Image Processing
MIN_IMAGE_SIZE=100               # Filter small images (pixels)
IMAGE_QUALITY=95                 # JPEG quality (1-100)

# Server Settings
SERVER_HOST=127.0.0.1            # Use 0.0.0.0 for network access
SERVER_PORT=7860

# Paths
SOURCE_FOLDER_PATH=./documents
VECTORSTORE_PATH=./chroma_db
IMAGE_STORE_PATH=./image_store

See docs/CONFIGURATION.md for the complete configuration guide including:

How to change the LLM model
Temperature and context window tuning
Embedding model options
Chunking strategy optimization
Common configuration scenarios

Example Queries

Vague queries work! The hybrid search intelligently combines keyword and semantic matching:

"What does Section 3.2 say about data validation?" → Finds specific sections via keywords
"What preprocessing methods were used?" → Semantic search finds related content
"Show me system architecture diagrams" → Multimodal search retrieves images
"Explain the API design patterns in Chapter 4" → Source-specific retrieval with code examples

Project Structure

.
├── gui.py                      # Main web interface
├── ingest.py                   # Document ingestion pipeline
├── requirements.txt            # Python dependencies
├── documents/                  # Your documents go here
├── chroma_db/                  # Vector database storage
│   └── file_tracking.json      # Tracks ingested files
├── image_store/                # Extracted images
├── scripts/                    # Utility scripts
│   ├── query_multimodal.py     # CLI query tool
│   ├── test_rag.py             # Testing utilities
│   └── test_search.py          # Search testing
├── QUICKSTART.md               # Quick start guide (root for visibility)
└── docs/                       # Comprehensive documentation
    ├── ARCHITECTURE.md         # System architecture
    ├── SETUP.md                # Setup instructions
    ├── CONFIGURATION.md        # Configuration guide
    ├── STRUCTURE_AWARE_CHUNKING.md # Complete technical guide
    ├── STRUCTURE_AWARE_CHUNKING_README.md  # Structure-aware overview
    ├── NOTEBOOK_SECTION_CHUNKING.md # Notebook section chunking details
    ├── GUI_COMPATIBILITY.md    # GUI integration details
    ├── GUI_SECTION_COMPATIBILITY.md # Section-aware GUI features
    └── PROJECT_SUMMARY.md      # Project summary

Troubleshooting

"No documents found" error

Ensure files are in the documents/ folder
Check file extensions are supported
Run with --force-rebuild to reset

Images not showing in results

Verify CLIP is installed: pip install transformers torch
Check image files are valid and > 100px
Re-run ingestion if images were added after initial run

Slow retrieval

Reduce NUM_RETRIEVAL_DOCS (default: 10)
Reduce NUM_IMAGE_RESULTS (default: 3)
Use GPU for CLIP embeddings (automatic on CUDA/MPS)

"Context too long" error

Reduce CHUNK_SIZE to 300-400
Decrease NUM_RETRIEVAL_DOCS to 6-8
Lower LLM_MAX_TOKENS to 4096

Performance Benchmarks

Ingestion Speed (MacBook Pro M1, 82 files):

Standard mode: ~45 seconds
With AI captions: ~8 minutes

Query Latency:

Text-only: 1-3 seconds
Multimodal: 2-4 seconds

Memory Usage:

Base: ~2GB RAM
With CLIP: ~4GB RAM
With LLaVA captioning: ~6GB RAM

Advanced Usage

Custom Prompts

Edit PROMPT_TEMPLATE in gui.py to customize LLM behavior.

Adding New File Types

Extend process_file() in ingest.py with custom extractors.

API Integration

Replace Gradio interface with FastAPI/Flask for production deployments.

Contributing

Contributions welcome! Please:

Fork the repository
Create a feature branch
Add tests for new functionality
Submit a pull request

License

MIT License - see LICENSE file for details.

Acknowledgments

Built with:

LangChain - RAG orchestration
ChromaDB - Vector database
Gradio - Web interface
Ollama - Local LLM inference
Sentence Transformers - Text embeddings
OpenAI CLIP - Image embeddings
rank-bm25 - Keyword search

Citation

If you use this project in research, please cite:

@software{multimodal_rag_hybrid,
  title={Multimodal RAG System with Hybrid Search},
  author={Shaun Beach},
  year={2025},
  url={https://github.com/shaunbeach/Multi_Modal_RAG}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
chroma_db		chroma_db
docs		docs
documents		documents
image_store		image_store
scripts		scripts
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
INSTALLATION.md		INSTALLATION.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
gui.py		gui.py
ingest.py		ingest.py
requirements.txt		requirements.txt

shaunbeach/Multimodal-hybrid-RAG

Folders and files

Latest commit

History

Repository files navigation