Skip to content

shaunbeach/Multimodal-hybrid-RAG

Repository files navigation

Multimodal RAG System with Hybrid Search

A Retrieval-Augmented Generation (RAG) system featuring hybrid search (BM25 + semantic), multimodal support (text + images), and conversational memory. Built with LangChain, ChromaDB, and Gradio.

Python LangChain License

Features

Advanced Retrieval

  • Hybrid Search: Combines BM25 keyword search with semantic embeddings using Reciprocal Rank Fusion (RRF)
  • BGE Embeddings: Uses BAAI/bge-small-en-v1.5 for superior semantic understanding
  • Structure-Aware Chunking: Respects natural file structure (notebooks by sections, Python by functions/classes, Markdown by headers)
  • Incremental Ingestion: Only processes new/modified files, with automatic cleanup of deleted files

Multimodal Capabilities

  • Text + Image Search: Query across documents, code, and visual content simultaneously
  • CLIP Embeddings: Vector search for images using OpenAI's CLIP model
  • PDF Intelligence: Extracts text, tables (via pdfplumber), and images from PDFs
  • Jupyter Notebook Support: Section-aware chunking keeps complete problems together (question + code + outputs)
  • Optional AI Captioning: Generate image descriptions using Ollama vision models

User Experience

  • Conversational Memory: Remembers chat history for follow-up questions
  • Inline Citations: Automatic source attribution with clickable references
  • Context-Only Responses: Prevents hallucinations by restricting answers to retrieved context
  • Modern Web UI: Clean Gradio interface with image gallery support

Quick Start

Prerequisites

Required Software:

  • Python 3.8 or later (Download here)
    • Check your version: python --version or python3 --version
  • pip (usually included with Python)
    • Check: pip --version or pip3 --version
  • Ollama - Local LLM inference engine

Hardware:

  • 4GB+ RAM (8GB+ recommended for image captioning)
  • 2GB+ free disk space

Installation

# 1. Clone the repository
git clone https://github.com/shaunbeach/Multi_Modal_RAG.git
cd Multi_Modal_RAG

# 2. (Recommended) Create a virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# 3. Install Python dependencies
pip install -r requirements.txt

# 4. Install and start Ollama (if not already installed)
# Visit https://ollama.ai to download for your platform

# 5. Pull the required LLM model
ollama pull llama3.1-Instruct:8b_IQ4_XS

# 6. (Optional) Pull vision model for AI image captioning
ollama pull llava:7b

Note: If you don't have Python installed, see our detailed installation guide for step-by-step instructions for Windows, macOS, and Linux.

Basic Usage

  1. Add your documents to the documents/ folder (PDFs, Jupyter notebooks, Markdown, code, images, etc.)

  2. Ingest documents:

    # Update database with new/modified documents (smart & fast)
    python ingest.py
    python ingest.py --update         # Same as above (explicit)
    
    # Update with AI-generated image captions (better quality)
    python ingest.py --update --caption-images
    
    # Force complete rebuild from scratch (only when needed)
    python ingest.py --force-rebuild

    Note: You do NOT need to delete the database manually! The system automatically tracks changes and only processes new/modified files.

  3. Launch the web interface:

    python gui.py
  4. Open your browser at http://127.0.0.1:7860 and start querying!

Ingestion Modes

See QUICKSTART.md for detailed ingestion options.

Update Mode (Default - Recommended)

python ingest.py
python ingest.py --update    # Same as above (explicit)
  • Smart: Automatically detects new/modified files
  • Fast: Only processes what changed
  • Incremental: Preserves existing database
  • No manual cleanup needed: System tracks everything for you

Update with AI Captions

python ingest.py --update --caption-images
  • Accurate: Generates AI descriptions for images
  • Incremental: Still only processes new/modified images
  • Requires: Ollama vision model (ollama pull llava:7b)

Force Rebuild Mode

python ingest.py --force-rebuild
python ingest.py --force-rebuild --caption-images  # Best quality
  • Fresh start: Deletes all existing data
  • Use when: Changing embedding models or corrupted database
  • Warning: Processes ALL files again (can be slow)

Supported File Types

Category Extensions Features
Documents .pdf, .md, .txt, .rst Table extraction, layout preservation
Code .py, .js, .java, .cpp Syntax-aware chunking
Data Science .ipynb, .json, .yaml, .jsonl Captures cell outputs, conversations
Web .html, .htm BeautifulSoup parsing
Images .jpg, .png, .gif, .webp, .bmp, .tiff CLIP embeddings, optional AI captions

Structure-Aware Chunking

Key Innovation: This system uses intelligent, structure-aware chunking that respects the natural organization of each file type:

  • Jupyter Notebooks: Section-based chunking groups cells by markdown headers (## Problem 1, ### Exercise 2.1), keeping complete problems together (question + code + outputs). No more fragmented answers!
  • Python Files: AST-based chunking extracts complete functions and classes with all their methods intact
  • Markdown/HTML: Section-based chunking preserves complete topics with all subsections
  • JSON/YAML: Key-based chunking maintains logical configuration groupings
  • PDFs: Page-based chunking with preserved table formatting

Why it matters: When you ask "How did I solve Problem 1?", you get the complete problem context - the question, all solution code, and all execution results in a single, cohesive chunk. No more piecing together fragments!

Learn more: See STRUCTURE_AWARE_CHUNKING.md and NOTEBOOK_SECTION_CHUNKING.md

Architecture

For a deep dive into how this system works, see ARCHITECTURE.md.

Key Components

  1. Hybrid Retrieval: BM25 (keyword precision) + BGE embeddings (semantic understanding)
  2. Multimodal Search: Parallel text and image retrieval with unified results
  3. Structure-Aware Chunking: Respects natural file structure (notebooks by sections, Python by functions/classes, Markdown by headers, JSON/YAML by keys)
  4. Vector Database: ChromaDB with separate collections for text and images
  5. LLM: Ollama with custom prompts for context-only responses

Configuration

Customize the system by creating a .env file in the project root:

# Model Configuration
OLLAMA_MODEL=llama3.1-Instruct:8b_IQ4_XS
LLM_TEMPERATURE=0.3              # Lower = more focused (0.0-1.0)
LLM_CONTEXT_WINDOW=8192          # Must match your model's context size
LLM_MAX_TOKENS=8192              # Maximum response length

# Embedding Models
EMBEDDING_MODEL_NAME=BAAI/bge-small-en-v1.5
IMAGE_EMBEDDING_MODEL_NAME=openai/clip-vit-base-patch32
VISION_MODEL_NAME=llava:7b

# Retrieval Settings
NUM_RETRIEVAL_DOCS=10            # Text chunks to retrieve (6-20)
NUM_IMAGE_RESULTS=3              # Images to retrieve (2-5)

# Chunking Strategy
CHUNK_SIZE=500                   # Code chunk size
PROSE_CHUNK_SIZE=1200            # Prose/document chunk size
CHUNK_OVERLAP=100
PROSE_CHUNK_OVERLAP=200

# Image Processing
MIN_IMAGE_SIZE=100               # Filter small images (pixels)
IMAGE_QUALITY=95                 # JPEG quality (1-100)

# Server Settings
SERVER_HOST=127.0.0.1            # Use 0.0.0.0 for network access
SERVER_PORT=7860

# Paths
SOURCE_FOLDER_PATH=./documents
VECTORSTORE_PATH=./chroma_db
IMAGE_STORE_PATH=./image_store

See docs/CONFIGURATION.md for the complete configuration guide including:

  • How to change the LLM model
  • Temperature and context window tuning
  • Embedding model options
  • Chunking strategy optimization
  • Common configuration scenarios

Example Queries

Vague queries work! The hybrid search intelligently combines keyword and semantic matching:

  • "What does Section 3.2 say about data validation?" → Finds specific sections via keywords
  • "What preprocessing methods were used?" → Semantic search finds related content
  • "Show me system architecture diagrams" → Multimodal search retrieves images
  • "Explain the API design patterns in Chapter 4" → Source-specific retrieval with code examples

Project Structure

.
├── gui.py                      # Main web interface
├── ingest.py                   # Document ingestion pipeline
├── requirements.txt            # Python dependencies
├── documents/                  # Your documents go here
├── chroma_db/                  # Vector database storage
│   └── file_tracking.json      # Tracks ingested files
├── image_store/                # Extracted images
├── scripts/                    # Utility scripts
│   ├── query_multimodal.py     # CLI query tool
│   ├── test_rag.py             # Testing utilities
│   └── test_search.py          # Search testing
├── QUICKSTART.md               # Quick start guide (root for visibility)
└── docs/                       # Comprehensive documentation
    ├── ARCHITECTURE.md         # System architecture
    ├── SETUP.md                # Setup instructions
    ├── CONFIGURATION.md        # Configuration guide
    ├── STRUCTURE_AWARE_CHUNKING.md # Complete technical guide
    ├── STRUCTURE_AWARE_CHUNKING_README.md  # Structure-aware overview
    ├── NOTEBOOK_SECTION_CHUNKING.md # Notebook section chunking details
    ├── GUI_COMPATIBILITY.md    # GUI integration details
    ├── GUI_SECTION_COMPATIBILITY.md # Section-aware GUI features
    └── PROJECT_SUMMARY.md      # Project summary

Troubleshooting

"No documents found" error

  • Ensure files are in the documents/ folder
  • Check file extensions are supported
  • Run with --force-rebuild to reset

Images not showing in results

  • Verify CLIP is installed: pip install transformers torch
  • Check image files are valid and > 100px
  • Re-run ingestion if images were added after initial run

Slow retrieval

  • Reduce NUM_RETRIEVAL_DOCS (default: 10)
  • Reduce NUM_IMAGE_RESULTS (default: 3)
  • Use GPU for CLIP embeddings (automatic on CUDA/MPS)

"Context too long" error

  • Reduce CHUNK_SIZE to 300-400
  • Decrease NUM_RETRIEVAL_DOCS to 6-8
  • Lower LLM_MAX_TOKENS to 4096

Performance Benchmarks

Ingestion Speed (MacBook Pro M1, 82 files):

  • Standard mode: ~45 seconds
  • With AI captions: ~8 minutes

Query Latency:

  • Text-only: 1-3 seconds
  • Multimodal: 2-4 seconds

Memory Usage:

  • Base: ~2GB RAM
  • With CLIP: ~4GB RAM
  • With LLaVA captioning: ~6GB RAM

Advanced Usage

Custom Prompts

Edit PROMPT_TEMPLATE in gui.py to customize LLM behavior.

Adding New File Types

Extend process_file() in ingest.py with custom extractors.

API Integration

Replace Gradio interface with FastAPI/Flask for production deployments.

Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Submit a pull request

License

MIT License - see LICENSE file for details.

Acknowledgments

Built with:

Citation

If you use this project in research, please cite:

@software{multimodal_rag_hybrid,
  title={Multimodal RAG System with Hybrid Search},
  author={Shaun Beach},
  year={2025},
  url={https://github.com/shaunbeach/Multi_Modal_RAG}
}

About

Multimodal RAG System with Hybrid Search

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages