Visual-RAG Gallery Search

A Retrieval-Augmented Generation (RAG) system for intelligent gallery search that combines image and metadata search capabilities using multimodal embeddings and evaluation.

Overview

This project implements a visual search system that:

Embeds both images and their metadata using a multimodal model
Stores embeddings in a vector database (ChromaDB)
Enables semantic search across the gallery
Evaluates search results using Vision-Language Models (VLMs)

Components

Multimodal Embedding Model

Model: MMRet-large (JUNJIE99/MMRet-large)
Architecture: CLIP-based multimodal encoder
Limitations:
- Text input limited to 77 tokens (CLIP constraint)
- Better alternatives could use VLM-based embeddings for richer text understanding

Vector Database

Database: ChromaDB (Persistent storage)
Storage: Images stored locally, embeddings in ChromaDB
Features: Supports hybrid search combining image and metadata

Dataset

Source: YFCC100M_OpenAI_subset
Features: Rich metadata including:
- Image data
- Capture information (device, date)
- User tags
- Geolocation
- License information
Streaming: Uses HuggingFace's streaming capability for memory-efficient loading

Evaluation Models

SmolVLM-Base
- Lightweight VLM for quick evaluation
- Limited multi-image understanding
- Best suited for single-image evaluation
Qwen 2.5-VL 3B
- More powerful VLM for detailed evaluation
- Better understanding of image-query relationships
- Provides numerical relevance scoring (1-10)

Implementation Details

Data Processing

Images are stored locally in a gallery directory
Selected metadata fields are combined with images for embedding
Embeddings are L2-normalized before storage

Search Process

Query text is embedded using the MMRet model
ChromaDB performs similarity search
Top results are evaluated using VLMs
Results are scored for relevance

Evaluation Methods

Embedding Similarity: Direct comparison of query and result embeddings
VLM Evaluation:
- Single image relevance scoring
- Multi-image comparison (limited by model capabilities)

Limitations and Future Work

Token Limitation: Current CLIP-based embedding only allows 77 tokens
- Consider VLM-based embeddings for longer text
- Implement better metadata truncation strategies
Evaluation Challenges:
- SmolVLM struggles with multiple image evaluation
- Need for more robust multi-image comparison
Potential Improvements:
- Implement hybrid search combining text and image queries
- Add support for more metadata fields
- Explore more advanced VLMs for embedding and evaluation. Ideal usecase would be to answer queries like "How many people were present on my birthday party last year?". The metadata would hint the date of the party and the VLM could then reason about number of people present based metadata & the image content.

Requirements

The project requires standard ML libraries including transformers, accelerate, datasets, chromadb, and qwen-vl-utils. Note that Qwen recommends to install

!pip install -r requirements.txt

!pip install git+https://github.com/huggingface/transformers accelerate

The implementation is available as a Jupyter notebook, demonstrating the complete workflow from database setup to search and evaluation. All of the code can be run in free Google Colab with GPU.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Visual_RAG.ipynb		Visual_RAG.ipynb
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Visual-RAG Gallery Search

Overview

Components

Multimodal Embedding Model

Vector Database

Dataset

Evaluation Models

Implementation Details

Data Processing

Search Process

Evaluation Methods

Limitations and Future Work

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

umar07/Visual-RAG

Folders and files

Latest commit

History

Repository files navigation

Visual-RAG Gallery Search

Overview

Components

Multimodal Embedding Model

Vector Database

Dataset

Evaluation Models

Implementation Details

Data Processing

Search Process

Evaluation Methods

Limitations and Future Work

Requirements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages