Skip to content

System for intelligent gallery search that combines image and metadata reasoning capabilities using multimodal embedding/VLMs

Notifications You must be signed in to change notification settings

umar07/Visual-RAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

Visual-RAG Gallery Search

A Retrieval-Augmented Generation (RAG) system for intelligent gallery search that combines image and metadata search capabilities using multimodal embeddings and evaluation.

Overview

This project implements a visual search system that:

  1. Embeds both images and their metadata using a multimodal model
  2. Stores embeddings in a vector database (ChromaDB)
  3. Enables semantic search across the gallery
  4. Evaluates search results using Vision-Language Models (VLMs)

Components

Multimodal Embedding Model

  • Model: MMRet-large (JUNJIE99/MMRet-large)
  • Architecture: CLIP-based multimodal encoder
  • Limitations:
    • Text input limited to 77 tokens (CLIP constraint)
    • Better alternatives could use VLM-based embeddings for richer text understanding

Vector Database

  • Database: ChromaDB (Persistent storage)
  • Storage: Images stored locally, embeddings in ChromaDB
  • Features: Supports hybrid search combining image and metadata

Dataset

  • Source: YFCC100M_OpenAI_subset
  • Features: Rich metadata including:
    • Image data
    • Capture information (device, date)
    • User tags
    • Geolocation
    • License information
  • Streaming: Uses HuggingFace's streaming capability for memory-efficient loading

Evaluation Models

  1. SmolVLM-Base

    • Lightweight VLM for quick evaluation
    • Limited multi-image understanding
    • Best suited for single-image evaluation
  2. Qwen 2.5-VL 3B

    • More powerful VLM for detailed evaluation
    • Better understanding of image-query relationships
    • Provides numerical relevance scoring (1-10)

Implementation Details

Data Processing

  • Images are stored locally in a gallery directory
  • Selected metadata fields are combined with images for embedding
  • Embeddings are L2-normalized before storage

Search Process

  1. Query text is embedded using the MMRet model
  2. ChromaDB performs similarity search
  3. Top results are evaluated using VLMs
  4. Results are scored for relevance

Evaluation Methods

  1. Embedding Similarity: Direct comparison of query and result embeddings
  2. VLM Evaluation:
    • Single image relevance scoring
    • Multi-image comparison (limited by model capabilities)

Limitations and Future Work

  1. Token Limitation: Current CLIP-based embedding only allows 77 tokens

    • Consider VLM-based embeddings for longer text
    • Implement better metadata truncation strategies
  2. Evaluation Challenges:

    • SmolVLM struggles with multiple image evaluation
    • Need for more robust multi-image comparison
  3. Potential Improvements:

    • Implement hybrid search combining text and image queries
    • Add support for more metadata fields
    • Explore more advanced VLMs for embedding and evaluation. Ideal usecase would be to answer queries like "How many people were present on my birthday party last year?". The metadata would hint the date of the party and the VLM could then reason about number of people present based metadata & the image content.

Requirements

The project requires standard ML libraries including transformers, accelerate, datasets, chromadb, and qwen-vl-utils. Note that Qwen recommends to install

!pip install -r requirements.txt

!pip install git+https://github.com/huggingface/transformers accelerate

The implementation is available as a Jupyter notebook, demonstrating the complete workflow from database setup to search and evaluation. All of the code can be run in free Google Colab with GPU.

About

System for intelligent gallery search that combines image and metadata reasoning capabilities using multimodal embedding/VLMs

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published