A Retrieval-Augmented Generation (RAG) system for intelligent gallery search that combines image and metadata search capabilities using multimodal embeddings and evaluation.
This project implements a visual search system that:
- Embeds both images and their metadata using a multimodal model
- Stores embeddings in a vector database (ChromaDB)
- Enables semantic search across the gallery
- Evaluates search results using Vision-Language Models (VLMs)
- Model: MMRet-large (JUNJIE99/MMRet-large)
- Architecture: CLIP-based multimodal encoder
- Limitations:
- Text input limited to 77 tokens (CLIP constraint)
- Better alternatives could use VLM-based embeddings for richer text understanding
- Database: ChromaDB (Persistent storage)
- Storage: Images stored locally, embeddings in ChromaDB
- Features: Supports hybrid search combining image and metadata
- Source: YFCC100M_OpenAI_subset
- Features: Rich metadata including:
- Image data
- Capture information (device, date)
- User tags
- Geolocation
- License information
- Streaming: Uses HuggingFace's streaming capability for memory-efficient loading
-
SmolVLM-Base
- Lightweight VLM for quick evaluation
- Limited multi-image understanding
- Best suited for single-image evaluation
-
Qwen 2.5-VL 3B
- More powerful VLM for detailed evaluation
- Better understanding of image-query relationships
- Provides numerical relevance scoring (1-10)
- Images are stored locally in a gallery directory
- Selected metadata fields are combined with images for embedding
- Embeddings are L2-normalized before storage
- Query text is embedded using the MMRet model
- ChromaDB performs similarity search
- Top results are evaluated using VLMs
- Results are scored for relevance
- Embedding Similarity: Direct comparison of query and result embeddings
- VLM Evaluation:
- Single image relevance scoring
- Multi-image comparison (limited by model capabilities)
-
Token Limitation: Current CLIP-based embedding only allows 77 tokens
- Consider VLM-based embeddings for longer text
- Implement better metadata truncation strategies
-
Evaluation Challenges:
- SmolVLM struggles with multiple image evaluation
- Need for more robust multi-image comparison
-
Potential Improvements:
- Implement hybrid search combining text and image queries
- Add support for more metadata fields
- Explore more advanced VLMs for embedding and evaluation. Ideal usecase would be to answer queries like "How many people were present on my birthday party last year?". The metadata would hint the date of the party and the VLM could then reason about number of people present based metadata & the image content.
The project requires standard ML libraries including transformers, accelerate, datasets, chromadb, and qwen-vl-utils. Note that Qwen recommends to install
!pip install -r requirements.txt
!pip install git+https://github.com/huggingface/transformers accelerate
The implementation is available as a Jupyter notebook, demonstrating the complete workflow from database setup to search and evaluation. All of the code can be run in free Google Colab with GPU.