We introduce a new research problem: multimodal search with target modality. This problem involves searching for objects in one modality (the target) using multiple modalities as input. One of the input modalities is the target modality, and the others are the auxiliary modalities. The auxiliary modalities modify or refine some aspects of the target modality input. For example, we can search for videos using a reference video and auxiliary image and text. Our paper entitled “MUST: An Effective and Scalable Framework for Multimodal Search with Target Modaliy” provides an efficient and scalable framework for mu
ltimodal s
earch with t
arget modality, called MUST. The evaluation results demonstrate that MUST improves search accuracy by about 50%, is more than 10x faster than the baseline methods, and can scale to more than 10 million data size.
This repo contains the code, datasets, optimal parameters, and other detailed information used for the experiments of our paper.
-
Multi-streamed retrieval (MR). MR is a traditional strategy for solving hybrid queries in IR and DB communities [VLDB'20, SIGMOD'21]. We adapt this framework to handle MSTM problem and enhance it by using advanced unimodal and multimodal encoders like CLIP [CVPR'22].
-
Joint embedding (JE). JE is a mainstream method for addressing multimodal search in CV community. We use two representative multimodal encoders: TIRG (pioneer) [CVPR'19] and CLIP (SOTA) [CVPR'22].
In MUST, we use three pluggable components: (1) Embedding; (2) Vector weight learning; (3) Indexing and search.
Dataset | # Modality | # Object | # Query | Type | Source |
---|---|---|---|---|---|
CelebA (link) | 2 | 191,549 | 34,326 | Image; Text | real-world |
MIT-States (link) | 2 | 53,743 | 72,732 | Image; Text | real-world |
Shopping* | 2 | 96,009 | 47,658 | Image; Text | real-world |
CelebA+ (link) | 4 | 191,549 | 34,326 | Imagex3; Text | real-world |
ImageText1M (link) | 2 | 1,000,000 | 1,000 | Image; Text | semi-synthetic |
AudioText1M (link) | 2 | 992,272 | 200 | Audio; Text | semi-synthetic |
VideoText1M (link) | 2 | 1,000,000 | 10,000 | Video; Text | semi-synthetic |
ImageText16M (link) | 2 | 16,000,000 | 10,000 | Image; Text | semi-synthetic |
*Please contact the author of the dataset to get access to the images.
To obtain embedding vectors, we use the same training hyper-parameters as the original papers of encoders. The encoder configuration is the same for all three frameworks. For the vector weight learning module, we set the learning rate to 0.2 and train for 20 iterations by default. Appendix contains the analysis of other parameters and the output weights of module on different datasets.
PyTorch
Pybind
GCC 4.9+ with OpenMP
CMake 2.8+
(i) Embedding
We convert vectors of all objects and query inputs to fvecs
format or ivecs
format, and groundtruth data to ivecs
format. For the description of fvecs
and ivecs
format, see here.
(ii) Vector weight learning
cd ./vector_weight_learning
python setup.py install
python main.py
(iii) Indexing and search
cd ./scripts
./run release build_<framework> # index build
./run release search_<framework> # search
We used the implementation of our embedding from TIRG and CLIP. We implemented our indexing components and search codes based on CGraph. We appreciate their inspiration and the references provided for this project.