1/6
Proud to present MagicLens: image retrieval models following open-ended instructions.
MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions
open-vision-language.github.io
Highlights of MagicLens:
>Novel Insights: Naturally occurring image pairs on the same web page contain diverse image relations (e.g., inside and outside views of the same building). Modeling such diverse relations can enable richer search intents beyond just searching for identical images in traditional image retrieval.
>Strong Performance: Trained on 36.7M data mined from the web, a single MagicLens model matches or exceeds prior SOTA methods on 10 benchmarks across various tasks, including multimodal-to-image, image-to-image, and text-to-image retrieval.
>Efficiency: On multiple benchmarks, MagicLens outperforms previous SOTA (>14.6B) but with a 50X smaller model size (267M).
>Open-Ended Search: MagicLens can satisfy various search intents expressed by open-ended instructions, especially complex and beyond visual intents — where prior best methods fall short.
Check out our technical report for more details:
[2403.19651] MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions
This is a joint work with awesome collaborators:
@YiLuan9
@Hexiang_Hu
@kentonctlee
, Siyuan Qiao,
@WenhuChen
@ysu_nlp
and
@mchang21
, from
@GoogleDeepMind
and
@osunlp
.
2/6
We mine image pairs from the same web pages and express their diverse semantic relations, which may extend beyond visual similarities (e.g., a charger of a product), as open-ended instructions with #LMM and #LLM.
3/6
How to precisely and explicitly capture the implicit relations between the image pairs?
We build a systematic pipeline with heavy mining, cleaning, and pairing. Then, we annotate massive metadata via #LMMs and generate open-ended instructions via #LLMs.
4/6
After mining 36.7 million triplets (query image, instruction, target image), we train light-weight MagicLens models taking image and instruction as input.
With comparable sizes, a single model achieves best results on 10 benchmarks across three image retrieval task forms.
5/6
On several benchmarks, MagicLens outperforms 50X larger SOTA methods by a large margin.
6/6
To simulate a more realistic scenario, we hold out an index set with 1.4 million images, the largest retrieval pool to date.
Human evaluation finds that MagicLens excels at all kinds of instructions, especially those that are complex and go beyond visual similarity.