A fashion e-commerce company wants to roll out multimodal search so that shoppers can type a natural-language query such as "red leather ankle boots" and instantly retrieve the most relevant product images from a catalog of 50 million pictures. Design constraints include:
End-to-end latency must stay below 100 ms.
Queries are open-ended, not limited to a fixed set of classes.
The image catalog will be stored as dense vectors in an approximate nearest-neighbor (ANN) index. Which modeling strategy should the data-science team choose to satisfy all of these requirements while preserving strong semantic alignment between text and images?
Deploy separate image and text classifiers and average their softmax probability outputs at query time (late fusion) to rank results.
Train a contrastive dual-encoder (two-tower) model on paired caption-image data so that the text and image encoders produce vectors in the same embedding space, then pre-compute and ANN-index the image embeddings.
Fine-tune a large language model on product captions only and use its [CLS] token embedding as the representation for both queries and images.
Generate synthetic captions for every product image with an image-captioning model and index those captions with a TF-IDF bag-of-words search engine.
Training a contrastive dual-encoder (two-tower) model on paired caption-image data projects both modalities into the same latent space. The image tower's embeddings can be pre-computed offline and placed in an ANN service, so an incoming text query only requires a single forward pass through the text tower followed by a fast vector-similarity lookup-easily meeting sub-100 ms latency at catalog scale. The language-only approach cannot embed images at all, so retrieval is impossible. A late-fusion ensemble combines modality-specific classifiers, but the softmax outputs are tied to a fixed label set and require scoring every image at query time, violating the latency and open-set constraints. Generating synthetic captions and searching them with TF-IDF reduces the problem to text-only retrieval, losing visual nuance and usually trailing dense joint-embedding methods in accuracy; caption generation also adds extra processing overhead. Therefore the contrastive dual-encoder is the only choice that aligns modalities, supports ANN indexing, and meets the performance target.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a contrastive dual-encoder (two-tower) model?
Open an interactive chat with Bash
What is the purpose of using an approximate nearest-neighbor (ANN) index?
Open an interactive chat with Bash
Why does the contrastive dual-encoder approach outperform other strategies for open-ended queries?