Your team is building a real-time recommendation service that must match a shopper's free-text query against more than 10 million product descriptions in under 50 ms. The catalog can be processed offline, and several gigabytes of pre-computed representations may be kept in memory, but the online request path should perform at most one neural-network forward pass per query. Relevance should be judged by semantic rather than purely lexical similarity. Which modelling strategy best satisfies the latency, scale, and semantic-matching requirements?
Pre-encode all item descriptions with a Siamese/bi-encoder transformer, store the vectors in an ANN index, and encode the query once at inference to retrieve nearest neighbours.
Run a transformer cross-encoder that concatenates the query with every candidate description and scores each pair on-the-fly.
Represent each text as a sparse TF-IDF vector and rank candidates with BM25 scoring over an inverted index.
Compute the Levenshtein edit distance between the query string and every item title, selecting the smallest distances.
A dual-encoder (also known as a bi-encoder or Siamese network) is the best approach. It embeds each catalog item offline and stores the resulting vectors in an approximate-nearest-neighbour (ANN) index. At runtime, the query is encoded once, and a fast vector search retrieves the closest items. This provides semantic matching in logarithmic or sub-linear time, meeting the latency requirements. A cross-encoder would compute a separate forward pass for every query-candidate pair, making it far too slow for a catalog of millions of items. While methods like BM25 (using an inverted index) are computationally efficient, they and character-level edit distance approaches (like Levenshtein) rely on lexical overlap (surface tokens or characters) rather than semantic meaning. Therefore, they fail the semantic relevance requirement. The dual-encoder with an ANN index is the only strategy that satisfies all the stated constraints.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is an ANN index, and why is it useful in this modeling strategy?
Open an interactive chat with Bash
What is the difference between a dual-encoder and a cross-encoder?
Open an interactive chat with Bash
Why don’t traditional methods like BM25 or Levenshtein satisfy the semantic matching requirement?