Your analytics team has split 1 million product-manual documents into 512-character chunks and stored them in the table warehouse.doc_chunks (content STRING, updated_ts TIMESTAMP). To enable similarity retrieval for an upcoming retrieval-augmented generation (RAG) feature, you must add a column that holds Vertex AI text embeddings and keep it synchronized without needlessly recomputing unchanged data. The solution must also let you build an HNSW vector index so that VECTOR_SEARCH queries run efficiently. Which BigQuery SQL approach best meets these requirements?
Run a set-based MERGE that upserts only rows whose updated_ts has changed, computing ML.GENERATE_EMBEDDING in the SELECT, and then create or refresh an HNSW vector index on the new embedding column.
Nightly rebuild the entire table with INSERT OVERWRITE AS SELECT *, ML.GENERATE_EMBEDDING(...) FROM doc_chunks, then swap it into production.
Create a materialized view over doc_chunks that calls ML.GENERATE_EMBEDDING and run VECTOR_SEARCH queries against the view.
Use a BigQuery script that fetches each chunk with a cursor and inserts ML.GENERATE_EMBEDDING results row-by-row into a separate embeddings table.
BigQuery materialized views cache query results, but you cannot create a vector index on a (materialized) view, and ML.GENERATE_EMBEDDING is currently unsupported in materialized-view definitions-so this option will not satisfy the need for indexed similarity search. Rebuilding the entire table each night with INSERT OVERWRITE forces ML.GENERATE_EMBEDDING to run on every row, driving up cost and latency. A procedural row-by-row loop issues one inference call per row, bypassing BigQuery's set-based execution and becoming slow and expensive. A set-based MERGE that inserts or updates only the rows whose updated_ts has changed computes new embeddings only for modified chunks, then allows you to create or refresh an HNSW vector index on the ARRAY embedding column. This minimizes recomputation while ensuring fast similarity queries, making it the most efficient and scalable strategy.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is an HNSW vector index in BigQuery?
Open an interactive chat with Bash
What is the purpose of ML.GENERATE_EMBEDDING in BigQuery?
Open an interactive chat with Bash
Why is a set-based MERGE the best approach for adding embeddings in this use case?
Open an interactive chat with Bash
What is an HNSW vector index?
Open an interactive chat with Bash
What does ML.GENERATE_EMBEDDING do in BigQuery?
Open an interactive chat with Bash
Why is a set-based MERGE more efficient than other approaches?
Open an interactive chat with Bash
GCP Professional Data Engineer
Preparing and using data for analysis
Your Score:
Report Issue
Bash, the Crucial Exams Chat Bot
AI Bot
Loading...
Loading...
Loading...
Pass with Confidence.
IT & Cybersecurity Package
You have hit the limits of our free tier, become a Premium Member today for unlimited access.
Military, Healthcare worker, Gov. employee or Teacher? See if you qualify for a Community Discount.
Monthly
$19.99 $11.99
$11.99/mo
Billed monthly, Cancel any time.
$19.99 after promotion ends
3 Month Pass
$44.99 $26.99
$8.99/mo
One time purchase of $26.99, Does not auto-renew.
$44.99 after promotion ends
Save $18!
MOST POPULAR
Annual Pass
$119.99 $71.99
$5.99/mo
One time purchase of $71.99, Does not auto-renew.
$119.99 after promotion ends
Save $48!
BEST DEAL
Lifetime Pass
$189.99 $113.99
One time purchase, Good for life.
Save $76!
What You Get
All IT & Cybersecurity Package plans include the following perks and exams .