GCP Professional Data Engineer Practice Question

Your analytics team has split 1 million product-manual documents into 512-character chunks and stored them in the table warehouse.doc_chunks (content STRING, updated_ts TIMESTAMP). To enable similarity retrieval for an upcoming retrieval-augmented generation (RAG) feature, you must add a column that holds Vertex AI text embeddings and keep it synchronized without needlessly recomputing unchanged data. The solution must also let you build an HNSW vector index so that VECTOR_SEARCH queries run efficiently. Which BigQuery SQL approach best meets these requirements?

Run a set-based MERGE that upserts only rows whose updated_ts has changed, computing ML.GENERATE_EMBEDDING in the SELECT, and then create or refresh an HNSW vector index on the new embedding column.
Nightly rebuild the entire table with INSERT OVERWRITE AS SELECT *, ML.GENERATE_EMBEDDING(...) FROM doc_chunks, then swap it into production.
Create a materialized view over doc_chunks that calls ML.GENERATE_EMBEDDING and run VECTOR_SEARCH queries against the view.
Use a BigQuery script that fetches each chunk with a cursor and inserts ML.GENERATE_EMBEDDING results row-by-row into a separate embeddings table.

Report Issue

Answer Description

BigQuery materialized views cache query results, but you cannot create a vector index on a (materialized) view, and ML.GENERATE_EMBEDDING is currently unsupported in materialized-view definitions-so this option will not satisfy the need for indexed similarity search. Rebuilding the entire table each night with INSERT OVERWRITE forces ML.GENERATE_EMBEDDING to run on every row, driving up cost and latency. A procedural row-by-row loop issues one inference call per row, bypassing BigQuery's set-based execution and becoming slow and expensive. A set-based MERGE that inserts or updates only the rows whose updated_ts has changed computes new embeddings only for modified chunks, then allows you to create or refresh an HNSW vector index on the ARRAY embedding column. This minimizes recomputation while ensuring fast similarity queries, making it the most efficient and scalable strategy.

Ask Bash

Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.