GCP Professional Data Engineer Practice Question

Your analytics team has split 1 million product-manual documents into 512-character chunks and stored them in the table warehouse.doc_chunks (content STRING, updated_ts TIMESTAMP). To enable similarity retrieval for an upcoming retrieval-augmented generation (RAG) feature, you must add a column that holds Vertex AI text embeddings and keep it synchronized without needlessly recomputing unchanged data. The solution must also let you build an HNSW vector index so that VECTOR_SEARCH queries run efficiently. Which BigQuery SQL approach best meets these requirements?

  • Run a set-based MERGE that upserts only rows whose updated_ts has changed, computing ML.GENERATE_EMBEDDING in the SELECT, and then create or refresh an HNSW vector index on the new embedding column.

  • Nightly rebuild the entire table with INSERT OVERWRITE AS SELECT *, ML.GENERATE_EMBEDDING(...) FROM doc_chunks, then swap it into production.

  • Use a BigQuery script that fetches each chunk with a cursor and inserts ML.GENERATE_EMBEDDING results row-by-row into a separate embeddings table.

  • Create a materialized view over doc_chunks that calls ML.GENERATE_EMBEDDING and run VECTOR_SEARCH queries against the view.

GCP Professional Data Engineer
Preparing and using data for analysis
Your Score:
Settings & Objectives
Random Mixed
Questions are selected randomly from all chosen topics, with a preference for those you haven’t seen before. You may see several questions from the same objective or domain in a row.
Rotate by Objective
Questions cycle through each objective or domain in turn, helping you avoid long streaks of questions from the same area. You may see some repeat questions, but the distribution will be more balanced across topics.

Check or uncheck an objective to set which questions you will receive.

Bash, the Crucial Exams Chat Bot
AI Bot