GCP Professional Data Engineer Practice Question

Your team ingests thousands of long policy documents (10-50 kB each) into a BigQuery table policy_raw(doc_id INT64, body STRING). You must build a Retrieval-Augmented Generation (RAG) pipeline that uses the Vertex AI textembedding-gecko model through ML.GENERATE_EMBEDDING.

During a proof-of-concept you see the error "Input text length exceeds model limit". Product management asks that no content be lost during embedding generation and that the resulting vector store must support efficient similarity search and reconstruction of the original document order.

Which approach best meets these requirements while staying completely inside BigQuery?

  • Apply a TRANSFORM clause with ML.BUCKETIZE in a CREATE MODEL statement so BigQuery ML automatically divides the text into 2 500-character buckets before calling ML.GENERATE_EMBEDDING.

  • Use BigQuery SQL to split each document's body on paragraph delimiters, aggregate the sentences into chunks that stay under the model's 2 500-character limit, write one row per (doc_id, chunk_index, chunk_text), and then run ML.GENERATE_EMBEDDING on the resulting table.

  • Pass the full document body to ML.GENERATE_EMBEDDING and rely on the model's automatic truncation; store the single truncated embedding per doc_id for similarity search.

  • Export each document to Cloud Storage, invoke a Cloud Function that calls Vertex AI embeddings on the entire file, and save the vectors in Firestore; create an external table in BigQuery that references the stored vectors for search.

GCP Professional Data Engineer
Preparing and using data for analysis
Your Score:
Settings & Objectives
Random Mixed
Questions are selected randomly from all chosen topics, with a preference for those you haven’t seen before. You may see several questions from the same objective or domain in a row.
Rotate by Objective
Questions cycle through each objective or domain in turn, helping you avoid long streaks of questions from the same area. You may see some repeat questions, but the distribution will be more balanced across topics.

Check or uncheck an objective to set which questions you will receive.

Bash, the Crucial Exams Chat Bot
AI Bot