Your team ingests thousands of long policy documents (10-50 kB each) into a BigQuery table policy_raw(doc_id INT64, body STRING). You must build a Retrieval-Augmented Generation (RAG) pipeline that uses the Vertex AI textembedding-gecko model through ML.GENERATE_EMBEDDING.
During a proof-of-concept you see the error "Input text length exceeds model limit". Product management asks that no content be lost during embedding generation and that the resulting vector store must support efficient similarity search and reconstruction of the original document order.
Which approach best meets these requirements while staying completely inside BigQuery?
Apply a TRANSFORM clause with ML.BUCKETIZE in a CREATE MODEL statement so BigQuery ML automatically divides the text into 2 500-character buckets before calling ML.GENERATE_EMBEDDING.
Pass the full document body to ML.GENERATE_EMBEDDING and rely on the model's automatic truncation; store the single truncated embedding per doc_id for similarity search.
Export each document to Cloud Storage, invoke a Cloud Function that calls Vertex AI embeddings on the entire file, and save the vectors in Firestore; create an external table in BigQuery that references the stored vectors for search.
Use BigQuery SQL to split each document's body on paragraph delimiters, aggregate the sentences into chunks that stay under the model's 2 500-character limit, write one row per (doc_id, chunk_index, chunk_text), and then run ML.GENERATE_EMBEDDING on the resulting table.
The textembedding-gecko model can only accept a few thousand tokens per call. A standard pattern in BigQuery is to split each long document into smaller, length-bounded chunks, keep positional metadata, then call ML.GENERATE_EMBEDDING on every chunk. Splitting can be done with SQL string functions (for example, SPLIT on paragraph breaks, then grouping consecutive paragraphs until the 2 500-character limit is reached). Storing each chunk with doc_id and a chunk_index lets you later reassemble the text in order and perform VECTOR_SEARCH at chunk granularity. Using ML.GENERATE_EMBEDDING directly on the full document would still fail, and BigQuery ML's TRANSFORM clause or numeric bucketing are unrelated to text chunking. Moving the logic to Cloud Functions or an external database breaks the "stay inside BigQuery" constraint and adds unnecessary complexity.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What does the `ML.GENERATE_EMBEDDING` function do in BigQuery?
Open an interactive chat with Bash
How can SQL be used to split long text into smaller chunks for embedding?
Open an interactive chat with Bash
What is a vector store and how does it support similarity search and reconstruction?
Open an interactive chat with Bash
What does the `ML.GENERATE_EMBEDDING` function in BigQuery do?
Open an interactive chat with Bash
How do you split large documents into chunks within BigQuery using SQL?
Open an interactive chat with Bash
What is a vector store, and how is it used for similarity search in BigQuery?
Open an interactive chat with Bash
GCP Professional Data Engineer
Preparing and using data for analysis
Your Score:
Report Issue
Bash, the Crucial Exams Chat Bot
AI Bot
Loading...
Loading...
Loading...
Pass with Confidence.
IT & Cybersecurity Package
You have hit the limits of our free tier, become a Premium Member today for unlimited access.
Military, Healthcare worker, Gov. employee or Teacher? See if you qualify for a Community Discount.
Monthly
$19.99
$19.99/mo
Billed monthly, Cancel any time.
3 Month Pass
$44.99
$14.99/mo
One time purchase of $44.99, Does not auto-renew.
MOST POPULAR
Annual Pass
$119.99
$9.99/mo
One time purchase of $119.99, Does not auto-renew.
BEST DEAL
Lifetime Pass
$189.99
One time purchase, Good for life.
What You Get
All IT & Cybersecurity Package plans include the following perks and exams .