GCP Professional Data Engineer Practice Question

Your analytics team stores historical customer data in BigQuery and wants to build a churn-prediction logistic-regression model with BigQuery ML. Several continuous columns contain occasional NULLs and strong outliers, while the string column plan_type has five distinct values. The team insists that all preprocessing logic be declared inside the model definition so it is version-controlled together with the model and reruns automatically every time the model is retrained. Which approach best meets these requirements while minimizing operational overhead?

Create the model with a TRANSFORM clause that applies ML.IMPUTER, ML.ROBUST_SCALER, and ML.ONE_HOT_ENCODER to the raw columns inside the CREATE MODEL … OPTIONS(model_type = 'logistic_reg') statement.
Run a Cloud Dataflow pipeline to write a cleaned staging table, then point a CREATE MODEL statement (without TRANSFORM) at the staged data.
Issue CREATE MODEL without any preprocessing and rely on BigQuery ML's automatic handling of missing values and scaling.
Build a materialized view that encodes and scales the features, refresh it automatically, and train the logistic-regression model against the view.

Report Issue

Answer Description

BigQuery ML's TRANSFORM clause lets you declare feature-engineering steps inside the CREATE MODEL statement. Functions such as ML.IMPUTER can fill missing numeric values, ML.ROBUST_SCALER can scale outlier-prone numerical features, and ML.ONE_HOT_ENCODER can convert categorical strings like plan_type into sparse binary vectors. Because the transformations are evaluated every time the model is trained or retrained, the preprocessing logic stays co-located with, and versioned alongside, the model definition without requiring separate pipelines or manually maintained staging tables.

Using an external Dataflow job or a materialized view would work technically, but the transformation code would live outside the model and need separate orchestration. Relying on BigQuery ML's default behavior would not apply robust scaling or median imputation, so data quality issues would remain. Therefore, embedding the preprocessing functions in the TRANSFORM clause is the most efficient and maintainable solution.

Ask Bash

Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.