GCP Professional Data Engineer Practice Question

Your organization needs to lift-and-shift 7,000 existing batch and streaming PySpark jobs that currently run on an on-premises Apache Hadoop cluster. Leadership insists that the cloud solution remain based on open-source processing engines to minimize vendor lock-in and allow future migration to other clouds. The data engineering team also wants the ability to create short-lived clusters that automatically delete themselves once each job finishes to keep costs low. Which Google Cloud service best fulfills these requirements while requiring little or no refactoring of the existing PySpark code?

Migrate the transformations into BigQuery user-defined functions (UDFs) and schedule them as BigQuery jobs.
Replicate the logic in Cloud Data Fusion visual pipelines, which are executed on managed Dataproc clusters.
Spin up on-demand Cloud Dataproc clusters for each job and enable cluster auto-deletion once the Spark or Hadoop job finishes.
Rewrite the PySpark workloads as Apache Beam pipelines and execute them with Cloud Dataflow Flex Templates.

GCP Professional Data Engineer

Designing data processing systems

Your Score:

Bash, the Crucial Exams Chat Bot

AI Bot

GCP Professional Data Engineer Practice Question

Answer Description

Ask Bash

What is Cloud Dataproc and how does it handle Apache Hadoop and Spark workloads?

How does cluster auto-deletion in Cloud Dataproc optimize costs?

Why is Cloud Dataproc preferred over Cloud Dataflow when using PySpark workloads?

Monthly

$19.99

Billed monthly,
Cancel any time.

3 Month Pass

$44.99

One time purchase of $44.99,
Does not auto-renew.

Annual Pass

$119.99

One time purchase of $119.99,
Does not auto-renew.

Lifetime Pass

$189.99

One time purchase,
Good for life.

All Exams

Unlimited Tests

Unlimited Questions

AI Tutor

Track scores

Report Cards

Voucher Discounts

Advanced PBQs

Included Exams

GCP Professional Data Engineer Practice Question

Report Issue

Answer Description

Ask Bash

What is Cloud Dataproc and how does it handle Apache Hadoop and Spark workloads?

How does cluster auto-deletion in Cloud Dataproc optimize costs?

Why is Cloud Dataproc preferred over Cloud Dataflow when using PySpark workloads?

Report Issue