GCP Professional Cloud Architect Practice Question

Your data-science team is iterating on a 24-layer transformer with billions of parameters. The training corpus is several petabytes stored in a Cloud Storage bucket. The team's priorities are:

Finish each training run in the shortest possible wall-clock time.
Avoid manual cluster provisioning or maintenance.
Run experiments that need between 256 and 1 024 hardware accelerators.
Use infrastructure that offers very high bandwidth between accelerator chips and fast access to Cloud Storage.
Which approach best meets these requirements?

Launch Compute Engine C3 virtual machines with PCIe-attached NVIDIA H100 GPUs and orchestrate training with a custom script.
Create a GKE Autopilot cluster with A2 Ultra GPU nodes and manage distributed training with Kubeflow operators.
Submit a Vertex AI custom training job that requests a TPU v4 Pod slice, allowing Vertex AI to provision and tear down the slice automatically for each run.
Run distributed TensorFlow on Cloud Run services backed by preemptible CPU instances that access data via Cloud Storage FUSE.

Report Issue

Answer Description

Submitting a Vertex AI custom training job that requests an appropriately sized TPU v4 Pod slice leverages Google's AI Hypercomputer. Vertex AI automatically provisions the slice (for example, 256 or 512 TPU v4 chips) at job start and releases it when the run finishes, eliminating manual cluster management. Although a slice does not provide the full 1.2 TB/s bidirectional mesh of a complete Pod, it still delivers substantially higher inter-chip and storage bandwidth than GPU-based alternatives and can scale to the required accelerator counts, resulting in the fastest time-to-train with minimal operational effort.

GKE Autopilot with A2 Ultra GPUs abstracts node operations but still requires users to configure and manage a distributed training framework, and its NVSwitch fabric offers lower aggregate bandwidth than TPU interconnects. Compute Engine C3 VMs with PCIe-attached H100 GPUs lack the dedicated mesh fabric and would need significant manual orchestration. Cloud Run on CPU instances cannot attach the necessary number of accelerators and would be orders of magnitude slower.

Ask Bash

Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.