A machine learning engineering team is migrating its model training environment from a single, high-memory on-premises server. Their primary challenge is training a complex deep learning model on a 10TB dataset, which requires both data and model parallelism. The team requires a solution that provides high availability, dynamic scaling of compute resources, and efficient distributed processing. However, they must also minimize the manual effort required for resource management and fault recovery. Which cluster deployment strategy best meets all of these criteria?
Deploy the training workload using a distributed computing framework like Apache Spark or a TensorFlow distribution strategy, running on a containerized cluster managed by Kubernetes.
Provision a cluster of virtual machines and use a shared network file system (NFS) to allow each node to access the data, coordinating the process with custom shell scripts.
Deploy the entire 10TB dataset and training script into a single, massive container image and rely on a container orchestrator to restart the container on another node upon failure.
Implement a high-availability database cluster and perform model training inside the database using distributed stored procedures.
The correct option is to deploy the training workload using a distributed computing framework like Apache Spark or TensorFlow with a distribution strategy on a containerized cluster managed by Kubernetes. This approach directly addresses the requirements. Kubernetes provides robust cluster management, including high availability, dynamic scaling, and automated fault recovery (e.g., restarting failed containers). A distributed computing framework like Apache Spark or a library with distribution strategies like TensorFlow is essential for parallelizing the computation and handling a dataset that is too large for a single machine. The other options are less suitable. Using custom scripts on a cluster of VMs introduces significant manual overhead for management and fault recovery, which contradicts a key requirement. Relying on a single large container fails to solve the core problem, as a single container is still limited by the resources of the node it runs on and cannot perform distributed processing across the cluster. Training a complex model inside a database is generally inefficient for the iterative, computationally intensive tasks required for deep learning compared to specialized frameworks.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Kubernetes and why is it useful in deploying machine learning workloads?
Open an interactive chat with Bash
What is Apache Spark, and how does it enable distributed processing?
Open an interactive chat with Bash
What is TensorFlow distribution strategy, and why is it vital for deep learning models?