AWS Certified Solutions Architect Associate SAA-C03 Practice Question
A company stores 50 TB of raw log data in Amazon S3. Each night, the data engineering team must run Apache Spark transformations that filter, aggregate, and join the data. The job must finish within 2 hours, and the team wants to choose specific Amazon EC2 instance types while allowing the cluster to scale automatically to balance cost and performance. Which AWS service or feature should the team use to meet these requirements?
AWS Glue ETL job using G.2X workers
A single c5n.18xlarge Amazon EC2 instance running Spark in standalone mode
AWS Lambda functions coordinated with AWS Step Functions
Amazon EMR on Amazon EC2 with EMR Managed Scaling enabled
Amazon EMR on Amazon EC2 lets the team launch and resize a dedicated Spark cluster. EMR Managed Scaling (or Auto Scaling policies) can add or remove EC2 instances during the run, so the job can meet the 2-hour SLA while minimizing cost. Glue is serverless and scalable, but it abstracts the underlying instances and offers less control over instance types and cluster tuning. Lambda and a single EC2 host cannot complete a 50-TB Spark workload within the required time window.
References:
Amazon EMR distributes large data sets across many compute nodes and allows you to resize clusters.
AWS Glue is a serverless data-integration service that hides underlying infrastructure choices.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Amazon EMR and how does it differ from standard EC2?
Open an interactive chat with Bash
What are some use cases for using Amazon EMR?
Open an interactive chat with Bash
What are the benefits of parallel processing in Amazon EMR?