AWS Certified Data Engineer Associate Practice Test (DEA-C01)
Use the form below to configure your AWS Certified Data Engineer Associate Practice Test (DEA-C01). The practice test can be configured to only include certain exam objectives and domains. You can choose between 5-100 questions and set a time limit.

AWS Certified Data Engineer Associate DEA-C01 Information
The AWS Certified Data Engineer – Associate certification validates your ability to design, build, and manage data pipelines on the AWS Cloud. It’s designed for professionals who transform raw data into actionable insights using AWS analytics and storage services. This certification proves you can work with modern data architectures that handle both batch and streaming data, using tools like Amazon S3, Glue, Redshift, EMR, Kinesis, and Athena to deliver scalable and efficient data solutions.
The exam covers the full data lifecycle — from ingestion and transformation to storage, analysis, and optimization. Candidates are tested on their understanding of how to choose the right AWS services for specific use cases, design secure and cost-effective pipelines, and ensure data reliability and governance. You’ll need hands-on knowledge of how to build ETL workflows, process large datasets efficiently, and use automation to manage data infrastructure in production environments.
Earning this certification demonstrates to employers that you have the technical expertise to turn data into value on AWS. It’s ideal for data engineers, analysts, and developers who work with cloud-based data systems and want to validate their skills in one of the most in-demand areas of cloud computing today. Whether you’re building data lakes, streaming pipelines, or analytics solutions, this certification confirms you can do it the AWS way — efficiently, securely, and at scale.

Free AWS Certified Data Engineer Associate DEA-C01 Practice Test
- 20 Questions
- Unlimited
- Data Ingestion and TransformationData Store ManagementData Operations and SupportData Security and Governance
Free Preview
This test is a free preview, no account required.
Subscribe to unlock all content, keep track of your scores, and access AI features!
Your ecommerce company stores daily order data as Parquet files in Amazon S3 under the prefix s3://sales-data/orders/year=YYYY/month=MM/day=DD/. A Lambda function, triggered every 15 minutes by Amazon EventBridge, submits Amazon Athena queries that must include the most recent files as soon as they arrive. The team wants to minimize query latency and eliminate the operational cost of running AWS Glue crawlers or the MSCK REPAIR TABLE command after each file delivery. Which approach best meets these requirements?
Modify the Lambda function to run the statement MSCK REPAIR TABLE orders before every query submission to refresh partition metadata.
Enable partition projection on the Athena table and specify the year, month, and day ranges; keep the partition columns in the WHERE clause of each query.
Create a new unpartitioned table with a CREATE TABLE AS SELECT (CTAS) statement and query the consolidated data instead of the partitioned source.
Schedule an AWS Glue crawler to run every 15 minutes so that new partitions are added to the Data Catalog before each query executes.
Answer Description
Partition projection lets Athena derive partition values from object paths at run time, so Athena does not need to store a separate metadata row for every partition. By enabling partition projection on the orders table and defining the year, month, and day ranges, new partitions become instantly queryable without invoking Glue crawlers or MSCK REPAIR TABLE. This reduces latency and catalog overhead. Scheduling a Glue crawler or running MSCK REPAIR TABLE would achieve correctness but adds cost and delay. Using an unpartitioned CTAS table removes the need for partition updates but would scan far more data and raise query costs.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is partition projection in Athena?
Why does enabling partition projection reduce latency?
How does partition projection compare to using AWS Glue crawlers?
What is partition projection in Athena?
Why does partition projection reduce query latency in Athena?
How does partition projection compare to AWS Glue crawlers for managing partitions?
An e-commerce company runs a MySQL 8.0 database on a Single-AZ db.m5.large Amazon RDS instance. The workload peaks at about 300 writes/sec and 3,000 read queries/sec during sales. Management wants to improve read performance and availability while controlling cost and making as few application changes as possible. Which solution meets these requirements?
Create two Amazon RDS MySQL read replicas in different Availability Zones and route read queries to the replicas.
Migrate the database to Amazon Aurora MySQL Serverless v2 and use two Aurora Replicas.
Move frequently read tables to Amazon ElastiCache for Redis and switch the database storage to gp3 volumes.
Enable a Multi-AZ deployment and upgrade the primary instance to db.m6i.2xlarge.
Answer Description
Amazon RDS MySQL read replicas use asynchronous replication to offload read traffic from the primary instance. Placing replicas in separate AZs increases availability for reads, and the application can begin directing SELECT traffic to the replicas with minimal code changes. Multi-AZ deployments improve failover resilience but keep all reads on the primary, so they do not relieve read pressure. Migrating to Aurora or adding ElastiCache introduces higher cost and greater change complexity compared with simply adding read replicas.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the purpose of Amazon RDS MySQL read replicas?
How does asynchronous replication work in RDS MySQL read replicas?
Why is Multi-AZ deployment not ideal for reducing read traffic?
How do Amazon RDS MySQL read replicas improve database performance?
What is the difference between a Multi-AZ deployment and read replicas in Amazon RDS?
Why is migrating to Amazon Aurora or using ElastiCache not preferred in this scenario?
A gaming company captures real-time session events from Amazon Kinesis Data Streams. The backend must persist each player's most recent 24-hour session data, handle unpredictable spikes to millions of writes per second, and return player records in single-digit milliseconds by primary key. Operations wants a fully managed, auto-scaling or serverless solution with built-in TTL so stale data is deleted automatically. Which AWS data store best meets these requirements?
Amazon S3 bucket storing JSON objects queried through Amazon Athena and S3 Lifecycle rules
Amazon Redshift streaming ingestion into an RA3 cluster with automatic table sort keys
Amazon DynamoDB table with on-demand capacity and TTL enabled
Amazon Aurora MySQL Serverless v2 cluster with auto-scaling read/write endpoints
Answer Description
Amazon DynamoDB offers on-demand capacity or DynamoDB Serverless, which automatically scales to millions of requests per second without capacity planning. It consistently delivers single-digit millisecond latency for key-value access and supports a TTL attribute that automatically removes items after a specified timestamp, satisfying the 24-hour retention goal. Aurora MySQL Serverless cannot sustain millisecond retrieval under extreme write bursts and introduces connection and cold-start latency. Amazon S3 with Athena provides virtually unlimited storage but has query latencies in seconds and no native per-item TTL. Amazon Redshift is optimized for analytic, not per-row, millisecond queries and requires manual cleanup or VACUUM operations. Therefore, DynamoDB is the most suitable choice.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is AWS DynamoDB's TTL feature?
How does DynamoDB handle millions of writes per second?
Why is Amazon Aurora MySQL Serverless not ideal for this use case?
What is TTL in DynamoDB?
How does DynamoDB achieve single-digit millisecond latency?
Why is DynamoDB better than Aurora for unpredictable spikes?
An analytics team receives hourly CSV files from external vendors. When a file lands in an S3 bucket, it must be validated, transformed with AWS Glue, and loaded into Amazon Redshift. The solution must be serverless, event-driven, include retry logic, and minimize operational overhead. Which architecture best meets these requirements?
Create a CloudWatch Events scheduled rule that runs every 5 minutes and invokes a Lambda function. The function lists recently added objects, kicks off an AWS Batch job to transform the data, and then loads the results into Redshift.
Deploy Apache Airflow on an EC2 Auto Scaling group and build a DAG that polls the S3 bucket every minute, then starts a Glue job and a Redshift COPY task.
Set up Kinesis Data Firehose with the S3 bucket as the data source, enable transformation with a Lambda function, and configure the delivery stream to load directly into Amazon Redshift.
Configure an S3 Event Notification to deliver ObjectCreated events to EventBridge, which triggers a Step Functions state machine. The state machine runs a Glue job for transformation, then uses the Redshift Data API to issue a COPY command. Step Functions built-in retries handle transient failures.
Answer Description
Sending S3 ObjectCreated events to EventBridge and using the event to start a Step Functions state machine provides a fully serverless, event-driven workflow. Step Functions can invoke a Glue job, wait for completion, and then use the Redshift Data API to run a COPY command. Built-in retry and error-handling policies satisfy the resiliency requirement with no servers to manage.
Airflow on EC2 introduces cluster management and relies on polling, not events. A CloudWatch scheduled rule is time-based rather than event-driven and requires custom logic for retries. Kinesis Data Firehose cannot accept S3 objects as a source, so it cannot react to new S3 files. Therefore, only the Step Functions approach aligns with the constraints.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is EventBridge and how does it help in serverless workflows?
How does Step Functions handle retries and error handling in workflows?
What is the role of the Redshift Data API in this architecture?
A data engineering team runs a persistent Amazon EMR cluster that stores intermediate data in HDFS. Each night, about 50 TB of gzip log files arrive in an Amazon S3 bucket and must be copied into HDFS before downstream MapReduce jobs start. The transfer must maximize throughput, minimize S3 request costs, and run by using only the existing EMR cluster resources. Which solution meets these requirements?
Mount the S3 bucket on every core node with s3fs and move the objects to HDFS with the Linux cp command.
Use AWS DataSync to transfer the objects to volumes on each core node, then import the data into HDFS.
Add an EMR step that uses S3DistCp to copy the objects from Amazon S3 to HDFS in parallel.
From the master node, run the AWS CLI command "aws s3 cp --recursive" to copy the objects into HDFS.
Answer Description
S3DistCp is an Amazon EMR utility built on Apache DistCp that runs as a step on the cluster. It launches multiple mapper tasks that copy objects in parallel, optionally combines small files, and uses the cluster's network bandwidth instead of a single node. This approach delivers the highest throughput while reducing the number of GET requests. Running the command from the master node (aws s3 cp) would be single-threaded, DataSync adds an external service and cannot write directly into HDFS, and mounting the bucket with s3fs provides no parallelism and incurs high per-object overhead.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is S3DistCp, and how does it optimize data transfer to HDFS?
Why is running 'aws s3 cp --recursive' from the master node not ideal for this scenario?
How does S3DistCp reduce S3 request costs compared to other methods like mounting S3 with s3fs?
What is S3DistCp and why is it used in Amazon EMR?
How does parallelism work in S3DistCp compared to single-threaded alternatives like AWS CLI?
Why is using alternatives like DataSync or s3fs not suitable for this scenario?
A retailer stores clickstream data as Parquet files in Amazon S3. Analysts query the data with Amazon Athena several times a day, and weekly batch jobs update or delete late-arriving records. The company uses AWS Lake Formation and must enforce row-level security while supporting ACID transactions with the least administration. Which approach meets these requirements?
Load the data into an Amazon Redshift cluster and share secure views through Lake Formation for row-level access.
Convert the dataset to a Lake Formation governed table and use LF tag-based policies to grant analysts SELECT access with row filters.
Enable object-level ACLs on the S3 bucket and restrict rows by forcing analysts to use Athena views containing WHERE clauses.
Create an external table in the AWS Glue Data Catalog and control access only with S3 bucket policies and Athena workgroup-level data filters.
Answer Description
Lake Formation governed tables add ACID compliance, automatic compaction, and concurrency controls on S3 objects. When a dataset is registered as a governed table, Lake Formation can apply fine-grained permissions-including row and column filters-through LF tag-based access control. Analysts receive SELECT on the governed table, and update jobs can use the transactional writes that governed tables provide. Merely using bucket policies, Athena workgroups, or object ACLs does not give row-level security or ACID semantics. Importing the data into Amazon Redshift is unnecessary overhead and Redshift sharing via Lake Formation does not cover S3 data updates.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What are Lake Formation governed tables?
How do LF tag-based permissions work?
Why are ACID transactions important for S3 datasets?
What are Lake Formation governed tables?
What are LF tag-based access control policies?
How does Amazon Athena integrate with Lake Formation governed tables?
An ecommerce company uses an Amazon Redshift RA3 cluster. A BI query joins two 200-GB Redshift tables with an Aurora PostgreSQL orders table through a federated query. Grafana runs the query every minute, causing 10-second latency and high Aurora CPU. Data may be 5 minutes old, and the team wants the lowest ongoing cost. What should the data engineer do?
Create a materialized view that joins the Redshift and federated tables, and schedule REFRESH MATERIALIZED VIEW every 5 minutes with Amazon EventBridge. Point the dashboard to the materialized view.
Unload the two Redshift tables to Amazon S3, create external tables, and use Redshift Spectrum to join them with the federated orders table.
Use AWS DMS and COPY to load the orders table into Redshift every 5 minutes, then keep the dashboard query unchanged.
Replace the query with a standard Redshift view and rely on the query result cache for most dashboard requests.
Answer Description
Creating a materialized view inside Redshift stores the pre-computed join result locally. Because materialized views that reference federated tables can't use auto-refresh, schedule REFRESH MATERIALIZED VIEW every 5 minutes (for example, with an EventBridge rule or Redshift Scheduler). The dashboard then queries the materialized view, delivering sub-second latency while reducing Aurora workload to one refresh per 5-minute window. Standard views or the result cache would still hit Aurora each minute, COPY/DMS loads replicate data unnecessarily, and a Spectrum approach adds S3 costs without reducing Aurora load.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a materialized view in Redshift?
How does Amazon EventBridge help with REFRESH MATERIALIZED VIEW?
Why is REFRESH MATERIALIZED VIEW better than relying on query result caching?
What is a materialized view in Amazon Redshift?
How does federated querying work in Amazon Redshift?
What is Amazon EventBridge, and how is it used to schedule tasks?
An AWS Glue crawler registers daily Parquet files stored under the Amazon S3 prefix s3://datalake/iot/year=YYYY/month=MM/day=DD/. Business analysts query the table from Amazon Athena, but the current day's data is not visible until the crawler's nightly run. As a data engineer, how can you expose new partitions to Athena within minutes of arrival while keeping operational effort low?
Replace the crawler with Athena partition projection and define formulas that generate the year, month, and day partitions.
Trigger an AWS Step Functions workflow from CloudWatch Events that calls ALTER TABLE ADD PARTITION for each new file detected.
Change the crawler to run every five minutes on a fixed schedule.
Enable Amazon S3 event notifications to invoke the crawler in incremental mode whenever new objects are created.
Answer Description
An Amazon S3 event-driven, incremental crawler updates only the partitions that correspond to the newly arrived objects. When an object is written to a monitored prefix, S3 publishes an event that triggers the crawler through EventBridge. The crawler quickly adds the new year/month/day partition to the AWS Glue Data Catalog, making the data immediately queryable from Athena. A frequent schedule works but still polls on a fixed interval. Partition projection removes the dependency on the Data Catalog but requires additional table-property configuration and is not necessary when a catalog is already in use. A custom Lambda or Step Functions workflow achieves the same goal but adds more code and maintenance overhead compared with the built-in event-driven crawler capability.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Amazon S3 event notification?
What is an AWS Glue incremental crawler?
How does partition projection work in Athena?
What is an Amazon S3 event notification?
What is an incremental crawler in AWS Glue?
What is partition projection in Athena, and why wasn’t it chosen here?
A retail company captures clickstream events in an Amazon Kinesis Data Stream. Business analysts need the events to be query-able in Amazon Redshift within one minute of being produced. The data engineering team wants the simplest solution that avoids intermediate storage and minimizes ongoing maintenance. Which approach best meets these requirements?
Trigger an AWS Lambda function from the Kinesis Data Stream to batch records and insert them into Redshift via the Data API.
Create a materialized view in Amazon Redshift that performs streaming ingestion from the Kinesis Data Stream and enables AUTO REFRESH.
Configure an Amazon Kinesis Data Firehose delivery stream to load the data into Amazon Redshift on a 1-minute buffer interval.
Build an AWS Glue streaming ETL job that reads from the Kinesis Data Stream and writes the records to Redshift through a JDBC connection.
Answer Description
Amazon Redshift supports native streaming ingestion from Amazon Kinesis Data Streams. By creating a materialized view that directly references the stream, Redshift continuously and automatically pulls new records without staging them in Amazon S3 or relying on additional services. The view can be set to AUTO REFRESH so data becomes query-able within seconds, satisfying the one-minute SLA with very little operational overhead.
Sending the stream through Kinesis Data Firehose or Glue adds an extra service and writes first to Amazon S3, increasing latency and management effort. A Lambda function that batches INSERT statements introduces custom code, scaling challenges, and higher maintenance compared with the managed Redshift feature.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a materialized view in Amazon Redshift?
How does Amazon Redshift support streaming ingestion from Kinesis Data Streams?
Why is AUTO REFRESH important for streaming data in Amazon Redshift?
What is a materialized view in Amazon Redshift?
How does Amazon Redshift perform streaming ingestion from Kinesis Data Streams?
Why is Kinesis Data Firehose not the best choice in this scenario?
Your organization uses AWS Lake Formation to govern a raw data lake in Amazon S3. You registered the s3://finance-raw bucket and cataloged the transactions table in the finance database. Analysts already have Lake Formation SELECT on the table, yet Athena returns "Access Denied - insufficient Lake Formation permissions." Which additional Lake Formation permission will resolve the error without granting broader S3 or IAM access?
Grant Lake Formation DESCRIBE permission on the default database.
Give the IAM role Lake Formation ALTER permission on the transactions table.
Attach an IAM policy that allows s3:GetObject on the finance-raw bucket.
Grant Lake Formation DATA_LOCATION_ACCESS on the s3://finance-raw location.
Answer Description
For Athena to run a query, Lake Formation must be able to read metadata in the Glue Data Catalog. The service looks in the default database, so the querying principal needs at least DESCRIBE on that database. Without it, Lake Formation blocks the request and Athena reports "Access Denied." Granting DESCRIBE on the default database satisfies the metadata check; DATA_LOCATION_ACCESS is only required for creating resources, and adding direct S3 permissions or ALTER on the table would bypass governance or still fail the metadata lookup.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What does Lake Formation DESCRIBE permission allow?
Why does Athena require DESCRIBE on the default database to query Lake Formation tables?
What is the difference between DATA_LOCATION_ACCESS and DESCRIBE permissions in Lake Formation?
What does DESCRIBE permission in Lake Formation entail?
Why is DATA_LOCATION_ACCESS not appropriate in this scenario?
How does Lake Formation interact with Athena and Glue Data Catalog?
A company's Amazon Redshift RA3 cluster hosts a 5-TB fact table that receives new rows each night. Business analysts issue the same complex aggregation query every morning to populate dashboards, but the query still takes about 40 minutes even after regular VACUUM and ANALYZE operations. As the data engineer, you must cut the runtime dramatically, keep administration effort low, and avoid a large cost increase. Which approach will best meet these requirements?
Increase the WLM queue's slot count and enable short query acceleration to allocate more memory to the query.
Enable Amazon Redshift Concurrency Scaling so the query can execute on additional transient clusters.
Create a materialized view that pre-aggregates the required data, schedule an automatic REFRESH after the nightly load, and direct the dashboard to query the materialized view.
Change the fact table's distribution style to ALL so every node stores a full copy, eliminating data shuffling during joins.
Answer Description
Creating a materialized view lets Amazon Redshift store the pre-computed, aggregated result set on disk. When analysts query the materialized view, Redshift returns the stored result almost immediately instead of re-scanning and joining the 5-TB fact table, yielding a large runtime reduction. Scheduling an automatic refresh immediately after the nightly data load maintains accuracy while requiring minimal ongoing management.
Changing the fact table to an ALL distribution style would duplicate terabytes of data across every node, greatly increasing storage space and load time. Concurrency scaling adds transient clusters to improve throughput when many queries run simultaneously, but it seldom reduces the elapsed time of a single long query. Adjusting WLM queues or enabling short query acceleration allocates resources differently but will not eliminate the heavy table scan and aggregation work that dominates the query's runtime.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a materialized view in Amazon Redshift?
How does automatic REFRESH in materialized views work?
Why is ALL distribution style not suitable for large fact tables?
What is a materialized view in Amazon Redshift?
Why is the ALL distribution style not the best choice for the fact table?
How does REFRESH work for a materialized view in Amazon Redshift?
A CloudFormation template will deploy an AWS Glue job that runs in a private subnet. The job only needs to read objects from the S3 bucket named analytics-data. Security insists the template: 1) follows the principle of least privilege and 2) keeps the IAM role definition concise by avoiding a long inline policy block within the role. Which CloudFormation approach best meets these requirements?
Define an AWSIAMRole and attach the AWS-managed policy AmazonS3ReadOnlyAccess in the ManagedPolicyArns property.
Attach an AWSIAMInstanceProfile to the Glue job so it inherits the default EC2 instance role.
Create an AWSIAMManagedPolicy resource granting s3:GetObject on arn:aws:s3:::analytics-data/* and reference it in the role's ManagedPolicyArns property.
Add an AWSIAMPolicy inline resource that grants s3:GetObject on the bucket and attach it to the role.
Answer Description
Using an AWSIAMManagedPolicy resource lets you define a separate, reusable policy document that can be limited to the specific S3 bucket. Attaching that managed policy to the role through the ManagedPolicyArns property keeps the role resource small while still granting only s3:GetObject permission on the required bucket, satisfying least-privilege. Relying on the AWS-managed AmazonS3ReadOnlyAccess policy would grant read access to every bucket, which violates least-privilege. An AWSIAMPolicy inline resource or an inline Policies block within the role would work, but they place the entire JSON policy in the role definition, contradicting the requirement to avoid a large inline block. An instance profile is used for EC2 resources and would not attach permissions to an AWS Glue job.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the principle of least privilege in AWS IAM?
Why is an AWS::IAM::ManagedPolicy used instead of an inline policy?
What permissions does AmazonS3ReadOnlyAccess include, and why is it unsuitable here?
What is the principle of least privilege in AWS IAM?
How does AWS::IAM::ManagedPolicy differ from inline policies in CloudFormation?
Why is AmazonS3ReadOnlyAccess inappropriate for least-privilege scenarios?
A company ingests 50 000 JSON events per second from IoT sensors into an Amazon Kinesis Data Stream. The analytics team needs each record converted to Apache Parquet with sub-second latency and written to Amazon S3. The solution must scale automatically with the unpredictable event rate and require minimal infrastructure management. Which approach meets these requirements most effectively?
Create an AWS Glue streaming ETL job that reads from the Kinesis Data Stream and writes Parquet files to Amazon S3.
Use AWS Lambda with Kinesis Data Streams as the event source; each invocation converts the JSON record to Parquet and writes it to Amazon S3.
Configure an Amazon EMR cluster with Spark Structured Streaming to poll the stream and convert data to Parquet in Amazon S3.
Deliver the stream to Amazon S3 through Kinesis Data Firehose with a Lambda transformation that converts incoming records to Parquet format.
Answer Description
An AWS Glue streaming ETL job is serverless, so there are no clusters to provision or manage. It can read directly from Kinesis Data Streams, perform Spark-based transformations, and write Parquet files to Amazon S3 with micro-batch windows as small as 1 s. Glue streaming jobs support Auto Scaling, so they handle spikes in event volume without manual intervention.
An Amazon EMR cluster running Spark Structured Streaming could work, but the company would still have to size, monitor, and scale the cluster, increasing operational overhead.
An AWS Lambda consumer risks hitting concurrency limits and would struggle to serialize each record to Parquet efficiently at 50 000 TPS.
Kinesis Data Firehose can deliver to S3 and convert formats, but its buffering interval (minimum 60 s) prevents sub-second latency. Therefore, the Glue streaming ETL job best satisfies latency, scalability, and management requirements.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is AWS Glue streaming ETL?
Why is Apache Parquet preferred for storage in this solution?
How does Auto Scaling in AWS Glue Streaming ETL work?
What is AWS Glue and how does it compare to EMR?
Why is Apache Parquet preferred for storing data in Amazon S3?
What are the benefits of Auto Scaling in AWS Glue streaming jobs?
A retail company receives a 10-GB CSV file in an Amazon S3 bucket every night. The file must be loaded into Amazon Redshift as soon as it arrives. The solution must be fully managed, cost-effective, and must avoid re-loading the same file if the job is restarted after a failure. Which approach meets these requirements?
Configure AWS DataSync to move the file into an Amazon Redshift Spectrum external table and run an INSERT statement into the target table.
Create an Amazon EventBridge rule for the s3:ObjectCreated event to start an AWS Glue job that copies the file into Amazon Redshift, and enable AWS Glue job bookmarks.
Schedule an Amazon EMR cluster to start nightly, run a Spark script that uses the COPY command to load the file into Amazon Redshift, and terminate the cluster afterward.
Use Amazon Kinesis Data Analytics with an S3 source and a Redshift destination to stream the file contents into Amazon Redshift.
Answer Description
An Amazon EventBridge rule that listens for the s3:ObjectCreated event can start an AWS Glue job whenever the nightly file lands in the bucket, providing a serverless, pay-as-you-go ingestion workflow. Enabling AWS Glue job bookmarks causes the job to record which S3 objects have already been processed, so a retry does not reload the same file. The combination removes the need to manage EC2 instances and is less expensive and simpler than operating Amazon EMR, AWS Data Pipeline, or streaming services for a once-per-day batch workload.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What are AWS Glue job bookmarks?
How does Amazon EventBridge handle S3 events?
Why is AWS Glue more cost-effective than Amazon EMR for this task?
What is an Amazon EventBridge rule and how does it work?
What are AWS Glue job bookmarks, and why are they used?
Why is AWS Glue more cost-effective and simpler than Amazon EMR for this use case?
A company stores daily .csv transaction files in an Amazon S3 bucket. A data engineer must ensure that every new object triggers a processing Lambda function exactly once, in the same order that the files arrive, and that failed invocations are automatically retried without manual intervention. Which approach meets these requirements with the least operational overhead?
Send S3 event notifications directly to the Lambda function and restrict its reserved concurrency to 1 to enforce sequential execution.
Create an Amazon EventBridge rule for s3:ObjectCreated:Put events and set the Lambda function as the rule's only target.
Configure an S3 event notification with a suffix filter of .csv that publishes to an Amazon SQS FIFO queue, then set the Lambda function to poll the queue with a batch size of 1.
Enable S3 replication to a second bucket and create a Step Functions state machine that the replication process invokes for each replicated object.
Answer Description
Sending S3 event notifications to an Amazon SQS FIFO queue preserves the order in which PutObject events occur. A Lambda function configured with the queue as an event source polls the messages, ensuring the function is invoked in order with exactly one message at a time (batch size 1). If the function returns an error, the message remains in the queue for retry based on the queue's visibility timeout, eliminating the need for custom retry logic. EventBridge or direct Lambda targets cannot guarantee ordered processing, and using replication or Step Functions adds unnecessary complexity.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why is using an Amazon SQS FIFO queue necessary for processing S3 event notifications in order?
What is the visibility timeout in Amazon SQS, and how does it facilitate retries?
Why is restricting Lambda concurrency insufficient for sequential processing of S3 events?
What is an Amazon SQS FIFO queue?
How does a Lambda function poll an SQS FIFO queue?
What is the difference between an EventBridge rule and an S3 event notification?
An ETL pipeline is orchestrated by Amazon EventBridge: a rule starts an AWS Glue job whenever new objects land in an S3 bucket. The data engineering team must alert on-call staff immediately when the Glue job finishes with either SUCCEEDED or FAILED status. Notifications must support email and SMS without introducing custom code. Which solution meets these requirements with minimal operational effort?
Wrap the Glue job in an AWS Step Functions state machine and use a Catch block that calls a webhook to a chat application when the task fails or succeeds.
Configure an Amazon CloudWatch alarm on the job's DPU consumed metric and set the alarm action to push messages to an SQS queue, then invoke a Lambda function to forward notifications.
Create a second EventBridge rule that matches Glue Job State Change events with states SUCCEEDED or FAILED and sends them to an Amazon SNS topic that has email and SMS subscriptions.
Add code at the end of the Glue script to use Amazon Simple Email Service (Amazon SES) to send an email when the job completes.
Answer Description
Amazon EventBridge publishes built-in "Glue Job State Change" events that include the job's final state. Creating a second EventBridge rule that filters for the SUCCEEDED and FAILED states and targets an Amazon SNS topic requires no new code. SNS natively supports multiple subscription protocols-such as email and SMS-so on-call staff can receive alerts through their preferred channel. The other options either require custom libraries, extra polling infrastructure, or do not natively provide both email and SMS delivery, so they introduce additional operational overhead and complexity.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Amazon EventBridge used for in this solution?
How does Amazon SNS enable email and SMS notifications?
Why is creating a second EventBridge rule better than using a Lambda function or Step Functions here?
What is Amazon EventBridge, and how does it integrate with AWS Glue?
How does Amazon SNS support both email and SMS notifications?
Why is Amazon SNS preferred over Amazon SES or Lambda in this scenario?
Your data engineering team stores daily AWS Glue Apache Spark job logs as compressed JSON files in an Amazon S3 bucket. Analysts must run ad-hoc SQL to find long-running stages and join the result with an existing reference dataset that also resides in S3. The solution must become queryable within minutes of log delivery, require no servers to manage, and minimize operational effort. Which solution best meets these requirements?
Stream the log files from S3 into Amazon CloudWatch Logs and analyze them with CloudWatch Logs Insights queries.
Launch an on-demand Amazon EMR cluster with Trino, mount the S3 buckets, and submit SQL queries through the Trino coordinator.
Run an AWS Glue crawler on the log prefix to update the Data Catalog and query both log and reference tables in Amazon Athena.
Deliver the logs to Amazon OpenSearch Service with Amazon Kinesis Data Firehose and query them alongside the reference data using OpenSearch Dashboards.
Answer Description
Creating a Glue crawler to catalog the new log files and letting analysts query them with Amazon Athena is the only option that is fully serverless, requires no cluster or domain management, and becomes available for SQL queries shortly after the data lands in S3. Athena reads directly from S3, and the Glue Data Catalog provides the schema that analysts can join to their reference table, which is already cataloged. An EMR cluster with Presto/Trino would work functionally but introduces nodes to provision, scale, and patch. Streaming the logs to Amazon OpenSearch Service creates an OpenSearch domain that the team must size and maintain, and its query language differs from standard SQL. CloudWatch Logs Insights cannot natively query objects that are already in S3; the logs would first have to be pushed to CloudWatch Logs, adding delay and extra steps.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is an AWS Glue crawler?
How does Amazon Athena work with AWS Glue?
Why is OpenSearch not suitable for this use case?
What is an AWS Glue crawler and how does it work?
How does Amazon Athena query data stored in Amazon S3?
Why is a serverless solution like AWS Glue and Athena a better choice for this scenario?
Your team receives unpredictable batches of CSV transaction files in a dedicated Amazon S3 prefix. Every file must be ingested into an Amazon Redshift staging table within five minutes of arrival. The solution must follow an event-driven batch pattern, avoid idle infrastructure, and scale automatically with the daily file count. Which approach meets these requirements while keeping operational overhead low?
Send the files to an Amazon Kinesis Data Firehose delivery stream configured to deliver records to Amazon Redshift.
Configure an Amazon S3 event notification that routes through EventBridge to trigger an AWS Glue job, and have the job run a Redshift COPY command for the new object.
Set up an AWS Database Migration Service task with S3 as the source endpoint and Redshift as the target to perform full load and change data capture.
Create an AWS Glue job with a 5-minute cron schedule that recursively scans the prefix and loads any discovered files into Redshift.
Answer Description
Amazon S3 can emit an event for every new object. Publishing that event to Amazon EventBridge allows a rule to start an AWS Glue job only when a file is written. The Glue job can issue a COPY command that loads the single object into Amazon Redshift, giving near-real-time latency without running servers between arrivals. A cron-based Glue schedule polls rather than reacts to events and could miss the five-minute window or waste resources. AWS DMS cannot use S3 as a change-data-capture source for Redshift in this scenario, and Kinesis Data Firehose expects streaming records, not entire objects already in S3, so it does not satisfy the event-driven batch requirement.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is an S3 event notification and how does it work?
What is the role of EventBridge in this solution?
How does AWS Glue and the Redshift COPY command integrate for this use case?
What is EventBridge, and how does it work with S3 event notifications?
Why is the AWS Glue job preferred for this solution instead of a scheduled cron job?
What is the Redshift COPY command, and why is it used in this workflow?
A data engineer loads transformed sales totals into Amazon Redshift Serverless each night. An external partner needs to query the current day's total over the internet through a low-latency HTTPS endpoint. The partner cannot obtain AWS credentials but can pass an API key for authentication. The solution must remain fully serverless and require the least operational overhead. Which approach satisfies these requirements?
Write the daily total to a JSON file in an Amazon S3 bucket and share a presigned URL with the partner.
Expose the Amazon Redshift Data API endpoint to the partner and store database credentials in AWS Secrets Manager.
Deploy a microservice on Amazon ECS Fargate behind an Application Load Balancer that connects to Amazon Redshift with JDBC and returns results.
Create a REST API in Amazon API Gateway that requires an API key and invokes an AWS Lambda function, which queries Amazon Redshift through the Redshift Data API and returns JSON.
Answer Description
Using Amazon API Gateway with an attached usage plan lets the company require an API key for every request. A Lambda function behind the API runs simple SELECT statements by calling the Amazon Redshift Data API, formats the result as JSON, and returns it. All components are serverless, no network endpoints for Redshift are exposed, and API Gateway handles throttling and key management. Directly exposing the Redshift Data API would require the partner to sign requests with AWS credentials. Running a container service behind an Application Load Balancer introduces additional infrastructure to operate. Publishing a daily file to Amazon S3 does not provide on-demand queries and relies on presigned URLs rather than API-key authentication.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the Amazon Redshift Data API?
How does Amazon API Gateway authenticate with API keys?
Why is AWS Lambda a good choice for querying Amazon Redshift in this solution?
What is Amazon API Gateway and why is it used here?
How does the Redshift Data API work in this scenario?
Why is an AWS Lambda function used instead of directly exposing Redshift?
An Amazon EMR cluster is running an Apache Spark SQL job that joins a 500 GB click-stream DataFrame with a 100 MB reference DataFrame. Shuffle stages dominate the runtime and the team cannot resize the cluster or rewrite the input data. Which Spark-level change will most effectively reduce shuffle traffic and speed up the join?
Apply a broadcast join hint to the 100 MB reference DataFrame so each executor receives a local copy.
Increase the value of spark.sql.shuffle.partitions to create more shuffle tasks.
Persist both DataFrames in memory before executing the join.
Enable speculative execution by setting spark.speculation to true.
Answer Description
Using a broadcast join hint copies the small 100 MB reference DataFrame to every executor, so the larger 500 GB DataFrame can be joined locally without shuffling either dataset across the network. Increasing the number of shuffle partitions will not reduce the amount of data shuffled, and persisting the DataFrames adds memory pressure without eliminating the shuffle. Enabling speculative execution only mitigates slow tasks but does not address the fundamental shuffle cost of a standard hash join. Therefore, broadcasting the small table is the most effective solution.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a broadcast join in Apache Spark?
Why does increasing spark.sql.shuffle.partitions not reduce shuffle traffic?
What is shuffle in Apache Spark and why is it costly?
What is a broadcast join in Apache Spark?
Why is shuffle traffic costly in Apache Spark?
What is the difference between persisting DataFrames and broadcast joins?
Gnarly!
Looks like that's it! You can go back and review your answers or click the button below to grade your test.