AWS Certified Data Engineer Associate Practice Test (DEA-C01)
Use the form below to configure your AWS Certified Data Engineer Associate Practice Test (DEA-C01). The practice test can be configured to only include certain exam objectives and domains. You can choose between 5-100 questions and set a time limit.

AWS Certified Data Engineer Associate DEA-C01 Information
The AWS Certified Data Engineer – Associate certification validates your ability to design, build, and manage data pipelines on the AWS Cloud. It’s designed for professionals who transform raw data into actionable insights using AWS analytics and storage services. This certification proves you can work with modern data architectures that handle both batch and streaming data, using tools like Amazon S3, Glue, Redshift, EMR, Kinesis, and Athena to deliver scalable and efficient data solutions.
The exam covers the full data lifecycle — from ingestion and transformation to storage, analysis, and optimization. Candidates are tested on their understanding of how to choose the right AWS services for specific use cases, design secure and cost-effective pipelines, and ensure data reliability and governance. You’ll need hands-on knowledge of how to build ETL workflows, process large datasets efficiently, and use automation to manage data infrastructure in production environments.
Earning this certification demonstrates to employers that you have the technical expertise to turn data into value on AWS. It’s ideal for data engineers, analysts, and developers who work with cloud-based data systems and want to validate their skills in one of the most in-demand areas of cloud computing today. Whether you’re building data lakes, streaming pipelines, or analytics solutions, this certification confirms you can do it the AWS way — efficiently, securely, and at scale.

Free AWS Certified Data Engineer Associate DEA-C01 Practice Test
- 20 Questions
- Unlimited
- Data Ingestion and TransformationData Store ManagementData Operations and SupportData Security and Governance
A fintech startup captures tick-level trade events in an Amazon Kinesis Data Stream. Business analysts need to run near-real-time SQL queries in Amazon Redshift with end-to-end latency under 15 seconds. The team wants the simplest, most cost-effective solution and does not want to manage intermediate Amazon S3 staging or custom infrastructure. Which approach should the data engineer implement to meet these requirements?
Create a materialized view in Amazon Redshift that references the Kinesis stream with the KINESIS clause and enable auto-refresh for continuous ingestion.
Configure Amazon Kinesis Data Firehose to deliver the stream to an S3 bucket and schedule a Redshift COPY command to load the files every minute.
Build an AWS Glue streaming job that reads from the Kinesis stream and writes batches to Amazon Redshift using JDBC.
Attach an AWS Lambda function as a stream consumer that buffers events and inserts them into Amazon Redshift through the Data API.
Answer Description
Amazon Redshift supports native streaming ingestion from Amazon Kinesis Data Streams and Amazon MSK. By creating a materialized view that references the stream with the KINESIS clause and enabling auto-refresh, Redshift consumes records directly and makes them available for queries in seconds. This eliminates the S3 staging layer used by Kinesis Data Firehose, avoids the operational overhead of managing AWS Glue or Lambda jobs, and incurs no additional service charges beyond Redshift and the existing Kinesis stream.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a materialized view in Amazon Redshift?
How does the KINESIS clause work in Amazon Redshift?
What are the key benefits of using Amazon Redshift for streaming ingestion?
Your team has registered an Amazon S3 data lake with AWS Lake Formation, and analysts query the data through Amazon Athena. The security team must ensure that any S3 object Amazon Macie flags as containing PII is automatically blocked from the analyst LF-principal but remains accessible to the governance LF-principal. The solution must rely on AWS-managed integrations and involve as little custom code as possible. Which approach meets these requirements?
Configure an Amazon Macie discovery job and an EventBridge rule that starts a Step Functions workflow. The workflow calls Lake Formation AddLFTagsToResource to tag resources Classification=Sensitive and applies LF-tag policies that block analysts and allow governance users.
Run an AWS Glue crawler with custom classifiers that detect PII and update the Data Catalog, then attach IAM policies that deny analysts access to any tables the crawler marks as sensitive.
Generate daily S3 Inventory reports, use S3 Batch Operations to tag files that contain sensitive keywords, and add bucket policies that block the analyst group from those objects while permitting governance access.
Use S3 Object Lambda with a Lambda function that removes or redacts PII from objects before analysts access them, while governance users read the original objects directly.
Answer Description
Create an Amazon Macie sensitive-data discovery job for the lake buckets. Configure an Amazon EventBridge rule that triggers an AWS Step Functions state machine whenever Macie publishes a sensitive-data finding. In the workflow, use an AWS SDK task to call the Lake Formation AddLFTagsToResource API and attach an LF-tag such as Classification=Sensitive to the object (or its corresponding catalog columns). Lake Formation tag-based access-control policies then deny the analyst principal and allow the governance principal for resources tagged Classification=Sensitive. This uses only managed integrations (Macie, EventBridge, Step Functions, Lake Formation) and requires minimal code-no bespoke parsing beyond the workflow definition.
The other options either rely on custom parsing inside Lambda, do not use Macie for detection, or cannot apply Lake Formation permissions automatically.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Amazon Lake Formation and how does it help with data security?
What is the role of Amazon Macie in detecting sensitive data?
How does AWS Step Functions integrate with this solution?
An e-commerce company transforms 2 TB of clickstream data stored in Amazon S3 every night by running a PySpark script that is version-controlled in an S3 path. Engineers want to invoke the job from a Jenkins pipeline through API calls, avoid managing any clusters, yet retain access to the Spark UI for detailed job troubleshooting. Which solution best satisfies these requirements?
Package the script in a Docker image and run it with AWS Batch on AWS Fargate; submit the job via the SubmitJob API; inspect the CloudWatch Logs stream for troubleshooting.
Create an AWS Glue Spark job that references the script in Amazon S3; trigger the job by calling the StartJobRun API from Jenkins; use the AWS Glue Spark UI to debug failed runs.
Provision an Amazon EMR cluster on EC2 each night and submit the script as a step by calling the AddJobFlowSteps API; access the Spark UI on the cluster's master node for troubleshooting; terminate the cluster after completion.
Load the script into an Amazon Athena Spark notebook and invoke it by calling the StartQueryExecution API; view execution output in Athena's query editor for debugging.
Answer Description
AWS Glue provides a serverless Spark runtime, so no clusters need to be provisioned or maintained. The script can be referenced directly from Amazon S3, and the Jenkins pipeline can start each run by calling the AWS Glue StartJobRun API. Each run automatically exposes the Spark UI and executor logs in the AWS Glue console, which allows engineers to inspect stages, tasks, and metrics for troubleshooting. Using an EMR cluster would meet the functional need but requires provisioning, configuring, and terminating infrastructure. Athena and AWS Batch do not provide the full Spark UI, making them less suitable for the stated troubleshooting requirement.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is AWS Glue?
How does the AWS Glue Spark UI assist in debugging?
Why is AWS Glue preferred over Amazon EMR for serverless Spark jobs?
An organization runs nightly Apache Spark ETL jobs with Amazon EMR on EKS. Each executor pod requests 4 vCPU and 32 GiB memory, but its CPU limit is also set to 4 vCPU. CloudWatch shows frequent CpuCfsThrottledSeconds and long task runtimes, while cluster nodes have unused CPU. The team wants faster jobs without adding nodes or instances. Which action meets the requirement?
Replace gp3 root volumes with io2 volumes on worker nodes to increase disk throughput.
Remove the CPU limit or raise it well above the request so executor containers can use idle vCPU on the node.
Migrate the workload to AWS Glue interactive sessions, which automatically scale compute resources.
Enable Spark dynamic allocation so the job can launch additional executor pods during the run.
Answer Description
CPU throttling occurs when a container exhausts the CPU limit defined in its pod specification. For Spark workloads this causes tasks to pause even though idle CPU is still available on the node. Removing the CPU limit (or increasing it well above the request) lets the executor containers borrow spare vCPUs and eliminates cgroup throttling, reducing job runtime without needing additional EC2 capacity. Increasing disk throughput, enabling dynamic allocation, or migrating to a different service may improve performance in some cases but do not directly address the throttling caused by the Kubernetes CPU limit in this scenario.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is CpuCfsThrottledSeconds?
How does removing the CPU limit improve Spark job performance?
Why doesn't enabling Spark dynamic allocation solve the issue?
An analytics team must build an AWS Glue Spark job that enriches 500 GB of Parquet click-stream data stored in Amazon S3 with a 5 GB customer dimension table that resides in an Amazon RDS for PostgreSQL instance. The solution must minimize infrastructure management, let multiple future jobs reuse the same metadata, and ensure that all traffic stays within the VPC. Which approach meets these requirements?
Set up AWS Database Migration Service to export the RDS table to Amazon S3 each night, crawl the exported files, and join them with the click-stream data in the Glue job.
Configure Amazon Athena with the PostgreSQL federated query connector and have the Glue job retrieve the customer table by querying Athena during each run.
Use AWS DMS to replicate the RDS table into Amazon DynamoDB and query DynamoDB from the Glue Spark job for the customer dimension data.
Create an AWS Glue JDBC connection to the RDS endpoint in the VPC, run a crawler with that connection to catalog the customer table, and have the Glue Spark job read the cataloged JDBC table alongside the Parquet files.
Answer Description
Creating an AWS Glue JDBC connection to the RDS instance keeps network traffic inside the VPC and removes the need to manage custom drivers or endpoints. A crawler that uses this connection can catalog the PostgreSQL table in the AWS Glue Data Catalog. The Spark job can then read both the Parquet dataset and the cataloged JDBC table through the same catalog, allowing other Glue or EMR jobs to reuse the metadata. Exporting to S3, using Athena federation, or replicating into DynamoDB adds extra components, increases management overhead, or changes the data store, so they do not best satisfy the stated constraints.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is an AWS Glue JDBC connection?
What role does the AWS Glue Data Catalog play in this solution?
How does AWS Glue ensure traffic stays within the VPC?
An analytics team runs a provisioned Amazon Redshift cluster that loads 3 TB of data nightly and is queried by business analysts. Queries arrive unpredictably, with some days heavy ad-hoc activity and most days almost no usage. The company wants to cut costs and remove cluster management tasks while keeping the existing Redshift schema and SQL. Which solution best meets these requirements?
Resize the cluster to RA3 nodes and enable Redshift Concurrency Scaling.
Query the nightly data files directly from Amazon S3 by using Amazon Athena.
Migrate the workload to an on-demand Amazon EMR cluster running Apache Hive.
Snapshot the cluster into a workgroup and run it with Amazon Redshift Serverless.
Answer Description
Amazon Redshift Serverless eliminates the need to size, scale, or turn on and off a cluster. You can restore the existing cluster snapshot into a serverless workgroup, so the current schema and SQL continue to run unchanged. Because compute capacity automatically spins up only when queries arrive and pauses when idle, the company pays for RPU-seconds consumed instead of for a constantly running cluster. Athena would avoid cluster management but would require moving data to open-format tables and adapting queries that rely on Redshift-specific functions. An EMR on-demand cluster still needs provisioning and administration. Switching to RA3 nodes with concurrency scaling retains fixed baseline node costs, so idle time would still accrue charges. Therefore, Redshift Serverless is the most cost-effective and fully managed choice.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Amazon Redshift Serverless?
What are RA3 nodes in Amazon Redshift?
How does Amazon Athena differ from Amazon Redshift?
Your company runs several Amazon EMR clusters that execute nightly Spark jobs. The engineering team wants a managed solution to aggregate application and step logs from every cluster, retain the data for 30 days, and provide near-real-time search and interactive dashboards to troubleshoot performance issues. Which approach meets these requirements with the least operational overhead?
Stream logs from the EMR master node to Amazon Kinesis Data Streams, invoke AWS Lambda to load the records into Amazon DynamoDB, and build Amazon QuickSight analyses on the table.
Enable log archiving to Amazon S3, run Amazon Athena queries against the logs, and visualize the results in Amazon QuickSight with a 30-day lifecycle policy on the S3 bucket.
Configure each EMR cluster to publish its logs to CloudWatch Logs, create a CloudWatch Logs subscription that streams the logs to an Amazon OpenSearch Service domain, and set a 30-day retention policy on the log groups.
Install Filebeat on every EMR node to forward logs to an ELK stack running on a separate always-on EMR cluster and delete indices older than 30 days.
Answer Description
Streaming each EMR cluster's application and step logs to Amazon CloudWatch Logs is fully managed and eliminates node-level agents. A CloudWatch Logs subscription can forward the data to an Amazon OpenSearch Service domain, which automatically indexes the events so engineers can query them and build dashboards in OpenSearch Dashboards. Retention policies on the CloudWatch log groups enforce the 30-day requirement. The other options either require self-managed analytics infrastructure, lack near-real-time search, or add unnecessary data-flow components.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Amazon OpenSearch Service and how does it support real-time search and dashboards?
How does a CloudWatch Logs subscription work, and why use it for log forwarding?
Why is retention policy for CloudWatch log groups important, and how is it configured?
An ecommerce company stores hundreds of Parquet datasets in Amazon S3. The analytics team catalogs the data in AWS Glue. They must indicate for each table and column whether the data is public, internal only, or contains customer PII, and they must enforce different Athena permissions based on these classifications. Which solution requires the least ongoing administration?
Maintain separate AWS Glue databases for Public, Internal, and PII data and restrict Athena users to the corresponding database.
Create Lake Formation LF-tags for each sensitivity level, attach them to the relevant tables and columns, and grant tag-based permissions to the appropriate IAM principals.
Configure custom classifiers in AWS Glue crawlers to label tables and use Glue column-level IAM policies to restrict Athena access.
Enable Amazon Macie on the S3 buckets and use Macie findings to automatically block unauthorized Athena queries against sensitive data.
Answer Description
Lake Formation LF-tags let administrators assign business-defined classifications (for example, Public, Internal, PII) to databases, tables, and even individual columns in the AWS Glue Data Catalog. Tag-based access control policies can then be granted to roles or users, and Athena automatically honors those permissions. Custom Glue classifiers only identify file formats, separate catalogs add operational overhead, and Amazon Macie does not provide fine-grained permission enforcement for Athena queries.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What are Lake Formation LF-tags?
How does tag-based access control work in Lake Formation?
Why is using LF-tags more efficient than separate Glue databases or Macie?
A data engineering team receives hourly CSV files in an Amazon S3 bucket. Each time a file arrives they must 1) launch an AWS Glue ETL job, 2) run an Amazon Athena CTAS query to aggregate the transformed data, and 3) send an Amazon SNS notification. The solution must provide built-in retries, visual workflow monitoring, JSON-based infrastructure-as-code definitions, and minimal operational overhead. Which service should orchestrate this pipeline?
Create an Amazon EventBridge Scheduler cron expression that invokes three Lambda functions in sequence to run Glue, Athena, and SNS.
Deploy an Amazon Managed Workflows for Apache Airflow environment and implement a DAG that calls Glue and Athena operators, then publishes an SNS message.
Define an AWS Step Functions state machine triggered by an EventBridge rule that invokes the Glue job, runs the Athena query with the SDK integration, and publishes to SNS.
Use an AWS Glue Workflow to run the Glue job, followed by a crawler and a trigger that starts an Athena query via Lambda, then send an SNS notification.
Answer Description
AWS Step Functions is a serverless, fully managed workflow engine defined in JSON (the Amazon States Language). It offers visual execution graphs in the console, integrated CloudWatch metrics, and configurable retry and error-handling policies. Step Functions provides direct service integrations to invoke AWS Glue StartJobRun, run Amazon Athena StartQueryExecution, and publish Amazon SNS messages, so no custom code is necessary. EventBridge Scheduler supports only a single target per schedule, so it cannot chain the three required steps. Amazon MWAA could orchestrate the tasks but introduces additional effort to provision and maintain an Airflow environment. AWS Glue Workflows can coordinate Glue jobs and crawlers but cannot directly invoke Athena queries or publish SNS notifications without additional Lambda code. Therefore, Step Functions best meets all stated requirements.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What are AWS Step Functions, and why are they preferred in this solution?
How does AWS EventBridge differ from Step Functions in orchestration use cases?
Why is Amazon Managed Workflows for Apache Airflow (MWAA) not the best option in this scenario?
An ecommerce platform streams purchase events to an Amazon Kinesis Data Stream that contains three shards. A Lambda function is configured as the only consumer through an event source mapping. CloudWatch shows the IteratorAge metric growing to several minutes even though the function successfully processes each batch in less than 200 ms. The team must reduce the lag without changing code or adding shards. Which action should the data engineer take?
Reduce the BatchSize value to invoke the function with fewer records more frequently.
Increase the Lambda function's memory allocation to provide more CPU and shorten runtime.
Increase the ParallelizationFactor setting on the event source mapping so multiple batches from each shard are processed concurrently.
Enable enhanced fan-out on the stream and register the Lambda function as an enhanced consumer.
Answer Description
The correct action is to increase the ParallelizationFactor. Lambda polls each Kinesis shard and, by default, invokes only one concurrent function per shard. When IteratorAge is high despite fast function execution, the bottleneck is the rate of processing. Increasing the ParallelizationFactor from the default of 1 (up to 10) allows Lambda to process multiple batches from each shard concurrently, which directly increases throughput and reduces IteratorAge. Reducing the BatchSize would trigger more frequent, smaller invocations but would not increase the number of concurrent executions per shard, so it would not solve the throughput bottleneck. While Lambda can be configured as an enhanced fan-out (EFO) consumer, EFO's main benefit is providing dedicated throughput to each consumer, which is most useful when multiple applications are reading from the same stream. For a single consumer, increasing the ParallelizationFactor is a more direct and simpler solution. Increasing the Lambda function's memory would not help, because the function's 200 ms runtime is already very low, indicating that compute resources are not the limiting factor.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is ParallelizationFactor in Lambda event source mapping?
What does the IteratorAge metric in CloudWatch indicate?
How does enhanced fan-out impact Kinesis Data Stream consumers?
A company stores multiple datasets in a single Amazon S3 bucket. Objects are tagged Team=
Apply S3 object ACLs that grant read permission to each team's IAM role whenever new data is uploaded.
Implement ABAC by attaching one IAM policy that allows s3:GetObject when the principal's Team tag matches the object's Team tag.
Create a dedicated IAM role and managed policy for each team that grants access to that team's S3 prefix.
Provision an S3 Access Point per team and use access point resource policies to restrict read access to the corresponding role.
Answer Description
Attribute-based access control (ABAC) allows a single IAM policy to enforce that the Team tag on the calling principal matches the Team tag on the S3 object. A condition such as "s3:ResourceTag/Team == aws:PrincipalTag/Team" dynamically authorizes access whenever matching tags are present, so new teams can be onboarded simply by tagging roles and objects. Creating separate roles or managed policies, maintaining ACLs, or configuring individual S3 Access Points would all require ongoing manual updates and therefore do not meet the stated goal.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is ABAC in AWS and how does it differ from RBAC?
What is the purpose of the s3:ResourceTag and aws:PrincipalTag conditions in an ABAC IAM policy?
Why is managing tags for ABAC policies more scalable than using S3 ACLs or multiple IAM roles?
An AWS Glue ETL job processes files that contain PII. The source and destination Amazon S3 buckets must enforce encryption at rest with customer-managed keys. Security forbids use of the default aws/s3 KMS key and wants other AWS accounts to read the output. Which approach meets these requirements with the least operational effort?
Enable SSE-KMS with a customer-managed key, configure bucket default encryption to use that key, and add the external accounts to the key policy and bucket policy.
Enable SSE-KMS with the AWS managed key (aws/s3) and create S3 Access Points for the external accounts.
Enable SSE-S3 on both buckets and add a bucket policy that denies uploads without encryption.
Implement client-side encryption in the Glue job using a key stored in AWS Secrets Manager, then upload the encrypted objects.
Answer Description
Using SSE-KMS with a customer-managed key satisfies the requirement for encryption at rest while avoiding the default aws/s3 key. Setting bucket default encryption to that customer-managed key ensures every object written by the Glue job is encrypted without code changes. The key policy for the customer-managed key can grant decrypt permission to the external AWS accounts, and the bucket policy grants object access, so no per-object ACLs or manual key distribution are needed. SSE-S3 lacks customer control of keys. Client-side encryption adds significant key-management overhead. The aws/s3 managed key is explicitly disallowed by the security team and cannot be shared cross-account directly.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is SSE-KMS in AWS?
How do bucket policies and key policies work together in AWS?
Why is client-side encryption less preferable for this use case?
A company's Amazon Redshift RA3 cluster hosts a 5-TB fact table that receives new rows each night. Business analysts issue the same complex aggregation query every morning to populate dashboards, but the query still takes about 40 minutes even after regular VACUUM and ANALYZE operations. As the data engineer, you must cut the runtime dramatically, keep administration effort low, and avoid a large cost increase. Which approach will best meet these requirements?
Enable Amazon Redshift Concurrency Scaling so the query can execute on additional transient clusters.
Increase the WLM queue's slot count and enable short query acceleration to allocate more memory to the query.
Change the fact table's distribution style to ALL so every node stores a full copy, eliminating data shuffling during joins.
Create a materialized view that pre-aggregates the required data, schedule an automatic REFRESH after the nightly load, and direct the dashboard to query the materialized view.
Answer Description
Creating a materialized view lets Amazon Redshift store the pre-computed, aggregated result set on disk. When analysts query the materialized view, Redshift returns the stored result almost immediately instead of re-scanning and joining the 5-TB fact table, yielding a large runtime reduction. Scheduling an automatic refresh immediately after the nightly data load maintains accuracy while requiring minimal ongoing management.
Changing the fact table to an ALL distribution style would duplicate terabytes of data across every node, greatly increasing storage space and load time. Concurrency scaling adds transient clusters to improve throughput when many queries run simultaneously, but it seldom reduces the elapsed time of a single long query. Adjusting WLM queues or enabling short query acceleration allocates resources differently but will not eliminate the heavy table scan and aggregation work that dominates the query's runtime.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a materialized view in Amazon Redshift?
How does automatic REFRESH in materialized views work?
Why is ALL distribution style not suitable for large fact tables?
An e-commerce company ingests about 800 GB of product images and related JSON metadata each day. The data must be stored with 11 nines durability, read by Spark jobs on Amazon EMR, and later queried using Amazon Athena. The solution should scale automatically, require minimal administration, and cut storage costs because the images are seldom accessed after the first few days. Which AWS storage option best meets these requirements?
Save the images as binary attributes in an Amazon DynamoDB table and scan the table from Amazon EMR.
Store the images and metadata in an Amazon S3 bucket and apply an S3 Lifecycle rule that transitions objects to S3 Glacier Instant Retrieval after 30 days.
Load the images and metadata into an Amazon Redshift RA3 cluster and query the data with Redshift Spectrum.
Mount an Amazon EFS One Zone-IA file system on the EMR cluster and place the images and metadata there.
Answer Description
Amazon S3 is an object store that delivers 11 nines of durability, scales without user intervention, and is the native storage layer for both Amazon EMR and Amazon Athena. Because objects are infrequently accessed after upload, an S3 Lifecycle rule can transition them to a lower-cost storage class such as S3 Glacier Instant Retrieval to reduce cost while still allowing occasional access.
Amazon EFS provides POSIX file storage but does not integrate directly with Athena and is generally more expensive for large, rarely accessed datasets. Loading binary images into Amazon Redshift is inefficient and costly because Redshift is optimized for structured, columnar data, not large unstructured files. DynamoDB cannot store multi-megabyte images because each item is limited to 400 KB and would still require another service for Athena queries. Therefore, storing the data in Amazon S3 with an appropriate Lifecycle policy is the most suitable choice.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What does '11 nines durability' mean in Amazon S3?
How does an S3 Lifecycle rule work?
Why is S3 Glacier Instant Retrieval suitable for infrequently accessed data?
A workload must ingest 20 MB/s of 20 KB JSON messages produced by thousands of IoT devices and make each record available to a downstream analytics application within a few hundred milliseconds. Which solution meets the throughput and latency requirements in the most cost-effective way?
Send the data to an Amazon Kinesis Data Firehose delivery stream with default buffering and deliver it to the analytics application.
Publish the events to an Amazon EventBridge bus and have a rule invoke the analytics application for each event.
Send the messages to an Amazon Kinesis Data Streams stream sized with at least 20 shards, then have the analytics application consume from the stream.
Buffer records on each device and write multipart objects directly to an Amazon S3 bucket, then trigger processing with S3 event notifications.
Answer Description
Amazon Kinesis Data Streams is designed for high-throughput, low-latency streaming ingestion. Each shard supports up to 1 MB/s or 1,000 records per second, and records are typically readable by consumers in less than one second. Provisioning 20 shards therefore supplies 20 MB/s of write capacity while meeting the sub-second availability goal.
Amazon Data Firehose (formerly Kinesis Data Firehose) adds a delivery buffer-300 s by default for S3 and 0-60 s for most other destinations. Even with the new zero-buffering mode, AWS states that most deliveries occur within about five seconds, so it cannot guarantee availability within a few hundred milliseconds. Writing individual objects directly to Amazon S3 and reacting with S3 event notifications typically introduces seconds or longer of lag because notifications are usually delivered in seconds and can occasionally take a minute or more. Amazon EventBridge charges $1 per million events, has default PutEvents rate limits of 600-10,000 TPS (Region-dependent), and therefore costs significantly more than a provisioned Kinesis stream sized for this workload. For these reasons, a properly sized Kinesis Data Streams solution is the most cost-effective way to meet both throughput and latency requirements.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What are the key differences between Kinesis Data Streams and Kinesis Data Firehose?
How do you determine the number of shards needed for a Kinesis Data Streams workload?
Why is Amazon S3 not suitable for workloads requiring millisecond-level latency?
Your company stores raw transactional data with credit-card and SSN columns in an Amazon S3 data lake. Business analysts query the data using Amazon Athena. Compliance mandates that analysts see all columns except those with PII. The solution must avoid duplicating data, follow least privilege, and require minimal maintenance. Which approach satisfies these needs?
Encrypt PII columns client-side before uploading to S3 and withhold the encryption key from analysts so that ciphertext values appear unreadable when they query the data.
Schedule Amazon Macie to classify objects daily and move any files containing PII to an encrypted quarantine bucket that analysts cannot access; analysts query the remaining bucket with Athena.
Register the S3 location with AWS Lake Formation, tag PII columns in the Data Catalog, and grant the analyst group column-level permissions that exclude columns tagged as PII.
Use an AWS Glue job to copy the dataset into a new Parquet table that omits PII columns, and direct analysts to query the new table instead of the raw data.
Answer Description
AWS Lake Formation integrates with Amazon Athena and supports fine-grained permissions down to the column level. By registering the S3 location in Lake Formation, adding the tables to the AWS Glue Data Catalog, and using LF-Tags to identify PII columns, administrators can grant analysts SELECT access to the table while explicitly denying access to columns tagged as PII. Because the data stays in place and permissions are enforced at query time, no additional copies of the dataset or ongoing ETL jobs are required.
Running Amazon Macie and moving objects to a different bucket protects data but introduces daily jobs and creates multiple data copies. Encrypting PII client-side still exposes ciphertext to analysts, does not prevent access, and requires application changes. Creating a separate table without PII columns duplicates data and adds maintenance overhead. Therefore, using Lake Formation column-level security is the most efficient and compliant solution.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is AWS Lake Formation?
What are LF-Tags in AWS Lake Formation?
How does Amazon Athena enforce column-level security with AWS Lake Formation?
A data engineering team runs a managed Apache Airflow environment on Amazon MWAA to orchestrate nightly ETL pipelines. Company policy states that no task may use the MWAA execution role; each task must assume a job-specific IAM role automatically. The team wants to satisfy the policy without refactoring the existing DAG code. Which solution will meet these requirements with the LEAST operational overhead?
Create a new Docker image that includes custom Airflow configuration with job-specific credentials and attach it to the MWAA environment.
Edit the aws_default Airflow connection in the MWAA environment and set the role_arn extra field to the IAM role that the pipeline should assume.
Store long-lived access keys for each job-specific IAM user in separate Airflow connections and reference them from every task.
Transform each task into an AWS Lambda function that first calls STS:AssumeRole and then performs the workload.
Answer Description
Amazon MWAA exposes the standard Airflow connection named "aws_default". By editing this connection in the MWAA console (or with Airflow CLI) and adding the ARN of an IAM role in the role_arn extra field, Airflow's AWSHook automatically calls AWS STS to assume that role. All built-in AWS operators and any custom code that relies on AWSHook or Boto3 inherit those temporary credentials, so no DAG code changes are needed.
Building a custom container image is not supported by MWAA. Adding static credentials directly in a connection violates security best practices and still leaves the execution role in use for other hooks. Wrapping every operator with Lambda vastly increases code and operational overhead. Therefore, updating the existing aws_default connection with the required role_arn is the simplest and most compliant approach.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Amazon MWAA?
How does AWS STS AssumeRole work in MWAA?
Why is using temporary credentials better than static credentials?
A data engineering team schedules an AWS Glue Spark job through Amazon EventBridge to transform and load daily CSV files from an S3 landing prefix into a partitioned analytics bucket. The job writes with append mode, and Athena reports sometimes reveal duplicate rows for the same day even though the source files are never modified. Which change will most effectively prevent these duplicates while keeping the pipeline fully automated and cost-effective?
Enable AWS Glue job bookmarks so the job automatically ignores files it has already processed.
Add an AWS Step Functions state machine that calls Athena to delete duplicate records after each load completes.
Configure an S3 lifecycle rule to delete files in the landing prefix immediately after the job finishes.
Change the Spark write operation to overwrite the existing date partition each day.
Answer Description
AWS Glue job bookmarks record the state of the input data that a job has already processed. When bookmarks are enabled, the job automatically skips files it has successfully loaded in earlier runs, preventing the same records from being appended twice. Overwrite mode could remove duplicates but risks data loss if the job fails midway and is less efficient. Deleting landing files with a lifecycle rule still lets duplicates through if the job re-reads already copied data before deletion, and it provides no guard against partial reruns. Adding a Step Functions task to run an Athena DELETE query introduces extra cost and complexity and only corrects duplicates after they occur rather than preventing them.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What are AWS Glue job bookmarks?
What is the difference between append mode and overwrite mode in Spark jobs?
How does Amazon EventBridge automate workflow scheduling?
A company receives CSV files in an Amazon S3 bucket that is owned by another AWS account. A data engineer must copy any new files to the company's central data-lake bucket every hour between 08:00 and 18:00, Monday through Friday. The solution must be serverless, easy to adjust for future schedule changes, and incur the lowest possible operational cost. Which approach meets these requirements MOST effectively?
Deploy Apache Airflow in Amazon Managed Workflows for Apache Airflow (MWAA) and create an hourly DAG that runs an AWS Data Pipeline task to replicate the files.
Configure an hourly AWS Glue crawler on the source bucket and trigger an AWS Glue job to copy the files into the destination bucket.
Launch an Amazon EC2 instance and configure a Linux cron job that runs the "aws s3 sync" command every hour to copy the objects between buckets.
Create an Amazon EventBridge rule with a cron expression that invokes an AWS Lambda function every hour during business hours; the function assumes a cross-account role and copies any new objects to the data-lake bucket.
Answer Description
An Amazon EventBridge rule can use a cron expression to run on an exact hourly schedule limited to specific days and times. The rule invokes an AWS Lambda function, which is serverless and priced per request. The function can use an IAM role that has cross-account S3 permissions to list and copy only new objects. This pattern requires no provisioned infrastructure and schedule changes are made by simply updating the EventBridge rule.
Running Apache Airflow on Amazon MWAA introduces continuous environment costs and additional operational overhead. Maintaining an EC2 instance with a cron job violates the serverless constraint and adds management responsibilities. Scheduling an AWS Glue crawler and job is possible but is more expensive and unnecessary for a straightforward file copy operation, making it less cost-effective than the EventBridge-and-Lambda combination.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Amazon EventBridge, and how is it used for scheduling tasks?
How does an AWS Lambda function work in cross-account scenarios?
Why is the EventBridge-and-Lambda combination cost-effective for this use case?
A data lake on Amazon S3 contains a raw table with customer email addresses. Compliance requires downstream analytics to receive a deterministic pseudonym for each address so that joins are possible, while the original email can never be inferred without an internal secret key. As the data engineer, which solution most simply applies a keyed salt during anonymization by relying only on managed services?
Enable S3 Bucket Keys with SSE-KMS and configure an S3 Object Lambda access point to rewrite objects on the fly.
Deploy an AWS Lambda function triggered by S3 PUT to read each object, prepend a random value to every email before hashing, store the mapping in Amazon DynamoDB, and write the redacted file back to S3.
Use server-side encryption with customer-provided keys (SSE-C) on the raw bucket and rotate the keys daily.
Create an AWS Glue DataBrew recipe that applies the HMAC-SHA256 transformation to the email column using a secret key retrieved from AWS Secrets Manager, then write the output to a curated S3 prefix.
Answer Description
AWS Glue DataBrew includes a built-in HMAC-SHA256 transformation that hashes a column by combining the field value with a secret key. The key can be stored in AWS Secrets Manager and referenced directly from the recipe. The result is a consistent, non-reversible pseudonym that supports deterministic joins. S3 bucket keys, SSE-C, or Lambda code either do not perform deterministic hashing or require custom code and state management, making them less suitable.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is HMAC-SHA256 used for in data processing?
How does AWS Secrets Manager integrate with AWS Glue DataBrew?
Why is AWS Glue DataBrew preferred over Lambda for deterministic pseudonymization?
That's It!
Looks like that's it! You can go back and review your answers or click the button below to grade your test.