AWS Certified Data Engineer Associate Practice Test (DEA-C01)
Use the form below to configure your AWS Certified Data Engineer Associate Practice Test (DEA-C01). The practice test can be configured to only include certain exam objectives and domains. You can choose between 5-100 questions and set a time limit.

AWS Certified Data Engineer Associate DEA-C01 Information
The AWS Certified Data Engineer – Associate certification validates your ability to design, build, and manage data pipelines on the AWS Cloud. It’s designed for professionals who transform raw data into actionable insights using AWS analytics and storage services. This certification proves you can work with modern data architectures that handle both batch and streaming data, using tools like Amazon S3, Glue, Redshift, EMR, Kinesis, and Athena to deliver scalable and efficient data solutions.
The exam covers the full data lifecycle — from ingestion and transformation to storage, analysis, and optimization. Candidates are tested on their understanding of how to choose the right AWS services for specific use cases, design secure and cost-effective pipelines, and ensure data reliability and governance. You’ll need hands-on knowledge of how to build ETL workflows, process large datasets efficiently, and use automation to manage data infrastructure in production environments.
Earning this certification demonstrates to employers that you have the technical expertise to turn data into value on AWS. It’s ideal for data engineers, analysts, and developers who work with cloud-based data systems and want to validate their skills in one of the most in-demand areas of cloud computing today. Whether you’re building data lakes, streaming pipelines, or analytics solutions, this certification confirms you can do it the AWS way — efficiently, securely, and at scale.

Free AWS Certified Data Engineer Associate DEA-C01 Practice Test
- 20 Questions
- Unlimited
- Data Ingestion and TransformationData Store ManagementData Operations and SupportData Security and Governance
A data engineering team runs a persistent Amazon EMR cluster that stores intermediate data in HDFS. Each night, about 50 TB of gzip log files arrive in an Amazon S3 bucket and must be copied into HDFS before downstream MapReduce jobs start. The transfer must maximize throughput, minimize S3 request costs, and run by using only the existing EMR cluster resources. Which solution meets these requirements?
Mount the S3 bucket on every core node with s3fs and move the objects to HDFS with the Linux cp command.
From the master node, run the AWS CLI command "aws s3 cp --recursive" to copy the objects into HDFS.
Use AWS DataSync to transfer the objects to volumes on each core node, then import the data into HDFS.
Add an EMR step that uses S3DistCp to copy the objects from Amazon S3 to HDFS in parallel.
Answer Description
S3DistCp is an Amazon EMR utility built on Apache DistCp that runs as a step on the cluster. It launches multiple mapper tasks that copy objects in parallel, optionally combines small files, and uses the cluster's network bandwidth instead of a single node. This approach delivers the highest throughput while reducing the number of GET requests. Running the command from the master node (aws s3 cp) would be single-threaded, DataSync adds an external service and cannot write directly into HDFS, and mounting the bucket with s3fs provides no parallelism and incurs high per-object overhead.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is S3DistCp, and how does it optimize data transfer to HDFS?
Why is running 'aws s3 cp --recursive' from the master node not ideal for this scenario?
How does S3DistCp reduce S3 request costs compared to other methods like mounting S3 with s3fs?
What is S3DistCp and why is it used in Amazon EMR?
How does parallelism work in S3DistCp compared to single-threaded alternatives like AWS CLI?
Why is using alternatives like DataSync or s3fs not suitable for this scenario?
A company stores operational data in an Amazon Aurora PostgreSQL cluster. Analysts need to join this data with large fact tables that already reside in Amazon Redshift for near-real-time ad-hoc reporting. The solution must minimize data movement and ongoing maintenance while allowing analysts to run standard SQL joins from their Redshift data warehouse. Which approach meets these requirements with the least operational overhead?
Set up an AWS Database Migration Service task with change data capture (CDC) to replicate the Aurora tables into Redshift and run joins on the replicated tables.
Create an external schema in Amazon Redshift that references the Aurora PostgreSQL database and use Amazon Redshift federated queries to join the remote tables with local fact tables.
Schedule an AWS Glue ETL job to load the Aurora data into Redshift staging tables every 15 minutes and join the staging tables with the fact tables.
Export the Aurora tables to Amazon S3 and use Redshift Spectrum external tables to join the exported data with Redshift fact tables.
Answer Description
Amazon Redshift federated queries let you create an external schema that maps directly to tables in Amazon Aurora PostgreSQL and Amazon RDS. When analysts run a query in Redshift, the service automatically pushes down predicates to the remote database and streams only the required rows, so no periodic ETL jobs or data replication pipelines are needed. Redshift Spectrum is designed for data in Amazon S3, not for live Aurora tables. AWS Glue ETL jobs or AWS DMS change-data-capture tasks would satisfy the requirement but require building and maintaining pipelines that copy data into Redshift, introducing additional latency and operational overhead.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What are Amazon Redshift federated queries?
How do Redshift Spectrum and federated queries differ?
Why is Amazon Redshift federated queries better suited for this use case compared to AWS Glue or AWS DMS?
What are Amazon Redshift federated queries?
How does Redshift Spectrum differ from federated queries?
What is the benefit of using federated queries over AWS Glue or DMS solutions?
Your company receives hourly comma-separated value (CSV) log files in an Amazon S3 prefix. Data analysts use Amazon Athena for ad-hoc queries, but scan costs and runtimes are increasing as the dataset grows. As a data engineer, you must convert both existing and future files to an optimized columnar format, partition the data by event_date, and avoid managing any servers or long-running clusters.
Which solution MOST cost-effectively meets these requirements?
Create an AWS Glue crawler to catalog the CSV files, then schedule an AWS Glue Spark job that reads the crawler's table, writes Snappy-compressed Parquet files partitioned by event_date to a new S3 prefix, and updates the Data Catalog.
Provision an Amazon EMR cluster with Apache Hive, run a CREATE EXTERNAL TABLE … STORED AS ORC statement to convert the CSV data to ORC, and keep the cluster running to process new hourly files.
Enable S3 Storage Lens and apply Lifecycle rules to transition the CSV objects to the S3 Glacier Flexible Retrieval storage class after 30 days to reduce storage and Athena scan costs.
Modify the source application to write Parquet files directly to the target S3 prefix and drop the existing CSV files once verified.
Answer Description
AWS Glue is a serverless ETL service; you pay only while the job runs and do not manage clusters. A Glue crawler can infer the schema of the incoming CSV files and store it in the Data Catalog. A Glue Spark job can then read the CSV data from the source prefix, write compressed Parquet files partitioned by event_date to a separate S3 prefix, and can be placed on an hourly schedule so new files are converted as they arrive. Athena automatically benefits because it can query the partitioned Parquet files with far less data scanned, lowering query costs and improving performance.
Launching an EMR cluster or Redshift cluster introduces persistent infrastructure you must configure and pay for even when idle, making them less cost-effective for this workload. S3 Storage Lens and S3 Lifecycle policies reduce storage cost but do not transform file formats. Writing Parquet directly from the source application eliminates the need for conversion but requires changing the upstream producer, which is outside the stated scope.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is AWS Glue and how does it work?
What are the benefits of using Parquet files over CSV files in AWS Athena?
How does partitioning by event_date improve Athena performance?
What is AWS Glue, and why is it suitable for this use case?
What are Parquet files, and why are they better than CSV for Athena?
Why is partitioning by event_date important in this solution?
An application writes 2 TB of structured transactional data as comma-separated files to an S3 bucket each day. Analysts query the data with Amazon Athena and experience long runtimes and high scan charges. A data engineer will add a nightly AWS Glue Spark job to transform the data. Which transformation will best address the volume characteristics while retaining the relational schema?
Merge all daily CSV files into a single uncompressed file to reduce S3 object overhead.
Compress the existing CSV files with Gzip and remove all header rows.
Split each CSV file into chunks no larger than 128 MB to increase Athena parallelism.
Convert the files to Apache Parquet, apply Snappy compression, and partition the dataset by transaction_date.
Answer Description
Columnar formats such as Apache Parquet store values together by column rather than by row, so Athena can read only the columns referenced in a query instead of every field in every record. Snappy compression further reduces the amount of data stored and scanned without adding excessive CPU overhead. Adding a partition key such as transaction_date lets Athena read only the partitions that match a predicate, which sharply limits the amount of data that must be scanned each day. Compressing CSV, combining files, or simply splitting them into smaller objects still forces Athena to read every column of every row, so they do not significantly reduce scan costs or latency for large structured datasets.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why is Apache Parquet better than CSV for Athena queries?
How does partitioning data in S3 improve Athena performance?
What is Snappy compression, and why is it suitable for Parquet?
Why is Apache Parquet better for Athena queries than CSV?
What is Snappy compression and why is it used here?
How does partitioning by transaction_date improve Athena query performance?
You run an AWS Glue 3.0 Spark job written in Python that reads 50,000 gzip-compressed JSON files (about 100 KB each) from one Amazon S3 prefix, transforms the data, and writes Parquet files back to S3. The job uses the default 10 G.1X DPUs and currently completes in eight hours while average CPU utilization stays under 30 percent. Which modification will most improve performance without increasing cost?
Use create_dynamic_frame_from_options with connection_options {"groupFiles": "inPartition", "groupSize": "134217728"} so Glue combines many small objects before processing.
Write the Parquet output with the Zstandard compression codec to shrink the file sizes.
Enable AWS Glue job bookmarking so previously processed files are skipped.
Add --conf spark.executor.memory=16g to the job parameters to increase executor heap size.
Answer Description
When a Spark job must open and schedule tens of thousands of very small objects, task-startup overhead, network calls, and driver pressure dominate run time even though CPU usage is low. AWS Glue lets you reduce that overhead by grouping files as they are read. Setting the S3 connection options "groupFiles" to "inPartition" and specifying an appropriate "groupSize" causes the Glue library to combine many small objects into larger logical partitions before they reach the executors, decreasing the number of tasks that must be scheduled and allowing each task to perform more useful work. Because this change does not request additional DPUs, cost remains the same.
- Increasing executor memory does not attack the scheduling overhead that is the primary bottleneck, and Glue fixes executor memory per DPU.
- Changing the Parquet compression codec affects the write phase, not the excessive read-side task creation.
- Job bookmarking only helps skip files that were processed in earlier runs; it does not speed up processing of the current data set.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the purpose of grouping files in AWS Glue?
What is the difference between DPUs and the groupFiles option?
Why doesn't increasing Spark executor memory improve performance in this case?
What is a dynamic frame in AWS Glue?
What is the purpose of the 'groupFiles' and 'groupSize' connection options?
Why does increasing executor memory not improve performance in this scenario?
A company stores application logs as compressed JSON files in an Amazon S3 location that is partitioned by the prefix logs/region/date=YYYY-MM-DD. A data engineer created an AWS Glue crawler that builds an Athena table so analysts can run ad-hoc queries. The crawler runs on a daily schedule, but after several months it spends most of its run time re-processing unchanged folders, delaying data availability for the most recent partition.
Which crawler configuration change will minimize the crawl time without requiring code changes to the ingest process?
Enable partition projection in the Athena table and delete the crawler.
Change the crawler's recrawl behavior to CRAWL_NEW_FOLDERS_ONLY so it processes only folders that were added since the last run.
Switch the crawler trigger to Amazon S3 event notifications so it runs once for every new object.
Configure the crawler to create a separate table for each region/date folder.
Answer Description
AWS Glue crawlers keep track of the folders they have already processed. Setting the crawler's recrawl policy to CRAWL_NEW_FOLDERS_ONLY turns the crawler into an incremental crawler: on each run it compares the current S3 prefix to its internal state and inspects only folders that appeared since the previous crawl. Existing partitions and their schemas are left untouched, so the crawler finishes quickly while still creating or updating the catalog entry for the newest date partition. The other options either continue to scan all folders, rely on S3 event notifications that are not configured, or require changing the folder naming convention used by the ingestion jobs.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a recrawl policy in AWS Glue and how does it affect crawler efficiency?
How does partitioning in Amazon S3 improve query performance in Athena?
What is the difference between AWS Glue and Athena in terms of functionality?
What is AWS Glue and its main purpose?
What does the AWS Glue recrawl behavior option 'CRAWL_NEW_FOLDERS_ONLY' do?
What is partition projection in Amazon Athena, and why wasn't it correct in the provided solution?
A retail company runs nightly AWS Glue ETL jobs that load data into an Amazon Redshift cluster. The job script currently hard-codes the database user name and password. Security now requires removing plaintext credentials, rotating the password automatically every 30 days, and making no changes to the ETL code. Which solution meets these requirements most securely?
Store the database credentials as SecureString parameters in AWS Systems Manager Parameter Store and schedule an Amazon EventBridge rule that invokes a Lambda function every 30 days to update the parameters; grant the Glue job role ssm:GetParameters permission.
Save the credentials in the AWS Glue Data Catalog connection properties and enable automatic rotation in the connection settings.
Encrypt the user name and password with AWS KMS and place the ciphertext in environment variables of the Glue job; configure KMS key rotation every 30 days.
Create an AWS Secrets Manager secret for the Redshift cluster, enable automatic rotation, update the existing AWS Glue connection to reference the secret's ARN, and add secretsmanager:GetSecretValue permission to the Glue job role.
Answer Description
AWS Secrets Manager can create a managed secret for an Amazon Redshift cluster whose password is rotated automatically every 30 days. An AWS Glue connection can reference the secret's ARN, so the job continues to run without code changes; the only additional step is to grant the Glue job role permission to call secretsmanager:GetSecretValue. Systems Manager Parameter Store has no built-in rotation, encrypting environment variables with KMS rotates keys rather than credentials, and AWS Glue connections do not provide automatic credential rotation. Therefore the Secrets Manager approach is the only option that satisfies all stated requirements.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is AWS Secrets Manager?
How does automatic secret rotation work in AWS Secrets Manager?
What is an ARN and how is it used in AWS Glue?
What is AWS Secrets Manager?
How does AWS Glue integrate with AWS Secrets Manager?
Why is Secrets Manager better than Parameter Store for automatic credential rotation?
A data engineering team keeps the Python script for an AWS Glue ETL job in an AWS CodeCommit repository. The team wants every commit to automatically: 1. package the script, 2. update a development Glue job, 3. pause for manager approval, and 4. promote the change to the production Glue job. Which approach delivers this CI/CD workflow with the least custom code and operational overhead?
Configure an Amazon EventBridge rule to start an AWS Glue workflow that copies the latest script to both development and production jobs, then ask engineers to manually trigger the production job after testing.
Create an AWS CodePipeline with a CodeCommit source stage, a CodeBuild stage that packages the script to Amazon S3, a CloudFormation deploy action for the development Glue job, a manual approval action, and a second CloudFormation deploy action for the production Glue job.
Add an S3 trigger to both Glue job script locations that invokes a Lambda function; the function pulls the latest commit from CodeCommit and updates the jobs without any intermediate steps.
Use AWS CodeDeploy to create deployment groups for the Glue job and set up a deployment pipeline that pushes the script to development and production, inserting a wait step before the production deployment.
Answer Description
AWS CodePipeline can orchestrate the entire release with almost no custom code. A source stage detects new commits in CodeCommit, a build stage (CodeBuild) packages the script and places the artifact in Amazon S3, a deploy stage updates the development Glue job by applying a CloudFormation stack update, a manual approval stage enforces the manager sign-off, and a final deploy stage updates the production Glue job.
Using EventBridge and Glue workflows would automate job execution rather than deployment and provides no approval gate. Triggering Lambda from S3 demands custom code for packaging, promotion logic, and approvals. CodeDeploy supports EC2, Lambda, and Amazon ECS targets-not AWS Glue-so additional wrappers would be necessary. Therefore, the CodePipeline solution offers the lowest operational overhead and the fewest components to maintain.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is AWS CodePipeline and how does it enable CI/CD?
What role does AWS CloudFormation play in maintaining AWS Glue jobs?
Why are manual approval actions important in a CI/CD workflow for AWS Glue jobs?
Your analytics team plans to land about 2 TB of new, structured sales data in AWS each day. They must run complex SQL joins across 100 TB of historical data, support roughly 200 concurrent dashboard users, and load new data continuously without locking running queries. Queries should complete within seconds. Which managed AWS data store is the most appropriate?
Create an Amazon Redshift cluster with RA3 nodes and enable Concurrency Scaling.
Run an Amazon EMR cluster and execute Apache Hive queries on Parquet files stored in Amazon S3.
Deploy Amazon RDS for PostgreSQL on db.r6g.16xlarge with provisioned IOPS and multiple read replicas.
Store the data in Amazon DynamoDB using on-demand capacity and query it with PartiQL.
Answer Description
Amazon Redshift is a managed, petabyte-scale columnar data warehouse purpose-built for complex analytic SQL workloads. RA3 nodes let you separate compute from storage so hundreds of terabytes can be kept at low cost, while Concurrency Scaling automatically adds transient capacity to serve hundreds of simultaneous users with sub-second latency. The COPY command can stream data into staging tables without blocking queries. Amazon RDS for PostgreSQL is limited to 64 TB and becomes I/O-constrained under heavy analytic concurrency. DynamoDB is a NoSQL key-value store; large joins and ad-hoc SQL analytics are inefficient. Apache Hive on EMR performs well for batch processing, but interactive queries across 100 TB typically take minutes and do not scale to hundreds of concurrent dashboard users.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What are RA3 nodes in Amazon Redshift?
How does Concurrency Scaling work in Amazon Redshift?
Why is DynamoDB not suitable for complex SQL analytics?
What are RA3 nodes in Amazon Redshift and why are they important?
How does Amazon Redshift's Concurrency Scaling work?
Why is DynamoDB not suitable for SQL analytics with large joins?
A data engineering team uses AWS Step Functions to launch a transient Amazon EMR 6.x cluster nightly to run a PySpark ETL step, after which the cluster terminates automatically. When a step fails, the cluster shuts down before engineers can view Spark driver and executor logs. The team must retain detailed logs and the Spark history UI for post-mortem analysis while adding minimal EC2 cost. Which action meets these requirements?
Enable termination protection and disable auto-termination so the cluster remains available for manual log retrieval via SSH.
Configure EMRFS Consistent View so logs are automatically synchronized to Amazon S3 after each task.
Specify an Amazon S3 log URI and enable persistent application user interfaces for Spark when creating the EMR cluster.
Enable CloudTrail data events on the input data bucket to capture Spark driver logs for later review.
Answer Description
Specifying an Amazon S3 log URI causes Amazon EMR to copy YARN, Spark, and system logs to the bucket as each step completes. Enabling the persistent application user interfaces feature stores Spark History Server data off-cluster in the same S3 location, so the UI can be opened from the Amazon EMR console even after the cluster terminates. This preserves all troubleshooting information without incurring additional EC2 costs. Leaving the cluster up with termination protection meets the troubleshooting goal but accrues ongoing instance charges. EMRFS Consistent View only improves S3 list consistency and does not archive logs. Enabling CloudTrail data events on the input bucket records object-level API activity, not Spark application logs, so it does not provide the required diagnostics.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the Amazon S3 log URI in the context of Amazon EMR?
What are persistent application user interfaces in Amazon EMR?
How does EMRFS Consistent View differ from enabling persistent logs in Amazon EMR?
What is an Amazon S3 log URI and how does it help in log retention for EMR clusters?
What are persistent application user interfaces for Spark, and why are they important?
Why is termination protection or disabling auto-termination not an optimal choice for log retrieval?
A data engineering team processes log files stored in Amazon S3. Nightly AWS Glue ETL jobs write curated data back to S3, while analysts run ad-hoc queries with Amazon Athena and Apache Spark on Amazon EMR. Maintaining separate metastores for each service has resulted in schema drift and extra administration. The team needs a single, serverless data catalog that all three services can reference directly, with the least operational overhead. Which approach satisfies these requirements?
Run an Apache Hive metastore on the EMR primary node and connect Athena to it with AWS Glue connectors.
Create external schemas in Amazon Redshift and have Athena and EMR issue federated queries against them.
Store table metadata in an Amazon DynamoDB table and update Athena and EMR Spark jobs to read from it using custom code.
Use the AWS Glue Data Catalog as the unified metastore and configure both Athena and EMR to reference it.
Answer Description
The AWS Glue Data Catalog is a fully managed, serverless metastore used by Amazon Athena by default and can also be configured as the Hive metastore for Amazon EMR. Pointing both Athena and EMR to the same Glue Data Catalog gives all services a consistent view of table definitions without running or maintaining additional infrastructure. Storing metadata in DynamoDB would require custom integration logic. External schemas in Amazon Redshift do not act as a central Hive metastore. Running a self-managed Hive metastore on the EMR primary node introduces operational overhead and Athena cannot natively query it without Glue.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the AWS Glue Data Catalog?
How do Athena and EMR use the AWS Glue Data Catalog?
Why is running a self-managed Hive metastore on EMR not optimal?
Why is the AWS Glue Data Catalog preferred over Amazon DynamoDB for metadata storage?
How does Amazon Athena use the AWS Glue Data Catalog?
Can Amazon EMR use the AWS Glue Data Catalog as a Hive metastore?
A data engineer must explore a 200 GB CSV data lake on Amazon S3, remove duplicate rows, and check for malformed records. Company policy prohibits long-running clusters, and the engineer wants to perform the work from an existing Jupyter notebook in Amazon SageMaker Studio with minimal infrastructure to manage. Which approach meets these requirements while keeping costs low?
Run ad-hoc Amazon Athena SQL queries from the notebook with the Boto3 SDK to identify and delete bad or duplicate rows.
Use the Athena for Apache Spark notebook interface to open a new serverless Spark session and connect the SageMaker Studio notebook to it with a JDBC driver.
Create an Amazon EMR cluster with JupyterHub enabled, attach the notebook to the cluster, and terminate the cluster after processing.
Launch an AWS Glue interactive session from the SageMaker Studio notebook by switching to the Glue PySpark kernel and process the data with Apache Spark.
Answer Description
AWS Glue interactive sessions let a SageMaker Studio notebook start a temporary, serverless Spark environment by selecting the Glue PySpark kernel. The session starts in seconds, bills by the second for DPUs that are actually used, and shuts down automatically when idle, satisfying the no-persistent-cluster policy. An EMR cluster or self-managed EC2 instance requires manual provisioning and ongoing management. Standard Athena SQL cannot easily perform row-level data cleansing for malformed records, and Athena for Apache Spark notebooks are only available in the Athena console, not from a SageMaker Studio Jupyter environment.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is AWS Glue and how does it support Spark processing?
Why is the Glue PySpark kernel a better option than EMR for this task?
Why can’t standard Amazon Athena SQL queries handle row-level data cleansing efficiently?
A retail company stores clickstream records in Amazon S3 using the prefix structure s3://bucket/events/year=YYYY/month=MM/day=DD/hour=HH/. An AWS Glue Data Catalog table exposes the data to Amazon Athena. Hundreds of new hour-level partitions arrive each day, and analysts must query the most recent data within minutes while keeping maintenance cost low. Which solution best meets these requirements?
Schedule an AWS Glue crawler to run every 5 minutes to discover and add new partitions.
Enable partition projection on the Glue Data Catalog table and define templates for year, month, day, and hour.
Instruct analysts to execute MSCK REPAIR TABLE before each Athena query to refresh partition metadata.
Configure Amazon S3 event notifications to trigger an AWS Lambda function that calls BatchCreatePartition for every new object.
Answer Description
Enabling partition projection on the AWS Glue Data Catalog table lets Athena derive partition values from the S3 prefix pattern at query runtime. No crawler, DDL statement, or Lambda function is required, so new partitions are available immediately with virtually no additional cost. Running a Glue crawler every few minutes or invoking BatchCreatePartition on every object adds unnecessary expense. Requiring analysts to run MSCK REPAIR TABLE introduces manual effort and delays.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is partition projection in AWS Glue?
How does MSCK REPAIR TABLE work in Athena?
What are the limitations of using AWS Glue crawlers for frequent partition updates?
What is partition projection in AWS Glue?
Why is running an AWS Glue crawler every 5 minutes not ideal for this scenario?
How does the MSCK REPAIR TABLE command work, and why is it not an optimal solution here?
A security team needs to audit API activity across 50 AWS accounts that belong to a single AWS Organization. They must aggregate all CloudTrail management events in near-real time, keep the logs immutable for 365 days, and let analysts run ad-hoc SQL queries without exporting the data to another service. Which solution requires the LEAST ongoing operational effort?
In each member account, stream CloudTrail events to CloudWatch Logs and subscribe the log groups to an Amazon OpenSearch Service domain for search and analysis.
Enable Amazon Security Lake across the organization to collect CloudTrail management events and query the Parquet files in the Security Lake S3 buckets with Athena.
Configure an organization CloudTrail trail that delivers logs to an S3 bucket protected with S3 Object Lock, catalog the logs with AWS Glue, and query them using Amazon Athena.
Create an organization event data store in AWS CloudTrail Lake from the delegated administrator account, set one-year extendable retention, and grant analysts permission to run Lake SQL queries.
Answer Description
An organization event data store in AWS CloudTrail Lake automatically ingests management events from every account when created by the management or delegated administrator account. Event data stores are immutable collections, and the one-year extendable retention option meets the 365-day requirement without additional storage configuration. CloudTrail Lake provides a built-in SQL interface, so analysts can query the data directly-no Object Lock configuration, Glue cataloging, or streaming pipelines are needed. The S3/Object Lock and Security Lake options satisfy immutability but add Glue/Athena setup and data-movement overhead. Streaming logs to CloudWatch Logs and OpenSearch requires per-account trail configuration, subscription filters, and OpenSearch management, increasing operational burden and not guaranteeing immutability.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is AWS CloudTrail Lake?
What does 'immutable' mean in the context of CloudTrail Lake?
How does CloudTrail Lake reduce operational effort compared to other solutions?
What is CloudTrail Lake, and how does it differ from a standard CloudTrail trail?
What does it mean for event data to be immutable, and why is it important in this solution?
How does the SQL interface in CloudTrail Lake simplify the querying process for analysts?
A fintech startup captures tick-level trade events in an Amazon Kinesis Data Stream. Business analysts need to run near-real-time SQL queries in Amazon Redshift with end-to-end latency under 15 seconds. The team wants the simplest, most cost-effective solution and does not want to manage intermediate Amazon S3 staging or custom infrastructure. Which approach should the data engineer implement to meet these requirements?
Build an AWS Glue streaming job that reads from the Kinesis stream and writes batches to Amazon Redshift using JDBC.
Create a materialized view in Amazon Redshift that references the Kinesis stream with the KINESIS clause and enable auto-refresh for continuous ingestion.
Configure Amazon Kinesis Data Firehose to deliver the stream to an S3 bucket and schedule a Redshift COPY command to load the files every minute.
Attach an AWS Lambda function as a stream consumer that buffers events and inserts them into Amazon Redshift through the Data API.
Answer Description
Amazon Redshift supports native streaming ingestion from Amazon Kinesis Data Streams and Amazon MSK. By creating a materialized view that references the stream with the KINESIS clause and enabling auto-refresh, Redshift consumes records directly and makes them available for queries in seconds. This eliminates the S3 staging layer used by Kinesis Data Firehose, avoids the operational overhead of managing AWS Glue or Lambda jobs, and incurs no additional service charges beyond Redshift and the existing Kinesis stream.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a materialized view in Amazon Redshift?
How does the KINESIS clause work in Amazon Redshift?
What are the key benefits of using Amazon Redshift for streaming ingestion?
What is a materialized view in Redshift and how does it work?
How does Redshift integrate with Kinesis Data Streams natively?
What are the advantages of using a materialized view over other solutions like Glue or Lambda?
Your company stores JSON transaction logs in Amazon S3 using the prefix s3://company-logs/year=
Set RecrawlPolicy RecrawlBehavior = CRAWL_EVENT_MODE and SchemaChangePolicy DeleteBehavior = DELETE_FROM_DATABASE (UpdateBehavior = LOG).
Set RecrawlPolicy RecrawlBehavior = CRAWL_NEW_FOLDERS_ONLY and SchemaChangePolicy DeleteBehavior = LOG.
Set RecrawlPolicy RecrawlBehavior = CRAWL_EVERYTHING and SchemaChangePolicy DeleteBehavior = DELETE_FROM_DATABASE.
Schedule a nightly full crawl with SchemaChangePolicy UpdateBehavior = UPDATE_IN_DATABASE and DeleteBehavior = LOG.
Answer Description
RecrawlPolicy with RecrawlBehavior = CRAWL_EVENT_MODE enables the crawler to use Amazon S3 event notifications so each run lists only the folders mentioned in new PUT or DELETE events, giving fast incremental crawls. Setting SchemaChangePolicy DeleteBehavior = DELETE_FROM_DATABASE tells the crawler to drop the partition from the Glue Data Catalog when an object-removal event indicates that the underlying S3 folder no longer exists. This combination therefore (1) registers new YYYY/MM/DD folders as partitions, (2) removes partitions whose folders are deleted, and (3) avoids rereading data that hasn't changed.
CRAWL_NEW_FOLDERS_ONLY cannot delete partitions because the service forces DeleteBehavior to LOG. CRAWL_EVERYTHING with DELETE_FROM_DATABASE does remove partitions, but it must list the entire dataset on every run, increasing runtime and cost. Running a full crawl nightly or recreating the table each day also rescans all data and is unnecessary.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What does RecrawlBehavior = CRAWL_EVENT_MODE do?
What is the role of SchemaChangePolicy DeleteBehavior = DELETE_FROM_DATABASE?
Why is CRAWL_NEW_FOLDERS_ONLY not suitable for this use case?
What is the purpose of the "RecrawlPolicy" in AWS Glue?
How does SchemaChangePolicy impact partition management in AWS Glue?
Why is CRAWL_EVENT_MODE better than CRAWL_NEW_FOLDERS_ONLY for this use case?
An analytics team stores click-stream data as Parquet files in Amazon S3, partitioned by year/month/day (for example, s3://datalake/events/year=2025/month=10/day=07/). A daily AWS Glue crawler adds partitions to the AWS Glue Data Catalog so analysts can query the table in Amazon Athena. After two years the crawler's runtime and cost have increased significantly. The team wants to keep automatic partition discovery while minimizing ongoing cost and administration. What should they do?
Switch to Amazon S3 event notifications that invoke an AWS Glue job calling the batchCreatePartition API to add each new partition to the Data Catalog.
Change the existing crawler's recrawl policy to crawl new folders only and enable partition indexes on the Data Catalog table.
Enable partition projection for the Athena table, configure the year, month, and day keys, and stop scheduling the AWS Glue crawler.
Create an AWS Lambda function that runs MSCK REPAIR TABLE after each crawler run to update the Data Catalog incrementally.
Answer Description
Athena partition projection lets you define the partition keys (year, month, day) as template variables. Athena then resolves the partitions at query time instead of reading them from the AWS Glue Data Catalog. Once projection is configured, the table no longer needs explicit partition objects, so the daily crawler can be disabled, eliminating both the crawl time and the related cost.
Using an S3 event-driven AWS Glue job or Lambda function would still require authoring and maintaining custom code. Setting the crawler to recrawl only new folders reduces, but does not eliminate, the growing scan cost. Glue partition indexes accelerate certain lookups but do not shorten crawler runtime or remove the need to maintain partitions.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Athena partition projection?
How do S3 event notifications work with AWS Glue?
What is MSCK REPAIR TABLE and how does it differ from partition projection?
What is Athena Partition Projection?
Why is enabling Partition Projection more efficient than using AWS Glue crawlers?
How does using S3 event-driven AWS Glue jobs differ from Partition Projection?
An Amazon Redshift cluster runs in private subnets without a NAT gateway. The cluster must query only the objects in the s3://dept-finance/raw/ prefix by using Redshift Spectrum. A VPC interface endpoint (AWS PrivateLink) for Amazon S3 already exists in the subnets. Which action enforces this restriction while leaving other VPC workloads unaffected?
Replace the interface endpoint with an S3 gateway endpoint, associate it with the private subnets, and create a bucket policy that limits access to the raw/ prefix.
Add a bucket policy on the dept-finance bucket that allows
GetObject
only from the specified VPC endpoint and raw/ prefix while denying all other access paths.Modify the Redshift cluster's IAM role to allow
s3:GetObject
ondept-finance/raw/*
ands3:ListBucket
on thedept-finance
bucket, leaving the endpoint configuration unchanged.Attach a custom IAM endpoint policy to the S3 interface VPC endpoint that permits
s3:GetObject
onarn:aws:s3:::dept-finance/raw/*
,s3:ListBucket
onarn:aws:s3:::dept-finance
, and denies all other S3 actions.
Answer Description
Attaching a custom endpoint policy to the S3 interface endpoint restricts the actions that can be performed through that endpoint only. By allowing s3:GetObject
and s3:ListBucket
on arn:aws:s3:::dept-finance/raw/*
and the bucket respectively, and denying other S3 permissions, the Redshift Spectrum query is limited to the required prefix. Redshift Spectrum requires both ListBucket
and GetObject
permissions to function. Other workloads that reach S3 through public endpoints or a different gateway are not affected because the endpoint policy is evaluated only when the interface endpoint is used. A bucket policy would impact every caller, changing the behavior for other workloads. Replacing the endpoint is unnecessary and costly. Changing only the Redshift role is a less secure option because the endpoint policy creates a network-level boundary that cannot be bypassed, even by a principal with overly permissive IAM credentials.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the difference between an S3 interface endpoint and S3 gateway endpoint?
Why are both `s3:GetObject` and `s3:ListBucket` permissions required for Redshift Spectrum?
How does an IAM endpoint policy differ from a bucket policy?
What is an S3 interface VPC endpoint?
Why is an IAM endpoint policy better than a bucket policy in this scenario?
Why is `ListBucket` permission necessary for Redshift Spectrum?
A data engineer is generating an AWS Step Functions workflow from a dependency table containing up to 10,000 tasks, each with at most 30 downstream dependencies. The engineer must store the directed acyclic graph in memory inside a 512 MB Lambda function and run a topological sort in O(V+E) time. Which in-memory representation best meets these requirements?
A 10,000 × 10,000 boolean adjacency matrix stored in memory.
An adjacency list implemented as a dictionary that maps each task ID to a list of its dependent task IDs.
A nested dictionary that maps each source task ID to a dictionary of destination IDs set to true.
A single list containing one JSON object for every edge, scanned each time the graph is traversed.
Answer Description
A sparse graph with far fewer edges than vertices-squared is most memory-efficient when stored as an adjacency list. Implementing the list as a single dictionary whose keys are task IDs and whose values are Python lists of neighboring task IDs requires O(V+E) space-about 10,000 keys and at most 300,000 integers-well within the 512 MB limit. Depth-first or Kahn topological sorting can traverse this structure in O(V+E) time. An adjacency matrix allocates O(V²) space (≈100 million booleans) and would exceed memory. A list of JSON edge objects or a nested dict-of-dicts adds heavy per-edge object overhead, wasting memory without improving traversal speed.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is an adjacency list and why is it efficient for sparse graphs?
How does topological sorting work in a directed acyclic graph (DAG)?
Why is an adjacency matrix not suitable for this scenario?
A data engineer must catalog tables from an Amazon RDS for MySQL database that sits in a private subnet with no NAT or internet gateway. The engineer is creating a new AWS Glue crawler to read the schema. Which configuration will allow the crawler to reach the database without exposing it publicly or adding extra network infrastructure?
Do not create any connection; selecting Amazon RDS as the data store is sufficient because Glue can connect to all regional RDS endpoints by default.
Create a JDBC connection with the default Glue security group; the crawler will automatically route through the account's NAT gateway.
Create a network connection that uses a public subnet with an internet gateway so the crawler can reach the database over its public endpoint.
Create a JDBC AWS Glue connection that specifies the RDS endpoint, references credentials in AWS Secrets Manager, and selects the same VPC, private subnet, and a security group allowing port 3306.
Answer Description
A JDBC-type AWS Glue connection is required for a crawler that targets an RDS database. By selecting the same VPC, one of the private subnets that already hosts the DB instance, and a security group that permits inbound TCP 3306 traffic from the Glue-managed ENIs, the service can place ENIs inside that subnet and reach the database directly. Storing the credentials in AWS Secrets Manager lets the crawler authenticate without hard-coding passwords. Because the connection remains inside the VPC, no internet gateway or NAT device is needed. The other options either place the database in a public subnet, rely on a NAT gateway, or assume that no connection is required, all of which conflict with best practices for a private RDS instance.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a JDBC connection in AWS Glue?
How does AWS Secrets Manager help with database credentials in Glue?
Why is it best practice to keep an RDS instance in a private subnet?
Why does the Glue crawler need a JDBC connection for an RDS database?
What is the purpose of AWS Secrets Manager in this configuration?
How does the Glue crawler reach a private RDS instance within the same VPC?
Gnarly!
Looks like that's it! You can go back and review your answers or click the button below to grade your test.