AWS Certified Data Engineer Associate Practice Test (DEA-C01)
Use the form below to configure your AWS Certified Data Engineer Associate Practice Test (DEA-C01). The practice test can be configured to only include certain exam objectives and domains. You can choose between 5-100 questions and set a time limit.

AWS Certified Data Engineer Associate DEA-C01 Information
The AWS Certified Data Engineer – Associate certification validates your ability to design, build, and manage data pipelines on the AWS Cloud. It’s designed for professionals who transform raw data into actionable insights using AWS analytics and storage services. This certification proves you can work with modern data architectures that handle both batch and streaming data, using tools like Amazon S3, Glue, Redshift, EMR, Kinesis, and Athena to deliver scalable and efficient data solutions.
The exam covers the full data lifecycle — from ingestion and transformation to storage, analysis, and optimization. Candidates are tested on their understanding of how to choose the right AWS services for specific use cases, design secure and cost-effective pipelines, and ensure data reliability and governance. You’ll need hands-on knowledge of how to build ETL workflows, process large datasets efficiently, and use automation to manage data infrastructure in production environments.
Earning this certification demonstrates to employers that you have the technical expertise to turn data into value on AWS. It’s ideal for data engineers, analysts, and developers who work with cloud-based data systems and want to validate their skills in one of the most in-demand areas of cloud computing today. Whether you’re building data lakes, streaming pipelines, or analytics solutions, this certification confirms you can do it the AWS way — efficiently, securely, and at scale.

Free AWS Certified Data Engineer Associate DEA-C01 Practice Test
- 20 Questions
- Unlimited
- Data Ingestion and TransformationData Store ManagementData Operations and SupportData Security and Governance
Your team needs a managed, serverless workflow that starts when an object arrives under s3://sales/landing/. The workflow must invoke a Lambda function to validate each file, run an AWS Glue Spark job to transform the data, then call another Lambda to load the result into Amazon Redshift. It must provide automatic per-step retries, execution history, and one-click resume from failures. Which solution is most cost-effective?
Set up an Amazon EventBridge pipe to invoke the first Lambda function; have that function synchronously call the Glue job and second Lambda while implementing all retries in code.
Deploy an Amazon MWAA environment and author an Apache Airflow DAG that coordinates the two Lambda tasks and the Glue job.
Build an AWS Glue Workflow that runs the Glue job and add the two Lambda steps as Python shell jobs inside the workflow.
Create an AWS Step Functions state machine that invokes the two Lambda functions and the AWS Glue job, and trigger the state machine with an Amazon EventBridge rule for the S3 prefix.
Answer Description
AWS Step Functions is purpose-built for orchestrating serverless workflows. It offers native integrations with both AWS Lambda and AWS Glue jobs, records a complete execution history, and supports configurable retry logic or catch paths for each state, allowing failed executions to be re-started at the failed step. An EventBridge rule can invoke the state machine when new S3 objects arrive, so no custom scheduling code is needed. EventBridge Pipes alone is an event routing service; the first Lambda would need to manage Glue and Redshift steps as well as error handling in code. Glue Workflows orchestrate only Glue jobs, crawlers, and triggers; adding Lambda requires workarounds such as Python shell jobs and still lacks cross-service state tracking. MWAA can meet the functional need but introduces continuous environment costs and additional operational overhead, making it less cost-effective than the fully managed Step Functions solution.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What are AWS Step Functions, and why are they suited for serverless workflows?
What is the role of Amazon EventBridge in triggering workflows?
Why is an AWS Glue Workflow less suitable for this use case?
An AWS Glue ETL job writes driver logs to the log group /aws-glue/jobs/output in JSON format. Each log event contains the fields level, message, and jobRunId. You must use CloudWatch Logs Insights to quickly show a count of unique jobRunId values that logged the string "ERROR TimeoutException" during the last 24 hours, while minimizing query cost. Which query meets these requirements?
fields @timestamp, jobRunId, message | filter message like /TimeoutException/ | stats count_distinct(message)
fields @timestamp, jobRunId, message | sort @timestamp desc | filter message like /ERROR TimeoutException/ | limit 1000 | stats count_distinct(jobRunId)
filter message like /ERROR TimeoutException/ | stats count(jobRunId)
fields message, jobRunId | filter message like /ERROR TimeoutException/ | stats count_distinct(jobRunId) as affectedRuns
Answer Description
The correct query filters on the exact error text, selects only the required fields (message and jobRunId), and then uses count_distinct(jobRunId) to report the number of affected job runs in a single pass. Alternative queries either count every matching event instead of distinct jobRunId values, apply sort or limit before aggregation (which can omit relevant events and add processing overhead), or aggregate the wrong field.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is CloudWatch Logs Insights?
What does count_distinct(jobRunId) do in CloudWatch Logs Insights?
How does filtering with message like /ERROR TimeoutException/ work?
A company has 5 TB of structured sales data that analysts query using complex joins, window functions, and aggregations. The queries must return results within seconds during business hours, and the team wants automatic columnar storage compression without managing infrastructure. Which AWS storage platform should be used to host the dataset to meet these performance characteristics?
Amazon DynamoDB
Amazon Redshift
Amazon RDS for MySQL
An AWS Lake Formation data lake on Amazon S3 queried with Amazon Athena
Answer Description
Amazon Redshift is a fully managed, petabyte-scale data warehouse designed for analytic workloads. It stores data in columnar format, applies automatic compression, and distributes processing across nodes, allowing complex SQL queries with joins and aggregations to return in seconds. Amazon RDS is optimized for transactional workloads and can struggle with multi-terabyte analytics at this latency. DynamoDB is a NoSQL key-value store that does not support relational joins. An S3-based data lake managed through AWS Lake Formation provides durable storage but relies on external engines (for example, Amazon Athena) whose performance is typically slower for highly concurrent, complex relational queries compared with a dedicated data warehouse.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is columnar storage in Amazon Redshift?
How does Amazon Redshift handle complex queries efficiently?
Why is Amazon Redshift better for analytics compared to Amazon DynamoDB?
A data engineer is developing a production ML workflow that uses Amazon SageMaker Pipelines to read raw files from Amazon S3, perform data preprocessing, train a model, and deploy the model to a SageMaker endpoint. The company must keep an auditable, end-to-end record of every dataset, processing job, model version, and endpoint created by the pipeline while writing as little custom tracking code as possible. Which solution meets these requirements?
Refactor the workflow into AWS Step Functions and enable AWS X-Ray tracing so that each state transition captures lineage information for audit queries.
Run an AWS Glue crawler after every pipeline step and store the results in the AWS Glue Data Catalog to represent lineage between datasets, jobs, and models.
Enable SageMaker ML Lineage Tracking in the SageMaker Pipeline so that each step automatically registers its artifacts and relationships, then query the lineage graph through the SageMaker Lineage API.
Turn on AWS CloudTrail for all SageMaker API calls and analyze the resulting logs with Amazon Athena to reconstruct the lineage of artifacts.
Answer Description
Amazon SageMaker ML Lineage Tracking is natively integrated with SageMaker Pipelines. When lineage tracking is enabled, each pipeline execution automatically records artifacts such as datasets, processing jobs, training jobs, models, and endpoints, and registers the relationships among them. These artifacts and their dependencies can be queried through the SageMaker Lineage API, used in automated compliance reports, or visualized in SageMaker Studio. AWS Glue crawlers catalog only data locations and cannot track transformations or model artifacts. CloudTrail logs must be parsed and correlated manually and do not provide semantic lineage. Step Functions with X-Ray trace requests but do not capture domain-specific ML artifacts without extensive custom instrumentation.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is SageMaker ML Lineage Tracking?
What is the role of the SageMaker Lineage API?
How does SageMaker Pipelines integrate with ML Lineage Tracking?
A data engineering team manages a MySQL database hosted on Amazon RDS. Compliance requires that the application password be rotated automatically every 30 days without manual scripting. The analytics pipeline runs on AWS Lambda functions in the same account. Which approach meets the requirement while minimizing operational overhead?
Encrypt the password with AWS KMS, save it in a Lambda environment variable, and update the variable manually through a CI/CD pipeline each month.
Store the password in AWS Systems Manager Parameter Store as a SecureString and use an EventBridge rule to trigger a custom Lambda function to rotate it every 30 days.
Set the master password in Amazon RDS to the keyword AWS_ROTATE to enable automatic rotation and allow Lambda to read the password from the DB instance endpoint.
Store the password in AWS Secrets Manager, enable the built-in RDS MySQL rotation schedule, and grant the Lambda execution role permission to retrieve the secret.
Answer Description
AWS Secrets Manager offers a built-in rotation feature for Amazon RDS databases. Enabling a rotation schedule creates an AWS-managed Lambda function that updates the database password and stores the new value in the same secret, eliminating the need for custom scripts. The Lambda functions in the pipeline can fetch the current password at run time by using an execution role that has secretsmanager:GetSecretValue permission.
AWS Systems Manager Parameter Store SecureString cannot rotate credentials automatically; implementing rotation would require a custom rule and script. Storing a KMS-encrypted value in Lambda environment variables still requires a manual update process, and Amazon RDS does not support a keyword such as AWS_ROTATE to trigger automatic password changes. Therefore, using AWS Secrets Manager with built-in rotation is the only option that satisfies the 30-day rotation requirement with the least operational effort.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is AWS Secrets Manager?
How does AWS Secrets Manager perform automatic rotation for RDS passwords?
Why is AWS Systems Manager Parameter Store not suitable for automatic credential rotation?
A data engineer is building an AWS Step Functions Standard workflow that will invoke an AWS Glue job for each of 200 daily S3 partitions. No more than 10 Glue jobs should run at the same time, each invocation must automatically retry twice with exponential backoff for transient errors, and the workflow must fail immediately on a custom "DATA_VALIDATION_FAILED" error returned by the job. Which Step Functions design will meet these requirements with the least custom code?
Run an Express Step Functions workflow triggered by Amazon EventBridge rules that submit Glue jobs in batches of 10 until all partitions are processed.
Use a Parallel state with 10 static branches; each branch invokes the Glue job for a subset of partitions.
Invoke the Glue job from a Lambda function in a Task state and write custom code in the function to iterate through partitions, manage retries, and enforce a 10-job concurrency limit.
Create a Map state that passes the array of partition prefixes, set MaxConcurrency to 10, and configure Retry with backoffRate and a Catch clause for the DATA_VALIDATION_FAILED error.
Answer Description
A Map state natively iterates over a JSON array and can control parallelism with the MaxConcurrency field, ensuring that no more than the specified number of iterations (10) run simultaneously. Inside the Map state's Item processor, you can add a Task state that starts the Glue job, apply a Retry clause that specifies a maximum of two attempts with an exponential backoffRate, and add a Catch clause that matches the custom "DATA_VALIDATION_FAILED" error to fail the workflow immediately. This solution uses only Step Functions features and requires no additional Lambda code or complex branching logic.
Parallel states launch a fixed number of branches and therefore cannot dynamically scale to 200 partitions while still limiting concurrency to 10 without extra logic. Iterating in a Lambda function shifts the retry and concurrency control to application code, adding operational overhead. Express workflows cannot directly throttle concurrent Glue invocations and would still require external coordination.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a Map state in AWS Step Functions?
How does MaxConcurrency work in Step Functions?
What is exponential backoff in Step Functions retry policy?
A data engineer must enable analysts to run ad hoc SQL queries from Amazon Athena, Amazon Redshift Spectrum, and Amazon EMR Presto against semi-structured JSON files stored in an S3 data lake. The solution must avoid duplicating table definitions and should automatically detect new daily partitions that land in the same S3 prefix. Which approach meets these requirements with minimal operational overhead?
Embed the JSON schema in every Spark job and instruct analysts to load the data into temporary views before running SQL queries.
Create separate external tables with identical names in Athena, Redshift Spectrum, and the EMR Hive metastore, updating each table manually when partitions arrive.
Configure an AWS Glue crawler on the S3 prefix to populate an AWS Glue Data Catalog table and have all query engines reference that catalog.
Store Avro schema definition files alongside the data in S3 and rely on each engine's SerDe to discover new partitions at query time.
Answer Description
AWS Glue Data Catalog provides a centralized Hive-compatible metastore that is natively supported by Athena, Redshift Spectrum, and EMR. Creating an AWS Glue crawler on the S3 prefix automatically infers the schema and adds or updates partitions on a schedule, so all three query engines can immediately consume the new data without additional DDL. Defining external tables separately, embedding schemas in application code, or storing inline schema files would require manual updates for every new partition and would not give the services a shared catalog.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is AWS Glue and how do crawlers work?
What is the AWS Glue Data Catalog and why is it important?
How does partition detection work with AWS Glue crawlers?
An insurance company keeps policy documents in an Amazon S3 bucket that has versioning enabled. Regulations require that every object, including all previous versions, must be permanently deleted exactly 7 years (2,555 days) after its creation. The solution must prove compliance while minimizing operational overhead and maintenance work. Which action will meet these requirements?
Configure an AWS Backup plan for the bucket with a 7-year retention rule so that the original objects are deleted after the backups expire.
Create an S3 Lifecycle rule with two expiration actions that permanently delete current object versions and noncurrent object versions after 2,555 days, and enable removal of expired object delete markers.
Set up an EventBridge rule that invokes an AWS Lambda function daily to list objects older than 2,555 days and delete each version individually.
Enable S3 Object Lock in compliance mode with a 7-year retention period so that objects are automatically removed when the retention period ends.
Answer Description
An S3 Lifecycle configuration can natively expire both current and noncurrent object versions after a specified number of days. By adding separate expiration actions for current versions and for noncurrent versions set to 2,555 days, and enabling the option to remove expired object delete markers, every copy of the object is removed automatically when it reaches the mandated age. This solution is completely managed, requires no code to maintain, and produces an auditable lifecycle rule that demonstrates compliance.
S3 Object Lock prevents deletion until the retention period ends, but it does not automatically remove objects when the period expires; manual deletion would still be required. AWS Backup retention applies only to backup copies, not to the original S3 objects, so the source data would remain. A scheduled EventBridge-Lambda process could work, but it introduces custom code and ongoing operational overhead, making it less suitable than a native lifecycle policy.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Amazon S3 versioning?
How does an S3 Lifecycle rule work?
What are expired object delete markers in S3?
An analytics team ingests clickstream logs into Amazon S3 and uses nightly AWS Glue Spark jobs to aggregate the data and load it into Amazon Redshift. Auditors must be able to trace each Redshift column back to the exact S3 objects that produced it to verify data accuracy. Which approach delivers automatic column-level data lineage with minimal operational overhead?
Run the transformations with AWS Glue ETL jobs and use the AWS Glue Data Catalog's built-in lineage features to track sources and targets.
Enable AWS CloudTrail data events for S3 and Redshift and analyze the logs in Amazon Athena to reconstruct lineage.
Schedule Amazon Inspector assessments of the Redshift cluster to generate data provenance reports.
Attach custom S3 object tags that identify lineage and propagate the tags through each Glue job using job parameters.
Answer Description
AWS Glue records lineage information every time a supported crawler or ETL job runs. When Glue jobs read objects in S3 and write tables in Amazon Redshift, the Glue Data Catalog is updated with table- and column-level lineage that shows the relationship between the S3 sources and the Redshift targets. This information can be explored in the Glue console or queried through the Glue API, requiring no custom tagging or additional services. Adding custom tags (object metadata) or parsing CloudTrail logs would require manual maintenance and do not provide column-level lineage, while Amazon Inspector does not offer data lineage capabilities.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is AWS Glue Data Catalog?
How does AWS Glue track data lineage automatically?
Why are custom tags and CloudTrail logs unsuitable for column-level lineage?
An e-commerce startup ingests clickstream events into an Amazon DynamoDB table. Traffic is highly unpredictable: most of the day only a few hundred writes per minute occur, but flash-sale campaigns generate short spikes of up to 50,000 writes per second. The team wants the simplest configuration that keeps costs low during idle periods while automatically absorbing the spikes without throttling. Which solution satisfies these requirements?
Enable on-demand capacity mode and turn on TTL so write capacity automatically drops to zero when items expire.
Create the table in on-demand capacity mode; rely on its automatic scaling for write traffic.
Configure the table with 5,000 provisioned WCUs and attach a multi-node DynamoDB Accelerator (DAX) cluster to absorb burst writes.
Use provisioned capacity mode with 50,000 write capacity units and enable auto scaling between 1,000 and 50,000 WCUs.
Answer Description
DynamoDB on-demand capacity mode charges only for the read and write requests that are actually made and can instantly scale to thousands of requests per second, so it handles sudden flash-sale bursts without advance capacity planning. Provisioned capacity with auto scaling can lag several minutes before increasing limits, risking throttling and unnecessary cost if pre-provisioned at peak. DAX accelerates reads, not writes, and DynamoDB TTL affects storage cost, not write throughput. Therefore, creating the table in on-demand capacity mode is the most cost-effective and operationally simple choice for this workload.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is DynamoDB on-demand capacity mode?
How does auto-scaling in provisioned capacity mode differ from on-demand capacity mode?
Why is DynamoDB Accelerator (DAX) not applicable for absorbing write spikes?
A data engineering team launches a transient Amazon EMR cluster each night through an AWS Step Functions workflow. Before any Spark job runs, the cluster must have a proprietary JDBC driver installed on every node. After installation, a PySpark ETL script stored in Amazon S3 must be executed. What is the most operationally efficient way to meet these requirements using native EMR scripting capabilities?
Configure a bootstrap action that downloads and installs the driver on all nodes, then add an EMR step that runs
spark-submiton the PySpark script in Amazon S3.Schedule an EMR Notebook that first installs the driver with
pipcommands and then executes the PySpark code, triggered nightly by a cron expression.Pass a shell script to a Hadoop Streaming step that both installs the driver and calls the PySpark script in a single command.
Build a custom AMI with the driver pre-installed and specify the PySpark ETL through classification properties when creating the cluster.
Answer Description
Bootstrap actions are executed on every node as the cluster is provisioning, making them ideal for installing additional software such as a JDBC driver before any jobs start. After the cluster is ready, an EMR step can invoke spark-submit to run a PySpark script that resides in Amazon S3. This combination uses built-in EMR scripting features, requires no custom AMI maintenance, and fits well into an automated Step Functions orchestration. Notebooks do not install software on all nodes automatically and are harder to schedule. Custom AMIs achieve the goal but add ongoing image-management overhead. Using Hadoop Streaming for software installation and Spark execution is possible but not intended for this scenario and complicates the workflow.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a bootstrap action in Amazon EMR?
How does EMR integrate with AWS Step Functions?
Why is using an EMR step with `spark-submit` operationally efficient?
You manage an Amazon EKS cluster that runs containerized Apache Spark batch jobs that transform data in Amazon S3. The cluster uses a fixed managed node group of twenty m5.xlarge On-Demand instances. During nightly runs CPU utilization exceeds 80 percent and jobs slow, but daytime utilization is under 10 percent. You must boost performance and cut idle costs with minimal operations effort. Which approach meets these goals?
Install the Kubernetes Cluster Autoscaler on the EKS cluster, create a managed node group that mixes On-Demand and Spot Instances, and set CPU and memory requests for all Spark pods.
Create an EKS Fargate profile for the Spark namespace so every Spark pod runs on Fargate while keeping the existing node group for system pods.
Migrate the Spark containers to Amazon ECS and enable Service Auto Scaling based on average CPU utilization across tasks.
Increase the existing node group to forty m5.xlarge instances and enable vertical pod autoscaling for Spark executors to remove resource contention.
Answer Description
Using the Kubernetes Cluster Autoscaler with an EKS managed node group that contains a mix of On-Demand and Spot Instances allows the cluster to add capacity when Spark jobs need it and automatically scale in when demand falls. Defining CPU and memory requests for the Spark driver and executor pods gives the autoscaler the information it needs to schedule additional nodes only when required. This improves job throughput at night while avoiding unutilized compute during the day, and Spot Instances lower cost further. Moving all workloads to Fargate would simplify management but is typically more expensive for long-running, compute-intensive Spark jobs. Migrating to ECS changes the platform and offers no inherent cost benefit for Spark. Simply doubling the node group and relying on vertical pod autoscaling removes the performance bottleneck but increases costs and does not reclaim idle capacity.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the Kubernetes Cluster Autoscaler?
What are Spot Instances and how do they reduce costs?
Why are CPU and memory requests important in Spark pods?
Every day at 02:00 UTC, a healthcare company must ingest the previous day's CSV file from an Amazon S3 bucket into a staging table in Amazon Redshift. The team wants a fully managed, serverless solution that minimizes cost and ongoing administration while reliably running at the scheduled time. Which approach best meets these requirements?
Create a cron-based Amazon EventBridge rule that starts an AWS Glue ETL job, which reads the CSV file from S3 and writes it to Amazon Redshift.
Deploy an Amazon Managed Workflows for Apache Airflow (MWAA) environment and schedule a DAG that issues a Redshift COPY command for the file in S3.
Launch a transient Amazon EMR cluster each night that runs a Spark job to copy the file from S3 to Redshift, then terminates the cluster.
Configure an Amazon Kinesis Data Firehose delivery stream with a Lambda transformation to send data to Redshift and enable the stream at 02:00 UTC using the AWS CLI.
Answer Description
An Amazon EventBridge rule can invoke an AWS Glue job on a cron schedule without provisioning or managing servers. Glue provides a native Redshift connection that issues a COPY operation under the hood, so the job can efficiently load the CSV file from S3 into the staging table. EventBridge and Glue are both serverless and incur charges only when rules fire or jobs run, meeting the cost-effectiveness and low-ops goals.
Launching an EMR cluster requires cluster provisioning, node management, and longer billing intervals. MWAA introduces additional infrastructure to manage and a higher baseline cost for the environment. Kinesis Data Firehose is optimized for continuous streaming, cannot be "scheduled" to start at a specific time, and would run (and bill) continuously, increasing cost and complexity.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Amazon EventBridge?
How does AWS Glue connect to Amazon Redshift?
Why is Kinesis Data Firehose not ideal for scheduled tasks?
A data engineering team must allow an AWS Glue job running in account A to write objects to an Amazon S3 bucket that belongs to account B. The solution must prevent storage of long-lived credentials inside the job code and must operate without human interaction. Which authentication method should the team use?
Create an IAM user in account B, store its access keys in AWS Secrets Manager, and retrieve them from the job at runtime.
Configure an IAM role in account B and allow the AWS Glue job to assume that role by using AWS STS.
Upload an X.509 client certificate so the Glue job can use mutual TLS authentication with Amazon S3.
Generate a pre-signed S3 URL and embed it in the Glue job parameters before each run.
Answer Description
Assuming an IAM role in account B provides temporary AWS Security Token Service (STS) credentials to the Glue job. Because the role is trusted by account A, the job can call STS to obtain short-lived keys whenever it runs, eliminating any need to embed or rotate long-term secrets. This is a role-based authentication mechanism. A pre-signed URL would require someone or some process to generate and inject the URL before each run. Storing permanent access keys in AWS Secrets Manager still relies on long-lived credentials that must be rotated. Amazon S3 does not support mutual TLS with customer-provided X.509 client certificates, so a certificate-based approach is not feasible.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is AWS STS and how does it work?
Why is assuming an IAM role considered more secure than pre-signed URLs or storing credentials?
What is the importance of cross-account access in AWS?
A data engineer is configuring a Spark job on an existing Amazon EMR cluster that periodically connects to an Amazon Redshift database. The job must retrieve the database user name and password at runtime. Security mandates that the credentials are encrypted at rest, automatically rotated every 30 days, and accessed through IAM roles without code changes. Which solution meets these requirements?
Store credentials as SecureString parameters in AWS Systems Manager Parameter Store encrypted with a customer managed KMS key. Grant the EMR instance profile role permission to read the parameters.
Embed the credentials in the cluster bootstrap action script and restrict script access with an EMR security configuration; create an IAM role that allows reading the script.
Store credentials in AWS Secrets Manager, enable built-in rotation with an AWS Lambda function scheduled every 30 days, and allow the EMR instance profile role to read the secret.
Place a JSON file containing the credentials in an Amazon S3 bucket encrypted with SSE-KMS and rotate the object every 30 days using a CloudWatch Events rule and Lambda.
Answer Description
AWS Secrets Manager encrypts secrets at rest with AWS KMS, integrates natively with IAM roles so the EMR instance profile can retrieve the secret with no code modification, and provides built-in automatic rotation through an AWS-managed schedule that invokes a Lambda function. Systems Manager Parameter Store SecureString parameters satisfy encryption and IAM integration but lack native rotation. Storing credentials in Amazon S3 or embedding them in bootstrap scripts requires manual rotation and increases the risk of exposure, so these options do not fulfill the security team's requirements.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is AWS Secrets Manager?
How does AWS Secrets Manager enable automated rotation?
What is the difference between Secrets Manager and Systems Manager Parameter Store?
An Amazon Athena table stores clickstream events as Parquet files in an S3 location partitioned by year, month, and day. A nightly ETL job currently runs the following query and is incurring high scan costs:
SELECT user_id, page, event_time FROM clickstream WHERE event_time BETWEEN date '2023-07-01' AND date '2023-07-31';
How should you rewrite the SQL to scan the least amount of data without changing the table definition?
Append a LIMIT clause so the statement becomes:
SELECT user_id, page, event_time FROM clickstream WHERE event_time BETWEEN date '2023-07-01' AND date '2023-07-31' LIMIT 100000;
Add a filter on the partition columns, for example:
SELECT user_id, page, event_time FROM clickstream WHERE year = 2023 AND month = 7 AND day BETWEEN 1 AND 31;
Include an ORDER BY year, month, day clause to ensure the data is read in partition order.
Create a common table expression (CTE) that selects all columns and then filter the CTE on event_time within the main query.
Answer Description
Athena partitions are stored as separate folders in Amazon S3. When a query's WHERE clause references the partition columns, Athena prunes the unrelated partitions and reads only the relevant files, which reduces the amount of data scanned and lowers cost. Filtering solely on event_time does not use partition pruning because that column is stored inside the files, not in the partition path. A LIMIT clause, ORDER BY, or a common table expression does not affect how much data is read from S3, so they provide no scan-cost benefit.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is partition pruning in Amazon Athena?
Why doesn’t filtering on non-partitioned columns reduce scan costs?
How does partitioning in Athena improve query performance?
A data engineer loads transformed sales totals into Amazon Redshift Serverless each night. An external partner needs to query the current day's total over the internet through a low-latency HTTPS endpoint. The partner cannot obtain AWS credentials but can pass an API key for authentication. The solution must remain fully serverless and require the least operational overhead. Which approach satisfies these requirements?
Write the daily total to a JSON file in an Amazon S3 bucket and share a presigned URL with the partner.
Expose the Amazon Redshift Data API endpoint to the partner and store database credentials in AWS Secrets Manager.
Deploy a microservice on Amazon ECS Fargate behind an Application Load Balancer that connects to Amazon Redshift with JDBC and returns results.
Create a REST API in Amazon API Gateway that requires an API key and invokes an AWS Lambda function, which queries Amazon Redshift through the Redshift Data API and returns JSON.
Answer Description
Using Amazon API Gateway with an attached usage plan lets the company require an API key for every request. A Lambda function behind the API runs simple SELECT statements by calling the Amazon Redshift Data API, formats the result as JSON, and returns it. All components are serverless, no network endpoints for Redshift are exposed, and API Gateway handles throttling and key management. Directly exposing the Redshift Data API would require the partner to sign requests with AWS credentials. Running a container service behind an Application Load Balancer introduces additional infrastructure to operate. Publishing a daily file to Amazon S3 does not provide on-demand queries and relies on presigned URLs rather than API-key authentication.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the Amazon Redshift Data API?
How does Amazon API Gateway authenticate with API keys?
Why is AWS Lambda a good choice for querying Amazon Redshift in this solution?
The analytics team stores PII in an Amazon S3 data lake in us-east-2 and protects it with AWS Backup. Company policy mandates that no backups or object replicas may ever leave us-east-2. You need an organization-wide control that prevents any engineer from configuring cross-Region replication or AWS Backup copy jobs to other Regions while still allowing normal operations in us-east-2. Which approach meets the requirement with minimal ongoing maintenance?
Encrypt all recovery points with a customer-managed AWS KMS key that exists solely in us-east-2 and rotate the key quarterly.
Enable Amazon S3 Same-Region Replication on every bucket and remove all cross-Region copy rules from existing AWS Backup plans.
Attach an AWS Organizations SCP that denies s3:PutBucketReplication, s3:CreateBucket, and backup:StartCopyJob whenever aws:RequestedRegion or s3:LocationConstraint is not "us-east-2", and apply the policy to the OU that contains all data accounts.
Create VPC interface endpoints for Amazon S3 and AWS Backup only in us-east-2 and delete the endpoints in all other AWS Regions.
Answer Description
An AWS Organizations service control policy (SCP) is evaluated before IAM policies in every member account, so an explicit Deny cannot be overridden. A Deny statement that fires when aws:RequestedRegion is not "us-east-2" blocks any API call aimed at another Region. Adding conditions such as s3:LocationConstraint and denying critical calls like s3:PutBucketReplication, s3:CreateBucket, and backup:StartCopyJob ensures engineers cannot create resources or start copy jobs that would place data outside the permitted Region. Because the SCP is attached to the organizational unit, it automatically applies to new accounts, buckets, and backup plans with no further action.
The Same-Region Replication answer relies on every bucket and backup plan being configured correctly and could be changed by developers. Restricting VPC interface endpoints only limits private network access; S3 replication and AWS Backup can use public endpoints that would still succeed. A Region-specific KMS key controls access to existing backups but does not stop a copy job from storing data in a vault in another Region, even if that data is encrypted.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is an AWS Organizations Service Control Policy (SCP)?
How does the `aws:RequestedRegion` condition help enforce Region-specific policies?
Why is attaching an SCP to an Organizational Unit (OU) beneficial for long-term policy enforcement?
After launching a new mobile game, a company ingests 20,000 player-event records per second through Amazon Kinesis Data Streams. An in-game personalization microservice must retrieve the most recent statistics for an individual player in less than 10 ms. Events older than 24 hours will be queried ad-hoc in Amazon Athena. Which data-storage approach best meets these requirements while minimizing cost?
Store each event in an Amazon DynamoDB table keyed by playerId with a 24-hour TTL; process the DynamoDB stream with AWS Lambda to batch write expired and changed items to Amazon S3 for Athena.
Use Amazon Kinesis Data Analytics to aggregate events and load them into an Amazon Redshift cluster; have the microservice query Redshift for personalization and analysts run reports on the same cluster.
Publish events to an Amazon MSK topic; have the microservice read the topic for player statistics and use MSK Connect to continuously sink the stream to Amazon S3 for Athena.
Configure Amazon Kinesis Data Firehose to deliver events directly to Amazon S3 in Parquet format and have both the microservice and analysts query the data with Amazon Athena.
Answer Description
Amazon DynamoDB delivers predictable single-digit millisecond latency, satisfying the <10 ms lookups required by the personalization microservice. Each record includes a TTL value set to 24 hours; DynamoDB automatically deletes items within the next couple of days after their expiration timestamp, which keeps the hot working set small and low-latency without consuming write capacity. With DynamoDB Streams enabled, every insert, update, and the eventual TTL delete record is sent to an AWS Lambda function that writes the data to Amazon S3. This provides a complete history for Amazon Athena at very low storage cost. The alternative solutions either cannot guarantee sub-10 ms access (Athena over S3), introduce higher operational cost and latency (Redshift), or are optimized for sequential stream consumption rather than on-demand key lookups (MSK).
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is DynamoDB TTL and how does it work?
What does DynamoDB Streams do?
Why is Amazon S3 used for ad-hoc queries with Amazon Athena?
A company ingests 50,000 IoT sensor readings per second. Each record is less than 1 KB of JSON. Data must be available for dashboards that query individual device readings with single-digit millisecond latency. Records are retained for 30 days, after which they should be automatically removed without administrator intervention. Which AWS storage service best meets these requirements while minimizing operational overhead?
Amazon Redshift cluster using automatic table vacuum and retention policies
Amazon Aurora MySQL with read replica autoscaling
Amazon S3 bucket storing gzip-compressed JSON objects
Amazon DynamoDB with TTL enabled on the ingestion timestamp
Answer Description
Amazon DynamoDB delivers consistent single-digit millisecond latency at any scale, making it suitable for real-time lookups of individual sensor readings. The service is fully managed and automatically scales throughput to handle sustained ingestion rates like 50,000 writes per second. DynamoDB Time to Live (TTL) can be enabled on the timestamp attribute so items expire automatically after 30 days, eliminating manual cleanup. Amazon S3 is cost-effective but cannot provide millisecond point-read latency for individual items. Amazon RDS Aurora can achieve sub-second queries but would require careful sharding and manual capacity management to sustain 50,000 TPS, increasing operational effort. Amazon Redshift is optimized for analytic scans, not high-velocity, per-item transactions, and it offers second-level rather than millisecond response times. Therefore, DynamoDB is the most appropriate choice.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
How does DynamoDB achieve single-digit millisecond latency?
What is Time to Live (TTL) in DynamoDB and how does it work?
Why is DynamoDB more suitable than Aurora or Redshift for high-velocity IoT data ingestion?
Gnarly!
Looks like that's it! You can go back and review your answers or click the button below to grade your test.