CompTIA DataX DY0-001 (V1) Practice Question

A data engineering team is building a daily batch processing pipeline that ingests terabytes of semi-structured JSON web server logs. The processed data must be stored to serve two primary use cases:

Enable fast, columnar-based analytical queries for business intelligence dashboards.
Provide a cost-effective, long-term archive for retraining machine learning models.

Given these requirements, which of the following persistence strategies for the processed data offers the best balance of query performance, cost efficiency, and schema evolution support?

Convert the data to Parquet format and persist it in a cloud-based object storage service.
Persist the raw JSON data directly in a distributed file system without any format conversion.
Load the flattened JSON data into a transactional, row-oriented relational database.
Store the final processed data in a distributed in-memory cache.

Report Issue

Answer Description

The correct answer is to convert the data to Parquet format and persist it in a cloud-based object storage service. This strategy directly addresses all requirements. Parquet is a columnar storage format, which is highly optimized for the fast analytical queries mentioned in the first requirement. Its columnar nature allows query engines to read only the necessary columns, drastically reducing I/O and accelerating query performance compared to row-based formats like JSON or relational databases. For the second requirement, Parquet offers excellent compression, which significantly reduces storage footprint and cost, making it ideal for long-term archiving of large data volumes in inexpensive cloud object storage. Finally, Parquet is designed to support schema evolution, allowing for the addition or removal of columns over time without breaking downstream processes, which is crucial for long-term data archival and model retraining.

The other options are incorrect for the following reasons:

Loading data into a transactional, row-oriented relational database is inefficient for this scenario. Row-oriented databases are optimized for transactional workloads (OLTP), not for large-scale analytical (OLAP) queries that scan columns. Querying terabytes of data this way would be slow and expensive.
Storing data in a distributed in-memory cache is not a persistence strategy for long-term archival. While caches provide extremely fast access, they are designed for temporary storage of smaller, frequently accessed data and are not cost-effective or suitable for storing terabytes of historical data.
Persisting raw JSON in a distributed file system fails to meet the performance requirement. Querying raw, row-based JSON files for columnar analytics is very inefficient, as the entire file must be scanned and parsed even if only a few columns are needed. This results in slow query performance and higher computational costs.

Ask Bash

Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.

Why is Parquet format recommended over JSON for analytics?

Open an interactive chat with Bash

What does schema evolution mean in Parquet, and why is it important?

Open an interactive chat with Bash

What makes cloud object storage cost-effective for long-term archiving?

Open an interactive chat with Bash

CompTIA DataX DY0-001 (V1)

Operations and Processes

Your Score:

SAVE $64

CompTIA DataX Voucher

v1 / DY0-001

$529.00 $465.00

Bash, the Crucial Exams Chat Bot

AI Bot

CompTIA DataX DY0-001 (V1) Practice Question

Answer Description

Ask Bash

Why is Parquet format recommended over JSON for analytics?

What does schema evolution mean in Parquet, and why is it important?

What makes cloud object storage cost-effective for long-term archiving?

Monthly

$19.99

Billed monthly,
Cancel any time.

3 Month Pass

$44.99

One time purchase of $44.99,
Does not auto-renew.

Annual Pass

$119.99

One time purchase of $119.99,
Does not auto-renew.

Lifetime Pass

$189.99

One time purchase,
Good for life.

All Exams

Unlimited Tests

Unlimited Questions

AI Tutor

Track scores

Report Cards

Voucher Discounts

Advanced PBQs

Included Exams