A data engineering team is building a daily batch processing pipeline that ingests terabytes of semi-structured JSON web server logs. The processed data must be stored to serve two primary use cases:
Enable fast, columnar-based analytical queries for business intelligence dashboards.
Provide a cost-effective, long-term archive for retraining machine learning models.
Given these requirements, which of the following persistence strategies for the processed data offers the best balance of query performance, cost efficiency, and schema evolution support?
Persist the raw JSON data directly in a distributed file system without any format conversion.
Load the flattened JSON data into a transactional, row-oriented relational database.
Convert the data to Parquet format and persist it in a cloud-based object storage service.
Store the final processed data in a distributed in-memory cache.
The correct answer is to convert the data to Parquet format and persist it in a cloud-based object storage service. This strategy directly addresses all requirements. Parquet is a columnar storage format, which is highly optimized for the fast analytical queries mentioned in the first requirement. Its columnar nature allows query engines to read only the necessary columns, drastically reducing I/O and accelerating query performance compared to row-based formats like JSON or relational databases. For the second requirement, Parquet offers excellent compression, which significantly reduces storage footprint and cost, making it ideal for long-term archiving of large data volumes in inexpensive cloud object storage. Finally, Parquet is designed to support schema evolution, allowing for the addition or removal of columns over time without breaking downstream processes, which is crucial for long-term data archival and model retraining.
The other options are incorrect for the following reasons:
Loading data into a transactional, row-oriented relational database is inefficient for this scenario. Row-oriented databases are optimized for transactional workloads (OLTP), not for large-scale analytical (OLAP) queries that scan columns. Querying terabytes of data this way would be slow and expensive.
Storing data in a distributed in-memory cache is not a persistence strategy for long-term archival. While caches provide extremely fast access, they are designed for temporary storage of smaller, frequently accessed data and are not cost-effective or suitable for storing terabytes of historical data.
Persisting raw JSON in a distributed file system fails to meet the performance requirement. Querying raw, row-based JSON files for columnar analytics is very inefficient, as the entire file must be scanned and parsed even if only a few columns are needed. This results in slow query performance and higher computational costs.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why is Parquet format recommended over JSON for analytics?
Open an interactive chat with Bash
What does schema evolution mean in Parquet, and why is it important?
Open an interactive chat with Bash
What makes cloud object storage cost-effective for long-term archiving?