A data science team is designing a data lake architecture on a distributed file system to store terabytes of structured event data for analytical querying. The primary use case involves running complex, read-heavy queries for feature engineering, which frequently select a small subset of columns from a wide table containing over 200 columns. The system must also support schema evolution as new event properties are added over time. Given these requirements, which data format is the most appropriate for storing the processed data in the data lake to optimize query performance and storage efficiency?
The correct answer is Parquet. Parquet is a columnar storage format specifically designed for efficient data storage and retrieval in analytical workloads. Its columnar nature allows query engines to read only the necessary columns to satisfy a query, which drastically reduces I/O and improves performance, especially for wide tables where only a subset of columns is accessed. Parquet also offers excellent compression and supports schema evolution, making it the ideal choice for this scenario.
Avro is an incorrect choice because it is a row-based storage format. While it is efficient for write-heavy workloads and data serialization (like in streaming pipelines), its row-based nature requires reading entire rows of data, which is inefficient for analytical queries that only need a few columns from a wide table.
JSON is incorrect because, although it supports schema flexibility and nested data, it is a text-based, row-oriented format. It is more verbose and significantly less performant for large-scale analytical queries compared to binary, columnar formats like Parquet.
CSV is incorrect as it is a simple, text-based, row-oriented format. It is inefficient for querying subsets of columns from large, wide datasets and lacks robust support for schema evolution or data typing.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why is Parquet considered a columnar format?
Open an interactive chat with Bash
What is schema evolution and how does Parquet handle it?
Open an interactive chat with Bash
Why is Avro unsuitable for read-heavy analytical queries?