A data architect at a major e-commerce company is designing an ingestion and storage solution for a new analytics platform. The platform will process high-velocity user clickstream data, which arrives as semi-structured JSON objects. The primary requirements are to support fast, complex analytical queries on specific columns while minimizing storage costs and providing data that is refreshed every few minutes. Which of the following approaches best meets all of these requirements?
Ingest the data in micro-batches, converting the nested JSON into a flattened, columnar Parquet format for storage.
Stream the incoming JSON data directly into a structured, relational database, normalizing the data into multiple tables.
Set up a daily batch process to collect all clickstream events, flatten them, and store them as compressed CSV files.
Implement a real-time streaming pipeline that writes the raw, nested JSON data directly to object storage as individual files.
The correct approach is to ingest the data in micro-batches and store it as Parquet files. Parquet is a columnar storage format, which is highly efficient for analytical queries that access a subset of columns, as is common in data science workloads. Its superior compression also helps minimize storage costs compared to formats like JSON or CSV.
Clickstream data is high-velocity, and writing each event as a separate file creates a 'small file problem' in data lakes, which severely degrades query performance due to metadata overhead. Micro-batching, where data is collected for a short interval (e.g., a few minutes) before being written as a larger file, effectively solves this issue while still providing near-real-time data availability.
Storing raw JSON is inefficient for analytical queries and would not perform well.
A daily batch process using CSV files would not meet the requirement for data to be refreshed every few minutes, and the row-based nature of CSV is less performant for columnar analytics.
A relational database is not ideal for handling the high velocity and semi-structured nature of clickstream data, as ingestion can be a bottleneck and schema evolution is difficult.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why is Parquet better than JSON for analytical queries?
Open an interactive chat with Bash
What is the 'small file problem' in data lakes?
Open an interactive chat with Bash
What is micro-batching, and how does it differ from real-time streaming?