A data science team is redesigning the data pipeline for a financial services company's fraud detection system. The system must support two distinct operational needs:
A fraud analytics team requires a dashboard with a data latency of less than one minute to take immediate action on suspicious transactions.
A risk modeling team analyzes historical data to identify long-term trends and build predictive models, for which daily data freshness is sufficient. The current architecture refreshes a single, large data warehouse via a continuous streaming feed, leading to high operational costs and resource contention.
Which of the following strategies provides the most cost-effective and operationally efficient solution that meets the refresh cycle requirements for both teams?
Maintain the single data warehouse with streaming ingestion but significantly increase the cluster's processing resources to eliminate performance contention.
Implement a two-tiered data architecture. Use a low-latency streaming pipeline to populate a dedicated data store for the real-time dashboard, and use a separate, daily batch process to load data into the main data warehouse for the risk modeling team.
Transition the entire data warehouse refresh cycle to a micro-batch process that runs every 10 minutes to balance latency and the high cost of continuous streaming.
Implement on-demand refresh triggers for the single data warehouse, allowing analysts from both teams to manually update the data as needed for their specific tasks.
The correct answer is to implement a two-tiered data architecture. This approach, often aligned with principles of the Lambda architecture, effectively addresses the distinct needs of both teams. It uses a low-latency, and typically more expensive, streaming refresh cycle only for the use case that absolutely requires it: the real-time fraud dashboard. For the risk modeling team, which does not need sub-minute data freshness, a daily batch refresh cycle is more appropriate and significantly more cost-effective for processing large historical datasets. This separation of workloads optimizes resource utilization and cost without compromising on the critical requirements of either team.
Upgrading resources for a single streaming pipeline is a brute-force solution that addresses the performance contention but not the underlying cost inefficiency. Switching entirely to a 10-minute micro-batch fails to meet the stringent sub-minute latency requirement for the fraud analytics team. Implementing an on-demand refresh strategy is unsuitable for a high-frequency, automated alerting system and introduces unpredictable system load.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a two-tiered data architecture?
Open an interactive chat with Bash
What is the Lambda architecture?
Open an interactive chat with Bash
Why is a streaming pipeline more costly than batch processing?