A data science team at a financial institution is experiencing a reproducibility crisis with their real-time fraud detection model. The model's performance is degrading, and new team members cannot replicate past experiments. The project involves streaming data, frequent pipeline updates, and continuous feature engineering. The team maintains a data dictionary, but it seems insufficient. Which of the following data dictionary components is MOST crucial for addressing the root cause of this model decay and lack of reproducibility?
A comprehensive list of all table and column names with their base data types (e.g., INTEGER, VARCHAR, TIMESTAMP).
Summary statistics (e.g., mean, median, cardinality) for all features, updated on a weekly basis.
Detailed data lineage mapping and versioned documentation of all feature transformation logic.
Business definitions and permissible value ranges for each variable as defined by domain stakeholders.
The correct answer is the inclusion of data lineage and transformation logic. In a dynamic environment with evolving data pipelines and feature engineering, reproducibility depends on the ability to trace how data and features have changed over time. Data lineage tracks the data's journey from source to model, and documenting the transformation logic explains exactly how raw data was converted into features for any given model version. This allows analysts to reconstruct the exact dataset used in a past experiment, which is essential for diagnosing performance degradation and ensuring reproducibility. While the other options are important components of a data dictionary, they do not directly address the problem of tracking changes in a complex, evolving data environment. Basic data types, business definitions, and summary statistics are snapshots, but data lineage provides the full history of change.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is data lineage, and why is it important in machine learning pipelines?
Open an interactive chat with Bash
What is 'feature transformation logic,' and how does it impact a model?
Open an interactive chat with Bash
How does versioned documentation improve reproducibility in data science projects?