A data science team's customer churn model, which performed well on historical data, shows a significant drop in accuracy when evaluating a new customer cohort from a corporate partnership. An initial investigation confirms there are no data quality issues or significant drift in the distribution of individual features. Which of the following statements provides the most fundamental explanation for the model's performance degradation?
The feature engineering process failed to create predictors that are relevant to the new customer segment.
The model suffers from high variance, leading to overfitting on the original training data.
Systematic errors were introduced during the data ingestion pipeline for the new cohort's transactional data.
The data-generating process for the new cohort is different from the process that generated the training data.
The correct answer explains that the model's failure is due to a change in the underlying data-generating process (DGP). The DGP refers to the true, real-world process that creates the data, including the underlying mechanisms and relationships between variables. In this scenario, the new customer cohort, acquired through a corporate partnership, likely has different behaviors, motivations, and relationships between their characteristics and their decision to churn compared to the historical customer base. This constitutes a different DGP. A model trained on the historical DGP will not generalize well to this new process, even if individual features appear stable.
The option suggesting the model suffers from high variance (overfitting) is less likely to be the fundamental cause. While overfitting can hurt generalization, the problem is specifically tied to a new, distinct group, which strongly points to a change in the population's underlying characteristics rather than just a model fitting too closely to noise in the original data.
The option regarding feature engineering is a symptom, not the root cause. The engineered features may indeed be less relevant for the new cohort, but this is because the underlying relationships (the DGP) have changed.
The option concerning systematic errors in data ingestion is incorrect because the scenario explicitly states that an investigation found no data quality issues.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the data-generating process (DGP)?
Open an interactive chat with Bash
Why does a model fail to generalize when the DGP changes?