A data science team is developing a predictive model for equipment failure using a single, unpruned decision tree. During testing, they observe two phenomena:
The model achieves near-perfect accuracy on the training dataset but performs poorly on the unseen validation dataset.
Minor changes to the training data, such as removing a small number of data points, result in a drastically different tree structure and predictions.
Which underlying characteristic of decision trees is the primary cause of both of these observations?
The correct answer is high variance. High variance in a model means it is highly sensitive to fluctuations in the training data. This sensitivity causes two primary effects seen in unpruned decision trees. First, the model learns the training data, including its noise, too well, which leads to overfitting. This explains why the model has high accuracy on the training set but generalizes poorly to new, unseen data. Second, because the model is so closely fitted to the specific training data, even small changes to that data can lead to significant changes in the model's structure and predictions, a behavior known as instability.
High bias is incorrect. High bias refers to underfitting, where the model is too simple to capture the underlying patterns in the data. This would result in poor performance on both the training and validation sets, which contradicts the scenario.
The curse of dimensionality refers to problems that arise when working with high-dimensional data, such as data sparsity and increased computational cost. While it can impact model performance, it is not the direct cause of a model's instability and overfitting in the way high variance is.
Multicollinearity, the correlation between predictor variables, can affect the stability of a decision tree's feature selection and interpretability but is not the fundamental reason for overfitting and sensitivity to data changes. The core issue described is high variance, for which decision trees are well-known.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is high variance in machine learning models?
Open an interactive chat with Bash
Why are decision trees prone to instability?
Open an interactive chat with Bash
How can high variance in decision trees be reduced?