You are preparing data for an ordinary least-squares regression model that is sensitive to multicollinearity. After centering and scaling the predictors, you compute the absolute Pearson correlation matrix for the four numeric features shown below (upper-triangle values only).
A common heuristic-used by the caret findCorrelation algorithm-flags any variable whose mean absolute correlation with all other predictors exceeds 0.40 and removes the one with the highest such mean first.
According to this rule, which feature should you drop first to reduce redundancy in the feature set?
For each feature, compute the mean of the absolute correlations with the other three predictors.
Feature_A: (0.85 + 0.30 + 0.10) ÷ 3 ≈ 0.42
Feature_B: (0.85 + 0.35 + 0.15) ÷ 3 ≈ 0.45
Feature_C: (0.30 + 0.35 + 0.20) ÷ 3 ≈ 0.28
Feature_D: (0.10 + 0.15 + 0.20) ÷ 3 ≈ 0.15
Only Feature_A and Feature_B exceed the 0.40 threshold, and Feature_B has the larger mean absolute correlation (≈ 0.45). Therefore Feature_B is the strongest candidate for elimination. Feature_A would be considered in the next iteration, while Features_C and D remain because their average correlations are below the cut-off.
This approach is consistent with the findCorrelation procedure, which iteratively removes the variable that shows the highest average absolute correlation to other predictors, thereby mitigating multicollinearity without discarding more information than necessary.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is multicollinearity and why does it matter in ordinary least-squares regression?
Open an interactive chat with Bash
How does the `findCorrelation` algorithm work in the caret library?
Open an interactive chat with Bash
What does centering and scaling predictors involve, and why do we perform it?