A data scientist is working with a financial dataset containing 200 correlated features to predict stock prices. The primary goal is to reduce the dimensionality while creating a new set of uncorrelated, interpretable features that capture the maximum possible variance from the original feature set. The new features will be used in an Ordinary Least Squares (OLS) regression model. Which dimensionality reduction technique is most appropriate for this scenario?
The correct answer is Principal Component Analysis (PCA). PCA is a linear dimensionality reduction technique that transforms a set of correlated variables into a smaller set of uncorrelated variables called principal components. The primary goal of PCA is to capture the maximum amount of variance from the original data in the first few components. This makes it ideal for the scenario, as it directly addresses the requirements of creating uncorrelated features that maximize variance, which is a key step in preparing data for an OLS regression model to avoid issues like multicollinearity.
t-SNE and UMAP are non-linear techniques primarily used for data visualization. Their main goal is to preserve the local structure of the data to reveal clusters, not to maximize variance or create uncorrelated components suitable for a linear regression model.
k-means is a clustering algorithm, not a dimensionality reduction technique. Its purpose is to partition data into distinct groups, not to create a new, smaller set of features that represent the original data's variance.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Can PCA handle non-linear relationships between features?
Open an interactive chat with Bash
Why is multicollinearity a problem for OLS regression?
Open an interactive chat with Bash
How does PCA determine the number of principal components to keep?