A data scientist is tasked with building a linear regression model to predict housing prices. Initial exploratory data analysis reveals that the target variable, Price, is continuous, strictly positive, and exhibits significant right-skew. Additionally, the dataset includes a key predictor, Geographic_Region, which is a nominal categorical feature with five unique, non-ordinal values. To meet the assumptions of linear regression and properly incorporate the categorical data, which of the following data enrichment strategies is the most appropriate for the data scientist to implement?
Apply binning to the Price variable and normalization to the Geographic_Region variable.
Apply a log transformation to the Price variable and label encoding to the Geographic_Region variable.
Apply a Box-Cox transformation to the Price variable and one-hot encoding to the Geographic_Region variable.
Apply standardization to the Price variable and create cross-terms for the Geographic_Region variable.
The correct approach addresses both the skewed target variable and the nominal categorical feature appropriately for a linear regression model. A Box-Cox transformation is a power transformation specifically designed to make data more closely resemble a normal distribution and stabilize variance, which is ideal for a right-skewed, positive target variable in a linear regression context. For the nominal categorical predictor Geographic_Region, one-hot encoding is the correct method because it converts the feature into a series of binary columns, allowing the model to interpret each region as a distinct entity without imposing a false and misleading ordinal relationship that would be created by label encoding.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why is a Box-Cox transformation suitable for the `Price` variable in this scenario?
Open an interactive chat with Bash
What is one-hot encoding, and why is it preferred over label encoding for the `Geographic_Region` variable?
Open an interactive chat with Bash
What are the assumptions of linear regression that make these transformations necessary?