A data scientist is developing an ordinary least-squares model to predict daily revenue, a strictly positive continuous variable. The revenue distribution is highly right-skewed and, after an initial linear fit, the residual-versus-fitted plot shows a wedge-shaped pattern that widens as fitted values increase, indicating heteroscedasticity. The scientist needs a single data transformation on the response variable that (1) can stabilize the variance and approximate normality and (2) lets the optimal transformation be chosen from a continuum of power functions using maximum-likelihood estimation. Which transformation should be applied before refitting the model?
Apply a Box-Cox power transformation to the revenue variable.
Take the natural logarithm (ln) of the revenue variable.
Standardize the revenue variable with a z-score (mean 0, standard deviation 1).
Rescale the revenue variable to the 0-1 range with min-max normalization.
The Box-Cox transformation is a family of power transforms (y^{(\lambda)}=(y^{\lambda}-1)/\lambda) (with the limiting case (\lambda=0) equal to the natural log). Because it includes a tunable parameter (\lambda), practitioners can estimate the value that maximizes the likelihood of the data, producing a variance-stabilizing, near-normal response. This directly addresses the wedge-shaped heteroscedasticity evident in the residual plot. A plain logarithmic transform is a special case of Box-Cox but fixes (\lambda) at 0, removing the ability to search for a better power. Z-score standardization and min-max scaling only change location and scale; they neither normalize a skewed distribution nor correct non-constant error variance.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is heteroscedasticity and why is it problematic in regression models?
Open an interactive chat with Bash
Why is the Box-Cox transformation preferred over a fixed transformation like natural logarithm?
Open an interactive chat with Bash
What are the limitations of z-score standardization and min-max normalization in addressing heteroscedasticity?