A data team must forecast daily gross-merchandise value (GMV) for an e-commerce platform with gradient-boosting regression. Exploratory analysis shows (a) an exponential growth trend and (b) weekly seasonality whose absolute swing widens as GMV rises, indicating a multiplicative relationship. The model performs best when the target variable exhibits an approximately linear, additive structure with constant variance.
Which feature-engineering step should be applied to the GMV series before training to satisfy these requirements without losing predictive information?
Standardize the original GMV values to zero mean and unit variance using z-scores.
One-hot encode the day-of-week indicator while leaving the raw GMV values unchanged.
Bucket the GMV values into deciles and use the resulting ordinal codes as the new target.
Take the natural logarithm of GMV and then difference the logged series at a 7-day lag.
Applying the natural logarithm converts the multiplicative structure (Trend × Seasonality) into an additive one and reduces heteroscedasticity, so variance is more constant across the range of values. Differencing the logged series at the 7-day (weekly) lag then removes the repeating seasonal pattern and the remaining exponential trend, yielding a nearly stationary series that gradient-boosting models can learn with linear-like relationships.
One-hot encoding day-of-week leaves the raw target untouched-seasonality is captured but the multiplicative amplitude and exponential trend remain, so variance is still non-constant. Standardizing the original GMV with z-scores rescales the data but does nothing to linearize the exponential growth or to address the multiplicative seasonality. Binning GMV into deciles discards granularity and also fails to correct the non-linear, multiplicative structure. Therefore, the combined log transformation followed by seasonal differencing is the most appropriate feature-engineering step.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why is applying the natural logarithm effective for addressing multiplicative relationships?
Open an interactive chat with Bash
What is the purpose of differencing at a 7-day lag in time series data?
Open an interactive chat with Bash
Why is stationarity important for time series forecasting models like gradient boosting?