A data scientist is performing exploratory data analysis (EDA) on a dataset for a real estate valuation model. The dataset includes a feature named view_quality, which is rated by professional assessors on a custom scale: "No View", "Partial Obstruction", "Standard", "Good", and "Excellent". The team is debating the most appropriate way to handle this feature for a multiple linear regression model versus a gradient boosting machine (GBM).
Which of the following statements most accurately describes the view_quality feature and the implications for its use in modeling?
view_quality is a continuous variable that has been binned. For use in a linear regression model, the mid-points of the implied continuous range for each category should be calculated and used as the feature value. For a GBM, this feature can be used directly.
view_quality is an ordinal variable. For a linear regression model, treating it as a continuous integer (e.g., 0-4) assumes equidistant spacing between categories, which is likely false and could violate model assumptions. For a GBM, integer encoding is generally effective as the model can create splits at any point along the ordered values.
view_quality is a nominal variable. It must be one-hot encoded for both linear regression and GBMs to avoid introducing a false sense of order, which would negatively impact the performance of both model types.
view_quality is a discrete variable. It can be used directly in a linear regression model without transformation because the model will interpret the integer values as distinct points. For a GBM, it should be treated as a categorical feature to allow for optimal splits.
The correct answer identifies view_quality as an ordinal variable and accurately describes the differing implications for linear versus tree-based models. view_quality is ordinal because its categories have a meaningful, inherent order, but the intervals between them (e.g., the improvement from "Good" to "Excellent" vs. "No View" to "Partial Obstruction") are not uniform or quantifiable. For a linear regression model, directly converting these categories to integers (e.g., 0-4) and treating it as a continuous variable is problematic because it incorrectly assumes the intervals are equidistant, which violates a core assumption of linearity and can lead to a misinterpretation of the feature's impact. For a tree-based model like a Gradient Boosting Machine (GBM), simple integer encoding is often effective. Tree models partition data by creating splits (e.g., view_quality < 3), which naturally handles the ordered nature of the variable without assuming linearity or equal intervals.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is an ordinal variable?
Open an interactive chat with Bash
Why is integer encoding problematic for ordinal variables in linear regression?
Open an interactive chat with Bash
How do tree-based models like GBMs handle ordinal variables?