During feature engineering for a multinomial logistic-regression model that predicts customer churn, a data scientist converts the nominal column "EmployerIndustry" (≈50 unique string values) to numeric form by assigning each category an integer from 0 through 49 with a basic label-encoding step. The single encoded column is then supplied to the model. Cross-validation reveals unstable coefficients and erratic performance.
Which statement best explains why this encoding choice is inappropriate for the selected algorithm?
The integer codes impose a false ordinal relationship among industries, leading logistic regression to treat higher codes as having proportionally larger (or smaller) effects on churn probability.
Label encoding expands the feature into dozens of sparse columns, and the resulting high dimensionality destabilizes the optimizer.
LabelEncoder in scikit-learn supports at most 32 distinct categories; using it with ≈50 levels causes excessive variance in the estimated coefficients.
Label encoding internally converts the column to a single binary flag, discarding most of the information needed by the model.
Label encoding simply substitutes arbitrary integers for string categories. For a linear or logistic-regression model those integers are interpreted as magnitudes on a numeric scale; the algorithm will therefore assume that a category encoded as 49 lies "further" from one encoded as 0 and will try to fit a monotonic relationship along that scale. Because the 50 industries are purely nominal, any implied ordering is artificial and inconsistent across folds, so the learned coefficients become unstable. The problem is not higher dimensionality, binary collapsing, or a library limit on the number of categories; it is the spurious ordinal structure introduced by the integer codes.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why is label encoding inappropriate for nominal features in regression models?
Open an interactive chat with Bash
What alternative encoding methods can handle nominal features better in regression?
Open an interactive chat with Bash
How does one-hot encoding affect the dimensionality of a dataset?