An engineering team is clustering a year's worth of telemetry from industrial boilers. Each record contains temperature (°C), pressure (kPa), and relative humidity (%). A preliminary K-means (Euclidean distance) run produces clusters that vary almost entirely by pressure, because that feature's numeric range dwarfs the others. The domain experts want to:
Give every feature comparable influence in the distance calculations.
Avoid compressing the effect of rare extreme readings.
Convert the final cluster centroids back to the original physical units once the model is trained.
Which single preprocessing step best satisfies all three requirements before fitting the model?
Standardize each feature using z-score scaling (subtract the mean and divide by the standard deviation).
Discretize each numerical feature into decile bins and one-hot encode the resulting categories.
Log-transform the pressure and temperature features while leaving humidity unchanged.
Apply min-max normalization to map every feature linearly into the range .
Standardizing each feature with the z-score (subtracting its mean and dividing by its standard deviation) rescales temperature, pressure, and humidity so they contribute equally to the Euclidean distance used by K-means. Because the transformation is based on the mean and standard deviation-not the absolute minimum and maximum-extreme values are not squeezed into a narrow range; they simply receive larger absolute z-scores. Libraries such as scikit-learn store the mean and standard deviation, so the inverse_transform method can easily return cluster centroids to the original engineering units.
Min-max normalization would meet the first requirement but often compresses the bulk of the data when outliers are present, violating requirement 2. Log-transforming only selected features alters their distribution without guaranteeing equal scale across all variables or an easy inverse for every feature. Discretizing into quantile bins with one-hot encoding discards continuous information and still yields incomparable distance units.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why does z-score scaling ensure all features have comparable influence in K-means clustering?
Open an interactive chat with Bash
How does z-score scaling handle outliers compared to min-max normalization?
Open an interactive chat with Bash
How does z-score scaling allow cluster centroids to be converted back to original physical units?