A data scientist is preparing a manufacturing data set for a k-nearest neighbors (k-NN) model that uses Euclidean distance. The data contain two continuous variables: AnnualEnergy_kWh, with a range of 0 to 12,000, and MaintenanceDowntime_min, with a range of 0 to 7,200.
During pilot runs, the distance metric is dominated by AnnualEnergy_kWh, causing records with high downtime to be mis-classified. According to best practice for normalization, which preprocessing step should the data scientist apply before training so that both variables contribute proportionally to the distance calculation?
Generate polynomial cross-terms between the two features and include them in the model.
Rescale each feature to the 0-1 interval using min-max normalization.
Standardize each feature to zero mean and unit variance (z-score).
Apply a natural logarithm transform (log1p) to every value in both features.
The k-nearest neighbors (k-NN) algorithm relies on raw Euclidean distances, so any feature with a larger numeric range will disproportionately dominate the distance metric. Min-max normalization addresses this by rescaling each continuous feature to a common bounded range (typically 0 to 1), ensuring that both AnnualEnergy_kWh and MaintenanceDowntime_min have an equal potential influence on the distance metric. Standardization (z-score) is another common scaling technique, but it scales data based on a mean of 0 and a standard deviation of 1, without enforcing a strict, bounded range; this can make it more sensitive to outliers than min-max normalization. A log transform changes a feature's distribution shape to handle skewness and is not designed for scaling features of different magnitudes. Creating polynomial cross-terms is a feature engineering technique that introduces new, unscaled variables, which would likely worsen the imbalance. Therefore, min-max normalization is the most appropriate technique in this scenario.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why is min-max normalization preferred over standardization for k-NN with Euclidean distance?
Open an interactive chat with Bash
How does Euclidean distance work in k-NN, and why do variable ranges matter?
Open an interactive chat with Bash
What is the difference between min-max normalization and a log transform?