While building a logistic-regression model to predict loan default, your training data show that 8 % of values for the numeric attribute debt_to_income_ratio are missing. Exploratory analysis reveals that the probability of a value being missing increases for borrowers who are younger than 25 and who have less than one year of employment, but within those strata the missingness appears random. The feature is continuous, right-skewed, and has a strong influence on the target. Regulation requires that the chosen imputation technique preserve the variable's variance and explicitly propagate the extra uncertainty introduced by the missing data to any downstream parameter estimates. Which imputation type is the most appropriate to meet these constraints?
k-nearest-neighbors imputation using Euclidean distance on standardized predictors
Listwise deletion of all records that lack debt_to_income_ratio
Multiple imputation with pooled estimates across several completed data sets
Single mean imputation calculated within each cross-validation fold
Multiple imputation (for example, multiple imputation by chained equations) stochastically draws several plausible values for each missing observation conditional on the observed data, creating multiple completed data sets. Model parameters are estimated in each data set and then pooled, so between-imputation variability is carried forward and reflected in standard errors-satisfying the requirement to propagate uncertainty under a Missing-at-Random mechanism. Listwise deletion simply removes affected rows, reducing sample size and yielding biased coefficients when missingness depends on observed covariates. Single mean imputation is deterministic; it underestimates variance and ignores imputation uncertainty. k-nearest-neighbors imputation generates only one completed data set and likewise fails to account for the sampling variability of the imputed values. Therefore, multiple imputation is the most suitable choice.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is multiple imputation and how does it work?
Open an interactive chat with Bash
Why does listwise deletion lead to biased coefficients?
Open an interactive chat with Bash
What does 'propagating uncertainty' mean in the context of imputation?