You are validating a 10-million-row transaction table exported from a retail point-of-sale system. The quantity_sold column must hold non-negative integers, but a quick scan shows that about 0.08 % of the rows contain strings such as "ten", "five", or "three"-the result of occasional cashier keystrokes. No store, cashier, or date is consistently affected.
Which remediation best addresses this idiosyncratic data error while preserving the analytic usefulness of the column?
Map the spelled-out numerals to integers with a dictionary (e.g., Series.replace()) and then cast quantity_sold to an integer dtype.
Delete every record whose quantity_sold value is not already numeric to enforce column integrity.
Overwrite the entire quantity_sold column with its global median so every row shares a consistent numeric value.
Convert the whole quantity_sold column to string so it can store both numeric and text values unchanged.
Idiosyncratic errors are rare, record-specific mistakes with no consistent pattern. The goal is to correct only the affected records and leave the vast majority of valid data untouched.
Mapping the few spelled-out numerals to their numeric equivalents (e.g., {"five": 5}) and then casting the column to an integer type fixes the erroneous rows and preserves every valid value-no information is lost and downstream numeric operations continue to work.
Dropping all rows that contain text in the column sacrifices potentially good data in other columns of those rows. Global imputation with the median overwrites all correct observations, which distorts distributional statistics. Re-typing the entire column as a string keeps the errors but removes critical numeric behavior from a core metric, rendering it useless for analysis.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What are idiosyncratic errors in data validation?
Open an interactive chat with Bash
Why is mapping spelled-out numerals to integers better than deleting rows?
Open an interactive chat with Bash
Why can't converting the column to a string resolve the issue?