You are preparing a large e-commerce transactions table for a sales-forecasting model. A validation query reveals that every record whose client_app_version equals "3.2.1-legacy" shows order_amount values about 100 × larger than comparable orders (for example, a typical $75 purchase is stored as 7 500).
Mobile-engineering confirms this specific app version sent monetary values in cents instead of dollars; no other rows are affected.
To correct the data while preserving information and maintaining data-lineage metadata, which data-wrangling action should you take?
Replace the affected order_amount values with NULL and later impute them with the overall median order value.
Drop all rows generated by client_app_version = '3.2.1-legacy' to remove the corrupted records completely.
Treat the inflated values as idiosyncratic errors and winsorize order_amount at the 99th percentile across the entire dataset to cap extreme values.
Identify the issue as a scale-factor systematic error and divide order_amount by 100 only for rows where client_app_version = '3.2.1-legacy', recording the transformation in the pipeline metadata.
The inflation affects all rows from one known app version in a consistent and proportional way, so it is a scale-factorsystematic error. Because the cause and the factor (100) are both known, the most accurate fix is a deterministic transformation that rescales only the affected subset. Winsorizing, deleting, or null-imputing would either distort valid observations or throw away usable data, and none of those techniques addresses the systematic nature (consistent proportional bias) of the error. Dividing the amounts by 100 for the identified rows corrects the measurements, retains every transaction, and a documented step in the pipeline satisfies data-lineage requirements.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a scale-factor systematic error?
Open an interactive chat with Bash
What is data-lineage metadata, and why is it important?
Open an interactive chat with Bash
Why is winsorizing or removing affected rows not the best solution?