An e-commerce analyst is auditing a 3-million-row payments table. Transaction amounts are strictly positive and strongly right-skewed: the median is USD 40, the mean is USD 120, and the standard deviation is USD 600. Legitimate orders sometimes exceed USD 10 000, but data-entry errors occasionally add extra zeroes, producing impossible amounts above USD 100 000. The analyst must flag only the erroneous records while keeping genuine high-value transactions. Which statistical technique offers the most robust approach for this task?
Log-transform the amounts and remove points lying more than 2 standard deviations above the mean in log space.
Calculate modified z-scores using the median absolute deviation (MAD) and flag observations whose |modified z| exceeds 3.5.
Apply the Tukey rule and mark any value greater than Q3 + 1.5 × IQR as an outlier.
Compute standard z-scores from the mean and standard deviation and flag observations with |z| > 3.
Because the distribution is highly skewed and contains a long but legitimate upper tail, methods that rely on the mean and standard deviation (plain z-scores) or on log-transformed z-scores remain sensitive to the heavy tail and will label many valid purchases as outliers. The classical 1.5 × IQR rule also tends to over-identify high values when the upper tail is long. A modified z-score based on the median absolute deviation (MAD) replaces the mean with the median and the standard deviation with MAD; both statistics are resistant to extreme values. Using the conventional |modified z| > 3.5 threshold therefore isolates the truly impossible amounts over USD 100 000 without misclassifying normal high-value sales, making it the most appropriate choice.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What makes the median absolute deviation (MAD) more robust than standard deviation?
Open an interactive chat with Bash
How does a modified z-score differ from a traditional z-score?
Open an interactive chat with Bash
Why is the Tukey rule less effective for strongly right-skewed distributions?