A data scientist is analyzing a large dataset of customer order transactions for an e-commerce company. They identify a single transaction with a 'quantity ordered' value that is several orders of magnitude higher than any other transaction in the dataset. This value significantly skews the distribution. Which of the following is the most appropriate initial step to determine if this outlier is a valid data point or an error?
Calculate the Z-score for all 'quantity ordered' values and immediately remove any data point with a score greater than 3, as it is a statistical outlier.
Conclude it is a data entry error and replace the value using median imputation to normalize the distribution.
Apply Winsorization to the 'quantity ordered' column, capping the extreme value at the 99th percentile to reduce its influence on subsequent analysis.
Cross-reference the transaction ID with related datasets, such as inventory logs or customer purchase history, to verify the order's legitimacy.
The correct answer is to cross-reference the transaction with associated data, such as inventory logs, customer purchase history, or shipping records. This approach uses contextual information to validate the transaction. An unusually large order could be legitimate (e.g., a corporate bulk purchase) or an error (e.g., a data entry mistake). Simply removing the point or applying a statistical transformation like Winsorization or Z-score based removal without investigation risks discarding valid, albeit rare, information. Consulting with domain experts or cross-validating with other internal data sources is a best practice for distinguishing between an error and a true, valid outlier.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is an outlier in a dataset?
Open an interactive chat with Bash
How do you cross-reference data to validate an outlier?
Open an interactive chat with Bash
Why can't statistical techniques like Winsorization or Z-score removal always handle outliers?