A data scientist is working with a time-series dataset from a network of environmental sensors monitoring river water temperature. Upon analysis, it is discovered that one specific sensor began consistently reporting temperatures 1.5°C higher than three co-located, recently calibrated sensors. This deviation started abruptly on a specific date and persists for all subsequent readings from only that sensor. Prior to this date, all sensors exhibited highly correlated measurements. Which of the following data-wrangling techniques is the most appropriate initial method to address this specific type of data error?
Remove all data points from the malfunctioning sensor for the period after the identified anomaly date.
Apply Winsorization at the 95th percentile to the entire dataset to limit the influence of extreme high temperature readings.
Impute the anomalous readings by replacing them with the rolling mean calculated from the co-located sensors' data.
Apply a constant offset correction of -1.5°C to all measurements from the anomalous sensor starting from the date the deviation began.
The correct answer is to apply a consistent offset correction. The scenario describes a classic systematic error, specifically an instrumental bias or drift where a sensor consistently reports values that are offset by a fixed amount (1.5°C). Applying a simple arithmetic correction (subtracting 1.5°C from the affected readings) is the most direct and efficient initial step. This method corrects the bias while preserving the original variance and trends within that sensor's data after the point of failure.
Removing all data points from the malfunctioning sensor is too drastic for an initial step, as it leads to significant data loss, which can be detrimental in time-series analysis. This is a last resort if the data is deemed unrecoverable.
Imputing the values using the mean of the other sensors would erase the unique, valid fluctuations captured by the problematic sensor and replace them with a less nuanced, averaged signal. This is less precise than an offset correction for a consistent bias.
Winsorization is a technique used to handle outliers (sporadic extreme values) by capping them at a certain percentile. It is not appropriate for a systematic error that affects all data points in a consistent direction, not just the extremes.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a systematic error, and how is it different from a random error?
Open an interactive chat with Bash
Why is applying an offset correction preferred over imputing the values?
Open an interactive chat with Bash
What is Winsorization, and why is it not suitable for this error?