A data science team is training a deep neural network to forecast intraday cryptocurrency returns. The target occasionally exhibits extreme spikes that are genuine (not data errors) but occur less than 1% of the time. When the team optimizes the model with mean-squared error, the network overfits to those rare points and produces unstable forecasts. They want a loss function that remains smooth and differentiable near zero error for standard gradient-based optimizers, but also sharply reduces the influence of very large residuals by changing the penalty from quadratic to linear once a tunable threshold is exceeded.
Which loss function best satisfies these requirements, and why?
Quantile (pinball) loss - it applies asymmetric linear penalties based on a chosen quantile, emphasizing under- or over-predictions rather than specifically limiting outliers.
Log-cosh loss - it smoothly approximates squared error near zero and absolute error for larger residuals but does not switch to a strict linear penalty at any threshold.
Mean-squared error - it squares every residual, increasing the penalty for large errors so the model focuses on the extreme spikes.
Huber loss - it is quadratic for small residuals and becomes linear once the error magnitude exceeds a chosen threshold, limiting outlier influence while preserving differentiability.
The Huber loss is specifically designed for robust regression. For residuals whose absolute value is below a threshold δ, it behaves like mean-squared error (quadratic), preserving a smooth, differentiable surface that works well with gradient descent. Once the absolute value of the residual is greater than δ, the penalty switches to a linear form identical to mean-absolute error, so large residuals no longer dominate the objective. This piece-wise definition dampens the influence of outliers without completely discarding them.
Mean-squared error continues to square every residual, so extreme values still dominate the gradient and are heavily weighted. Log-cosh loss does become approximately absolute for large errors, but it transitions gradually and never becomes strictly linear at a configurable cutoff, so very large spikes can still exert more influence than desired. Quantile (pinball) loss applies asymmetric linear penalties that target specific conditional quantiles rather than overall mean predictions and therefore does not meet the stated need for a quadratic-to-linear switch around a threshold.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What makes Huber loss different from mean-squared error (MSE)?
Open an interactive chat with Bash
How does the choice of the threshold (δ) in Huber loss affect the model?
Open an interactive chat with Bash
Why is differentiability important for gradient-based optimizers in loss functions?