During training you notice that a deep multilayer perceptron that uses tanh(x) in every hidden layer begins to learn extremely slowly after the first few epochs. You suspect the gradients are vanishing as they are back-propagated. From a mathematical standpoint, which property of the tanh activation most directly explains why its use can drive gradients toward zero when neuron inputs have large magnitude?
Its first derivative is 1 − tanh²(x), which tends to zero as |x| becomes large, so back-propagated gradients are repeatedly attenuated.
Its output range is strictly 0 to 1, so activations stay positive and bias the gradient toward zero.
Its second derivative is a constant 1, so there is no curvature change and gradients get stuck at saddle points instead of vanishing.
Its first derivative equals x for |x| > 1, causing gradients to grow without bound and leading to exploding rather than vanishing gradients.
Back-propagation multiplies the upstream gradient by the local derivative of each activation. For tanh the derivative is tanh′(x) = 1 − tanh²(x). When a neuron's pre-activation |x| becomes large, tanh(x) saturates at ±1, making tanh′(x) almost zero. Repeated multiplication by these near-zero factors across many layers quickly shrinks the gradient, producing the vanishing-gradient problem. The derivative is not equal to x, its second derivative is not constant, and tanh outputs in the interval −1 to 1, not 0 to 1, so the other options do not account for the vanishing effect.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why does the derivative of tanh(x) approach zero for large |x|?
Open an interactive chat with Bash
What is the vanishing gradient problem in neural networks?
Open an interactive chat with Bash
How can the vanishing gradient problem be mitigated in deep networks?