During an ablation study you train two otherwise identical multilayer perceptrons on the same data set:
Network A uses the logistic sigmoid (σ) activation in every hidden layer.
Network B uses the hyperbolic tangent (tanh) activation in every hidden layer. With the same optimizer, learning-rate schedule, batch size, and weight initialization, Network B reaches the target validation loss in roughly half the epochs required by Network A.
Which intrinsic property of the logistic sigmoid most plausibly explains the slower convergence of Network A?
Its derivative equals one at zero, leading to gradient magnitudes that explode during early training.
It is not differentiable for negative inputs, so back-propagation cannot adjust weights efficiently.
The exponential operations in its formula require more floating-point instructions, and this computational overhead dominates training time.
Output values are strictly positive, so updates are not zero-centered and cause gradient descent to zig-zag, slowing learning.
The logistic sigmoid squashes all hidden-layer outputs into the range (0, 1), so the average activation is positive (≈ 0.5). Because the error term from the next layer is multiplied by these always-positive activations during back-propagation, the partial derivatives of the loss with respect to the weights have the same sign throughout a mini-batch. Gradient descent therefore moves in a zig-zag path toward the optimum and needs many small steps, which appears as slow convergence. Tanh, by contrast, produces zero-centered outputs in [-1, 1], enabling more balanced (positive and negative) weight updates that reach the optimum faster.
The other statements are incorrect:
The derivative of σ at zero is 0.25, not 1, so it does not cause exploding gradients.
σ is smooth and differentiable for all real inputs, including negative values.
Although σ contains an exponential, the extra floating-point operations are negligible compared with matrix multiplies and cannot by themselves double the number of training epochs.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why does the logistic sigmoid's strictly positive output range slow convergence?
Open an interactive chat with Bash
How does the hyperbolic tangent (tanh) activation improve convergence speed?
Open an interactive chat with Bash
Why don't the incorrect options explain the slower convergence of Network A?