A machine learning engineer is training a deep neural network for a non-stationary problem and notices that the learning process has effectively halted. They determine that their current optimizer, Adagrad, has caused the learning rate to diminish to a near-zero value. To mitigate this, they decide to switch to the Root Mean Square Propagation (RMSprop) optimizer. What is the key mechanism in RMSprop that directly addresses this issue of a rapidly vanishing learning rate seen in Adagrad?
It computes adaptive learning rates by storing an exponentially decaying average of past gradients (first moment) and past squared gradients (second moment).
It introduces a penalty term to the loss function based on the magnitude of the model's weights to prevent overfitting.
It adds a fraction of the previous weight update vector to the current one, helping to accelerate convergence and dampen oscillations.
It calculates a moving average of the squared gradients using a decay parameter, which prevents the denominator of the update rule from monotonically increasing.
The correct answer explains the core mechanism of RMSprop that solves a key limitation of Adagrad. RMSprop maintains an exponentially decaying average of past squared gradients. Unlike Adagrad, which accumulates all past squared gradients, RMSprop's use of a moving average (controlled by a decay parameter, rho) prevents the denominator in the learning rate update from growing indefinitely. This ensures the learning rate does not become too small, allowing the model to continue learning effectively, especially in non-stationary settings.
The distractor describing the use of both first and second moment estimates refers to the Adam optimizer, which combines the adaptive learning rate mechanism of RMSprop with momentum. The option describing the addition of a previous weight update vector refers to the Momentum optimizer, a different technique used to accelerate gradient descent. The final incorrect option describes L2 regularization (weight decay), which is a technique to prevent overfitting by penalizing large weights, and is unrelated to the adaptive learning rate mechanism of RMSprop.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What makes RMSprop better suited for non-stationary problems compared to Adagrad?
Open an interactive chat with Bash
How does the decay parameter (rho) in RMSprop function?
Open an interactive chat with Bash
How does RMSprop differ from the Adam optimizer in handling gradients?