While implementing stochastic gradient descent with Momentum, a developer maintains a velocity vector v that is initialized to zeros. At iteration t the gradient is denoted g_t = ∇L(θ_{t−1}), the learning-rate is η, and the momentum coefficient is β. Which pair of update equations corresponds to the classical Momentum algorithm as it is implemented in major deep-learning libraries such as PyTorch and TensorFlow?
Classical Momentum forms an exponential moving average of past gradients. First the velocity is updated as a weighted sum of the previous velocity and the current gradient; then the parameters are moved in the direction of this velocity scaled by the learning-rate. The two-line rule is therefore v_t = β·v_{t−1} + g_t followed by θ_t = θ_{t−1} − η·v_t. The other options deviate from this:
The second choice embeds the learning-rate inside the velocity and then omits it in the parameter step, changing the effective step size and breaking the typical interpretation of β.
The third choice subtracts both the raw gradient and the momentum buffer, resembling a form of Nesterov look-ahead rather than classical Momentum.
The fourth choice replaces the running average of gradients with a running average of squared gradients and divides by its square root; this is the core idea of RMSprop/Adam, not Momentum.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What does the momentum coefficient (β) control in the Momentum algorithm?
Open an interactive chat with Bash
How does Momentum differ from algorithms like RMSprop or Adam?
Open an interactive chat with Bash
Why is the learning rate (η) applied to the velocity vector in Momentum?