A data scientist is hand-coding the backward pass for a multi-class logistic regression model. For a logit vector z ∈ ℝᴷ the softmax function is defined as
σ(z)k = exp(z_k) / \sum^ exp(z_j).
During backpropagation they must compute the Jacobian element ∂σ(z)k / ∂z_i. Which of the following expressions is mathematically correct for this partial derivative (δ denotes the Kronecker delta)?
Because the softmax function involves an exponential term divided by a sum of exponential terms, its derivative requires the application of the quotient rule. The derivative of the numerator, exp(z_k), with respect to z_i is exp(z_k) when i=k and 0 otherwise, which can be written as exp(z_k) * δ_ik. The derivative of the denominator, Σ_j exp(z_j), with respect to z_i is exp(z_i).
This expression can be simplified by dividing the numerator and denominator by (Σ_j exp(z_j))² and substituting the definition of softmax σ_k = exp(z_k)/Σ_j exp(z_j): ∂σ_k/∂z_i = (exp(z_k)/Σ_j exp(z_j)) * δ_ik - (exp(z_k)/Σ_j exp(z_j)) * (exp(z_i)/Σ_j exp(z_j))∂σ_k/∂z_i = σ_k * δ_ik - σ_k * σ_i∂σ_k/∂z_i = σ_k (δ_ik - σ_i)
Therefore, the choice showing σ_k(z) (δ_{ik} − σ_i(z)) is correct.
The option σ_k(z) (1 − σ_k(z)) is only valid for the special case where i = k (the diagonal elements of the Jacobian) and is analogous to the derivative of the simpler sigmoid function.
The option σ_i(z) (δ_{ik} − σ_k(z)) incorrectly swaps the outer σ_k factor with σ_i.
The option δ_{ik} − σ_k(z) σ_i(z) is an incorrect expansion of the derivative, as it is missing the σ_k(z) factor on the δ_{ik} term.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the role of the softmax function in machine learning?
Open an interactive chat with Bash
What is the Jacobian matrix and why is it important in backpropagation?
Open an interactive chat with Bash
What is the Kronecker delta (δ_ik) and how is it applied in this derivative?