Generalized Gradient Descent Algorithms

ECE 275AB Lecture 11 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 1/1 Lecture 11 ECE 275A Generalized Gradient Descent Algorithms

ECE 275AB Lecture 11 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 2/1 Continuous Dynamic Minimization of l(x) Let l( ) : X = R n R be twice differentiable with respect to x R n and bounded from below by 0, l(x) 0 for all x. Recalling that l(x) = l(x) ẋ, set ẋ = Q(x) c xl(x) with c x l(x) = ( l(x) ) T = Cartesian gradient of l(x) and Q(x) = Q(x) T for all x. This yields l(x) = c x l(x) 2 Q (x) 0 with l(x) = 0 if and only if c xl(x) = 0 Thus continuously evolving the value of x according to ẋ = Q(x) c xl(x) ensures that the cost l(x) dynamically decreases in value until a stationary point turns off learning.

ECE 275AB Lecture 11 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 3/1 Generalized Gradient Descent Algorithm A family of algorithms for discrete-step dynamic minimization of l(x) can be developed as follows. Setting x k = x(t k ), approximate ẋ k = ẋ(t k ) by the first-order forward difference This yields ẋ k x k+1 x k t k+1 t k x k+1 x k + α k ẋ k with α k = t k+1 t k which suggests the Generalized Gradient Descent Algorithm ˆx k+1 = ˆx k α k Q(ˆx k ) c xl(ˆx k ) A vast body of literature in Mathematical Optimization Theory (aka Mathematical Programming) exists which gives conditions on step size α k to guarantee that a generalized gradient descent algorithm will converge to a stationary value of l(x) for various choices of Q(x) = Q(x) T > 0. Note that a Generalized Gradient Algorithm turns off once the sequence of estimates has converged to a stationary point of l(x): ˆx k ˆx with c xl(ˆx) = 0.

ECE 275AB Lecture 11 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 4/1 Generalized Gradient Descent Algorithm Cont. IMPORTANT SPECIAL CASES: Q(x) = I Gradient Descent Algorithm Q(x) = H 1 (x) Newton Algorithm Q(x) = Ω 1 x Natural Gradient Algorithm The (Cartesian or naive) Gradient Descent Algorithm is simplest to implement, but slowest to converge. The Newton Algorithm is most difficult to implement, due to the difficulty in constructing and inverting the Hessian, but fastest to converge. The Natural (or true) Gradient Descent Algorithm provides improved convergence speed over (naive) gradient descent.

ECE 275AB Lecture 11 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 5/1 Derivation of the Newton Algorithm A Taylor series expansion of l(x) about a current estimate of a minimizing point ˆx k to second-order in x = x ˆx k yields l(ˆx k + x) l(ˆx k ) + l(ˆx k) Minimizing the above wrt x results in x + 1 2 xt H(x 0 ) x x k = H 1 (ˆx k ) c xl(ˆx k ) Finally, updating the minimizing point estimate via yields the Newton Algorithm. ˆx k+1 = ˆx k + α k xk = ˆx k+1 = ˆx k H 1 (ˆx k ) c x l(ˆx k) As mentioned, the Newton Algorithm generally yields fast convergence. This is particularly true if it can be stabilized using the so-called Newton step-size α k = 1.

ECE 275AB Lecture 11 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 6/1 Nonlinear Least-Squares For the Linear Gaussian model, with known covariance matrix C = W 1, the negative log-likelihood is l(x) = log p x (y). = 1 2 y h(x) 2 W The (Cartesian) gradient is given by c xl(x) = ( ) l(x) T, or c xl(x) = H T (x)w (y h(x)) where H(x) is the Jacobian (linearization) of h(x) H(x) = h(x) This yields the Nonlinear Least-Squares Generalized Gradient Descent Algorithm ˆx k+1 = ˆx k + α k Q(ˆx k )H T (ˆx k ) (y h(ˆx k ))

ECE 275AB Lecture 11 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 7/1 Nonlinear Least-Squares Cont. Note that the Natural (true) gradient of l(x) is given by x l(x) = Ω 1 x HT (x)w (y h(x)) = H (x)(y h(x)) where H (x) is the adjoint of H(x). A stationary point x, x l(x) = 0, must satisfy the Nonlinear Normal Equation H (x)h(x) = H (x)y and the prediction error e(x) = y h(x) must be in N(H (x)) = R(H(x)) H (x)e(x) = H (x) (y h(x)) = 0 Provided that the step size α k is chosen to stabilize the nonlinear least-squares generalized gradient descent algorithm, the sequence of estimates ˆx k will converge to a stationary point of l(x), ˆx k ˆx with x l(ˆx) = 0 = c xl(ˆx).

ECE 275AB Lecture 11 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 8/1 Nonlinear Least-Squares Cont. IMPLEMENTING THE NEWTON ALGORITHM One computes the Hessian from H(x) = ( l(x) ) T = c xl(x) This yields the Hessian for the Weighted Least-Squares Loss Function m H(x) = H T (x)wh(x) H i (x)[w (y h(x))] i i=1 where ( hi (x) ) T H i (x) denotes the Hessian of the the i-th scalar-valued component of the vector function h(x). Note that all terms on the right-hand-side of of the Hessian expression are symmetric, as required if H(x) is to be symmetric.

ECE 275AB Lecture 11 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 9/1 Nonlinear Least-Squares Cont. Evidently, the Hessian matrix of the least-squares loss function l(x) can be quite complex. Also note that because of the second term on the right-hand-side of the Hessian expression, H(x) can become singular or indefinite. However, as we have seen in the special case when h(x) is linear, h(x) = Hx, we have that H(x) = H and H i(x) = 0, i = 1, n, yielding, H(x) = H T WH, which for full column-rank A and positive W is always symmetric and invertible. Also note that if we have a good model and a value ˆx such that the prediction error e(ˆx) = y h(ˆx) 0, then H(ˆx) H T (ˆx)WH(ˆx)

ECE 275AB Lecture 11 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 10/1 Nonlinear Least-Squares Gauss-Newton Algorithm Linearizing h(x) about a current estimate ˆx k with x = x ˆx k we have h(x) h(ˆx k ) + h(ˆx k) x = h(ˆx k ) + H(ˆx k ) This yields the loss-function approximation l(x) = l(ˆx k + x) 1 2 (y h(ˆx k)) H(ˆx k ) x 2 W Assuming that H(ˆx k ) has full column rank (which is guaranteed if h(x) is one-to-one in a neighborhood of ˆx k ), then we can minimize l(ˆx k + x) wrt x to obtain x k = ( H T (ˆx k )WH(ˆx k )) 1 H T (ˆx k )W (y h(ˆx k )) The update rule ˆx k+1 = ˆx k + α k xk yields the a nonlinear least-squares generalized gradient descent algorithm known as the Gauss-Newton Algorithm.

ECE 275AB Lecture 11 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 11/1 Gauss-Newton Algorithm Cont. The Gauss-Newton algorithm corresponds to taking the weighting matrix Q(x) = Q T (x) > 0 in the nonlinear least-squares generalized gradient descent algorithm to be (under the assumption that h(x) is locally one-to-one) Q(x) = ( H T (x)wh(x)) 1 Gauss-Newton Algorithm Note that when e(x) = y h(x) 0, the Newton and Gauss-Newton algorithms become essentially equivalent. Thus it is not surprising that the Gauss-Newton algorithm can result in very fast convergence, assuming that e(ˆx k ) becomes asymptotically small. This is particularly true if the Gauss-Newton algorithm can be stabilized using the Newton step-size α k = 1.