Matrix differential calculus 10-725 Optimization Geoff Gordon Ryan Tibshirani
Review Matrix differentials: sol n to matrix calculus pain compact way of writing Taylor expansions, or definition: df = a(x; dx) [+ r(dx)] a(x;.) linear in 2nd arg r(dx)/ dx 0 as dx 0 d( ) is linear: passes thru +, scalar * Generalizes Jacobian, Hessian, gradient, velocity 2
Review Chain rule Product rule Bilinear functions: cross product, Kronecker, Frobenius, Hadamard, Khatri-Rao, Identities rules for working with, tr() trace rotation Identification theorems 3
Finding a maximum or minimum, or saddle point ID for df(x) scalar x vector x matrix X scalar f df = a dx df = a T dx df = tr(a T dx) 2 1.5 1 0.5 0 0.5 1 3 2 1 0 1 2 3 4
Finding a maximum or minimum, or saddle point ID for df(x) scalar x vector x matrix X scalar f df = a dx df = a T dx df = tr(a T dx) 5
And so forth Can t draw it for X a matrix, tensor, But same principle holds: set coefficient of dx to 0 to find min, max, or saddle point: if df = c(a; dx) [+ r(dx)] then so: max/min/sp iff for c(.;.) any product, 6
10 Ex: Infomax ICA Training examples xi R d, i = 1:n Transformation yi = g(wxi) W R d!d g(z) = Want: 5 0 5 10 10 5 0 5 10 10 5 0 5 Wxi 10 10 5 0 5 10 xi 0.8 0.6 0.4 0.2 0.2 0.4 0.6 0.8 yi 23
Volume rule 8
10 Ex: Infomax ICA yi = g(wxi) dyi = 5 0 5 10 10 5 0 5 10 10 xi 5 Method: maxw!i ln(p(yi)) where P(yi) = 0 5 Wxi 10 10 5 0 5 10 0.8 0.6 0.4 0.2 0.2 0.4 0.6 0.8 yi 24
Gradient L = ln det Ji yi = g(wxi) dyi = Ji dxi i 10
Gradient Ji = diag(ui) W dji = diag(ui) dw + diag(vi) diag(dw xi) W dl = 11
Natural gradient L(W): R d d R dl = tr(g T dw) step S = arg maxs M(S) = tr(g T S) SW -1 2 /2 F scalar case: M = gs s 2 / 2w 2 M = dm = 12
ICA natural gradient [W -T + C] W T W = yi Wxi start with W0 = I 13
ICA natural gradient [W -T + C] W T W = yi Wxi start with W0 = I 13
ICA on natural image patches 14
ICA on natural image patches 15
More info Minka s cheat sheet: http://research.microsoft.com/en-us/um/people/minka/ papers/matrix/ Magnus & Neudecker. Matrix Differential Calculus. Wiley, 1999. 2nd ed. http://www.amazon.com/differential-calculus- Applications-Statistics-Econometrics/dp/047198633X Bell & Sejnowski. An information-maximization approach to blind separation and blind deconvolution. Neural Computation, v7, 1995. 16
Newton s method 10-725 Optimization Geoff Gordon Ryan Tibshirani
Nonlinear equations x R d solve: Taylor: J: Newton: f: R d R d, diff ble 1.5 1 0.5 0 0.5 1 0 1 2 18
Error analysis 19
dx = x*(1-x*phi) 0: 0.7500000000000000 1: 0.5898558813281841 2: 0.6167492604787597 3: 0.6180313181415453 4: 0.6180339887383547 5: 0.6180339887498948 6: 0.6180339887498949 7: 0.6180339887498948 8: 0.6180339887498949 *: 0.6180339887498948 20
Bad initialization 1.3000000000000000-0.1344774409873226-0.2982157033270080-0.7403273854022190-2.3674743431148597-13.8039236412225819-335.9214859516196157-183256.0483360671496484-54338444778.1145248413085938 21
Minimization x R d f: R d R, twice diff ble find: Newton: 22
Descent Newton step: d = (f (x)) -1 f (x) Gradient step: g = f (x) Taylor: df = Let t > 0, set dx = df = So: 23
Steepest descent g = f (x) H = f (x) d H = x x + x nsd x + x nt nts 24
Newton w/ line search Pick x1 For k = 1, 2, gk = f (xk); Hk = f (xk) dk = Hk \ gk tk = 1 while f(xk + tk dk) > f(xk) + t gk T dk / 2 gradient & Hessian Newton direction backtracking line search tk = β tk xk+1 = xk + tk dk β<1 step 25
Properties of damped Newton Affine invariant: suppose g(x) = f(ax+b) x1, x2, from Newton on g() y1, y2, from Newton on f() If y1 = Ax1 + b, then: Convergent: if f bounded below, f(xk) converges if f strictly convex, bounded level sets, xk converges typically quadratic rate in neighborhood of x* 26
Equality constraints min f(x) s.t. h(x) = 0 2 1 0 1 2 2 0 2 27
Optimality w/ equality min f(x) s.t. h(x) = 0 f: R d R, h: R d R k (k d) g: R d R d (gradient of f) Useful special case: min f(x) s.t. Ax = 0 28
Picture max c x y s.t. z x 2 + y 2 + z 2 =1 a x = b 29
Optimality w/ equality min f(x) s.t. h(x) = 0 f: R d R, h: R d R k (k d) g: R d R d (gradient of f) Now suppose: dg = dh = Optimality: 30
Example: bundle adjustment Latent: 3 Robot positions xt, θt 2 Landmark positions yk 1 Observed: odometry, landmark vectors 0 1 vt = Rθt[xt+1 xt] + noise 2 wt = [θt+1 θt + noise]π 3 dkt = Rθt[yk xt] + noise 4 O = {observed kt pairs} 5 2 0 2 31
Example: bundle adjustment Latent: Robot positions xt, θt Landmark positions yk Observed: odometry, landmark vectors vt = Rθt[xt+1 xt] + noise wt = [θt+1 θt + noise]π dkt = Rθt[yk xt] + noise 32
Bundle adjustment min x t,u t,y t v t R(u t )[x t+1 x t ] 2 + t R w u t t u t+1 2 + k (t,k) O d k,t R(u t )[y k x t ] 2 s.t. u t u t =1 33
Ex: MLE in exponential family L = ln k P (x k θ) P (x k θ) = g(θ) = 34
MLE Newton interpretation 35
Comparison of methods for minimizing a convex function Newton FISTA (sub)grad stoch. (sub)grad. convergence cost/iter smoothness 36
Variations Trust region [H(x) + ti]dx = g(x) [H(x) + td]dx = g(x) Quasi-Newton use only gradients, but build estimate of Hessian in R d, d gradient estimates at nearby points determine approx. Hessian (think finite differences) can often get good enough estimate w/ fewer can even forget old info to save memory (L-BFGS) 37
Variations: Gauss-Newton L =min θ k 1 2 y k f(x k,θ) 2 38
Variations: Fisher scoring Recall Newton in exponential family E[xx θ]dθ = x E[x θ] Can use this formula in place of Newton, even if not an exponential family descent direction, even w/ no regularization Hessian is independent of data often a wider radius of convergence than Newton can be superlinearly convergent 39