Matrix differential calculus Optimization Geoff Gordon Ryan Tibshirani

Size: px

Start display at page:

Download "Matrix differential calculus Optimization Geoff Gordon Ryan Tibshirani"

Sharon Nichols
6 years ago
Views:

1 Matrix differential calculus Optimization Geoff Gordon Ryan Tibshirani

2 Review Matrix differentials: sol n to matrix calculus pain compact way of writing Taylor expansions, or definition: df = a(x; dx) [+ r(dx)] a(x;.) linear in 2nd arg r(dx)/ dx 0 as dx 0 d( ) is linear: passes thru +, scalar * Generalizes Jacobian, Hessian, gradient, velocity 2

3 Review Chain rule Product rule Bilinear functions: cross product, Kronecker, Frobenius, Hadamard, Khatri-Rao, Identities rules for working with, tr() trace rotation Identification theorems 3

4 Finding a maximum or minimum, or saddle point ID for df(x) scalar x vector x matrix X scalar f df = a dx df = a T dx df = tr(a T dx)

5 Finding a maximum or minimum, or saddle point ID for df(x) scalar x vector x matrix X scalar f df = a dx df = a T dx df = tr(a T dx) 5

6 And so forth Can t draw it for X a matrix, tensor, But same principle holds: set coefficient of dx to 0 to find min, max, or saddle point: if df = c(a; dx) [+ r(dx)] then so: max/min/sp iff for c(.;.) any product, 6

7 10 Ex: Infomax ICA Training examples xi R d, i = 1:n Transformation yi = g(wxi) W R d!d g(z) = Want: Wxi xi yi 23

8 Volume rule 8

9 10 Ex: Infomax ICA yi = g(wxi) dyi = xi 5 Method: maxw!i ln(p(yi)) where P(yi) = 0 5 Wxi yi 24

10 Gradient L = ln det Ji yi = g(wxi) dyi = Ji dxi i 10

11 Gradient Ji = diag(ui) W dji = diag(ui) dw + diag(vi) diag(dw xi) W dl = 11

12 Natural gradient L(W): R d d R dl = tr(g T dw) step S = arg maxs M(S) = tr(g T S) SW -1 2 /2 F scalar case: M = gs s 2 / 2w 2 M = dm = 12

13 ICA natural gradient [W -T + C] W T W = yi Wxi start with W0 = I 13

14 ICA natural gradient [W -T + C] W T W = yi Wxi start with W0 = I 13

15 ICA on natural image patches 14

16 ICA on natural image patches 15

17 More info Minka s cheat sheet: papers/matrix/ Magnus & Neudecker. Matrix Differential Calculus. Wiley, nd ed. Applications-Statistics-Econometrics/dp/ X Bell & Sejnowski. An information-maximization approach to blind separation and blind deconvolution. Neural Computation, v7,

18 Newton s method Optimization Geoff Gordon Ryan Tibshirani

19 Nonlinear equations x R d solve: Taylor: J: Newton: f: R d R d, diff ble

20 Error analysis 19

21 dx = x*(1-x*phi) 0: : : : : : : : : *:

22 Bad initialization

23 Minimization x R d f: R d R, twice diff ble find: Newton: 22

24 Descent Newton step: d = (f (x)) -1 f (x) Gradient step: g = f (x) Taylor: df = Let t > 0, set dx = df = So: 23

25 Steepest descent g = f (x) H = f (x) d H = x x + x nsd x + x nt nts 24

26 Newton w/ line search Pick x1 For k = 1, 2, gk = f (xk); Hk = f (xk) dk = Hk \ gk tk = 1 while f(xk + tk dk) > f(xk) + t gk T dk / 2 gradient & Hessian Newton direction backtracking line search tk = β tk xk+1 = xk + tk dk β<1 step 25

27 Properties of damped Newton Affine invariant: suppose g(x) = f(ax+b) x1, x2, from Newton on g() y1, y2, from Newton on f() If y1 = Ax1 + b, then: Convergent: if f bounded below, f(xk) converges if f strictly convex, bounded level sets, xk converges typically quadratic rate in neighborhood of x* 26

28 Equality constraints min f(x) s.t. h(x) =

29 Optimality w/ equality min f(x) s.t. h(x) = 0 f: R d R, h: R d R k (k d) g: R d R d (gradient of f) Useful special case: min f(x) s.t. Ax = 0 28

30 Picture max c x y s.t. z x 2 + y 2 + z 2 =1 a x = b 29

31 Optimality w/ equality min f(x) s.t. h(x) = 0 f: R d R, h: R d R k (k d) g: R d R d (gradient of f) Now suppose: dg = dh = Optimality: 30

32 Example: bundle adjustment Latent: 3 Robot positions xt, θt 2 Landmark positions yk 1 Observed: odometry, landmark vectors 0 1 vt = Rθt[xt+1 xt] + noise 2 wt = [θt+1 θt + noise]π 3 dkt = Rθt[yk xt] + noise 4 O = {observed kt pairs}

33 Example: bundle adjustment Latent: Robot positions xt, θt Landmark positions yk Observed: odometry, landmark vectors vt = Rθt[xt+1 xt] + noise wt = [θt+1 θt + noise]π dkt = Rθt[yk xt] + noise 32

34 Bundle adjustment min x t,u t,y t v t R(u t )[x t+1 x t ] 2 + t R w u t t u t k (t,k) O d k,t R(u t )[y k x t ] 2 s.t. u t u t =1 33

35 Ex: MLE in exponential family L = ln k P (x k θ) P (x k θ) = g(θ) = 34

36 MLE Newton interpretation 35

37 Comparison of methods for minimizing a convex function Newton FISTA (sub)grad stoch. (sub)grad. convergence cost/iter smoothness 36

38 Variations Trust region [H(x) + ti]dx = g(x) [H(x) + td]dx = g(x) Quasi-Newton use only gradients, but build estimate of Hessian in R d, d gradient estimates at nearby points determine approx. Hessian (think finite differences) can often get good enough estimate w/ fewer can even forget old info to save memory (L-BFGS) 37

39 Variations: Gauss-Newton L =min θ k 1 2 y k f(x k,θ) 2 38

40 Variations: Fisher scoring Recall Newton in exponential family E[xx θ]dθ = x E[x θ] Can use this formula in place of Newton, even if not an exponential family descent direction, even w/ no regularization Hessian is independent of data often a wider radius of convergence than Newton can be superlinearly convergent 39

Optimization for well-behaved problems

Optimization for well-behaved problems For statistical learning problems, well-behaved means: signal to noise ratio is decently high correlations between predictor variables are under control number of