Matrix differential calculus Optimization Geoff Gordon Ryan Tibshirani

Similar documents
Optimization for well-behaved problems

Lecture 11: October 2

14. Nonlinear equations

Convex Optimization. Problem set 2. Due Monday April 26th

Unconstrained minimization of smooth functions

Neural Network Training

Nonlinear equations and optimization

Nonlinear Optimization: What s important?

Unconstrained Optimization

ECE 680 Modern Automatic Control. Gradient and Newton s Methods A Review

Vector Derivatives and the Gradient

Conditional Gradient (Frank-Wolfe) Method

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

Numerical solutions of nonlinear systems of equations

ORIE 6326: Convex Optimization. Quasi-Newton Methods

Matrix Derivatives and Descent Optimization Methods

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

10. Unconstrained minimization

Lecture 14: October 17

REVIEW OF DIFFERENTIAL CALCULUS

The Steepest Descent Algorithm for Unconstrained Optimization

Mathematical Economics: Lecture 9

Introduction to gradient descent

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Numerical Optimization

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

g(t) = f(x 1 (t),..., x n (t)).

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Optimization: Nonlinear Optimization without Constraints. Nonlinear Optimization without Constraints 1 / 23

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

MATH 4211/6211 Optimization Quasi-Newton Method

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2

MA102: Multivariable Calculus

Nonlinear Optimization

Unconstrained optimization

Note: Every graph is a level set (why?). But not every level set is a graph. Graphs must pass the vertical line test. (Level sets may or may not.

MATHEMATICAL ECONOMICS: OPTIMIZATION. Contents

Descent methods. min x. f(x)

University of Houston, Department of Mathematics Numerical Analysis, Fall 2005

Constrained optimization. Unconstrained optimization. One-dimensional. Multi-dimensional. Newton with equality constraints. Active-set method.

8 Numerical methods for unconstrained problems

Optimization Tutorial 1. Basic Gradient Descent

Unconstrained minimization

Lecture 17: October 27

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30

11 a 12 a 21 a 11 a 22 a 12 a 21. (C.11) A = The determinant of a product of two matrices is given by AB = A B 1 1 = (C.13) and similarly.

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares

Machine Learning Brett Bernstein. Recitation 1: Gradients and Directional Derivatives

Scientific Computing: Optimization

Second-order approximation of dynamic models without the use of tensors

Nonlinear Optimization for Optimal Control

Newton s Method. Javier Peña Convex Optimization /36-725

5 Handling Constraints

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

8. Constrained Optimization

10.34 Numerical Methods Applied to Chemical Engineering Fall Quiz #1 Review

WHY DUALITY? Gradient descent Newton s method Quasi-newton Conjugate gradients. No constraints. Non-differentiable ???? Constrained problems? ????

ECE580 Partial Solution to Problem Set 3

ECE580 Exam 1 October 4, Please do not write on the back of the exam pages. Extra paper is available from the instructor.

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

The Conjugate Gradient Method

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University

Matrix Differential Calculus with Applications in Statistics and Econometrics

1 Numerical optimization

Subgradient Method. Ryan Tibshirani Convex Optimization

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

Differentiable Functions

5 Quasi-Newton Methods

MATH The Chain Rule Fall 2016 A vector function of a vector variable is a function F: R n R m. In practice, if x 1, x n is the input,

Nonlinear Programming

Interior-point methods Optimization Geoff Gordon Ryan Tibshirani

Lecture 3. Optimization Problems and Iterative Algorithms

Line Search Methods for Unconstrained Optimisation

Selected Topics in Optimization. Some slides borrowed from

Review for the Final Exam

E5295/5B5749 Convex optimization with engineering applications. Lecture 8. Smooth convex unconstrained and equality-constrained minimization

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

A function(al) f is convex if dom f is a convex set, and. f(θx + (1 θ)y) < θf(x) + (1 θ)f(y) f(x) = x 3

Chapter 3 Numerical Methods

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

1. Nonlinear Equations. This lecture note excerpted parts from Michael Heath and Max Gunzburger. f(x) = 0

Numerical Methods. V. Leclère May 15, x R n

Interior-point methods Optimization Geoff Gordon Ryan Tibshirani

Math 273a: Optimization Netwon s methods

CPSC 540: Machine Learning

Unconstrained minimization: assumptions

Generalized Gradient Descent Algorithms

Mathematical optimization

Automatic Differentiation and Neural Networks

Lecture 5: September 15

Multivariate Newton Minimanization

6.252 NONLINEAR PROGRAMMING LECTURE 10 ALTERNATIVES TO GRADIENT PROJECTION LECTURE OUTLINE. Three Alternatives/Remedies for Gradient Projection

Lecture Notes: Geometric Considerations in Unconstrained Optimization

Math 5630: Iterative Methods for Systems of Equations Hung Phan, UMass Lowell March 22, 2018

GRADIENT = STEEPEST DESCENT

Nonlinear equations. Norms for R n. Convergence orders for iterative methods

Transcription:

Matrix differential calculus 10-725 Optimization Geoff Gordon Ryan Tibshirani

Review Matrix differentials: sol n to matrix calculus pain compact way of writing Taylor expansions, or definition: df = a(x; dx) [+ r(dx)] a(x;.) linear in 2nd arg r(dx)/ dx 0 as dx 0 d( ) is linear: passes thru +, scalar * Generalizes Jacobian, Hessian, gradient, velocity 2

Review Chain rule Product rule Bilinear functions: cross product, Kronecker, Frobenius, Hadamard, Khatri-Rao, Identities rules for working with, tr() trace rotation Identification theorems 3

Finding a maximum or minimum, or saddle point ID for df(x) scalar x vector x matrix X scalar f df = a dx df = a T dx df = tr(a T dx) 2 1.5 1 0.5 0 0.5 1 3 2 1 0 1 2 3 4

Finding a maximum or minimum, or saddle point ID for df(x) scalar x vector x matrix X scalar f df = a dx df = a T dx df = tr(a T dx) 5

And so forth Can t draw it for X a matrix, tensor, But same principle holds: set coefficient of dx to 0 to find min, max, or saddle point: if df = c(a; dx) [+ r(dx)] then so: max/min/sp iff for c(.;.) any product, 6

10 Ex: Infomax ICA Training examples xi R d, i = 1:n Transformation yi = g(wxi) W R d!d g(z) = Want: 5 0 5 10 10 5 0 5 10 10 5 0 5 Wxi 10 10 5 0 5 10 xi 0.8 0.6 0.4 0.2 0.2 0.4 0.6 0.8 yi 23

Volume rule 8

10 Ex: Infomax ICA yi = g(wxi) dyi = 5 0 5 10 10 5 0 5 10 10 xi 5 Method: maxw!i ln(p(yi)) where P(yi) = 0 5 Wxi 10 10 5 0 5 10 0.8 0.6 0.4 0.2 0.2 0.4 0.6 0.8 yi 24

Gradient L = ln det Ji yi = g(wxi) dyi = Ji dxi i 10

Gradient Ji = diag(ui) W dji = diag(ui) dw + diag(vi) diag(dw xi) W dl = 11

Natural gradient L(W): R d d R dl = tr(g T dw) step S = arg maxs M(S) = tr(g T S) SW -1 2 /2 F scalar case: M = gs s 2 / 2w 2 M = dm = 12

ICA natural gradient [W -T + C] W T W = yi Wxi start with W0 = I 13

ICA natural gradient [W -T + C] W T W = yi Wxi start with W0 = I 13

ICA on natural image patches 14

ICA on natural image patches 15

More info Minka s cheat sheet: http://research.microsoft.com/en-us/um/people/minka/ papers/matrix/ Magnus & Neudecker. Matrix Differential Calculus. Wiley, 1999. 2nd ed. http://www.amazon.com/differential-calculus- Applications-Statistics-Econometrics/dp/047198633X Bell & Sejnowski. An information-maximization approach to blind separation and blind deconvolution. Neural Computation, v7, 1995. 16

Newton s method 10-725 Optimization Geoff Gordon Ryan Tibshirani

Nonlinear equations x R d solve: Taylor: J: Newton: f: R d R d, diff ble 1.5 1 0.5 0 0.5 1 0 1 2 18

Error analysis 19

dx = x*(1-x*phi) 0: 0.7500000000000000 1: 0.5898558813281841 2: 0.6167492604787597 3: 0.6180313181415453 4: 0.6180339887383547 5: 0.6180339887498948 6: 0.6180339887498949 7: 0.6180339887498948 8: 0.6180339887498949 *: 0.6180339887498948 20

Bad initialization 1.3000000000000000-0.1344774409873226-0.2982157033270080-0.7403273854022190-2.3674743431148597-13.8039236412225819-335.9214859516196157-183256.0483360671496484-54338444778.1145248413085938 21

Minimization x R d f: R d R, twice diff ble find: Newton: 22

Descent Newton step: d = (f (x)) -1 f (x) Gradient step: g = f (x) Taylor: df = Let t > 0, set dx = df = So: 23

Steepest descent g = f (x) H = f (x) d H = x x + x nsd x + x nt nts 24

Newton w/ line search Pick x1 For k = 1, 2, gk = f (xk); Hk = f (xk) dk = Hk \ gk tk = 1 while f(xk + tk dk) > f(xk) + t gk T dk / 2 gradient & Hessian Newton direction backtracking line search tk = β tk xk+1 = xk + tk dk β<1 step 25

Properties of damped Newton Affine invariant: suppose g(x) = f(ax+b) x1, x2, from Newton on g() y1, y2, from Newton on f() If y1 = Ax1 + b, then: Convergent: if f bounded below, f(xk) converges if f strictly convex, bounded level sets, xk converges typically quadratic rate in neighborhood of x* 26

Equality constraints min f(x) s.t. h(x) = 0 2 1 0 1 2 2 0 2 27

Optimality w/ equality min f(x) s.t. h(x) = 0 f: R d R, h: R d R k (k d) g: R d R d (gradient of f) Useful special case: min f(x) s.t. Ax = 0 28

Picture max c x y s.t. z x 2 + y 2 + z 2 =1 a x = b 29

Optimality w/ equality min f(x) s.t. h(x) = 0 f: R d R, h: R d R k (k d) g: R d R d (gradient of f) Now suppose: dg = dh = Optimality: 30

Example: bundle adjustment Latent: 3 Robot positions xt, θt 2 Landmark positions yk 1 Observed: odometry, landmark vectors 0 1 vt = Rθt[xt+1 xt] + noise 2 wt = [θt+1 θt + noise]π 3 dkt = Rθt[yk xt] + noise 4 O = {observed kt pairs} 5 2 0 2 31

Example: bundle adjustment Latent: Robot positions xt, θt Landmark positions yk Observed: odometry, landmark vectors vt = Rθt[xt+1 xt] + noise wt = [θt+1 θt + noise]π dkt = Rθt[yk xt] + noise 32

Bundle adjustment min x t,u t,y t v t R(u t )[x t+1 x t ] 2 + t R w u t t u t+1 2 + k (t,k) O d k,t R(u t )[y k x t ] 2 s.t. u t u t =1 33

Ex: MLE in exponential family L = ln k P (x k θ) P (x k θ) = g(θ) = 34

MLE Newton interpretation 35

Comparison of methods for minimizing a convex function Newton FISTA (sub)grad stoch. (sub)grad. convergence cost/iter smoothness 36

Variations Trust region [H(x) + ti]dx = g(x) [H(x) + td]dx = g(x) Quasi-Newton use only gradients, but build estimate of Hessian in R d, d gradient estimates at nearby points determine approx. Hessian (think finite differences) can often get good enough estimate w/ fewer can even forget old info to save memory (L-BFGS) 37

Variations: Gauss-Newton L =min θ k 1 2 y k f(x k,θ) 2 38

Variations: Fisher scoring Recall Newton in exponential family E[xx θ]dθ = x E[x θ] Can use this formula in place of Newton, even if not an exponential family descent direction, even w/ no regularization Hessian is independent of data often a wider radius of convergence than Newton can be superlinearly convergent 39