Lecture 11: October 2

Similar documents
Matrix differential calculus Optimization Geoff Gordon Ryan Tibshirani

Optimization for well-behaved problems

Lecture 6: September 12

Lecture 14: October 17

Lecture 6: September 17

Lecture 1: September 25

Lecture 5: September 12

Lecture 5: September 15

Lecture 24: August 28

Lecture 14: Newton s Method

Lecture 18: November Review on Primal-dual interior-poit methods

Lecture 17: October 27

Lecture 14: October 11

Lecture 16: October 22

Lecture 10: September 26

Lecture 15: October 15

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Lecture 9: September 28

Introduction to gradient descent

Unconstrained minimization of smooth functions

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

Lecture 17: Primal-dual interior-point methods part II

Neural Network Training

10. Unconstrained minimization

ECE580 Exam 1 October 4, Please do not write on the back of the exam pages. Extra paper is available from the instructor.

Constrained Optimization and Lagrangian Duality

Lecture 25: November 27

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Optimization Tutorial 1. Basic Gradient Descent

ECE580 Fall 2015 Solution to Midterm Exam 1 October 23, Please leave fractions as fractions, but simplify them, etc.

8 Numerical methods for unconstrained problems

Lecture 9: Numerical Linear Algebra Primer (February 11st)

Vector Derivatives and the Gradient

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Nonlinear equations and optimization

Lecture 6: September 19

Newton s Method. Javier Peña Convex Optimization /36-725

AM 205: lecture 18. Last time: optimization methods Today: conditions for optimality

Gradient Descent. Dr. Xiaowei Huang

ECS171: Machine Learning

Conditional Gradient (Frank-Wolfe) Method

Nonlinear Optimization for Optimal Control

Lecture 23: November 19

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Numerical Optimization

10-725/36-725: Convex Optimization Spring Lecture 21: April 6

Performance Surfaces and Optimum Points

THE INVERSE FUNCTION THEOREM

LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION

Unconstrained Optimization

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Lecture 20: November 1st

REVIEW OF DIFFERENTIAL CALCULUS

Unconstrained minimization

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Lecture 3. Optimization Problems and Iterative Algorithms

Numerical solutions of nonlinear systems of equations

Lecture 23: Conditional Gradient Method

Selected Topics in Optimization. Some slides borrowed from

Lecture 4: Convex Functions, Part I February 1

Lecture 3 September 1

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

14. Nonlinear equations

The Steepest Descent Algorithm for Unconstrained Optimization

2. Quasi-Newton methods

4 Newton Method. Unconstrained Convex Optimization 21. H(x)p = f(x). Newton direction. Why? Recall second-order staylor series expansion:

Nonlinear equations. Norms for R n. Convergence orders for iterative methods

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

Lecture 5: September Time Complexity Analysis of Local Alignment

Lecture 23: Online convex optimization Online convex optimization: generalization of several algorithms

Lecture 7 Unconstrained nonlinear programming

Convex Optimization. Problem set 2. Due Monday April 26th

1 Overview. 2 A Characterization of Convex Functions. 2.1 First-order Taylor approximation. AM 221: Advanced Optimization Spring 2016

Lecture 23: November 21

Math 409/509 (Spring 2011)

ECE 680 Modern Automatic Control. Gradient and Newton s Methods A Review

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Homework 4. Convex Optimization /36-725

Mathematical Economics: Lecture 9

Computational Finance

Advanced computational methods X Selected Topics: SGD

4 damped (modified) Newton methods

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares

Numerical Optimization

Kernel Density Estimation

Math 273a: Optimization Basic concepts

Numerical Optimization Algorithms

Lecture 26: April 22nd

Mathematical Economics: Lecture 2

CPSC 540: Machine Learning

Scientific Computing: Optimization

Multivariate Newton s Method

Matrix Derivatives and Descent Optimization Methods

Maximum Likelihood Estimation

10-704: Information Processing and Learning Fall Lecture 9: Sept 28

Motivation: We have already seen an example of a system of nonlinear equations when we studied Gaussian integration (p.8 of integration notes)

Convex Optimization. Lecture 12 - Equality Constrained Optimization. Instructor: Yuanzhang Xiao. Fall University of Hawaii at Manoa

An Iterative Descent Method

Math 273a: Optimization Netwon s methods

Transcription:

10-725: Optimization Fall 2012 Lecture 11: October 2 Lecturer: Geoff Gordon/Ryan Tibshirani Scribes: Tongbo Huang, Shoou-I Yu Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the Instructor. This lecture s notes illustrate some uses of various L A TEX macros. Take a look at this and imitate. 11.1 Matrix Differential Calculus 11.1.1 Review of Previous Class Matrix differential is a solution to matrix calculation pain, it can be understood by either of the following: A compact way of writing Taylor expansion. Definition: df = a(x; dx)[r(dx)] where r is the residual term a(x,.) is linear in the second argument r(dx) dx 0 as dx 0 The derivative is linear, so it passes through addition and scalar multiplication. It also generalizes Jacobian, Hessian, gradient and velocity. Other topics covered: chain rule, product rule, bilinear functions, identities and identities theorems. Please refer to previous scribed notes for details. 11.1.2 Finding a maximum, minimum or saddle points The principle: set coefficient of dx to 0 to find min, max, or saddle point: if df = c(a; dx)[r(dx)] then dx = ta, df = c(a; ta) = tc(a; A) function is at min/max/saddle point iff c(a; A) = 0 if c is any product, then A = 0 11-1

11-2 Lecture 11: October 2 11.1.3 Infomax ICA Suppose we have n training examples x i R d and a scalar-valued, component-wise function g. We would like to find the d d matrix W that maximizes the entropy of y i = g(w x i ). Detour: volume rule: vol(as) = det(a) vol(s) Interpretation: small determinant value means the existence of small eigenvalue, thus squash the volume flat, vice versa. Back to infomax ICA. We have y i = g(w x i ) where dy i = J(x; W )dx i = J i dx i. We want to maximize the entropy over the distribution of y: max W Σ i ( ln(p (y i ))), P (y i ) = P (x i ) detj(x i ; W ) And from maxh(p (y)) = P (y)ln(p (y))dy = E(ln(P (y))) it is equivalent to maximizing max W Σ i ln( detj(x i ; W ) ) since P (x) is independent to W. 11.1.4 Solving ICA Gradient Define u i = g (W x i ), v i = g (W x i ). For gradient of y i = g(w x i ): dy i = g (W x i ) d(w x i ) = u i (W dx i ) = diag(u i )W dx i For gradient of J i = diag(u i )W : dj i = d(diag(u i ))W + diag(u i )dw = diag(v i d(w x i ))W + diag(u i )dw = diag(u i )dw + diag(v i )diag(d(w x i ))W

Lecture 11: October 2 11-3 Finally, define diag(α i ) = diag(u i ) 1 diag(v i ) solving the gradient of Σ i ln( detj(x i ; W ) ): dl = Σ i d(ln detj(x i ; W ) ) = Σ i tr(j 1 i dj i ) = Σ i tr(w 1 dw + W 1 diag(u i ) 1 diag(v i )diag(d(w x i )))W = Σ i tr(w 1 dw ) + tr(diag(α i )diag(d(w x i ))) = ntr(w 1 dw ) + Σ i tr(α T i d(w x i )) = ntr(w 1 dw ) + tr(σ i x i α T i d(w x i )) = nw T + Σ i α i x T i = nw T + C 11.1.5 Natural Gradient Define L(W ) as a function from R d d to R, then dl = tr(g T dw ). So step S is: S = argmax S M(S), M(S) = tr(g T S) SW 1 2 F 2 which, in scaler case: So, to find the max/min/saddle point: M = gs S2 2W 2 M = tr(g T S) 1 2 tr(sw 1 W T S T ) dm = tr(g T ds) 1 2 tr(dsw 1 W T S T ) So, natural gradient becomes G = W 1 W T, and thus GW T W = S. Using the gradient previously derived, [W T + C]W T W = W + CW T W. 11.1.6 More Info Minkas cheat sheet: http://research.microsoft.com/en-us/um/people/minka/papers/matrix/ Magnus & Neudecker. Matrix Differential Calculus. Wiley, 1999. 2nd ed. http://www.amazon.com/differential-calculus- Applications-Statistics-Econometrics/dp/047198633X Bell & Sejnowski. An information-maximization approach to blind separation and blind deconvolution. Neural Computation, v7, 1995. 11.2 Newton s Method Newton s method have two main applications: solving nonlinear equations and finding minima/maxima/saddles.

11-4 Lecture 11: October 2 11.2.1 Solving Nonlinear Equations For x R d and f : R d R d which is differentiable, we want to solve f(x) = 0. We perform first order Taylor approximation on f(x), f(y) f(x) + J(x)(y x) = ˆf(y) where J(x) is the Jacobian. We now try to solve for ˆf(y) = 0. f(x) + J(x)(y x) = 0 dx represents the update step for Newton s method. dx = y x = J(x) 1 f(x) (11.1) We now work on the example of approximating the reciprocal of the Golden Number φ. The function and the derivative is as follows. f(x) = 1 x φ, f (x) = 1 x 2 And dx becomes the following. dx = J(x) 1 f(x) = x 2 ( 1 x φ) = x x2 φ The update rule is x + = x + dx. Figure 11.1 shows an iteration of Newton s method when x = 1. Figure 11.1: Example of one iteration of Newton s Method in solving non-linear equations. We now perform error analysis of Newton s method. For value x, the error is ɛ = xφ 1. For value x +, the error ɛ + is the following.

Lecture 11: October 2 11-5 ɛ + = x + φ 1 = ( x + x x 2 φ ) φ 1 = (x + x (1 xφ)) φ 1 = (x xɛ) φ 1 = xφ 1 xɛφ = ɛ xɛφ = ɛ (1 xφ) = ɛ 2 This shows that if ɛ < 1, then Newton s method has quadratic convergence. However, if ɛ > 1, then Newton s method will diverge. 11.2.2 Finding Minima/Maxima/Saddles For x R d and f : R d R d which is twice differentiable, we want to find min x f(x). In the example, we only focus on minimizing f, but finding the maxima and saddle points are the same. We first define g = f. Minimizing f is the same as finding x such that g = f = 0. From Equation 11.1, the Newton s update is the following, d = J 1 g = (g ) 1 g = (f ) 1 f = H 1 g where H is the Hessian. We now show that Newton s method is a descent method if H 0. We set dx = td for t > 0. r(dx) is the residual. Using first order Taylor expansion, we get the following. df = g T dx + r(dx) ( = g T t (f ) 1 f ) + r(dx) = tf T (f ) 1 f + r(dx) = tf T H 1 f + r(dx) If H 0, then H 1 0, which makes the first term always negative, thus making Newton s method a descent method. 11.2.3 Newton s Method and Steepest Descent Newton s method is a special case of steepest descent when the norm used is the Hessian norm. To find the step for steepest descent, we minimize the following. min d g T d + 1 2 d 2 H, d 2 H = d T Hd (11.2) The solution to this minimization is d = H 1 g. Steepest descent with a constraint or a penalty in the objective is equivalent. The equivalence will be covered when duality is covered in class. Figure 11.2 shows the difference between the direction of steps for gradient descent and Newton s method.

11-6 Lecture 11: October 2 Figure 11.2: Direction of step for gradient descent and Newton s method. 11.2.4 Damped Newton Damped Newton is combining the Newton s method with backtracking line search to make sure that the objective value does not increase. Initialize x 1 for k = 1, 2,... g k = f (x k ); gradient H k = f (x k ); Hessian d k = H k \g k ; Newton direction t k = 1; backtracking line search while f(x k + t k d k ) > f(x k ) + tgk T d k/3 divide by 3 to make sure < 1 2 for future proofs t k = βt k β < 1 x k+1 = x k + t k d k step Damped Newton is affine invariant, meaning that suppose g(x) = f(ax + b), and we get Newton s updates x 1, x 2,... from g(x) and y 1, y 2,... from f(y), and if y 1 = Ax 1 + b, then y i = Ax i + b i. For damped Newton, if f is bounded below, then f(x k ) converges. If f is strictly convex with bounded level sets, then x k converges. Finally, damped Newton typically converges at quadratic rate in the neighborhood of x.