Beyond Newton s method Thomas P. Minka

Similar documents
Topic 8c Multi Variable Optimization

Unconstrained Multivariate Optimization

is a maximizer. However this is not the case. We will now give a graphical argument explaining why argue a further condition must be satisfied.

Power EP. Thomas Minka Microsoft Research Ltd., Cambridge, UK MSR-TR , October 4, Abstract

Selected Topics in Optimization. Some slides borrowed from

1 Kernel methods & optimization

CHAPTER 1-2: SHADOW PRICES

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Intro to Nonlinear Optimization

Machine Learning Basics III

Why should you care about the solution strategies?

Numerical Methods. Root Finding

9. Interpretations, Lifting, SOS and Moments

EAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science

Machine Learning Lecture 3

Variational Bayes. A key quantity in Bayesian inference is the marginal likelihood of a set of data D given a model M

1. Sets A set is any collection of elements. Examples: - the set of even numbers between zero and the set of colors on the national flag.

10-725/36-725: Convex Optimization Spring Lecture 21: April 6

Chapter 9. Derivatives. Josef Leydold Mathematical Methods WS 2018/19 9 Derivatives 1 / 51. f x. (x 0, f (x 0 ))

A Practitioner s Guide to Generalized Linear Models

Economics 205 Exercises

Lecture 23: November 19

Approximate inference, Sampling & Variational inference Fall Cours 9 November 25

Nonlinear Optimization for Optimal Control

Numerical Integration (Quadrature) Another application for our interpolation tools!

Chapter 6. Nonlinear Equations. 6.1 The Problem of Nonlinear Root-finding. 6.2 Rate of Convergence

Part 2: NLP Constrained Optimization

Gaussian Process Vine Copulas for Multivariate Dependence

APPLICATIONS OF DIFFERENTIATION

Notes on Discriminant Functions and Optimal Classification

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

OPER 627: Nonlinear Optimization Lecture 14: Mid-term Review

Part 1: Expectation Propagation

MACHINE LEARNING ADVANCED MACHINE LEARNING

Chapter 2. Optimization. Gradients, convexity, and ALS

Gradient Descent. Dr. Xiaowei Huang

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Lecture 14: Newton s Method

Lecture 3 September 1

Type II variational methods in Bayesian estimation

STATIC LECTURE 4: CONSTRAINED OPTIMIZATION II - KUHN TUCKER THEORY

Bayesian Inference of Noise Levels in Regression

8 Numerical methods for unconstrained problems

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Today. Introduction to optimization Definition and motivation 1-dimensional methods. Multi-dimensional methods. General strategies, value-only methods

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

A Primer on Multidimensional Optimization

Language and Statistics II

17 Solution of Nonlinear Systems

Bayesian Approach 2. CSC412 Probabilistic Learning & Reasoning

Learning Targets: Standard Form: Quadratic Function. Parabola. Vertex Max/Min. x-coordinate of vertex Axis of symmetry. y-intercept.

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

Geometric Modeling Summer Semester 2010 Mathematical Tools (1)

Bregman Divergence and Mirror Descent

Approximate Message Passing with Built-in Parameter Estimation for Sparse Signal Recovery

ebay/google short course: Problem set 2

Machine Learning Lecture 2

2.3 The Fixed-Point Algorithm

Where now? Machine Learning and Bayesian Inference

Optimization Tutorial 1. Basic Gradient Descent

QUADRATIC FUNCTIONS. ( x 7)(5x 6) = 2. Exercises: 1 3x 5 Sum: 8. We ll expand it by using the distributive property; 9. Let s use the FOIL method;

Iterative Reweighted Least Squares

Optimization Methods: Optimization using Calculus - Equality constraints 1. Module 2 Lecture Notes 4

Machine Learning 4771

Review of Optimization Basics

Comparative study of Optimization methods for Unconstrained Multivariable Nonlinear Programming Problems

Lecture 23: Conditional Gradient Method

Example - Newton-Raphson Method

NUMERICAL COMPUTATION OF THE CAPACITY OF CONTINUOUS MEMORYLESS CHANNELS

MACHINE LEARNING ADVANCED MACHINE LEARNING

Review of Classical Optimization

Lecture 7: Minimization or maximization of functions (Recipes Chapter 10)

GRADIENT = STEEPEST DESCENT

M.S. Project Report. Efficient Failure Rate Prediction for SRAM Cells via Gibbs Sampling. Yamei Feng 12/15/2011

Regression with Numerical Optimization. Logistic

Math 273a: Optimization Lagrange Duality

Water Resources Systems: Modeling Techniques and Analysis

Lecture 3: Linesearch methods (continued). Steepest descent methods

Numerical optimization

Physics 403. Segev BenZvi. Numerical Methods, Maximum Likelihood, and Least Squares. Department of Physics and Astronomy University of Rochester

Lecture 25: November 27

LECTURE # - NEURAL COMPUTATION, Feb 04, Linear Regression. x 1 θ 1 output... θ M x M. Assumes a functional form

NON-NEGATIVE MATRIX FACTORIZATION FOR PARAMETER ESTIMATION IN HIDDEN MARKOV MODELS. Balaji Lakshminarayanan and Raviv Raich

Color Scheme. swright/pcmi/ M. Figueiredo and S. Wright () Inference and Optimization PCMI, July / 14

Applied Computational Economics Workshop. Part 3: Nonlinear Equations

Calculus in Business. By Frederic A. Palmliden December 7, 1999

Math 171 Calculus I Spring, 2019 Practice Questions for Exam IV 1

1 Computing with constraints

Math Midterm Solutions

Lecture 16: October 22

Artificial Neural Networks 2

Expectation Maximization Deconvolution Algorithm

CSC 576: Gradient Descent Algorithms

How to Plan Experiments

A Robust State Estimator Based on Maximum Exponential Square (MES)

Lecture 3: Pattern Classification. Pattern classification

Dual Methods. Lecturer: Ryan Tibshirani Convex Optimization /36-725

= V I = Bus Admittance Matrix. Chapter 6: Power Flow. Constructing Ybus. Example. Network Solution. Triangular factorization. Let

Nonlinear Equations. Chapter The Bisection Method

Transcription:

Beyond Newton s method Thomas P. Minka 2000 (revised 7/21/2017) Abstract Newton s method for optimization is equivalent to iteratively maimizing a local quadratic approimation to the objective function. But some functions are not well-approimated by a quadratic, leading to slow convergence, and some have turning points where the curvature changes sign, leading to failure. To fi this, we can use a more appropriate choice of local approimation than quadratic, based on the type of function we are optimizing. This paper demonstrates three such generalized Newton rules. Like Newton s method, they only involve the first two derivatives of the function, yet converge faster and fail less often. 1 Newton s method Newton s method is usually described in terms of root-finding, but it can also be understood as maimizing a local quadratic approimation to the objective function. Let the objective be f(). A quadratic approimation about 0 has the form g() = k + b 2 ( a)2 (1) g () = b( a) (2) g () = b (3) With three free parameters, we can make g and its first two derivatives match those of f at 0. That is, (k,b,a) are given by whose solution is k + b 2 ( 0 a) 2 = f( 0 ) (4) b( 0 a) = f ( 0 ) (5) b = f ( 0 ) (6) b = f ( 0 ) (7) a = 0 f ( 0 )/f ( 0 ) (8) k = f( 0 ) f ( 0 ) 2 2f ( 0 ) (9) 1

If b < 0, then the maimum of g is a, leading to the update rule new = f ()/f () (10) which is eactly Newton s method for optimization. Unfortunately, if b = f () 0, then the method fails and we must use some other optimization technique to get closer to the optimum of f. 2 A non-quadratic variation In maimum-likelihood estimation of a Dirichlet distribution, we encounter the objective f() = log Γ()ep(s) K k=1 Γ(m k) > 0 (11) This objective is conve, but Newton s method may still fail by producing a negative value. We can get a faster and more stable algorithm by using a local approimation of the form Matching f and its derivatives at 0 requires g() = k +alog()+b (12) g () = a/+b (13) g () = a/ 2 (14) a = 2 0f ( 0 ) (15) b = f ( 0 ) a/ 0 (16) Figure 1 demonstrates the high quality of this approimation compared to a quadratic. Since f () 0 for all, we know a 0. If in addition b < 0, then the maimum of g is a/b, giving the update rule 1 = 1 new + f () (17) 2 f () If it happens that b 0, then some other method must be used. Note that this update cannot be achieved by simply changing the parameterization of f and applying Newton s method to the new parameterization. For eample, if we write f() = f(1/y) and use a quadratic approimation in y, we get the Newton update or y new f (1/y)( 1/y 2 ) = y f (1/y)(1/y 4 )+f (1/y)(2/y 3 ) 1 = 1 new + f () 2 f ()+2f () (18) (19) 2

4 3 5 4 6 5 f() 7 f() 6 7 0 2 4 6 8 10 0 2 4 6 8 10 Figure 1: Approimations to the objective (11) at = 1 (left) and = 6 (right). Here K = 3, m = [1/2,1/6,1/3], and s = 1.5. Iteration Newton y = 1/ (19) Newton y = log() (20) method (17) 0 = 6 = 6 = 6 1 4.37629402002204 3.32635155274492 2.46069962590123 2 3.37583855493481 2.65619730648832 2.59239008053976 3 2.83325972215703 2.59371556140329 2.59311961991097 4 2.62645752464592 2.59311969529076 2.59311964068914 5 2.59390440938786 2.59311964068914 2.59311964068914 Figure2: Convergenceofthreeiteration schemesstartingfrom = 6ontheobjective infigure1. The proposed method reaches machine precision in 4 iterations. which is as close as you can get to (17) using reparameterization. Since must be positive, it is also tempting to reparameterize with f() = f(e y ). Applying Newton s method to y gives ( ) new f () = ep (20) f ()+f () Figure 2 compares the convergence rate of the three updates (17), (19), and (20). 3

3 Another variation In maimum-likelihood estimation of certain Gaussian models, we encounter the objective f() = n i=1 log(+v i ) s i +v i 0 (21) This objective is not conve and has turning points which confound Newton s method. Instead of quadratic g, consider a local approimation of the form Matching f and its derivatives at 0 requires g() = k nlog(+a) b +a (22) g () = n +a + b (+a) 2 (23) g n () = (+a) 2b 2 (+a) 3 (24) b = n( 0 +a)+f ( 0 )( 0 +a) 2 (25) { f ( 0 ) 2 nf ( 0 )+f ( 0 ) a = f ( 0 ) 0 if f ( 0 ) 0 n (26) 2f ( 0 ) 0 if f ( 0 ) = 0 Figure 3 demonstrates the high quality of this approimation compared to a quadratic. If a 0 and b > 0, the unconstrained maimum of g is b/n a, so the update rule is new = ma(0,+ f () n (+a)2 ) (27) If the conditions aren t met, which can happen when we are far from the maimum, then some other scheme must be used, just as with Newton s method. However, the method does not fail when f () crosses zero. Figure 4 compares the convergence rate of the updates (10), (20), and (27). 4

11 11 f() 12 f() 12 13 13 14 14 15 0 20 40 60 80 100 15 0 20 40 60 80 100 Figure 3: Approimations to the objective (21) at = 2 (left) and = 40 (right). Here n = 3 and v = [1,1/2,1/3], s = [2,5,10]. In both cases, the proposed approimation (22) is nearly identical to the true objective. Iteration Newton (10) Newton y = log() (20) method (27) 0 = 2 = 2 = 2 1 2.88957666375356 4.45604707043312 5.33228321620505 2 3.88206057335333 5.30261828233160 5.36035707890124 3 4.75937270469894 5.36013962130173 5.36035751320994 4 5.24746847271009 5.36035751012362 5.36035751320994 Figure 4: Convergence of three iteration schemes starting from = 2 on the objective in figure 3. Update (19) fails and is not shown. The proposed method reaches machine precision in 3 iterations. Starting from = 40, the proposed method converges equally fast, while the updates (10) and (20) fail. 5

4 Cauchy approimation In certain maimum-likelihood and robust estimation problems, we encounter the objective f() = n log(1+a i ( b i ) 2 ) (28) i=1 For consistency with previous sections, it is written as a maimization problem. Following the eample of the last section, consider a local approimation which has the form of one term: Matching f and its derivatives at 0 requires g() = k log(1+a( b) 2 ) (29) g () = 2a( b) 1+a( b) 2 (30) g () = 2a 1 a( b)2 (1+a( b) 2 ) 2 (31) b = 0 f ( 0 ) f ( 0 ) f ( 0 ) 2 (32) a = (f ( 0 ) f ( 0 ) 2 ) 2 f ( 0 ) 2 2f ( 0 ) Figure 5 compares this approimation with a quadratic. It doesn t fit particularly well, but it is robust. The approimation has a maimum when a > 0, i.e. 2f ( 0 ) < f ( 0 ) 2, which is a weaker constraint than f ( 0 ) < 0 required by Newton s method. When the constraint is satisfied, the maimum of g() is b, so the update rule is new = (33) f () f () f () 2 (34) This can be used as a general replacement for Newton s method, when maimizing any function. For the function in figure 5, the convergence rate is comparable to Newton when started at a point where Newton doesn t fail (points near the maimum). 6

f() 3 4 5 6 7 11 12 0 1 2 3 f() 3 4 5 6 7 11 12 0 1 2 3 Figure 5: Approimations to the objective (28) at = 2 (left) and = 0.6 (right). Here n = 3 and a = [10,20,30], b = [2,1.5,1]. 7

5 Multivariate eample The approach readily generalizes to multivariate functions. For eample, suppose (28) was a function of a vector : f() = n log(1+( b i ) T A i ( b i )) (35) i=1 Instead of using a univariate Cauchy-type approimation, we can use a multivariate Cauchy: The gradient and Hessian of g are: g() = k log(1+( b) T A( b)) (36) g() = g() = 2A( b) 1+( b) T A( b) (37) 2A 1+( b) T A( b) + 4A( b)( b)t A (1+( b) T A( b)) 2 (38) We set A and b by matching the gradient and Hessian of f at 0. This gives: H( 0 ) = f( 0 ) (39) b = 0 ( H( 0 ) f( 0 ) f( 0 ) T) 1 f(0 ) (40) s = f( 0 ) T H( 0 ) 1 f( 0 ) (41) A = (H( 0 ) f( 0 ) f( 0 ) T ) s 1 2 s (42) The approimation has a maimum when A is positive definite, which is a weaker constraint than H( 0 ) being negative definite as required by Newton s method. When the constraint is satisfied, the maimum of g() is b, so the update rule is new = ( H( 0 ) f( 0 ) f( 0 ) T) 1 f(0 ) (43) = H( 0 ) 1 f( 0 ) 1 f( 0 ) T H( 0 ) 1 f( 0 ) This can be used as a general replacement for Newton s method, when maimizing any multivariate function. From (44) we see that it moves in the same direction that Newton would, but with a different stepsize. This stepsize adjustment allows it to work even when Newton would fail. (44) 8