Lecture 17: Numerical Optimization October 2014

Similar documents
Statistical Computing (36-350)

Motivation: We have already seen an example of a system of nonlinear equations when we studied Gaussian integration (p.8 of integration notes)

Gradient Descent. Sargur Srihari

1 Numerical optimization

Lecture V. Numerical Optimization

Optimization. Totally not complete this is...don't use it yet...

1 Numerical optimization

Selected Topics in Optimization. Some slides borrowed from

Physics 403. Segev BenZvi. Numerical Methods, Maximum Likelihood, and Least Squares. Department of Physics and Astronomy University of Rochester

LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION

Lecture 7: Minimization or maximization of functions (Recipes Chapter 10)

Numerical optimization

CPSC 540: Machine Learning

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

POLI 8501 Introduction to Maximum Likelihood Estimation

Scientific Computing: Optimization

x k+1 = x k + α k p k (13.1)

Lecture 3 September 1

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares

Gaussian and Linear Discriminant Analysis; Multiclass Classification

ECON 5350 Class Notes Nonlinear Regression Models

Chapter 3 Numerical Methods

Optimization Tutorial 1. Basic Gradient Descent

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Fitting The Unknown 1/28. Joshua Lande. September 1, Stanford

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Optimization and Gradient Descent

Solution Methods. Richard Lusby. Department of Management Engineering Technical University of Denmark

EAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science

2.098/6.255/ Optimization Methods Practice True/False Questions

Numerical solutions of nonlinear systems of equations

Lecture 5: September 12

Lecture 18: Constrained & Penalized Optimization

Introduction to gradient descent

Constrained optimization. Unconstrained optimization. One-dimensional. Multi-dimensional. Newton with equality constraints. Active-set method.

Unconstrained optimization

Lecture 17: October 27

Regression with Numerical Optimization. Logistic

Quasi-Newton Methods

CPSC 540: Machine Learning

Response Surface Methods

Gradient Descent Methods

ECE531 Lecture 10b: Maximum Likelihood Estimation

EAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science

Homework #3 RELEASE DATE: 10/28/2013 DUE DATE: extended to 11/18/2013, BEFORE NOON QUESTIONS ABOUT HOMEWORK MATERIALS ARE WELCOMED ON THE FORUM.

Scientific Computing: An Introductory Survey

Scientific Computing: An Introductory Survey

AM 205: lecture 19. Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

MATHEMATICS FOR COMPUTER VISION WEEK 8 OPTIMISATION PART 2. Dr Fabio Cuzzolin MSc in Computer Vision Oxford Brookes University Year

Unit 2: Solving Scalar Equations. Notes prepared by: Amos Ron, Yunpeng Li, Mark Cowlishaw, Steve Wright Instructor: Steve Wright

Design and Optimization of Energy Systems Prof. C. Balaji Department of Mechanical Engineering Indian Institute of Technology, Madras

Notes for CS542G (Iterative Solvers for Linear Systems)

10-725/36-725: Convex Optimization Prerequisite Topics

Conjugate-Gradient. Learn about the Conjugate-Gradient Algorithm and its Uses. Descent Algorithms and the Conjugate-Gradient Method. Qx = b.

8 Numerical methods for unconstrained problems

Machine Learning and Data Mining. Linear regression. Kalev Kask

Lecture 1: Supervised Learning

CLASS NOTES Models, Algorithms and Data: Introduction to computing 2018

(One Dimension) Problem: for a function f(x), find x 0 such that f(x 0 ) = 0. f(x)

Programming, numerics and optimization

Optimization. Charles J. Geyer School of Statistics University of Minnesota. Stat 8054 Lecture Notes

Nonlinear Optimization for Optimal Control

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

MATH 4211/6211 Optimization Basics of Optimization Problems

11 More Regression; Newton s Method; ROC Curves

NONLINEAR. (Hillier & Lieberman Introduction to Operations Research, 8 th edition)

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Nonlinear equations and optimization

5 Finding roots of equations

Maximum Likelihood Estimation

Single Variable Minimization

Outline. Scientific Computing: An Introductory Survey. Optimization. Optimization Problems. Examples: Optimization Problems

Lecture 6: September 12

ECS550NFB Introduction to Numerical Methods using Matlab Day 2

ROOT FINDING REVIEW MICHELLE FENG

Mathematical optimization

minimize x subject to (x 2)(x 4) u,

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Lecture 25: November 27

Descent methods. min x. f(x)

Lecture 6: CS395T Numerical Optimization for Graphics and AI Line Search Applications

Conditional Gradient (Frank-Wolfe) Method

Unconstrained Optimization

Linear Regression (continued)

Unconstrained Multivariate Optimization

Optimization: Nonlinear Optimization without Constraints. Nonlinear Optimization without Constraints 1 / 23

CPSC 540: Machine Learning

Newton s Method. Javier Peña Convex Optimization /36-725

CSE446: Clustering and EM Spring 2017

Newton-Raphson. Relies on the Taylor expansion f(x + δ) = f(x) + δ f (x) + ½ δ 2 f (x) +..

Cox regression: Estimation

Optimization. Yuh-Jye Lee. March 21, Data Science and Machine Intelligence Lab National Chiao Tung University 1 / 29

Optimization Methods

Chapter 1. Root Finding Methods. 1.1 Bisection method

AM 205: lecture 18. Last time: optimization methods Today: conditions for optimality

The linear model. Our models so far are linear. Change in Y due to change in X? See plots for: o age vs. ahe o carats vs.

Numerical Optimization

Transcription:

Lecture 17: Numerical Optimization 36-350 22 October 2014 Agenda Basics of optimization Gradient descent Newton s method Curve-fitting R: optim, nls Reading: Recipes 13.1 and 13.2 in The R Cookbook Optional reading: 1.1, 2.1 and 2.2 in Red Plenty Examples of Optimization Problems Minimize mean-squared error of regression surface (Gauss, c. 1800) Maximize likelihood of distribution (Fisher, c. 1918) Maximize output of plywood from given supplies and factories (Kantorovich, 1939) Maximize output of tanks from given supplies and factories; minimize number of bombing runs to destroy factory (c. 1939 1945) Maximize return of portfolio for given volatility (Markowitz, 1950s) Minimize cost of airline flight schedule (Kantorovich... ) Maximize reproductive fitness of an organism (Maynard Smith) Optimization Problems Given an objective function f : D R, find Basics: maximizing f is minimizing f: θ = argmin θ f(θ) If h is strictly increasing (e.g., log), then argmax θ f(θ) = argmin θ f(θ) argmin θ f(θ) = argmin θ h(f(θ)) 1

Considerations Approximation: How close can we get to θ, and/or f(θ )? Time complexity: How many computer steps does that take? Varies with precision of approximation, niceness of f, size of D, size of data, method... Most optimization algorithms use successive approximation, so distinguish number of iterations from cost of each iteration You remember calculus, right? Suppose x is one dimensional and f is smooth. If x is an interior minimum / maximum / extremum point df dx = 0 x=x If x a minimum, d 2 f dx 2 > 0 x=x You remember calculus, right? This all carries over to multiple dimensions: At an interior extremum, At an interior minimum, meaning for any vector v, 2 f = the Hessian, H θ might just be a local minimum f(θ ) = 0 2 f(θ ) 0 v T 2 f(θ )v 0 Gradients and Changes to f f (x 0 ) = df dx x=x0 = lim x x 0 f(x) f(x 0 ) x x 0 f(x) f(x 0 ) + (x x 0 )f (x 0 ) Locally, the function looks linear; to minimize a linear function, move down the slope Multivariate version: f(θ) f(θ 0 ) + (θ θ 0 ) f(θ 0 ) f(θ 0 ) points in the direction of fastest ascent at θ 0 2

Gradient Descent 1. Start with initial guess for θ, step-size η 2. While ((not too tired) and (making adequate progress)) Find gradient f(θ) Set θ θ η f(θ) 3. Return final θ as approximate θ Variations: adaptively adjust η to make sure of improvement or search along the gradient direction for minimum Pros and Cons of Gradient Descent Pro: Moves in direction of greatest immediate improvement If η is small enough, gets to a local minimum eventually, and then stops Cons: small enough η can be really, really small Slowness or zig-zagging if components of f are very different sizes How much work do we need? Scaling Big-O notation: means for some c 0 e.g., x 2 5000x + 123456778 = O(x 2 ) e.g., e x /(1 + e x ) = O(1) Useful to look at over-all scaling, hiding details Also done when the limit is x 0 h(x) = O(g(x)) h(x) lim x g(x) = c 3

How Much Work is Gradient Descent? Pro: For nice f, f(θ) f(θ ) + ɛ in O(ɛ 2 ) iterations For very nice f, only O(log ɛ 1 ) iterations To get f(θ), take p derivatives, each iteration costs O(p) Con: Taking derivatives can slow down as data grows each iteration might really be O(np) Taylor Series What if we do a quadratic approximation to f? Special cases of general idea of Taylor approximation Simplifies if x 0 is a minimum since then f (x 0 ) = 0: f(x) f(x 0 ) + (x x 0 )f (x 0 ) + 1 2 (x x 0) 2 f (x 0 ) f(x) f(x 0 ) + 1 2 (x x 0) 2 f (x 0 ) Near a minimum, smooth functions look like parabolas Carries over to the multivariate case: f(θ) f(θ 0 ) + (θ θ 0 ) f(θ 0 ) + 1 2 (θ θ 0) T H(θ 0 )(θ θ 0 ) Minimizing a Quadratic If we know f(x) = ax 2 + bx + c we minimize exactly: 2ax + b = 0 x = b 2a If then f(x) = 1 2 a(x x 0) 2 + b(x x 0 ) + c x = x 0 a 1 b 4

Newton s Method Taylor-expand for the value at the minimum θ f(θ ) f(θ) + (θ θ) f(θ) + 1 2 (θ θ) T H(θ)(θ θ) Take gradient, set to zero, solve for θ : Works exactly if f is quadratic and H 1 exists, etc. 0 = f(θ) + H(θ)(θ θ) θ = θ (H(θ)) 1 f(θ) If f isn t quadratic, keep pretending it is until we get close to θ, when it will be nearly true Newton s Method: The Algorithm 1. Start with guess for θ 2. While ((not too tired) and (making adequate progress)) Find gradient f(θ) and Hessian H(θ) Set θ θ H(θ) 1 f(θ) 3. Return final θ as approximation to θ Like gradient descent, but with inverse Hessian giving the step-size This is about how far you can go with that gradient Advantages and Disadvantages of Newton s Method Pros: Step-sizes chosen adaptively through 2nd derivatives, much harder to get zig-zagging, over-shooting, etc. Also guaranteed to need O(ɛ 2 ) steps to get within ɛ of optimum Only O(log log ɛ 1 ) for very nice functions Typically many fewer iterations than gradient descent Advantages and Disadvantages of Newton s Method Cons: Hopeless if H doesn t exist or isn t invertible Need to take O(p 2 ) second derivatives plus p first derivatives Need to solve Hθ new = Hθ old f(θ old ) for θ new inverting H is O(p 3 ), but cleverness gives O(p 2 ) for solving for θ new 5

Getting Around the Hessian Want to use the Hessian to improve convergence Don t want to have to keep computing the Hessian at each step Approaches: Use knowledge of the system to get some approximation to the Hessian, use that instead of taking derivatives ( Fisher scoring ) Use only diagonal entries (p unmixed 2nd derivatives) Use H(θ) at initial guess, hope H changes very slowly with θ Re-compute H(θ) every k steps, k > 1 Fast, approximate updates to the Hessian at each step (BFGS) Other Methods Lots! See bonus slides at end for for Nedler-Mead, a.k.a. the simplex method, which doesn t need any derivatives See bonus slides for the meta-method coordinate descent Curve-Fitting by Optimizing We have data (x 1, y 1 ), (x 2, y 2 ),... (x n, y n ) We also have possible curves, r(x; θ) e.g., r(x) = x θ e.g., r(x) = θ 1 x θ2 e.g., r(x) = q j=1 θ jb j (x) for fixed basis functions b j Curve-Fitting by Optimizing Least-squares curve fitting: ˆθ = argmin θ 1 n n (y i r(x i ; θ)) 2 i=1 Robust curve fitting: ˆθ = argmin θ 1 n n ψ(y i r(x i ; θ)) i=1 6

Optimization in R: optim() optim(par, fn, gr, method, control, hessian) fn: function to be minimized; mandatory par: initial parameter guess; mandatory gr: gradient function; only needed for some methods method: defaults to a gradient-free method ( Nedler-Mead ), could be BFGS (Newton-ish) control: optional list of control settings (maximum iterations, scaling, tolerance for convergence, etc.) hessian: should the final Hessian be returned? default FALSE Return contains the location ($par) and the value ($val) of the optimum, diagnostics, possibly $hessian Optimization in R: optim() gmp <- read.table("gmp.dat") gmp$pop <- gmp$gmp/gmp$pcgmp library(numderiv) mse <- function(theta) { mean((gmp$pcgmp - theta[1]*gmp$pop^theta[2])^2) } grad.mse <- function(theta) { grad(func=mse,x=theta) } theta0=c(5000,0.15) fit1 <- optim(theta0,mse,grad.mse,method="bfgs",hessian=true) fit1: Newton-ish BFGS method fit1[1:3] $par [1] 6493.2564 0.1277 $value [1] 61853983 $counts function gradient 63 11 fit1: Newton-ish BFGS method fit1[4:6] 7

$convergence [1] 0 $message NULL $hessian [,1] [,2] [1,] 52.5 4.422e+06 [2,] 4422070.4 3.757e+11 nls optim is a general-purpose optimizer So is nlm try them both if one doesn t work nls is for nonlinear least squares nls nls(formula, data, start, control, [[many other options]]) formula: Mathematical expression with response variable, predictor variable(s), and unknown parameter(s) data: Data frame with variable names matching formula start: Guess at parameters (optional) control: Like with optim (optional) Returns an nls object, with fitted values, prediction methods, etc. The default optimization is a version of Newton s method fit2: Fitting the Same Model with nls() fit2 <- nls(pcgmp~y0*pop^a,data=gmp,start=list(y0=5000,a=0.1)) summary(fit2) Formula: pcgmp ~ y0 * pop^a Parameters: Estimate Std. Error t value Pr(> t ) y0 6.49e+03 8.57e+02 7.58 2.9e-13 *** a 1.28e-01 1.01e-02 12.61 < 2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 8

Residual standard error: 7890 on 364 degrees of freedom Number of iterations to convergence: 5 Achieved convergence tolerance: 1.75e-07 fit2: Fitting the Same Model with nls() plot(pcgmp~pop,data=gmp) pop.order <- order(gmp$pop) lines(gmp$pop[pop.order],fitted(fit2)[pop.order]) curve(fit1$par[1]*x^fit1$par[2],add=true,lty="dashed",col="blue") pcgmp 20000 40000 60000 80000 0.0e+00 5.0e+06 1.0e+07 1.5e+07 pop Summary 1. Trade-offs: complexity of iteration vs. number of iterations vs. precision of approximation Gradient descent: less complex iterations, more guarantees, less adaptive Newton: more complex iterations, but few of them for good functions, more adaptive, less robust 2. Start with pre-built code like optim or nls, implement your own as needed Nelder-Mead, a.k.a. the Simplex Method Try to cage θ with a simplex of p + 1 points Order the trial points, f(θ 1 ) f(θ 2 )... f(θ p+1 ) 9

θ p+1 is the worst guess try to improve it Center of the not-worst = θ 0 = 1 n n i=1 θ i Nelder-Mead, a.k.a. the Simplex Method Try to improve the worst guess θ p+1 1. Reflection: Try θ 0 (θ p+1 θ 0 ), across the center from θ p+1 if it s better than θ p but not than θ 1, replace the old θ p+1 with it Expansion: if the reflected point is the new best, try θ 0 2(θ p+1 θ 0 ); replace the old θ p+1 with the better of the reflected and the expanded point 2. Contraction: If the reflected point is worse that θ p, try θ 0 + θp+1 θ0 2 ; if the contracted value is better, replace θ p+1 with it 3. Reduction: If all else fails, θ i θ1+θi 2 4. Go back to (1) until we stop improving or run out of time Making Sense of Nedler-Mead The Moves: Reflection: try the opposite of the worst point Expansion: if that really helps, try it some more Contraction: see if we overshot when trying the opposite Reduction: if all else fails, try making each point more like the best point Making Sense of Nedler-Mead Pros: Each iteration 4 values of f, plus sorting (and sorting is at most O(p log p), usually much better) No derivatives used, can even work for dis-continuous f Con: - Can need many more iterations than gradient methods Coordinate Descent Gradient descent, Newton s method, simplex, etc., adjust all coordinates of θ at once gets harder as the number of dimensions p grows Coordinate descent: never do more than 1D optimization Start with initial guess θ While ((not too tired) and (making adequate progress)) 10

For i (1 : p) do 1D optimization over i th coordinate of θ, holding the others fixed Update i th coordinate to this optimal value Return final value of θ Coordinate Descent Cons: Needs a good 1D optimizer Can bog down for very tricky functions, especially with lots of interactions among variables Pros: Can be extremely fast and simple 11