Derivative-Free Optimization of Noisy Functions via Quasi-Newton Methods. Jorge Nocedal

Similar documents
Zero-Order Methods for the Optimization of Noisy Functions. Jorge Nocedal

Derivative-Free Optimization of Noisy Functions via Quasi-Newton Methods: Experiments and Theory. Richard Byrd

EAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science

Stochastic Optimization Algorithms Beyond SG

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

Numerical Methods I Solving Nonlinear Equations

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

Introduction. New Nonsmooth Trust Region Method for Unconstraint Locally Lipschitz Optimization Problems

Nonlinear Optimization Methods for Machine Learning

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal

A recursive model-based trust-region method for derivative-free bound-constrained optimization.

Interpolation-Based Trust-Region Methods for DFO

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Convex Optimization Algorithms for Machine Learning in 10 Slides

Step lengths in BFGS method for monotone gradients

Introduction. A Modified Steepest Descent Method Based on BFGS Method for Locally Lipschitz Functions. R. Yousefpour 1

Reduced-Hessian Methods for Constrained Optimization

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares

1 Numerical optimization

Algorithms for Constrained Optimization

OPER 627: Nonlinear Optimization Lecture 9: Trust-region methods

Programming, numerics and optimization

Stochastic Optimization Methods for Machine Learning. Jorge Nocedal

1 Numerical optimization

5 Quasi-Newton Methods

Second-Order Methods for Stochastic Optimization

Homework 5. Convex Optimization /36-725

Sub-Sampled Newton Methods

Unconstrained Optimization

A DERIVATIVE-FREE ALGORITHM FOR THE LEAST-SQUARE MINIMIZATION

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization

Derivative-Free Trust-Region methods

Higher-Order Methods

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725

Improved Damped Quasi-Newton Methods for Unconstrained Optimization

Algorithms for Nonsmooth Optimization

Second order machine learning

Gradient Estimation for Attractor Networks

AM 205: lecture 19. Last time: Conditions for optimality, Newton s method for optimization Today: survey of optimization methods

Bindel, Fall 2011 Intro to Scientific Computing (CS 3220) Week 6: Monday, Mar 7. e k+1 = 1 f (ξ k ) 2 f (x k ) e2 k.

An Inexact Newton Method for Nonconvex Equality Constrained Optimization

LBFGS. John Langford, Large Scale Machine Learning Class, February 5. (post presentation version)

ECS550NFB Introduction to Numerical Methods using Matlab Day 2

Second order machine learning

Methods that avoid calculating the Hessian. Nonlinear Optimization; Steepest Descent, Quasi-Newton. Steepest Descent

AM 205: lecture 19. Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods

An Introduction to Algebraic Multigrid (AMG) Algorithms Derrick Cerwinsky and Craig C. Douglas 1/84

Newton-Like Methods for Sparse Inverse Covariance Estimation

A Trust-region-based Sequential Quadratic Programming Algorithm

Motivation: We have already seen an example of a system of nonlinear equations when we studied Gaussian integration (p.8 of integration notes)

Introduction to Nonlinear Optimization Paul J. Atzberger

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

Line Search Methods for Unconstrained Optimisation

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

EAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science

Scientific Computing: An Introductory Survey

Recovery of Sparse Signals from Noisy Measurements Using an l p -Regularized Least-Squares Algorithm

Benchmarking Derivative-Free Optimization Algorithms

CPSC 540: Machine Learning

Line search methods with variable sample size for unconstrained optimization

Scientific Computing: An Introductory Survey

Scientific Computing: An Introductory Survey

On efficiency of nonmonotone Armijo-type line searches

Optimization II: Unconstrained Multivariable

Numerical Optimization: Basic Concepts and Algorithms

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

A New Trust Region Algorithm Using Radial Basis Function Models

Optimization II: Unconstrained Multivariable

11 More Regression; Newton s Method; ROC Curves

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Quasi-Newton methods for minimization

A Quasi-Newton Algorithm for Nonconvex, Nonsmooth Optimization with Global Convergence Guarantees

Static unconstrained optimization

230 L. HEI if ρ k is satisfactory enough, and to reduce it by a constant fraction (say, ahalf): k+1 = fi 2 k (0 <fi 2 < 1); (1.7) in the case ρ k is n

Lecture 14: October 17

Chapter 3 Numerical Methods

Outline. Scientific Computing: An Introductory Survey. Nonlinear Equations. Nonlinear Equations. Examples: Nonlinear Equations

GEOMETRY OF INTERPOLATION SETS IN DERIVATIVE FREE OPTIMIZATION

Optimization 2. CS5240 Theoretical Foundations in Multimedia. Leow Wee Kheng

Linear Regression (continued)

Newton s Method. Javier Peña Convex Optimization /36-725

Nonsmooth optimization: conditioning, convergence, and semi-algebraic models

5 Handling Constraints

Penalty and Barrier Methods. So we again build on our unconstrained algorithms, but in a different way.

PDE-Constrained and Nonsmooth Optimization

Preconditioned conjugate gradient algorithms with column scaling

Numerical solutions of nonlinear systems of equations

An Inexact Newton Method for Nonlinear Constrained Optimization

x n+1 = x n f(x n) f (x n ), n 0.

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Global and derivative-free optimization Lectures 1-4

Unconstrained Multivariate Optimization

Nonlinear Programming

MATH 4211/6211 Optimization Quasi-Newton Method

Stochastic Analogues to Deterministic Optimizers

Math 409/509 (Spring 2011)

CLASS NOTES Computational Methods for Engineering Applications I Spring 2015

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2

Introduction to Black-Box Optimization in Continuous Search Spaces. Definitions, Examples, Difficulties

E5295/5B5749 Convex optimization with engineering applications. Lecture 8. Smooth convex unconstrained and equality-constrained minimization

Transcription:

Derivative-Free Optimization of Noisy Functions via Quasi-Newton Methods Jorge Nocedal Northwestern University Huatulco, Jan 2018 1

Collaborators Albert Berahas Northwestern University Richard Byrd University of Colorado 2

Discussion 1. The BFGS method continues to surprise 2. One of the best methods for nonsmooth optimization (Lewis-Overton) 3. Leading approach for (deterministic) DFO derivative-free optimization 4. This talk: Very good method for the minimization of noisy functions We had not fully recognized the power and generality of quasi-newton updating until we tried to find alternatives! Subject of this talk: 1. Black-box noisy functions 2. No known structure 3. Not the finite sum loss functions arising in machine learning, where cheap approximate gradients are available 3

Outline Problem 1: min f (x) 1. f contains no noise 2. Scalability, Parallelism 3. Robustness f smooth but derivatives not available: DFO Problem 2: min f (x;ξ) min f (x) =φ(x) + ε(x) f ( ;ξ) smooth f (x) = φ(x)(1+ ε(x)) Propose method build upon classical quasi-newton updating using finitedifference gradients Estimate good finite-difference interval h Use noise estimation techniques (More -Wild) Deal with noise adaptively Can solve problems with thousands of variables Novel convergence results to neighborhood of solution (Richard Byrd) 4

DFO: Derivative free deterministic optimization (no noise) min f (x) f is smooth Direct search/pattern search methods: not scalable Much better idea: Interpolation based models with trust regions (Powell, Conn, Scheinberg, ) min m(x) = xt Bx + g T x s.t. x 2 Δ 1. Need (n+1)(n+2)/2 function values to define quadratic model by pure interpolation 2. Can use O(n) points and assume minimum norm change in the Hessian 3. Arithmetic costs high: n 4 ß scalability 4. Placement of interpolation points is important 5. Correcting the model may require many function evaluations 6. Parallelizable? ß 5

Why not simply BFGS with finite difference gradients? x k+1 = x k α k H k f (x k ) f (x) x i f (x + he i ) f (x) h Invest significant effort in estimation of gradient Delegate construction of model to BFGS Interpolating gradients Modest linear algebra costs O(n) for L-BFGS Placement of sample points on an orthogonal set BFGS is an overwriting process: no inconsistencies or ill conditioning with Armijo-Wolfe line search Gradient evaluation parallelizes easily Why now? Perception that n function evaluations per step is too high Derivative-free literature rarely compares with FD quasi-newton Already used extensively: fminunc MATLAB Black-box competition and KNITRO 6

Some numerical results Compare: Model based trust region code DFOtr by Conn, Scheinberg, Vicente vs FD-L-BFGS with forward and central differences Plot function decrease vs total number of function evaluations 7

Comparison: function decrease vs total # of function evaluations quadratic 10 5 10 0 s271 Smooth Deterministic DFOtr LBFGS FD (FD) LBFGS FD (CD) 10 2 10 0 s334 Smooth Deterministic DFOtr LBFGS FD (FD) LBFGS FD (CD) 10-5 F(x)-F * 10-10 10-15 F(x)-F * 10-2 10-4 10-20 10-6 10-25 20 40 60 80 100 120 Number of function evaluations 10-8 20 40 60 80 100 120 140 160 180 200 Number of function evaluations 10 10 10 5 s293 Smooth Deterministic DFOtr LBFGS FD (FD) LBFGS FD (CD) 10 0 10-2 10-4 s289 Smooth Deterministic DFOtr LBFGS FD (FD) LBFGS FD (CD) F(x)-F * 10 0 10-5 F(x)-F * 10-6 10-8 10-10 10-10 10-12 10-14 10-15 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Number of function evaluations 10-16 100 200 300 400 500 600 700 800 900 1000 Number of function evaluations 8

Conclusion: DFO without noise Finite difference BFGS is a real competitor of DFO method based on function evaluations Can solve problems with thousands of variables but really nothing new. 9

Optimization of Noisy Functions min f (x;ξ) min f (x) =φ(x) + ε(x) where f ( ;ξ) is smooth f (x) = φ(x)(1+ ε(x)) Focus on additive noise Finite-difference BFGS should not work! 1. Difference of noisy functions dangerous 2. Just one bad update once in a while: disastrous 3. Not done to the best of our knowledge f(x) 1.04 1.03 1.02 1.01 1 0.99 0.98 0.97 0.96 f(x) = sin(x) + cos(x) + 10-3 U(0,2sqrt(3)) Smooth Noisy -0.03-0.02-0.01 0 0.01 0.02 0.03 x 10

Finite Differences Noisy Functions (x) f (x) = (x)+ (x) h too small True Derivative @ x = 2.5: 1.6 Finite Di erence Estimate @ x = 2.5: 1.33 h too big True Derivative @ x = 3.5: 0.5 Finite Di erence Estimate @ x = 3.5: 0.5 (x) f (x) = (x)+ (x) h h x h h x 11

A Practible Algorithm min f (x) =φ(x) + ε(x) f (x) = φ(x)(1+ ε(x)) Outline of adaptive finite-difference BFGS method 1. Estimate noise ε(x) at every iteration -- More -Wild 2. Estimate h 3. Compute finite difference gradient 4. Perform line search (?!) 5. Corrective Procedure when case line search fails (need to modify line search) Re-estimate noise level Will require very few extra f evaluations/iteration even none 12

Noise estimation More -Wild (2011) Noise level: σ = [var(ε(x))] 1/2 Noise estimate: ε f min f (x) =φ(x) + ε(x) At x choose a random direction v evaluate f at q +1 equally spaced points x + iβv, i = 0,...,q Compute function differences: Δ 0 f (x) = f (x) Δ j+1 f (x) = Δ j [Δf (x)] = Δ j [ f (x + β)] Δ j [ f (x)]] Compute finite diverence table: T ij = Δ j f (x + iβv) 1 < j < q 0 < i < j q σ j = γ j q 1 j q j i=0 2 T i, j γ j = ( j!)2 (2 j)! 13

X Noise estimation More -Wild (2011) min f (x) =sin(x) + cos(x) +10 3 U(0,2 3 ) q = 6 β = 10 2 x f f 2 f 3 f 4 f 5 f 6 f 3 10 2 1.003 7.54e 3 2.15e 3 1.87e 4 5.87e 3 1.46e 2 2.49e 2 2 10 2 1.011 9.69e 3 2.33e 3 5.68e 3 8.73e 3 1.03e 3 10 2 1.021 1.20e 2 3.35e 3 3.05e 3 1.61e 3 0 1.033 8.67e 3 2.96e 3 1.44e 3 10 2 1.041 8.38e 3 1.14e 3 2 10 2 1.050 9.52e 3 3 10 2 1.059 k 6.65e 3 8.69e 4 7.39e 4 7.34e 4 7.97e 4 8.20e 4 High order differences of a smooth function tend to zero rapidly, while differences in noise are bounded away from zero. Changes in sign, useful. Procedure is scale invariant! 14

Finite difference itervals Once noise estimate ε f has been chosen: Forward difference: h = 8 1/4 ( ε f µ 2 ) 1/2 Central difference: h = 3 1/3 ( ε f µ 3 ) 1/3 µ 2 = max x I f (x) µ 3 f (x) Bad estimates of second and third derivatives can cause problems (not often) 15

Adaptive Finite Difference L-BFGS Method Estimate noise ε f Compute h by for forward or central differences [(4-8) function evaluations] Compute g k While convergence test not satisfied: d = H k g k [L-BFGS procedure] (x +, f +, flag) = LineSearch(x k, f k,g k,d k, f s ) IF flag=1 endif [line search failed] (x +, f +,h) = Recovery(x k, f k,g k,d k,max iter ) x k+1 = x +, f k+1 = f + Compute g k+1 [finite differences using h] s k = x k+1 x k, y k = g k+1 g k Discard (s k, y k ) if s k T y k 0 k = k +1 endwhile 16

Line Search BFGS method requires Armijo-Wolfe line search f (x k + αd) f (x k ) + αc 1 f (x k )d f (x k + αd) T d c 2 f (x k ) T d Wolfe Armijo Deterministic case: always possible if f is bounded below Can be problematic in the noisy case. Strategy: try to satisfy both but limit the number of attempts If first trial point (unit steplength) is not acceptable relax: f (x k + αd) f (x k ) + αc 1 f (x k )d + 2ε f relaxed Armijo Three outcomes: a) both satisfied; b) only Armijo; c) none 17

Corrective Procedure Compute a new noise estimate ε f along search direction d k Compute corresponding h If ĥ / h use new estimat h h; return w.o. changing x k Else compute new iterate (various options): small perturbation; stencil point Finite difference Stencil (Kelley) x s 18

Some quotes I believe that eventually the better methods will not use derivative approximations [Powell, 1972] f is somewhat noisy, which renders most methods based on finite differences of little or no use [X,X,X]. [Rios & Sahinidis, 2013] 19

END 20