Derivative-Free Optimization of Noisy Functions via Quasi-Newton Methods: Experiments and Theory. Richard Byrd

Similar documents
Zero-Order Methods for the Optimization of Noisy Functions. Jorge Nocedal

Derivative-Free Optimization of Noisy Functions via Quasi-Newton Methods. Jorge Nocedal

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal

Stochastic Optimization Algorithms Beyond SG

Nonlinear Optimization Methods for Machine Learning

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

Stochastic Optimization Methods for Machine Learning. Jorge Nocedal

Unconstrained Optimization

An Inexact Newton Method for Nonconvex Equality Constrained Optimization

Linearly-Convergent Stochastic-Gradient Methods

Numerical Methods for PDE-Constrained Optimization

Exact and Inexact Subsampled Newton Methods for Optimization

On the use of piecewise linear models in nonlinear programming

OPER 627: Nonlinear Optimization Lecture 9: Trust-region methods

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

An Inexact Newton Method for Optimization

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization

Benchmarking Derivative-Free Optimization Algorithms

A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications

Optimization II: Unconstrained Multivariable

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Interpolation-Based Trust-Region Methods for DFO

Inexact Newton Methods and Nonlinear Constrained Optimization

Algorithms for Constrained Optimization

Line Search Methods for Unconstrained Optimisation

Stochastic Quasi-Newton Methods

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Stochastic Analogues to Deterministic Optimizers

1. Introduction. In this paper we discuss an algorithm for equality constrained optimization problems of the form. f(x) s.t.

A Bound-Constrained Levenburg-Marquardt Algorithm for a Parameter Identification Problem in Electromagnetics

1 Numerical optimization

Reduced-Hessian Methods for Constrained Optimization

A Line search Multigrid Method for Large-Scale Nonlinear Optimization

AM 205: lecture 19. Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods

On efficiency of nonmonotone Armijo-type line searches

Higher-Order Methods

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

Sub-Sampled Newton Methods

A Trust-region-based Sequential Quadratic Programming Algorithm

Numerical Methods I Solving Nonlinear Equations

Second-Order Methods for Stochastic Optimization

Introduction. New Nonsmooth Trust Region Method for Unconstraint Locally Lipschitz Optimization Problems

Step lengths in BFGS method for monotone gradients

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725

Infeasibility Detection and an Inexact Active-Set Method for Large-Scale Nonlinear Optimization

Line search methods with variable sample size for unconstrained optimization

A Stochastic Quasi-Newton Method for Large-Scale Optimization

Second order machine learning

5 Handling Constraints

5 Quasi-Newton Methods

REPORTS IN INFORMATICS

1 Numerical optimization

Gradient Estimation for Attractor Networks

AM 205: lecture 19. Last time: Conditions for optimality, Newton s method for optimization Today: survey of optimization methods

Second order machine learning

Introduction. A Modified Steepest Descent Method Based on BFGS Method for Locally Lipschitz Functions. R. Yousefpour 1

Line Search Algorithms

Numerical Optimization: Basic Concepts and Algorithms

Numerical Optimization

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization

Unconstrained optimization

2. Quasi-Newton methods

A recursive model-based trust-region method for derivative-free bound-constrained optimization.

The Inverse Function Theorem via Newton s Method. Michael Taylor

Static Parameter Estimation using Kalman Filtering and Proximal Operators Vivak Patel

Simple Iteration, cont d

Optimization Methods for Machine Learning

Image restoration. An example in astronomy

An Inexact Sequential Quadratic Optimization Method for Nonlinear Optimization

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

Scientific Computing: An Introductory Survey

Scientific Computing: An Introductory Survey

2 Nonlinear least squares algorithms

Newton-Like Methods for Sparse Inverse Covariance Estimation

Lecture 6: CS395T Numerical Optimization for Graphics and AI Line Search Applications

Non-Linear Optimization

A New Penalty-SQP Method

Gradient-Based Optimization

Lecture: Adaptive Filtering

E5295/5B5749 Convex optimization with engineering applications. Lecture 8. Smooth convex unconstrained and equality-constrained minimization

Newton-like method with diagonal correction for distributed optimization

GLOBAL CONVERGENCE OF CONJUGATE GRADIENT METHODS WITHOUT LINE SEARCH

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization

An Interior Algorithm for Nonlinear Optimization That Combines Line Search and Trust Region Steps

Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Convex Optimization CMU-10725

Statistics 580 Optimization Methods

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

1. Introduction. We develop an active set method for the box constrained optimization

Optimization II: Unconstrained Multivariable

Arc Search Algorithms

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

Step-size Estimation for Unconstrained Optimization Methods

Lecture notes. Mathematical Methods for Applications in Science and Engineering part II: Numerical Optimization

HYBRID RUNGE-KUTTA AND QUASI-NEWTON METHODS FOR UNCONSTRAINED NONLINEAR OPTIMIZATION. Darin Griffin Mohr. An Abstract

Unconstrained Multivariate Optimization

Lecture 14 Ellipsoid method

Solving linear equations with Gaussian Elimination (I)

DUAL REGULARIZED TOTAL LEAST SQUARES SOLUTION FROM TWO-PARAMETER TRUST-REGION ALGORITHM. Geunseop Lee

Transcription:

Derivative-Free Optimization of Noisy Functions via Quasi-Newton Methods: Experiments and Theory Richard Byrd University of Colorado, Boulder Albert Berahas Northwestern University Jorge Nocedal Northwestern University Huatulco, Jan 2018 1

My thanks to the organizers. Saludos y Gracias a Don Goldfarb 2

Numerical Results We compare our Finite Difference L-BFGS Method (FD-LM) to Model interpolation trust region method (MB) of Conn, Scheinberg, Vicente. Their method,, is: a simple implementation not designed for fast execution does not include a geometry phase Our goal is not to determine which method wins. Rather 1. Show that the FD-LM method is robust 2. Show that FD-LM is not wasteful in function evaluations 3

Adaptive Finite Difference L-BFGS Method Estimate noise ε f Compute h by forward or central differences [(4-8) function evaluations] Compute g k While convergence test not satisfied: d = H k g k [L-BFGS procedure] (x +, f +, flag) = LineSearch(x k, f k,g k,d k, f s ) IF flag=1 endif [line search failed] (x +, f +,h) = Recovery(x k, f k,g k,d k,max iter ) x k+1 = x +, f k+1 = f + Compute g k+1 [finite differences using h] s k = x k+1 x k, y k = g k+1 g k Discard (s k, y k ) if s k T y k 0 k = k +1 endwhile 4

Test problems Plotting f (x k ) φ * vs no. of f evaluations We show results for 4 representative problems 5

Numerical Results Stochastic Additive Noise f (x) = φ(x) + ε(x) ε(x) ~ U( ξ,ξ) ξ [10 8,,10 1 ] 10 4 10 2 s271 Stochastic Additive Noise:1e-08 10 2 s334 Stochastic Additive Noise:1e-08 10-4 10-4 10-6 10-8 10-6 10-10 100 200 300 400 500 600 Number of function evaluations 10-8 50 100 150 200 250 300 Number of function evaluations 10 3 10 2 s271 Stochastic Additive Noise:1e-02 10 2 10 1 s334 Stochastic Additive Noise:1e-02 10 1 10-1 10-1 10-3 100 200 300 400 500 600 Number of function evaluations 10-3 50 100 150 200 250 300 Number of function evaluations 6

Numerical Results Stochastic Additive Noise (continued) f (x) = φ(x) + ε(x) ε(x) ~ U( ξ,ξ) ξ [10 8,,10 1 ] 10 8 10 6 10 4 s293 Stochastic Additive Noise:1e-08 s289 Stochastic Additive Noise:1e-08 10 2 10-4 10-6 10-4 10-8 10-6 10-8 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Number of function evaluations 10-10 500 1000 1500 2000 2500 3000 Number of function evaluations 10 8 10 6 s293 Stochastic Additive Noise:1e-02 s289 Stochastic Additive Noise:1e-02 10 4 10 2 10-1 10-4 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Number of function evaluations 10-3 500 1000 1500 2000 2500 3000 Number of function evaluations 7

Numerical Results Stochastic Additive Noise Performance Profiles 1 τ = 10-5 1 τ = 10-5 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.1 0 1 2 4 Performance Ratio 0.2 0.1 0 1 2 4 8 16 Performance Ratio 8

Numerical Results Stochastic Multiplicative Noise Performance Profiles 1 τ = 10-5 1 τ = 10-5 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.1 0 1 2 4 Performance Ratio 0.2 0.1 0 1 2 4 8 Performance Ratio 9

Numerical Results Hybrid Method Recovery Mechanism As Jorge mentioned in Part I, our algorithm has a recovery mechanism This procedure is very important for the stable performance of the method Principle recovery mechanism is to re-estimate h HYBRID METHOD: If h is acceptable, then we switch from Forward to Central differences 10

Numerical Results Hybrid FC Method Stochastic Additive Noise 10 10 3 2 10 2 10 1 10-4 10-1 10-6 10 10-3 -8 s241 s208 Stochastic Stochastic Multiplicative Additive Noise:1e-08 Noise:1e-02 FDLM (HYBRID) FDLM (HYBRID) 10 10 1 1 10 0 10 10-1 -1 10-2 10 10-3 -3 10 10-4 -4 10 10-5 -5 10 10-6 -6 s267 s246 Stochastic Stochastic Multiplicative Additive Noise:1e-06 Noise:1e-02 FDLM (HYBRID) FDLM (HYBRID) 10 10-4 -10 20 50 40 100 60 80 150100 120 200 140 250 160 180300200 Number of of function evaluations 10 10-7 -7 50 100 50 150 100 200 250150 300 350 200 400 250 450 500300 Number of of function evaluations 11

Numerical Results Hybrid Method FC Stochastic Multiplicative Noise 10 3 s241 Stochastic Multiplicative Noise:1e-02 10 1 s267 Stochastic Multiplicative Noise:1e-02 10 2 10 1 FDLM (HYBRID) 10-1 10-1 10-3 10-4 10-3 10-5 10-6 FDLM (HYBRID) 10-4 50 100 150 200 250 300 Number of function evaluations 10-7 50 100 150 200 250 300 350 400 450 500 Number of function evaluations 12

Numerical Results Conclusions Both methods are fairly reliable FD-LM method not wasteful in terms of function evaluations No method dominates Central difference appears to be more reliable, but is twice as expensive per iteration Hybrid approach shows promise 13

Convergence analysis 1. What can we prove about the algorithm proposed here? 2. We first note that there is a theory for the Implicit Filtering Method of Kelley which is a finite difference BFGS method He establishes deterministic convergence guarantees to the solution Possible because it is assumed that noise can be diminished as needed at every iteration Similar to results on Sampling methods for stochastic objctives 3. In our analysis we assume that noise does not go to zero We prove convergence to a neighborhood of the solution whose radius depends on the noise level in the function Results of this type were pioneered by Nedic-Bertsekas for incremental gradient method with constant steplengths 4. We prove two sets of results for strongly convex functions Fixed steplength Armijo line search 5. Up to now, little analysis of line search with noise 14

Discussion 1. The algorithm proposed here is complex, particularly if the recovery mechanism is included 2. The effect that noisy function evaluations and finite difference gradient approximations have on the line search are difficult to analyze 3. In fact: the study of stochastic line searches is one of our current research projects 4. How should results be stated: in expectation? in probability? what assumptions on the noise are realistic? some results in the literature assume the true function value φ(x) is available This field is emerging 15

Context of our analysis 1. We will bypass these thorny issues by assuming that Noise in the function and gradient are bounded ε(x) C f e(x) C g And consider a general gradient method with errors x k+1 = x k α k H k g k g k is any approximation to the gradient could stand for a finite difference approximation or some other treatment is general to highlight the novel aspects of this analysis we assume H k =I 16

Fixed Steplength Analysis Recall f (x) = φ(x) + ε(x) Define g k = φ(x k ) + e(x k ) Iteration x k+1 = x k αg k Assume µi 2 φ(x k ) LI e(x) C g Theorem. If α < 1/ L then for all k φ(x k+1 φ N ) (1 αµ)[φ(x k ) φ N ] φ N φ * + C 2 g 2µ best possible objective value Therefore, φ k φ * (1 αµ) k (φ 0 φ N ) + C 2 g 2µ 17

Idea behind the proof 18

Line Search Our algorithm uses a line search Move away from fixed steplengths and exploit the power of line searches Very little work on noisy line searches How should sufficient decrease be defined? Introduce new Armijo condition: f (x k + αd k ) f (x k ) + c 1 αg T k d k + ε A where α = max{1,τ,τ 2, } and ε A > 2C f 19

Line Search Analysis New Armijo condition: f (x k + αd k ) f (x k ) + c 1 αg k T d k + ε A where α = max{1,τ,τ 2, } and ε A > 2C f Because of relaxation term Armijo is always satisfied for alpha <<1. But how long will the step be? Consider 2 sets of iterates: Case 1: Gradient error is small relative to gradient. Step of 1/L is accepted, and good progress is made. Case 2: Gradient error is large relative to gradient. Step could be poor, but size of step is only of order C g 20

Line Search Analysis Iteration x k+1 = x k α k g k Assume µi 2 φ(x k ) LI e(x) C g and ε(x) C f Theorem: Above algorithm with relaxed Armijo with c 1 < 1/ 2 gives φ(x k+1 ) φ N ρ[φ(x k ) φ N ] where ρ = 1 2µc 1τ (1 β) 2 L and φ N = φ * + 1 1 ρ [c 1τ (1 β) 2 2 C g + ε Lβ 2 A + 2C f ] Here β is a free parameter in (0, 1 2c 1 1+ 2c 1 ] 21

THANK YOU. 22