Zero-Order Methods for the Optimization of Noisy Functions. Jorge Nocedal

Size: px

Start display at page:

Download "Zero-Order Methods for the Optimization of Noisy Functions. Jorge Nocedal"

Norma Morris
5 years ago
Views:

1 Zero-Order Methods for the Optimization of Noisy Functions Jorge Nocedal Northwestern University Simons Institute, October

2 Collaborators Albert Berahas Northwestern University Richard Byrd University of Colorado 2

3 Motivation Problem 1: min f (x) f smooth but derivatives not available 1. Scalability 2. Parallelism 3. Beyond linear models 4. But should not aim for fully quadratic model 5. Spread function evaluations effectively Problem 2: min f (x;ξ) f ( ;ξ) smooth 1. Use noise estimation techniques (Hamming 1960s) 2. Estimate good finite-difference interval h 3. Classical quasi-newton updating using finite-difference gradients 4. Deal with noise adaptively 5. Can solve problems with thousands of variables 6. Convergence to a neighborhood of solution 3

4 Nonsmooth Optimization Lewis and Overton BFGS method with the right line search is more effective in practice than bundle methods or any other approach they tried (Curtis) The Wolfe line search ensures that a convex model can be created. Only assume function is bounded below x k+1 = x k α k H k f (x k ) f (x) = x Gradient exists almost everywhere: f (x k + αd) f (x k ) + αc 1 f (x k ) T d Armijo f (x k + αd) T d c 2 f (x k ) T d Wolfe 0 < c 1 < c 2 < 1 4

Nonsmooth Optimization Lewis and Overton: The BFGS matrix captures the U-V structure of the objective x k+1 = x k α k H k f (x k ) Hessian approximation blows up (good thing) Never observed failures

5 Nonsmooth Optimization Lewis and Overton: The BFGS matrix captures the U-V structure of the objective x k+1 = x k α k H k f (x k ) Hessian approximation blows up (good thing) Never observed failures Very limited convergence results Where do we go from here? Power of Armijo-Wolfe line search not appreciated by the convex analysis community Rather than constructing a majorizing function, one constructs a convex model along the search direction 5

6 Discussion 1. The BFGS method continues to surprise 2. One of the leading algorithms for nonsmooth optimization 3. Leading approach for (deterministic) derivative-free optimization 4. This talk: Leading method for the minimization of noisy functions These observations do not apply to: 1. Structured nonsmooth optimization (e.g. lasso) 2. Stochastic objectives with cheap gradient, as in machine learning 3. Nonlinear least squares objectives; Gauss-Newton is the right approach We had not fully recognized the power and generality of quasi-newton updating 6

7 Derivative free deterministic optimization (no noise) min f (x) f is smooth Interpolation based models with trust regions (Katya) min m(x) = xt Bx + g T x s.t. x 2 Δ 1. Need (n+1)(n+2)/2 function values to define quadratic model by pure interpolation 2. Can use n points and assume minimum norm change in the Hessian 3. Arithmetic costs high: n 4 4. Placement of interpolation points is important 5. Trust region constraint needed and natural 6. Parallelizable? 7

8 BFGS with finite difference gradients: deterministic case x k+1 = x k α k H k f (x k ) f (x) x i f (x + he i ) f (x) h Invest significant effort in estimation of gradient Delegate construction of model to BFGS Interpolating gradients Modest linear algebra costs O(n) Placement of sample points on an orthogonal set BFGS is an overwriting process: no inconsistencies or ill conditioning with Armijo-Wolfe line search Gradient evaluation parallelizes easily Why now? Perception that n function evaluations per step is too high Derivative-free literature rarely compares with FD quasi-newton Already used extensively: fminunc MATLAB 8

9 Comparison: function decrease vs total # of function evaluations Schittkowski: s201 Stochastic Additive, Noise Level = 0 DFOtr FDLBFGS (FD) FDLBFGS (CD) 10 0 Schittkowski: s207 Stochastic Additive, Noise Level = 0 DFOtr FDLBFGS (FD) FDLBFGS (CD) Schittkowski: s287 Stochastic Additive, Noise Level = 0 DFOtr FDLBFGS (FD) FDLBFGS (CD) F(x)-F * F(x)-F * F(x)-F * Number of function evaluations Number of function evaluations Number of function evaluations

10 Optimization of Noisy Functions min f (x;ξ) min f (x) =φ(x) + ε(x) where f ( ;ξ) is smooth f (x) = φ(x)(1+ ε(x)) Additive and multiplicative noise. Focus on additive Outline of adaptive finite-difference BFGS method 1. Estimate noise e(x) at every iteration, 2. Possibly change h 3. Corrective Procedure in case line search fails 4. (need to modify line search) f(x) f(x) = sin(x) + cos(x) U(0,2sqrt(3)) Smooth Noisy x 10

11 Noise estimation More -Wild (2011) Noise level: σ = [var(ε(x))] 1/2 Noise estimate: ε f min f (x) =φ(x) + ε(x) At x choose a random direction v evaluate f at q +1 equally spaced points x + iβv, i = 0,...,q Compute function differences: Δ 0 f (x) = f (x) Δ j+1 f (x) = Δ j [Δf (x)] = Δ j [ f (x + β)] Δ j [ f (x)]] Compute finite diverence table: T ij = Δ j f (x + iβv) 1 < j < q 0 < i < j q σ j = γ j q 1 j q j i=0 2 T i, j γ j = ( j!)2 (2 j)! 11

12 min f (x) =sin(x) + cos(x) U(0,2 3 ) q = 6 β = 10 2 x f f 2 f 3 f 4 f 5 f 6 f e e e e e e e e e e e e e e e e e e e e e k 6.65e e e e e e 4 High order differences of a smooth function tend to zero rapidly, while differences in noise are bounded away from zero. Changes in sign, useful. Procedure is scale invariant!

13 Finite difference itervals Once noise estimate ε f has been chosen: Forward difference: h = 8 1/4 ( ε f µ 2 ) 1/2 Central difference: h = 3 1/3 ( ε f µ 3 ) 1/3 µ 2 = max x I f (x) µ 3 f (x) Bad estimates of second and third derivatives can make cause problems (not often) 13

14 Adaptive Finite Difference L-BFGS Method Estimate noise ε f Compute h by forward or central differences [(4-8) function evaluations] Compute g k While convergence test not satisfied: d = H k g k [L-BFGS procedure] (x +, f +, flag) = LineSearch(x k, f k,g k,d k, f s ) IF flag=1 endif [line search failed] (x +, f +,h) = Recovery(x k, f k,g k,d k,max iter ) x k+1 = x +, f k+1 = f + Compute g k+1 [finite differences using h] s k = x k+1 x k, y k = g k+1 g k Discard (s k, y k ) if s k T y k 0 k = k +1 endwhile 14

15 Adaptive Finite Difference L-BFGS Method Estimate noise ε f Compute h by forward or central differences [(4-8) function evaluations] Compute g k While convergence test not satisfied: d = H k g k [L-BFGS procedure] (x +, f +, flag) = LineSearch(x k, f k,g k,d k, f s ) IF flag=1 endif [line search failed] (x +, f +,h) = Recovery(x k, f k,g k,d k,max iter ) x k+1 = x +, f k+1 = f + Compute g k+1 [finite differences using h] s k = x k+1 x k, y k = g k+1 g k Discard (s k, y k ) if s k T y k 0 k = k +1 endwhile 15

16 Corrective Procedure Compute new noise estimate ε f along search direction d k ; Compute corresponding h IF h [0.7h,1.5h] h = h, x + = x k, f + = f k [update h; do not move] ELSE x + = x k + hd k / d k, f + = f (x + ) [perturbation] If x + satisfies the relaxed Armijo condition return x +,h else Finite difference Stencil (Kelley) x s if f + f s and f + f k accept x + else if f k > f s and f + > f s x + = x s, f + = f s else x + = x k f + = f k compute new ε f, h [random v] end if end if ENDIF 16

17 Line Search BFGS method requires Armijo-Wolfe line search f (x k + αd) f (x k ) + αc 1 f (x k )d f (x k + αd) T d c 2 f (x k ) T d Wolfe Armijo Deterministic case: always possible if f is bounded below Can be problematic in the noisy case. Direction d may not be a descent direction for smooth underlying function Strategy: try to satisfy both but limit the number of attempts If first trial point (unit steplength) is not acceptable relax: f (x k + αd) f (x k ) + αc 1 f (x k )d + 2ε f relaxed Armijo Three outcomes: a) both satisfied; b) only Armijo; c) none 17

18 10 3 Schittkowski: s286 Stochastic Additive, Noise Level = 1e-02 DFOtr FDLBFGS (FD) FDLBFGS (CD) 10 5 Schittkowski: s287 Stochastic Additive, Noise Level = 1e-02 DFOtr FDLBFGS (FD) FDLBFGS (CD) F(x)-F * 10 1 F(x)-F * Number of function evaluations Number of function evaluations

19 A simple convergence result: constant steplength Assumptions: 1. f (x) = φ(x) + ε(x). Function φ is twice differentiable 2. Strong convexity. µi 2 φ(x) LI x R n 3. H k has bounded eigenvalues 4. Bounded noise. ε(x) ε x R n Theorem. If α < Then for all k 1 β (1 β)l + β 2 L for any β (0,1) φ(x k ) φ * (1 ρ) k (φ(x 0 ) φ * η / ρ) + η / ρ ρ = 1 αµ(1 β) η = α[ 1 α L β + α L 2 ]ε 2 19

20 END 20

Derivative-Free Optimization of Noisy Functions via Quasi-Newton Methods. Jorge Nocedal

Derivative-Free Optimization of Noisy Functions via Quasi-Newton Methods Jorge Nocedal Northwestern University Huatulco, Jan 2018 1 Collaborators Albert Berahas Northwestern University Richard Byrd University