Interpolation-Based Trust-Region Methods for DFO

Similar documents
A recursive model-based trust-region method for derivative-free bound-constrained optimization.

Derivative-Free Trust-Region methods

Bilevel Derivative-Free Optimization and its Application to Robust Optimization

1101 Kitchawan Road, Yorktown Heights, New York 10598, USA

Global convergence of trust-region algorithms for constrained minimization without derivatives

A New Trust Region Algorithm Using Radial Basis Function Models

GEOMETRY OF INTERPOLATION SETS IN DERIVATIVE FREE OPTIMIZATION

A DERIVATIVE-FREE ALGORITHM FOR THE LEAST-SQUARE MINIMIZATION

Sparse Legendre expansions via l 1 minimization


MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing

A trust-region derivative-free algorithm for constrained optimization

Interpolation via weighted l 1 minimization

A Derivative-Free Gauss-Newton Method

A multistart multisplit direct search methodology for global optimization

USING SIMPLEX GRADIENTS OF NONSMOOTH FUNCTIONS IN DIRECT SEARCH METHODS

Sparse recovery for spherical harmonic expansions

USING SIMPLEX GRADIENTS OF NONSMOOTH FUNCTIONS IN DIRECT SEARCH METHODS

Constrained optimization

Recovering overcomplete sparse representations from structured sensing

Higher-Order Methods

Global Convergence of Radial Basis Function Trust Region Derivative-Free Algorithms

Interpolation via weighted l 1 -minimization

SURVEY OF TRUST-REGION DERIVATIVE FREE OPTIMIZATION METHODS

Interpolation via weighted l 1 -minimization

Recent Developments in Compressed Sensing

A trust-funnel method for nonlinear optimization problems with general nonlinear constraints and its application to derivative-free optimization

Worst Case Complexity of Direct Search

Strengthened Sobolev inequalities for a random subspace of functions

Global and derivative-free optimization Lectures 1-4

Benchmarking Derivative-Free Optimization Algorithms

1. Introduction. In this paper, we design a class of derivative-free optimization algorithms for the following least-squares problem:

Suppose that the approximate solutions of Eq. (1) satisfy the condition (3). Then (1) if η = 0 in the algorithm Trust Region, then lim inf.

Worst Case Complexity of Direct Search

Mesures de criticalité d'ordres 1 et 2 en recherche directe

sparse and low-rank tensor recovery Cubic-Sketching

Reconstruction from Anisotropic Random Measurements

AN INTRODUCTION TO COMPRESSIVE SENSING

Lecture Notes 9: Constrained Optimization

1. Introduction. In this paper we address unconstrained local minimization

Introduction to Compressed Sensing

8 Numerical methods for unconstrained problems

Optimization for Compressed Sensing

An Inexact Sequential Quadratic Optimization Method for Nonlinear Optimization

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

Conditional Gradient (Frank-Wolfe) Method

Near Ideal Behavior of a Modified Elastic Net Algorithm in Compressed Sensing

Introduction How it works Theory behind Compressed Sensing. Compressed Sensing. Huichao Xue. CS3750 Fall 2011

Limitations in Approximating RIP

GREEDY SIGNAL RECOVERY REVIEW

Introduction. New Nonsmooth Trust Region Method for Unconstraint Locally Lipschitz Optimization Problems

Sparsity Regularization

STAT 200C: High-dimensional Statistics

Sparse Solutions of an Undetermined Linear System

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

An Introduction to Sparse Approximation

Derivative-Free Optimization of Noisy Functions via Quasi-Newton Methods. Jorge Nocedal

Least Squares Approximation

E5295/5B5749 Convex optimization with engineering applications. Lecture 8. Smooth convex unconstrained and equality-constrained minimization

Stochastic Optimization Algorithms Beyond SG

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

An iterative hard thresholding estimator for low rank matrix recovery

Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb Shanno February 6, / 25 (BFG. Limited memory BFGS (L-BFGS)

arxiv: v1 [math.oc] 1 Jul 2016

BOOSTERS: A Derivative-Free Algorithm Based on Radial Basis Functions

Convex Optimization and l 1 -minimization

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit

Minimizing the Difference of L 1 and L 2 Norms with Applications

Uniqueness Conditions for A Class of l 0 -Minimization Problems

CoSaMP. Iterative signal recovery from incomplete and inaccurate samples. Joel A. Tropp

Compressed sensing. Or: the equation Ax = b, revisited. Terence Tao. Mahler Lecture Series. University of California, Los Angeles

Model-Based Compressive Sensing for Signal Ensembles. Marco F. Duarte Volkan Cevher Richard G. Baraniuk

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

10-725/36-725: Convex Optimization Prerequisite Topics

Compressed Sensing and Sparse Recovery

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Tutorial: Sparse Signal Recovery

Constructing Explicit RIP Matrices and the Square-Root Bottleneck

1 Computing with constraints

Zero-Order Methods for the Optimization of Noisy Functions. Jorge Nocedal

1. Introduction. We analyze a trust region version of Newton s method for the optimization problem

On Lagrange multipliers of trust region subproblems

REPORTS IN INFORMATICS

Stochastic geometry and random matrix theory in CS

MS&E 318 (CME 338) Large-Scale Numerical Optimization

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

On fast trust region methods for quadratic models with linear constraints. M.J.D. Powell

Algorithms for constrained local optimization

AM 205: lecture 19. Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods

Accelerated Block-Coordinate Relaxation for Regularized Optimization

PDE-Constrained and Nonsmooth Optimization

Three Generalizations of Compressed Sensing

Algorithms for Constrained Optimization

A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications

Newton s Method. Javier Peña Convex Optimization /36-725

Large-Scale L1-Related Minimization in Compressive Sensing and Beyond

Compressibility of Infinite Sequences and its Interplay with Compressed Sensing Recovery

Transcription:

Interpolation-Based Trust-Region Methods for DFO Luis Nunes Vicente University of Coimbra (joint work with A. Bandeira, A. R. Conn, S. Gratton, and K. Scheinberg) July 27, 2010 ICCOPT, Santiago http//www.mat.uc.pt/~lnv

Why Derivative-Free Optimization? 2/54 Some of the reasons to apply derivative-free optimization are the following:

Why Derivative-Free Optimization? 2/54 Some of the reasons to apply derivative-free optimization are the following: Nowadays computer hardware and mathematical algorithms allows increasingly large simulations.

Why Derivative-Free Optimization? 2/54 Some of the reasons to apply derivative-free optimization are the following: Nowadays computer hardware and mathematical algorithms allows increasingly large simulations. Functions are noisy (one cannot trust derivatives or approximate them by finite differences).

Why Derivative-Free Optimization? 2/54 Some of the reasons to apply derivative-free optimization are the following: Nowadays computer hardware and mathematical algorithms allows increasingly large simulations. Functions are noisy (one cannot trust derivatives or approximate them by finite differences). Binary codes (source code not available) and random simulations making automatic differentiation impossible to apply.

Why Derivative-Free Optimization? 2/54 Some of the reasons to apply derivative-free optimization are the following: Nowadays computer hardware and mathematical algorithms allows increasingly large simulations. Functions are noisy (one cannot trust derivatives or approximate them by finite differences). Binary codes (source code not available) and random simulations making automatic differentiation impossible to apply. Legacy codes (written in the past and not maintained by the original authors).

Why Derivative-Free Optimization? 2/54 Some of the reasons to apply derivative-free optimization are the following: Nowadays computer hardware and mathematical algorithms allows increasingly large simulations. Functions are noisy (one cannot trust derivatives or approximate them by finite differences). Binary codes (source code not available) and random simulations making automatic differentiation impossible to apply. Legacy codes (written in the past and not maintained by the original authors). Lack of sophistication of the user (users need improvement but want to use something simple).

Limitations of Derivative-Free Optimization 3/54 In DFO convergence/stopping is typically slow (per function evaluation):

Recent advances in direct-search methods 4/54 For a recent talk on Direct Search (8th EUROPT, 2010) see: http://www.mat.uc.pt/~lnv/talks

Recent advances in direct-search methods 4/54 For a recent talk on Direct Search (8th EUROPT, 2010) see: http://www.mat.uc.pt/~lnv/talks Ana Luisa Custodio Talk WA04 10:30. Ismael Vaz Talk WB04 13:30.

The book! 5/54 A. R. Conn, K. Scheinberg, and L. N. Vicente, Introduction to Derivative-Free Optimization, MPS-SIAM Series on Optimization, SIAM, Philadelphia, 2009.

Model-based trust-region methods 6/54 One typically minimizes a model m in a trust region B p (x; ): Trust-region subproblem min m(y) y B p(x; )

Model-based trust-region methods 6/54 One typically minimizes a model m in a trust region B p (x; ): Trust-region subproblem min m(y) y B p(x; ) In derivative-based optimization, one could use:

Model-based trust-region methods 6/54 One typically minimizes a model m in a trust region B p (x; ): Trust-region subproblem min m(y) y B p(x; ) In derivative-based optimization, one could use: 1st order Taylor: m(y) = f(x) + f(x) (y x)

Model-based trust-region methods 6/54 One typically minimizes a model m in a trust region B p (x; ): Trust-region subproblem min m(y) y B p(x; ) In derivative-based optimization, one could use: 1st order Taylor: m(y) = f(x) + f(x) (y x) + 1 2 (y x) H(y x)

Model-based trust-region methods 6/54 One typically minimizes a model m in a trust region B p (x; ): Trust-region subproblem min m(y) y B p(x; ) In derivative-based optimization, one could use: 2nd order Taylor: m(y) = f(x) + f(x) (y x) + 1 2 (y x) 2 f(x)(y x)

Fully linear models 7/54 Given a point x and a trust-region radius, a model m(y) around x is called fully linear if

Fully linear models 7/54 Given a point x and a trust-region radius, a model m(y) around x is called fully linear if It is C 1 with Lipschitz continuous gradient.

Fully linear models 7/54 Given a point x and a trust-region radius, a model m(y) around x is called fully linear if It is C 1 with Lipschitz continuous gradient. The following error bounds hold: f(y) m(y) κ eg y B(x; ) f(y) m(y) κ ef 2 y B(x; ).

Fully linear models 7/54 Given a point x and a trust-region radius, a model m(y) around x is called fully linear if It is C 1 with Lipschitz continuous gradient. The following error bounds hold: f(y) m(y) κ eg y B(x; ) f(y) m(y) κ ef 2 y B(x; ). For a class of fully linear models, the (unknown) constants κ ef, κ eg > 0 must be independent of x and.

Fully linear models 7/54 Given a point x and a trust-region radius, a model m(y) around x is called fully linear if It is C 1 with Lipschitz continuous gradient. The following error bounds hold: f(y) m(y) κ eg y B(x; ) f(y) m(y) κ ef 2 y B(x; ). For a class of fully linear models, the (unknown) constants κ ef, κ eg > 0 must be independent of x and. Fully linear models can be quadratic (or even nonlinear).

Fully quadratic models 8/54 Given a point x and a trust-region radius, a model m(y) around x is called fully quadratic if

Fully quadratic models 8/54 Given a point x and a trust-region radius, a model m(y) around x is called fully quadratic if It is C 2 with Lipschitz continuous Hessian.

Fully quadratic models 8/54 Given a point x and a trust-region radius, a model m(y) around x is called fully quadratic if It is C 2 with Lipschitz continuous Hessian. The following error bounds hold: 2 f(y) 2 m(y) κ eh y B(x; ) f(y) m(y) κ eg 2 y B(x; ) f(y) m(y) κ ef 3 y B(x; ).

Fully quadratic models 8/54 Given a point x and a trust-region radius, a model m(y) around x is called fully quadratic if It is C 2 with Lipschitz continuous Hessian. The following error bounds hold: 2 f(y) 2 m(y) κ eh y B(x; ) f(y) m(y) κ eg 2 y B(x; ) f(y) m(y) κ ef 3 y B(x; ). For a class of fully quadratic models, the (unknown) constants κ ef, κ eg, κ eh > 0 must be independent of x and.

Fully quadratic models Given a point x and a trust-region radius, a model m(y) around x is called fully quadratic if It is C 2 with Lipschitz continuous Hessian. The following error bounds hold: 2 f(y) 2 m(y) κ eh y B(x; ) f(y) m(y) κ eg 2 y B(x; ) f(y) m(y) κ ef 3 y B(x; ). For a class of fully quadratic models, the (unknown) constants κ ef, κ eg, κ eh > 0 must be independent of x and. Fully quadratic models are only necessary for global convergence to 2nd order stationary points. 8/54

Polynomial interpolation models 9/54 Given a sample set Y = {y 0, y 1,..., y p }, a polynomial basis φ, and a polynomial model m(y) = α φ(y), the interpolating conditions form the linear system: M(φ, Y )α = f(y ),

Polynomial interpolation models 9/54 Given a sample set Y = {y 0, y 1,..., y p }, a polynomial basis φ, and a polynomial model m(y) = α φ(y), the interpolating conditions form the linear system: M(φ, Y )α = f(y ), where M(φ, Y ) = φ 0 (y 0 ) φ 1 (y 0 ) φ p (y 0 ) φ 0 (y 1 ) φ 1 (y 1 ) φ p (y 1 ).... φ 0 (y p ) φ 1 (y p ) φ p (y p ) f(y ) = f(y 0 ) f(y 1 ). f(y p ).

Natural/canonical basis 10/54 The natural/canonical basis appears in a Taylor expansion and is given by: φ = {1, y 1,..., y n, 12 y21,..., 12 } y2n, y 1 y 2,..., y n 1 y n.

Natural/canonical basis 10/54 The natural/canonical basis appears in a Taylor expansion and is given by: φ = {1, y 1,..., y n, 12 y21,..., 12 } y2n, y 1 y 2,..., y n 1 y n. Under appropriate smoothness, the second order Taylor model, centered at 0, is: f(0) [1] + f x 1 (0)[y 1 ] + f x 2 (0)[y 2 ] + 2 f (0)[y x 2 1 2/2] + 2 f x 1 1 x 2 (0)[y 1 y 2 ] + 2 f (0)[y x 2 2 2/2]. 2

Well poisedness (Λ poisedness) 11/54 Λ is a Λ poisedness constant related to the geometry of Y.

Well poisedness (Λ poisedness) 11/54 Λ is a Λ poisedness constant related to the geometry of Y. The original definition of Λ poisedness says that the maximum absolute value of the Lagrange polynomials in B(x; ) is bounded by Λ.

Well poisedness (Λ poisedness) 11/54 Λ is a Λ poisedness constant related to the geometry of Y. The original definition of Λ poisedness says that the maximum absolute value of the Lagrange polynomials in B(x; ) is bounded by Λ. An equivalent definition of Λ poisedness is ( Y = α ) M( φ, Y scaled ) 1 Λ, with Y scaled obtained from Y such that Y scaled B(0; 1).

Well poisedness (Λ poisedness) 11/54 Λ is a Λ poisedness constant related to the geometry of Y. The original definition of Λ poisedness says that the maximum absolute value of the Lagrange polynomials in B(x; ) is bounded by Λ. An equivalent definition of Λ poisedness is ( Y = α ) M( φ, Y scaled ) 1 Λ, with Y scaled obtained from Y such that Y scaled B(0; 1). Non-squared cases are defined analogously (IDFO).

A badly poised set 12/54 Λ = 21296.

A not so badly poised set 13/54 Λ = 440.

Another badly poised set 14/54 Λ = 524982.

An ideal set 15/54 Λ = 1.

Quadratic interpolation models 16/54 The system M(φ, Y )α = f(y ) can be Overdetermined when Y > α. See talk on Direct Search!

Quadratic interpolation models 16/54 The system M(φ, Y )α = f(y ) can be Determined when Y = α. For M(φ, Y ) to be squared one needs N = (n + 2)(n + 1)/2 evaluations of f (often too expensive).

Quadratic interpolation models 16/54 The system M(φ, Y )α = f(y ) can be Determined when Y = α. For M(φ, Y ) to be squared one needs N = (n + 2)(n + 1)/2 evaluations of f (often too expensive). Leads to fully quadratic models when Y is well poised (the constants κ in the error bounds will depend on Λ).

Quadratic interpolation models 16/54 The system M(φ, Y )α = f(y ) can be Determined when Y = α. For M(φ, Y ) to be squared one needs N = (n + 2)(n + 1)/2 evaluations of f (often too expensive). Leads to fully quadratic models when Y is well poised (the constants κ in the error bounds will depend on Λ). Underdetermined when Y < α. Minimum Frobenius norm models (Powell, IDFO book).

Quadratic interpolation models 16/54 The system M(φ, Y )α = f(y ) can be Determined when Y = α. For M(φ, Y ) to be squared one needs N = (n + 2)(n + 1)/2 evaluations of f (often too expensive). Leads to fully quadratic models when Y is well poised (the constants κ in the error bounds will depend on Λ). Underdetermined when Y < α. Minimum Frobenius norm models (Powell, IDFO book). Other approaches?...

Underdetermined quadratic models 17/54 Let m be an underdetermined quadratic model (with Hessian H) built with less than N = O(n 2 ) points.

Underdetermined quadratic models 17/54 Let m be an underdetermined quadratic model (with Hessian H) built with less than N = O(n 2 ) points. Theorem (IDFO book) If Y is Λ L poised for linear interpolation or regression then f(y) m(y) Λ L [C f + H ] y B(x; ).

Underdetermined quadratic models 17/54 Let m be an underdetermined quadratic model (with Hessian H) built with less than N = O(n 2 ) points. Theorem (IDFO book) If Y is Λ L poised for linear interpolation or regression then f(y) m(y) Λ L [C f + H ] y B(x; ). One should build models by minimizing the norm of H.

Minimum Frobenius norm models 18/54 Using φ and separating the quadratic terms, write m(y) = α L φ L (y) + α Q φ Q (y).

Minimum Frobenius norm models 18/54 Using φ and separating the quadratic terms, write m(y) = α L φ L (y) + α Q φ Q (y). Then, build models by minimizing the entries of the Hessian ( Frobenius norm ): min 1 2 α Q 2 2 s.t. M( φ, Y )α = f(y ).

Minimum Frobenius norm models Using φ and separating the quadratic terms, write m(y) = α L φ L (y) + α Q φ Q (y). Then, build models by minimizing the entries of the Hessian ( Frobenius norm ): min 1 2 α Q 2 2 s.t. M( φ, Y )α = f(y ). The solution of this convex QP problem requires a linear solve with: [ MQ MQ M ] L ML where M( 0 φ, Y ) = [ ] M L M Q. 18/54

Minimum Frobenius norm models 19/54 Theorem (IDFO book) If Y is Λ F poised in the minimum Frobenius norm sense then H C f Λ F, where H is, again, the Hessian of the model.

Minimum Frobenius norm models 19/54 Theorem (IDFO book) If Y is Λ F poised in the minimum Frobenius norm sense then H C f Λ F, where H is, again, the Hessian of the model. Putting the two theorems together yield: f(y) m(y) Λ L [C f + C f Λ F ] y B(x; ).

Minimum Frobenius norm models 19/54 Theorem (IDFO book) If Y is Λ F poised in the minimum Frobenius norm sense then H C f Λ F, where H is, again, the Hessian of the model. Putting the two theorems together yield: f(y) m(y) Λ L [C f + C f Λ F ] y B(x; ). MFN models are fully linear.

Sparsity on the Hessian 20/54 In many problems, pairs of variables have no correlation, leading to zero second order partial derivatives in f:

Sparsity on the Hessian 20/54 In many problems, pairs of variables have no correlation, leading to zero second order partial derivatives in f:

Sparsity on the Hessian 20/54 In many problems, pairs of variables have no correlation, leading to zero second order partial derivatives in f: Thus, the Hessian 2 m(x = 0) of the model (i.e., the vector α Q in the basis φ) should be sparse.

Our question 21/54 Is it possible to build fully quadratic models by quadratic underdetermined interpolation (i.e., using less than N = O(n 2 ) points) in the SPARSE case?

Compressed sensing sparse recovery 22/54 Objective: Find sparse α subject to a highly underdetermined linear system Mα = f.

Compressed sensing sparse recovery 22/54 Objective: Find sparse α subject to a highly underdetermined linear system Mα = f. { min α 0 = supp(α) is NP-Hard. s.t. Mα = f

Compressed sensing sparse recovery 22/54 Objective: Find sparse α subject to a highly underdetermined linear system Mα = f. { min α 0 = supp(α) is NP-Hard. s.t. Mα = f { min α 1 s.t. Mα = f often recovers sparse solutions.

Compressed sensing sparse recovery 22/54 Objective: Find sparse α subject to a highly underdetermined linear system Mα = f. { min α 0 = supp(α) is NP-Hard. s.t. Mα = f { min α 1 s.t. Mα = f often recovers sparse solutions.

Restricted isometry property 23/54 Definition (RIP) The RIP Constant of order s of M (p N) is the smallest δ s such that (1 δ s ) α 2 2 Mα 2 2 (1 + δ s ) α 2 2 for all s sparse α ( α 0 s).

Restricted isometry property 23/54 Definition (RIP) The RIP Constant of order s of M (p N) is the smallest δ s such that (1 δ s ) α 2 2 Mα 2 2 (1 + δ s ) α 2 2 for all s sparse α ( α 0 s). Theorem (Candès, Tao, 2005, 2006) If ᾱ is s sparse and 2δ 2s + δ s < 1 then it can be recovered by l 1 -minimization: min α 1 s.t. Mα = Mᾱ.

Restricted isometry property 23/54 Definition (RIP) The RIP Constant of order s of M (p N) is the smallest δ s such that (1 δ s ) α 2 2 Mα 2 2 (1 + δ s ) α 2 2 for all s sparse α ( α 0 s). Theorem (Candès, Tao, 2005, 2006) If ᾱ is s sparse and 2δ 2s + δ s < 1 then it can be recovered by l 1 -minimization: min α 1 s.t. Mα = Mᾱ. i.e., the optimal solution α of this problem is unique and given by α = ᾱ.

Random matrices 24/54 It is hard to find deterministic matrices that satisfy the RIP for large s.

Random matrices 24/54 It is hard to find deterministic matrices that satisfy the RIP for large s. Using Random Matrix Theory it is possible to prove RIP for p = O(s log N). Matrices with Gaussian entries. Matrices with Bernoulli entries. Uniformly chosen subsets of discrete Fourier transform.

Bounded orthonormal expansions (Rauhut) 25/54 Question How to find a basis φ and a sample set Y such that M(φ, Y ) satisfies the RIP?

Bounded orthonormal expansions (Rauhut) 25/54 Question How to find a basis φ and a sample set Y such that M(φ, Y ) satisfies the RIP? Choose orthonormal bases.

Bounded orthonormal expansions (Rauhut) 25/54 Question How to find a basis φ and a sample set Y such that M(φ, Y ) satisfies the RIP? Choose orthonormal bases. Avoid localized functions ( φ i L should be uniformly bounded).

Bounded orthonormal expansions (Rauhut) 25/54 Question How to find a basis φ and a sample set Y such that M(φ, Y ) satisfies the RIP? Choose orthonormal bases. Avoid localized functions ( φ i L should be uniformly bounded). Select Y randomly.

Sparse orthonormal bounded expansion recovery 26/54 Theorem (Rauhut, 2010) If φ is orthonormal in a probability measure µ and φ i L K.

Sparse orthonormal bounded expansion recovery 26/54 Theorem (Rauhut, 2010) If φ is orthonormal in a probability measure µ and φ i L K. each point of Y is drawn independently according to µ.

Sparse orthonormal bounded expansion recovery 26/54 Theorem (Rauhut, 2010) If φ is orthonormal in a probability measure µ and φ i L K. each point of Y is drawn independently according to µ. p log p c 1K 2 s(log s) 2 log N.

Sparse orthonormal bounded expansion recovery 26/54 Theorem (Rauhut, 2010) If φ is orthonormal in a probability measure µ and φ i L K. each point of Y is drawn independently according to µ. p log p c 1K 2 s(log s) 2 log N. Then, with high probability, for every s sparse vector ᾱ:

Sparse orthonormal bounded expansion recovery 26/54 Theorem (Rauhut, 2010) If φ is orthonormal in a probability measure µ and φ i L K. each point of Y is drawn independently according to µ. p log p c 1K 2 s(log s) 2 log N. Then, with high probability, for every s sparse vector ᾱ: Given noisy samples f = M(φ, Y )ᾱ + ɛ with ɛ 2 η, let α be the solution of

Sparse orthonormal bounded expansion recovery 26/54 Theorem (Rauhut, 2010) If φ is orthonormal in a probability measure µ and φ i L K. each point of Y is drawn independently according to µ. p log p c 1K 2 s(log s) 2 log N. Then, with high probability, for every s sparse vector ᾱ: Given noisy samples f = M(φ, Y )ᾱ + ɛ with ɛ 2 η, let α be the solution of min α 1 s. t. M(φ, Y )α f 2 η.

Sparse orthonormal bounded expansion recovery 26/54 Theorem (Rauhut, 2010) If φ is orthonormal in a probability measure µ and φ i L K. each point of Y is drawn independently according to µ. p log p c 1K 2 s(log s) 2 log N. Then, with high probability, for every s sparse vector ᾱ: Given noisy samples f = M(φ, Y )ᾱ + ɛ with ɛ 2 η, let α be the solution of min α 1 s. t. M(φ, Y )α f 2 η. Then, ᾱ α 2 C p η.

What basis do we need for sparse Hessian recovery? 27/54 Remember the second order Taylor model f(0) [1] + f x 1 (0)[y 1 ] + f x 2 (0)[y 2 ] + 2 f (0)[y x 2 1 2/2] + 2 f x 1 1 x 2 (0)[y 1 y 2 ] + 2 f (0)[y x 2 2 2/2]. 2

What basis do we need for sparse Hessian recovery? 27/54 Remember the second order Taylor model f(0) [1] + f x 1 (0)[y 1 ] + f x 2 (0)[y 2 ] + 2 f (0)[y x 2 1 2/2] + 2 f x 1 1 x 2 (0)[y 1 y 2 ] + 2 f (0)[y x 2 2 2/2]. 2 So, we want something like the natural/canonical basis: φ = {1, y 1,..., y n, 12 y21,..., 12 } y2n, y 1 y 2,..., y n 1 y n.

An orthonormal basis for quadratics (appropriate for sparse Hessian recovery) 28/54 Proposition (Bandeira, Scheinberg, and Vicente, 2010) The following basis ψ for quadratics is orthonormal (w.r.t. the uniform measure on B (0; )) and satisfies ψ ι L 3.

An orthonormal basis for quadratics (appropriate for sparse Hessian recovery) 28/54 Proposition (Bandeira, Scheinberg, and Vicente, 2010) The following basis ψ for quadratics is orthonormal (w.r.t. the uniform measure on B (0; )) and satisfies ψ ι L 3. ψ 0 (u) = 1 ψ 1,i (u) = 3 u i ψ 2,ij (u) = 3 2 u i u j ψ 2,i (u) = 3 5 2 1 u 2 2 i 5 2.

An orthonormal basis for quadratics (appropriate for sparse Hessian recovery) 28/54 Proposition (Bandeira, Scheinberg, and Vicente, 2010) The following basis ψ for quadratics is orthonormal (w.r.t. the uniform measure on B (0; )) and satisfies ψ ι L 3. ψ 0 (u) = 1 ψ 1,i (u) = 3 u i ψ 2,ij (u) = 3 2 u i u j ψ 2,i (u) = 3 5 2 1 u 2 2 i 5 2. ψ is very similar to the canonical basis, and preserves the sparsity of the Hessian (at 0).

Hessian sparse recovery 29/54 Let us look again at min α 1 s. t. M(φ, Y )α f 2 η, where f = M(ψ, Y )ᾱ + ɛ.

Hessian sparse recovery 29/54 Let us look again at min α 1 s. t. M(φ, Y )α f 2 η, where f = M(ψ, Y )ᾱ + ɛ. So, the noisy data is f = f(y ).

Hessian sparse recovery 29/54 Let us look again at min α 1 s. t. M(φ, Y )α f 2 η, where f = M(ψ, Y )ᾱ + ɛ. So, the noisy data is f = f(y ). What we are trying to recover is the 2nd order Taylor model ᾱ ψ(y).

Hessian sparse recovery 29/54 Let us look again at min α 1 s. t. M(φ, Y )α f 2 η, where f = M(ψ, Y )ᾱ + ɛ. So, the noisy data is f = f(y ). What we are trying to recover is the 2nd order Taylor model ᾱ ψ(y). Thus, in ɛ η, one has η = O( 3 ).

Hessian sparse recovery 30/54 Theorem (Bandeira, Scheinberg, and Vicente, 2010) If the Hessian of f at 0 is s sparse.

Hessian sparse recovery 30/54 Theorem (Bandeira, Scheinberg, and Vicente, 2010) If the Hessian of f at 0 is s sparse. Y is a random sample set chosen w.r.t. the uniform measure on B (0; ).

Hessian sparse recovery 30/54 Theorem (Bandeira, Scheinberg, and Vicente, 2010) If the Hessian of f at 0 is s sparse. Y is a random sample set chosen w.r.t. the uniform measure on B (0; ). p log p 9c 1(s + n + 1) log 2 (s + n + 1)log O(n 2 ).

Hessian sparse recovery 30/54 Theorem (Bandeira, Scheinberg, and Vicente, 2010) If the Hessian of f at 0 is s sparse. Y is a random sample set chosen w.r.t. the uniform measure on B (0; ). p log p 9c 1(s + n + 1) log 2 (s + n + 1)log O(n 2 ). Then, with high probability, the quadratic

Hessian sparse recovery 30/54 Theorem (Bandeira, Scheinberg, and Vicente, 2010) If the Hessian of f at 0 is s sparse. Y is a random sample set chosen w.r.t. the uniform measure on B (0; ). p log p 9c 1(s + n + 1) log 2 (s + n + 1)log O(n 2 ). Then, with high probability, the quadratic q = α ι ψ ι obtained by solving the noisy l 1 -minimization problem is a fully quadratic model for f (with error constants not depending on ).

An answer to our question 31/54 For instance, when the number of non-zeros of the Hessian is s = O(n), we are able to construct fully quadratic models with O(n log 4 n) points.

An answer to our question 31/54 For instance, when the number of non-zeros of the Hessian is s = O(n), we are able to construct fully quadratic models with O(n log 4 n) points. Also, we recover both the function and its sparsity structure.

Open question 32/54 Generalize the result above when minimizing only the l 1 -norm of the Hessian (α Q ) rather than of the whole α. Numerical simulations have shown that such approach is (slightly) advantageous.

Remarks 33/54 However, the Theorem only provides motivation because, in a practical approach we:

Remarks 33/54 However, the Theorem only provides motivation because, in a practical approach we: Solve min α Q 1 s. t. M( φ L, Y )α L + M( φ Q, Y )α Q = f(y ).

Remarks 33/54 However, the Theorem only provides motivation because, in a practical approach we: Solve min α Q 1 s. t. M( φ L, Y )α L + M( φ Q, Y )α Q = f(y ). Deal with small n (from the DFO setting) and the bound we obtain is asymptotical.

Remarks 33/54 However, the Theorem only provides motivation because, in a practical approach we: Solve min α Q 1 s. t. M( φ L, Y )α L + M( φ Q, Y )α Q = f(y ). Deal with small n (from the DFO setting) and the bound we obtain is asymptotical. Use deterministic sampling.

Interpolation-based trust-region methods 34/54

Interpolation-based trust-region methods 34/54 Trust-region methods for DFO typically: attempt to form quadratic models (by interpolation/regression and using polynomials or radial basis functions) m k (x k + s) = f(x k ) + g k s + 1 2 s H k s based on (well poised) sample sets.

Interpolation-based trust-region methods 34/54 Trust-region methods for DFO typically: attempt to form quadratic models (by interpolation/regression and using polynomials or radial basis functions) m k (x k + s) = f(x k ) + g k s + 1 2 s H k s based on (well poised) sample sets. Well poisedness ensures fully linear or fully quadratic models.

Interpolation-based trust-region methods 34/54 Trust-region methods for DFO typically: attempt to form quadratic models (by interpolation/regression and using polynomials or radial basis functions) m k (x k + s) = f(x k ) + g k s + 1 2 s H k s based on (well poised) sample sets. Well poisedness ensures fully linear or fully quadratic models. Calculate a step s k by approximately solving the trust-region subproblem min m k (x k + s). s B 2 (x k ; k )

Interpolation-based trust-region methods 35/54 Set x k+1 to x k + s k (success) or to x k (unsuccess) and update k depending on the value of ρ k = f(x k) f(x k + s k ) m k (x k ) m k (x k + s k ).

Interpolation-based trust-region methods 35/54 Set x k+1 to x k + s k (success) or to x k (unsuccess) and update k depending on the value of ρ k = f(x k) f(x k + s k ) m k (x k ) m k (x k + s k ). Attempt to accept steps based on simple decrease, i.e., if ρ k > 0 f(x k + s k ) < f(x k ).

Interpolation-based trust-region methods (IDFO) 36/54 Reduce k only if ρ k is small and the model is FL/FQ unsuccessful iterations.

Interpolation-based trust-region methods (IDFO) 36/54 Reduce k only if ρ k is small and the model is FL/FQ unsuccessful iterations. Accept new iterates based on simple decrease (ρ k > 0) as long as the model is FL/FQ acceptable iterations.

Interpolation-based trust-region methods (IDFO) 36/54 Reduce k only if ρ k is small and the model is FL/FQ unsuccessful iterations. Accept new iterates based on simple decrease (ρ k > 0) as long as the model is FL/FQ acceptable iterations. Allow for model-improving iterations (when ρ k is not large enough and the model is not certifiably FL/FQ).

Interpolation-based trust-region methods (IDFO) 36/54 Reduce k only if ρ k is small and the model is FL/FQ unsuccessful iterations. Accept new iterates based on simple decrease (ρ k > 0) as long as the model is FL/FQ acceptable iterations. Allow for model-improving iterations (when ρ k is not large enough and the model is not certifiably FL/FQ). Do not reduce k.

Interpolation-based trust-region methods (IDFO) 37/54 Incorporate a criticality step (1st or 2nd order) when the stationarity of the model is small.

Interpolation-based trust-region methods (IDFO) 37/54 Incorporate a criticality step (1st or 2nd order) when the stationarity of the model is small. Internal cycle of reductions of k until model is well poised in B(x k ; g k ).

Behavior of the trust-region radius 38/54 Due to the criticality step, one has for successful iterations: f(x k ) f(x k+1 ) O( g k min{ g k, k }) O( 2 k ).

Behavior of the trust-region radius 38/54 Due to the criticality step, one has for successful iterations: Thus: f(x k ) f(x k+1 ) O( g k min{ g k, k }) O( 2 k ). Theorem (Conn, Scheinberg, and Vicente, 2009) The trust-region radius converges to zero: lim k = 0. k +

Behavior of the trust-region radius 38/54 Due to the criticality step, one has for successful iterations: Thus: f(x k ) f(x k+1 ) O( g k min{ g k, k }) O( 2 k ). Theorem (Conn, Scheinberg, and Vicente, 2009) The trust-region radius converges to zero: lim k = 0. k + Similar to direct-search methods where lim inf k + α k = 0.

Analysis of TR methods (1st order) 39/54 Using fully linear models: Theorem (Conn, Scheinberg, and Vicente, 2009) If f is Lips. continuous and f is bounded below on L(x 0 ) then lim f(x k) = 0. k +

Analysis of TR methods (1st order) 39/54 Using fully linear models: Theorem (Conn, Scheinberg, and Vicente, 2009) If f is Lips. continuous and f is bounded below on L(x 0 ) then lim f(x k) = 0. k + Valid also for simple decrease (acceptable iterations).

Analysis of TR methods (2nd order) 40/54 Using fully quadratic models: Theorem (Conn, Scheinberg, and Vicente, 2009) If 2 f is Lips. continuous and f is bounded below on L(x 0 ) then lim max { f(x k ), λ min [ 2 f(x k )] } = 0. k +

Analysis of TR methods (2nd order) 40/54 Using fully quadratic models: Theorem (Conn, Scheinberg, and Vicente, 2009) If 2 f is Lips. continuous and f is bounded below on L(x 0 ) then lim max { f(x k ), λ min [ 2 f(x k )] } = 0. k + Valid also for simple decrease (acceptable iterations).

Analysis of TR methods (2nd order) 40/54 Using fully quadratic models: Theorem (Conn, Scheinberg, and Vicente, 2009) If 2 f is Lips. continuous and f is bounded below on L(x 0 ) then lim max { f(x k ), λ min [ 2 f(x k )] } = 0. k + Valid also for simple decrease (acceptable iterations). Going from lim inf to lim requires changing the update of k.

Sample set management 41/54 Recently, Fasano, Morales, and Nocedal (2009) suggested an one-point exchange:

Sample set management 41/54 Recently, Fasano, Morales, and Nocedal (2009) suggested an one-point exchange: In successful iterations: where y out = argmax y x k 2. Y k+1 = Y k {x k + s k } \ {y out }.

Sample set management 41/54 Recently, Fasano, Morales, and Nocedal (2009) suggested an one-point exchange: In successful iterations: where y out = argmax y x k 2. In the unsuccessful case: Y k+1 = Y k {x k + s k } \ {y out }. Y k+1 = Y k {x k + s k } \ {y out } if y out x k s k.

Sample set management 41/54 Recently, Fasano, Morales, and Nocedal (2009) suggested an one-point exchange: In successful iterations: where y out = argmax y x k 2. In the unsuccessful case: Y k+1 = Y k {x k + s k } \ {y out }. Y k+1 = Y k {x k + s k } \ {y out } if y out x k s k. Do not perform model-improving iterations.

Sample set management 41/54 Recently, Fasano, Morales, and Nocedal (2009) suggested an one-point exchange: In successful iterations: where y out = argmax y x k 2. In the unsuccessful case: Y k+1 = Y k {x k + s k } \ {y out }. Y k+1 = Y k {x k + s k } \ {y out } if y out x k s k. Do not perform model-improving iterations. They observed sample sets not badly poised!

Self-correcting geometry 42/54 Later, Scheinberg and Toint (2009) proposed:

Self-correcting geometry 42/54 Later, Scheinberg and Toint (2009) proposed: A self-correcting geometry approach based on one-point exchanges, globally convergent to first-order stationary points (lim inf).

Self-correcting geometry 42/54 Later, Scheinberg and Toint (2009) proposed: A self-correcting geometry approach based on one-point exchanges, globally convergent to first-order stationary points (lim inf). In the unsuccessful case, y out is not only based on y x k 2, but also on the values of the Lagrange polynomials at x k + s k.

Self-correcting geometry 42/54 Later, Scheinberg and Toint (2009) proposed: A self-correcting geometry approach based on one-point exchanges, globally convergent to first-order stationary points (lim inf). In the unsuccessful case, y out is not only based on y x k 2, but also on the values of the Lagrange polynomials at x k + s k. They showed that, if k is small compared to g k, then the step either improves the function or the geometry/poisedness of the model.

Self-correcting geometry 42/54 Later, Scheinberg and Toint (2009) proposed: A self-correcting geometry approach based on one-point exchanges, globally convergent to first-order stationary points (lim inf). In the unsuccessful case, y out is not only based on y x k 2, but also on the values of the Lagrange polynomials at x k + s k. They showed that, if k is small compared to g k, then the step either improves the function or the geometry/poisedness of the model. In their approach, model-improving iterations are not needed.

Self-correcting geometry 42/54 Later, Scheinberg and Toint (2009) proposed: A self-correcting geometry approach based on one-point exchanges, globally convergent to first-order stationary points (lim inf). In the unsuccessful case, y out is not only based on y x k 2, but also on the values of the Lagrange polynomials at x k + s k. They showed that, if k is small compared to g k, then the step either improves the function or the geometry/poisedness of the model. In their approach, model-improving iterations are not needed. They showed, however, that the criticality step is indeed necessary.

Without criticality step... (Scheinberg and Toint) 43/54

Without criticality step... (Scheinberg and Toint) 44/54

Without criticality step... (Scheinberg and Toint) 45/54

Without criticality step... (Scheinberg and Toint) 46/54

A practical interpolation-based trust-region method 47/54

A practical interpolation-based trust-region method 47/54 Model building: If Y k = N = O(n 2 ), use determined quadratic interpolation.

A practical interpolation-based trust-region method 47/54 Model building: If Y k = N = O(n 2 ), use determined quadratic interpolation. Otherwise use l 1 (p = 1) or Frobenius (p = 2) minimum norm quadratic interpolation: min 1 p α Q p p s. t. M( φ L, Y scaled )α L + M( φ Q, Y scaled )α Q = f(y scaled ).

A practical interpolation-based trust-region method 47/54 Model building: If Y k = N = O(n 2 ), use determined quadratic interpolation. Otherwise use l 1 (p = 1) or Frobenius (p = 2) minimum norm quadratic interpolation: min 1 p α Q p p s. t. M( φ L, Y scaled )α L + M( φ Q, Y scaled )α Q = f(y scaled ). Sample set update one starts with Y 0 = O(n):

A practical interpolation-based trust-region method 47/54 Model building: If Y k = N = O(n 2 ), use determined quadratic interpolation. Otherwise use l 1 (p = 1) or Frobenius (p = 2) minimum norm quadratic interpolation: min 1 p α Q p p s. t. M( φ L, Y scaled )α L + M( φ Q, Y scaled )α Q = f(y scaled ). Sample set update one starts with Y 0 = O(n): If Y k < N = O(n 2 ), set Y k+1 = Y k {x k + s k }.

A practical interpolation-based trust-region method 47/54 Model building: If Y k = N = O(n 2 ), use determined quadratic interpolation. Otherwise use l 1 (p = 1) or Frobenius (p = 2) minimum norm quadratic interpolation: min 1 p α Q p p s. t. M( φ L, Y scaled )α L + M( φ Q, Y scaled )α Q = f(y scaled ). Sample set update one starts with Y 0 = O(n): If Y k < N = O(n 2 ), set Y k+1 = Y k {x k + s k }. Otherwise as in Fasano et al., but with y out = argmax y x k+1 2.

A practical interpolation-based trust-region method 47/54 Model building: If Y k = N = O(n 2 ), use determined quadratic interpolation. Otherwise use l 1 (p = 1) or Frobenius (p = 2) minimum norm quadratic interpolation: min 1 p α Q p p s. t. M( φ L, Y scaled )α L + M( φ Q, Y scaled )α Q = f(y scaled ). Sample set update one starts with Y 0 = O(n): If Y k < N = O(n 2 ), set Y k+1 = Y k {x k + s k }. Otherwise as in Fasano et al., but with y out = argmax y x k+1 2. Criticality step : If k is very small, discard points far away from the trust region.

Performance profiles (accuracy of 10 4 in function values) 48/54 Figure: Performance profiles comparing DFO-TR (l 1 and Frobenius) and NEWUOA (Powell) in a test set from CUTEr (Fasano et al.).

Performance profiles (accuracy of 10 6 in function values) 49/54 Figure: Performance profiles comparing DFO-TR (l 1 and Frobenius) and NEWUOA (Powell) in a test set from CUTEr (Fasano et al.).

Concluding remarks 50/54 Optimization is a fundamental tool in Compressed Sensing. However, this work shows that CS can also be applied to Optimization.

Concluding remarks 50/54 Optimization is a fundamental tool in Compressed Sensing. However, this work shows that CS can also be applied to Optimization. In a sparse scenario, we were able to construct fully quadratic models with samples of size O(n log 4 n) instead of the classical O(n 2 ).

Concluding remarks 50/54 Optimization is a fundamental tool in Compressed Sensing. However, this work shows that CS can also be applied to Optimization. In a sparse scenario, we were able to construct fully quadratic models with samples of size O(n log 4 n) instead of the classical O(n 2 ). We proposed a practical DFO method (using l 1 -minimization) that was able to outperform state-of-the-art methods in several numerical tests (in the already tough DFO scenario where n is small).

Open questions 51/54 Improve the efficiency of the model l 1 -minimization, by properly warmstarting it (currently we solve it as an LP using lipsol by Y. Zhang).

Open questions 51/54 Improve the efficiency of the model l 1 -minimization, by properly warmstarting it (currently we solve it as an LP using lipsol by Y. Zhang). Study the convergence properties of possibly stochastic interpolation-based trust-region methods.

Open questions 51/54 Improve the efficiency of the model l 1 -minimization, by properly warmstarting it (currently we solve it as an LP using lipsol by Y. Zhang). Study the convergence properties of possibly stochastic interpolation-based trust-region methods. Investigate l 1 -minimization techniques in statistical models (like Kriging for interpolating sparse data sets), but applied to Optimization.

Open questions 51/54 Improve the efficiency of the model l 1 -minimization, by properly warmstarting it (currently we solve it as an LP using lipsol by Y. Zhang). Study the convergence properties of possibly stochastic interpolation-based trust-region methods. Investigate l 1 -minimization techniques in statistical models (like Kriging for interpolating sparse data sets), but applied to Optimization. Develop a globally convergent model-based trust-region method for non-smooth functions.

References 52/54 A. Bandeira, K. Scheinberg, and L. N. Vicente, Computation of sparse low degree interpolating polynomials and their application to derivative-free optimization, in preparation, 2010. A. R. Conn, K. Scheinberg, and L. N. Vicente, Global convergence of general derivative-free trust-region algorithms to first and second order critical points, SIAM J. Optim., 20 (2009) 387 415. G. Fasano, J. L. Morales, and J. Nocedal, On the geometry phase in model-based algorithms for derivative-free optimization, Optim. Methods Softw., 24 (2009) 145 154. K. Scheinberg and Ph. L. Toint, Self-correcting geometry in model-based algorithms for derivative-free unconstrained optimization, 2009.

Other recent work 53/54 S. Gratton, Ph. L. Toint, and A. Tröltzsch, An active-set trust-region method for derivative-free nonlinear bound-constrained optimization, 2010. Talk WA02 11:30

Other recent work 53/54 S. Gratton, Ph. L. Toint, and A. Tröltzsch, An active-set trust-region method for derivative-free nonlinear bound-constrained optimization, 2010. Talk WA02 11:30 S. Gratton and L. N. Vicente, A surrogate management framework using rigorous trust-regions steps, 2010.

Other recent work 53/54 S. Gratton, Ph. L. Toint, and A. Tröltzsch, An active-set trust-region method for derivative-free nonlinear bound-constrained optimization, 2010. Talk WA02 11:30 S. Gratton and L. N. Vicente, A surrogate management framework using rigorous trust-regions steps, 2010. M. J. D. Powell, The BOBYQA algorithm for bound constrained optimization without derivatives, 2009.

Other recent work 53/54 S. Gratton, Ph. L. Toint, and A. Tröltzsch, An active-set trust-region method for derivative-free nonlinear bound-constrained optimization, 2010. Talk WA02 11:30 S. Gratton and L. N. Vicente, A surrogate management framework using rigorous trust-regions steps, 2010. M. J. D. Powell, The BOBYQA algorithm for bound constrained optimization without derivatives, 2009. S. M. Wild and C. Shoemaker, Global convergence of radial basis function trust region derivative-free algorithms, 2009.

Optimization 2011 (July 24 27, Portugal) 54/54