Interpolation-Based Trust-Region Methods for DFO

Interpolation-Based Trust-Region Methods for DFO Luis Nunes Vicente University of Coimbra (joint work with A. Bandeira, A. R. Conn, S. Gratton, and K. Scheinberg) July 27, 2010 ICCOPT, Santiago http//www.mat.uc.pt/~lnv

Why Derivative-Free Optimization? 2/54 Some of the reasons to apply derivative-free optimization are the following:

Why Derivative-Free Optimization? 2/54 Some of the reasons to apply derivative-free optimization are the following: Nowadays computer hardware and mathematical algorithms allows increasingly large simulations. Functions are noisy (one cannot trust derivatives or approximate them by finite differences). Binary codes (source code not available) and random simulations making automatic differentiation impossible to apply.

Limitations of Derivative-Free Optimization 3/54 In DFO convergence/stopping is typically slow (per function evaluation):

Recent advances in direct-search methods 4/54 For a recent talk on Direct Search (8th EUROPT, 2010) see: http://www.mat.uc.pt/~lnv/talks

Recent advances in direct-search methods 4/54 For a recent talk on Direct Search (8th EUROPT, 2010) see: http://www.mat.uc.pt/~lnv/talks Ana Luisa Custodio Talk WA04 10:30. Ismael Vaz Talk WB04 13:30.

The book! 5/54 A. R. Conn, K. Scheinberg, and L. N. Vicente, Introduction to Derivative-Free Optimization, MPS-SIAM Series on Optimization, SIAM, Philadelphia, 2009.

Model-based trust-region methods 6/54 One typically minimizes a model m in a trust region B p (x; ): Trust-region subproblem min m(y) y B p(x; )

Model-based trust-region methods 6/54 One typically minimizes a model m in a trust region B p (x; ): Trust-region subproblem min m(y) y B p(x; ) In derivative-based optimization, one could use:

Model-based trust-region methods 6/54 One typically minimizes a model m in a trust region B p (x; ): Trust-region subproblem min m(y) y B p(x; ) In derivative-based optimization, one could use: 1st order Taylor: m(y) = f(x) + f(x) (y x)

Fully linear models 7/54 Given a point x and a trust-region radius, a model m(y) around x is called fully linear if

Fully linear models 7/54 Given a point x and a trust-region radius, a model m(y) around x is called fully linear if It is C 1 with Lipschitz continuous gradient.

Fully linear models 7/54 Given a point x and a trust-region radius, a model m(y) around x is called fully linear if It is C 1 with Lipschitz continuous gradient. The following error bounds hold: f(y) m(y) κ eg y B(x; ) f(y) m(y) κ ef 2 y B(x; ). For a class of fully linear models, the (unknown) constants κ ef, κ eg > 0 must be independent of x and.

Fully quadratic models 8/54 Given a point x and a trust-region radius, a model m(y) around x is called fully quadratic if

Fully quadratic models 8/54 Given a point x and a trust-region radius, a model m(y) around x is called fully quadratic if It is C 2 with Lipschitz continuous Hessian.

Fully quadratic models 8/54 Given a point x and a trust-region radius, a model m(y) around x is called fully quadratic if It is C 2 with Lipschitz continuous Hessian. The following error bounds hold: 2 f(y) 2 m(y) κ eh y B(x; ) f(y) m(y) κ eg 2 y B(x; ) f(y) m(y) κ ef 3 y B(x; ).

Fully quadratic models Given a point x and a trust-region radius, a model m(y) around x is called fully quadratic if It is C 2 with Lipschitz continuous Hessian. The following error bounds hold: 2 f(y) 2 m(y) κ eh y B(x; ) f(y) m(y) κ eg 2 y B(x; ) f(y) m(y) κ ef 3 y B(x; ). For a class of fully quadratic models, the (unknown) constants κ ef, κ eg, κ eh > 0 must be independent of x and. Fully quadratic models are only necessary for global convergence to 2nd order stationary points. 8/54

Polynomial interpolation models 9/54 Given a sample set Y = {y 0, y 1,..., y p }, a polynomial basis φ, and a polynomial model m(y) = α φ(y), the interpolating conditions form the linear system: M(φ, Y )α = f(y ), where M(φ, Y ) = φ 0 (y 0 ) φ 1 (y 0 ) φ p (y 0 ) φ 0 (y 1 ) φ 1 (y 1 ) φ p (y 1 ).... φ 0 (y p ) φ 1 (y p ) φ p (y p ) f(y ) = f(y 0 ) f(y 1 ). f(y p ).

Natural/canonical basis 10/54 The natural/canonical basis appears in a Taylor expansion and is given by: φ = {1, y 1,..., y n, 12 y21,..., 12 } y2n, y 1 y 2,..., y n 1 y n.

Natural/canonical basis 10/54 The natural/canonical basis appears in a Taylor expansion and is given by: φ = {1, y 1,..., y n, 12 y21,..., 12 } y2n, y 1 y 2,..., y n 1 y n. Under appropriate smoothness, the second order Taylor model, centered at 0, is: f(0) [1] + f x 1 (0)[y 1 ] + f x 2 (0)[y 2 ] + 2 f (0)[y x 2 1 2/2] + 2 f x 1 1 x 2 (0)[y 1 y 2 ] + 2 f (0)[y x 2 2 2/2]. 2

Well poisedness (Λ poisedness) 11/54 Λ is a Λ poisedness constant related to the geometry of Y.

Well poisedness (Λ poisedness) 11/54 Λ is a Λ poisedness constant related to the geometry of Y. The original definition of Λ poisedness says that the maximum absolute value of the Lagrange polynomials in B(x; ) is bounded by Λ. An equivalent definition of Λ poisedness is ( Y = α ) M( φ, Y scaled ) 1 Λ, with Y scaled obtained from Y such that Y scaled B(0; 1).

A badly poised set 12/54 Λ = 21296.

A not so badly poised set 13/54 Λ = 440.

Another badly poised set 14/54 Λ = 524982.

An ideal set 15/54 Λ = 1.

Quadratic interpolation models 16/54 The system M(φ, Y )α = f(y ) can be Overdetermined when Y > α. See talk on Direct Search!

Quadratic interpolation models 16/54 The system M(φ, Y )α = f(y ) can be Determined when Y = α. For M(φ, Y ) to be squared one needs N = (n + 2)(n + 1)/2 evaluations of f (often too expensive).

Quadratic interpolation models 16/54 The system M(φ, Y )α = f(y ) can be Determined when Y = α. For M(φ, Y ) to be squared one needs N = (n + 2)(n + 1)/2 evaluations of f (often too expensive). Leads to fully quadratic models when Y is well poised (the constants κ in the error bounds will depend on Λ).

Underdetermined quadratic models 17/54 Let m be an underdetermined quadratic model (with Hessian H) built with less than N = O(n 2 ) points.

Underdetermined quadratic models 17/54 Let m be an underdetermined quadratic model (with Hessian H) built with less than N = O(n 2 ) points. Theorem (IDFO book) If Y is Λ L poised for linear interpolation or regression then f(y) m(y) Λ L [C f + H ] y B(x; ).

Minimum Frobenius norm models 18/54 Using φ and separating the quadratic terms, write m(y) = α L φ L (y) + α Q φ Q (y).

Minimum Frobenius norm models 18/54 Using φ and separating the quadratic terms, write m(y) = α L φ L (y) + α Q φ Q (y). Then, build models by minimizing the entries of the Hessian ( Frobenius norm ): min 1 2 α Q 2 2 s.t. M( φ, Y )α = f(y ).

Minimum Frobenius norm models Using φ and separating the quadratic terms, write m(y) = α L φ L (y) + α Q φ Q (y). Then, build models by minimizing the entries of the Hessian ( Frobenius norm ): min 1 2 α Q 2 2 s.t. M( φ, Y )α = f(y ). The solution of this convex QP problem requires a linear solve with: [ MQ MQ M ] L ML where M( 0 φ, Y ) = [ ] M L M Q. 18/54

Minimum Frobenius norm models 19/54 Theorem (IDFO book) If Y is Λ F poised in the minimum Frobenius norm sense then H C f Λ F, where H is, again, the Hessian of the model.

Minimum Frobenius norm models 19/54 Theorem (IDFO book) If Y is Λ F poised in the minimum Frobenius norm sense then H C f Λ F, where H is, again, the Hessian of the model. Putting the two theorems together yield: f(y) m(y) Λ L [C f + C f Λ F ] y B(x; ).

Sparsity on the Hessian 20/54 In many problems, pairs of variables have no correlation, leading to zero second order partial derivatives in f:

Sparsity on the Hessian 20/54 In many problems, pairs of variables have no correlation, leading to zero second order partial derivatives in f: Thus, the Hessian 2 m(x = 0) of the model (i.e., the vector α Q in the basis φ) should be sparse.

Our question 21/54 Is it possible to build fully quadratic models by quadratic underdetermined interpolation (i.e., using less than N = O(n 2 ) points) in the SPARSE case?

Compressed sensing sparse recovery 22/54 Objective: Find sparse α subject to a highly underdetermined linear system Mα = f.

Compressed sensing sparse recovery 22/54 Objective: Find sparse α subject to a highly underdetermined linear system Mα = f. { min α 0 = supp(α) is NP-Hard. s.t. Mα = f

Compressed sensing sparse recovery 22/54 Objective: Find sparse α subject to a highly underdetermined linear system Mα = f. { min α 0 = supp(α) is NP-Hard. s.t. Mα = f { min α 1 s.t. Mα = f often recovers sparse solutions.

Restricted isometry property 23/54 Definition (RIP) The RIP Constant of order s of M (p N) is the smallest δ s such that (1 δ s ) α 2 2 Mα 2 2 (1 + δ s ) α 2 2 for all s sparse α ( α 0 s).

Restricted isometry property 23/54 Definition (RIP) The RIP Constant of order s of M (p N) is the smallest δ s such that (1 δ s ) α 2 2 Mα 2 2 (1 + δ s ) α 2 2 for all s sparse α ( α 0 s). Theorem (Candès, Tao, 2005, 2006) If ᾱ is s sparse and 2δ 2s + δ s < 1 then it can be recovered by l 1 -minimization: min α 1 s.t. Mα = Mᾱ.

Random matrices 24/54 It is hard to find deterministic matrices that satisfy the RIP for large s.

Random matrices 24/54 It is hard to find deterministic matrices that satisfy the RIP for large s. Using Random Matrix Theory it is possible to prove RIP for p = O(s log N). Matrices with Gaussian entries. Matrices with Bernoulli entries. Uniformly chosen subsets of discrete Fourier transform.

Bounded orthonormal expansions (Rauhut) 25/54 Question How to find a basis φ and a sample set Y such that M(φ, Y ) satisfies the RIP?

Bounded orthonormal expansions (Rauhut) 25/54 Question How to find a basis φ and a sample set Y such that M(φ, Y ) satisfies the RIP? Choose orthonormal bases.

Bounded orthonormal expansions (Rauhut) 25/54 Question How to find a basis φ and a sample set Y such that M(φ, Y ) satisfies the RIP? Choose orthonormal bases. Avoid localized functions ( φ i L should be uniformly bounded).

Sparse orthonormal bounded expansion recovery 26/54 Theorem (Rauhut, 2010) If φ is orthonormal in a probability measure µ and φ i L K.

Sparse orthonormal bounded expansion recovery 26/54 Theorem (Rauhut, 2010) If φ is orthonormal in a probability measure µ and φ i L K. each point of Y is drawn independently according to µ.

Sparse orthonormal bounded expansion recovery 26/54 Theorem (Rauhut, 2010) If φ is orthonormal in a probability measure µ and φ i L K. each point of Y is drawn independently according to µ. p log p c 1K 2 s(log s) 2 log N. Then, with high probability, for every s sparse vector ᾱ: Given noisy samples f = M(φ, Y )ᾱ + ɛ with ɛ 2 η, let α be the solution of

What basis do we need for sparse Hessian recovery? 27/54 Remember the second order Taylor model f(0) [1] + f x 1 (0)[y 1 ] + f x 2 (0)[y 2 ] + 2 f (0)[y x 2 1 2/2] + 2 f x 1 1 x 2 (0)[y 1 y 2 ] + 2 f (0)[y x 2 2 2/2]. 2 So, we want something like the natural/canonical basis: φ = {1, y 1,..., y n, 12 y21,..., 12 } y2n, y 1 y 2,..., y n 1 y n.

An orthonormal basis for quadratics (appropriate for sparse Hessian recovery) 28/54 Proposition (Bandeira, Scheinberg, and Vicente, 2010) The following basis ψ for quadratics is orthonormal (w.r.t. the uniform measure on B (0; )) and satisfies ψ ι L 3. ψ 0 (u) = 1 ψ 1,i (u) = 3 u i ψ 2,ij (u) = 3 2 u i u j ψ 2,i (u) = 3 5 2 1 u 2 2 i 5 2.

Hessian sparse recovery 29/54 Let us look again at min α 1 s. t. M(φ, Y )α f 2 η, where f = M(ψ, Y )ᾱ + ɛ.

Hessian sparse recovery 29/54 Let us look again at min α 1 s. t. M(φ, Y )α f 2 η, where f = M(ψ, Y )ᾱ + ɛ. So, the noisy data is f = f(y ).

Hessian sparse recovery 29/54 Let us look again at min α 1 s. t. M(φ, Y )α f 2 η, where f = M(ψ, Y )ᾱ + ɛ. So, the noisy data is f = f(y ). What we are trying to recover is the 2nd order Taylor model ᾱ ψ(y).

Hessian sparse recovery 30/54 Theorem (Bandeira, Scheinberg, and Vicente, 2010) If the Hessian of f at 0 is s sparse.

Hessian sparse recovery 30/54 Theorem (Bandeira, Scheinberg, and Vicente, 2010) If the Hessian of f at 0 is s sparse. Y is a random sample set chosen w.r.t. the uniform measure on B (0; ).

Hessian sparse recovery 30/54 Theorem (Bandeira, Scheinberg, and Vicente, 2010) If the Hessian of f at 0 is s sparse. Y is a random sample set chosen w.r.t. the uniform measure on B (0; ). p log p 9c 1(s + n + 1) log 2 (s + n + 1)log O(n 2 ). Then, with high probability, the quadratic q = α ι ψ ι obtained by solving the noisy l 1 -minimization problem is a fully quadratic model for f (with error constants not depending on ).

An answer to our question 31/54 For instance, when the number of non-zeros of the Hessian is s = O(n), we are able to construct fully quadratic models with O(n log 4 n) points.

An answer to our question 31/54 For instance, when the number of non-zeros of the Hessian is s = O(n), we are able to construct fully quadratic models with O(n log 4 n) points. Also, we recover both the function and its sparsity structure.

Open question 32/54 Generalize the result above when minimizing only the l 1 -norm of the Hessian (α Q ) rather than of the whole α. Numerical simulations have shown that such approach is (slightly) advantageous.

Remarks 33/54 However, the Theorem only provides motivation because, in a practical approach we:

Remarks 33/54 However, the Theorem only provides motivation because, in a practical approach we: Solve min α Q 1 s. t. M( φ L, Y )α L + M( φ Q, Y )α Q = f(y ).

Remarks 33/54 However, the Theorem only provides motivation because, in a practical approach we: Solve min α Q 1 s. t. M( φ L, Y )α L + M( φ Q, Y )α Q = f(y ). Deal with small n (from the DFO setting) and the bound we obtain is asymptotical.

Interpolation-based trust-region methods 34/54

Interpolation-based trust-region methods 34/54 Trust-region methods for DFO typically: attempt to form quadratic models (by interpolation/regression and using polynomials or radial basis functions) m k (x k + s) = f(x k ) + g k s + 1 2 s H k s based on (well poised) sample sets. Well poisedness ensures fully linear or fully quadratic models.

Interpolation-based trust-region methods 35/54 Set x k+1 to x k + s k (success) or to x k (unsuccess) and update k depending on the value of ρ k = f(x k) f(x k + s k ) m k (x k ) m k (x k + s k ).

Interpolation-based trust-region methods 35/54 Set x k+1 to x k + s k (success) or to x k (unsuccess) and update k depending on the value of ρ k = f(x k) f(x k + s k ) m k (x k ) m k (x k + s k ). Attempt to accept steps based on simple decrease, i.e., if ρ k > 0 f(x k + s k ) < f(x k ).

Interpolation-based trust-region methods (IDFO) 36/54 Reduce k only if ρ k is small and the model is FL/FQ unsuccessful iterations.

Interpolation-based trust-region methods (IDFO) 36/54 Reduce k only if ρ k is small and the model is FL/FQ unsuccessful iterations. Accept new iterates based on simple decrease (ρ k > 0) as long as the model is FL/FQ acceptable iterations. Allow for model-improving iterations (when ρ k is not large enough and the model is not certifiably FL/FQ).

Interpolation-based trust-region methods (IDFO) 37/54 Incorporate a criticality step (1st or 2nd order) when the stationarity of the model is small.

Interpolation-based trust-region methods (IDFO) 37/54 Incorporate a criticality step (1st or 2nd order) when the stationarity of the model is small. Internal cycle of reductions of k until model is well poised in B(x k ; g k ).

Behavior of the trust-region radius 38/54 Due to the criticality step, one has for successful iterations: f(x k ) f(x k+1 ) O( g k min{ g k, k }) O( 2 k ).

Behavior of the trust-region radius 38/54 Due to the criticality step, one has for successful iterations: Thus: f(x k ) f(x k+1 ) O( g k min{ g k, k }) O( 2 k ). Theorem (Conn, Scheinberg, and Vicente, 2009) The trust-region radius converges to zero: lim k = 0. k +

Analysis of TR methods (1st order) 39/54 Using fully linear models: Theorem (Conn, Scheinberg, and Vicente, 2009) If f is Lips. continuous and f is bounded below on L(x 0 ) then lim f(x k) = 0. k +

Analysis of TR methods (1st order) 39/54 Using fully linear models: Theorem (Conn, Scheinberg, and Vicente, 2009) If f is Lips. continuous and f is bounded below on L(x 0 ) then lim f(x k) = 0. k + Valid also for simple decrease (acceptable iterations).

Analysis of TR methods (2nd order) 40/54 Using fully quadratic models: Theorem (Conn, Scheinberg, and Vicente, 2009) If 2 f is Lips. continuous and f is bounded below on L(x 0 ) then lim max { f(x k ), λ min [ 2 f(x k )] } = 0. k + Valid also for simple decrease (acceptable iterations).

Sample set management 41/54 Recently, Fasano, Morales, and Nocedal (2009) suggested an one-point exchange:

Sample set management 41/54 Recently, Fasano, Morales, and Nocedal (2009) suggested an one-point exchange: In successful iterations: where y out = argmax y x k 2. Y k+1 = Y k {x k + s k } \ {y out }.

Sample set management 41/54 Recently, Fasano, Morales, and Nocedal (2009) suggested an one-point exchange: In successful iterations: where y out = argmax y x k 2. In the unsuccessful case: Y k+1 = Y k {x k + s k } \ {y out }. Y k+1 = Y k {x k + s k } \ {y out } if y out x k s k.

Self-correcting geometry 42/54 Later, Scheinberg and Toint (2009) proposed:

Self-correcting geometry 42/54 Later, Scheinberg and Toint (2009) proposed: A self-correcting geometry approach based on one-point exchanges, globally convergent to first-order stationary points (lim inf). In the unsuccessful case, y out is not only based on y x k 2, but also on the values of the Lagrange polynomials at x k + s k. They showed that, if k is small compared to g k, then the step either improves the function or the geometry/poisedness of the model.

Without criticality step... (Scheinberg and Toint) 43/54

Without criticality step... (Scheinberg and Toint) 44/54

Without criticality step... (Scheinberg and Toint) 45/54

Without criticality step... (Scheinberg and Toint) 46/54

A practical interpolation-based trust-region method 47/54

A practical interpolation-based trust-region method 47/54 Model building: If Y k = N = O(n 2 ), use determined quadratic interpolation.

A practical interpolation-based trust-region method 47/54 Model building: If Y k = N = O(n 2 ), use determined quadratic interpolation. Otherwise use l 1 (p = 1) or Frobenius (p = 2) minimum norm quadratic interpolation: min 1 p α Q p p s. t. M( φ L, Y scaled )α L + M( φ Q, Y scaled )α Q = f(y scaled ). Sample set update one starts with Y 0 = O(n): If Y k < N = O(n 2 ), set Y k+1 = Y k {x k + s k }.

Performance profiles (accuracy of 10 4 in function values) 48/54 Figure: Performance profiles comparing DFO-TR (l 1 and Frobenius) and NEWUOA (Powell) in a test set from CUTEr (Fasano et al.).

Performance profiles (accuracy of 10 6 in function values) 49/54 Figure: Performance profiles comparing DFO-TR (l 1 and Frobenius) and NEWUOA (Powell) in a test set from CUTEr (Fasano et al.).

Concluding remarks 50/54 Optimization is a fundamental tool in Compressed Sensing. However, this work shows that CS can also be applied to Optimization.

Concluding remarks 50/54 Optimization is a fundamental tool in Compressed Sensing. However, this work shows that CS can also be applied to Optimization. In a sparse scenario, we were able to construct fully quadratic models with samples of size O(n log 4 n) instead of the classical O(n 2 ). We proposed a practical DFO method (using l 1 -minimization) that was able to outperform state-of-the-art methods in several numerical tests (in the already tough DFO scenario where n is small).

Open questions 51/54 Improve the efficiency of the model l 1 -minimization, by properly warmstarting it (currently we solve it as an LP using lipsol by Y. Zhang).

Open questions 51/54 Improve the efficiency of the model l 1 -minimization, by properly warmstarting it (currently we solve it as an LP using lipsol by Y. Zhang). Study the convergence properties of possibly stochastic interpolation-based trust-region methods. Investigate l 1 -minimization techniques in statistical models (like Kriging for interpolating sparse data sets), but applied to Optimization.

References 52/54 A. Bandeira, K. Scheinberg, and L. N. Vicente, Computation of sparse low degree interpolating polynomials and their application to derivative-free optimization, in preparation, 2010. A. R. Conn, K. Scheinberg, and L. N. Vicente, Global convergence of general derivative-free trust-region algorithms to first and second order critical points, SIAM J. Optim., 20 (2009) 387 415. G. Fasano, J. L. Morales, and J. Nocedal, On the geometry phase in model-based algorithms for derivative-free optimization, Optim. Methods Softw., 24 (2009) 145 154. K. Scheinberg and Ph. L. Toint, Self-correcting geometry in model-based algorithms for derivative-free unconstrained optimization, 2009.

Other recent work 53/54 S. Gratton, Ph. L. Toint, and A. Tröltzsch, An active-set trust-region method for derivative-free nonlinear bound-constrained optimization, 2010. Talk WA02 11:30

Other recent work 53/54 S. Gratton, Ph. L. Toint, and A. Tröltzsch, An active-set trust-region method for derivative-free nonlinear bound-constrained optimization, 2010. Talk WA02 11:30 S. Gratton and L. N. Vicente, A surrogate management framework using rigorous trust-regions steps, 2010. M. J. D. Powell, The BOBYQA algorithm for bound constrained optimization without derivatives, 2009.

Optimization 2011 (July 24 27, Portugal) 54/54