Interpolation-Based Trust-Region Methods for DFO

Size: px

Start display at page:

Download "Interpolation-Based Trust-Region Methods for DFO"

Crystal Scott
5 years ago
Views:

1 Interpolation-Based Trust-Region Methods for DFO Luis Nunes Vicente University of Coimbra (joint work with A. Bandeira, A. R. Conn, S. Gratton, and K. Scheinberg) July 27, 2010 ICCOPT, Santiago http//

2 Why Derivative-Free Optimization? 2/54 Some of the reasons to apply derivative-free optimization are the following:

3 Why Derivative-Free Optimization? 2/54 Some of the reasons to apply derivative-free optimization are the following: Nowadays computer hardware and mathematical algorithms allows increasingly large simulations.

4 Why Derivative-Free Optimization? 2/54 Some of the reasons to apply derivative-free optimization are the following: Nowadays computer hardware and mathematical algorithms allows increasingly large simulations. Functions are noisy (one cannot trust derivatives or approximate them by finite differences).

5 Why Derivative-Free Optimization? 2/54 Some of the reasons to apply derivative-free optimization are the following: Nowadays computer hardware and mathematical algorithms allows increasingly large simulations. Functions are noisy (one cannot trust derivatives or approximate them by finite differences). Binary codes (source code not available) and random simulations making automatic differentiation impossible to apply.

6 Why Derivative-Free Optimization? 2/54 Some of the reasons to apply derivative-free optimization are the following: Nowadays computer hardware and mathematical algorithms allows increasingly large simulations. Functions are noisy (one cannot trust derivatives or approximate them by finite differences). Binary codes (source code not available) and random simulations making automatic differentiation impossible to apply. Legacy codes (written in the past and not maintained by the original authors).

7 Why Derivative-Free Optimization? 2/54 Some of the reasons to apply derivative-free optimization are the following: Nowadays computer hardware and mathematical algorithms allows increasingly large simulations. Functions are noisy (one cannot trust derivatives or approximate them by finite differences). Binary codes (source code not available) and random simulations making automatic differentiation impossible to apply. Legacy codes (written in the past and not maintained by the original authors). Lack of sophistication of the user (users need improvement but want to use something simple).

8 Limitations of Derivative-Free Optimization 3/54 In DFO convergence/stopping is typically slow (per function evaluation):

9 Recent advances in direct-search methods 4/54 For a recent talk on Direct Search (8th EUROPT, 2010) see:

10 Recent advances in direct-search methods 4/54 For a recent talk on Direct Search (8th EUROPT, 2010) see: Ana Luisa Custodio Talk WA04 10:30. Ismael Vaz Talk WB04 13:30.

11 The book! 5/54 A. R. Conn, K. Scheinberg, and L. N. Vicente, Introduction to Derivative-Free Optimization, MPS-SIAM Series on Optimization, SIAM, Philadelphia, 2009.

12 Model-based trust-region methods 6/54 One typically minimizes a model m in a trust region B p (x; ): Trust-region subproblem min m(y) y B p(x; )

13 Model-based trust-region methods 6/54 One typically minimizes a model m in a trust region B p (x; ): Trust-region subproblem min m(y) y B p(x; ) In derivative-based optimization, one could use:

14 Model-based trust-region methods 6/54 One typically minimizes a model m in a trust region B p (x; ): Trust-region subproblem min m(y) y B p(x; ) In derivative-based optimization, one could use: 1st order Taylor: m(y) = f(x) + f(x) (y x)

15 Model-based trust-region methods 6/54 One typically minimizes a model m in a trust region B p (x; ): Trust-region subproblem min m(y) y B p(x; ) In derivative-based optimization, one could use: 1st order Taylor: m(y) = f(x) + f(x) (y x) (y x) H(y x)

16 Model-based trust-region methods 6/54 One typically minimizes a model m in a trust region B p (x; ): Trust-region subproblem min m(y) y B p(x; ) In derivative-based optimization, one could use: 2nd order Taylor: m(y) = f(x) + f(x) (y x) (y x) 2 f(x)(y x)

17 Fully linear models 7/54 Given a point x and a trust-region radius, a model m(y) around x is called fully linear if

18 Fully linear models 7/54 Given a point x and a trust-region radius, a model m(y) around x is called fully linear if It is C 1 with Lipschitz continuous gradient.

19 Fully linear models 7/54 Given a point x and a trust-region radius, a model m(y) around x is called fully linear if It is C 1 with Lipschitz continuous gradient. The following error bounds hold: f(y) m(y) κ eg y B(x; ) f(y) m(y) κ ef 2 y B(x; ).

20 Fully linear models 7/54 Given a point x and a trust-region radius, a model m(y) around x is called fully linear if It is C 1 with Lipschitz continuous gradient. The following error bounds hold: f(y) m(y) κ eg y B(x; ) f(y) m(y) κ ef 2 y B(x; ). For a class of fully linear models, the (unknown) constants κ ef, κ eg > 0 must be independent of x and.

21 Fully linear models 7/54 Given a point x and a trust-region radius, a model m(y) around x is called fully linear if It is C 1 with Lipschitz continuous gradient. The following error bounds hold: f(y) m(y) κ eg y B(x; ) f(y) m(y) κ ef 2 y B(x; ). For a class of fully linear models, the (unknown) constants κ ef, κ eg > 0 must be independent of x and. Fully linear models can be quadratic (or even nonlinear).

22 Fully quadratic models 8/54 Given a point x and a trust-region radius, a model m(y) around x is called fully quadratic if

23 Fully quadratic models 8/54 Given a point x and a trust-region radius, a model m(y) around x is called fully quadratic if It is C 2 with Lipschitz continuous Hessian.

24 Fully quadratic models 8/54 Given a point x and a trust-region radius, a model m(y) around x is called fully quadratic if It is C 2 with Lipschitz continuous Hessian. The following error bounds hold: 2 f(y) 2 m(y) κ eh y B(x; ) f(y) m(y) κ eg 2 y B(x; ) f(y) m(y) κ ef 3 y B(x; ).

25 Fully quadratic models 8/54 Given a point x and a trust-region radius, a model m(y) around x is called fully quadratic if It is C 2 with Lipschitz continuous Hessian. The following error bounds hold: 2 f(y) 2 m(y) κ eh y B(x; ) f(y) m(y) κ eg 2 y B(x; ) f(y) m(y) κ ef 3 y B(x; ). For a class of fully quadratic models, the (unknown) constants κ ef, κ eg, κ eh > 0 must be independent of x and.

26 Fully quadratic models Given a point x and a trust-region radius, a model m(y) around x is called fully quadratic if It is C 2 with Lipschitz continuous Hessian. The following error bounds hold: 2 f(y) 2 m(y) κ eh y B(x; ) f(y) m(y) κ eg 2 y B(x; ) f(y) m(y) κ ef 3 y B(x; ). For a class of fully quadratic models, the (unknown) constants κ ef, κ eg, κ eh > 0 must be independent of x and. Fully quadratic models are only necessary for global convergence to 2nd order stationary points. 8/54

27 Polynomial interpolation models 9/54 Given a sample set Y = {y 0, y 1,..., y p }, a polynomial basis φ, and a polynomial model m(y) = α φ(y), the interpolating conditions form the linear system: M(φ, Y )α = f(y ),

28 Polynomial interpolation models 9/54 Given a sample set Y = {y 0, y 1,..., y p }, a polynomial basis φ, and a polynomial model m(y) = α φ(y), the interpolating conditions form the linear system: M(φ, Y )α = f(y ), where M(φ, Y ) = φ 0 (y 0 ) φ 1 (y 0 ) φ p (y 0 ) φ 0 (y 1 ) φ 1 (y 1 ) φ p (y 1 ).... φ 0 (y p ) φ 1 (y p ) φ p (y p ) f(y ) = f(y 0 ) f(y 1 ). f(y p ).

29 Natural/canonical basis 10/54 The natural/canonical basis appears in a Taylor expansion and is given by: φ = {1, y 1,..., y n, 12 y21,..., 12 } y2n, y 1 y 2,..., y n 1 y n.

30 Natural/canonical basis 10/54 The natural/canonical basis appears in a Taylor expansion and is given by: φ = {1, y 1,..., y n, 12 y21,..., 12 } y2n, y 1 y 2,..., y n 1 y n. Under appropriate smoothness, the second order Taylor model, centered at 0, is: f(0) [1] + f x 1 (0)[y 1 ] + f x 2 (0)[y 2 ] + 2 f (0)[y x 2 1 2/2] + 2 f x 1 1 x 2 (0)[y 1 y 2 ] + 2 f (0)[y x 2 2 2/2]. 2

31 Well poisedness (Λ poisedness) 11/54 Λ is a Λ poisedness constant related to the geometry of Y.

32 Well poisedness (Λ poisedness) 11/54 Λ is a Λ poisedness constant related to the geometry of Y. The original definition of Λ poisedness says that the maximum absolute value of the Lagrange polynomials in B(x; ) is bounded by Λ.

33 Well poisedness (Λ poisedness) 11/54 Λ is a Λ poisedness constant related to the geometry of Y. The original definition of Λ poisedness says that the maximum absolute value of the Lagrange polynomials in B(x; ) is bounded by Λ. An equivalent definition of Λ poisedness is ( Y = α ) M( φ, Y scaled ) 1 Λ, with Y scaled obtained from Y such that Y scaled B(0; 1).

34 Well poisedness (Λ poisedness) 11/54 Λ is a Λ poisedness constant related to the geometry of Y. The original definition of Λ poisedness says that the maximum absolute value of the Lagrange polynomials in B(x; ) is bounded by Λ. An equivalent definition of Λ poisedness is ( Y = α ) M( φ, Y scaled ) 1 Λ, with Y scaled obtained from Y such that Y scaled B(0; 1). Non-squared cases are defined analogously (IDFO).

35 A badly poised set 12/54 Λ =

36 A not so badly poised set 13/54 Λ = 440.

37 Another badly poised set 14/54 Λ =

38 An ideal set 15/54 Λ = 1.

39 Quadratic interpolation models 16/54 The system M(φ, Y )α = f(y ) can be Overdetermined when Y > α. See talk on Direct Search!

40 Quadratic interpolation models 16/54 The system M(φ, Y )α = f(y ) can be Determined when Y = α. For M(φ, Y ) to be squared one needs N = (n + 2)(n + 1)/2 evaluations of f (often too expensive).

41 Quadratic interpolation models 16/54 The system M(φ, Y )α = f(y ) can be Determined when Y = α. For M(φ, Y ) to be squared one needs N = (n + 2)(n + 1)/2 evaluations of f (often too expensive). Leads to fully quadratic models when Y is well poised (the constants κ in the error bounds will depend on Λ).

42 Quadratic interpolation models 16/54 The system M(φ, Y )α = f(y ) can be Determined when Y = α. For M(φ, Y ) to be squared one needs N = (n + 2)(n + 1)/2 evaluations of f (often too expensive). Leads to fully quadratic models when Y is well poised (the constants κ in the error bounds will depend on Λ). Underdetermined when Y < α. Minimum Frobenius norm models (Powell, IDFO book).

43 Quadratic interpolation models 16/54 The system M(φ, Y )α = f(y ) can be Determined when Y = α. For M(φ, Y ) to be squared one needs N = (n + 2)(n + 1)/2 evaluations of f (often too expensive). Leads to fully quadratic models when Y is well poised (the constants κ in the error bounds will depend on Λ). Underdetermined when Y < α. Minimum Frobenius norm models (Powell, IDFO book). Other approaches?...

44 Underdetermined quadratic models 17/54 Let m be an underdetermined quadratic model (with Hessian H) built with less than N = O(n 2 ) points.

45 Underdetermined quadratic models 17/54 Let m be an underdetermined quadratic model (with Hessian H) built with less than N = O(n 2 ) points. Theorem (IDFO book) If Y is Λ L poised for linear interpolation or regression then f(y) m(y) Λ L [C f + H ] y B(x; ).

46 Underdetermined quadratic models 17/54 Let m be an underdetermined quadratic model (with Hessian H) built with less than N = O(n 2 ) points. Theorem (IDFO book) If Y is Λ L poised for linear interpolation or regression then f(y) m(y) Λ L [C f + H ] y B(x; ). One should build models by minimizing the norm of H.

47 Minimum Frobenius norm models 18/54 Using φ and separating the quadratic terms, write m(y) = α L φ L (y) + α Q φ Q (y).

48 Minimum Frobenius norm models 18/54 Using φ and separating the quadratic terms, write m(y) = α L φ L (y) + α Q φ Q (y). Then, build models by minimizing the entries of the Hessian ( Frobenius norm ): min 1 2 α Q 2 2 s.t. M( φ, Y )α = f(y ).

49 Minimum Frobenius norm models Using φ and separating the quadratic terms, write m(y) = α L φ L (y) + α Q φ Q (y). Then, build models by minimizing the entries of the Hessian ( Frobenius norm ): min 1 2 α Q 2 2 s.t. M( φ, Y )α = f(y ). The solution of this convex QP problem requires a linear solve with: [ MQ MQ M ] L ML where M( 0 φ, Y ) = [ ] M L M Q. 18/54

50 Minimum Frobenius norm models 19/54 Theorem (IDFO book) If Y is Λ F poised in the minimum Frobenius norm sense then H C f Λ F, where H is, again, the Hessian of the model.

51 Minimum Frobenius norm models 19/54 Theorem (IDFO book) If Y is Λ F poised in the minimum Frobenius norm sense then H C f Λ F, where H is, again, the Hessian of the model. Putting the two theorems together yield: f(y) m(y) Λ L [C f + C f Λ F ] y B(x; ).

52 Minimum Frobenius norm models 19/54 Theorem (IDFO book) If Y is Λ F poised in the minimum Frobenius norm sense then H C f Λ F, where H is, again, the Hessian of the model. Putting the two theorems together yield: f(y) m(y) Λ L [C f + C f Λ F ] y B(x; ). MFN models are fully linear.

53 Sparsity on the Hessian 20/54 In many problems, pairs of variables have no correlation, leading to zero second order partial derivatives in f:

54 Sparsity on the Hessian 20/54 In many problems, pairs of variables have no correlation, leading to zero second order partial derivatives in f:

55 Sparsity on the Hessian 20/54 In many problems, pairs of variables have no correlation, leading to zero second order partial derivatives in f: Thus, the Hessian 2 m(x = 0) of the model (i.e., the vector α Q in the basis φ) should be sparse.

56 Our question 21/54 Is it possible to build fully quadratic models by quadratic underdetermined interpolation (i.e., using less than N = O(n 2 ) points) in the SPARSE case?

57 Compressed sensing sparse recovery 22/54 Objective: Find sparse α subject to a highly underdetermined linear system Mα = f.

58 Compressed sensing sparse recovery 22/54 Objective: Find sparse α subject to a highly underdetermined linear system Mα = f. { min α 0 = supp(α) is NP-Hard. s.t. Mα = f

59 Compressed sensing sparse recovery 22/54 Objective: Find sparse α subject to a highly underdetermined linear system Mα = f. { min α 0 = supp(α) is NP-Hard. s.t. Mα = f { min α 1 s.t. Mα = f often recovers sparse solutions.

60 Compressed sensing sparse recovery 22/54 Objective: Find sparse α subject to a highly underdetermined linear system Mα = f. { min α 0 = supp(α) is NP-Hard. s.t. Mα = f { min α 1 s.t. Mα = f often recovers sparse solutions.

61 Restricted isometry property 23/54 Definition (RIP) The RIP Constant of order s of M (p N) is the smallest δ s such that (1 δ s ) α 2 2 Mα 2 2 (1 + δ s ) α 2 2 for all s sparse α ( α 0 s).

62 Restricted isometry property 23/54 Definition (RIP) The RIP Constant of order s of M (p N) is the smallest δ s such that (1 δ s ) α 2 2 Mα 2 2 (1 + δ s ) α 2 2 for all s sparse α ( α 0 s). Theorem (Candès, Tao, 2005, 2006) If ᾱ is s sparse and 2δ 2s + δ s < 1 then it can be recovered by l 1 -minimization: min α 1 s.t. Mα = Mᾱ.

63 Restricted isometry property 23/54 Definition (RIP) The RIP Constant of order s of M (p N) is the smallest δ s such that (1 δ s ) α 2 2 Mα 2 2 (1 + δ s ) α 2 2 for all s sparse α ( α 0 s). Theorem (Candès, Tao, 2005, 2006) If ᾱ is s sparse and 2δ 2s + δ s < 1 then it can be recovered by l 1 -minimization: min α 1 s.t. Mα = Mᾱ. i.e., the optimal solution α of this problem is unique and given by α = ᾱ.

64 Random matrices 24/54 It is hard to find deterministic matrices that satisfy the RIP for large s.

65 Random matrices 24/54 It is hard to find deterministic matrices that satisfy the RIP for large s. Using Random Matrix Theory it is possible to prove RIP for p = O(s log N). Matrices with Gaussian entries. Matrices with Bernoulli entries. Uniformly chosen subsets of discrete Fourier transform.

66 Bounded orthonormal expansions (Rauhut) 25/54 Question How to find a basis φ and a sample set Y such that M(φ, Y ) satisfies the RIP?

67 Bounded orthonormal expansions (Rauhut) 25/54 Question How to find a basis φ and a sample set Y such that M(φ, Y ) satisfies the RIP? Choose orthonormal bases.

68 Bounded orthonormal expansions (Rauhut) 25/54 Question How to find a basis φ and a sample set Y such that M(φ, Y ) satisfies the RIP? Choose orthonormal bases. Avoid localized functions ( φ i L should be uniformly bounded).

69 Bounded orthonormal expansions (Rauhut) 25/54 Question How to find a basis φ and a sample set Y such that M(φ, Y ) satisfies the RIP? Choose orthonormal bases. Avoid localized functions ( φ i L should be uniformly bounded). Select Y randomly.

70 Sparse orthonormal bounded expansion recovery 26/54 Theorem (Rauhut, 2010) If φ is orthonormal in a probability measure µ and φ i L K.

71 Sparse orthonormal bounded expansion recovery 26/54 Theorem (Rauhut, 2010) If φ is orthonormal in a probability measure µ and φ i L K. each point of Y is drawn independently according to µ.

72 Sparse orthonormal bounded expansion recovery 26/54 Theorem (Rauhut, 2010) If φ is orthonormal in a probability measure µ and φ i L K. each point of Y is drawn independently according to µ. p log p c 1K 2 s(log s) 2 log N.

73 Sparse orthonormal bounded expansion recovery 26/54 Theorem (Rauhut, 2010) If φ is orthonormal in a probability measure µ and φ i L K. each point of Y is drawn independently according to µ. p log p c 1K 2 s(log s) 2 log N. Then, with high probability, for every s sparse vector ᾱ:

74 Sparse orthonormal bounded expansion recovery 26/54 Theorem (Rauhut, 2010) If φ is orthonormal in a probability measure µ and φ i L K. each point of Y is drawn independently according to µ. p log p c 1K 2 s(log s) 2 log N. Then, with high probability, for every s sparse vector ᾱ: Given noisy samples f = M(φ, Y )ᾱ + ɛ with ɛ 2 η, let α be the solution of

75 Sparse orthonormal bounded expansion recovery 26/54 Theorem (Rauhut, 2010) If φ is orthonormal in a probability measure µ and φ i L K. each point of Y is drawn independently according to µ. p log p c 1K 2 s(log s) 2 log N. Then, with high probability, for every s sparse vector ᾱ: Given noisy samples f = M(φ, Y )ᾱ + ɛ with ɛ 2 η, let α be the solution of min α 1 s. t. M(φ, Y )α f 2 η.

76 Sparse orthonormal bounded expansion recovery 26/54 Theorem (Rauhut, 2010) If φ is orthonormal in a probability measure µ and φ i L K. each point of Y is drawn independently according to µ. p log p c 1K 2 s(log s) 2 log N. Then, with high probability, for every s sparse vector ᾱ: Given noisy samples f = M(φ, Y )ᾱ + ɛ with ɛ 2 η, let α be the solution of min α 1 s. t. M(φ, Y )α f 2 η. Then, ᾱ α 2 C p η.

77 What basis do we need for sparse Hessian recovery? 27/54 Remember the second order Taylor model f(0) [1] + f x 1 (0)[y 1 ] + f x 2 (0)[y 2 ] + 2 f (0)[y x 2 1 2/2] + 2 f x 1 1 x 2 (0)[y 1 y 2 ] + 2 f (0)[y x 2 2 2/2]. 2

78 What basis do we need for sparse Hessian recovery? 27/54 Remember the second order Taylor model f(0) [1] + f x 1 (0)[y 1 ] + f x 2 (0)[y 2 ] + 2 f (0)[y x 2 1 2/2] + 2 f x 1 1 x 2 (0)[y 1 y 2 ] + 2 f (0)[y x 2 2 2/2]. 2 So, we want something like the natural/canonical basis: φ = {1, y 1,..., y n, 12 y21,..., 12 } y2n, y 1 y 2,..., y n 1 y n.

79 An orthonormal basis for quadratics (appropriate for sparse Hessian recovery) 28/54 Proposition (Bandeira, Scheinberg, and Vicente, 2010) The following basis ψ for quadratics is orthonormal (w.r.t. the uniform measure on B (0; )) and satisfies ψ ι L 3.

80 An orthonormal basis for quadratics (appropriate for sparse Hessian recovery) 28/54 Proposition (Bandeira, Scheinberg, and Vicente, 2010) The following basis ψ for quadratics is orthonormal (w.r.t. the uniform measure on B (0; )) and satisfies ψ ι L 3. ψ 0 (u) = 1 ψ 1,i (u) = 3 u i ψ 2,ij (u) = 3 2 u i u j ψ 2,i (u) = u 2 2 i 5 2.

81 An orthonormal basis for quadratics (appropriate for sparse Hessian recovery) 28/54 Proposition (Bandeira, Scheinberg, and Vicente, 2010) The following basis ψ for quadratics is orthonormal (w.r.t. the uniform measure on B (0; )) and satisfies ψ ι L 3. ψ 0 (u) = 1 ψ 1,i (u) = 3 u i ψ 2,ij (u) = 3 2 u i u j ψ 2,i (u) = u 2 2 i 5 2. ψ is very similar to the canonical basis, and preserves the sparsity of the Hessian (at 0).

82 Hessian sparse recovery 29/54 Let us look again at min α 1 s. t. M(φ, Y )α f 2 η, where f = M(ψ, Y )ᾱ + ɛ.

83 Hessian sparse recovery 29/54 Let us look again at min α 1 s. t. M(φ, Y )α f 2 η, where f = M(ψ, Y )ᾱ + ɛ. So, the noisy data is f = f(y ).

84 Hessian sparse recovery 29/54 Let us look again at min α 1 s. t. M(φ, Y )α f 2 η, where f = M(ψ, Y )ᾱ + ɛ. So, the noisy data is f = f(y ). What we are trying to recover is the 2nd order Taylor model ᾱ ψ(y).

85 Hessian sparse recovery 29/54 Let us look again at min α 1 s. t. M(φ, Y )α f 2 η, where f = M(ψ, Y )ᾱ + ɛ. So, the noisy data is f = f(y ). What we are trying to recover is the 2nd order Taylor model ᾱ ψ(y). Thus, in ɛ η, one has η = O( 3 ).

86 Hessian sparse recovery 30/54 Theorem (Bandeira, Scheinberg, and Vicente, 2010) If the Hessian of f at 0 is s sparse.

87 Hessian sparse recovery 30/54 Theorem (Bandeira, Scheinberg, and Vicente, 2010) If the Hessian of f at 0 is s sparse. Y is a random sample set chosen w.r.t. the uniform measure on B (0; ).

88 Hessian sparse recovery 30/54 Theorem (Bandeira, Scheinberg, and Vicente, 2010) If the Hessian of f at 0 is s sparse. Y is a random sample set chosen w.r.t. the uniform measure on B (0; ). p log p 9c 1(s + n + 1) log 2 (s + n + 1)log O(n 2 ).

89 Hessian sparse recovery 30/54 Theorem (Bandeira, Scheinberg, and Vicente, 2010) If the Hessian of f at 0 is s sparse. Y is a random sample set chosen w.r.t. the uniform measure on B (0; ). p log p 9c 1(s + n + 1) log 2 (s + n + 1)log O(n 2 ). Then, with high probability, the quadratic

90 Hessian sparse recovery 30/54 Theorem (Bandeira, Scheinberg, and Vicente, 2010) If the Hessian of f at 0 is s sparse. Y is a random sample set chosen w.r.t. the uniform measure on B (0; ). p log p 9c 1(s + n + 1) log 2 (s + n + 1)log O(n 2 ). Then, with high probability, the quadratic q = α ι ψ ι obtained by solving the noisy l 1 -minimization problem is a fully quadratic model for f (with error constants not depending on ).

91 An answer to our question 31/54 For instance, when the number of non-zeros of the Hessian is s = O(n), we are able to construct fully quadratic models with O(n log 4 n) points.

92 An answer to our question 31/54 For instance, when the number of non-zeros of the Hessian is s = O(n), we are able to construct fully quadratic models with O(n log 4 n) points. Also, we recover both the function and its sparsity structure.

93 Open question 32/54 Generalize the result above when minimizing only the l 1 -norm of the Hessian (α Q ) rather than of the whole α. Numerical simulations have shown that such approach is (slightly) advantageous.

94 Remarks 33/54 However, the Theorem only provides motivation because, in a practical approach we:

95 Remarks 33/54 However, the Theorem only provides motivation because, in a practical approach we: Solve min α Q 1 s. t. M( φ L, Y )α L + M( φ Q, Y )α Q = f(y ).

96 Remarks 33/54 However, the Theorem only provides motivation because, in a practical approach we: Solve min α Q 1 s. t. M( φ L, Y )α L + M( φ Q, Y )α Q = f(y ). Deal with small n (from the DFO setting) and the bound we obtain is asymptotical.

97 Remarks 33/54 However, the Theorem only provides motivation because, in a practical approach we: Solve min α Q 1 s. t. M( φ L, Y )α L + M( φ Q, Y )α Q = f(y ). Deal with small n (from the DFO setting) and the bound we obtain is asymptotical. Use deterministic sampling.

98 Interpolation-based trust-region methods 34/54

99 Interpolation-based trust-region methods 34/54 Trust-region methods for DFO typically: attempt to form quadratic models (by interpolation/regression and using polynomials or radial basis functions) m k (x k + s) = f(x k ) + g k s s H k s based on (well poised) sample sets.

100 Interpolation-based trust-region methods 34/54 Trust-region methods for DFO typically: attempt to form quadratic models (by interpolation/regression and using polynomials or radial basis functions) m k (x k + s) = f(x k ) + g k s s H k s based on (well poised) sample sets. Well poisedness ensures fully linear or fully quadratic models.

101 Interpolation-based trust-region methods 34/54 Trust-region methods for DFO typically: attempt to form quadratic models (by interpolation/regression and using polynomials or radial basis functions) m k (x k + s) = f(x k ) + g k s s H k s based on (well poised) sample sets. Well poisedness ensures fully linear or fully quadratic models. Calculate a step s k by approximately solving the trust-region subproblem min m k (x k + s). s B 2 (x k ; k )

102 Interpolation-based trust-region methods 35/54 Set x k+1 to x k + s k (success) or to x k (unsuccess) and update k depending on the value of ρ k = f(x k) f(x k + s k ) m k (x k ) m k (x k + s k ).

103 Interpolation-based trust-region methods 35/54 Set x k+1 to x k + s k (success) or to x k (unsuccess) and update k depending on the value of ρ k = f(x k) f(x k + s k ) m k (x k ) m k (x k + s k ). Attempt to accept steps based on simple decrease, i.e., if ρ k > 0 f(x k + s k ) < f(x k ).

104 Interpolation-based trust-region methods (IDFO) 36/54 Reduce k only if ρ k is small and the model is FL/FQ unsuccessful iterations.

105 Interpolation-based trust-region methods (IDFO) 36/54 Reduce k only if ρ k is small and the model is FL/FQ unsuccessful iterations. Accept new iterates based on simple decrease (ρ k > 0) as long as the model is FL/FQ acceptable iterations.

106 Interpolation-based trust-region methods (IDFO) 36/54 Reduce k only if ρ k is small and the model is FL/FQ unsuccessful iterations. Accept new iterates based on simple decrease (ρ k > 0) as long as the model is FL/FQ acceptable iterations. Allow for model-improving iterations (when ρ k is not large enough and the model is not certifiably FL/FQ).

107 Interpolation-based trust-region methods (IDFO) 36/54 Reduce k only if ρ k is small and the model is FL/FQ unsuccessful iterations. Accept new iterates based on simple decrease (ρ k > 0) as long as the model is FL/FQ acceptable iterations. Allow for model-improving iterations (when ρ k is not large enough and the model is not certifiably FL/FQ). Do not reduce k.

108 Interpolation-based trust-region methods (IDFO) 37/54 Incorporate a criticality step (1st or 2nd order) when the stationarity of the model is small.

109 Interpolation-based trust-region methods (IDFO) 37/54 Incorporate a criticality step (1st or 2nd order) when the stationarity of the model is small. Internal cycle of reductions of k until model is well poised in B(x k ; g k ).

110 Behavior of the trust-region radius 38/54 Due to the criticality step, one has for successful iterations: f(x k ) f(x k+1 ) O( g k min{ g k, k }) O( 2 k ).

111 Behavior of the trust-region radius 38/54 Due to the criticality step, one has for successful iterations: Thus: f(x k ) f(x k+1 ) O( g k min{ g k, k }) O( 2 k ). Theorem (Conn, Scheinberg, and Vicente, 2009) The trust-region radius converges to zero: lim k = 0. k +

112 Behavior of the trust-region radius 38/54 Due to the criticality step, one has for successful iterations: Thus: f(x k ) f(x k+1 ) O( g k min{ g k, k }) O( 2 k ). Theorem (Conn, Scheinberg, and Vicente, 2009) The trust-region radius converges to zero: lim k = 0. k + Similar to direct-search methods where lim inf k + α k = 0.

113 Analysis of TR methods (1st order) 39/54 Using fully linear models: Theorem (Conn, Scheinberg, and Vicente, 2009) If f is Lips. continuous and f is bounded below on L(x 0 ) then lim f(x k) = 0. k +

114 Analysis of TR methods (1st order) 39/54 Using fully linear models: Theorem (Conn, Scheinberg, and Vicente, 2009) If f is Lips. continuous and f is bounded below on L(x 0 ) then lim f(x k) = 0. k + Valid also for simple decrease (acceptable iterations).

115 Analysis of TR methods (2nd order) 40/54 Using fully quadratic models: Theorem (Conn, Scheinberg, and Vicente, 2009) If 2 f is Lips. continuous and f is bounded below on L(x 0 ) then lim max { f(x k ), λ min [ 2 f(x k )] } = 0. k +

116 Analysis of TR methods (2nd order) 40/54 Using fully quadratic models: Theorem (Conn, Scheinberg, and Vicente, 2009) If 2 f is Lips. continuous and f is bounded below on L(x 0 ) then lim max { f(x k ), λ min [ 2 f(x k )] } = 0. k + Valid also for simple decrease (acceptable iterations).

117 Analysis of TR methods (2nd order) 40/54 Using fully quadratic models: Theorem (Conn, Scheinberg, and Vicente, 2009) If 2 f is Lips. continuous and f is bounded below on L(x 0 ) then lim max { f(x k ), λ min [ 2 f(x k )] } = 0. k + Valid also for simple decrease (acceptable iterations). Going from lim inf to lim requires changing the update of k.

118 Sample set management 41/54 Recently, Fasano, Morales, and Nocedal (2009) suggested an one-point exchange:

119 Sample set management 41/54 Recently, Fasano, Morales, and Nocedal (2009) suggested an one-point exchange: In successful iterations: where y out = argmax y x k 2. Y k+1 = Y k {x k + s k } \ {y out }.

120 Sample set management 41/54 Recently, Fasano, Morales, and Nocedal (2009) suggested an one-point exchange: In successful iterations: where y out = argmax y x k 2. In the unsuccessful case: Y k+1 = Y k {x k + s k } \ {y out }. Y k+1 = Y k {x k + s k } \ {y out } if y out x k s k.

121 Sample set management 41/54 Recently, Fasano, Morales, and Nocedal (2009) suggested an one-point exchange: In successful iterations: where y out = argmax y x k 2. In the unsuccessful case: Y k+1 = Y k {x k + s k } \ {y out }. Y k+1 = Y k {x k + s k } \ {y out } if y out x k s k. Do not perform model-improving iterations.

122 Sample set management 41/54 Recently, Fasano, Morales, and Nocedal (2009) suggested an one-point exchange: In successful iterations: where y out = argmax y x k 2. In the unsuccessful case: Y k+1 = Y k {x k + s k } \ {y out }. Y k+1 = Y k {x k + s k } \ {y out } if y out x k s k. Do not perform model-improving iterations. They observed sample sets not badly poised!

123 Self-correcting geometry 42/54 Later, Scheinberg and Toint (2009) proposed:

124 Self-correcting geometry 42/54 Later, Scheinberg and Toint (2009) proposed: A self-correcting geometry approach based on one-point exchanges, globally convergent to first-order stationary points (lim inf).

125 Self-correcting geometry 42/54 Later, Scheinberg and Toint (2009) proposed: A self-correcting geometry approach based on one-point exchanges, globally convergent to first-order stationary points (lim inf). In the unsuccessful case, y out is not only based on y x k 2, but also on the values of the Lagrange polynomials at x k + s k.

126 Self-correcting geometry 42/54 Later, Scheinberg and Toint (2009) proposed: A self-correcting geometry approach based on one-point exchanges, globally convergent to first-order stationary points (lim inf). In the unsuccessful case, y out is not only based on y x k 2, but also on the values of the Lagrange polynomials at x k + s k. They showed that, if k is small compared to g k, then the step either improves the function or the geometry/poisedness of the model.

127 Self-correcting geometry 42/54 Later, Scheinberg and Toint (2009) proposed: A self-correcting geometry approach based on one-point exchanges, globally convergent to first-order stationary points (lim inf). In the unsuccessful case, y out is not only based on y x k 2, but also on the values of the Lagrange polynomials at x k + s k. They showed that, if k is small compared to g k, then the step either improves the function or the geometry/poisedness of the model. In their approach, model-improving iterations are not needed.

128 Self-correcting geometry 42/54 Later, Scheinberg and Toint (2009) proposed: A self-correcting geometry approach based on one-point exchanges, globally convergent to first-order stationary points (lim inf). In the unsuccessful case, y out is not only based on y x k 2, but also on the values of the Lagrange polynomials at x k + s k. They showed that, if k is small compared to g k, then the step either improves the function or the geometry/poisedness of the model. In their approach, model-improving iterations are not needed. They showed, however, that the criticality step is indeed necessary.

129 Without criticality step... (Scheinberg and Toint) 43/54

130 Without criticality step... (Scheinberg and Toint) 44/54

131 Without criticality step... (Scheinberg and Toint) 45/54

132 Without criticality step... (Scheinberg and Toint) 46/54

133 A practical interpolation-based trust-region method 47/54

134 A practical interpolation-based trust-region method 47/54 Model building: If Y k = N = O(n 2 ), use determined quadratic interpolation.

135 A practical interpolation-based trust-region method 47/54 Model building: If Y k = N = O(n 2 ), use determined quadratic interpolation. Otherwise use l 1 (p = 1) or Frobenius (p = 2) minimum norm quadratic interpolation: min 1 p α Q p p s. t. M( φ L, Y scaled )α L + M( φ Q, Y scaled )α Q = f(y scaled ).

136 A practical interpolation-based trust-region method 47/54 Model building: If Y k = N = O(n 2 ), use determined quadratic interpolation. Otherwise use l 1 (p = 1) or Frobenius (p = 2) minimum norm quadratic interpolation: min 1 p α Q p p s. t. M( φ L, Y scaled )α L + M( φ Q, Y scaled )α Q = f(y scaled ). Sample set update one starts with Y 0 = O(n):

137 A practical interpolation-based trust-region method 47/54 Model building: If Y k = N = O(n 2 ), use determined quadratic interpolation. Otherwise use l 1 (p = 1) or Frobenius (p = 2) minimum norm quadratic interpolation: min 1 p α Q p p s. t. M( φ L, Y scaled )α L + M( φ Q, Y scaled )α Q = f(y scaled ). Sample set update one starts with Y 0 = O(n): If Y k < N = O(n 2 ), set Y k+1 = Y k {x k + s k }.

138 A practical interpolation-based trust-region method 47/54 Model building: If Y k = N = O(n 2 ), use determined quadratic interpolation. Otherwise use l 1 (p = 1) or Frobenius (p = 2) minimum norm quadratic interpolation: min 1 p α Q p p s. t. M( φ L, Y scaled )α L + M( φ Q, Y scaled )α Q = f(y scaled ). Sample set update one starts with Y 0 = O(n): If Y k < N = O(n 2 ), set Y k+1 = Y k {x k + s k }. Otherwise as in Fasano et al., but with y out = argmax y x k+1 2.

139 A practical interpolation-based trust-region method 47/54 Model building: If Y k = N = O(n 2 ), use determined quadratic interpolation. Otherwise use l 1 (p = 1) or Frobenius (p = 2) minimum norm quadratic interpolation: min 1 p α Q p p s. t. M( φ L, Y scaled )α L + M( φ Q, Y scaled )α Q = f(y scaled ). Sample set update one starts with Y 0 = O(n): If Y k < N = O(n 2 ), set Y k+1 = Y k {x k + s k }. Otherwise as in Fasano et al., but with y out = argmax y x k+1 2. Criticality step : If k is very small, discard points far away from the trust region.

140 Performance profiles (accuracy of 10 4 in function values) 48/54 Figure: Performance profiles comparing DFO-TR (l 1 and Frobenius) and NEWUOA (Powell) in a test set from CUTEr (Fasano et al.).

141 Performance profiles (accuracy of 10 6 in function values) 49/54 Figure: Performance profiles comparing DFO-TR (l 1 and Frobenius) and NEWUOA (Powell) in a test set from CUTEr (Fasano et al.).

142 Concluding remarks 50/54 Optimization is a fundamental tool in Compressed Sensing. However, this work shows that CS can also be applied to Optimization.

143 Concluding remarks 50/54 Optimization is a fundamental tool in Compressed Sensing. However, this work shows that CS can also be applied to Optimization. In a sparse scenario, we were able to construct fully quadratic models with samples of size O(n log 4 n) instead of the classical O(n 2 ).

144 Concluding remarks 50/54 Optimization is a fundamental tool in Compressed Sensing. However, this work shows that CS can also be applied to Optimization. In a sparse scenario, we were able to construct fully quadratic models with samples of size O(n log 4 n) instead of the classical O(n 2 ). We proposed a practical DFO method (using l 1 -minimization) that was able to outperform state-of-the-art methods in several numerical tests (in the already tough DFO scenario where n is small).

145 Open questions 51/54 Improve the efficiency of the model l 1 -minimization, by properly warmstarting it (currently we solve it as an LP using lipsol by Y. Zhang).

146 Open questions 51/54 Improve the efficiency of the model l 1 -minimization, by properly warmstarting it (currently we solve it as an LP using lipsol by Y. Zhang). Study the convergence properties of possibly stochastic interpolation-based trust-region methods.

147 Open questions 51/54 Improve the efficiency of the model l 1 -minimization, by properly warmstarting it (currently we solve it as an LP using lipsol by Y. Zhang). Study the convergence properties of possibly stochastic interpolation-based trust-region methods. Investigate l 1 -minimization techniques in statistical models (like Kriging for interpolating sparse data sets), but applied to Optimization.

148 Open questions 51/54 Improve the efficiency of the model l 1 -minimization, by properly warmstarting it (currently we solve it as an LP using lipsol by Y. Zhang). Study the convergence properties of possibly stochastic interpolation-based trust-region methods. Investigate l 1 -minimization techniques in statistical models (like Kriging for interpolating sparse data sets), but applied to Optimization. Develop a globally convergent model-based trust-region method for non-smooth functions.

149 References 52/54 A. Bandeira, K. Scheinberg, and L. N. Vicente, Computation of sparse low degree interpolating polynomials and their application to derivative-free optimization, in preparation, A. R. Conn, K. Scheinberg, and L. N. Vicente, Global convergence of general derivative-free trust-region algorithms to first and second order critical points, SIAM J. Optim., 20 (2009) G. Fasano, J. L. Morales, and J. Nocedal, On the geometry phase in model-based algorithms for derivative-free optimization, Optim. Methods Softw., 24 (2009) K. Scheinberg and Ph. L. Toint, Self-correcting geometry in model-based algorithms for derivative-free unconstrained optimization, 2009.

150 Other recent work 53/54 S. Gratton, Ph. L. Toint, and A. Tröltzsch, An active-set trust-region method for derivative-free nonlinear bound-constrained optimization, Talk WA02 11:30

151 Other recent work 53/54 S. Gratton, Ph. L. Toint, and A. Tröltzsch, An active-set trust-region method for derivative-free nonlinear bound-constrained optimization, Talk WA02 11:30 S. Gratton and L. N. Vicente, A surrogate management framework using rigorous trust-regions steps, 2010.

152 Other recent work 53/54 S. Gratton, Ph. L. Toint, and A. Tröltzsch, An active-set trust-region method for derivative-free nonlinear bound-constrained optimization, Talk WA02 11:30 S. Gratton and L. N. Vicente, A surrogate management framework using rigorous trust-regions steps, M. J. D. Powell, The BOBYQA algorithm for bound constrained optimization without derivatives, 2009.

153 Other recent work 53/54 S. Gratton, Ph. L. Toint, and A. Tröltzsch, An active-set trust-region method for derivative-free nonlinear bound-constrained optimization, Talk WA02 11:30 S. Gratton and L. N. Vicente, A surrogate management framework using rigorous trust-regions steps, M. J. D. Powell, The BOBYQA algorithm for bound constrained optimization without derivatives, S. M. Wild and C. Shoemaker, Global convergence of radial basis function trust region derivative-free algorithms, 2009.

154 Optimization 2011 (July 24 27, Portugal) 54/54

A recursive model-based trust-region method for derivative-free bound-constrained optimization.

A recursive model-based trust-region method for derivative-free bound-constrained optimization. ANKE TRÖLTZSCH [CERFACS, TOULOUSE, FRANCE] JOINT WORK WITH: SERGE GRATTON [ENSEEIHT, TOULOUSE, FRANCE] PHILIPPE