A First-Order Framework for Solving Ellipsoidal Inclusion and. Inclusion and Optimal Design Problems. Selin Damla Ahipa³ao lu.

Size: px

Start display at page:

Download "A First-Order Framework for Solving Ellipsoidal Inclusion and. Inclusion and Optimal Design Problems. Selin Damla Ahipa³ao lu."

Homer Jennings
5 years ago
Views:

1 A First-Order Framework for Solving Ellipsoidal Inclusion and Optimal Design Problems Singapore University of Technology and Design November 23, 2012

2 1. Statistical Motivation Given m regression vectors {x 1, x 2,..x m } IR n which span IR n, we want to estimate the parameter θ IR n in a linear model such as y = X T θ + ɛ where X := [x 1, x 2,..x m ] IR n m and ɛ N(0, σ 2 ) IR m. An estimator ˆθ is unbiased if E(ˆθ) = θ.

3 2. Motivation It is well-known that the optimal unbiased estimator is ˆθ = (X T X) 1 Xy and D := σ 2 (XX T ) 1, the dispersion matrix, is a measure of the variance of the model. In classical estimation, X and y are known. In optimal design, we can choose X such that the dispersion is minimized with respect to some criterion, i.e. the accuracy of the model can be increased by designing the experiments carefully at the start.

4 3. The Optimal Design Problem An experimental design is a set of vectors (support vectors) {x 1,..., x m } X and non-zero integers n 1,..., n m such that m n i = N, i=1 where n i is the number of repetitions at regression point x i.

5 3. The Optimal Design Problem An experimental design is a set of vectors (support vectors) {x 1,..., x m } X and non-zero integers n 1,..., n m such that m n i = N, i=1 where n i is the number of repetitions at regression point x i. The corresponding dispersion matrix is m D = σ 2 ( n i x i x T i ) 1 = σ2 N i=1 ( m i=1 n i N x ix T i ) 1.

6 4. The Optimal Design Problem The optimal design problem in very general form is to nd a distribution function which in some sense maximizes the information matrix M = m u i x i x T i D 1. i=1

7 4. The Optimal Design Problem The optimal design problem in very general form is to nd a distribution function which in some sense maximizes the information matrix M = m u i x i x T i D 1. i=1 We call φ : S n + IR an information function if it is positively homogeneous, super additive, nonnegative, non-constant and upper semi-continuous. Information functions provide real-valued criteria w.r.t. which we can evaluate designs.

8 5. Matrix means Denition Let λ(c) denote the eigenvalues of a matrix C. If C is a positive denite matrix, i.e., C 0, the matrix mean φ p is dened as λ max (C) for p = ; φ p (C) = ( 1 n TrCp) 1/p for p 0, ± ; (det C) 1/n for p = 0; λ min (C) for p =.

9 6. The Optimal Design Problem For p 1, the general optimal design problem can be written as follows max u g p (u) := ln φ p (XUX T ) (D p ) e T u = 1, u 0, where U := Diag(u). Each value of the parameter p gives rise to a dierent criterion with dierent applications.

10 7. Geometric Motivation The set E( x, H) := {x IR n : (x x) T H(x x) n} for x IR n and H 0 is an ellipsoid in IR n with center x and shape dened by H.

11 7. Geometric Motivation The set E( x, H) := {x IR n : (x x) T H(x x) n} for x IR n and H 0 is an ellipsoid in IR n with center x and shape dened by H. We have vol(e( x, H)) = const(n)/ det H, and minimizing the volume of E( x, H) is equivalent to minimizing ln det H. This is equivalent to minimizing the matrix mean φ 0 (H 1 ). Each parameter corresponds to a dierent geometric feature.

12 8. The Fritz-John Theorem Theorem (John, 1948) For any point set X = {x 1,..., x m } IR n, there is an ellipsoid E, which satises x + 1 E conv(x ) x + E, n

13 8. The Fritz-John Theorem Theorem (John, 1948) For any point set X = {x 1,..., x m } IR n, there is an ellipsoid E, which satises furthermore if X = X, x + 1 E conv(x ) x + E, n 1 n E conv(x ) E.

14 8. The Fritz-John Theorem Theorem (John, 1948) For any point set X = {x 1,..., x m } IR n, there is an ellipsoid E, which satises furthermore if X = X, x + 1 E conv(x ) x + E, n 1 n E conv(x ) E.

15 9. The Ellipsoidal Inclusion Problem For q 1, consider the following problem: min H f q (H) := ln φ q (H) (P q ) x T i Hx i n, i = 1,..., m, H 0. H denes an ellipsoid which encloses the points x 1,..., x m. (P q ) is a geometric optimization problem.

16 10. Weak Duality Lemma Let p and q be conjugate numbers in (, 1]. Then we have f q (H) g p (u) for any H and u feasible in (P q ) and (D p ), resp. f q (H) g p (u) = ln φ q (H) ln φ p (XUX T ) = ln ( φ q (H)φ p (XUX T ) ) ( ) 1 ln n H XUXT ln 1 = 0.

17 11. Strong Duality and Optimality Conditions Let p and q be conjugate numbers in (, 1]. Then (P q ) and (D p ) are dual problems. Let H and u be optimal solutions for (P q ) and (D p ), respectively, then we must have 1. H = n Tr(XU X T ) p (XU X T ) p 1, 2. if u i > 0 then xt i H x i = n.

18 12. Approximate Solutions Let ω i (u) := x T i (XUXT ) p 1 x i. Denition Given a positive ɛ, we call a dual feasible point u 1. an ɛ-primal feasible solution if ω i (u) u T ω(u)(1 + ɛ) for all i, 2. and say that it is an ɛ-approximate optimal solution if moreover ω i (u) u T ω(u)(1 ɛ) whenever u i > 0.

19 13. Quality of Approximate Solutions Lemma Let p and q be a pair of conjugate numbers in (, 1]. Given a dual feasible solution u which is ɛ-primal feasible, 1. H = n (1+ɛ)Tr(XUX T ) p (XUXT ) p 1 is feasible in (P q ), 2. 0 g p g p (u) ln(1 + ɛ), where g p function value of (D p ). is the optimal objective

20 14. Initial Solutions with Provable Quality Lemma û = 1 m (1, 1,..., 1) is an (m 1)-primal feasible solution for (D p) for p < 1. Lemma If u 0 is an δ-primal feasible solution for (D 0 ), then it is also a (n + nδ 1)-primal feasible solution for D p for p < 1. There is a O(n log n) algorithm by Kumar-Yldrm that produces a 1-primal feasible solution for the (D 0 ) problem.

21 15. A First-Order Framework Note that the objective function g p of (D p ) is a concave function with gradient ω(u) := g p (u) = (x T i (XUX T ) p 1 x i ) m i=1.

22 15. A First-Order Framework Note that the objective function g p of (D p ) is a concave function with gradient ω(u) := g p (u) = (x T i (XUX T ) p 1 x i ) m i=1. Consider the following update u + := (1 τ)u + τe j ;

23 15. A First-Order Framework Note that the objective function g p of (D p ) is a concave function with gradient ω(u) := g p (u) = (x T i (XUX T ) p 1 x i ) m i=1. Consider the following update where u + := (1 τ)u + τe j ; - j := arg max i ω i (u) and τ > 0, or - j := arg min i ω i (u) and τ < 0 (s.t. u + 0).

24 16. A First-Order Framework

25 16. A First-Order Framework

26 16. A First-Order Framework

27 16. A First-Order Framework

28 16. A First-Order Framework

29 17. T-criterion When p = 1 and q =, the design problem (D) and its dual (P ) become: max u ln Tr(XUX T ) min H ln λ n (H) e T u = 1, x T i Hx i n, i, u 0, H 0, where λ 1 (H) λ 2 (H) λ n (H) are eigenvalues of H. (P ) corresponds to the Minimum Enclosing Ball problem. Trivial when centered at the origin.

30 18. A-criterion When p = 1 and q = 1/2, the design problem (D) and its dual (P ) become: max u ln Tr(XUX T ) 1 min H 2 ln TrH 1/2 e T u = 1, x T i Hx i n, i, u 0, H 0. (P ) corresponds to the problem of nding an enclosing ellipsoid which has the greatest sum of the inverses of the semi-axes. (D) generates a design with the least average dispersion.

31 19. D-criterion When p = 0 and q = 0, the design problem (D) and its dual (P ) become: max u ln det(xux T ) min H ln det H e T u = 1, x T i Hx i n, i, u 0, H 0. (P ) corresponds to the Minimum-Volume Enclosing Ellipsoid problem. MVEE is also the Fritz-John ellipsoid! (D) generates a design with the least maximum dispersion.

32 20. D-Criterion When p = 0, the problem is the well-known MVEE problem, with gradient ω(u) := g(u) = (x T i (XUX T ) 1 x i ) m i=1.

33 20. D-Criterion When p = 0, the problem is the well-known MVEE problem, with gradient ω(u) := g(u) = (x T i (XUX T ) 1 x i ) m i=1. Consider the following update u + := (1 τ)u + τe i ;

34 20. D-Criterion When p = 0, the problem is the well-known MVEE problem, with gradient ω(u) := g(u) = (x T i (XUX T ) 1 x i ) m i=1. Consider the following update u + := (1 τ)u + τe i ; then it is easy to update ω(u) and g(u) as in det XU + X T = (1 τ) n 1 [1 τ + τω i (u)] det XUX T,

35 20. D-Criterion When p = 0, the problem is the well-known MVEE problem, with gradient ω(u) := g(u) = (x T i (XUX T ) 1 x i ) m i=1. Consider the following update u + := (1 τ)u + τe i ; then it is easy to update ω(u) and g(u) as in det XU + X T = (1 τ) n 1 [1 τ + τω i (u)] det XUX T, and the optimal stepsize is (Khachiyan (1996)) τ = ω i(u)/n 1 ω i (u) 1.

36 21. First-Order Framework Khachiyan (1996) developed and analyzed the following algorithm: 1. Start with u = (1/m)e and calculate ω(u) 2. Check for ɛ-primal feasibility. 3. Let i := arg max j ω j (u) and calculate best step size τ > Update u to u + with τ > Update ω and go to step 2.

37 21. First-Order Framework Khachiyan (1996) developed and analyzed the following algorithm: 1. Start with u = (1/m)e and calculate ω(u) 2. Check for ɛ-primal feasibility. 3. Let i := arg max j ω j (u) and calculate best step size τ > Update u to u + with τ > Update ω and go to step 2. Each iteration takes O(mn) operations. Total number of iterations is N (ɛ) = O(n(ɛ 1 + log n + log log m)). This algorithm was also proposed by Fedorov (1972) and is very similar to that of Wynn (1970). It is a special case of the Frank-Wolfe (1956) algorithm on the dual problem.

38 22. First-Order Framework Kumar and Yldrm (2005) proposed an initialization scheme which improved the complexity for m n. Finally Todd and Yldrm (2007) modied the algorithm by also considering i := arg min {j:uj >0} ω j (u) and possibly updating with τ < 0. They seek ɛ-approximate optimality. This version was also proposed by Atwood (1973) and coincides with the Frank-Wolfe algorithm with Wolfe's away steps (1970). N (ɛ) = O(n(ɛ 1 + log n)). This algorithm guarantees the construction of a small core set.

39 23. First-Order Framework In Ahipa³ao lu, et al.(2008), we have showed that (for data dependent constants M and Q), N (ɛ) = O(Q + M log(ɛ 1 )). A similar result is proven by Wolfe (1970) and Guelat and Marcotte (1986) when g is strongly and boundedly concave but this assumption does not hold for (D). We work with a perturbation of (P ) and use Robinson's second-order constraint qualication. For the general concave problem over the unit simplex, we prove local linear convergence if g is twice dierentiable and there exists an optimal solution of (D) which satises the second-order sucient condition.

40 24. First-Order Framework For any ɛ-approximate solution u, no point x i such that ω i (u) < n[1 + ɛ ɛ(4 + ɛ 4/n) 2 ] 2 can be a support point (Harman-Pronzato(2005)). Incorporate elimination technique and active set strategies into Todd-Yldrm algorithm. Very fast compared to Todd-Yldrm algorithm. Inspired a similar algorithm for the related Minimum Enclosing Ball problem which decreases the run time by 90% (Ahipa³ao lu and Yldrm, 2008).

41 25. Computational Study Table : Average Running Time of Dierent Versions of Todd-Yldrm Algorithm on Exponentially Distributed Data Sets n m FO FO+ACT FO + ELIM FO +ELIM+ACT

42 26. Computational Study

43 27. Minimum Area Enclosing Ellipsoidal Cylinders Given m points {x 1, x 2,..., x m } IR n which span IR n and k n, the Minimum Area Enclosing Ellipsoidal Cylinder (MAEC) problem nds an ellipsoidal cylinder which is centered at the origin, covers all points and has minimum area intersection with Π := {[ y z ] [ IR k IR n k ] } : z = 0.

44 27. Geometry The set C(E, H) := {[y; z] IR n : (y + Ez) T H(y + Ez) k} for E IR k (n k) and H 0 is a cylinder in IR n dened by shape matrix H and axis direction matrix E.

45 27. Geometry The set C(E, H) := {[y; z] IR n : (y + Ez) T H(y + Ez) k} for E IR k (n k) and H 0 is a cylinder in IR n dened by shape matrix H and axis direction matrix E. Note that C(E, H) Π is an ellipsoid in IR k with vol(c(e, H) Π) = const(k)/ det H, and minimizing the volume of C(E, H) Π is equivalent to minimizing ln det H.

46 28. MAEC Formulation The MAEC problem can be formulated as follows: min f( H) := ln det H (y i + Ez i ) T H(yi + Ez i ) k, i = 1,..., m,

47 28. MAEC Formulation The MAEC problem can be formulated as follows: or equivalently min f( H) := ln det H (y i + Ez i ) T H(yi + Ez i ) k, i = 1,..., m, min f(h) := ln det H Y Y (P ) x T i Hx i k, i = 1,..., m, H 0, ( ) HY where H = Y H Y Z. H T Y Z H ZZ

48 29. The D k -optimal Design Problem The dual problem can be stated as max u,k g(u, K) := ln det K XUX T K := XUX T (D) e T u = 1, u 0. ( K ) 0

49 29. The D k -optimal Design Problem The dual problem can be stated as max u,k g(u, K) := ln det K XUX T K := XUX T (D) e T u = 1, u 0. ( K ) 0 (D) is the statistical problem of nding a D k -optimal design measure on the columns of X, that maximizes the determinant of a Schur Complement in the Fisher information matrix which is related to estimating the rst k parameters θ 1,..., θ k in the linear model ỹ X T θ.

50 30. Duality Lemma For any H feasible for (P ) and u and K feasible for (D), we have g(u, K) f(h).

51 30. Duality Lemma For any H feasible for (P ) and u and K feasible for (D), we have g(u, K) f(h). Furthermore, optimal solutions Ĥ and û and ˆK exist and satisfy the following necessary and sucient conditions: (a) Ĥ (XÛXT ˆK) = 0 (b) u i > 0 only if x T i Ĥx i = (y i + Êz i) T ˆK 1 (y i + Êz i) = k (c) ĤY Y = ˆK 1.

52 31. Optimality Conditions We have strong duality if (a) H (XUX T K) = 0 (b) u i > 0 only if x T i Hx i = (y i + Ez i ) T K 1 (y i + Ez i ) = k (c) H Y Y = K 1.

53 31. Optimality Conditions We have strong duality if (a) H (XUX T K) = 0 (b) u i > 0 only if x T i Hx i = (y i + Ez i ) T K 1 (y i + Ez i ) = k (c) H Y Y = K 1. For optimal (u, E, K), condition (a) implies E(ZUZ T ) = (Y UZ T ) and K = Y UY T E(ZUZ T )E T.

54 31. Optimality Conditions We have strong duality if (a) H (XUX T K) = 0 (b) u i > 0 only if x T i Hx i = (y i + Ez i ) T K 1 (y i + Ez i ) = k (c) H Y Y = K 1. For optimal (u, E, K), condition (a) implies E(ZUZ T ) = (Y UZ T ) and K = Y UY T E(ZUZ T )E T. Denition We say (u, E, K) is an ɛ-primal feasible solution if (a) (y i + Ez i ) T K 1 (y i + Ez i ) (1 + ɛ)k, i = 1,..., m.

55 31. Optimality Conditions We have strong duality if (a) H (XUX T K) = 0 (b) u i > 0 only if x T i Hx i = (y i + Ez i ) T K 1 (y i + Ez i ) = k (c) H Y Y = K 1. For optimal (u, E, K), condition (a) implies E(ZUZ T ) = (Y UZ T ) and K = Y UY T E(ZUZ T )E T. Denition We say (u, E, K) is an ɛ-primal feasible solution if (a) (y i + Ez i ) T K 1 (y i + Ez i ) (1 + ɛ)k, i = 1,..., m. Furthermore, it is an ɛ-approximate optimal solution if also (b) u i > 0 implies (y i + Ez i ) T K 1 (y i + Ez i ) (1 ɛ)k.

56 32. A First-Order Algorithm Using u + := (1 τ)u + τe i and rank-one update formulae leads to an algorithm: 1. Find a feasible u, E and K and calculate ω k (u) where g(u) u i = ω k i (u) := (y i + Ez i ) T K 1 (y i + Ez i ). 2. Check for ɛ-approximate optimality. 3. Choose i that improves the objective function or optimality conditions. 4. Update u to u +, where step size τ is a solution of a quadratic equation. 5. Update E, K and ω k and go to step 2.

57 33. Why is the [ MAEC ] [ harder ] than the MVEE? Y Example: Let X = =, k = 1, and Z u = [0, 0, 1].

58 33. Why is the [ MAEC ] [ harder ] than the MVEE? Y Example: Let X = =, k = 1, and Z [ ] 1 0 u = [0, 0, 1]. We have XUX T = and 0 0 E(ZUZ T ) = (Y UZ T ) becomes E0 = 0.

59 33. Why is the [ MAEC ] [ harder ] than the MVEE? Y Example: Let X = =, k = 1, and Z [ ] 1 0 u = [0, 0, 1]. We have XUX T = and 0 0 E(ZUZ T ) = (Y UZ T ) becomes E0 = 0. For E 1, this cylinder contains X 0 < E < 1 E = 1

60 33. Why is the [ MAEC ] [ harder ] than the MVEE? Y Example: Let X = =, k = 1, and Z u = [0, 0, 1].

61 33. Why is the [ MAEC ] [ harder ] than the MVEE? Y Example: Let X = =, k = 1, and Z [ ] 1 0 u = [0, 0, 1]. We have XUX T = and 0 0 E(ZUZ T ) = (Y UZ T ) becomes E0 = 0.

62 33. Why is the [ MAEC ] [ harder ] than the MVEE? Y Example: Let X = =, k = 1, and Z [ ] 1 0 u = [0, 0, 1]. We have XUX T = and 0 0 E(ZUZ T ) = (Y UZ T ) becomes E0 = 0. For E 1, this cylinder contains X, but for E > 1, it does not: 0 < E 1 E > 1

63 34. Why is the MAEC harder than the MVEE? For a given iterate u, when ZUZ T is not pd, it is hard to choose a matrix E which satises E(ZUZ T ) = (Y UZ T ). Computational and theoretical complications. Modify the algorithm so that ZUZ T never becomes singular until the last iteration. Unlike MVEE, choosing the right pivot is not trivial.

64 35. Complexity Analysis Assuming ZUZ T 0, w(u) < C 1, ω k (u) < C 2, we have: O(k(ln k + k ln ln m + ɛ 1 ) + m) iterations. Each iteration takes O(nm) operations. O( Q + M log(ɛ 1 )) iterations under technical assumptions. Away steps are necessary for rapid convergence.

65 36. Computational Study Table : Geometric Mean of Running Time and Average Number of Iterations Required by the Algorithm to Obtain an Approximate Solution for Sun-Freud data sets Dimensions With Away Steps n k m iter time (sec.)

66 37. Computational Study MAECC MVEE as k n:

67 37. Computational Study MAECC MVEE as k n: Can we nd a good warm-start strategy? Can we prove any non-trivial core-set results? Identify and eliminate non-support points?

68 38. State of the Art First-order algorithms are very ecient in solving optimal design and ellipsoidal inclusion problems! T-optimal D-optimal A-optimal MEB (uncentered) MVEE Global convergence Local convergence Warm-start Elimination technique? Cylinderical Inclusion?

Linear Convergence of a Modified Frank-Wolfe Algorithm for Computing Minimum Volume Enclosing Ellipsoids

Linear Convergence of a Modified Frank-Wolfe Algorithm for Computing Minimum Volume Enclosing Ellipsoids S. Damla Ahipasaoglu Peng Sun Michael J. Todd October 5, 2006 Dedicated to the memory of Naum Shor