On the Minimization Over Sparse Symmetric Sets: Projections, O. Projections, Optimality Conditions and Algorithms

Size: px

Start display at page:

Download "On the Minimization Over Sparse Symmetric Sets: Projections, O. Projections, Optimality Conditions and Algorithms"

Winfred Cory Fields
5 years ago
Views:

1 On the Minimization Over Sparse Symmetric Sets: Projections, Optimality Conditions and Algorithms Amir Beck Technion - Israel Institute of Technology Haifa, Israel Based on joint work with Nadav Hallak and Yakov Vaisbourd OPT2014: Workshop on Optimization for Machine Learning (NIPS 2014), Montreal, December 12, 2014

2 Problem Formulation The sparse optimization problem (P) min f (x) s.t. x C s B, (1) f continuously differentiable (1) B is closed and convex (2) C s = {x R n : x 0 s} Difficulties: (a) C s B non-convex (b) C s B induces a combinatorial constraint No global optimality conditions, solution methods are heuristic in nature.

3 Example - Compressed Sensing Linear CS Recover a sparse signal x with a sampling matrix A and a measure b. min Ax b (CS) 2 2 s.t. x C s R n

4 Literature - CS Linear: 1 Conditions for reconstruction: RIP (Candes and Tao 05), SRIP (Beck and Teboulle 10), spark (Donoho and Elad 03; Gorodnitsky and Rao 97), mutual coherence (Donoho et al. 03; Donoho and Huo 99; Mallat and Zhang 93) 2 Reviews: Bruckstein et al. 09, Davenport et al. 11, Tropp and Wright Iterative algorithms: IHT (Blumensath and Davis 08, 09, 12; Beck and Teboulle 10), CoSaMP (Needell and Tropp 09) Nonlinear: 1 Phase retrieval: Shechtman et al. 13; Ohlsson and Eldar 13; Eldar and Mendelson 13; Eldar et al. 13; Hurt Nonlinear: optimality conditions (Beck and Eldar 13), GraSP (Bahmani et al. 13)

5 Example - Sparse Index Tracking Sparse Index Tracking Track an index b with at most s assets, with return matrix A. (IT ) min Ax b 2 2 s.t. x C s n Example: Finance - track the S&P500 with a small number of assets Takeda et al 12

6 Example - Sparse Principal Component Analysis Sparse Principal Component Analysis Find the first principal eigenvector of a matrix A. (PCA) max x T Ax s.t. x C s {y R n : y 2 1} Example: Finance - identify the group which explains most of the variance in the S&P500 Sample of works: Moghaddam, Weiss, Avidan 06, d Aspremont, Bach, El-Ghaoui 08, d Aspremont, El-Ghaoui, Jordan Lanckriet 07, recent review: Luss and Teboulle 13

7 Objectives The sparse optimization problem (P) min f (x) s.t. x C s B, B closed and convex Main Objectives: Define necessary optimality conditions Develop corresponding algorithms Establish hierarchy between algorithms and conditions. The case B = R n : Beck, Eldar 13

8 Objectives The sparse optimization problem (P) min f (x) s.t. x C s B, B closed and convex Main Objectives: Define necessary optimality conditions Develop corresponding algorithms Establish hierarchy between algorithms and conditions. The case B = R n : Beck, Eldar 13 However, we will also need to study and compute Orthogonal Projections on B C s.

9 Recap of Necessary First Order Opt. Conditions over Convex Sets: Stationarity ( ) min{f (x) : x S}, S closed and convex, f continuously differentiable. Equivalent Definitions of Stationarity: x stationary point iff Projection Form x = P S (x 1 ) L f (x ) Variational Form f (x ), x x 0 x S for some L > 0

10 Recap of Necessary First Order Opt. Conditions over Convex Sets: Stationarity ( ) min{f (x) : x S}, S closed and convex, f continuously differentiable. Equivalent Definitions of Stationarity: x stationary point iff Projection Form x = P S (x 1 ) L f (x ) Variational Form f (x ), x x 0 x S for some L > 0 conditions are equivalent independent of L most algorithms that use first order information converge to stat. points. condition relies on the properties/computatbility of P S ( ) P S (y) = argmin{ y x 2 : y S}.

11 Why Study Orthogonal Projections? P Cs B (x) = argmin { z x 2 2 : z C s B } To define optimality conditions, we need to compute and analyze properties of the orthogonal projection P Cs B. Computing P Cs B is in general a difficult task, but in fact tractable under symmetry assumptions on B

12 Why Study Orthogonal Projections? P Cs B (x) = argmin { z x 2 2 : z C s B } To define optimality conditions, we need to compute and analyze properties of the orthogonal projection P Cs B. Computing P Cs B is in general a difficult task, but in fact tractable under symmetry assumptions on B Revised Layout: Projections, Optimality Conditions, Algorithms

13 Projection Onto Symmetric Sets

14 Definitions - Basics Σ n = permutation group of [n] x σ = reordering of x according to σ Σ n, (x σ ) i = x σ(i). Example (permutation) x = ( ) T, and σ(1) = 3, σ(2) = 1, σ(3) = 2, then x σ = ( ) T.

15 Definitions - sorting permutations σ Σ n is a sorting permutation of x if x σ(1) x σ(2) x σ(n 1) x σ(n) Σ(x) is the set of all the sorting permutations of x Example (sorting permutation) x = ( ) T, and σ(1) = 2, σ(2) = 4, σ(3) = 3, σ(4) = 1, then x σ = ( ) T Also σ Σ(x) where σ(1) = 4, σ(2) = 2, σ(3) = 3, σ(4) = 1.

16 Definitions - Type-1 Symmetric Set D is a type-1 symmetric set if x D x σ D σ Σ n set description type-1 nonneg. type-1 type-2 n 1 unit sum [l, u] n (l < u) box 1 n = {x R n : 1 T x = 1}

17 Definitions - Nonnegative Type-1 Symmetric Set D is nonnegative if x D, x 0 set description type-1 nonneg. type-1 type-2 R n + nonnegative orthant n unit simplex

18 Definitions - Type-2 Symmetric Set D is a type-2 symmetric set if it is type-1 symmetric and x D, y { 1, 1} n x y (x i y i ) n i=1 D set description type-1 nonneg. type-1 type-2 R n entire space B p [0, 1](p 1) p-ball

19 Summary of symmetry properties of simple sets set desc. type-1 non. t-1 type-2 R n entire space R n + nonnegative orthant n unit simplex n unit sum B p [0, 1](p 1) p-ball [l, u] n (l < u) box

20 Symmetric Projection Monotonicity Lemma Symmetric Projection Monotonicity Lemma. Let D be a type-1 symmetric set, x R n, and y P D (x). Then for any i, j [n]. (y i y j ) (x i x j ) 0

21 Order Preservation Property Theorem. Let D be a type-1 symmetric set, and σ Σ(x). Suppose that P D (x). Then y P D (x) s.t. σ Σ(y) That is x σ(1) x σ(2) x σ(n) y σ(1) y σ(2) y σ(n) Example: D = C 2, P D ((3, 2, 2, 0)) = {(3, 2, 0, 0), (3, 0, 2, 0)}.

22 Sparse Projection Onto Symmetric Sets

23 The Sparse Projection Problem The sparse projection problem Find an element in orthogonal projection of x R n onto B C s : P Cs B (x) = argmin { z x 2 2 : z C s B } 1 B C s closed P Cs B(x) 2 B C s nonconvex P Cs B(x) 1 A DIFFICULT NONCONVEX PROBLEM IN GENERAL (known for B = R n, R n +, n )

24 Supports, Super Supports Let x R n, s [n] = {1,..., n}. 1 Support of x: I 1 (x) {i [n] : x i 0}. 2 Super support of x: any set T s.t. I 1 (x) T and T = s. 3 x has full support if x 0 = I 1 (x) = s. 4 Off-support of x: I 0 (x) {i [n] : x i = 0}. Example s = 3, n = 5 and x = ( 3, 4, 0, 0, 0) T 1 Support: I 1 (x) = {1, 2} 2 Super support: T {{1, 2, 3}, {1, 2, 4}, {1, 2, 5}} 3 Incomplete support: x 0 < s 4 Off-support: I 0 (x) = {3, 4, 5}

25 Restriction on Index Sets x R n, T [n] index set 1 x T R T is the restriction of x to T 2 U T is the submatrix of I n constructed from the columns in T 3 B T = {x R T : U T x B} is the restriction of B to T 4 T f (x) = U T T f (x) is the restriction of f (x) to T. Example x = (8, 7, 6, 5) T x 1,3 = (8, 6) T. B = {(x 1, x 2, x 3, x 4 ) : x 1 + 2x 2 + 3x 3 + 4x 4 = 1} B 1,2 = {(x 1, x 2 ) T : x 1 + 2x 2 = 1}. f (x) = x 1 x 2 + x x 3 3 {1,3} f (x) = (x 2, 3x 2 3 ) T.

26 The Order Set order set S σ [j 1,j 2 ] For any permutation σ Σ n, we define S σ [j 1,j 2 ] as: { S[j σ 1,j 2 ] = {σ(j 1 ), σ(j 1 + 1),..., σ(j 2 )} 0 < j 1 j 2 n, otherwise. Example (order set) σ = ( ) S[1,2] σ = {σ(1), σ(2)} = {4, 1}, S [3,4] σ = {σ(3), σ(4)} = {3, 2}

27 Phases in Computing the Projection To find y P Cs B (x): (1) find its super support S (2) Compute y S = P BS (x S ), y S c = 0 Naive approach: go over all possible ( n s) super supports, compute the corresponding projections, and find the sparse projection vector. TOO EXPENSIVE

28 Type-1 Symmetric Sparse Projection Theorem Type-1 Symmetric Projection Theorem B be a type-1 symmetric set, σ Σ(x). Then y P Cs B (x), k {0,..., s} for which I 1 (y) S σ [1,k] S σ [n+k (s 1),n] Result: to find the super support, find k {0,..., s}: x σ(1) x σ(k) x σ(k+1) x σ(n+k s) x σ(n+k (s 1)) x σ(n) y σ(1) y σ(k) 0 0 y σ(n+k (s 1)) y σ(n)

29 Type-1 Symmetric Sparse Projection Algorithm Algorithm 1 Projection onto a type-1 symmetric sparse set Input: x R n. Output: u P Cs B(x). 1 Find σ Σ(x). 2 for any k = s, s 1,..., 0 do: 1 Set T k = S σ [1,k] S σ [n+k (s 1),n]. 2 Compute g k = P BTk (x Tk ) and define z k = U Tk g k. 3 Return u = argmin{ z x 2 : z {z k : k = s, s 1,..., 0}}

30 Nonnegative Type-1 Symmetric Sparse Projection Theorem Nonnegative Type-1 Symmetric Sparse Projection Theorem B nonnegative type-1 symmetric, and σ Σ(x). Then y P Cs B (x) s.t I 1 (y) S σ [1,s] Example B = 4, s = 2, x = (1, 0.5, 0.5, 1) T, S[1,2] σ = {1, 2}, y = (0.75, 0.25, 0, 0) T. The super support can be found by a simple sort operation

31 Type-2 Symmetric Sparse Projection Theorem Type-2 Symmetric Sparse Projection Theorem B type-2 symmetric, and σ Σ( x ). Then y P Cs B (x) s.t I 1 (y) S σ [1,s] Example B = B 2 [0, 1], s = 2, x = (3, 0.5, 0.7, 4) T, S[1,2] σ = {4, 1}, y = (0.6, 0, 0, 0.8) T. The super support can be found by a simple sort operation

32 Unified Symmetric Projection Theorem symmetry function The symmetry function p : R n R n : { x B is nonnegative type-1, p(x) x B is type-2 symmetric. Unified Symmetric Projection Theorem B be closed, convex, and a nonnegative type-1 or a type-2 symmetric. σ Σ(p(x)). Then y P Cs B (x) s.t. I 1 (y) S σ [1,s]

33 Unified Symmetric Sparse Projection Algorithm Algorithm 2 Unified symmetric sparse projection algorithm Input: x R n. Output: u P B Cs (x). 1 Compute T = S[1,s] σ for σ Σ(p(x)). 2 Return u = U T P BT (x T ).

34 Super supports for sparse projection onto simple sets B P B Cs (x) super support set(s) B T R n U T x T T = S σ [1,s], σ Σ( x ) B T = R s R n + U T [x T ] + T = S σ [1,s], σ Σ(x) B T = R s + n U T P BT (x T ) T = S[1,s] σ, σ Σ(x) B T = s n U Tk P BTk (x Tk ) T k = S[1,k] σ S [n+k (s 1),n] σ k = 0, 1,..., s, σ Σ(x) B Tk = s Bp n [0, 1] U T P BT (x T ) T = S[1,s] σ, σ Σ( x ) B T = (p 1) Bp[0, s 1] [l, u] n U Tk P BTk (x Tk ) T k = S[1,k] σ S [n+k (s 1),n] σ B Tk = [l, u] s (l < u) k = 0, 1,..., s, σ Σ(x)

35 Optimality Conditions and Algorithms

36 Back to the Sparse Optimization Problem The sparse optimization problem (P) min f (x) s.t. x C s B, C s = {x R n : x 0 s} Assumption [A] f : R n R is lower bounded, continuously differentiable. [B] B is a closed and convex set. In some cases [C] f C 1,1 L(f ).

37 Road Map of Optimality Conditions min f (x) (P) s.t. x C s B, Basic Feasibility - optimality over the support. L-Stationarity - extension of stationarity over convex sets. CW-optimality

38 Basic Feasibility (BF) x C s B is a basic feasible (BF) point of (P) if for any super support set S of x, for some L > 0: x S = P BS (x S 1 ) L Sf (x). Remarks: (a) If I 1 (x) = s, then the only super support set is the support itself. (b) If I 1 (x) = k < s, then there are ( n k s k) possible super supports. (c) x is a BF x T is stationary point of f over B T for any super support T of x (d) Optimality BF

39 Basic Feasibility - Examples B BF conditions (full support) R n I1 (x )f (x ) = 0 R n + I1 (x )f (x ) = 0 n µ R : i f (x ) = µ, i I 1 (x ) n µ R : i f (x ) = µ, i I 1 (x ) B2 n[0, 1] I 1 (x )f (x ) = 0 or x = 1 and λ 0 : I1 (x )f (x ) = λx I 1 (x ) = 0 l < x i < u [l, u] n f (l < u) x i (x ) 0 x i = l 0 x i = u i, i I 1 (x ) What if I 1 (x) < s?

40 Basic Feasibility - Examples B BF conditions (full support) R n I1 (x )f (x ) = 0 R n + I1 (x )f (x ) = 0 n µ R : i f (x ) = µ, i I 1 (x ) n µ R : i f (x ) = µ, i I 1 (x ) B2 n[0, 1] I 1 (x )f (x ) = 0 or x = 1 and λ 0 : I1 (x )f (x ) = λx I 1 (x ) = 0 l < x i < u [l, u] n f (l < u) x i (x ) 0 x i = l 0 x i = u i, i I 1 (x ) What if I 1 (x) < s? In all of the above cases, basic feasibility is exactly like stationarity over B

41 Characterization of BF points Theorem. x C s B is a BF point if and only if x T = P BT (x T 1 ) L T f (x), where 1 Symmetry: B is non-negative type-1 or type-2 symmetric 2 Fill the support: i [n + 1] s.t. S[i,n] σ I 1(x) = s (σ Σ( p( f (x)))) 3 Into super support: T = I 1 (x) S σ [i,n] Important: when I 1 (x) < s only one super support needs to be checked

42 Basic Feasible Search - Find a BF point Algorithm 3 BAsic Feasible Search (BAFS) Initialization: x 0 C s B, k = 0. Output: u C s B which is a basic feasible point. 1 Repeat 1 k k let σ Σ ( p ( f (x k ) )) 3 set i {1,..., n + 1} such that S[i,n] σ I 1(x k ) = s 4 set T k = I 1 (x k ) S σ [i,n] 5 take x k argmin {f (y) : y B, I 1 (y) T k } Until f (x k 1 ) f (x k ) 2 Set u = x k 1 Finite Termination. requires the ability to minimize over the support set.

43 Road Map of Optimality Conditions min f (x) (P) s.t. x C s B, Basic Feasibility - optimality over the support. L-Stationarity - extension of stationarity over convex sets. CW-optimality

44 L-Stationarity Unfortunately, the variational form f (x ), x x 0 x B C s is not a necessary optimality condition (in general...)

45 L-Stationarity Unfortunately, the variational form f (x ), x x 0 x B C s is not a necessary optimality condition (in general...) Let L > 0. A vector x C s B is an L-stationary point of (P) if x P Cs B (x 1L ) f (x). L-Stationarity in the Hierarchy: 1 L-Stationarity BF 2 If f C 1,1 L(f ), Optimality L-stationarity L > L(f ) Condition depends on L, more restrictive as L gets smaller

46 Gradient Projection Method x k+1 P Cs B ( x k 1 ) L f (xk ) B = R n Iterative Hard Thresholding (IHT) method (Blumensath and Davis 08, 09, 12). Makes sense only when f C 1,1. Only guarantees convergence to an L-stationary point for L > L(f ).

47 L-stationarity characterization Theorem. Let B be a nonnegative type-1 or type-2 symmetric set. Then a BF point x is an L-stationary point if and only if min p(lx i I 1 (x i i f (x )) max p(lx ) j I 0 (x j j f (x )), ) Example (B = R n ) B = R n, and σ Σ( x ). Then x is an L-stationary point of (P) if and only if a { L x i f (x ) σ(s) if i I 0 (x ), = 0 if i I 1 (x ). a Beck, A. & Eldar, Y. C., SIOPT, 2013

48 Back to the Variational Form of Stationarity In general, the variational form is not a necessary optimality conditions. However, when f is concave, it is in fact a necessary optimality condition.

49 Back to the Variational Form of Stationarity In general, the variational form is not a necessary optimality conditions. However, when f is concave, it is in fact a necessary optimality condition. Theorem. Suppose that f is concave and cont. diff.. If x is an optimal solution of (P), then f (x ), x x 0 x B C s. (a direct consequence of Krein-Milman+attainment of opt. sol. at extreme points)

50 Back to the Variational Form of Stationarity In general, the variational form is not a necessary optimality conditions. However, when f is concave, it is in fact a necessary optimality condition. Theorem. Suppose that f is concave and cont. diff.. If x is an optimal solution of (P), then f (x ), x x 0 x B C s. (a direct consequence of Krein-Milman+attainment of opt. sol. at extreme points) We will call this type of stationarity co-stationarity

51 co-stationarity L-stationarity Theorem. For any L > 0: x co-stationary x is L stationary

52 co-stationarity L-stationarity Theorem. For any L > 0: x co-stationary x is L stationary Consequence: for concave f, L-stationarity is a necessary optimality condition for any L > 0.

53 co-stationarity L-stationarity Theorem. For any L > 0: x co-stationary x is L stationary Consequence: for concave f, L-stationarity is a necessary optimality condition for any L > 0. sparse PCA max{x T Ax : x 2 1, x 0 s} (A 0) Most algorithms converge to a co-stationary point, e.g., conditional gradient method: x k+1 = yk y k 2, y k = H s (Ax k ). Gradient projection is never employed (since L-stationarity is a weak condition). The above is only correct for concave f.

54 Back to L-stationarity - Example n min f (x1, x2 ) 12x x1 x2 + 32x22 : (x1 ; x2 )T 0 1 o L(f ) = Two BF vectors: (0, 9/16) - optimal solution. ( 1/12, 0) non-optimal, SL=196. L = 250 L =

55 Main Questions Are there stronger (more restrictive) optimality conditions? Can we define algorithms that (1) do not depend on the Lipschitz property and (2) converge to better optimality conditions?

56 Main Questions Are there stronger (more restrictive) optimality conditions? Can we define algorithms that (1) do not depend on the Lipschitz property and (2) converge to better optimality conditions? The answer is YES

57 Road Map of Optimality Conditions min f (x) (P) s.t. x C s B, Basic Feasibility - optimality over the support. L-Stationarity - extension of stationarity over convex sets. CW-optimality

58 Simple-CW optimality Let B be nonnegative type-1 or type-2 symmetric. A BF point x is a simple-cw point of (P) if { f (x xi e f (x) i ± x i e j ) B type-2 f (x x i e i + x i e j ) B nonneg. type-1 where i argmin{p( l f (x))} with D(x) = argmin p(x k ) l D(x) k I 1 (x) j argmin { p( l f (x))} l I 0 (x)

59 Simple-CW optimality in the Hierarchy Results: If B is a non-negative type-1 set or a type-2 symmetric 1 Optimality Simple-CW 2 If f C 1,1 L(f ), then Simple-CW L 2(f )-stationarity L L 2 (f ) L 2 (f ) - Lipschitz constant of f restricted to two coordinates. Smaller than L(f ) Simple-CW is more restrictive than L(f )-stationarity

60 Local Lipschitz Constant Under the Lipschitz assumption, For any i j there exists a L i,j (f ) for which: i,j f (x) i,j f (x + d) L i,j (f ) d, for any d R n satisfying d k = 0 for any k i, j.

61 Local Lipschitz Constant Under the Lipschitz assumption, For any i j there exists a L i,j (f ) for which: i,j f (x) i,j f (x + d) L i,j (f ) d, for any d R n satisfying d k = 0 for any k i, j. the Local Lipschitz constant is L 2 (f ) = max L i,j (f ) i j

62 Local Lipschitz Constant Under the Lipschitz assumption, For any i j there exists a L i,j (f ) for which: i,j f (x) i,j f (x + d) L i,j (f ) d, for any d R n satisfying d k = 0 for any k i, j. the Local Lipschitz constant is L 2 (f ) L(f ). L 2 (f ) = max L i,j (f ) i j Example: f (x) = x T Qx + 2b T x with Q n = I n + J n (I n - identity, J n - all ones)

63 Local Lipschitz Constant Under the Lipschitz assumption, For any i j there exists a L i,j (f ) for which: i,j f (x) i,j f (x + d) L i,j (f ) d, for any d R n satisfying d k = 0 for any k i, j. the Local Lipschitz constant is L 2 (f ) L(f ). L 2 (f ) = max L i,j (f ) i j Example: f (x) = x T Qx + 2b T x with Q n = I n + J n (I n - identity, J n - all ones) L(f ) = 2λ max (Q n ) = 2(n + 1) On the other hand,l i,j (f ) = 2λ max ( ) = 6

64 Local Lipschitz Constant Under the Lipschitz assumption, For any i j there exists a L i,j (f ) for which: i,j f (x) i,j f (x + d) L i,j (f ) d, for any d R n satisfying d k = 0 for any k i, j. the Local Lipschitz constant is L 2 (f ) L(f ). L 2 (f ) = max L i,j (f ) i j Example: f (x) = x T Qx + 2b T x with Q n = I n + J n (I n - identity, J n - all ones) L(f ) = 2λ max (Q n ) = 2(n + 1) ( ) 2 1 On the other hand,l i,j (f ) = 2λ max = We get: L(f ) = 2(n + 1), L 2 (f ) = 6

65 Zero-CW Optimality Assumption: The function can be minimized over any super support, B nonnegative type-1 or type-2 x is called a zero-cw optimal point if f (x) min {f (y) : y B, I 1 (y) T }, where T = (I 1 (x) {j}) \{i} if x 0 = s, otherwise additional (specific) indices are added

66 Full-CW Optimality Assumption: The function can be minimized over any super support, B nonnegative type-1 or type-2 x is a full-cw optimal point if i I 1 (x), j I 0 (x), it holds that f (x) min {f (y) : y B, I 1 (y) T i,j }. In the non full-support case, additional indices are added to T i,j.

67 Hierarchy - Summary Full-CW Zero-CW Simple-CW L 2 (f )-Stationarity Basic Feasibility

68 Concave f x is called CW-minimum point if f (x) f (z) for any S 2 (x) {z : z x 0 2, z C s B}.

69 Concave f x is called CW-minimum point if f (x) f (z) for any S 2 (x) {z : z x 0 2, z C s B}. CW-optimality is a necessary optimality conditions (of a combinatorial flavor...) Theorem. If f is concave and B = {x : x 2 1}, then any CWmaximal point is a co-stationary point. Quite difficult to prove since co-stationarity is a rather strong condition.

70 pit-prop data 13 variables measuring 180 properties of pitprops. 715 BF points, 28 co-stationary points, 3 CW-optimal points. # Support CW Value # Support CW Value 1 {1,2,9,10} * {5,6,7,10} {1,2,7,10} {7,8,10,12} {1,2,7,9} {7,8,10,13} {1,2,8,9} {5,6,7,13} {1,2,8,10} {3,4,6,7} {1,2,6,7} {4,5,6,7} {2,7,9,10} {7,10,12,13} {2,6,7,10} {3,4,8,12} {1,6,7,10} {3,4,10,12} {1,2,3,4} * {3,10,11,12} {7,8,9,10} {3,5,12,13} {6,7,9,10} {1,5,12,13} {6,7,10,13} {2,5,12,13} {6,7,8,10} {3,5,11,13} 1.382

71 LS over sparse l 1 -ball min x : x C 2 B1 4 [0, 1]. support {1, 2} {1, 3} {1, 4} {2, 3} {2, 4} {3, 4} values BF L(f)-stationary simple-cw zero-cw full-cw

72 Zero-CW search method Algorithm 4 Zero-CW search method (ZCWS) Initialization: x 0 C s B a BF, k = 0. Output: u C s B which is a zero-cw. General Step (k = 0, 1, 2,...) 1 D(x k ) = argmin p(xl k ) l I 1 (x k ) 2 i argmin{p( l f (x k ))} l D(x k ) 3 j argmin l I 0 (x k ) { } p( l f (x k ))

73 Zero-CW search method - Find a Zero-CW point Algorithm 5 Zero-CW search method (ZCWS) 4 let σ Σ( p( f (x k ))) and let l be such that ( ) S[l,n] σ I 1(x k ) {j} {i} = s 5 Define T k = ( ) S[l,n] σ I 1(x k ) {j} {i}. 6 Set x argmin {f (y) : y B, I 1 (y) T k }. 7 x k+1 = BFS(x) 8 If f (x k ) f (x k+1 ), then STOP and the output is u = x k. Otherwise, k k + 1 and go back to step 1. ZCWS generates a points that is a zero-cw point, and this converges to better points than gradient projection/iht.

74 Full-CW search method - Find a Full-CW point Algorithm 6 Full-CW search method Initialization: x 0 C s B - a basic feasible point, k = 0. Output: u C s B which is a full-cw point. General Step (k = 0, 1, 2,...) 1 x k+1 = ZCWS(x k ) and set k k let σ Σ( p( f (x k ))), and for any i I 1 (x k ), j I 0 (x k ) let l i,j be such that ( ) S σ [l i,j,n] I 1(x k ) {j} {i} = s 3 define ( ) T i,j k = S σ [l i,j,n] I 1(x k ) {j} {i}

75 Full-CW search method Algorithm 7 Full-CW search method { 4 take z i,j argmin }. f (y) : y B, I 1 (y) T i,j k 5 set (i 0, j 0 ) argmin { f (z i,j ) : i I 1 (x), j I 0 (x) } 6 define x k+1 = BFS(z i 0,j 0 ) 7 if f (x k ) f (x k+1 ), then STOP and the output is u = x k. Otherwise, k k + 1 and go back to step 1. The full-cw search method obviously find a full-cw point in a finite number of steps.

76 Hierarchy of Algorithms (Best to Worst) Full-CW search. ZCWS Gradient projection/iht BAFS

77 Numerical Experiments

78 Compared Method - TGA Greedily build the super support Algorithm 8 TGA Initialization: x = 0 n, S =. Output: x C s B 1 while S < s do: 1 (j, x) argmin (l S c,z B) {f (z) : I 1 (z) S {l}} 2 set S S {j} 2 Return x

79 Numerical Experiment - Sparse index tracking Objective: track an index with a small number of assets x, which has a return matrix A. Prob. min Ax b 2 2 s.t. x C s n A: A R 72 54, daily returns matrix x: x weights vector b: b R 72 S&P500 daily returns vector Data 180 random sets of stocks from NYSE, 60 for each sparsity level

80 Numerical Experiment - Sparse index tracking improver improved s = 9 s = 18 s = 27 total FCWS ZCWS IHT TGA ZCWS FCWS IHT TGA ZCWS IHT FCWS TGA The IHT never reached a zero-cw or full-cw point. 2 The ZCWS reached a full-cw point in 66%. 3 The TGA was improved in most of the instances when s > 9.

81 sparse PCA - gene expression data (GeneChip oncology) 20 data sets. Number of variables Probability of Obtaining the Best Solution PCW cont PCW CGU EM agr Trsh Sparsity Level

82 THANK YOU FOR YOUR ATTENTION Beck, Hallak On the minimization over sparse symmetric sets. Beck, Vaisbourd Optimization methods for solving the sparse PCA problem.

Conditional Gradient Algorithms for Rank-One Matrix Approximations with a Sparsity Constraint

Conditional Gradient Algorithms for Rank-One Matrix Approximations with a Sparsity Constraint Marc Teboulle School of Mathematical Sciences Tel Aviv University Joint work with Ronny Luss Optimization and