Generalized Power Method for Sparse Principal Component Analysis

Size: px

Start display at page:

Download "Generalized Power Method for Sparse Principal Component Analysis"

Allan Washington
5 years ago
Views:

1 Generalized Power Method for Sparse Principal Component Analysis Peter Richtárik CORE/INMA Catholic University of Louvain Belgium VOCAL 2008, Veszprém, Hungary CORE Discussion Paper #2008/70 joint work with M. Journée, Yu. Nesterov and R. Sepulchre

2 1. Outline Sparse PCA Optimization reformulations Algorithm and complexity analysis Numerical experiments

3 2. Sparse PCA (spca) Input: Matrix A = [a 1,..., a n ] R p n, p n Goal: Find unit-norm vector z R n which simultaneously 1. maximizes variance z T A T Az 2. is sparse If sparsity is not required, z is the dominant right singular vector of A: max z T A T Az = λ max (A T A) = (σ max (A)) 2. z T z 1 Extracting more components: Discussion above is about the single-unit case (m = 1). Often more components (sparse dominant singular directions) are needed: block case (m > 1). Applications: gene expression, finance, data visualization, signal processing, vision,...

4 3. Our approach to spca 1. Formulate spca as an optimization problem with sparsityinducing penalty (l 1 or l 0 ) controlled by a single parameter 2. Reformulate to get problem of a suitable form: suitable for analysis suitable for computation 3. Solve reformulation using a simple gradient scheme 4. Recover solution of the original problem Will illustrate steps 1) and 2) and 4) on the single-unit l 1 penalized case and then jump to general analysis of step 3).

5 4. Three observations about the l 1 penalty Notation: z 1 = i z i. Penalty formulation of single-unit spca: Observations: φ l1 (γ) def = max zt A T Az γ z 1. (1) z T z 1 1. γ = 0 no reason to expect zero coordinates in z 2. γ a i 2 def = max i a i, then z = 0. Indeed, since max z 0 Az 2 z 1 = max z 0 max z 0 i z ia i 2 z 1 i z i a i 2 i z i = max i a i In fact, γ a i 2 zi (γ) = 0 for all i

6 5. Reformulation Note that: φ l1 (γ) = max Az 2 γ z 1 = max max z B n z B n x B xt Az γ z 1 p = max x B p max z B n n z i (a T i x) γ z i. For fixed x, the inner max-problem has the closed-form solution i=1 z i = sign(a T i x)[ a T i x γ] +, z = z/ z 2. Hence to solve (1), we only need to solve this reformulation: φ 2 l 1 (γ) = max x R p x T x=1 n [ a T i x γ] 2 +, (2) i=1 Note: The objective function of (2) is convex and smooth and the feasible region is in R p instead of R n (p n).

7 6. Single-unit spca via l 0 penalty Similar story as in the l 1 case, so only briefly: Notation: z 0 = Card{i : z i 0}. Penalty formulation: φ l0 (γ) def = max z T A T Az γ z 0, (3) z T z 1 To solve (3), first solve this reformulation: and then set φ l1 (γ) = max x R p x T x=1 n [(a T i x) 2 γ] +, (4) i=1 z i = [sign((a T i x) 2 γ)] + a T i x, z = z/ z 2.

8 7. Maximizing convex functions Problems (2) and (4) (and their block generalizations) are of the form f = max f(x), (P) x Q where E is a finite-dimensional vector space, f : E R is a convex function, Q E is compact. In particular, Q = unit Euclidean sphere in R p / Single-unit case (m = 1) Q = Stiefel manifold in R p m, i.e. the set of p m matrices with orthonormal columns / Block case (m > 1) How to solve (P)?

9 8. Gradient algorithm We solve (P) using this simple gradient method: 1. Input: Initial iterate x 0 Q 2. For k 0 repeat x k+1 Arg max{f(x k ) + f (x k ), y x k y Q} k k + 1 This algorithm generalizes the power method for computing the largest eigenvalue of a symmetric positive definite matrix C: f(x) = 1 2 xt Cx x k+1 = Cx k Cx k 2. Hence Generalized Power Method (GPower).

10 9. Iteration complexity: basic result At any point x Q we introduce a measure for the first-order optimality conditions: (x) def = max y Q f (x), y x. Clearly, (x) 0 and it vanishes only at the points where the gradient f (x) belongs to the normal cone to Conv(Q) at x. Denote k def = min 0 i k (x i). Theorem Let sequence {x k } k=0 be generated by GPower as applied to a convex function f. Then the sequence {f(x k )} k=0 is monotonically increasing and lim (x k ) = 0. Moreover, k k f f(x 0 ). (5) k + 1

11 10. Strong convexity of functions and sets Function f is strongly convex if there exists a constant σ f > 0 such that for any x, y E f(y) f(x) + f (x), y x + σ f 2 y x 2. The set Conv(Q) is strongly convex if there exists a constant σ Q > 0 such that for any x, y Conv(Q) and α [0, 1] the following inclusion holds: αx + (1 α)y + σ Q 2 α(1 α) x y 2 S Conv(Q). Theorem If f : E R is nonnegative, has σ f L f -Lipschitz, then for any ω > 0, the level set Q ω def = {x f(x) ω} is strongly convex with parameter σ Qω = σ f / 2ωL f. > 0 and f is

12 11. Refined analysis under strong convexity Theorem Let f be convex with strong convexity parameter σ f 0, and Conv(Q) be convex with strong convexity parameter σ Q 0. If 0 < δ f = inf x Q f (x) and either σ f > 0 or σ Q > 0, then N k=0 x k+1 x k 2 2(f f(x 0 )) σ Q δ f + σ f. Note: If f is not minimized on Q, then δ f > 0.

13 12. Computational experiments We compare the following Sparse PCA algorithms: GPower l1 Single-unit sparse PCA via l 1 -penalty [1] GPower l0 Single-unit sparse PCA via l 0 -penalty [1] GPower l1,m Block sparse PCA via l 1 -penalty [1] GPower l0,m Block sparse PCA via l 0 -penalty [1] SPCA SPCA algorithm [2] Greedy Greedy method [3] rsvd l1 Method [4] with l 1 -penalty ( soft thresholding ) rsvd l0 Method [4] with l 0 -penalty ( hard thresholding ) Greedy slows down dramatically, compared to the other methods, if aimed at obtaining a component of higher cardinality. Test Problems: Randomly generated A = Gaussian with zero mean and unit variance Gene-expression data

14 13. Trade-off curves Trade-off between explained variance and cardinality. The algorithms aggregate in two groups. The methods GPower l1, GPower l0, Greedy and rsvd l0 do better (black solid lines), and SPCA and rsvd l1 do worse (red dashed lines). Based on 100 random test problems of size p = 100, n = 300.

15 14. Controlling sparsity with γ Dependence of cardinality on the value of the sparsity-inducing parameter γ. The horizontal axis shows a normalized interval of reasonable values of γ. The vertical axis shows percentage of nonzero coefficients of the resulting sparse loading vector z. Based on 100 random test problems of size p = 100, n = 300.

16 15. How does the trade-off evolve in time? Evolution of the explained variance (solid lines and left axis) and cardinality (dashed lines and right axis) in time for the methods GPower l1 and rsvd l1. Based on random test problem of size p = 250 and n = 2500.

17 16. Random data: speed Fixed n/p ratio: p n GPower l GPower l SPCA rsvd l rsvd l Fixed p, growing n: p n GPower l GPower l SPCA rsvd l rsvd l

18 17. Gene expression data: speed Data sets (breast cancer cohorts): Study Samples (p) Genes (n) Reference Vijver van de Vijver et al. [2002] Wang Wang et al. [2005] Naderi Naderi et al. [2006] JRH Sotiriou et al. [2006] Speed (in seconds): Vijver Wang Naderi JRH-2 GPower l GPower l GPower l1,m GPower l0,m SPCA rsvd l rsvd l

19 18. Gene expression data: content PEI-values based on 536 cancer-related pathways: Vijver Wang Naderi JRH-2 PCA GPower l GPower l GPower l1,m GPower l0,m SPCA rsvd l rsvd l Pathway Enrichment Index (PEI) measures the statistical significance of the overlap between two kinds of gene sets.

20 19. Summary We have developed 4 reformulations (single unit/block l 1 /l 0 ) of the spca problem which enabled us to devise a very fast method (we work in dimension p n and use only gradients), and analyze the iteration complexity of the method; analyzed a simple gradient method (Generalized Power Method) for maximizing convex functions on compact sets; applied GPower to 4 reformulations and ended-up with 4 algorithms for spca; tested our algorithms on random and gene expression data: they outperform other methods significantly in speed (finish before some other algorithms initialize), for the biological data, they produce slightly higher quality of solution in terms of PEI.

21 20. References [1] M. Journée, Yu. Nesterov, P. Richtárik, R. Sepulchre. Generalized Power Method for Sparse Principal Component Analysis (this talk). submitted to Journal of Machine Learning Research, November [2] H. Zou, T. Hastie, R. Tibshirani. Sparse Principal Component Analysis. Journal of Computational and Graphical Statistics, 15(2): , [3] A. d Aspremont, F. R. Bach, L. El Ghaoui. Optimal Solutions for Sparse Principal Component Analysis. Journal of Machine Learning Research, 9: , [4] H. Shen, J. Z. Huang. Sparse Principal Component Analysis via Regularized Low Rank Matrix Approximation. Journal of Multivariate Analysis, 99(6): , 2008.

Generalized power method for sparse principal component analysis

CORE DISCUSSION PAPER 2008/70 Generalized power method for sparse principal component analysis M. JOURNÉE, Yu. NESTEROV, P. RICHTÁRIK and R. SEPULCHRE November 2008 Abstract In this paper we develop a