Sparse Principal Component Analysis for multiblocks data and its extension to Sparse Multiple Correspondence Analysis

Size: px

Start display at page:

Download "Sparse Principal Component Analysis for multiblocks data and its extension to Sparse Multiple Correspondence Analysis"

Calvin Patterson
6 years ago
Views:

1 Sparse Principal Component Analysis for multiblocks data and its extension to Sparse Multiple Correspondence Analysis Anne Bernard 1,5, Hervé Abdi 2, Arthur Tenenhaus 3, Christiane Guinot 4, Gilbert Saporta 5 1 CE.R.I.E.S, Neuilly sur Seine, France 2 Department of Brain and Behavioral Sciences, The University of Texas at Dallas, Richardson, TX, USA 3 SUPELEC, Gif-sur-Yvette, France 4 Université Francois Rabelais, département d informatique, Tours, France 5 Laboratoire Cedric, CNAM, Paris June 13, 2013 Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 1 / 33

2 Background Variable selection and dimensionality reduction are necessary in different application domains, and more specifically in genetics. Quantitative data analysis: PCA Categorical data analysis: MCA In case of high dimensional data (n p): components (PCs) difficult to interpret. Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 2 / 33

3 Background Variable selection and dimensionality reduction are necessary in different application domains, and more specifically in genetics. Quantitative data analysis: PCA Categorical data analysis: MCA In case of high dimensional data (n p): components (PCs) difficult to interpret. A way for obtaining simple structures Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 2 / 33

4 Background Variable selection and dimensionality reduction are necessary in different application domains, and more specifically in genetics. Quantitative data analysis: PCA Categorical data analysis: MCA In case of high dimensional data (n p): components (PCs) difficult to interpret. A way for obtaining simple structures 1. Sparse methods like sparse PCA for quantitative data: PCs are combinations of few original variables Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 2 / 33

5 Background Variable selection and dimensionality reduction are necessary in different application domains, and more specifically in genetics. Quantitative data analysis: PCA Categorical data analysis: MCA In case of high dimensional data (n p): components (PCs) difficult to interpret. A way for obtaining simple structures 1. Sparse methods like sparse PCA for quantitative data: PCs are combinations of few original variables extension for multiblocks data structure (GSPCA) Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 2 / 33

Background Variable selection and dimensionality reduction are necessary in different application domains, and more specifically in genetics.

6 Background Variable selection and dimensionality reduction are necessary in different application domains, and more specifically in genetics. Quantitative data analysis: PCA Categorical data analysis: MCA In case of high dimensional data (n p): components (PCs) difficult to interpret. A way for obtaining simple structures 1. Sparse methods like sparse PCA for quantitative data: PCs are combinations of few original variables extension for multiblocks data structure (GSPCA) 2. Extension for categorical variables: Development of a new sparse method (sparse MCA) Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 2 / 33

7 Data and methods Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 3 / 33

8 Data and methods Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 3 / 33

9 Uniblock data structure Principal Component Analysis (PCA): Each PC is a linear combination of all the original variables, so loadings are nonzero PCA results difficult to interpret Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 4 / 33

Uniblock data structure Principal Component Analysis (PCA): Each PC is a linear combination of all the original variables, so loadings are nonzero PCA results difficult

10 Uniblock data structure Principal Component Analysis (PCA): Each PC is a linear combination of all the original variables, so loadings are nonzero PCA results difficult to interpret Multiple Correspondence Analysis (MCA): Counterpart of PCA for categorical data Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 4 / 33

11 Uniblock data structure Principal Component Analysis (PCA): Each PC is a linear combination of all the original variables, so loadings are nonzero PCA results difficult to interpret Multiple Correspondence Analysis (MCA): Counterpart of PCA for categorical data Sparse PCA (SPCA): Modified PCs with sparse loadings using regularized SVD Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 4 / 33

12 Principal Component Analysis (PCA) X: matrix of quantitative variables PCA can be computed via the SVD of X Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 5 / 33

13 Principal Component Analysis (PCA) X: matrix of quantitative variables PCA can be computed via the SVD of X Singular Value Decomposition of X (SVD): X = UDV T with U T U = V T V = I (1) where Z=UD the PCs and V the corresponding nonzero loadings PCA results difficult to interpret Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 5 / 33

14 Principal Component Analysis (PCA) X: matrix of quantitative variables PCA can be computed via the SVD of X Singular Value Decomposition of X (SVD): X = UDV T with U T U = V T V = I (1) where Z=UD the PCs and V the corresponding nonzero loadings PCA results difficult to interpret SVD as low rank approximation of matrices: min X X (1) 2 X (1) 2 = min X ũ,ṽ ũṽt 2 2 (2) X (1) = ũṽ T is the best rank-one matrix approximation of X with: ũ = δ 1 u 1 the first component and ṽ = v 1 the first loading vector. Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 5 / 33

15 Sparse Principal Component Analysis (SPCA) Challenge: facilitate interpretation of PCA results How? PCs with a lot of zero loadings (sparse loadings) Several approaches: ScoTLass by Jolliffe, I.T. et al. (2003), SPCA by Zou, H. et al. (2006) SPCA-rSVD by Shen, H. and Huang, J.Z. (2008) Principle: use connection of PCA with SVD low rank matrix approximation problem with regularization penalty on the loadings Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 6 / 33

16 Regularization in regression Lasso penalty: (Tibshirani, 1996) a variable selection technique produces sparse models Imposes the L 1 norm on the linear regression coefficients Regression with L 1 penalization ˆβ lasso = min y Xβ 2 + λ β J β j with P λ (β) = λ j=1 J β j j=1 If λ=0: usual regression (non-zero coefficients) If λ increases: some coefficients sets to zero selection of variables but the number of variables selected is bounded by the number of units Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 7 / 33

17 Regularization in regression Other kinds of penalties: Hard thresholding penalty: (Antoniadis, 1997) P λ ( β ) = λ 2 ( β λ) 2 I( β < λ) SCAD penalty (Smoothly Clipped Absolute Deviation penalty): (Fan and Li, 2001) ( P λ (β) = λ I( β λ) + (aλ β) ) + I( β > λ) (a 1)λ Group Lasso penalty: (Yuan and Lin, 2007) P λ (β) = λ K k=1 with a > 2, β > 0 J [k] β k 2 (3) with J [k] the number of variables in the group k. Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 8 / 33

18 Sparse PCA via regularized SVD (SPCA-rSVD) Low rank matrix approximation problem min ũ,ṽ X ũṽt 2 2 (4) Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 9 / 33

19 Sparse PCA via regularized SVD (SPCA-rSVD) Low rank matrix approximation problem with X ũṽ T 2 2 min ũ,ṽ X ũṽt 2 2 (4) ( ) = tr (X ũṽ) T (X ũṽ) ( ) = X tr ṽ T ṽ 2 tr (X ) T ũṽ T = X J j=1 ( ṽ 2 j 2(X T ũ) j ṽ j ) (5) Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 9 / 33

20 Sparse PCA via regularized SVD (SPCA-rSVD) Low rank matrix approximation problem +regularization penalty function applied on ṽ with X ũṽ T 2 2 min ũ,ṽ X ũṽt 2 2+P λ (ṽ) (4) ( ) = tr (X ũṽ) T (X ũṽ) ( ) = X tr ṽ T ṽ 2 tr (X ) T ũṽ T = X J j=1 ( ṽ 2 j 2(X T ũ) j ṽ j ) (5) Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 9 / 33

21 Sparse PCA via regularized SVD (SPCA-rSVD) Low rank matrix approximation problem +regularization penalty function applied on ṽ min ũ,ṽ X ũṽt 2 2+P λ (ṽ) (4) with ( ) X ũṽ T 2 2+P λ (ṽ) = tr (X ũṽ) T (X ũṽ) +P λ (ṽ) (5) ( ) ( ) = X tr ṽ T ṽ 2 tr X T ũṽ T +P λ (ṽ) = X = X J j=1 J j=1 ( ṽ 2 j 2(X T ũ) j ṽ j ) + J P λ (ṽ j ) j=1 ( ) ṽj 2 2(X T ũ) j ṽ j +P λ (ṽ j ) Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 9 / 33

22 Sparse PCA via regularized SVD (SPCA-rSVD) Iterative procedure: 1. ṽ fixed: minimizer of (4) is ũ = Xṽ/ Xṽ 2. ũ fixed: minimizer of (4) is ṽ = h λ (X T ũ) with h λ (X T ũ) = min ṽ j ṽj 2 2(X T ũ) j ṽ j +P λ (ṽ j ) ( ) (6) For the Lasso penalty: h λ (X T ũ) = sign(x T ũ)( X T ũ λ) + For the hard thresholding penalty: h λ (X T ũ) = I( X T ũ > λ)x T ũ Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 10 / 33

23 Sparse PCA via regularized SVD (SPCA-rSVD) Iterative procedure: 1. ṽ fixed: minimizer of (4) is ũ = Xṽ/ Xṽ 2. ũ fixed: minimizer of (4) is ṽ = h λ (X T ũ) with h λ (X T ũ) = min ṽ j ṽj 2 2(X T ũ) j ṽ j +P λ (ṽ j ) ( ) (6) For the Lasso penalty: h λ (X T ũ) = sign(x T ũ)( X T ũ λ) + For the hard thresholding penalty: h λ (X T ũ) = I( X T ũ > λ)x T ũ min obtained by applying h λ to the vector X T ũ componentwise v 1 is obtained. Subsequently sparse loadings v i (i > 1) obtained via rank-one approximation of residual matrices. Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 10 / 33

Multiblock data structure Group Sparse PCA (GSPCA): Challenge: To create sparseness in multiblock data structure Principle: Modified PCs with sparse loadings

24 Multiblock data structure Group Sparse PCA (GSPCA): Challenge: To create sparseness in multiblock data structure Principle: Modified PCs with sparse loadings using the group lasso penalty Result: All variables of a block is selected or removed Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 11 / 33

Group Sparse PCA Group Sparse PCA: Extension of SPCA-rSVD for blocks of quantitative variables Challenge: select groups of continuous variables zero coefficients

25 Group Sparse PCA Group Sparse PCA: Extension of SPCA-rSVD for blocks of quantitative variables Challenge: select groups of continuous variables zero coefficients to entire blocks of variables How? A compromise between SPCA-rSVD and group Lasso Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 12 / 33

26 Group Sparse PCA X: matrix of quantitative variables divided into K sub-matrices SVD of X: X = [ ] X [1]... X [k]... X [K] = UDV T [ ] = UD V[1] T... VT [k]... VT [K] where each line of V T is a vector divided into K blocks. Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 13 / 33

27 Group Sparse PCA Remember SPCA-rSVD: min X ũ,ṽ ũṽt P λ (ṽ) (7) ) ũ fixed: ṽ = h λ (X T ũ) = minṽj (ṽj 2 2(X T ũ) j ṽ j + P λ (ṽ j ) Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 14 / 33

28 Group Sparse PCA Remember SPCA-rSVD: min X ũ,ṽ ũṽt P λ (ṽ) (7) ) ũ fixed: ṽ = h λ (X T ũ) = minṽj (ṽj 2 2(X T ũ) j ṽ j + P λ (ṽ j ) In GSPCA: ( ) ṽ = h λ (X T ũ) = min tr(ṽ [k] ṽ[k] T ) 2 tr(ṽ [k]ũ T X [k] ) + P λ (ṽ [k] ) ṽ [k] Here P λ is the group lasso regularizer: K P λ (ṽ) = λ J [k] ṽ [k] 2 (8) The solution is h λ (X T [k]ũ) = ( k=1 1 λ 2 J[k] X T [k]ũ 2 ) + X T [k]ũ (9) Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 14 / 33

29 Group Sparse PCA Group Sparse PCA Algorithm Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 15 / 33

30 Group Sparse PCA Group Sparse PCA Algorithm 1 Apply the SVD to X and obtain the best rank-one approximation of X as δũṽ T. Set ṽ old = ṽ = [ṽ [1],..., ṽ [K] ] and ũ old = δũ. Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 15 / 33

31 Group Sparse PCA Group Sparse PCA Algorithm 1 Apply the SVD to X and obtain the best rank-one approximation of X as δũṽ T. Set ṽ old = ṽ = [ṽ [1],..., ṽ [K] ] and ũ old = δũ. 2 Update: a) ṽ new = [h λ (X T [1]ũold),..., h λ (X T [K]ũold)] b) ũ new = Xṽ new / Xṽ new Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 15 / 33

32 Group Sparse PCA Group Sparse PCA Algorithm 1 Apply the SVD to X and obtain the best rank-one approximation of X as δũṽ T. Set ṽ old = ṽ = [ṽ [1],..., ṽ [K] ] and ũ old = δũ. 2 Update: a) ṽ new = [h λ (X T [1]ũold),..., h λ (X T [K]ũold)] b) ũ new = Xṽ new / Xṽ new 3 Repeat Step 2 replacing ũ old and ṽ old by ũ new and ṽ new until convergence. Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 15 / 33

33 Group Sparse PCA Group Sparse PCA Algorithm 1 Apply the SVD to X and obtain the best rank-one approximation of X as δũṽ T. Set ṽ old = ṽ = [ṽ [1],..., ṽ [K] ] and ũ old = δũ. 2 Update: a) ṽ new = [h λ (X T [1]ũold),..., h λ (X T [K]ũold)] b) ũ new = Xṽ new / Xṽ new 3 Repeat Step 2 replacing ũ old and ṽ old by ũ new and ṽ new until convergence. Sparse loadings v i (i > 1) are obtained via rank-one approximation of residual matrices. Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 15 / 33

34 Group Sparse PCA Group Sparse PCA Algorithm 1 Apply the SVD to X and obtain the best rank-one approximation of X as δũṽ T. Set ṽ old = ṽ = [ṽ [1],..., ṽ [K] ] and ũ old = δũ. 2 Update: a) ṽ new = [h λ (X T [1]ũold),..., h λ (X T [K]ũold)] b) ũ new = Xṽ new / Xṽ new 3 Repeat Step 2 replacing ũ old and ṽ old by ũ new and ṽ new until convergence. Sparse loadings v i (i > 1) are obtained via rank-one approximation of residual matrices. Choice of λ using cross-validation or an ad-hoc approach. Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 15 / 33

Multiblock data structure To select 1 column in the original table (categorical variable X) = To select a block of indicator variables in the complete disjunctive

35 Multiblock data structure To select 1 column in the original table (categorical variable X) = To select a block of indicator variables in the complete disjunctive table Sparse MCA (SMCA): Extension of the GSPCA for blocks of indicator variables Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 16 / 33

36 MCA Generalized SVD of F r = F1 c = F T 1 D r = diag(r) D c = diag(c) p [j] =Nb of modalities of variable j F = P Q T with P T MP = Q T WQ = I with F = [F [1]... F [j]... F [J] ] and Q = [Q [1]... Q [j]... Q [J] ] In the case of PCA: M = W = I In the case of MCA: M = D r W = D c Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 17 / 33

37 From MCA to SMCA GSVD as low rank approximation matrices min F F (1) 2 F (1) W = min F p, q p qt 2 W (10) Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 18 / 33

38 From MCA to SMCA GSVD as low rank approximation matrices min F F (1) 2 F (1) W = min F p, q p qt 2 W (10) where F 2 W = tr(m 1 2 FWF T M 1 2 ) is the norm W-generalized The norm is under the constraints of M and W Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 18 / 33

39 From MCA to SMCA GSVD as low rank approximation matrices min F F (1) 2 F (1) W = min F p, q p qt 2 W (10) where F 2 W = tr(m 1 2 FWF T M 1 2 ) is the norm W-generalized The norm is under the constraints of M and W F p q T 2 W = tr (M 1 2 (F p q T )W(F p q T ) T M 1 2 ) (11) Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 18 / 33

40 From MCA to SMCA GSVD as low rank approximation matrices min F F (1) 2 F (1) W = min F p, q p qt 2 W (10) where F 2 W = tr(m 1 2 FWF T M 1 2 ) is the norm W-generalized The norm is under the constraints of M and W F p q T 2 W = tr (M 1 2 (F p q T )W(F p q T ) T M 1 2 ) (11) Sparse MCA problem min p, q F p qt 2 W +P λ( q) p T M p = q t W q = 1 (12) Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 18 / 33

41 From MCA to SMCA GSVD as low rank approximation matrices min F F (1) 2 F (1) W = min F p, q p qt 2 W (10) where F 2 W = tr(m 1 2 FWF T M 1 2 ) is the norm W-generalized The norm is under the constraints of M and W F p q T 2 W = tr (M 1 2 (F p q T )W(F p q T ) T M 1 2 ) (11) Sparse MCA problem min p, q F p qt 2 W +P λ( q) p T M p = q t W q = 1 (12) ( ( ) ) min tr M 1 2 (F p q T )W(F p q T ) T M 1 2 +P λ ( q) p, q (13) Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 18 / 33

42 From MCA to SMCA Remember GSPCA solution ( ) min tr(ṽ [k] ṽ[k] T ) 2 tr(ṽ [k]ũ T X [k] ) + P λ (ṽ [k] ) ṽ [k] ṽ = h λ (X T [k]ũ) = ( 1 λ 2 J[k] X T [k]ũ 2 ) + X T [k]ũ Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 19 / 33

43 From MCA to SMCA Remember GSPCA solution ( ) min tr(ṽ [k] ṽ[k] T ) 2 tr(ṽ [k]ũ T X [k] ) + P λ (ṽ [k] ) ṽ [k] ṽ = h λ (X T [k]ũ) = ( Sparse MCA solution 1 λ 2 J[k] X T [k]ũ 2 ) + X T [k]ũ ( ) min tr( q [k] W q [k] q T [k] ) 2 tr(mf [k]w [k] q [k] p T ) + P λ ( q [k] ) [k] q = h λ (F T [k] M p) = ( J[k] 1 λ 2 F T [k] M p W ) + F T [k] M p Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 19 / 33

44 Sparse MCA Sparse MCA Algorithm 1 Apply the GSVD to F and obtain the best rank-one approximation of F as δ p q T. Set p old = δ p and q old = q. 2 Update: a) q new = [h λ (F T [1] M p old),..., h λ (F T [J] M p old)] b) p new = F q new / F q new 3 Repeat Step 2 replacing p old and q old by p new and q new until convergence. Sparse loadings q i (i > 1) are obtained via rank-one approximation of residual matrices. Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 20 / 33

45 Properties of the SMCA Properties MCA Sparse MCA Barycentric properties Non correlated components Orthogonal loadings Inertia TRUE TRUE TRUE δ j tot 100 TRUE FALSE FALSE Adjusted variance Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 21 / 33

46 Sparse MCA: exemple on breeds of dogs From Tenenhaus, M. (2007) Data I = 27 breeds of dogs J = 6 variables Q = 16 (total number of modalities) X: 27 6 matrix of categorical variables D: complete disjunctive table D=(D [1],..., D [6] ) 1 block = 1 descriptive variable = 1 sub-matrix D [j] Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 22 / 33

47 Sparse MCA: exemple on breeds of dogs Result of the MCA Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 23 / 33

48 Sparse MCA: exemple on breeds of dogs Result of the Sparse MCA From λ = 0.25: CPEV rapidly We choose λ = 0.25 to find a compromise between the number of variables selected and the % of variance lost. Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 24 / 33

49 Sparse MCA: exemple on breeds of dogs Result of the Sparse MCA From λ = 0.25: CPEV rapidly We choose λ = 0.25 to find a compromise between the number of variables selected and the % of variance lost. Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 24 / 33

50 Sparse MCA: exemple on breeds of dogs Result of the Sparse MCA Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 25 / 33

51 Sparse MCA: exemple on breeds of dogs Comparison of the loadings MCA Sparse MCA Variable Comp1 Comp2 Comp3 Comp4 Comp1 Comp2 Comp3 Comp4 large medium small lightweight heavy very heavy slow fast veryfast unintelligent avg intelligent very intelligent unloving veryaffectionate agressive non agressive Nb non-zero loadings Adjusted variance (%) Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 26 / 33

Sparse MCA: application on SNPs data Single nucleotide polymorphism (SNP): A single base pair mutation at a specific locus The most frequent polymorphism Data I = 502 women J = 537 SNPs Q = 1554 (2

52 Sparse MCA: application on SNPs data Single nucleotide polymorphism (SNP): A single base pair mutation at a specific locus The most frequent polymorphism Data I = 502 women J = 537 SNPs Q = 1554 (2 or 3 modalities by SNP) X: matrix of categorical variables (SNPs) D: complete disjunctive matrix D=(D [1],..., D [537] ) 1 block = 1 SNP=1 sub-matrix D [j] Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 27 / 33

53 Sparse MCA: application on SNPs data Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 28 / 33

54 Sparse MCA: application on SNPs data λ = : CPEV=0.36% and 300 columns selected on Comp 1. Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 28 / 33

55 Sparse MCA: application on SNPs data λ = : CPEV=0.36% and 300 columns selected on Comp 1. λ = 0.005: CPEV= 0.32% and 174 columns selected on Comp 1. we choose λ = Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 28 / 33

56 Sparse MCA: application on SNPs data Comparison of the loadings SNPs MCA SMCA Comp1 Comp2 Comp1 Comp2 SNP1.AA SNP1.AG SNP1.GG SNP2.AA SNP2.AG SNP2.GG SNP3.CC SNP3.CG SNP3.GG SNP4.AA SNP4.AG SNP4.GG Nb non-zero loadings Variance (%) Cumulative variance (%) Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 29 / 33

57 Sparse MCA: application on SNPs data Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 30 / 33

58 Sparse MCA: application on SNPs data Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 31 / 33

59 Conclusions In a unsupervised multiblock data context: GSPCA for continuous variables, SMCA for categorical variables. Produce sparse loading structures (with limited loss of explained variance) easier interpretation and comprehension of the results. Very powerful in a context of variable selection in high dimension issues reduce noise as well as computation time. Research in progress: Extension of Sparse MCA to select groups and predictors within a group sparsity at both group and individual feature levels compromise between Sparse MCA and sparse group lasso developed by Simon et al. (2012). Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 32 / 33

60 References [1] Jolliffe, I. T. et al. (2003), A modified principal component technique based on the LASSO, Journal of Computational and Graphical Statistics, 12, , [2] Shen, H. et Huang, J. (2008), Sparse principal component analysis via regularized low rank matrix approximation, Journal of Multivariate Analysis, 99, , [3] Simon, N., Friedman, J., Hastie, T. et Tibshirani, R. (2012), A Sparse-Group Lasso, Journal of Computational and Graphical Statistics, DOI: / , [4] Tenenhaus, M. (2007), Statistique; méthodes pour décrire, expliquer et prévoir, Dunod, , [5] Yuan, M. et Lin, Y. (2006), Model selection and estimation in regression with grouped variables, Journal of the Royal Statistical Society: Series B, 68, 49 67, [6] Zou, H., Hastie, T. et Tibshirani, R. (2006), Sparse Principal Component Analysis, Journal of Computational and Graphical Statistics, 15, Bernard, Abdi, Tenenhaus, Guinot, Saporta Group Sparse PCA and ACM Sparse 33 / 33

A SURVEY OF SOME SPARSE METHODS FOR HIGH DIMENSIONAL DATA

A SURVEY OF SOME SPARSE METHODS FOR HIGH DIMENSIONAL DATA Gilbert Saporta CEDRIC, Conservatoire National des Arts et Métiers, Paris gilbert.saporta@cnam.fr With inputs from Anne Bernard, Ph.D student Industrial