Some Sparse Methods for High Dimensional Data. Gilbert Saporta CEDRIC, Conservatoire National des Arts et Métiers, Paris

Similar documents
A SURVEY OF SOME SPARSE METHODS FOR HIGH DIMENSIONAL DATA

Sparse Principal Component Analysis for multiblocks data and its extension to Sparse Multiple Correspondence Analysis

ESL Chap3. Some extensions of lasso

Regularization: Ridge Regression and the LASSO

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

PCR and PLS for Clusterwise Regression on Functional Data

Chapter 3. Linear Models for Regression

Linear Methods for Regression. Lijun Zhang

Machine Learning for OR & FE

Linear Model Selection and Regularization

Linear regression methods

Sparse principal component analysis via regularized low rank matrix approximation

A note on the group lasso and a sparse group lasso

Linear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman

STATS 306B: Unsupervised Learning Spring Lecture 13 May 12

Learning with Singular Vectors

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Sparse PCA with applications in finance

Linear Regression Linear Regression with Shrinkage

50 Years of Data Analysis

A direct formulation for sparse PCA using semidefinite programming

Regularization and Variable Selection via the Elastic Net

A General Framework for Variable Selection in Linear Mixed Models with Applications to Genetic Studies with Structured Populations

Prediction & Feature Selection in GLM

Clusterwise analysis for multiblock component methods

A direct formulation for sparse PCA using semidefinite programming

Iterative Selection Using Orthogonal Regression Techniques

Permutation-invariant regularization of large covariance matrices. Liza Levina

Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR

Institute of Statistics Mimeo Series No Simultaneous regression shrinkage, variable selection and clustering of predictors with OSCAR

High-dimensional Ordinary Least-squares Projection for Screening Variables

Principal Component Analysis with Sparse Fused Loadings

Sparse Principal Component Analysis

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

A Modern Look at Classical Multivariate Techniques

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Sparse orthogonal factor analysis

Linear Regression Linear Regression with Shrinkage

Tractable Upper Bounds on the Restricted Isometry Constant

Bayesian Grouped Horseshoe Regression with Application to Additive Models

Mining Big Data Using Parsimonious Factor and Shrinkage Methods

25 : Graphical induced structured input/output models

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

MS-C1620 Statistical inference

Dimension Reduction Methods

The MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010

Or How to select variables Using Bayesian LASSO

Generalized Elastic Net Regression

Sparse Principal Component Analysis Formulations And Algorithms

Sparse statistical modelling

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Robust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly

Outlier detection and variable selection via difference based regression model and penalized regression

Bi-level feature selection with applications to genetic association

STATISTICAL LEARNING SYSTEMS

Lecture 14: Variable Selection - Beyond LASSO

The prediction of house price

A New Bayesian Variable Selection Method: The Bayesian Lasso with Pseudo Variables

Comparisons of penalized least squares. methods by simulations

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables

Stability and the elastic net

ISyE 691 Data mining and analytics

International Journal of Pure and Applied Mathematics Volume 19 No , A NOTE ON BETWEEN-GROUP PCA

Uncertainty quantification and visualization for functional random variables

Regression Modelling with Many Correlated Predictors and Few Cases: A new approach to linear and logistic regression with high dimensional data

PLS discriminant analysis for functional data

Machine Learning (BSMC-GA 4439) Wenke Liu

Regression Shrinkage and Selection via the Lasso

Principal Component Analysis with Sparse Fused Loadings

Lecture 6: Methods for high-dimensional problems

Biostatistics Advanced Methods in Biostatistics IV

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

High-dimensional regression

Sparse representation classification and positive L1 minimization

Machine Learning Linear Regression. Prof. Matteo Matteucci

High-dimensional regression modeling

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss

Lecture VIII Dim. Reduction (I)

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

1 Data Arrays and Decompositions

EXTENDING PARTIAL LEAST SQUARES REGRESSION

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

Principal Component Analysis

Generalized Power Method for Sparse Principal Component Analysis

Fast Regularization Paths via Coordinate Descent

Review of Sparse Methods in Regression and Classification with Application to Chemometrics

Variable Selection for Highly Correlated Predictors

Principal component analysis (PCA) for clustering gene expression data

Graphical Model Selection

Prediction of Stress Increase in Unbonded Tendons using Sparse Principal Component Analysis

Lasso applications: regularisation and homotopy

Functional SVD for Big Data

Regularization Path Algorithms for Detecting Gene Interactions

The lasso: some novel algorithms and applications

LASSO-Type Penalization in the Framework of Generalized Additive Models for Location, Scale and Shape

The lasso, persistence, and cross-validation

An algorithm for the multivariate group lasso with covariance estimation

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

A significance test for the lasso

Transcription:

Some Sparse Methods for High Dimensional Data Gilbert Saporta CEDRIC, Conservatoire National des Arts et Métiers, Paris gilbert.saporta@cnam.fr

A joint work with Anne Bernard (QFAB Bioinformatics) Stéphanie Bougeard (ANSES) Ndeye Niang (CNAM) Cristian Preda (Lille) H2DM, Napoli, June 7, 2016 2

Outline 1. Introduction 2. Sparse regression 3. Sparse PCA 4. Group sparse PCA and sparse MCA 5. Clusterwise sparse PLS regression 6. Conclusion and perspectives H2DM, Napoli, June 7, 2016 3

1.Introduction High dimensional data: p>>n Gene expression data Chemometrics etc. Several solutions for regression problems with all variables (PLS, ridge, PCR) ; but interpretation is difficult Sparse methods: provide combinations of few variables H2DM, Napoli, June 7, 2016 4

This talk: a survey of sparse methods for supervised (regression) and unsupervised (PCA) problems New propositions for large n and large p when unobserved heterogeneity among observations occurs Clusterwise sparse PLS regression H2DM, Napoli, June 7, 2016 5

2. Sparse regression Keeping all predictors is a drawback for high dimensional data: combinations of too many variables cannot be interpreted Sparse methods simultaneously shrink coefficients and select variables, hence better predictions H2DM, Napoli, June 7, 2016 6

2.1 Lasso and elastic-net The Lasso (Tibshirani, 1996) is a shrinkage and selection method. It minimizes the usual sum of squared errors, with a bound on the sum of the absolute values of the coefficients (L 1 penalty). min y - Xβ with 2 2 ˆ βlasso arg min y Xβ β p j1 j c p j1 j H2DM, Napoli, June 7, 2016 7

Looks similar to ridge regression (L 2 penalty) 2 2 min y - Xβ with β c Lasso continuously shrinks the coefficients towards zero when c decreases Convex optimisation; no explicit solution If c βˆ arg min y - Xβ - β ridge p j1 b jols 2 2 Lasso identical to OLS H2DM, Napoli, June 7, 2016 8

Constraints and log-priors Like ridge regression, the Lasso is a bayesian regression but with an exponential prior f ( ) exp j j 2 j is proportional to the log-prior H2DM, Napoli, June 7, 2016 9

Finding the optimal parameter Cross validation if optimal prediction is needed BIC when the sparsity is the main concern opt 2 y X ˆ( ) log( n) arg min ˆ ( ) 2 df n n a good unbiased estimate of df is the number of nonzero coefficients. (Zou et al., 2007) H2DM, Napoli, June 7, 2016 10

A few drawbacks Lasso produces a sparse model but the number of selected variables cannot exceed the number of units Unfitted to DNA Arrays : n(arrays)<<p(genes) Select only one variable among a set of correlated predictors H2DM, Napoli, June 7, 2016 11

Elastic net Combines ridge and lasso penalties to select more predictors than the number of observations (Zou & Hastie, 2005) L 1 part leads to a sparse model 2 2 βˆ en arg min y Xβ β β β 2 1 1 L 2 part removes the limitation on the number of retained predictors and favourises group choice H2DM, Napoli, June 7, 2016 12

2.2 Group-lasso X matrix divided into J sub-matrices X j of p j variables Group Lasso: extension of Lasso for selecting groups of variables (Yuan & Lin, 2007): J βˆ arg min y X β p β GL j j j j j1 j1 2 J If p j =1 for all j, group Lasso = Lasso H2DM, Napoli, June 7, 2016 13

Drawback: no sparsity within groups A solution: sparse group lasso (Simon et al., 2012) 2 min y X β β β Two tuning parameters: grid search J J J j j 1 j 2 ij j1 j1 j1 i1 p j H2DM, Napoli, June 7, 2016 14

2.3. Sparse PLS Several solutions: Le Cao et al. (2008) Chun & Keles (2010) H2DM, Napoli, June 7, 2016 15

3.Sparse PCA In PCA, each PC is a linear combination of all the original variables : difficult to interpret the results Challenge of SPCA: obtain components easily interpretable (many zero loadings in principal factors) Principle of SPCA: modify PCA imposing lasso/elastic-net constraints to construct modified PCs with sparse loadings Warning: Sparse PCA does not provide a global selection of variables but a selection dimension by dimension : different from the regression context (Lasso, Elastic Net, ) H2DM, Napoli, June 7, 2016 16

3.1 First attempts: Simple PCA by Vines (2000) : integer loadings Rousson, V. and Gasser, T. (2004) : loadings (+, 0, -) SCoTLASS (Simplified Component Technique Lasso) by Jolliffe & al. (2003) : extra L 1 constraints max u'vu with u u'u 1 and 2 p uj j1 t H2DM, Napoli, June 7, 2016 17

SCotLass properties: t t t p usual PCA 1 no solution 1 only one nonzero coefficient 1 t p Non convex problem H2DM, Napoli, June 7, 2016 18

3.2 S-PCA R-SVD by Zou et al (2006) ' Let the SVD of X be X = UDV with Z = UD the principal components Ridge regression: 2 2 βˆ arg min Z - Xβ β ridge β ' 2 ' ' X X = VD V with V V = I β ˆ = X X + λ I X Xv v ' -1 ' i,ridge i i 2 dii d + 2 ii v v i Loadings can be recovered by regressing (ridge regression) PCs on the p variables PCA can be written as a regression-type optimization problem H2DM, Napoli, June 7, 2016 19

Sparse PCA add a new penalty to produce sparse loadings: βˆ arg min Z - Xβ + β 2 2 + 1 β 1 βˆ v ˆ is an approximation to, and the i th approximated i = v i Xvˆ i β ˆ component Produces sparse loadings with zero coefficients to facilitate interpretation Alternated algorithm between elastic net and SVD Lasso penalty H2DM, Napoli, June 7, 2016 20

Loss of orthogonality SCotLass: orthogonal loadings but correlated components S-PCA: neither loadings, nor components are orthogonal Necessity of adjusting the % of explained variance H2DM, Napoli, June 7, 2016 21

4. Group sparse PCA and sparse MCA (Bernard, Guinot, S., 2012) Industrial context and motivation: Funded by R&D department of Chanel cosmetic company Relate gene expression data to skin aging measures n=500, p= 800 000 SNP s, 15 000 genes H2DM, Napoli, June 7, 2016 22

4.1 Group Sparse PCA Data matrix X divided into J groups X j of p j variables Group Sparse PCA: compromise between SPCA and group Lasso Goal: select groups of continuous variables (zero coefficients to entire blocks of variables) Principle: replace the penalty function in the SPCA algorithm 2 2 ˆ arg min 1 1 β β Z - Xβ β β by that defined in the group Lasso J β ˆ arg min Z X β p β GL j j j j j1 j1 H2DM, Napoli, June 7, 2016 23 2 J

4.2 Sparse MCA Original table In MCA: Complete disjunctive table X J 1 p J... 3 Selection of 1 column in the original table (categorical variable X j ) = Selection of a block of p j indicator variables in the complete disjunctive table X J1 X Jpj 1 0... 0 0 1... 0 Challenge of Sparse MCA : select categorical variables, not categories Principle: a straightforward extension of Group Sparse PCA for groups of indicator variables, with the chi-square metric. Uses s-pca r-svd algorithm. H2DM, Napoli, June 7, 2016 24

Application on genetic data Single Nucleotide Polymorphisms Data: n=502 individuals p=537 SNPs (among more than 800 000 of the original data base, 15000 genes) q=1554 (total number of columns) X : 502 x 537matrix of qualitative variables K : 502 x 1554 complete disjunctive table K=(K 1,, K 1554 ) 1 block = 1 SNP = 1 K j matrix H2DM, Napoli, June 7, 2016 25

Application on genetic data Single Nucleotide Polymorphisms H2DM, Napoli, June 7, 2016 26

Application on genetic data Single Nucleotide Polymorphisms λ= 0:005: CPEV= 0:32% and 174 columns selected on Comp 1 H2DM, Napoli, June 7, 2016 27

Application on genetic data Comparison of the loadings. H2DM, Napoli, June 7, 2016 28

Application on genetic data Single Nucleotide Polymorphisms Comparison between MCA and Sparse MCA on the first plan H2DM, Napoli, June 7, 2016 29

Properties MCA Sparse MCA Uncorrelated Components TRUE FALSE Orthogonal loadings TRUE FALSE Barycentric property TRUE Partly true % of inertia j 100 tot Z j.1,...,j-1 2 Total inertia 1 p p j 1 p j 1 k j1 Zj.1,...,j-1 2 Z j.1,...,j-1 Z j are the residuals after adjusting for (regression projection) Z 1,...,j-1 H2DM, Napoli, June 7, 2016 30

5. Clusterwise sparse PLS regression Context High dimensional explanatory data X (with p>>n), One or several variables Y to explain, Unknown clusters of the n observations. Twofold aim Clustering: get the optimal clustering of observations, Sparse regression: compute the cluster sparse regression coefficients within each cluster to improve the Y prediction. Improve high-dimensional data interpretation. H2DM, Napoli, June 7, 2016 31

Clusterwise (typological) regression Diday (1974), Charles (1977), Späth (1979) Optimizes simultaneously the partition and the local models n K min ( i) y ( ' ) i1 k1 2 1 β x k i k k i p >n k our proposal : local sparse PLS regressions H2DM, Napoli, June 7, 2016 32

Criterion (for a single dependent variable and G clusters) G 2 2 ' ' ' arg min Xgy g w gv g g w g y g Xgw gcg with w g 1 g1 (cluster) sparse components (cluster) residuals w loading weight of PLS component t=xw v loading weight u=yv c regression coefficients of Y with t: c=(t T) -1 T y H2DM, Napoli, June 7, 2016 33

A simulation study n=500, p= 100 X vector normally distributed with mean 0 and covariance matrix S with elements s ij = cov(x i, X j ) = ρ i-j, ρ [0,1] The response variable (Y) is defined as follows: there exists a latent categorical variable G, G {1,, K}, such that, conditionally to G=i and X=x, we have : Y N x x i i i 2 / Gi, X=x ( 0 1 1... p p, i ) H2DM, Napoli, June 7, 2016 34

Signal/noise 2 i Var( Y / G i) 0.1 the n subjects are equiprobably distributed into the K clusters For K=2 we have chosen H2DM, Napoli, June 7, 2016 35

H2DM, Napoli, June 7, 2016 36

A wrong number of groups H2DM, Napoli, June 7, 2016 37

Application to real data Data features (liver toxicity from mixomics R package) X: 3116 genes (mrna extracted from liver) y: clinical chemistry marker for liver injury (serum enzymes level) Observations: 64 rats Each rat is subject to different more or less toxic acetaminophen doses (50, 150, 1500, 2000) Necropsy is performed at 6, 18, 24 and 48 hours after exposure (i.e., time when the mrna is extracted from liver) Data pre-processing y and X are centered and scaled Twofold aim Improve the sparse PLS link between genes (X) and liver injury marker (y) while seeking G unknown clusters of the 64 rats. H2DM, Napoli, June 7, 2016 38

Batch clusterwise algorithm For several random allocations of the n observations into G clusters Compute G sparse PLS within each cluster (select and dimension through 10-fold CV) Assign the observation to the cluster with the minimum criterion value. Iterate until convergence Select the best initialization (i.e., which minimizes the criterion) and get: The allocation of the n observations into G clusters, The sparse regression coefficients within each cluster. H2DM, Napoli, June 7, 2016 39

Clusterwise sparse PLS: performance vs standard spls N=64 R 2 =0.968 RMSE=0.313 No cluster G=3 clusters of rats N=(12; 42; 10) R 2 =(0.99; 0.928; 0.995) RMSE=(0.0272; 0.0111; 0.0186) =0.9 (sparse parameter) H= 3 dimensions =( 0.9; 0.8; 0.9) H=(3; 3; 3) H2DM, Napoli, June 7, 2016 40

Clusterwise sparse PLS: regression coefficients Interpretation Links between genes (X) and marker for liver injury (y) are specific to each cluster Cluster 1: 12 selected genes Cluster 2: 86 selected genes Cluster 3: 15 selected genes (No cluster: 31 selected genes) The cluster structure is significantly associated with the acetaminophen dose (pvalue- Chi2Fisher= 0.0057) Cluster 1: all the doses Cluster 2: mainly doses 50, 150 Cluster 3: mainly doses 1500, 2000 H2DM, Napoli, June 7, 2016 41

6.Conclusions and perspectives Sparse techniques provide elegant and efficient solutions to problems posed by high-dimensional data. A new generation of data analysis methods with few restrictive hypothesis Sparse PCA and group-lasso extended towards sparse MCA. Clusterwise extensions for large p and large n H2DM, Napoli, June 7, 2016 42

Research in progress: Criteria for choosing the tuning parameter λ Extension of Sparse MCA : compromise between the Sparse MCA and the sparse group lasso developed by Simon et al. (2002) select groups and predictors within a group, in order to produce sparsity at both levels Extension to the case where variables have a block structure Writing a R package H2DM, Napoli, June 7, 2016 43

References Bernard, A., Guinot, C., Saporta, G.. (2012) Sparse principal component analysis for multiblock data and its extension to sparse multiple correspondence analysis, Compstat 2012, pp.99-106, Limassol, Chypre Charles, C. (1977): Régression Typologique et Reconnaissance des Formes. Thèse de doctorat, Université Paris IX. Chun, H. and Keles, S. (2010), Sparse partial least squares for simultaneous dimension reduction and variable selection, Journal of the Royal Statistical Society - Series B, Vol. 72, pp. 3 25. Diday, E. (1974): Introduction à l analyse factorielle typologique, Revue de Statistique Appliquée, 22, 4, pp.29-38 Jolliffe, I.T., Trendafilov, N.T. and Uddin, M. (2003) A modified principal component technique based on the LASSO. Journal of Computational and Graphical Statistics, 12, 531 547 Rousson, V., Gasser, T. (2004), Simple component analysis. Journal of the Royal Statistical Society: Series C (Applied Statistics), 53,539-555 Shen, H. and Huang, J. Z. (2008). Sparse principal component analysis via regularized low rank matrix approximation. Journal of Multivariate Analysis, 99:1015-1034. H2DM, Napoli, June 7, 2016 44

Simon, N., Friedman, J., Hastie, T., and Tibshirani, R. (2012) A Sparse-Group Lasso. Journal of Computational and Graphical Statistics, Späth, H. (1979): Clusterwise linear regression, Computing, 22, pp.367-373 Tibshirani, R. (1996) Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58, 267-288 Vines, S.K., (2000) Simple principal components, Journal of the Royal Statistical Society: Series C (Applied Statistics), 49, 441-451 Yuan, M., Lin, Y. (2007) Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B, 68, 49-67 Zou, H., Hastie, T. (2005) Regularization and variable selection via the elastic net. Journal of Computational and Graphical Statistics, 67, 301-320, Zou, H., Hastie, T. and Tibshirani, R. (2006) Sparse Principal Component Analysis. Journal of Computational and Graphical Statistics, 15, 265-286. H. Zou, T. Hastie, R. Tibshirani, (2007), On the degrees of freedom of the lasso, The Annals of Statistics, 35, 5, 2173 2192. H2DM, Napoli, June 7, 2016 45