Sparse statistical modelling

Size: px

Start display at page:

Download "Sparse statistical modelling"

Barry Austin
5 years ago
Views:

1 Sparse statistical modelling Tom Bartlett Sparse statistical modelling Tom Bartlett 1 / 28

2 Introduction A sparse statistical model is one having only a small number of nonzero parameters or weights. [1] The number of features or variables measured on a person or object can be very large (e.g., expression levels of genes) These measurements are often highly correlated, i.e., contain much redundant information This scenario is particularly relevant in the age of big-data 1 Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical learning with sparsity: the lasso and generalizations. CRC Press, 2015 Sparse statistical modelling Tom Bartlett 2 / 28

3 Outline Sparse linear models Sparse PCA Sparse SVD Sparse CCA Sparse LDA Sparse clustering Sparse statistical modelling Tom Bartlett 3 / 28

4 Sparse linear models A linear model can be written as p y i =α + x ij β j + ɛ i, j=1 i = 1,..., n =α + x i β + ɛ i Hence, the model can be fit by minimising the objective function { N } minimise (y i α x i β) 2 a,β i=1 Adding a penalisation term to the objective function makes the solution more sparse: { } 1 N minimise (y i α x i β) 2 + λ β q q, where q = 1 or 2 a,β 2N i=1 Sparse statistical modelling Tom Bartlett 4 / 28

5 Sparse linear models The penalty term λ β q q means that only the bare minimum is used of all the information available in the p predictor variables x ij, j = 1,...p. minimise a,β { 1 2N } N (y i α x i β) 2 + λ β q q i=1 q is typically chosen as q = 1 or q = 2, because these produce convex solutions and hence are computationally much nicer! q = 1 is called the lasso ; it tends to set as many elements of β as possible to zero q = 2 is called ridge regression, and it tends to minimise the size of all the elements of β Penalisation is equally applicable to other types of linear models: logistic regression, generalised linear models etc Sparse statistical modelling Tom Bartlett 5 / 28

6 Sparse linear models - simple example Lasso Ridge Regression Coefficients funding not hs college4 college hs Coefficients funding not hs college4 college hs ˆβ 1 / β 1 ˆβ 2 / β 2 Crime-rate modelled according to 5 predictors: annual police funding in dollars per resident (funding), percent of people 25 years and older with four years of high school (hs), percent of 16- to 19-year olds not in high school and not high school graduates (not-hs), percent of 18- to 24-year olds in college (college), and percent of people 25 years and older with at least four years of college (college4). Sparse statistical modelling Tom Bartlett 6 / 28

7 Sparse linear models - genomics example Gene expression data, for p = genes, for n c = 530 cancer samples + n h = 61 healthy tissue samples Fit logistic (i.e., 2 class, cancer/healthy) lasso model using the R package glmnet, selecting λ by cross-validation Out of possible genes for prediction, lasso chooses just these 25 (shown with their fitted model coefficients) ADAMTS HPD NUP ADH HS3ST PAFAH1B CA IGSF TACC CCDC LRRTM TESC CDH LRRC3B TRPM CES MEG TSLP COL10A MMP WDR51A DPP NUAK WISP HHATL Caveat: these are not necessarily the only predictive genes. If we removed these genes from the data-set and fitted the model again, lasso would choose an entirely new set of genes which might be almost as good at predicting! Sparse statistical modelling Tom Bartlett 7 / 28

8 Sparse PCA Ordinary PCA finds v by carrying out the optimisation: { } maximise v X X v 2 =1 n v, with X R n p (i.e., n samples and p variables). With p >> n, the eigenvectors of the sample covariance matrix X X/n are not necessarily close to those of the population covariance matrix [2]. Hence ordinary PCA can fail in this context. This motivates sparse PCA, in which many entries of v are encouraged to be zero, by finding v by carrying out the optimisation: maximise v 2 =1 { v X Xv }, subject to: v 1 t. In effect this discards some variables such that p is closer to n. 2 Iain M Johnstone. On the distribution of the largest eigenvalue in principal components analysis. In: Annals of statistics (2001), pp Sparse statistical modelling Tom Bartlett 8 / 28

9 Sparse SVD The SVD of a matrix X R n p, with n > p, can be expressed as X = UDV, where U R n p and V R p p are orthogonal and D R p p is diagonal. The SVD can hence be found by carrying out the optimisation: minimise U R n p,v R p p,d R p p X UDV 2. Hence, a sparse SVD with rank r can be obtained by carrying out the optimisation: { } minimise X UDV 2 + λ 1 U 1 + λ 2 V 1. U R n r,v R p r,d R r r This allows SVD to be applied to the p > n scenario. Sparse statistical modelling Tom Bartlett 9 / 28

10 Sparse PCA and SVD - an algorithm SVD is a generalisation of PCA. Hence, algorithms to solve the SVD problem can be applied to the PCA problem The sparse PCA can thus be re-formulated as: { } maximise u Xv, subject to: v 1 t, u 2 = v 2 =1 which is biconvex in u and v and can be solved by alternating between the updates: u Xv, and v S ( λ X u ) Xv 2 S λ (X, (1) u) 2 where S λ is the soft-thresholding operator S λ = sign(x) ( x λ) +. Sparse statistical modelling Tom Bartlett 10 / 28

11 Sparse PCA - simulation study Define Σ as a p p block-diagonal matrix, with p = 200 and 10 blocks of 1s of size Hence, we would expect there to be 10 independent components of variation in the corresponding distribution. Generate n samples x Normal(0, Σ) Estimate Σ = (x x)(x x) /n Eigenvector correlation Top 10 PCs n/p Correlate eigenvectors of Σ with eigenvectors of Σ Repeat 100 times for each different value of n The plot shows the means of these correlations over the 100 repetitions for different values of n. Sparse statistical modelling Tom Bartlett 11 / 28

12 Sparse PCA - simulation study An implementation of sparse PCA is available in the R package PMA as the function spca. It proceeds similarly to the algorithm described earlier, which is presented in more detail by Witten, Tibshirani and Hastie [3]. I applied this function to the same simulation as described in the previous slide. The scale of the penalisation is in terms of u 1, with u 1 = p being the minimum and u 1 = 1 being the maximum permissible values. Eigenvector correlation Top 10 PCs n/p The plot shows the result with u 1 = p. 3 Daniela M Witten, Robert Tibshirani, and Trevor Hastie. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. In: Biostatistics (2009), kxp008 Sparse statistical modelling Tom Bartlett 12 / 28

13 Sparse PCA - simulation study Eigenvector correlation Top 10 PCs n/p Eigenvector correlation Top 10 PCs n/p The plot shows the result with u 1 = p/2. The plot shows the result with u 1 = p/3. Sparse statistical modelling Tom Bartlett 13 / 28

Sparse PCA - real data example I carried out PCA on expression levels of 10138 genes in individual cells from developing brains There are many different cell types in the data - some mature, some

14 Sparse PCA - real data example I carried out PCA on expression levels of genes in individual cells from developing brains There are many different cell types in the data - some mature, some immature, and some in between Different cell-types are characterised by different gene expression profiles We would therefore expect to be able visualise some separation of the cell-types by dimensionality reduction to three dimensions The plot shows the cells plotted in terms of the top three (standard) PCA components. Sparse statistical modelling Tom Bartlett 14 / 28

15 Sparse PCA - real data example The plot shows the cells in terms of the top three sparse PCA components, with u 1 = 0.1 p (i.e., a high level of regularisation). The plot shows the cells in terms of the top three sparse PCA components, with u 1 = 0.8 p (i.e., a low level of regularisation). Sparse statistical modelling Tom Bartlett 15 / 28

16 Sparse CCA In CCA, the aim is to find coefficient vectors u R p and v R q which project the data-matrices X R n p and Y R n q so as to maximise the correlations between these projections. Whereas PCA aims to find the direction of maximum variance in a single data-matrix, CCA aims to find the directions in the two data-matrices in which the variances best explain each other. The CCA problem can be solved by carrying out the optimisation: maximise Cor(Xu, Yv) u R p,v Rq This problem is not well posed for n < max(p, q), in which case u and v can be found which trivially give Cor(Xu, Yv) = 1. Sparse CCA solves this problem by carrying out the optimisation: maximise Cor(Xu, Yv), subject to u u R p,v R q 1 < t 1 and v 1 < t 2. Sparse statistical modelling Tom Bartlett 16 / 28

17 Sparse CCA - real data example Cell cycle is a biological process involved in the replication of cells Cell-cycle can be thought of as a latent process which is not directly observable in genomics data It is driven by a small set of genes (particularly cyclins and cyclindependent kinases) from which it may be inferred It has an effect on the expression of very many genes: hence it can also tend to act as a confounding factor when modelling many other biological processes Sparse statistical modelling Used CCA here as an exploratory tool, with Y the data for the cell cycle genes, and X the data for all the other genes. Tom Bartlett 17 / 28

18 Sparse LDA LDA assigns item i to a group G based a corresponding data-vector x i, according to the posterior probability: P(G = k x i ) = π kf k (x i ) K l=1 π lf l (x i ), f k (x i ) = with { 1 (2π) p/2 exp 1 } Σ 1/2 2 (x i µ k ) Σ 1 (x i µ k ), with prior π k and mean µ k for group k, and covariance Σ. This assignment takes place by constructing decision boundaries between classes k and l: log P(G = k x i) P(G = l x i ) = log π k π l + x i Σ 1 (µ k µ l ) 1 2 (µ k + µ l ) Σ 1 (µ k µ l ) Because this boundary is linear in x i, we get the name LDA. Sparse statistical modelling Tom Bartlett 18 / 28

19 Sparse LDA The decision boundary log P(G = k x i) P(G = l x i ) = log π k π l + x i Σ 1 (µ k µ l ) 1 2 (µ k + µ l ) Σ 1 (µ k µ l ) then naturally leads to the decision rule: } G(x i ) = argmax {log π k + x i Σ 1 µ k µ k Σ 1 µ k. k By assuming Σ is diagonal, i.e., there is no covariance between the p dimensions, this decision rule can be reduced to the nearest centroids classifier: p (x j µ jk ) 2 G(x i ) = argmin log π k k. j=1 Typically, Σ (or σ) are estimated from the data as Σ (or σ), and the µ k are estimated as µ k whilst training the classifier. σ 2 j Sparse statistical modelling Tom Bartlett 19 / 28

20 Sparse LDA The nearest centroids classifier p (x j µ jk ) 2 Ĝ(x i ) = argmin log π k k j=1 will typically use all p variables. This is often unnecessary and can lead to overfitting in high-dimensional contexts. The nearest shrunken centroids classifier deals with this issue. Define µ = x + α k, where x is the data-mean across all classes, and α k is the class-specific deviation of the mean from x. Then, the nearest shrunken centroids classifier proceeds with the optimisation: 1 K p (x ij x j α jk ) 2 minimise α k R p,k {1,...,K} 2n ˆσ 2 k=1 i C k j=1 K p nk +λ ˆσ 2 α jk, σ 2 j k=1 j=1 where C k and n k are the set and number of samples in group k. Sparse statistical modelling Tom Bartlett 20 / 28

21 Sparse LDA Hence, the α k estimated from the optimisation 1 K p (x ij x j α jk ) 2 minimise α k R p,k {1,...,K} 2n ˆσ 2 k=1 i C k j=1 +λ K p k=1 j=1 nk α jk can be used to estimate the shrunken centroids µ = x + α k, thus training the classifier: p (x j µ jk ) 2 Ĝ(x i ) = argmin log π k k. j=1 σ 2 j ˆσ 2 Sparse statistical modelling Tom Bartlett 21 / 28

22 Sparse LDA - real data example I applied nearest (shrunken) centroids to expression data for genes, for 347 cells of different types: leukocytes (54); lymphoblastic cells (88); fetal brain cells (16wk, 26; 21wk, 24); fibroblasts (37); ductal carcinoma (22); keratinocytes (40); B lymphoblasts (17); ips cells (24); neural progenitors (15). Used R packages MASS, and pamr [4]. Carried out 100 repetitions of 3-fold CV. Plots show normalised mutual information (NMI), adjusted Rand index (ARI) and prediction accuracy. NMI ARI Accuracy Sparsity threshold Sparsity threshold Sparsity threshold Sparse LDA quantile (over 300 predictions) 100% 75% 50% 25% 0% Regular LDA quantile (over 300 predictions) 100% 75% 50% 25% 0% 4 Robert Tibshirani et al. Class prediction by nearest shrunken centroids, with applications to DNA microarrays. In: Statistical Science (2003), pp Sparse statistical modelling Tom Bartlett 22 / 28

23 Sparse clustering Many clustering methods, such hierarchical clustering, are based on a dissimilarity measure D i,i = p j=1 d i,i,j between samples i and i. One popular choice of dissimilarity measure is the euclidean distance. In high-dimensions, it is often unnecessary to use information from all of the p dimensions. A weighted dissimilarity measure D i,i = p j=1 w jd i,i,j can be a useful approach to this problem. This can be obtained by the sparse matrix decomposition: maximise u w, subject to u 2 1, w 2 1, u R n2,w R p w 1 t, and w j 0, j {1,..., p}, where w is vector of the weights w j, j {1,..., p}, and R n2 p is the dissimilarity components arranged such that each row of corresponds to the d i,i,j, j {1,..., p} for a pair of samples i, i. This weighted dissimilarity measure can then be used for sparse clustering, such as sparse hierarchical clustering. Sparse statistical modelling Tom Bartlett 23 / 28

24 Sparse clustering Some clustering methods, such as K-means, need a slightly modified approach. K-means seeks to minimise the within-cluster sum of squares K x i x k 2 2 = 1 K x i x i 2 2 2N i C k k=1 i,i C k k=1 where C k is the set of samples in cluster k and x k is the corresponding centroid. Hence, a weighted K-means could proceed according to the optimisation: minimise w R p p j=1 w j K 1 d i,i n,j k k=1 i,i C k, where d i,i,j = (x ij x i j) 2, and n k is the number of samples in cluster k. Sparse statistical modelling Tom Bartlett 24 / 28

25 Sparse clustering However, for the optimisation p minimise w R p j=1 w j K 1 n k k=1 i,i C k d i,i,j, it is not possible to choose a set of constraints which guarantee a non-pathological solution as well as convexity. Instead, the between-cluster sum of squares can be maximised: p maximise w w R p j 1 n n K 1 d i,i n,j j=1 i=1 i =1 d i,i n,j k k=1 i,i C k subject to w 2 1, w 1 t, and w j 0, j {1,..., p}. Sparse statistical modelling Tom Bartlett 25 / 28

26 Sparse clustering - real data examples Applied (sparse) hierarchal clustering to the same benchmark expression data-set (14349 genes, for 347 cells of different types). NMI L1 bound Used R package sparcl [5] for the sparse clustering. Plots show normalised mutual information (NMI) and adjusted Rand index (ARI) comparing sparse with standard clustering. ARI L1 bound Sparse hierarchical clustering hierarchical clustering 5 Daniela M Witten and Robert Tibshirani. A framework for feature selection in clustering. In: Journal of the American Statistical Association (2012) Sparse statistical modelling Tom Bartlett 26 / 28

27 Sparse clustering - real data examples Applied (sparse) k-means to the same benchmark expression data-set (14349 genes, for 347 cells of different types). NMI L1 bound Used R package sparcl for the sparse clustering. Plots show normalised mutual information (NMI) and adjusted Rand index (ARI) comparing sparse with standard clustering. ARI L1 bound Sparse k means k means Sparse statistical modelling Tom Bartlett 27 / 28

28 Sparse clustering - real data examples Spectral clustering essentially uses k-means clustering (or similar) in dimensionallyreduced (e.g., PCA) space. Applied standard k-means in sparse-pca space to the same benchmark expression data-set (14349 genes, for 347 cells of different types). Offers computational advantages, running in 9 seconds on a 2.8GHz Macbook, compared with 19 seconds for standard k-means, and 35 seconds for sparse k-means. NMI ARI L1 bound / sqrt(n) L1 bound / sqrt(n) Sparse spectral k means k means Sparse statistical modelling Tom Bartlett 28 / 28

Lecture 6: Methods for high-dimensional problems

Lecture 6: Methods for high-dimensional problems Hector Corrada Bravo and Rafael A. Irizarry March, 2010 In this Section we will discuss methods where data lies on high-dimensional spaces. In particular,