Sparse statistical modelling

Similar documents
Lecture 6: Methods for high-dimensional problems

Learning with Singular Vectors

CMSC858P Supervised Learning Methods

STATISTICAL LEARNING SYSTEMS

STATS 306B: Unsupervised Learning Spring Lecture 13 May 12

ESL Chap3. Some extensions of lasso

Machine Learning (BSMC-GA 4439) Wenke Liu

Permutation-invariant regularization of large covariance matrices. Liza Levina

Proteomics and Variable Selection

Machine Learning - MT & 14. PCA and MDS

Learning From Data: Modelling as an Optimisation Problem

Fast Regularization Paths via Coordinate Descent

Regularized Discriminant Analysis. Part I. Linear and Quadratic Discriminant Analysis. Discriminant Analysis. Example. Example. Class distribution

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

Regularization and Variable Selection via the Elastic Net

2.3. Clustering or vector quantization 57

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Correlate. A method for the integrative analysis of two genomic data sets

High-dimensional Ordinary Least-squares Projection for Screening Variables

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Linear regression methods

Machine Learning for OR & FE

MSA220/MVE440 Statistical Learning for Big Data


Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018

DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania

The lasso: some novel algorithms and applications

Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

Dimension Reduction Methods

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Regularized Discriminant Analysis and Its Application in Microarrays

Fast Regularization Paths via Coordinate Descent

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Regularized Estimation of High Dimensional Covariance Matrices. Peter Bickel. January, 2008

Probabilistic Latent Semantic Analysis

The lasso: some novel algorithms and applications

Shrinkage Methods: Ridge and Lasso

Graphical Model Selection

Regularized Discriminant Analysis and Its Application in Microarray

A direct formulation for sparse PCA using semidefinite programming

Statistical Data Mining and Machine Learning Hilary Term 2016

Standardization and Singular Value Decomposition in Canonical Correlation Analysis

Statistical Inference

Simultaneous variable selection and class fusion for high-dimensional linear discriminant analysis

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables

Least Angle Regression, Forward Stagewise and the Lasso

Principal Component Analysis

High Dimensional Covariance and Precision Matrix Estimation

Classification 2: Linear discriminant analysis (continued); logistic regression

Regularized Discriminant Analysis and Reduced-Rank LDA

Statistical Learning with the Lasso, spring The Lasso

Tractable Upper Bounds on the Restricted Isometry Constant

PCA and admixture models

Clusters. Unsupervised Learning. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved

PCA, Kernel PCA, ICA

Penalized versus constrained generalized eigenvalue problems

Probabilistic & Unsupervised Learning

Properties of Matrices and Operations on Matrices

The prediction of house price

Fast Regularization Paths via Coordinate Descent

application in microarrays

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

Sparse Principal Component Analysis Formulations And Algorithms

High-dimensional data: Exploratory data analysis

A Least Squares Formulation for Canonical Correlation Analysis

MSA200/TMS041 Multivariate Analysis

Classification 1: Linear regression of indicators, linear discriminant analysis

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Sparse Principal Component Analysis

Biostatistics Advanced Methods in Biostatistics IV

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

A direct formulation for sparse PCA using semidefinite programming

A Framework for Feature Selection in Clustering

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Extensions to LDA and multinomial regression

L26: Advanced dimensionality reduction

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

MS-C1620 Statistical inference

Learning Gaussian Graphical Models with Unknown Group Sparsity

Lecture Notes 10: Matrix Factorization

Sparse PCA with applications in finance

Linear Methods for Prediction

Regularization: Ridge Regression and the LASSO

Principal Component Analysis and Singular Value Decomposition. Volker Tresp, Clemens Otte Summer 2014

Penalized classification using Fisher s linear discriminant

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Regression Shrinkage and Selection via the Lasso

Principal Component Analysis

TESTING SIGNIFICANCE OF FEATURES BY LASSOED PRINCIPAL COMPONENTS. BY DANIELA M. WITTEN 1 AND ROBERT TIBSHIRANI 2 Stanford University

Principal Component Analysis (PCA)

Midterm exam CS 189/289, Fall 2015

Sparse representation classification and positive L1 minimization

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Linear Model Selection and Regularization

A note on the group lasso and a sparse group lasso

Transcription:

Sparse statistical modelling Tom Bartlett Sparse statistical modelling Tom Bartlett 1 / 28

Introduction A sparse statistical model is one having only a small number of nonzero parameters or weights. [1] The number of features or variables measured on a person or object can be very large (e.g., expression levels of 30000 genes) These measurements are often highly correlated, i.e., contain much redundant information This scenario is particularly relevant in the age of big-data 1 Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical learning with sparsity: the lasso and generalizations. CRC Press, 2015 Sparse statistical modelling Tom Bartlett 2 / 28

Outline Sparse linear models Sparse PCA Sparse SVD Sparse CCA Sparse LDA Sparse clustering Sparse statistical modelling Tom Bartlett 3 / 28

Sparse linear models A linear model can be written as p y i =α + x ij β j + ɛ i, j=1 i = 1,..., n =α + x i β + ɛ i Hence, the model can be fit by minimising the objective function { N } minimise (y i α x i β) 2 a,β i=1 Adding a penalisation term to the objective function makes the solution more sparse: { } 1 N minimise (y i α x i β) 2 + λ β q q, where q = 1 or 2 a,β 2N i=1 Sparse statistical modelling Tom Bartlett 4 / 28

Sparse linear models The penalty term λ β q q means that only the bare minimum is used of all the information available in the p predictor variables x ij, j = 1,...p. minimise a,β { 1 2N } N (y i α x i β) 2 + λ β q q i=1 q is typically chosen as q = 1 or q = 2, because these produce convex solutions and hence are computationally much nicer! q = 1 is called the lasso ; it tends to set as many elements of β as possible to zero q = 2 is called ridge regression, and it tends to minimise the size of all the elements of β Penalisation is equally applicable to other types of linear models: logistic regression, generalised linear models etc Sparse statistical modelling Tom Bartlett 5 / 28

Sparse linear models - simple example Lasso Ridge Regression Coefficients 5 0 5 10 funding not hs college4 college hs Coefficients 5 0 5 10 funding not hs college4 college hs 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 ˆβ 1 / β 1 ˆβ 2 / β 2 Crime-rate modelled according to 5 predictors: annual police funding in dollars per resident (funding), percent of people 25 years and older with four years of high school (hs), percent of 16- to 19-year olds not in high school and not high school graduates (not-hs), percent of 18- to 24-year olds in college (college), and percent of people 25 years and older with at least four years of college (college4). Sparse statistical modelling Tom Bartlett 6 / 28

Sparse linear models - genomics example Gene expression data, for p = 17280 genes, for n c = 530 cancer samples + n h = 61 healthy tissue samples Fit logistic (i.e., 2 class, cancer/healthy) lasso model using the R package glmnet, selecting λ by cross-validation Out of 17280 possible genes for prediction, lasso chooses just these 25 (shown with their fitted model coefficients) ADAMTS5-0.0666 HPD -0.00679 NUP210 0.00582 ADH4-0.165 HS3ST4-0.0863 PAFAH1B3 0.297 CA4-0.151 IGSF10-0.356 TACC3 0.128 CCDC36-0.335 LRRTM2-0.0711 TESC -0.0568 CDH12-0.253 LRRC3B -0.211 TRPM3-1.24 CES1-0.302 MEG3-0.022 TSLP -0.0841 COL10A1 0.747 MMP11 0.22 WDR51A 0.0722 DPP6-0.107 NUAK2 0.0354 WISP1 0.14 HHATL -0.0665 Caveat: these are not necessarily the only predictive genes. If we removed these genes from the data-set and fitted the model again, lasso would choose an entirely new set of genes which might be almost as good at predicting! Sparse statistical modelling Tom Bartlett 7 / 28

Sparse PCA Ordinary PCA finds v by carrying out the optimisation: { } maximise v X X v 2 =1 n v, with X R n p (i.e., n samples and p variables). With p >> n, the eigenvectors of the sample covariance matrix X X/n are not necessarily close to those of the population covariance matrix [2]. Hence ordinary PCA can fail in this context. This motivates sparse PCA, in which many entries of v are encouraged to be zero, by finding v by carrying out the optimisation: maximise v 2 =1 { v X Xv }, subject to: v 1 t. In effect this discards some variables such that p is closer to n. 2 Iain M Johnstone. On the distribution of the largest eigenvalue in principal components analysis. In: Annals of statistics (2001), pp. 295 327 Sparse statistical modelling Tom Bartlett 8 / 28

Sparse SVD The SVD of a matrix X R n p, with n > p, can be expressed as X = UDV, where U R n p and V R p p are orthogonal and D R p p is diagonal. The SVD can hence be found by carrying out the optimisation: minimise U R n p,v R p p,d R p p X UDV 2. Hence, a sparse SVD with rank r can be obtained by carrying out the optimisation: { } minimise X UDV 2 + λ 1 U 1 + λ 2 V 1. U R n r,v R p r,d R r r This allows SVD to be applied to the p > n scenario. Sparse statistical modelling Tom Bartlett 9 / 28

Sparse PCA and SVD - an algorithm SVD is a generalisation of PCA. Hence, algorithms to solve the SVD problem can be applied to the PCA problem The sparse PCA can thus be re-formulated as: { } maximise u Xv, subject to: v 1 t, u 2 = v 2 =1 which is biconvex in u and v and can be solved by alternating between the updates: u Xv, and v S ( λ X u ) Xv 2 S λ (X, (1) u) 2 where S λ is the soft-thresholding operator S λ = sign(x) ( x λ) +. Sparse statistical modelling Tom Bartlett 10 / 28

Sparse PCA - simulation study Define Σ as a p p block-diagonal matrix, with p = 200 and 10 blocks of 1s of size 20 20. Hence, we would expect there to be 10 independent components of variation in the corresponding distribution. Generate n samples x Normal(0, Σ) Estimate Σ = (x x)(x x) /n Eigenvector correlation 0.0 0.2 0.4 0.6 0.8 1.0 Top 10 PCs 0.2 0.4 0.6 0.8 1.0 n/p Correlate eigenvectors of Σ with eigenvectors of Σ Repeat 100 times for each different value of n The plot shows the means of these correlations over the 100 repetitions for different values of n. Sparse statistical modelling Tom Bartlett 11 / 28

Sparse PCA - simulation study An implementation of sparse PCA is available in the R package PMA as the function spca. It proceeds similarly to the algorithm described earlier, which is presented in more detail by Witten, Tibshirani and Hastie [3]. I applied this function to the same simulation as described in the previous slide. The scale of the penalisation is in terms of u 1, with u 1 = p being the minimum and u 1 = 1 being the maximum permissible values. Eigenvector correlation 0.0 0.2 0.4 0.6 0.8 1.0 Top 10 PCs 0.2 0.4 0.6 0.8 1.0 n/p The plot shows the result with u 1 = p. 3 Daniela M Witten, Robert Tibshirani, and Trevor Hastie. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. In: Biostatistics (2009), kxp008 Sparse statistical modelling Tom Bartlett 12 / 28

Sparse PCA - simulation study Eigenvector correlation 0.0 0.2 0.4 0.6 0.8 1.0 Top 10 PCs 0.2 0.4 0.6 0.8 1.0 n/p Eigenvector correlation 0.0 0.2 0.4 0.6 0.8 1.0 Top 10 PCs 0.2 0.4 0.6 0.8 1.0 n/p The plot shows the result with u 1 = p/2. The plot shows the result with u 1 = p/3. Sparse statistical modelling Tom Bartlett 13 / 28

Sparse PCA - real data example I carried out PCA on expression levels of 10138 genes in individual cells from developing brains There are many different cell types in the data - some mature, some immature, and some in between Different cell-types are characterised by different gene expression profiles We would therefore expect to be able visualise some separation of the cell-types by dimensionality reduction to three dimensions The plot shows the cells plotted in terms of the top three (standard) PCA components. Sparse statistical modelling Tom Bartlett 14 / 28

Sparse PCA - real data example The plot shows the cells in terms of the top three sparse PCA components, with u 1 = 0.1 p (i.e., a high level of regularisation). The plot shows the cells in terms of the top three sparse PCA components, with u 1 = 0.8 p (i.e., a low level of regularisation). Sparse statistical modelling Tom Bartlett 15 / 28

Sparse CCA In CCA, the aim is to find coefficient vectors u R p and v R q which project the data-matrices X R n p and Y R n q so as to maximise the correlations between these projections. Whereas PCA aims to find the direction of maximum variance in a single data-matrix, CCA aims to find the directions in the two data-matrices in which the variances best explain each other. The CCA problem can be solved by carrying out the optimisation: maximise Cor(Xu, Yv) u R p,v Rq This problem is not well posed for n < max(p, q), in which case u and v can be found which trivially give Cor(Xu, Yv) = 1. Sparse CCA solves this problem by carrying out the optimisation: maximise Cor(Xu, Yv), subject to u u R p,v R q 1 < t 1 and v 1 < t 2. Sparse statistical modelling Tom Bartlett 16 / 28

Sparse CCA - real data example Cell cycle is a biological process involved in the replication of cells Cell-cycle can be thought of as a latent process which is not directly observable in genomics data It is driven by a small set of genes (particularly cyclins and cyclindependent kinases) from which it may be inferred It has an effect on the expression of very many genes: hence it can also tend to act as a confounding factor when modelling many other biological processes Sparse statistical modelling Used CCA here as an exploratory tool, with Y the data for the cell cycle genes, and X the data for all the other genes. Tom Bartlett 17 / 28

Sparse LDA LDA assigns item i to a group G based a corresponding data-vector x i, according to the posterior probability: P(G = k x i ) = π kf k (x i ) K l=1 π lf l (x i ), f k (x i ) = with { 1 (2π) p/2 exp 1 } Σ 1/2 2 (x i µ k ) Σ 1 (x i µ k ), with prior π k and mean µ k for group k, and covariance Σ. This assignment takes place by constructing decision boundaries between classes k and l: log P(G = k x i) P(G = l x i ) = log π k π l + x i Σ 1 (µ k µ l ) 1 2 (µ k + µ l ) Σ 1 (µ k µ l ) Because this boundary is linear in x i, we get the name LDA. Sparse statistical modelling Tom Bartlett 18 / 28

Sparse LDA The decision boundary log P(G = k x i) P(G = l x i ) = log π k π l + x i Σ 1 (µ k µ l ) 1 2 (µ k + µ l ) Σ 1 (µ k µ l ) then naturally leads to the decision rule: } G(x i ) = argmax {log π k + x i Σ 1 µ k µ k Σ 1 µ k. k By assuming Σ is diagonal, i.e., there is no covariance between the p dimensions, this decision rule can be reduced to the nearest centroids classifier: p (x j µ jk ) 2 G(x i ) = argmin log π k k. j=1 Typically, Σ (or σ) are estimated from the data as Σ (or σ), and the µ k are estimated as µ k whilst training the classifier. σ 2 j Sparse statistical modelling Tom Bartlett 19 / 28

Sparse LDA The nearest centroids classifier p (x j µ jk ) 2 Ĝ(x i ) = argmin log π k k j=1 will typically use all p variables. This is often unnecessary and can lead to overfitting in high-dimensional contexts. The nearest shrunken centroids classifier deals with this issue. Define µ = x + α k, where x is the data-mean across all classes, and α k is the class-specific deviation of the mean from x. Then, the nearest shrunken centroids classifier proceeds with the optimisation: 1 K p (x ij x j α jk ) 2 minimise α k R p,k {1,...,K} 2n ˆσ 2 k=1 i C k j=1 K p nk +λ ˆσ 2 α jk, σ 2 j k=1 j=1 where C k and n k are the set and number of samples in group k. Sparse statistical modelling Tom Bartlett 20 / 28

Sparse LDA Hence, the α k estimated from the optimisation 1 K p (x ij x j α jk ) 2 minimise α k R p,k {1,...,K} 2n ˆσ 2 k=1 i C k j=1 +λ K p k=1 j=1 nk α jk can be used to estimate the shrunken centroids µ = x + α k, thus training the classifier: p (x j µ jk ) 2 Ĝ(x i ) = argmin log π k k. j=1 σ 2 j ˆσ 2 Sparse statistical modelling Tom Bartlett 21 / 28

Sparse LDA - real data example I applied nearest (shrunken) centroids to expression data for 14349 genes, for 347 cells of different types: leukocytes (54); lymphoblastic cells (88); fetal brain cells (16wk, 26; 21wk, 24); fibroblasts (37); ductal carcinoma (22); keratinocytes (40); B lymphoblasts (17); ips cells (24); neural progenitors (15). Used R packages MASS, and pamr [4]. Carried out 100 repetitions of 3-fold CV. Plots show normalised mutual information (NMI), adjusted Rand index (ARI) and prediction accuracy. NMI 0.0 0.4 0.8 ARI 0.0 0.4 0.8 Accuracy 0.0 0.4 0.8 0 5 10 15 20 25 30 Sparsity threshold 0 5 10 15 20 25 30 Sparsity threshold 0 5 10 15 20 25 30 Sparsity threshold Sparse LDA quantile (over 300 predictions) 100% 75% 50% 25% 0% Regular LDA quantile (over 300 predictions) 100% 75% 50% 25% 0% 4 Robert Tibshirani et al. Class prediction by nearest shrunken centroids, with applications to DNA microarrays. In: Statistical Science (2003), pp. 104 117 Sparse statistical modelling Tom Bartlett 22 / 28

Sparse clustering Many clustering methods, such hierarchical clustering, are based on a dissimilarity measure D i,i = p j=1 d i,i,j between samples i and i. One popular choice of dissimilarity measure is the euclidean distance. In high-dimensions, it is often unnecessary to use information from all of the p dimensions. A weighted dissimilarity measure D i,i = p j=1 w jd i,i,j can be a useful approach to this problem. This can be obtained by the sparse matrix decomposition: maximise u w, subject to u 2 1, w 2 1, u R n2,w R p w 1 t, and w j 0, j {1,..., p}, where w is vector of the weights w j, j {1,..., p}, and R n2 p is the dissimilarity components arranged such that each row of corresponds to the d i,i,j, j {1,..., p} for a pair of samples i, i. This weighted dissimilarity measure can then be used for sparse clustering, such as sparse hierarchical clustering. Sparse statistical modelling Tom Bartlett 23 / 28

Sparse clustering Some clustering methods, such as K-means, need a slightly modified approach. K-means seeks to minimise the within-cluster sum of squares K x i x k 2 2 = 1 K x i x i 2 2 2N i C k k=1 i,i C k k=1 where C k is the set of samples in cluster k and x k is the corresponding centroid. Hence, a weighted K-means could proceed according to the optimisation: minimise w R p p j=1 w j K 1 d i,i n,j k k=1 i,i C k, where d i,i,j = (x ij x i j) 2, and n k is the number of samples in cluster k. Sparse statistical modelling Tom Bartlett 24 / 28

Sparse clustering However, for the optimisation p minimise w R p j=1 w j K 1 n k k=1 i,i C k d i,i,j, it is not possible to choose a set of constraints which guarantee a non-pathological solution as well as convexity. Instead, the between-cluster sum of squares can be maximised: p maximise w w R p j 1 n n K 1 d i,i n,j j=1 i=1 i =1 d i,i n,j k k=1 i,i C k subject to w 2 1, w 1 t, and w j 0, j {1,..., p}. Sparse statistical modelling Tom Bartlett 25 / 28

Sparse clustering - real data examples Applied (sparse) hierarchal clustering to the same benchmark expression data-set (14349 genes, for 347 cells of different types). NMI 0.0 0.4 0.8 2 5 10 20 50 100 200 500 1000 L1 bound Used R package sparcl [5] for the sparse clustering. Plots show normalised mutual information (NMI) and adjusted Rand index (ARI) comparing sparse with standard clustering. ARI 0.0 0.4 0.8 2 5 10 20 50 100 200 500 1000 L1 bound Sparse hierarchical clustering hierarchical clustering 5 Daniela M Witten and Robert Tibshirani. A framework for feature selection in clustering. In: Journal of the American Statistical Association (2012) Sparse statistical modelling Tom Bartlett 26 / 28

Sparse clustering - real data examples Applied (sparse) k-means to the same benchmark expression data-set (14349 genes, for 347 cells of different types). NMI 0.0 0.4 0.8 2 5 10 20 50 100 200 500 1000 L1 bound Used R package sparcl for the sparse clustering. Plots show normalised mutual information (NMI) and adjusted Rand index (ARI) comparing sparse with standard clustering. ARI 0.0 0.4 0.8 2 5 10 20 50 100 200 500 1000 L1 bound Sparse k means k means Sparse statistical modelling Tom Bartlett 27 / 28

Sparse clustering - real data examples Spectral clustering essentially uses k-means clustering (or similar) in dimensionallyreduced (e.g., PCA) space. Applied standard k-means in sparse-pca space to the same benchmark expression data-set (14349 genes, for 347 cells of different types). Offers computational advantages, running in 9 seconds on a 2.8GHz Macbook, compared with 19 seconds for standard k-means, and 35 seconds for sparse k-means. NMI 0.0 0.4 0.8 ARI 0.0 0.4 0.8 0.1 0.2 0.5 1.0 L1 bound / sqrt(n) 0.1 0.2 0.5 1.0 L1 bound / sqrt(n) Sparse spectral k means k means Sparse statistical modelling Tom Bartlett 28 / 28