Sparse statistical modelling
|
|
- Barry Austin
- 5 years ago
- Views:
Transcription
1 Sparse statistical modelling Tom Bartlett Sparse statistical modelling Tom Bartlett 1 / 28
2 Introduction A sparse statistical model is one having only a small number of nonzero parameters or weights. [1] The number of features or variables measured on a person or object can be very large (e.g., expression levels of genes) These measurements are often highly correlated, i.e., contain much redundant information This scenario is particularly relevant in the age of big-data 1 Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical learning with sparsity: the lasso and generalizations. CRC Press, 2015 Sparse statistical modelling Tom Bartlett 2 / 28
3 Outline Sparse linear models Sparse PCA Sparse SVD Sparse CCA Sparse LDA Sparse clustering Sparse statistical modelling Tom Bartlett 3 / 28
4 Sparse linear models A linear model can be written as p y i =α + x ij β j + ɛ i, j=1 i = 1,..., n =α + x i β + ɛ i Hence, the model can be fit by minimising the objective function { N } minimise (y i α x i β) 2 a,β i=1 Adding a penalisation term to the objective function makes the solution more sparse: { } 1 N minimise (y i α x i β) 2 + λ β q q, where q = 1 or 2 a,β 2N i=1 Sparse statistical modelling Tom Bartlett 4 / 28
5 Sparse linear models The penalty term λ β q q means that only the bare minimum is used of all the information available in the p predictor variables x ij, j = 1,...p. minimise a,β { 1 2N } N (y i α x i β) 2 + λ β q q i=1 q is typically chosen as q = 1 or q = 2, because these produce convex solutions and hence are computationally much nicer! q = 1 is called the lasso ; it tends to set as many elements of β as possible to zero q = 2 is called ridge regression, and it tends to minimise the size of all the elements of β Penalisation is equally applicable to other types of linear models: logistic regression, generalised linear models etc Sparse statistical modelling Tom Bartlett 5 / 28
6 Sparse linear models - simple example Lasso Ridge Regression Coefficients funding not hs college4 college hs Coefficients funding not hs college4 college hs ˆβ 1 / β 1 ˆβ 2 / β 2 Crime-rate modelled according to 5 predictors: annual police funding in dollars per resident (funding), percent of people 25 years and older with four years of high school (hs), percent of 16- to 19-year olds not in high school and not high school graduates (not-hs), percent of 18- to 24-year olds in college (college), and percent of people 25 years and older with at least four years of college (college4). Sparse statistical modelling Tom Bartlett 6 / 28
7 Sparse linear models - genomics example Gene expression data, for p = genes, for n c = 530 cancer samples + n h = 61 healthy tissue samples Fit logistic (i.e., 2 class, cancer/healthy) lasso model using the R package glmnet, selecting λ by cross-validation Out of possible genes for prediction, lasso chooses just these 25 (shown with their fitted model coefficients) ADAMTS HPD NUP ADH HS3ST PAFAH1B CA IGSF TACC CCDC LRRTM TESC CDH LRRC3B TRPM CES MEG TSLP COL10A MMP WDR51A DPP NUAK WISP HHATL Caveat: these are not necessarily the only predictive genes. If we removed these genes from the data-set and fitted the model again, lasso would choose an entirely new set of genes which might be almost as good at predicting! Sparse statistical modelling Tom Bartlett 7 / 28
8 Sparse PCA Ordinary PCA finds v by carrying out the optimisation: { } maximise v X X v 2 =1 n v, with X R n p (i.e., n samples and p variables). With p >> n, the eigenvectors of the sample covariance matrix X X/n are not necessarily close to those of the population covariance matrix [2]. Hence ordinary PCA can fail in this context. This motivates sparse PCA, in which many entries of v are encouraged to be zero, by finding v by carrying out the optimisation: maximise v 2 =1 { v X Xv }, subject to: v 1 t. In effect this discards some variables such that p is closer to n. 2 Iain M Johnstone. On the distribution of the largest eigenvalue in principal components analysis. In: Annals of statistics (2001), pp Sparse statistical modelling Tom Bartlett 8 / 28
9 Sparse SVD The SVD of a matrix X R n p, with n > p, can be expressed as X = UDV, where U R n p and V R p p are orthogonal and D R p p is diagonal. The SVD can hence be found by carrying out the optimisation: minimise U R n p,v R p p,d R p p X UDV 2. Hence, a sparse SVD with rank r can be obtained by carrying out the optimisation: { } minimise X UDV 2 + λ 1 U 1 + λ 2 V 1. U R n r,v R p r,d R r r This allows SVD to be applied to the p > n scenario. Sparse statistical modelling Tom Bartlett 9 / 28
10 Sparse PCA and SVD - an algorithm SVD is a generalisation of PCA. Hence, algorithms to solve the SVD problem can be applied to the PCA problem The sparse PCA can thus be re-formulated as: { } maximise u Xv, subject to: v 1 t, u 2 = v 2 =1 which is biconvex in u and v and can be solved by alternating between the updates: u Xv, and v S ( λ X u ) Xv 2 S λ (X, (1) u) 2 where S λ is the soft-thresholding operator S λ = sign(x) ( x λ) +. Sparse statistical modelling Tom Bartlett 10 / 28
11 Sparse PCA - simulation study Define Σ as a p p block-diagonal matrix, with p = 200 and 10 blocks of 1s of size Hence, we would expect there to be 10 independent components of variation in the corresponding distribution. Generate n samples x Normal(0, Σ) Estimate Σ = (x x)(x x) /n Eigenvector correlation Top 10 PCs n/p Correlate eigenvectors of Σ with eigenvectors of Σ Repeat 100 times for each different value of n The plot shows the means of these correlations over the 100 repetitions for different values of n. Sparse statistical modelling Tom Bartlett 11 / 28
12 Sparse PCA - simulation study An implementation of sparse PCA is available in the R package PMA as the function spca. It proceeds similarly to the algorithm described earlier, which is presented in more detail by Witten, Tibshirani and Hastie [3]. I applied this function to the same simulation as described in the previous slide. The scale of the penalisation is in terms of u 1, with u 1 = p being the minimum and u 1 = 1 being the maximum permissible values. Eigenvector correlation Top 10 PCs n/p The plot shows the result with u 1 = p. 3 Daniela M Witten, Robert Tibshirani, and Trevor Hastie. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. In: Biostatistics (2009), kxp008 Sparse statistical modelling Tom Bartlett 12 / 28
13 Sparse PCA - simulation study Eigenvector correlation Top 10 PCs n/p Eigenvector correlation Top 10 PCs n/p The plot shows the result with u 1 = p/2. The plot shows the result with u 1 = p/3. Sparse statistical modelling Tom Bartlett 13 / 28
14 Sparse PCA - real data example I carried out PCA on expression levels of genes in individual cells from developing brains There are many different cell types in the data - some mature, some immature, and some in between Different cell-types are characterised by different gene expression profiles We would therefore expect to be able visualise some separation of the cell-types by dimensionality reduction to three dimensions The plot shows the cells plotted in terms of the top three (standard) PCA components. Sparse statistical modelling Tom Bartlett 14 / 28
15 Sparse PCA - real data example The plot shows the cells in terms of the top three sparse PCA components, with u 1 = 0.1 p (i.e., a high level of regularisation). The plot shows the cells in terms of the top three sparse PCA components, with u 1 = 0.8 p (i.e., a low level of regularisation). Sparse statistical modelling Tom Bartlett 15 / 28
16 Sparse CCA In CCA, the aim is to find coefficient vectors u R p and v R q which project the data-matrices X R n p and Y R n q so as to maximise the correlations between these projections. Whereas PCA aims to find the direction of maximum variance in a single data-matrix, CCA aims to find the directions in the two data-matrices in which the variances best explain each other. The CCA problem can be solved by carrying out the optimisation: maximise Cor(Xu, Yv) u R p,v Rq This problem is not well posed for n < max(p, q), in which case u and v can be found which trivially give Cor(Xu, Yv) = 1. Sparse CCA solves this problem by carrying out the optimisation: maximise Cor(Xu, Yv), subject to u u R p,v R q 1 < t 1 and v 1 < t 2. Sparse statistical modelling Tom Bartlett 16 / 28
17 Sparse CCA - real data example Cell cycle is a biological process involved in the replication of cells Cell-cycle can be thought of as a latent process which is not directly observable in genomics data It is driven by a small set of genes (particularly cyclins and cyclindependent kinases) from which it may be inferred It has an effect on the expression of very many genes: hence it can also tend to act as a confounding factor when modelling many other biological processes Sparse statistical modelling Used CCA here as an exploratory tool, with Y the data for the cell cycle genes, and X the data for all the other genes. Tom Bartlett 17 / 28
18 Sparse LDA LDA assigns item i to a group G based a corresponding data-vector x i, according to the posterior probability: P(G = k x i ) = π kf k (x i ) K l=1 π lf l (x i ), f k (x i ) = with { 1 (2π) p/2 exp 1 } Σ 1/2 2 (x i µ k ) Σ 1 (x i µ k ), with prior π k and mean µ k for group k, and covariance Σ. This assignment takes place by constructing decision boundaries between classes k and l: log P(G = k x i) P(G = l x i ) = log π k π l + x i Σ 1 (µ k µ l ) 1 2 (µ k + µ l ) Σ 1 (µ k µ l ) Because this boundary is linear in x i, we get the name LDA. Sparse statistical modelling Tom Bartlett 18 / 28
19 Sparse LDA The decision boundary log P(G = k x i) P(G = l x i ) = log π k π l + x i Σ 1 (µ k µ l ) 1 2 (µ k + µ l ) Σ 1 (µ k µ l ) then naturally leads to the decision rule: } G(x i ) = argmax {log π k + x i Σ 1 µ k µ k Σ 1 µ k. k By assuming Σ is diagonal, i.e., there is no covariance between the p dimensions, this decision rule can be reduced to the nearest centroids classifier: p (x j µ jk ) 2 G(x i ) = argmin log π k k. j=1 Typically, Σ (or σ) are estimated from the data as Σ (or σ), and the µ k are estimated as µ k whilst training the classifier. σ 2 j Sparse statistical modelling Tom Bartlett 19 / 28
20 Sparse LDA The nearest centroids classifier p (x j µ jk ) 2 Ĝ(x i ) = argmin log π k k j=1 will typically use all p variables. This is often unnecessary and can lead to overfitting in high-dimensional contexts. The nearest shrunken centroids classifier deals with this issue. Define µ = x + α k, where x is the data-mean across all classes, and α k is the class-specific deviation of the mean from x. Then, the nearest shrunken centroids classifier proceeds with the optimisation: 1 K p (x ij x j α jk ) 2 minimise α k R p,k {1,...,K} 2n ˆσ 2 k=1 i C k j=1 K p nk +λ ˆσ 2 α jk, σ 2 j k=1 j=1 where C k and n k are the set and number of samples in group k. Sparse statistical modelling Tom Bartlett 20 / 28
21 Sparse LDA Hence, the α k estimated from the optimisation 1 K p (x ij x j α jk ) 2 minimise α k R p,k {1,...,K} 2n ˆσ 2 k=1 i C k j=1 +λ K p k=1 j=1 nk α jk can be used to estimate the shrunken centroids µ = x + α k, thus training the classifier: p (x j µ jk ) 2 Ĝ(x i ) = argmin log π k k. j=1 σ 2 j ˆσ 2 Sparse statistical modelling Tom Bartlett 21 / 28
22 Sparse LDA - real data example I applied nearest (shrunken) centroids to expression data for genes, for 347 cells of different types: leukocytes (54); lymphoblastic cells (88); fetal brain cells (16wk, 26; 21wk, 24); fibroblasts (37); ductal carcinoma (22); keratinocytes (40); B lymphoblasts (17); ips cells (24); neural progenitors (15). Used R packages MASS, and pamr [4]. Carried out 100 repetitions of 3-fold CV. Plots show normalised mutual information (NMI), adjusted Rand index (ARI) and prediction accuracy. NMI ARI Accuracy Sparsity threshold Sparsity threshold Sparsity threshold Sparse LDA quantile (over 300 predictions) 100% 75% 50% 25% 0% Regular LDA quantile (over 300 predictions) 100% 75% 50% 25% 0% 4 Robert Tibshirani et al. Class prediction by nearest shrunken centroids, with applications to DNA microarrays. In: Statistical Science (2003), pp Sparse statistical modelling Tom Bartlett 22 / 28
23 Sparse clustering Many clustering methods, such hierarchical clustering, are based on a dissimilarity measure D i,i = p j=1 d i,i,j between samples i and i. One popular choice of dissimilarity measure is the euclidean distance. In high-dimensions, it is often unnecessary to use information from all of the p dimensions. A weighted dissimilarity measure D i,i = p j=1 w jd i,i,j can be a useful approach to this problem. This can be obtained by the sparse matrix decomposition: maximise u w, subject to u 2 1, w 2 1, u R n2,w R p w 1 t, and w j 0, j {1,..., p}, where w is vector of the weights w j, j {1,..., p}, and R n2 p is the dissimilarity components arranged such that each row of corresponds to the d i,i,j, j {1,..., p} for a pair of samples i, i. This weighted dissimilarity measure can then be used for sparse clustering, such as sparse hierarchical clustering. Sparse statistical modelling Tom Bartlett 23 / 28
24 Sparse clustering Some clustering methods, such as K-means, need a slightly modified approach. K-means seeks to minimise the within-cluster sum of squares K x i x k 2 2 = 1 K x i x i 2 2 2N i C k k=1 i,i C k k=1 where C k is the set of samples in cluster k and x k is the corresponding centroid. Hence, a weighted K-means could proceed according to the optimisation: minimise w R p p j=1 w j K 1 d i,i n,j k k=1 i,i C k, where d i,i,j = (x ij x i j) 2, and n k is the number of samples in cluster k. Sparse statistical modelling Tom Bartlett 24 / 28
25 Sparse clustering However, for the optimisation p minimise w R p j=1 w j K 1 n k k=1 i,i C k d i,i,j, it is not possible to choose a set of constraints which guarantee a non-pathological solution as well as convexity. Instead, the between-cluster sum of squares can be maximised: p maximise w w R p j 1 n n K 1 d i,i n,j j=1 i=1 i =1 d i,i n,j k k=1 i,i C k subject to w 2 1, w 1 t, and w j 0, j {1,..., p}. Sparse statistical modelling Tom Bartlett 25 / 28
26 Sparse clustering - real data examples Applied (sparse) hierarchal clustering to the same benchmark expression data-set (14349 genes, for 347 cells of different types). NMI L1 bound Used R package sparcl [5] for the sparse clustering. Plots show normalised mutual information (NMI) and adjusted Rand index (ARI) comparing sparse with standard clustering. ARI L1 bound Sparse hierarchical clustering hierarchical clustering 5 Daniela M Witten and Robert Tibshirani. A framework for feature selection in clustering. In: Journal of the American Statistical Association (2012) Sparse statistical modelling Tom Bartlett 26 / 28
27 Sparse clustering - real data examples Applied (sparse) k-means to the same benchmark expression data-set (14349 genes, for 347 cells of different types). NMI L1 bound Used R package sparcl for the sparse clustering. Plots show normalised mutual information (NMI) and adjusted Rand index (ARI) comparing sparse with standard clustering. ARI L1 bound Sparse k means k means Sparse statistical modelling Tom Bartlett 27 / 28
28 Sparse clustering - real data examples Spectral clustering essentially uses k-means clustering (or similar) in dimensionallyreduced (e.g., PCA) space. Applied standard k-means in sparse-pca space to the same benchmark expression data-set (14349 genes, for 347 cells of different types). Offers computational advantages, running in 9 seconds on a 2.8GHz Macbook, compared with 19 seconds for standard k-means, and 35 seconds for sparse k-means. NMI ARI L1 bound / sqrt(n) L1 bound / sqrt(n) Sparse spectral k means k means Sparse statistical modelling Tom Bartlett 28 / 28
Lecture 6: Methods for high-dimensional problems
Lecture 6: Methods for high-dimensional problems Hector Corrada Bravo and Rafael A. Irizarry March, 2010 In this Section we will discuss methods where data lies on high-dimensional spaces. In particular,
More informationLearning with Singular Vectors
Learning with Singular Vectors CIS 520 Lecture 30 October 2015 Barry Slaff Based on: CIS 520 Wiki Materials Slides by Jia Li (PSU) Works cited throughout Overview Linear regression: Given X, Y find w:
More informationCMSC858P Supervised Learning Methods
CMSC858P Supervised Learning Methods Hector Corrada Bravo March, 2010 Introduction Today we discuss the classification setting in detail. Our setting is that we observe for each subject i a set of p predictors
More informationSTATISTICAL LEARNING SYSTEMS
STATISTICAL LEARNING SYSTEMS LECTURE 8: UNSUPERVISED LEARNING: FINDING STRUCTURE IN DATA Institute of Computer Science, Polish Academy of Sciences Ph. D. Program 2013/2014 Principal Component Analysis
More informationSTATS 306B: Unsupervised Learning Spring Lecture 13 May 12
STATS 306B: Unsupervised Learning Spring 2014 Lecture 13 May 12 Lecturer: Lester Mackey Scribe: Jessy Hwang, Minzhe Wang 13.1 Canonical correlation analysis 13.1.1 Recap CCA is a linear dimensionality
More informationESL Chap3. Some extensions of lasso
ESL Chap3 Some extensions of lasso 1 Outline Consistency of lasso for model selection Adaptive lasso Elastic net Group lasso 2 Consistency of lasso for model selection A number of authors have studied
More informationMachine Learning (BSMC-GA 4439) Wenke Liu
Machine Learning (BSMC-GA 4439) Wenke Liu 02-01-2018 Biomedical data are usually high-dimensional Number of samples (n) is relatively small whereas number of features (p) can be large Sometimes p>>n Problems
More informationPermutation-invariant regularization of large covariance matrices. Liza Levina
Liza Levina Permutation-invariant covariance regularization 1/42 Permutation-invariant regularization of large covariance matrices Liza Levina Department of Statistics University of Michigan Joint work
More informationProteomics and Variable Selection
Proteomics and Variable Selection p. 1/55 Proteomics and Variable Selection Alex Lewin With thanks to Paul Kirk for some graphs Department of Epidemiology and Biostatistics, School of Public Health, Imperial
More informationMachine Learning - MT & 14. PCA and MDS
Machine Learning - MT 2016 13 & 14. PCA and MDS Varun Kanade University of Oxford November 21 & 23, 2016 Announcements Sheet 4 due this Friday by noon Practical 3 this week (continue next week if necessary)
More informationLearning From Data: Modelling as an Optimisation Problem
Learning From Data: Modelling as an Optimisation Problem Iman Shames April 2017 1 / 31 You should be able to... Identify and formulate a regression problem; Appreciate the utility of regularisation; Identify
More informationFast Regularization Paths via Coordinate Descent
August 2008 Trevor Hastie, Stanford Statistics 1 Fast Regularization Paths via Coordinate Descent Trevor Hastie Stanford University joint work with Jerry Friedman and Rob Tibshirani. August 2008 Trevor
More informationRegularized Discriminant Analysis. Part I. Linear and Quadratic Discriminant Analysis. Discriminant Analysis. Example. Example. Class distribution
Part I 09.06.2006 Discriminant Analysis The purpose of discriminant analysis is to assign objects to one of several (K) groups based on a set of measurements X = (X 1, X 2,..., X p ) which are obtained
More informationEffective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data
Effective Linear Discriant Analysis for High Dimensional, Low Sample Size Data Zhihua Qiao, Lan Zhou and Jianhua Z. Huang Abstract In the so-called high dimensional, low sample size (HDLSS) settings, LDA
More informationRegularization and Variable Selection via the Elastic Net
p. 1/1 Regularization and Variable Selection via the Elastic Net Hui Zou and Trevor Hastie Journal of Royal Statistical Society, B, 2005 Presenter: Minhua Chen, Nov. 07, 2008 p. 2/1 Agenda Introduction
More information2.3. Clustering or vector quantization 57
Multivariate Statistics non-negative matrix factorisation and sparse dictionary learning The PCA decomposition is by construction optimal solution to argmin A R n q,h R q p X AH 2 2 under constraint :
More informationSparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda
Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic
More informationCorrelate. A method for the integrative analysis of two genomic data sets
Correlate A method for the integrative analysis of two genomic data sets Sam Gross, Balasubramanian Narasimhan, Robert Tibshirani, and Daniela Witten February 19, 2010 Introduction Sparse Canonical Correlation
More informationHigh-dimensional Ordinary Least-squares Projection for Screening Variables
1 / 38 High-dimensional Ordinary Least-squares Projection for Screening Variables Chenlei Leng Joint with Xiangyu Wang (Duke) Conference on Nonparametric Statistics for Big Data and Celebration to Honor
More informationIntroduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin
1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)
More informationLinear regression methods
Linear regression methods Most of our intuition about statistical methods stem from linear regression. For observations i = 1,..., n, the model is Y i = p X ij β j + ε i, j=1 where Y i is the response
More informationMachine Learning for OR & FE
Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com
More informationMSA220/MVE440 Statistical Learning for Big Data
MSA220/MVE440 Statistical Learning for Big Data Lecture 7/8 - High-dimensional modeling part 1 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Classification
More informationhttps://goo.gl/kfxweg KYOTO UNIVERSITY Statistical Machine Learning Theory Sparsity Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT OF INTELLIGENCE SCIENCE AND TECHNOLOGY 1 KYOTO UNIVERSITY Topics:
More informationLearning Multiple Tasks with a Sparse Matrix-Normal Penalty
Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Yi Zhang and Jeff Schneider NIPS 2010 Presented by Esther Salazar Duke University March 25, 2011 E. Salazar (Reading group) March 25, 2011 1
More informationCPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018
CPSC 340: Machine Learning and Data Mining Sparse Matrix Factorization Fall 2018 Last Time: PCA with Orthogonal/Sequential Basis When k = 1, PCA has a scaling problem. When k > 1, have scaling, rotation,
More informationDISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania
Submitted to the Annals of Statistics DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING By T. Tony Cai and Linjun Zhang University of Pennsylvania We would like to congratulate the
More informationThe lasso: some novel algorithms and applications
1 The lasso: some novel algorithms and applications Newton Institute, June 25, 2008 Robert Tibshirani Stanford University Collaborations with Trevor Hastie, Jerome Friedman, Holger Hoefling, Gen Nowak,
More informationPre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models
Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider variable
More informationCS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS
CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS * Some contents are adapted from Dr. Hung Huang and Dr. Chengkai Li at UT Arlington Mingon Kang, Ph.D. Computer Science, Kennesaw State University Problems
More informationDimension Reduction Methods
Dimension Reduction Methods And Bayesian Machine Learning Marek Petrik 2/28 Previously in Machine Learning How to choose the right features if we have (too) many options Methods: 1. Subset selection 2.
More informationCPSC 340: Machine Learning and Data Mining. More PCA Fall 2017
CPSC 340: Machine Learning and Data Mining More PCA Fall 2017 Admin Assignment 4: Due Friday of next week. No class Monday due to holiday. There will be tutorials next week on MAP/PCA (except Monday).
More informationRegularized Discriminant Analysis and Its Application in Microarrays
Biostatistics (2005), 1, 1, pp. 1 18 Printed in Great Britain Regularized Discriminant Analysis and Its Application in Microarrays By YAQIAN GUO Department of Statistics, Stanford University Stanford,
More informationFast Regularization Paths via Coordinate Descent
KDD August 2008 Trevor Hastie, Stanford Statistics 1 Fast Regularization Paths via Coordinate Descent Trevor Hastie Stanford University joint work with Jerry Friedman and Rob Tibshirani. KDD August 2008
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationRegularized Estimation of High Dimensional Covariance Matrices. Peter Bickel. January, 2008
Regularized Estimation of High Dimensional Covariance Matrices Peter Bickel Cambridge January, 2008 With Thanks to E. Levina (Joint collaboration, slides) I. M. Johnstone (Slides) Choongsoon Bae (Slides)
More informationProbabilistic Latent Semantic Analysis
Probabilistic Latent Semantic Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr
More informationThe lasso: some novel algorithms and applications
1 The lasso: some novel algorithms and applications Robert Tibshirani Stanford University ASA Bay Area chapter meeting Collaborations with Trevor Hastie, Jerome Friedman, Ryan Tibshirani, Daniela Witten,
More informationShrinkage Methods: Ridge and Lasso
Shrinkage Methods: Ridge and Lasso Jonathan Hersh 1 Chapman University, Argyros School of Business hersh@chapman.edu February 27, 2019 J.Hersh (Chapman) Ridge & Lasso February 27, 2019 1 / 43 1 Intro and
More informationGraphical Model Selection
May 6, 2013 Trevor Hastie, Stanford Statistics 1 Graphical Model Selection Trevor Hastie Stanford University joint work with Jerome Friedman, Rob Tibshirani, Rahul Mazumder and Jason Lee May 6, 2013 Trevor
More informationRegularized Discriminant Analysis and Its Application in Microarray
Regularized Discriminant Analysis and Its Application in Microarray Yaqian Guo, Trevor Hastie and Robert Tibshirani May 5, 2004 Abstract In this paper, we introduce a family of some modified versions of
More informationA direct formulation for sparse PCA using semidefinite programming
A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley Available online at www.princeton.edu/~aspremon
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationStandardization and Singular Value Decomposition in Canonical Correlation Analysis
Standardization and Singular Value Decomposition in Canonical Correlation Analysis Melinda Borello Johanna Hardin, Advisor David Bachman, Reader Submitted to Pitzer College in Partial Fulfillment of the
More informationStatistical Inference
Statistical Inference Liu Yang Florida State University October 27, 2016 Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 1 / 27 Outline The Bayesian Lasso Trevor Park
More informationSimultaneous variable selection and class fusion for high-dimensional linear discriminant analysis
Biostatistics (2010), 11, 4, pp. 599 608 doi:10.1093/biostatistics/kxq023 Advance Access publication on May 26, 2010 Simultaneous variable selection and class fusion for high-dimensional linear discriminant
More informationSmoothly Clipped Absolute Deviation (SCAD) for Correlated Variables
Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables LIB-MA, FSSM Cadi Ayyad University (Morocco) COMPSTAT 2010 Paris, August 22-27, 2010 Motivations Fan and Li (2001), Zou and Li (2008)
More informationLeast Angle Regression, Forward Stagewise and the Lasso
January 2005 Rob Tibshirani, Stanford 1 Least Angle Regression, Forward Stagewise and the Lasso Brad Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani Stanford University Annals of Statistics,
More informationPrincipal Component Analysis
I.T. Jolliffe Principal Component Analysis Second Edition With 28 Illustrations Springer Contents Preface to the Second Edition Preface to the First Edition Acknowledgments List of Figures List of Tables
More informationHigh Dimensional Covariance and Precision Matrix Estimation
High Dimensional Covariance and Precision Matrix Estimation Wei Wang Washington University in St. Louis Thursday 23 rd February, 2017 Wei Wang (Washington University in St. Louis) High Dimensional Covariance
More informationClassification 2: Linear discriminant analysis (continued); logistic regression
Classification 2: Linear discriminant analysis (continued); logistic regression Ryan Tibshirani Data Mining: 36-462/36-662 April 4 2013 Optional reading: ISL 4.4, ESL 4.3; ISL 4.3, ESL 4.4 1 Reminder:
More informationRegularized Discriminant Analysis and Reduced-Rank LDA
Regularized Discriminant Analysis and Reduced-Rank LDA Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Regularized Discriminant Analysis A compromise between LDA and
More informationStatistical Learning with the Lasso, spring The Lasso
Statistical Learning with the Lasso, spring 2017 1 Yeast: understanding basic life functions p=11,904 gene values n number of experiments ~ 10 Blomberg et al. 2003, 2010 The Lasso fmri brain scans function
More informationTractable Upper Bounds on the Restricted Isometry Constant
Tractable Upper Bounds on the Restricted Isometry Constant Alex d Aspremont, Francis Bach, Laurent El Ghaoui Princeton University, École Normale Supérieure, U.C. Berkeley. Support from NSF, DHS and Google.
More informationPCA and admixture models
PCA and admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price PCA and admixture models 1 / 57 Announcements HW1
More informationClusters. Unsupervised Learning. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved
Clusters Unsupervised Learning Luc Anselin http://spatial.uchicago.edu 1 curse of dimensionality principal components multidimensional scaling classical clustering methods 2 Curse of Dimensionality 3 Curse
More informationPCA, Kernel PCA, ICA
PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per
More informationPenalized versus constrained generalized eigenvalue problems
Penalized versus constrained generalized eigenvalue problems Irina Gaynanova, James G. Booth and Martin T. Wells. arxiv:141.6131v3 [stat.co] 4 May 215 Abstract We investigate the difference between using
More informationProbabilistic & Unsupervised Learning
Probabilistic & Unsupervised Learning Week 2: Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College
More informationProperties of Matrices and Operations on Matrices
Properties of Matrices and Operations on Matrices A common data structure for statistical analysis is a rectangular array or matris. Rows represent individual observational units, or just observations,
More informationThe prediction of house price
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
More informationFast Regularization Paths via Coordinate Descent
user! 2009 Trevor Hastie, Stanford Statistics 1 Fast Regularization Paths via Coordinate Descent Trevor Hastie Stanford University joint work with Jerome Friedman and Rob Tibshirani. user! 2009 Trevor
More informationapplication in microarrays
Biostatistics Advance Access published April 7, 2006 Regularized linear discriminant analysis and its application in microarrays Yaqian Guo, Trevor Hastie and Robert Tibshirani Abstract In this paper,
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationA Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression
A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression Noah Simon Jerome Friedman Trevor Hastie November 5, 013 Abstract In this paper we purpose a blockwise descent
More informationSparse Principal Component Analysis Formulations And Algorithms
Sparse Principal Component Analysis Formulations And Algorithms SLIDE 1 Outline 1 Background What Is Principal Component Analysis (PCA)? What Is Sparse Principal Component Analysis (spca)? 2 The Sparse
More informationHigh-dimensional data: Exploratory data analysis
High-dimensional data: Exploratory data analysis Mark van de Wiel mark.vdwiel@vumc.nl Department of Epidemiology and Biostatistics, VUmc & Department of Mathematics, VU University Contributions by Wessel
More informationA Least Squares Formulation for Canonical Correlation Analysis
A Least Squares Formulation for Canonical Correlation Analysis Liang Sun, Shuiwang Ji, and Jieping Ye Department of Computer Science and Engineering Arizona State University Motivation Canonical Correlation
More informationMSA200/TMS041 Multivariate Analysis
MSA200/TMS041 Multivariate Analysis Lecture 8 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Back to Discriminant analysis As mentioned in the previous
More informationClassification 1: Linear regression of indicators, linear discriminant analysis
Classification 1: Linear regression of indicators, linear discriminant analysis Ryan Tibshirani Data Mining: 36-462/36-662 April 2 2013 Optional reading: ISL 4.1, 4.2, 4.4, ESL 4.1 4.3 1 Classification
More informationClassification Methods II: Linear and Quadratic Discrimminant Analysis
Classification Methods II: Linear and Quadratic Discrimminant Analysis Rebecca C. Steorts, Duke University STA 325, Chapter 4 ISL Agenda Linear Discrimminant Analysis (LDA) Classification Recall that linear
More informationSparse Principal Component Analysis
Sparse Principal Component Analysis Hui Zou, Trevor Hastie, Robert Tibshirani April 26, 2004 Abstract Principal component analysis (PCA) is widely used in data processing and dimensionality reduction.
More informationBiostatistics Advanced Methods in Biostatistics IV
Biostatistics 140.754 Advanced Methods in Biostatistics IV Jeffrey Leek Assistant Professor Department of Biostatistics jleek@jhsph.edu Lecture 12 1 / 36 Tip + Paper Tip: As a statistician the results
More informationChris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010
Model-Averaged l 1 Regularization using Markov Chain Monte Carlo Model Composition Technical Report No. 541 Department of Statistics, University of Washington Chris Fraley and Daniel Percival August 22,
More informationA direct formulation for sparse PCA using semidefinite programming
A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley A. d Aspremont, INFORMS, Denver,
More informationA Framework for Feature Selection in Clustering
A Framework for Feature Selection in Clustering Daniela M. Witten and Robert Tibshirani October 10, 2011 Outline Problem Past work Proposed (sparse) clustering framework Sparse K-means clustering Sparse
More informationData Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.
TheThalesians Itiseasyforphilosopherstoberichiftheychoose Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods Ivan Zhdankin
More informationExtensions to LDA and multinomial regression
Extensions to LDA and multinomial regression Patrick Breheny September 22 Patrick Breheny BST 764: Applied Statistical Modeling 1/20 Introduction Quadratic discriminant analysis Fitting models Linear discriminant
More informationL26: Advanced dimensionality reduction
L26: Advanced dimensionality reduction The snapshot CA approach Oriented rincipal Components Analysis Non-linear dimensionality reduction (manifold learning) ISOMA Locally Linear Embedding CSCE 666 attern
More informationMachine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.
Machine Learning CUNY Graduate Center, Spring 2013 Lectures 11-12: Unsupervised Learning 1 (Clustering: k-means, EM, mixture models) Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning
More informationMS-C1620 Statistical inference
MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents
More informationLearning Gaussian Graphical Models with Unknown Group Sparsity
Learning Gaussian Graphical Models with Unknown Group Sparsity Kevin Murphy Ben Marlin Depts. of Statistics & Computer Science Univ. British Columbia Canada Connections Graphical models Density estimation
More informationLecture Notes 10: Matrix Factorization
Optimization-based data analysis Fall 207 Lecture Notes 0: Matrix Factorization Low-rank models. Rank- model Consider the problem of modeling a quantity y[i, j] that depends on two indices i and j. To
More informationSparse PCA with applications in finance
Sparse PCA with applications in finance A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley Available online at www.princeton.edu/~aspremon 1 Introduction
More informationLinear Methods for Prediction
Chapter 5 Linear Methods for Prediction 5.1 Introduction We now revisit the classification problem and focus on linear methods. Since our prediction Ĝ(x) will always take values in the discrete set G we
More informationRegularization: Ridge Regression and the LASSO
Agenda Wednesday, November 29, 2006 Agenda Agenda 1 The Bias-Variance Tradeoff 2 Ridge Regression Solution to the l 2 problem Data Augmentation Approach Bayesian Interpretation The SVD and Ridge Regression
More informationPrincipal Component Analysis and Singular Value Decomposition. Volker Tresp, Clemens Otte Summer 2014
Principal Component Analysis and Singular Value Decomposition Volker Tresp, Clemens Otte Summer 2014 1 Motivation So far we always argued for a high-dimensional feature space Still, in some cases it makes
More informationPenalized classification using Fisher s linear discriminant
J. R. Statist. Soc. B (2011) 73, Part 5, pp. 753 772 Penalized classification using Fisher s linear discriminant Daniela M. Witten University of Washington, Seattle, USA and Robert Tibshirani Stanford
More informationLecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides
Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides Intelligent Data Analysis and Probabilistic Inference Lecture
More informationGI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil
GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis Massimiliano Pontil 1 Today s plan SVD and principal component analysis (PCA) Connection
More informationRegression Shrinkage and Selection via the Lasso
Regression Shrinkage and Selection via the Lasso ROBERT TIBSHIRANI, 1996 Presenter: Guiyun Feng April 27 () 1 / 20 Motivation Estimation in Linear Models: y = β T x + ɛ. data (x i, y i ), i = 1, 2,...,
More informationPrincipal Component Analysis
Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used
More informationTESTING SIGNIFICANCE OF FEATURES BY LASSOED PRINCIPAL COMPONENTS. BY DANIELA M. WITTEN 1 AND ROBERT TIBSHIRANI 2 Stanford University
The Annals of Applied Statistics 2008, Vol. 2, No. 3, 986 1012 DOI: 10.1214/08-AOAS182 Institute of Mathematical Statistics, 2008 TESTING SIGNIFICANCE OF FEATURES BY LASSOED PRINCIPAL COMPONENTS BY DANIELA
More informationPrincipal Component Analysis (PCA)
Principal Component Analysis (PCA) Additional reading can be found from non-assessed exercises (week 8) in this course unit teaching page. Textbooks: Sect. 6.3 in [1] and Ch. 12 in [2] Outline Introduction
More informationMidterm exam CS 189/289, Fall 2015
Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points
More informationSparse representation classification and positive L1 minimization
Sparse representation classification and positive L1 minimization Cencheng Shen Joint Work with Li Chen, Carey E. Priebe Applied Mathematics and Statistics Johns Hopkins University, August 5, 2014 Cencheng
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V
More informationStructure in Data. A major objective in data analysis is to identify interesting features or structure in the data.
Structure in Data A major objective in data analysis is to identify interesting features or structure in the data. The graphical methods are very useful in discovering structure. There are basically two
More informationLinear Model Selection and Regularization
Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In
More informationA note on the group lasso and a sparse group lasso
A note on the group lasso and a sparse group lasso arxiv:1001.0736v1 [math.st] 5 Jan 2010 Jerome Friedman Trevor Hastie and Robert Tibshirani January 5, 2010 Abstract We consider the group lasso penalty
More information