Statistical Learning with the Lasso, spring The Lasso

Similar documents
High-dimensional data analysis, fall

LASSO Review, Fused LASSO, Parallel LASSO Solvers

Linear Methods for Regression. Lijun Zhang

Support Vector Machines. CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Machine Learning for OR & FE

MS-C1620 Statistical inference

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

Lecture 10: Logistic Regression

Variations. ECE 6540, Lecture 02 Multivariate Random Variables & Linear Algebra

ORIE 4741: Learning with Big Messy Data. Regularization

CS145: INTRODUCTION TO DATA MINING

ISyE 691 Data mining and analytics

Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models

Approximate Second Order Algorithms. Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo

Variable Selection in Predictive Regressions

General Strong Polarization

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

Building a Prognostic Biomarker

COMS 4771 Lecture Fixed-design linear regression 2. Ridge and principal components regression 3. Sparse regression and Lasso

Uncertainty quantification in high-dimensional statistics

CS249: ADVANCED DATA MINING

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, Emily Fox 2014

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

Shrinkage Methods: Ridge and Lasso

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

Lecture 3. Linear Regression

Probabilistic Graphical Models

Radial Basis Function (RBF) Networks

Machine Learning Linear Classification. Prof. Matteo Matteucci

Regression, Ridge Regression, Lasso

High-dimensional regression modeling

Lecture 3: Autoregressive Moving Average (ARMA) Models and their Practical Applications

Outline Introduction OLS Design of experiments Regression. Metamodeling. ME598/494 Lecture. Max Yi Ren

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

PHY103A: Lecture # 4

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Biostatistics Advanced Methods in Biostatistics IV

Regularization: Ridge Regression and the LASSO

Linear Models for Regression CS534

Expectation Propagation performs smooth gradient descent GUILLAUME DEHAENE

A simulation study of model fitting to high dimensional data using penalized logistic regression

Proteomics and Variable Selection

Machine Learning for Biomedical Engineering. Enrico Grisan

General Strong Polarization

CS Lecture 8 & 9. Lagrange Multipliers & Varitional Bounds

STAT 200C: High-dimensional Statistics

High-dimensional regression

l 1 and l 2 Regularization

Statistical Inference

Variable Selection for Highly Correlated Predictors

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection

Lecture 6. Notes on Linear Algebra. Perceptron

Regularization Path Algorithms for Detecting Gene Interactions

Is the test error unbiased for these programs?

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

SECTION 5: POWER FLOW. ESE 470 Energy Distribution Systems

Data analysis strategies for high dimensional social science data M3 Conference May 2013

Dan Roth 461C, 3401 Walnut

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Probabilistic Graphical Models

Generalized Elastic Net Regression

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

The lasso, persistence, and cross-validation

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

High-dimensional Ordinary Least-squares Projection for Screening Variables

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty

ECE 592 Topics in Data Science

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

A Simple Algorithm for Learning Stable Machines

Lecture 11. Kernel Methods

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3 Optimization methods for econometrics models

Fast Regularization Paths via Coordinate Descent

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

Linear Model Selection and Regularization

Linear regression methods

Cross-Validation with Confidence

Minwise hashing for large-scale regression and classification with sparse data

A Bias Correction for the Minimum Error Rate in Cross-validation

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

7.3 The Jacobi and Gauss-Seidel Iterative Methods

Regularization and Variable Selection via the Elastic Net

Applied Machine Learning Annalisa Marsico

High-Dimensional Statistical Learning: Introduction

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

ECE 6540, Lecture 06 Sufficient Statistics & Complete Statistics Variations

Heteroskedasticity ECONOMETRICS (ECON 360) BEN VAN KAMMEN, PHD

Homework 5. Convex Optimization /36-725

Least Squares Regression

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

Data Preprocessing. Jilles Vreeken IRDM 15/ Oct 2015

Consistent high-dimensional Bayesian variable selection via penalized credible regions

MSA220/MVE440 Statistical Learning for Big Data

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

STAT 462-Computational Data Analysis

Transcription:

Statistical Learning with the Lasso, spring 2017 1 Yeast: understanding basic life functions p=11,904 gene values n number of experiments ~ 10 Blomberg et al. 2003, 2010 The Lasso fmri brain scans function of brain language network p=appr. 3 mill. Voxels n number of subjects ~ 50 Taylor et al. 2006

The module Lasso - linear models - generalized linear models - group - smooth functions Module litteratur: Statistics for High-Dimensional Data. Methods, Theory and Applications, P. Buhlmann and S. van de Geer, Springer 2011. Computer-Age Statistical Inference by B. Efron and T. Hastie, Cambridge University Press 2016 Statistical learning with sparsity: the Lasso and generalizations by T. Hastie, R. Tibshirani, and M. Wainwright, CRC press

Slides for B&vdG 1-2.7 and E&H 12.1, 12.2 (also good to read E&H 16.1, 16.2) Exercise: B&vdG 2.1. Project: find interesting data set and do a Lasso on it. Excercises and project should be done in groups of 2 or 3 persons and presented in week 10.

The linear model responses covariates parameters errors YY 1 = XX 1 (1) ββ1 + XX 1 (2) ββ2 + XX 1 pp ββ pp + εε 1 (1) (2) YY 2 = XX 2 ββ1 + XX 2 ββ2 + XX pp 2 ββ pp + εε 2 (1) (2) YY nn = XX nn ββ1 + XX nn ββ2 + XX pp nn ββ pp + εε nn YY = XXββ + εε ols (ordinary least square) estimate of parameters nn ββ = argmin ββ ii=1 = argmin β YY XXββ 22 22 YY ii (XX ii 1 ββ 1 + XX ii 2 ββ 2 + XX ii pp ββ pp ) 2

ols ββ oooooo = argmin β YY XXββ 22 22 Can be obtained as follows ff(ββ) = YY XXββ 22 22 = YY XXββ tt (YY XXββ) ddff ββ ddββ = 2XXtt (YY XXββ) = 0 XX tt YY = XX tt XXββ ββ oooooo = XX tt XX 1 XX tt YY (if matrix is invertible) If pp large then ols often overfits (i.e. adapts too much to the noise and gives unreliable predictions and indications of which covariates are important)

E&H 16.1 16.2 E&H 16.1 Forward stepwise regression. The standard statistical way to do regression on many covariates: first do all 1-dimensional regressions, choose the covariate which gives the smallest loss, then do all 2-dimensional regressions and choose the 2 covariates which gives the smallest loss, then... up to some predetermined dimension M. Choose the regression which is best. Computationally impossible in high dimensions. Greedy version: instead of doing all 2-dimensional regressions, in step 2 keep the first selected covariate, and just add the best one of the remaining covariates, and so on. Computationally much easier, but the first version can lead to better model E&H 16.1: Lasso. Read

B&vdG 2.2 Simplest case: YY 1 = XX 1 (1) ββ1 + XX 1 (2) ββ2 + XX 1 pp ββ pp + εε 1 (1) (2) YY 2 = XX 2 ββ1 + XX 2 ββ2 + XX 2 2 ββ pp + εε 2 (1) (2) YY nn = XX nn ββ1 + XX nn ββ2 + XX pp nn ββ pp + εε nn P>>n ( wide data set ) YY = XXββ + εε Standardize (cheating?) so that YY = nn ii=1 YY ii /nn = 0, XX (jj) = 0, σσ 2 jj = nn ii=1 (XX jj ii XX (jj) ) 2 = 1, for jj = 1, pp

The Lasso (Least Absolute Shrinkage and Selection estimator) Lagrange formulation: ββ λλ = argmin β ( YY XXββ 22 22 /nn + λλ ββ 1 ) Same loss as (for RR determined by λλ and data: RR = ββ(λλ ) 1 ) ββ primal RR = argmin β; ββ 1 RR YY XXββ 22 22 /nn (Ridge regression: ββ λλ = argmin β ( YY XXββ 22 22 /nn + λλ ββ 2 2 ) Same as (for RR determined by λλ) ββ primal RR = argmin β; ββ 2 RR YY XXββ 22 22 /n )

Ordinary least squares (ols): ββ oooooo = argmin β YY XXββ 22 22 More parameters than observations if pp nn ols overfits Estimation & understanding only possible if pppppppppppppppprrrr ββ are sparse or continuous, and to obtain sparse models when pp is (very) large one has to select which covariates to use in a good way. One way to do this is penalize as on previous slide

Contour lines of residual sum of squares (Left) 1-ball corresponding to the Lasso. (Right) 2-ball corresponding to Ridge regression.

B&vdG 2.3: Soft tresholding Simplest case: YY 1 = ββ 1 + εε 1 YY 2 = ββ 2 + εε 2 YY nn = ββ nn + εε nn YY XXββ 22 22 = nn ii=1 YY ii ββ ii 2, ββ 1 = nn ii=1 ββ ii YY XXββ 22 22 + λλ ββ 11 = nn ii=1 YY ii ββ ii 2 + λλ ββ ii ββ jj = 0 if YY jj λλ/2 sign(yy jj )( YY jj λλ ) =: gg soft, λλ (YY ii ) 2 2

gg ssssssss,1 (zz)

Soft tresholding: orthonormal designs Orthonormal design if XX is nn nn and XX tt XX = nnii nn Ex (designed experiment with 3 factors): YY = ββ 1 + XX 2 ββ 2 + XX 3 ββ 3 + XX 4 ββ 4 + εε ββ 1 is the mean XX 2 = 1 if first factor is high, XX 2 = 1 if first factor is low, etc XX =

if XX tt XX = nnii nn then XX tt YY/nn = XX tt XXββ/nn + XX tt εε/nn = ββ + εεε so back to the simple situation, but with YY jj replaced by ZZ jj = (XX tt YY) jj /nn and we get that ββ jj = h soft, λλ 2 (XX tt YY jj ) = h λλ(zz jj ) soft, 2 (note that XX tt YY/nn = XX tt XX 1 XX tt YY = ββ ols for an orthonormal design) Homework: problem 2.1

B&vdG 2.4: Prediction pp is to find ff xx = EE YY XX = xx) = jj=1 xx (jj) ββ jj, and hence is closely related to estimation of ββ. If we use the Lasso for estimation we get the estimated predictor ff λλ xx = xx tt pp ββ λλ = xx (jj) ββ jj (λλ) jj=1 But, how should one chose λλ?

EH 12.1, 12.2, (10-fold) crossvalidation 1: split (somehow) the data into 10 approximately equally sized groups 2: for each YY ii use the Lasso to estimate ββ from the 9 groups which do not contain YY ii, call this estimate ββ ii λλ, and use it to compute the predictor ii pp ff λλ xx ii = jj=1 xx jj ( ii) ii ββ jj (λλ) 3: compute the crossvalidated mean square prediction error ee λλ = nn ii=1 (YY ii ff λλ ii xx ii ) 2 (or use some other appropriate error measure) 4: repeat this for a grid of λλ-values, choose the λλ which gives the smallest ee λλ

B&vdG 2.4.1.1 Lymph node status of cancer patients: YY = 0 or 1 nn = 49 breast cancer tumor samples pp = 7129 gene expression measurements/tumor sample Doesn t follow model, but one can still use (crossvalidated ) pp Lasso to compute ff λλ xx = jj=1 xx (jj) ββ jj (λλ) and then use it for classification: classify status of observation with design vector xx as 1 if ff λλ xx > 1/2 and as 0 if ff λλ xx 1/2. Compared with other method, also through crossvalidation: randomly divide sample into 2/3 training data and 1/3 test data, do entire model fitting on training data, count misclassifications on test data, repeat 100 times, compute average misclassification percentages: Lasso 21% alt. 35%. Average number of non-zero ββ-s for Lasso was 13.

Choice of estimation method How does one choose between different estimation metods: Lasso, ridge regression, ols,? Sometimes, e.g. in signal processing, one can try out the methods in situations where one can see how well they work (but still, it s imposible to try all methods which have been suggested) Understanding Simulation Asymptotics

B&vdG 2.4.2: Asymptotics In basic statistics asymptotics the model is fixed and nn, the number of observations, tends to infinity (and one then assumes that in a practical situations with large nn distributions are well described by the asymptotic distribution). Not possible for pp nn since if the model (and thus pp, the number of parameters) is kept fixed, then as nn gets large one gets that pp < nn. One instead has to use triangular arrays, where pp = pp nn tends to infinity at the same time as n tends to infinity.

Triangular arrays In asymptotic statments B&vdG typically assume that pp = pp nn, XX = XX n = (j) X n;i, and ββ = ββ nn = ββ nn;jj all change as nn, typically with pp nn faster than nn i.e. with pp nn /nn as nn. Thus one considers a sequence of models or in more detail YY nn = XX nn ββ nn + εε nn, pp nn (jj) YY nn;ii = XXnn;ii ββnn;jj + εε ii, jj=1 for i = 1, nn, and derive asymptotic inequalities and limit distributions for such triangular arrays, for nn

Prediction consistency ββ nn λλ nn ββ nn 0 tt Σ XXnn ββ nn λλ nn ββ nn 0 = oo pp 1 for Σ XXnn = XX nn tt XX nn /nn for a fixed design, and Σ XXnn equal to the estimated covariance matrix of the covariates vector XX nn, for a random design with the covariate vectors in the different rows i.i.d. (ZZ nn = oo pp (1) means that PP ZZ nn > εε 0 as nn, for any εε > 0, i.e. that ZZ nn tends to zero in probability) Conditions needed to obtain prediction consistency include ββ nn 0 1 = oo( nn log pp nn ), KK 1 log pp nn /nn < λλ nn < KK log pp nn /nn for some constant KK, and that the εε nn;ii, ii = 1, nn are i.i.d. with finite variance.

The quantity ββ nn λλ nn 0 ββ tt nn Σ XXnn ββ nn λλ nn ββ 0 nn = oo pp 1 equals XX nn ββ nn (λλ nn ) ββ nn 0 2 /nn for a fixed design, 2 EE (XX nn;nnnnnn ββ nn λλ nn ββ nn 0 ) 2 for a random design, and where XX nn ββ nn λλ nn ββ nn 0 is the difference between the estimated and the true predictor (= regression function).

Oracle inequality EE XX ββ nn (λλ nn ) ββ nn 0 2 /nn = OO( ss nn;0 logpp nn 2 nnφφ nn 2 ), where ss nn;0 = card S n;0 = S n;0 is the number of 0 elemements in the set S 0;n = jj; ββ nn;jj 0, jj = 1, pp nn active variables, and φφ 2 nn is a compatablity constant determined by the design matrix XX nn. means roughly that using the estimated predictor instead of the true one adds a (random) error of the size slightly larger 1 than nn Requires further conditions. of

B&vdG 2.5: Estimation consistency ββ nn λλ nn ββ nn 0 qq = oo PP 1, for qq = 1 or 2 Similar conditions as for prediction consistency.

Variable screening SS 0;nn (λλ) = jj; ββ nn;jj (λλ) 0, jj = 1, pp nn Typically, the Lasso solutions ββ nn 0 are not unique (e.g. they are not unique if pp > nn) but still SS 0;nn (λλ) is unique (Lemma 2.1). Further, SS 0;nn λλ min n, p n. relevant(cc) SS 0;nn (λλ) = jj; 0 ββnn;jj (λλ) 0, jj = 1, pp nn is the true set of relevant variables. Asymptotically, under relevant(cc) suitable conditions, SS 0;nn (λλ) will contain SS 0;nn, but often, depending on the value of λλ, it will contain many more variables. For choice of λλ, read B&vdG 2.5.1.

B&vdG 2.5.2 Motif regression for DNA binding sites: YY ii binding intensity of H1F1αα in coarse DNA segment ii (jj) XX ii abundance score of candidate motif jj in DNA segment ii ii = 1, nn = 287; jj = 1, pp =195 195 YY ii = μμ + jj=1 XX ii (jj) ββjj + εε ii, for i = 1, 287, Scale covariates XX to same empirical variance, subtract YY, do Lasso with λλ = λλ CV choosen by 10-fold crossvalidation for optimal prediction. Gives SS λλ CV = 26 non-zero ββ λλ CV estimates. B&vdG believe these include true active variables

B&vdG 2.6: Variable selection ββ λλ = argmin β ( YY XXββ 22 22 /nn + λλ ββ 0 0 ) where ββ 0 pp 0 = jj=1 1(ββ jj 0) Not convex: naive solution would be to go through all possible sub-models (i.e. models with some of the ββ-s set to 0), compute the ols for each of these models, and then chose the model which minimizes the right-hand side. However, there are 2 pp submodels, and if pp is not small, this isn t feasible. More efficient algoritms exist, but still a difficult/infeasible problem if pp is not small.

One possibility is to first use (some variant of) the Lasso to select a set of variables which with high probability includes the set of active variables, and then only do the minimization of the l 0 penalized mean square error among these variables. Typically reduces the number of operations needed for the minimization to OO nnnnmin nn, pp However this requires that the Lasso in fact selects all active variables/parameters. Asymptotic conditions to ensure this include that inf ββ jj ss 0 log pp/nn jj SS 0 and neigbourhood stability or irrepresentable condition holds

Examples of sufficient conditions for neigbourhood stability (recall that Σ = XX tt XX/n for a fixed design and that Σ is the correlation matrix of XX for a random design) Equal positive correlation: Σ jj,jj = 1, for all jj, Σ jj,kk = ρρ > 0, for some ρρ < 1 Markov (or Toeplitz) structure: Bounded pairwise correlation ΣΣ jj,kk = θθ jj kk, with θθ < 1 ss 0 max jj SS 0 kk SS 0 Σ jj,kk 2 2 /Λ min (ΣΣ 1,1) < 1 2 where Λ min (ΣΣ 1,1) is the smallest eigenvalue of the covariance matrix of the active variables