Statistical Learning with the Lasso, spring The Lasso

Statistical Learning with the Lasso, spring 2017 1 Yeast: understanding basic life functions p=11,904 gene values n number of experiments ~ 10 Blomberg et al. 2003, 2010 The Lasso fmri brain scans function of brain language network p=appr. 3 mill. Voxels n number of subjects ~ 50 Taylor et al. 2006

The module Lasso - linear models - generalized linear models - group - smooth functions Module litteratur: Statistics for High-Dimensional Data. Methods, Theory and Applications, P. Buhlmann and S. van de Geer, Springer 2011. Computer-Age Statistical Inference by B. Efron and T. Hastie, Cambridge University Press 2016 Statistical learning with sparsity: the Lasso and generalizations by T. Hastie, R. Tibshirani, and M. Wainwright, CRC press

Slides for B&vdG 1-2.7 and E&H 12.1, 12.2 (also good to read E&H 16.1, 16.2) Exercise: B&vdG 2.1. Project: find interesting data set and do a Lasso on it. Excercises and project should be done in groups of 2 or 3 persons and presented in week 10.

The linear model responses covariates parameters errors YY 1 = XX 1 (1) ββ1 + XX 1 (2) ββ2 + XX 1 pp ββ pp + εε 1 (1) (2) YY 2 = XX 2 ββ1 + XX 2 ββ2 + XX pp 2 ββ pp + εε 2 (1) (2) YY nn = XX nn ββ1 + XX nn ββ2 + XX pp nn ββ pp + εε nn YY = XXββ + εε ols (ordinary least square) estimate of parameters nn ββ = argmin ββ ii=1 = argmin β YY XXββ 22 22 YY ii (XX ii 1 ββ 1 + XX ii 2 ββ 2 + XX ii pp ββ pp ) 2

ols ββ oooooo = argmin β YY XXββ 22 22 Can be obtained as follows ff(ββ) = YY XXββ 22 22 = YY XXββ tt (YY XXββ) ddff ββ ddββ = 2XXtt (YY XXββ) = 0 XX tt YY = XX tt XXββ ββ oooooo = XX tt XX 1 XX tt YY (if matrix is invertible) If pp large then ols often overfits (i.e. adapts too much to the noise and gives unreliable predictions and indications of which covariates are important)

E&H 16.1 16.2 E&H 16.1 Forward stepwise regression. The standard statistical way to do regression on many covariates: first do all 1-dimensional regressions, choose the covariate which gives the smallest loss, then do all 2-dimensional regressions and choose the 2 covariates which gives the smallest loss, then... up to some predetermined dimension M. Choose the regression which is best. Computationally impossible in high dimensions. Greedy version: instead of doing all 2-dimensional regressions, in step 2 keep the first selected covariate, and just add the best one of the remaining covariates, and so on. Computationally much easier, but the first version can lead to better model E&H 16.1: Lasso. Read

B&vdG 2.2 Simplest case: YY 1 = XX 1 (1) ββ1 + XX 1 (2) ββ2 + XX 1 pp ββ pp + εε 1 (1) (2) YY 2 = XX 2 ββ1 + XX 2 ββ2 + XX 2 2 ββ pp + εε 2 (1) (2) YY nn = XX nn ββ1 + XX nn ββ2 + XX pp nn ββ pp + εε nn P>>n ( wide data set ) YY = XXββ + εε Standardize (cheating?) so that YY = nn ii=1 YY ii /nn = 0, XX (jj) = 0, σσ 2 jj = nn ii=1 (XX jj ii XX (jj) ) 2 = 1, for jj = 1, pp

The Lasso (Least Absolute Shrinkage and Selection estimator) Lagrange formulation: ββ λλ = argmin β ( YY XXββ 22 22 /nn + λλ ββ 1 ) Same loss as (for RR determined by λλ and data: RR = ββ(λλ ) 1 ) ββ primal RR = argmin β; ββ 1 RR YY XXββ 22 22 /nn (Ridge regression: ββ λλ = argmin β ( YY XXββ 22 22 /nn + λλ ββ 2 2 ) Same as (for RR determined by λλ) ββ primal RR = argmin β; ββ 2 RR YY XXββ 22 22 /n )

Ordinary least squares (ols): ββ oooooo = argmin β YY XXββ 22 22 More parameters than observations if pp nn ols overfits Estimation & understanding only possible if pppppppppppppppprrrr ββ are sparse or continuous, and to obtain sparse models when pp is (very) large one has to select which covariates to use in a good way. One way to do this is penalize as on previous slide

Contour lines of residual sum of squares (Left) 1-ball corresponding to the Lasso. (Right) 2-ball corresponding to Ridge regression.

B&vdG 2.3: Soft tresholding Simplest case: YY 1 = ββ 1 + εε 1 YY 2 = ββ 2 + εε 2 YY nn = ββ nn + εε nn YY XXββ 22 22 = nn ii=1 YY ii ββ ii 2, ββ 1 = nn ii=1 ββ ii YY XXββ 22 22 + λλ ββ 11 = nn ii=1 YY ii ββ ii 2 + λλ ββ ii ββ jj = 0 if YY jj λλ/2 sign(yy jj )( YY jj λλ ) =: gg soft, λλ (YY ii ) 2 2

gg ssssssss,1 (zz)

Soft tresholding: orthonormal designs Orthonormal design if XX is nn nn and XX tt XX = nnii nn Ex (designed experiment with 3 factors): YY = ββ 1 + XX 2 ββ 2 + XX 3 ββ 3 + XX 4 ββ 4 + εε ββ 1 is the mean XX 2 = 1 if first factor is high, XX 2 = 1 if first factor is low, etc XX =

if XX tt XX = nnii nn then XX tt YY/nn = XX tt XXββ/nn + XX tt εε/nn = ββ + εεε so back to the simple situation, but with YY jj replaced by ZZ jj = (XX tt YY) jj /nn and we get that ββ jj = h soft, λλ 2 (XX tt YY jj ) = h λλ(zz jj ) soft, 2 (note that XX tt YY/nn = XX tt XX 1 XX tt YY = ββ ols for an orthonormal design) Homework: problem 2.1

B&vdG 2.4: Prediction pp is to find ff xx = EE YY XX = xx) = jj=1 xx (jj) ββ jj, and hence is closely related to estimation of ββ. If we use the Lasso for estimation we get the estimated predictor ff λλ xx = xx tt pp ββ λλ = xx (jj) ββ jj (λλ) jj=1 But, how should one chose λλ?

EH 12.1, 12.2, (10-fold) crossvalidation 1: split (somehow) the data into 10 approximately equally sized groups 2: for each YY ii use the Lasso to estimate ββ from the 9 groups which do not contain YY ii, call this estimate ββ ii λλ, and use it to compute the predictor ii pp ff λλ xx ii = jj=1 xx jj ( ii) ii ββ jj (λλ) 3: compute the crossvalidated mean square prediction error ee λλ = nn ii=1 (YY ii ff λλ ii xx ii ) 2 (or use some other appropriate error measure) 4: repeat this for a grid of λλ-values, choose the λλ which gives the smallest ee λλ

B&vdG 2.4.1.1 Lymph node status of cancer patients: YY = 0 or 1 nn = 49 breast cancer tumor samples pp = 7129 gene expression measurements/tumor sample Doesn t follow model, but one can still use (crossvalidated ) pp Lasso to compute ff λλ xx = jj=1 xx (jj) ββ jj (λλ) and then use it for classification: classify status of observation with design vector xx as 1 if ff λλ xx > 1/2 and as 0 if ff λλ xx 1/2. Compared with other method, also through crossvalidation: randomly divide sample into 2/3 training data and 1/3 test data, do entire model fitting on training data, count misclassifications on test data, repeat 100 times, compute average misclassification percentages: Lasso 21% alt. 35%. Average number of non-zero ββ-s for Lasso was 13.

Choice of estimation method How does one choose between different estimation metods: Lasso, ridge regression, ols,? Sometimes, e.g. in signal processing, one can try out the methods in situations where one can see how well they work (but still, it s imposible to try all methods which have been suggested) Understanding Simulation Asymptotics

B&vdG 2.4.2: Asymptotics In basic statistics asymptotics the model is fixed and nn, the number of observations, tends to infinity (and one then assumes that in a practical situations with large nn distributions are well described by the asymptotic distribution). Not possible for pp nn since if the model (and thus pp, the number of parameters) is kept fixed, then as nn gets large one gets that pp < nn. One instead has to use triangular arrays, where pp = pp nn tends to infinity at the same time as n tends to infinity.

Triangular arrays In asymptotic statments B&vdG typically assume that pp = pp nn, XX = XX n = (j) X n;i, and ββ = ββ nn = ββ nn;jj all change as nn, typically with pp nn faster than nn i.e. with pp nn /nn as nn. Thus one considers a sequence of models or in more detail YY nn = XX nn ββ nn + εε nn, pp nn (jj) YY nn;ii = XXnn;ii ββnn;jj + εε ii, jj=1 for i = 1, nn, and derive asymptotic inequalities and limit distributions for such triangular arrays, for nn

Prediction consistency ββ nn λλ nn ββ nn 0 tt Σ XXnn ββ nn λλ nn ββ nn 0 = oo pp 1 for Σ XXnn = XX nn tt XX nn /nn for a fixed design, and Σ XXnn equal to the estimated covariance matrix of the covariates vector XX nn, for a random design with the covariate vectors in the different rows i.i.d. (ZZ nn = oo pp (1) means that PP ZZ nn > εε 0 as nn, for any εε > 0, i.e. that ZZ nn tends to zero in probability) Conditions needed to obtain prediction consistency include ββ nn 0 1 = oo( nn log pp nn ), KK 1 log pp nn /nn < λλ nn < KK log pp nn /nn for some constant KK, and that the εε nn;ii, ii = 1, nn are i.i.d. with finite variance.

The quantity ββ nn λλ nn 0 ββ tt nn Σ XXnn ββ nn λλ nn ββ 0 nn = oo pp 1 equals XX nn ββ nn (λλ nn ) ββ nn 0 2 /nn for a fixed design, 2 EE (XX nn;nnnnnn ββ nn λλ nn ββ nn 0 ) 2 for a random design, and where XX nn ββ nn λλ nn ββ nn 0 is the difference between the estimated and the true predictor (= regression function).

Oracle inequality EE XX ββ nn (λλ nn ) ββ nn 0 2 /nn = OO( ss nn;0 logpp nn 2 nnφφ nn 2 ), where ss nn;0 = card S n;0 = S n;0 is the number of 0 elemements in the set S 0;n = jj; ββ nn;jj 0, jj = 1, pp nn active variables, and φφ 2 nn is a compatablity constant determined by the design matrix XX nn. means roughly that using the estimated predictor instead of the true one adds a (random) error of the size slightly larger 1 than nn Requires further conditions. of

B&vdG 2.5: Estimation consistency ββ nn λλ nn ββ nn 0 qq = oo PP 1, for qq = 1 or 2 Similar conditions as for prediction consistency.

Variable screening SS 0;nn (λλ) = jj; ββ nn;jj (λλ) 0, jj = 1, pp nn Typically, the Lasso solutions ββ nn 0 are not unique (e.g. they are not unique if pp > nn) but still SS 0;nn (λλ) is unique (Lemma 2.1). Further, SS 0;nn λλ min n, p n. relevant(cc) SS 0;nn (λλ) = jj; 0 ββnn;jj (λλ) 0, jj = 1, pp nn is the true set of relevant variables. Asymptotically, under relevant(cc) suitable conditions, SS 0;nn (λλ) will contain SS 0;nn, but often, depending on the value of λλ, it will contain many more variables. For choice of λλ, read B&vdG 2.5.1.

B&vdG 2.5.2 Motif regression for DNA binding sites: YY ii binding intensity of H1F1αα in coarse DNA segment ii (jj) XX ii abundance score of candidate motif jj in DNA segment ii ii = 1, nn = 287; jj = 1, pp =195 195 YY ii = μμ + jj=1 XX ii (jj) ββjj + εε ii, for i = 1, 287, Scale covariates XX to same empirical variance, subtract YY, do Lasso with λλ = λλ CV choosen by 10-fold crossvalidation for optimal prediction. Gives SS λλ CV = 26 non-zero ββ λλ CV estimates. B&vdG believe these include true active variables

B&vdG 2.6: Variable selection ββ λλ = argmin β ( YY XXββ 22 22 /nn + λλ ββ 0 0 ) where ββ 0 pp 0 = jj=1 1(ββ jj 0) Not convex: naive solution would be to go through all possible sub-models (i.e. models with some of the ββ-s set to 0), compute the ols for each of these models, and then chose the model which minimizes the right-hand side. However, there are 2 pp submodels, and if pp is not small, this isn t feasible. More efficient algoritms exist, but still a difficult/infeasible problem if pp is not small.

One possibility is to first use (some variant of) the Lasso to select a set of variables which with high probability includes the set of active variables, and then only do the minimization of the l 0 penalized mean square error among these variables. Typically reduces the number of operations needed for the minimization to OO nnnnmin nn, pp However this requires that the Lasso in fact selects all active variables/parameters. Asymptotic conditions to ensure this include that inf ββ jj ss 0 log pp/nn jj SS 0 and neigbourhood stability or irrepresentable condition holds

Examples of sufficient conditions for neigbourhood stability (recall that Σ = XX tt XX/n for a fixed design and that Σ is the correlation matrix of XX for a random design) Equal positive correlation: Σ jj,jj = 1, for all jj, Σ jj,kk = ρρ > 0, for some ρρ < 1 Markov (or Toeplitz) structure: Bounded pairwise correlation ΣΣ jj,kk = θθ jj kk, with θθ < 1 ss 0 max jj SS 0 kk SS 0 Σ jj,kk 2 2 /Λ min (ΣΣ 1,1) < 1 2 where Λ min (ΣΣ 1,1) is the smallest eigenvalue of the covariance matrix of the active variables