Chapter 3. Linear Models for Regression

Similar documents
Data Mining Stat 588

Linear regression methods

Chapter 6. Ensemble Methods

Linear Methods for Regression. Lijun Zhang

Least Angle Regression, Forward Stagewise and the Lasso

Regularization Paths

Machine Learning for OR & FE

Regularization Paths. Theme

Lecture 5: Soft-Thresholding and Lasso

Regularization: Ridge Regression and the LASSO

ISyE 691 Data mining and analytics

ESL Chap3. Some extensions of lasso

Consistent high-dimensional Bayesian variable selection via penalized credible regions

A Modern Look at Classical Multivariate Techniques

Prediction & Feature Selection in GLM

Comparisons of penalized least squares. methods by simulations

Biostatistics Advanced Methods in Biostatistics IV

Dimension Reduction Methods

MSA220/MVE440 Statistical Learning for Big Data

MSA220/MVE440 Statistical Learning for Big Data

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

Linear Model Selection and Regularization

Sparse survival regression

STAT 462-Computational Data Analysis

Adaptive Lasso for correlated predictors

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Fast Regularization Paths via Coordinate Descent

MS-C1620 Statistical inference

LASSO Review, Fused LASSO, Parallel LASSO Solvers

Learning with Sparsity Constraints

A significance test for the lasso

Regression, Ridge Regression, Lasso

Shrinkage Methods: Ridge and Lasso

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

The lasso: some novel algorithms and applications

Sparse Linear Models (10/7/13)

On High-Dimensional Cross-Validation

The lasso: some novel algorithms and applications

A Significance Test for the Lasso

Lecture 14: Variable Selection - Beyond LASSO

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

A significance test for the lasso

Bi-level feature selection with applications to genetic association

Regression Shrinkage and Selection via the Lasso

Lecture 14: Shrinkage

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

Linear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Regularization and Variable Selection via the Elastic Net

Generalized Elastic Net Regression

Stepwise Searching for Feature Variables in High-Dimensional Linear Regression

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Non-linear Supervised High Frequency Trading Strategies with Applications in US Equity Markets

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

Some new ideas for post selection inference and model assessment

Regression Shrinkage and Selection via the Elastic Net, with Applications to Microarrays

The lasso, persistence, and cross-validation

Fast Regularization Paths via Coordinate Descent

Post-selection Inference for Forward Stepwise and Least Angle Regression

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

Lecture 2 Part 1 Optimization

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Robust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly

Institute of Statistics Mimeo Series No Simultaneous regression shrinkage, variable selection and clustering of predictors with OSCAR

SCIENCE CHINA Information Sciences. Received December 22, 2008; accepted February 26, 2009; published online May 8, 2010

Tutz, Binder: Boosting Ridge Regression

Business Statistics. Tommaso Proietti. Model Evaluation and Selection. DEF - Università di Roma 'Tor Vergata'

Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR

High-dimensional Ordinary Least-squares Projection for Screening Variables

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

The MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010

High-dimensional regression modeling

Bias-free Sparse Regression with Guaranteed Consistency

CONSISTENT BI-LEVEL VARIABLE SELECTION VIA COMPOSITE GROUP BRIDGE PENALIZED REGRESSION INDU SEETHARAMAN

Variable Selection for Highly Correlated Predictors

Statistics for high-dimensional data

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables

High-dimensional data analysis

Linear Regression Linear Regression with Shrinkage

Sampling Distributions

A SURVEY OF SOME SPARSE METHODS FOR HIGH DIMENSIONAL DATA

Fast Regularization Paths via Coordinate Descent

Statistics 203: Introduction to Regression and Analysis of Variance Course review

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

Linear Model Selection and Regularization

A Short Introduction to the Lasso Methodology

CS 340 Lec. 15: Linear Regression

Linear Regression (9/11/13)

Generalized Linear Models and Its Asymptotic Properties

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Machine Learning for OR & FE

arxiv: v1 [stat.me] 30 Dec 2017

Machine Learning for Economists: Part 4 Shrinkage and Sparsity

Day 4: Shrinkage Estimators

Two Tales of Variable Selection for High Dimensional Regression: Screening and Model Building

Introduction to the genlasso package

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Transcription:

Chapter 3. Linear Models for Regression Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 c Wei Pan

Linear Model and Least Squares Data: (Y i, X i ), X i = (X i1,..., X ip ), i = 1,..., n. Y i : continuous LM: Y i = β 0 + p j=1 X ijβ j + ɛ i, ɛ i s iid with E(ɛ i ) = 0 and Var(ɛ i ) = σ 2. RSS(β) = n i=1 (Y i β 0 p j=1 X ijβ j ) 2 = Y X β 2 2. LSE (OLSE): ˆβ = arg min β RSS(β) = (X X ) 1 X Y. Nice properties: Under true model, E( ˆβ) = β, Var( ˆβ) = σ 2 (X X ) 1, ˆβ N(β, Var( ˆβ)), Gauss-Markov Theorem: ˆβ has min var among all linear unbiased estimates.

Some questions: ˆσ 2 = RSS( ˆβ)/(n p 1). Q: what happens if the denominator is n? Q: what happens if X X is (nearly) singular? What if p is large relative to n? Variable selection: forward, backward, stepwise: fast, but may miss good ones; best-subset: too time consuming.

Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 3 E ˆβ(k) β 2 0.65 0.70 0.75 0.80 0.85 0.90 0.95 Best Subset Forward Stepwise Backward Stepwise Forward Stagewise 0 5 10 15 20 25 30 Subset Size k FIGURE 3.6. Comparison of four subset-selection techniques on a simulated linear regression problem Y = X T β + ε. There are N = 300 observations on p = 31 standard Gaussian variables, with pairwise correlations all equal to 0.85. For10 of the variables, the coefficients are drawn at random from a N(0, 0.4) distribution; the rest are zero. The noise

Shrinkage or regularization methods Use regularized or penalized RSS: PRSS(β) = RSS(β) + λj(β). λ: penalization parameter to be determined; (thinking about the p-value thresold in stepwise selection, or subset size in best-subset selection.) J(): prior; both a loose and a Bayesian interpretations; log prior density. Ridge: J(β) = p j=1 β2 j ; prior: β j N(0, τ 2 ). ˆβ R = (X X + λi ) 1 X Y. Properties: biased but small variances, E( ˆβ R ) = (X X + λi ) 1 X X β, Var( ˆβ R ) = σ 2 (X X + λi ) 1 X X (X X + λi ) 1 Var( ˆβ), df (λ) = tr[x (X X + λi ) 1 X ] df (0) = tr(x (X X ) 1 X ) = tr((x X ) 1 X X ) = p,

Lasso: J(β) = p j=1 β j. Prior: β j Laplace or DE(0, τ 2 ); No closed form for ˆβ L. Properties: biased but small variances, df ( ˆβ L ) = # of non-zero ˆβ j L s (Zou et al ). Special case: for X X = I, or simple regression (p = 1), ˆβ j L = ST( ˆβ j, λ) = sign( ˆβ j )( ˆβ j λ) +, compared to: ˆβ j R = ˆβ j /(1 + λ), ˆβ j B = HT( ˆβ j, M) = ˆβ j I (rank( ˆβ j ) M). A key property of Lasso: ˆβL j = 0 for large λ, but not ˆβ R j. simultaneous parameter estimation and selection.

Note: for a convex J(β) (as for Lasso and Ridge), min PRSS is equivalent to: min RSS(β) s.t. J(β) t. Offer an intutive explanation on why we can have ˆβ j L = 0; see Fig 3.11. Theory: β j is singular at 0; Fan and Li (2001). How to choose λ? obtain a solution path ˆβ(λ), then, as before, use tuning data or CV or model selection criterion (e.g. AIC or BIC). Example: R code ex3.1.r

Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 3 β 2. ^ β β 2. ^ β β 1 β 1 FIGURE 3.11. Estimation picture for the lasso (left) and ridge regression (right). Shown are contours of the error and constraint functions. The solid blue areas are the constraint regions β 1 + β 2 t and β 2 1 + β 2 2 t 2, respectively, while the red ellipses are the contours of

Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 3 Coefficients 0 2 4 6 8 0.2 0.0 0.2 0.4 0.6 lcavol lweight age lbph svi lcp gleason pgg45 df(λ)

Shrinkage Factor s Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 3 lcavol Coefficients 0.2 0.0 0.2 0.4 0.6 svi lweight pgg45 lbph gleason age lcp 0.0 0.2 0.4 0.6 0.8 1.0

Lasso: biased estimates; alternatives: Relaxed lasso: 1) use Lasso for VS; 2) then use LSE or MLE on the selected model. Use a non-convex penalty: SCAD: eq (3.82) on p.92; Bridge J(β) = j β j q with 0 < q < 1; Adaptive Lasso (Zou 2006): J(β) = j β j / β j,0 ; Truncated Lasso Penalty (Shen, Pan &Zhu 2012, JASA): J(β; τ) = j min( β j, τ), or J(β; τ) = j min( β j /τ, 1). Choice b/w Lasso and Ridge: bet on a sparse model? risk prediction for GWAS (Austin, Pan & Shen 2013, SADM). Elastic net (Zou & Hastie 2005): J(β) = j α β j + (1 α)β 2 j may select correlated X j s.

Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 3 β SCAD β 1 ν 0 1 2 3 4 5 0.0 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 4 2 0 2 4 β 4 2 0 2 4 β 4 2 0 2 4 β FIGURE 3.20. The lasso and two alternative non convex penalties designed to penalize large coefficients less. For SCAD we use λ =1and a =4,andν = 1 2 in the last panel.

Group Lasso: a group of variables are to be 0 (or not) at the same time, J(β) = β 2, i.e. use L 2 -norm, not L 1 -norm for Lasso or squared L 2 -norm for Ridge. better in VS (but worse for parameter estimation?) Grouping/fusion penalties: encouraging equalities b/w β j s (or β j s). Fused Lasso: J(β) = p 1 j=1 β j β j+1 J(β) = j k β j β k Ridge penalty: grouping implicitly, why? (8000) Grouping pursuit (Shen & Huang 2010, JASA): p 1 J(β; τ) = TLP(β j β j+1 ; τ) j=1

Grouping penalties: (8000) Zhu, Shen & Pan (2013, JASA): p 1 J 2 (β; τ) = TLP( β j β j+1 ; τ); j=1 p J(β; τ 1, τ 2 ) = TLP(β j ; τ 1 ) + J 2 (β; τ 2 ); j=1 (8000) Kim, Pan & Shen (2013, Biometrics): J 2(β) = j k I (β j 0) I (β k 0) ; J 2 (β; τ) = j k (8000) Dantzig Selector ( 3.8). TLP(β j ; τ) TLP(β k ; τ) ; (8000) Theory ( 3.8.5); Greenshtein & Ritov (2004) (persistence); Zou 2006 (non-consistency)...

R packages for penalized GLMs (and Cox PHM) glmnet: Ridge, Lasso and Elastic net. ncvreg: SCAD, MCP TLP: https://github.com/chongwu-biostat/glmtlp Vignette: http://www.tc.umn.edu/ wuxx0845/glmtlp FGSG: grouping/fusion penalties (based on Lasso, TLP, etc) for LMs More general convex programming: Matlab CVX package.

(8000) Computational Algorithms for Lasso Quadratic programming: the original; slow. LARS ( 3.8): the solution path is piece-wise linear; at a cost of fitting a single LM; not general? Incremental Forward Stagewise Regression ( 3.8): approx; related to boosting. A simple (and general) way: β j = βj 2 (r) / ˆβ j ; (r) truncate a current estimate ˆβ j 0 at a small ɛ. Coordinate-descent algorithm ( 3.8.6): update each β j while fixing others at the current estimates recall we have a closed-form solution for a single β j! simple and general but not applicable to grouping penalties. ADMM (Boyd et al 2011). http://stanford.edu/ boyd/admm.html

Sure Independence Screening (SIS) Q: penalized (or stepwise...) regression can do automatic VS; just do it? Key: there is a cost/limit in performance/speed/theory. Q2: some methods (e.g. LDA/QDA/RDA) do not have VS, then what? Going back to basics: first conduct marginal VS, 1) Y X 1, Y X 2,..., Y X p ; 2) choose a few top ones, say p 1 ; p 1 can be chosen somewhat arbitrarily, or treated as a tuning parameter 3) then apply penalized reg (or other VS) to the selected p 1 variables. Called SIS with theory (Fan & Lv, 2008, JRSS-B). R package SIS; iterative SIS (ISIS); why? a limitation of SIS...

Using Derived Input Directions PCR: PCA on X, then use the first few PCs as predictors. Use a few top PCs explaining a majority (e.g. 85% or 95%) of total variance; # of components: a tuning parameter; use (genuine) CV; Used in genetic association studies, even for p < n to improve power. +: simple; -: PCs may not be related to Y.

Partial least squares (PLS): multiple versions; see Alg 3.3. Main idea: 1) regress Y on each X j univariately to obtain coef est φ 1j ; 2) first component is Z 1 = j φ 1jX j ; 3) regress X j on Z 1 and use the residuals as new X j ; 4) repeat the above process to obtain Z 2,...; 5) Regress Y on Z 1, Z 2,... Choice of # components: tuning data or CV (or AIC/BIC?) Contrast PCR and PLS: PCA: max α Var(X α) s.t....; PLS: max α Cov(Y, X α) s.t....; Continuum regression (Stone & Brooks 1990, JRSS-B) Penalized PCA (...) and Penalized PLS (Huang et al 2004, BI; Chun & Keles 2012, JRSS-B; R packages ppls, spls). Example code: ex3.2.r

FIGURE 3.7. Estimated prediction error curves and their standard errors for the various selection and Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 3 All Subsets Ridge Regression CV Error 0.6 0.8 1.0 1.2 1.4 1.6 1.8 CV Error 0.6 0.8 1.0 1.2 1.4 1.6 1.8 0 2 4 6 8 Subset Size Lasso 0 2 4 6 8 Degrees of Freedom Principal Components Regression CV Error 0.6 0.8 1.0 1.2 1.4 1.6 1.8 CV Error 0.6 0.8 1.0 1.2 1.4 1.6 1.8 0.0 0.2 0.4 0.6 0.8 1.0 Shrinkage Factor s 0 2 4 6 8 Number of Directions Partial Least Squares CV Error 0.6 0.8 1.0 1.2 1.4 1.6 1.8 0 2 4 6 8 Number of Directions