Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Similar documents
Statistics 262: Intermediate Biostatistics Model selection

Biostatistics Advanced Methods in Biostatistics IV

Lecture 14: Shrinkage

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Machine Learning for OR & FE

Introduction to Statistical modeling: handout for Math 489/583

Regression, Ridge Regression, Lasso

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1

STAT 462-Computational Data Analysis

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff

Least Squares Regression

Least Squares Regression

Regularization: Ridge Regression and the LASSO

STA414/2104 Statistical Methods for Machine Learning II

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

Linear regression methods

ISyE 691 Data mining and analytics

High-dimensional regression

Week 3: Linear Regression

Dimension Reduction Methods

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

Introduction to Machine Learning

Nonparametric Regression. Badr Missaoui

Shrinkage Methods: Ridge and Lasso

Today. Calculus. Linear Regression. Lagrange Multipliers

Linear Model Selection and Regularization

UNIVERSITETET I OSLO

Math 533 Extra Hour Material

Data Mining Stat 588

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

A Short Introduction to the Lasso Methodology

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

Lecture 2 Machine Learning Review

y(x) = x w + ε(x), (1)

Exam: high-dimensional data analysis January 20, 2014

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Model Selection. Frank Wood. December 10, 2009

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

Basis Penalty Smoothers. Simon Wood Mathematical Sciences, University of Bath, U.K.

A Modern Look at Classical Multivariate Techniques

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

MS-C1620 Statistical inference

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Linear Regression (9/11/13)

A Bayesian Treatment of Linear Gaussian Regression

Introduction to Regression

Linear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman

Business Statistics. Tommaso Proietti. Model Evaluation and Selection. DEF - Università di Roma 'Tor Vergata'

Lecture 5: A step back

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

IEOR165 Discussion Week 5

High-dimensional regression with unknown variance

Outline lecture 2 2(30)

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

STA 414/2104, Spring 2014, Practice Problem Set #1

Penalized Regression

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

Nonconvex penalties: Signal-to-noise ratio and algorithms

Selection of Variables and Functional Forms in Multivariable Analysis: Current Issues and Future Directions

Improving ridge regression via model selection and focussed fine-tuning

MS&E 226: Small Data

Ridge Regression. Flachs, Munkholt og Skotte. May 4, 2009

STATS306B STATS306B. Discriminant Analysis. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

Transformations The bias-variance tradeoff Model selection criteria Remarks. Model selection I. Patrick Breheny. February 17

CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015

Linear Model Selection and Regularization

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Sparse Linear Models (10/7/13)

Machine learning - HT Basis Expansion, Regularization, Validation

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Stat 5100 Handout #26: Variations on OLS Linear Regression (Ch. 11, 13)

Linear Models for Regression

Multivariate Survival Analysis

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Bayesian linear regression

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

Bayesian Regression Linear and Logistic Regression

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

ECE521 week 3: 23/26 January 2017

Lecture 6: Methods for high-dimensional problems

Compressed Sensing in Cancer Biology? (A Work in Progress)

Generalized Elastic Net Regression

Outline Lecture 2 2(32)

6. Regularized linear regression

COMS 4771 Regression. Nakul Verma

Ridge and Lasso Regression

Introduction to Regression

STAT 100C: Linear models

Simple and Multiple Linear Regression

STAT5044: Regression and Anova. Inyoung Kim

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Machine Learning Linear Regression. Prof. Matteo Matteucci

Transcription:

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models Jonathan Taylor - p. 1/15

Today s class Bias-Variance tradeoff. Penalized regression. Cross-validation. - p. 2/15

Bias-variance tradeoff Arguably, the goal of a regression analysis is to build a model that predicts well. We saw in model selection that C p and AIC were trying to estimate the MSE of each model which included some bias. One way to measure this performance is in the population mean squared-error of the model p 1 MSE pop (M) = E (Y new ( β 0 + β j X new,j )) 2 = Var(Y new ( β 0 + Bias( β) 2. j=1 p j=1 β j X new,j ))+ - p. 3/15

Bias-variance tradeoff In choosing a model automatically, even if the full model is correct (unbiased) our resulting model may be biased a fact we have ignored so far. Inference (F, χ 2 tests, etc) is not quite exact for biased models. Sometimes, it is possible to find a model with lower MSE than an unbiased model! This is called the bias-variance tradeoff. It is generic in statistics: almost always introducing some bias yields a decrease in MSE, followed by an later increase. - p. 4/15

Bias-Variance Tradeoff MSE 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Bias^2 Variance MSE 0.0 0.1 0.2 0.3 0.4 0.5 Shrinkage - p. 5/15

Shrinkage & Penalties Shrinkage can be thought of as constrained minimization. Minimize n (Y i µ) 2 subject to µ 2 C i=1 Lagrange: equivalent to minimizing n (Y i µ) 2 + λ C µ 2 i=1 Differentiating: 2 n (Y i µ C ) + 2λ C µ C = 0 i=1 Finally µ C = n i=1 Y i n + λ C = K C Y, K C < 1. - p. 6/15

Shrinkage & Penalties The precise form of λ C is unimportant: as C 0, µ C Y. As C µ C 0. - p. 7/15

Penalties & Priors Minimizing n (Y i µ) 2 + λµ 2 i=1 is similar to computing MLE of µ if the likelihood was proportional to ( ( exp 1 n )) 2σ 2 (Y i µ) 2 + λµ 2. i=1 This is not a likelihood function, but it is a posterior density for µ if µ has a N(0, σ 2 /λ) prior. Hence, penalized estimation with this penalty is equivalent to using the MAP (Maximum A Posteriori) estimator of µ with a Gaussian prior. - p. 8/15

Biased regression: penalties Not all biased models are better we need a way to find good biased models. Generalized one sample problem: penalize large values of β. This should lead to multivariate shrinkage of the vector β. Heuristically, large β is interpreted as complex model. Goal is really to penalize complex models, i.e. Occam s razor. Equivalent Bayesian interpretation. If truth really is complex, this may not work! (But, it will then be hard to build a good model anyways... (statistical lore)) - p. 9/15

Ridge regression Assume that columns (X j ) 1 j p 1 have zero mean, and length 1 (to distribute the penalty equally not strictly necessary) and Y has zero mean, i.e. no intercept in the model. This is called the standardized model. Minimize SSE λ (β) = nx i=1 Y i p 1 X j=1 X ij β j! 2 + λ p 1 X j=1 β 2 j. Corresponds (through Lagrange multiplier) to a quadratic constraint on β s. LASSO, another penalized regression uses p 1 j=1 β j. Normal equations β l SSE λ (β) = 2 Y Xβ, X l + 2λβ l - p. 10/15

Solving the normal equations 2 Y X β λ, X l + 2λ β l,λ = 0, 1 l p 1 In matrix form Y t X + β λ(x t t X + λi) = 0 Or β λ = (X t X + λi) 1 X t Y. This is identical to the previous µ C in matrix form. Essentially equivalent to putting a N(0, CI) prior on the standardized coefficients. - p. 11/15

LASSO regression Another popular penalized regression technique. Use the standardized model. Minimize SSE λ (β) = n Y i i=1 p 1 j=1 X ij β j 2 + λ p β j. Corresponds (through Lagrange multiplier) to an l 1 constraint on β s. In theory, it works well when many β j s are 0 and gives sparse solutions unlike ridge. Corresponds to a Laplace prior on standardized coefficients. j=1 - p. 12/15

Choosing λ: cross-validation If we knew MSE as a function of λ then we would simply choose the λ that minimizes MSE. To do this, we need to estimate MSE. A popular method is cross-validation. Breaks the data up into smaller groups and uses part of the data to predict the rest. We saw this in diagnostics: i.e. Cook s distance measured the fit with and without each point in the data set. - p. 13/15

Generalized Cross Validation A computational shortcut for n-fold cross-validation (also known as leave-one out cross-validation). Later, we will talk about K-fold cross-validation. Let S λ = (X t X + λi) 1 X t be the matrix in ridge regression. Then GCV (λ) = Y S λy 2 n Tr(S λ ). - p. 14/15

Effective degrees of freedom If λ = 0 then S λ is a projection matrix onto a p 1 dimensional space so n Tr(S 0 ) = n p + 1 is basically the degrees of freedom in the error, ignoring the fact we have forgotten the intercept term. For any linear smoother Ŷ = SY the quantity n Tr(S) can therefore be thought of as the effective degrees of freedom. - p. 15/15