Day 4: Shrinkage Estimators

Similar documents
Linear Model Selection and Regularization

MS-C1620 Statistical inference

Machine Learning for OR & FE

Regression, Ridge Regression, Lasso

Linear model selection and regularization

Data Mining Stat 588

Lecture 14: Shrinkage

Inference. ME104: Linear Regression Analysis Kenneth Benoit. August 15, August 15, 2012 Lecture 3 Multiple linear regression 1 1 / 58

The Classical Linear Regression Model

ISyE 691 Data mining and analytics

Machine Learning Linear Regression. Prof. Matteo Matteucci

Multiple linear regression S6

MLR Model Selection. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project

The prediction of house price

High-dimensional regression

Multiple (non) linear regression. Department of Computer Science, Czech Technical University in Prague

Linear Methods for Regression. Lijun Zhang

Multiple Regression Analysis

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

A Short Introduction to the Lasso Methodology

How the mean changes depends on the other variable. Plots can show what s happening...

Analysing data: regression and correlation S6 and S7

Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014

Statistical Methods for Data Mining

Linear regression. Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X 1,X 2,...X p is linear.

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Lecture 4 Multiple linear regression

Dimension Reduction Methods

Machine Learning for OR & FE

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Model comparison and selection

L7: Multicollinearity

Statistics 262: Intermediate Biostatistics Model selection

Lecture 4: Regression Analysis

Matematické Metody v Ekonometrii 7.

Shrinkage Methods: Ridge and Lasso

STAT 462-Computational Data Analysis

Classification & Regression. Multicollinearity Intro to Nominal Data

Introduction to Statistical modeling: handout for Math 489/583

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Linear Model Selection and Regularization

Lecture 5: Omitted Variables, Dummy Variables and Multicollinearity

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77

LECTURE 10: LINEAR MODEL SELECTION PT. 1. October 16, 2017 SDS 293: Machine Learning

High-dimensional regression modeling

Theorems. Least squares regression

Föreläsning /31

Regression I: Mean Squared Error and Measuring Quality of Fit

Lecture 3: Multiple Regression

Sparse Linear Models (10/7/13)

Linear regression methods

Topic 18: Model Selection and Diagnostics

Linear Regression In God we trust, all others bring data. William Edwards Deming

Categorical Predictor Variables

Lecture 6: Linear Regression (continued)

Simple Regression Model Setup Estimation Inference Prediction. Model Diagnostic. Multiple Regression. Model Setup and Estimation.

10. Alternative case influence statistics

Lecture 5: Clustering, Linear Regression

STAT 100C: Linear models

Multiple Regression Analysis. Part III. Multiple Regression Analysis

Lecture 5: A step back

The general linear regression with k explanatory variables is just an extension of the simple regression as follows

Dr. Maddah ENMG 617 EM Statistics 11/28/12. Multiple Regression (3) (Chapter 15, Hines)

Lecture 5: Clustering, Linear Regression

Statistical Machine Learning I

Bivariate Data: Graphical Display The scatterplot is the basic tool for graphically displaying bivariate quantitative data.

Math 423/533: The Main Theoretical Topics

MORE ON SIMPLE REGRESSION: OVERVIEW

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata'

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

Model Selection. Frank Wood. December 10, 2009

School of Mathematical Sciences. Question 1. Best Subsets Regression

Correlation and Regression Bangkok, 14-18, Sept. 2015

Motivation for multiple regression

2 Regression Analysis

1 The classical linear regression model

Regression Analysis. BUS 735: Business Decision Making and Research. Learn how to detect relationships between ordinal and categorical variables.

ECNS 561 Multiple Regression Analysis

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

Linear Regression Models

Python 데이터분석 보충자료. 윤형기

Regression Diagnostics for Survey Data

Statistical View of Least Squares

Lecture 6. Multiple linear regression. Model validation, model choice

Business Statistics. Tommaso Proietti. Model Evaluation and Selection. DEF - Università di Roma 'Tor Vergata'

Introduction to Estimation Methods for Time Series models. Lecture 1

CHAPTER 6: SPECIFICATION VARIABLES

Lecture 5: Clustering, Linear Regression

1 Motivation for Instrumental Variable (IV) Regression

Multiple Linear Regression. Chapter 12

The Simple Regression Model. Part II. The Simple Regression Model

Single and multiple linear regression analysis

Regularization: Ridge Regression and the LASSO

ECON 497 Midterm Spring

MFin Econometrics I Session 4: t-distribution, Simple Linear Regression, OLS assumptions and properties of OLS estimators

A Significance Test for the Lasso

Lecture 19 Multiple (Linear) Regression

Variable Selection in Predictive Regressions

Empirical Economic Research, Part II

Transcription:

Day 4: Shrinkage Estimators Kenneth Benoit Data Mining and Statistical Learning March 9, 2015

n versus p (aka k) Classical regression framework: n > p. Without this inequality, the OLS coefficients have no unique solution The variance of the estimates increases as p n To predict problems where n < p, we need new strategies - OLS and versions of OLS will not work Note: These are predictive methods, not methods for explanation!

Strategies for coping with n < p Subset selection. Identifying a relevant subset of the p < n predictors, and fitting an OLS model on the reduced set of variables Shrinkage. Fitting a model involving all p predictors, but penalizing (regularizing) the coefficients in such as way that they are shrunken towards zero relative to the least squares estimates has the effect of reducing variance may also perform variable selection (with the lasso) Dimension Reduction. Replacing the p predictors with projections (linear combinations) of the predictors onto M-dimensional subspace, where M < p, and then fitting an OLS model on the reduced set of (combination) variables

Subset selection through sequential testing Such a model can be found using a series of significance tests Usual t or F tests of the coefficients, all using the same significance level (e.g. 5%) Two basic versions are: Forward selection: start with a model with no explanatory variables, and add new ones one at a time, until none of the omitted ones are significant Backward selection: start with a model with all the variables included, and remove nonsignificant ones, one at a time, until the remaining ones are significant In practice, the most careful (and safest) procedure is a combination of these two: stepwise selection

Stepwise model selection Pick a significance level, say α = 0.05 Start forwards (from a model with nothing in it) or backwards (from a model with everything in it) Then, repeat the following: Add to the current model the omitted variable for which the P-value would be smallest, if this is smaller than α Remove the variable with the largest P-value, if this is bigger than α Continue until all the variables in the current model have P < α, and all the ones out of it would have P α This process can even be done automatically in one go, but usually shouldn t needs more care than that

Example from HIE data Response variable: General Health Index at entry, n = 1113 Potential explanatory variables: sex (dummy for men), age, log of family income, weight, blood pressure and smoking (as two dummy variables, for current and ex smokers) A haphazard collection of variables with no theoretical motivation, purely for illustration of the stepwise procedure For simplicity, no interactions or nonlinear effects considered F -tests are used for the smoking variable (with two dummies), t-tests for the rest Start backwards, i.e. from a full model with all candidate variables included

Example: Health Index Experiment Response variable: General Health Index Model Variable (1) (2) (4) (5) (6) Age 0.138 0.089 0.128 0.142 (< 0.001) (0.004) (< 0.001) (< 0.001) Education 1.157 0.990 0.981 1.117 (< 0.001) (< 0.001) (< 0.001) (< 0.001) Income 0.275 0.277 0.219 (< 0.001) (< 0.001) (< 0.001) Work 0.002 0.007 experience (0.563) (0.045) (Constant) 74.777 58.801 59.417 59.723 54.666 R 2 0.012 0.051 0.061 0.061 0.054 (P-values in parentheses)

Example from HIE data 1. In the full model, Blood pressure (P = 0.71), Smoking (P = 0.30) and Sex (P = 0.19) are not significant at the 5% level Remove Blood pressure 2. Now Smoking (P = 0.30) and Sex (P = 0.19) are not significant Remove Smoking 3. If added to this model, Blood pressure is not be significant (P = 0.71), so it can stay out 4. In this model, Sex (P = 0.21) is the only nonsignificant variable, so remove it 5. Added (one at a time) to this model, neither Blood pressure (P = 0.77) nor Smoking (P = 0.31) is significant, so they can stay out

Example from HIE data So the final model includes Age, Log-income and Weight, all of which are significant at the 5% level Here the nonsignificant variables were clear and unchanging throughout, but this is definitely not always the case

Comments and caveats on stepwise model selection Often some variables are central to the research hypothesis, and treated differently from other control variables e.g. in the Health Insurance Experiment, the insurance plan was the variable of main interest Such variables are not dropped during a stepwise search, but tested separately at the end Variables are added or removed one at a time, not several at once For categorical variables with more than two categories, this means adding or dropping all the corresponding dummy variables at once Individual dummy variables (i.e. differences between particular categories) may be tested separately (e.g. at the end)

Comments and caveats on stepwise model selection The models should always be hierarchical: if an interaction (e.g. coefficient of X1 X 2 ) is significant, main effects (X 1 and X 2 ) may not be dropped if coefficient of X 2 is significant, X may not be dropped In practice, the possible interactions and nonlinear terms are often not all considered in model selection Only those with some a priori plausibility Because it involves a sequence of multiple tests, the overall stepwise procedure is not a significance test with significance level α Not guaranteed to find a single best model, because it may not exist: there may be several models satisfying the conditions stated earlier

Changes of scale In short: Linear rescaling of variables will not change the essential key statistics for inference, just their scale Suppose we reexpress x i as (x i + a)/b. Then: t, F, ˆσ 2, R 2 unchanged ˆβi b ˆβ i Suppose we rescale y i as (y i + a)/b. Then: t, F, R 2 unchanged ˆσ 2 and ˆβ i will be rescaled by b Standardized variables and standardized coefficients: where we replace the variables (all x and y) by their standardized values (x i X )/SD x (e.g. for x). Standardized coefficients are sometimes called betas.

More on standardized coefficients Consider a standardized coefficient b on a single variable x. Formula: b = b SDx SD y Intrepretation: the increase in standard deviations of y associated with a one standard deviation increase in x Motivation: standardizes units so we can compare the magnitude of different variables effects In practice: serious people never use these and you should not either too tricky to interpret misleading since suggests we can compare apples and oranges too dependent on sample variation (just another version of R 2 ) We can illustrate this in R, if we use the scale()[,1] command to standardize the variables, which transforms them into z i = (x i X )/SD x

Standardized coefficients illustrated > dc <- d[complete.cases(d$votes1st, d$spend_total),] # remove missing > m1.std <- lm(scale(dc$votes1st)[,1] ~ scale(dc$spend_total)[,1]) > coef(m1) (Intercept) spend_total 683.7550298 0.2336056 > coef(m1.std) (Intercept) scale(dc$spend_total)[, 1] -1.100854e-16 7.395996e-01 > coef(m1)[2]*sd(dc$spend_total)/sd(dc$votes1st) spend_total 0.7395996

Collinearity When some variables are exact linear combinations of others then we have exact collinearity, and there is no unque least squares estimate of β When X variables are correlated, then we have (multi)collinearity Detecting (multi)collinearity: look at correlation matrix of predictors for pairwise correlations regress xk on all other predictors to produce Rk 2, and look for high values (close to 1.0) Examine eigenvalues of X X

Collinearity continued Define: S xj x j = i (x ij x j ) 2 then Var( ˆβ j ) = σ 2 ( 1 1 R 2 j ) 1 S xj x j So collinearity s main consequence is to reduce the efficiency of our estimates of β So if x j does not vary much, then Var( ˆβ j ) will be large and we can maximize S xj x j by spreading X as much as possible 1 We call this factor a variance inflaction factor (the 1 Rj 2 faraway package for R has a function called vif() you can use to compute it) Orthogonality means that variance is minimized when R 2 j = 0

Model fit: Revisiting the OLS formulas For the three parameters (simple regression): the regression coefficient: ˆβ 1 = (xi x)(y i ȳ) (xi x) 2 the intercept: ˆβ 0 = ȳ ˆβ 1 x and the residual variance σ 2 : ˆσ 2 = 1 [yi ( ˆβ 0 + ˆβ 1 x i )] 2 n 2

OLS formulas continued Things to note: the prediction line is ŷ = ˆβ 0 + ˆβ 1 x the value ŷ i = ˆβ 0 + ˆβ 1 x i is the predicted value for x i the residual is e i = y i ŷ i The residual sum of squares (RSS) = i e2 i The estimate for σ 2 is the same as ˆσ 2 = RSS/(n 2)

Components of least squares model fit TSS Total sum of squares (y i ȳ) 2 ESS Estimation or Regression sum of squares (ŷ i ȳ) 2 RSS Residual sum of squares ei 2 = (ŷ i y i ) 2 The key to remember is that TSS = ESS + RSS

R 2 How much of the variance did we explain? R 2 = 1 RSS N TSS = 1 i=1 (y i ŷ i ) 2 N N i=1 (y i ȳ) = i=1 (ŷ i ȳ) 2 2 N i=1 (y i ȳ) 2 Can be interpreted as the proportion of total variance explained by the model.

R 2 A much over-used statistic: it may not be what we are interested in at all Interpretation: the proportion of the variation in y that is explained lineraly by the independent variables Defined in terms of sums of squares: R 2 = ESS TSS = 1 RSS TSS (yi ŷ i ) 2 = 1 (yi ȳ) 2 Alternatively, R 2 is the squared correlation coefficient between y and ŷ

R 2 continued When a model has no intercept, it is possible for R 2 to lie outside the interval (0, 1) R 2 rises with the addition of more explanatory variables. For this reason we often report adjusted R 2 : 1 (1 R 2 ) n 1 n k 1 where k is the total number of regressors in the linear model (excluding the constant) Whether R 2 is high or not depends a lot on the overall variance in Y To R 2 values from different Y samples cannot be compared

R 2 continued Solid arrow: variation in y when X is unknown (TSS Total Sum of Squares (y i ȳ) 2 ) Dashed arrow: variation in y when X is known (ESS Estimation Sum of Squares (ŷ i ȳ) 2 )

R 2 decomposed y = ŷ + ɛ Var(y) = Var(ŷ) + Var(e) + 2Cov(ŷ, e) Var(y) = Var(ŷ) + Var(e) + 0 (yi ȳ) 2 /N = (ŷ i ŷ) 2 /N + (e i ê) 2 /N (yi ȳ) 2 = (ŷ i ŷ) 2 + (e i ê) 2 (yi ȳ) 2 = (ŷ i ŷ) 2 + ei 2 TSS = ESS + RSS TSS/TSS = ESS/TSS + RSS/TSS 1 = R 2 + unexplained variance

Other model fit statistics Where d is the number of predictors and ˆσ 2 is the estimated residual error variance, Mallows s C p C p = 1 n (RSS + 2d ˆσ2 ) Akaike information criterion (AIC) AIC = 1 nˆσ 2 (RSS + 2d ˆσ2 ) (note: perfectly correlated with C p for OLS) BIC BIC = 1 n (RSS + log(n)d ˆσ2 ) penalizes the number of parameters (d) more than AIC Adjusted R 2 1 (1 R 2 ) n 1 n d 1

Penalized regression Provides a way to shrink the variance of estimators (toward zero), to reduce the variance inflation problem that occurs as p n Also solves non-uniqueness of β estimates when n < p Some methods (e.g. lasso) even shrink estimates to zero, performing a type of variable selection Two most common methods: ridge regression lasso regression Both involve a tuning parameter λ whose value must be set based on optimizing some criterion (usually, predictive fit)

Ridge regression OLS: minimize the residual sum of squares, defined as: RSS = = n (y i ŷ i ) 2 (1) i=1 n y i β 0 i=1 j=1 p β j x ij (2) Ridge regression: minimize: n p p y i β 0 β j x ij λ βj 2 (3) i=1 j=1 j=1 p = RSS λ βj 2 (4) j=1

Ridge regression continued The second term, λ p j=1 β2 j, is called a shrinkage penalty, and serves to shrink the estimates of β j toward zero when λ is large, βj will shrink closer to zero, and when λ, β j = 0 when λ = 0, β j is same as OLS solution Ridge regression will produce a different estimate of β j for each value of λ So λ must be chosen carefully

The lasso Lasso: minimize: n p p y i β 0 β j x ij λ β j (5) i=1 j=1 j=1 = RSS λ p β j (6) The lasso uses an l1 penalty the l1 norm of a coefficient vector β is given by β 1 = β j (note: the ridge penalty is also known as the l2 norm) j=1

Differences Lasso can actually shrink some β values to zero completely, while ridge regression always includes them with some penalty This property makes interpreting the lasso simpler No steadfast rule as to which performs better in applications depends on the number of predictors actually related to the outcome depends on λ