Model Selection. Frank Wood. December 10, 2009

Similar documents
Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

STAT 100C: Linear models

Introduction to Statistical modeling: handout for Math 489/583

Statistics 262: Intermediate Biostatistics Model selection

How the mean changes depends on the other variable. Plots can show what s happening...

Sparse Linear Models (10/7/13)

MS-C1620 Statistical inference

Regression, Ridge Regression, Lasso

Chapter 1 Statistical Inference

10. Alternative case influence statistics

Variable Selection in Predictive Regressions

MS&E 226: Small Data

Machine Learning Linear Regression. Prof. Matteo Matteucci

Linear regression methods

ISyE 691 Data mining and analytics

Machine Learning for OR & FE

High-dimensional regression

Generalized Linear Models

Week 5 Quantitative Analysis of Financial Markets Modeling and Forecasting Trend

MLR Model Selection. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project

Linear Model Selection and Regularization

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Niche Modeling. STAMPS - MBL Course Woods Hole, MA - August 9, 2016

CHAPTER 6: SPECIFICATION VARIABLES

2.2 Classical Regression in the Time Series Context

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Standard Errors & Confidence Intervals. N(0, I( β) 1 ), I( β) = [ 2 l(β, φ; y) β i β β= β j

An Introduction to Mplus and Path Analysis

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Dimension Reduction Methods

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

Data Mining Stat 588

The prediction of house price

An Introduction to Path Analysis

Inference in Regression Analysis

Linear Models (continued)

Day 4: Shrinkage Estimators

Outline Introduction OLS Design of experiments Regression. Metamodeling. ME598/494 Lecture. Max Yi Ren

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics

Biostatistics Advanced Methods in Biostatistics IV

Linear regression. Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X 1,X 2,...X p is linear.

Regularization: Ridge Regression and the LASSO

MS&E 226. In-Class Midterm Examination Solutions Small Data October 20, 2015

Topic 18: Model Selection and Diagnostics

36-309/749 Experimental Design for Behavioral and Social Sciences. Dec 1, 2015 Lecture 11: Mixed Models (HLMs)

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty

Multiple Regression. Peerapat Wongchaiwat, Ph.D.

Linear Regression Models

Nonparametric Regression and Bonferroni joint confidence intervals. Yang Feng

Nonparametric Bayesian Methods (Gaussian Processes)

Chapter 4: Regression Models

High-dimensional regression modeling

Model Selection in GLMs. (should be able to implement frequentist GLM analyses!) Today: standard frequentist methods for model selection

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1

Lecture 7: Modeling Krak(en)

Linear Models in Machine Learning

Transformations The bias-variance tradeoff Model selection criteria Remarks. Model selection I. Patrick Breheny. February 17

Model comparison and selection

Multiple Regression. Dr. Frank Wood. Frank Wood, Linear Regression Models Lecture 12, Slide 1

Business Statistics. Tommaso Proietti. Model Evaluation and Selection. DEF - Università di Roma 'Tor Vergata'

All models are wrong but some are useful. George Box (1979)

Regression. ECO 312 Fall 2013 Chris Sims. January 12, 2014

Statistical Methods for Data Mining

ECE521 week 3: 23/26 January 2017

Multiple Linear Regression for the Supervisor Data

Lecture 3: Introduction to Complexity Regularization

CS6220: DATA MINING TECHNIQUES

ECO220Y Simple Regression: Testing the Slope

STATISTICS 174: APPLIED STATISTICS FINAL EXAM DECEMBER 10, 2002

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1

Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014

Formal Statement of Simple Linear Regression Model

ISQS 5349 Spring 2013 Final Exam

Alfredo A. Romero * College of William and Mary

Linear Regression 9/23/17. Simple linear regression. Advertising sales: Variance changes based on # of TVs. Advertising sales: Normal error?

Linear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman

R in Linguistic Analysis. Wassink 2012 University of Washington Week 6

Inferences for Regression

FEEG6017 lecture: Akaike's information criterion; model reduction. Brendan Neville

Linear model selection and regularization

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

Machine Learning. Part 1. Linear Regression. Machine Learning: Regression Case. .. Dennis Sun DATA 401 Data Science Alex Dekhtyar..

Prediction & Feature Selection in GLM

Linear Models 1. Isfahan University of Technology Fall Semester, 2014

Lecture 2 Machine Learning Review

Machine Learning and Data Mining. Linear regression. Kalev Kask

LASSO Review, Fused LASSO, Parallel LASSO Solvers

Multiple Linear Regression

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation?

Lasso Regression: Regularization for feature selection

Linear Regression (9/11/13)

Regression I: Mean Squared Error and Measuring Quality of Fit

36-309/749 Experimental Design for Behavioral and Social Sciences. Sep. 22, 2015 Lecture 4: Linear Regression

LECTURE 10: LINEAR MODEL SELECTION PT. 1. October 16, 2017 SDS 293: Machine Learning

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77

Minimum Description Length (MDL)

Transcription:

Model Selection Frank Wood December 10, 2009

Standard Linear Regression Recipe Identify the explanatory variables Decide the functional forms in which the explanatory variables can enter the model Decide which interactions should be in the model. Reduce explanatory variables (!?) Refine model

Trouble and Strife Form any set of p 1 predictors, 2 p 1 alternative linear regression models can be constructed Consider all binary vectors of length p 1 Search in that space is exponentially difficult. Greedy strategies are typically utilized. Is this the only way?

Selecting between models In order to select between models some score must be given to each model. The likelihood of the data under each model is not sufficient because, as we have seen, the likelihood of the data can always be improved by adding more parameters until the model effectively memorizes the data. Accordingly some penalty that is a function of the complexity of the model must be included in the selection procedure. There are four choices for how to do this 1. Explicit penalization of the number of parameters in the model (AIC, BIC, etc.) 2. Implicit penalization through cross validation 3. Bayesian regularization / Occam s razor. 4. Use a fixed model of unbounded complexity (Bayesian nonparametrics).

Penalized Likelihood The Akaike s information criterion (AIC) and the Bayesian information criterion (BIC) (also called the Schwarz criterion) are two criteria that penalize model complexity. In the linear regression setting AIC p = n ln SSE p l ln n + 2p BIC p = n ln SSE p n ln n + (ln n)p 1 Roughly you can think of these two criteria as penalizing models with many parameters (p in the case of linear regression). 1 For a good derivation of the BIC see http://myweb.uiowa.edu/cavaaugh/ms lec 6 ho.pdf

Cross Validation A way of doing model selection that implicitely penalizes models of high complexity is cross-validation. If you fit a model with a large number of parameters to a subset of the data then predict the rest of the data using the model you just fit, then The average scores of held-out data over different folds can be used to compare models. If the held-out data is consistently (over the various folds) well explained by the model then one could conclude that the model is a good model. If one model performs better on average when predicting held-out data then there is reason to prefer that model. Overly complex models will not generalize well and will therefor not be selected.

PRESS p or Leave-One-Out Cross Validation The PRESS p or prediction sum of squares measures how well a subset model can predict the observed responses Y i. Let Ŷ i (i) be the fiitted value when i is being predicted from a model in which (i) was left out during training. The PRESS p criterion is then given by summing over all n cases PRESS p = n (Y i Ŷ i(i) ) 2 i=1 PRESS p values can be calculated without doing n separate regression runs.

PRESS p or Leave-One-Out Cross Validation If we let d i be the deleted residual for the i th case then we can rewrite d i = Y i Ŷ i(i) e i d i = 1 h ii where e i is the ordinary residual for the i th case and h ii is the i th diagonal element in the hat matrix. We can obtain the h ii diagonal element of the hat matrix directly from h ii = X i(x X) 1 X i

PRESS p or Leave-One-Out Cross Validation PRESS p is useful for choosing between models. Consider Take two models, one M p with p dimensional input and one M p 1 with p 1 dimensional input Calculate the PRESS p criterion for M p and M p 1 Whichever model has the lowest PRESS p should be preferred. Why? Unfortunately the PRESS p criteria can t tell us which variables to include. For that general search is still required. More general cross-validation procedures can be designed. Note that the PRESS p criterion is very similar to log-likelihood.

Detecting Outliers Via Studentized Deleted Residuals The studentized deleted residual, denoted by t i is where t i = d i s{d i } s 2 {d i } = MSE (i) d i t(n p 1) 1 h ii s{d i } Fortunately, again, t i can be calculated without fitting n different regression models. It can be shown that [ n p 1 t i = e i SSE(1 h ii ei 2) ] 1/2 These t i s can be used to formally test (e.g. using a Bonferroni test procedure) whether the largest absolute studentized deleted residual is an outlier.

Stepwise Regression Methods A (greedy) procedure for identifying variables to include in the regression model is as follows. Repeat until finished: 1. Fit a simple linear regression model for each of the P-1 X variables considered for inclusion. For each compute the t statistics for testing whether or not the slope is zero t k = b k 2 s{b k } 2. Pick the largest out of the P 1 tk s (in the first step k = 1) and include the corresponding X variable in the regression model if tk exceeds some arbitrary threshold. 3. If the number of X variables included in the regression model is greater than one, check to see if the model would be improved by dropping variables (using the t-test and a threshold again). 2 Remember b k is the estimate for β k and s{b k } is the estimator sample standard deviation.

Stepwise Regression Methods Big question: Does this ensure the best possible model? Either in the is this procedure guaranteed to pick the best possible linear regression model or in any other sense? There is a tension between including variables (why not include everything?) and needing to reliably estimate many parameters. Here sharp philosophical differences emerge (partially driven by the needs of the user / application) 1. Try to identify which elements are linearly related to the output. 2. Include everything and regularize.

Finding the Best Model, Case 1 When there are many parameters and few observations this is known as the big p little n problem. One might actually to know (the inference goal) which inputs are linearly related to the output. It is tempting but dangerous and wrong to conclude that if by formal testing procedures you find that a particular input feature is not linearly related to the output that there is no relationship between the variables. A converse is also potentially dangerous and wrong, namely, if you find that a particular feature has a statistically significant linear effect on the output you cannot necessarily conclude causality. Carefully controlled experiments are required to establish probable causality.

Finding the Best Model, Case 2 Philosophically it makes sense to include all possible features in a regression model. Why not? Unfortunately, in the big p small n setting model estimation is usually difficult. In linear regression the sample covariance matrix is low rank and not-invertible so the small n big p problem is degenerate. Regularization can be employed, but it is difficult to interpret and assign physical meaning to the resulting regression coefficients. This may not matter if prediction is the goal, rather than describing or examining the relationship between the predictors and the output.

Finding the Best Model, Alternative It is possible to include all the variables in the model and automatically learn which variables should be included. Techniques for doing this go by different names (LASSO, L1 penalized regression, etc.). They use a different regularizer (L1 instead of L2) term that encourages sparsity (i.e. zero parameters). Fitting such a model becomes significantly more difficult and requires using a constrained optimization solver.