STAT 100C: Linear models

Similar documents
Model comparison and selection

Introduction to Statistical modeling: handout for Math 489/583

Model Selection. Frank Wood. December 10, 2009

How the mean changes depends on the other variable. Plots can show what s happening...

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

Model Selection Procedures

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

MASM22/FMSN30: Linear and Logistic Regression, 7.5 hp FMSN40:... with Data Gathering, 9 hp

Statistics 262: Intermediate Biostatistics Model selection

Regression, Ridge Regression, Lasso

MS&E 226: Small Data

Statistics 202: Data Mining. c Jonathan Taylor. Model-based clustering Based in part on slides from textbook, slides of Susan Holmes.

Statistics 203: Introduction to Regression and Analysis of Variance Course review

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

Why Forecast Recruitment?

Variable Selection and Model Building

Machine Learning for OR & FE

Day 4: Shrinkage Estimators

MS-C1620 Statistical inference

STAT 100C: Linear models

Applied Regression Analysis

Sparse Linear Models (10/7/13)

ISyE 691 Data mining and analytics

Math 423/533: The Main Theoretical Topics

Bayesian Asymptotics

Generalized Linear Models

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

MLR Model Selection. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project

High-dimensional regression modeling

Outline. Topic 13 - Model Selection. Predicting Survival - Page 350. Survival Time as a Response. Variable Selection R 2 C p Adjusted R 2 PRESS

Data Mining Stat 588

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

Extended Bayesian Information Criteria for Model Selection with Large Model Spaces

Model selection in penalized Gaussian graphical models

Outline Introduction OLS Design of experiments Regression. Metamodeling. ME598/494 Lecture. Max Yi Ren

Linear Models (continued)

Minimum Description Length (MDL)

Matematické Metody v Ekonometrii 7.

Multiple Regression. Peerapat Wongchaiwat, Ph.D.

Variable Selection in Predictive Regressions

Applied Econometrics (QEM)

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Model Selection and Geometry

Business Statistics. Tommaso Proietti. Model Evaluation and Selection. DEF - Università di Roma 'Tor Vergata'

F & B Approaches to a simple model

Practical Econometrics. for. Finance and Economics. (Econometrics 2)

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics

LECTURE 6. Introduction to Econometrics. Hypothesis testing & Goodness of fit

An Extended BIC for Model Selection

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Lecture 15 Multiple regression I Chapter 6 Set 2 Least Square Estimation The quadratic form to be minimized is

Simple Regression Model Setup Estimation Inference Prediction. Model Diagnostic. Multiple Regression. Model Setup and Estimation.

Linear regression methods

Estimation and Model Selection in Mixed Effects Models Part I. Adeline Samson 1

Measures of Fit from AR(p)

Lecture 7: Modeling Krak(en)

Akaike Information Criterion

Likelihood-Based Methods

Dimension Reduction Methods

MATH 644: Regression Analysis Methods

CS Homework 3. October 15, 2009

10. Alternative case influence statistics

Regression I: Mean Squared Error and Measuring Quality of Fit

Bayesian Model Comparison

Regression Models - Introduction

Machine Learning. Part 1. Linear Regression. Machine Learning: Regression Case. .. Dennis Sun DATA 401 Data Science Alex Dekhtyar..

The Behaviour of the Akaike Information Criterion when Applied to Non-nested Sequences of Models

2.2 Classical Regression in the Time Series Context

Univariate ARIMA Models

Dr. Maddah ENMG 617 EM Statistics 11/28/12. Multiple Regression (3) (Chapter 15, Hines)

CSC321 Lecture 18: Learning Probabilistic Models

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

TIME SERIES ANALYSIS AND FORECASTING USING THE STATISTICAL MODEL ARIMA

Prediction & Feature Selection in GLM

Advanced Statistics I : Gaussian Linear Model (and beyond)

Definition 3.1 A statistical hypothesis is a statement about the unknown values of the parameters of the population distribution.

TMA4267 Linear Statistical Models V2017 (L12)

Linear Model Selection and Regularization

Alfredo A. Romero * College of William and Mary

Linear model selection and regularization

Discrepancy-Based Model Selection Criteria Using Cross Validation

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77

Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017

Least Squares Model Averaging. Bruce E. Hansen University of Wisconsin. January 2006 Revised: August 2006

7. Estimation and hypothesis testing. Objective. Recommended reading

Introductory Econometrics

Chapter 12: Multiple Linear Regression

Probabilistic machine learning group, Aalto University Bayesian theory and methods, approximative integration, model

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

A NEW INFORMATION THEORETIC APPROACH TO ORDER ESTIMATION PROBLEM. Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A.

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

Transformations The bias-variance tradeoff Model selection criteria Remarks. Model selection I. Patrick Breheny. February 17

All models are wrong but some are useful. George Box (1979)

STAT5044: Regression and Anova. Inyoung Kim

Transcription:

STAT 100C: Linear models Arash A. Amini June 9, 2018 1 / 21

Model selection Choosing the best model among a collection of models {M 1, M 2..., M N }. What is a good model? 1. fits the data well (model fit): measured e.g. by least-squares criterion. 2. is simple (model simplicity or parsimony): measured by # of parameters. These two are in conflict; there is a trade-off between them. We are looking for a criteria that allows us to balance the two. 2 / 21

Model selection in regression What covariates to include in the model? A model corresponds to a subset of covariates that we include (in it). Suppose we have a pool {x 1,..., x q } of covariates. Then, there are 2 q possible models: All possible subsets. For example if we have {x 1, x 2, x 3 }, then there are 2 3 = 8 possible models: p = 0 y = β 0 + ε M 0 y = β 0 + β 1 x 1 + ε M 1 p = 1 y = β 0 + β 2 x 2 + ε M 2 y = β 0 + β 3 x 3 + ε M 3 y = β 0 + β 1 x 1 + β 2 x 2 + ε M 12 p = 2 y = β 0 + β 1 x 1 + β 3 x 3 + ε M 13 y = β 0 + β 2 x 2 + β 3 x 3 + ε M 23 p = 3 y = β 0 + β 1 x 1 + β 2 x 2 + β 3 x 3 + ε M 123 3 / 21

We can fit all possible models, and try to pick the best one based on a criterion. R 2 is in general not a good model selection criterion, since it increases (at least does not decrease) when we increase the number of parameters: R 2 (M 0 ) R 2 (M 1 ) R 2 (M 12 ) R 2 (M 123 ) 4 / 21

Model selection criteria Adjusted R 2 (Radj 2 ) or mean-squared error. Let Rp 2 = 1 SSE p / SST, assuming that we have p covariates in the model. SST = i (y i ȳ) 2 does not depend on p. The adjusted R 2 is R 2 adj,p = 1 SSE p /(n p 1) SST /(n 1) Note also that = 1 s2 p sy 2, sp 2 = SSE p n p 1 = MSE p. R 2 adj,p = 1 n 1 n p 1 (1 R2 p). Increasing p does not necessarily increase R 2 adj,p. R 2 adj,p takes the # of parameters into account. Maximizing R 2 adj,p is equivalent to minimizing s2 p. 5 / 21

Akaike s information criterion (AIC). An example of penalized criteria. In general if l(θ) is the log-likelihood of the model M, and ˆθ M is the maximum likelihood estimator, the AIC of the model is AIC(M ) = 2l(ˆθ M ) + 2p(M ), p(m ) = # of parametres of the model (penalizes complex models). Pick the model with smallest AIC. In regression, θ = (β, σ 2 ), and l(β, σ 2 ) = n 2 log(2π) 1 2σ 2 S(β) n log σ2 2 where S(β) = n i=1 ( yi (X β) i ) 2. 6 / 21

MLE of σ 2 is σ 2 = S( β)/n = SSE /n. Thus, l( β, σ 2 ) = n 2 log(2π) 1 2 σ 2 S( β) n 2 = n 2 log(2π) n }{{ 2 } const. Ignoring the constant (as in your book): AIC p = n log n 2 log ( SSE n ( SSEp ) + 2(p + 1). n A related criteria is Bayesian information criterion (BIC). log σ2 ). 7 / 21

Mallow s C p statistic True model y = X β + ε where X R n q. For true β some of the coefficients are zero. Let S [q] be index of nonzero coefficients of β. Then y = X S β S + ε X S is the reduction of X to the columns in S. For example, q = 5, and S = {1, 2, 4}, β 1 X β = ( ) β 2 x 1 x 2 x 3 x 4 x 5 0 β 4 = ( ) β 1 x 1 x 2 x 4 β 2 }{{} β 4 X S 0 }{{} β S S is called the support of β. S = the true set of covariates in the model. 8 / 21

9 / 21

10 / 21

Mallow s C p statistic Pick some S [q], with S = p + 1, and fit model y = X S α + ε, α R p+1. (1) X S R n (p+1) is X restricted to columns in S. Get mean vector estimate µ S = X S α after fitting (1). A good measure of performance is d(µ, µ S ) := 1 σ 2 E µ µ S 2. (A form of rescaled MSE for parameter µ) µ is the true mean: µ = X β = X S β S. We should choose S that minimizes this. 11 / 21

Let HS := I H S One can show that (try it!) d(µ, µ S ) := 1 σ 2 E µ µ S 2 = 1 σ 2 µt HS µ + (p + 1) If the model is adequate then µ Im(X S ) = µ T HS µ = 0 = d(µ, µ S ) = p + 1 Otherwise d(µ, µ S ) > p + 1. Can do model selection by comparing d(µ, µ S ) to p + 1. However, it depends on the unknown µ. (Note: µ Im(X S ) is equivalent to S S). 12 / 21

C p tries to approximate d(µ, µ S ). It is given by (if we know σ 2 ) With e S = y µ S, C p = SSE p σ 2 n + 2(p + 1) E[SSE p ] = E e S 2 = µ T H S µ + (n p 1)σ 2 This implies that C p is an unbiased estimator of d(µ, µ S ): In practice, do not know σ 2, E[ C p ] = d(µ, µ S ) replace it with s 2 based on the full model (all covariates): C p = SSE p s 2 n + 2(p + 1). Choose the smallest model for which C p p + 1. 13 / 21

14 / 21

PRESS statistic Recall e (i) = y i ŷ (i) and ŷ (i) = x T i PRESS p = β (i). Define: n e(i) 2 = i=1 n i=1 ( ei 1 h ii ) 2 1. Leave one data-point out, 2. fit the model, 3. try to predict the deleted data-point. PRESS p an empirical measure of the prediction error of the model (called generalization error in machine learning). Can use PRESS p as a model selection criteria. 15 / 21

16 / 21

General principle of model selection, called cross-validation: 1. Hold some part of the data and try to predict it by the fitted model. 2. Choose the model that has the smallest prediction error. Using PRESS p is equivalent to leave-one-out cross-validation. (Also known as jack-knifing in this case.) 17 / 21

Automatic methods Instead of looking at all possible regressions, we can use a greedy approach, usually called stepwise regression: Pick a preset significance value α (alpha-to-enter): Forward selection: 1. Start with no variable in the model, 2. From the pool of available covariates pick the most significant to keep in the model, say x j, assuming its significance is > α. Remove x j from the pool. 3. Repeat until no variable is significant. Backward selection: 1. Start with full model, 2. drop the covariates that are least significant recursively, 3. until none is insignificant. 18 / 21

To decide whether to keep or drop a variable at each stage, it is common to use t or F test at level α = 0.05 or 0.10. (alpha-to-enter) In forward selection once a variable enters the model, it cannot leave the model at a later stage. Similarly, in backward selection once a variable leaves the model, it cannot enter the model at a later stage. Bidirectional (stepwise) selection: Combination of forward and backward selections. There are criticisms of these approaches. (Effectively generating hypotheses from the data, and testing them on the same data, which is generally not advisable.) 19 / 21

20 / 21

21 / 21