Variable Selection and Model Building

Similar documents
Multiple Linear Regression Analysis

Chapter 13 Variable Selection and Model Building

ECONOMETRIC THEORY. MODULE VI Lecture 19 Regression Analysis Under Linear Restrictions

Variable Selection and Model Building

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

MASM22/FMSN30: Linear and Logistic Regression, 7.5 hp FMSN40:... with Data Gathering, 9 hp

Math 423/533: The Main Theoretical Topics

STAT 100C: Linear models

Generalized Linear Models

Simple Linear Regression Analysis

How the mean changes depends on the other variable. Plots can show what s happening...

CS6220: DATA MINING TECHNIQUES

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Introduction to Statistical modeling: handout for Math 489/583

Variable Selection and Model Building

Ch 2: Simple Linear Regression

Ch 3: Multiple Linear Regression

10. Alternative case influence statistics

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression

Dr. Maddah ENMG 617 EM Statistics 11/28/12. Multiple Regression (3) (Chapter 15, Hines)

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

CS6220: DATA MINING TECHNIQUES

Correlation Analysis

Data Mining Stat 588

Chapter 11 Specification Error Analysis

MATH 644: Regression Analysis Methods

Day 4: Shrinkage Estimators

Linear Regression In God we trust, all others bring data. William Edwards Deming

ECONOMETRIC THEORY. MODULE XVII Lecture - 43 Simultaneous Equations Models

Statistics 262: Intermediate Biostatistics Model selection

2.2 Classical Regression in the Time Series Context

Model Selection Procedures

Topic 18: Model Selection and Diagnostics

MS-C1620 Statistical inference

STAT Financial Time Series

Matematické Metody v Ekonometrii 7.

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

STAT763: Applied Regression Analysis. Multiple linear regression. 4.4 Hypothesis testing

Model Selection. Frank Wood. December 10, 2009

Chapter 3 Multiple Regression Complete Example

Lecture 11. Correlation and Regression

Applied Econometrics (QEM)

3. Linear Regression With a Single Regressor

Linear Models (continued)

Measures of Fit from AR(p)

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

1. Hypothesis testing through analysis of deviance. 3. Model & variable selection - stepwise aproaches

Summer School in Statistics for Astronomers V June 1 - June 6, Regression. Mosuk Chow Statistics Department Penn State University.

Regression, Ridge Regression, Lasso

Multiple Regression of Students Performance Using forward Selection Procedure, Backward Elimination and Stepwise Procedure

Lecture 15. Hypothesis testing in the linear model

Multiple linear regression S6

Regression Analysis and Forecasting Prof. Shalabh Department of Mathematics and Statistics Indian Institute of Technology-Kanpur

Analysis of Variance and Design of Experiments-II

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

14 Multiple Linear Regression

Chapter 14 Student Lecture Notes 14-1

Why Forecast Recruitment?

Applied Econometrics. Professor Bernard Fingleton

2. Regression Review

Outline. Topic 13 - Model Selection. Predicting Survival - Page 350. Survival Time as a Response. Variable Selection R 2 C p Adjusted R 2 PRESS

Business Statistics. Chapter 14 Introduction to Linear Regression and Correlation Analysis QMIS 220. Dr. Mohammad Zainal

Simple Linear Regression: A Model for the Mean. Chap 7

Lecture 5: Clustering, Linear Regression

Economics 520. Lecture Note 19: Hypothesis Testing via the Neyman-Pearson Lemma CB 8.1,

MS&E 226: Small Data

ST440/540: Applied Bayesian Statistics. (9) Model selection and goodness-of-fit checks

Six Sigma Black Belt Study Guides

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Machine Learning for OR & FE

Model comparison and selection

No other aids are allowed. For example you are not allowed to have any other textbook or past exams.

Notes for ISyE 6413 Design and Analysis of Experiments

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77

Lecture 5: Clustering, Linear Regression

Lecture 6: Linear Regression (continued)

18.S096 Problem Set 3 Fall 2013 Regression Analysis Due Date: 10/8/2013

Lecture 10 Multiple Linear Regression

Cauchy Regression and Confidence Intervals for the Slope. University of Montana and College of St. Catherine Missoula, MT St.

Multiple Linear Regression

Standard Errors & Confidence Intervals. N(0, I( β) 1 ), I( β) = [ 2 l(β, φ; y) β i β β= β j

LECTURE 5 HYPOTHESIS TESTING

Lecture 15 Multiple regression I Chapter 6 Set 2 Least Square Estimation The quadratic form to be minimized is

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Multivariate Regression (Chapter 10)

THE ROYAL STATISTICAL SOCIETY 2008 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE (MODULAR FORMAT) MODULE 4 LINEAR MODELS

Chapter 13. Multiple Regression and Model Building

Math 5305 Notes. Diagnostics and Remedial Measures. Jesse Crawford. Department of Mathematics Tarleton State University

1 Introduction 1. 2 The Multiple Regression Model 1

TIME SERIES ANALYSIS AND FORECASTING USING THE STATISTICAL MODEL ARIMA

Model comparison. Patrick Breheny. March 28. Introduction Measures of predictive power Model selection

Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure.

CAS MA575 Linear Models

y ˆ i = ˆ " T u i ( i th fitted value or i th fit)

Econometric Methods. Prediction / Violation of A-Assumptions. Burcu Erdogan. Universität Trier WS 2011/2012

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

5. Erroneous Selection of Exogenous Variables (Violation of Assumption #A1)

Transcription:

LINEAR REGRESSION ANALYSIS MODULE XIII Lecture - 39 Variable Selection and Model Building Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

5. Akaike s information criterion (AIC) The Akaike s information criterion statistic is given as SS ( p) AIC p = n ln + p n where SS ( p) = y ' H y = y ' X ( X X ) X y is based on the subset model y = X1β1+ δ derived from the full model y = X1β1+ Xβ + ε = Xβ + ε. Now we derive the given expsion for AIC. In general, the AIC is defined as AIC = -(maximized log likelihood) + (number of parameters). In linear regsion model with ε ~ N(0, σ I), the likelihood function is Lyβσ and log likelihood is ' 1 ' 1 1 1 1 1 1 1 ( y Xβ)'( y Xβ) = σ (,, ) exp n ( πσ ) n n 1 ( y Xβ)'( y Xβ) σ ln Ly ( ; βσ, ) = ln π ln( σ ). The log-likelihood is maximized at 1 β = ( X ' X) X ' y n p σ = ˆ σ n where β is maximum likelihood estimate of β which is same as OLSE, σ is maximum likelihood estimate of σ and is OLSE of σ. σˆ

3 So AIC = ln L( y; βσ, ) + p SS = nln + p+ n ln( ) + 1 n [ π ] where The term SS = y I X X X X y 1 ' ( ' ) '. n[ ln( π ) + 1] remains same for all the models under comparison if same observations y are compared. So it is irrelevant for AIC. Hence we get the required expsion for AIC. A model with smaller value of AIC is preferable. 6. Bayesian information criterion (BIC) Similar to AIC, the Bayesian information criterion is based on maximizing the posterior distribution of model given the observations y. In the case of linear regsion model with p selected explanatory variables is defined as BIC = n ln( SS ) + ( p n) ln n. A model with smaller value of BIC is preferable.

4 7. PRESS statistic Since the iduals and idual sum of squa act as a criterion for subset model selection, so similarly the PRESS iduals and prediction sum of squa can also be used as a basis for subset model selection. The usual idual and PRESS iduals have their own characteristics which are used is regsion modeling. The PRESS statistic based on a subset model with p explanatory variable is given by where h ii is the i th element in PRESS( p) = y ˆ i y n i= 1 () i n e i = i= 1 1 hii H = X X X X 1 ( ' ). This criterion is used on the similar lines as in the case of SS ( p). A subset regsion model with smaller value of PRESS(p) is preferable.

Partial F- statistic The partial F-statistic is used to test the hypothesis about a subvector of the regsion coefficient. Consider the model y = X β+ ε n 1 n k k 1 n 1 5 which includes an intercept term and (k 1) explanatory variables. Suppose a subset of r < (k - 1) explanatory variables is to be obtained which contribute significantly to the regsion model. So partition β1 X = X1 X, β = β where X and X are matrices of order n ( k r) and pectively; and are the vectors of order and r 1 1 pectively. The objective is to test the null hypothesis H0 : β = 0 H : β 0. 1 Then y = X + = X + X + β ε β β ε 1 1 n r is the full model and application of least squa gives the OLSE of b= X X X y 1 ( ' ) '. β1 The corponding sum of squa due to regsion with k degrees of freedom is = b' X ' y SSreg and the sum of squa due to iduals with (n - k) degrees of freedom is SS = y ' y b ' X ' y β as β ( k r) 1 and MS = yy ' bx ' ' y n k is the mean square due to idual.

The contribution of explanatory variables in β in the regsion can be found by considering the full model under H0 : β = 0. Assume that H0: β = 0 is true, then the full model becomes 6 = 1β1 + δ, ( δ) = 0, ( δ) = σ y X E Var I which is the reduced model. Application of least squa to reduced model yields the OLSE of ' 1 ' b1 = ( XX 1 1) Xy 1 and corponding sum of squa due to regsion with ( k r) degrees of freedom is SSreg = b X y ' '. 1 1 β 1 as The sum of squa of regsion due to given that in already in the model can be found by SS ( β β ) = SS ( β) SS ( β ) reg 1 reg reg 1 SSreg ( β ) SSreg ( β1) β β β1 where and pectively are the sum of squa due to regsion with all explanatory variables corponding to is the model and the sum of squa due to explanatory variables corponding to in the model. The term SS is called as the extra sum of squa due to and has degrees of freedom It reg ( β β1) β k ( k r) = r. is independent of MS and is a measure of regsion sum of squa that ults from adding the explanatory variables X,..., k r+ 1 Xk in the model when the model has already X explanatory variables. 1, X,..., Xk r The null hypothesis F 0 SS = H0 : β = 0 ( β β1)/ r MS can be tested using the statistic which follows F - distribution with r and (n k) degrees of freedom under H 0. The decision rule is to reject H 0 whenever F0 > Fα ( rn, k). This is known as the partial F test. It measu the contribution of explanatory variables in X given that the other explanatory variables in X 1 are already in the model. β 1

7 Computational techniques for variable selection In order to select a subset model, several techniques based on computational procedu and algorithm are available. They are essentially based on two ideas select all possible explanatory variables or select the explanatory variables stepwise. 1. Use all possible explanatory variables This methodology is based on following steps: Fit a model with one explanatory variable. Fit a model with two explanatory variables. Fit a model with three explanatory variables. and so on. Choose a suitable criterion for model selection and evaluate each of the fitted regsion equation with the selection criterion. The total number of models to be fitted sharply rises with increase in k. So such models can be evaluated using model selection criterion with the help of an efficient computation algorithm on computers.. Stepwise regsion techniques This methodology is based on choosing the explanatory variables in the subset model in steps which can be either adding one variable at a time or deleting one variable at a times. Based on this, there are three procedu. forward selection backward elimination and stepwise regsion. These procedu are basically computer intensive procedu and are executed using a software.

8 Forward selection procedure This methodology assumes that there is no explanatory variable in the model except an intercept term. It adds variables one by one and test the fitted model at each step using some suitable criterion. It has following steps. Consider only intercept term and insert one variable at a time. Calculate the simple correlations of Choose x i which has largest correlation with y. Suppose x 1 is the variable which has highest correlation with y. Since F - statistic given by F so x 1 will produce the largest value of F 0 in testing the significance of regsion. Choose a ppecified value of F value, say F IN (F to enter). If F >F IN, then accept x 1 and so x 1 enters into the model. Adjust the effect of x 1 on and re-compute the correlations of remaining x i s with y and obtain the partial correlations as follows. Fit the regsion ŷ = ˆ β + ˆ β x x' si ( = 1,,..., k) with y. i 0 1 1 and obtain the iduals. Fit the regsion of x 1 on other candidate explanatory variables as ˆ j αoj ˆ1 α j 1 xˆ = + x, j =,3,..., k n k R =., k 11 R 0 and obtain the iduals. Find the simple correlation between the two iduals. This gives the partial correlations. Choose x i with second largest correlation with y, i.e., the variable with highest value of partial correlation with y. Suppose this variable is x. Then the largest partial F - statistic is SS x x F = MS ( x, x ) reg ( 1). 1

9 It F >F IN then x enters into the model. These steps are repeated. At each step, the partial correlations are computed and explanatory variable corponding to highest partial correlation with y is chosen to be added into the model. Equivalently, the partial F -statistics are calculated and largest F - statistic given the other explanatory variables in the model is chosen. The corponding explanatory variable is added into the model if partial F-statistic exceeds F IN. Continue with such selection as long as either at particular step, the partial F - statistic does not exceed F IN or when the last explanatory variable is added to the model. Note: The SAS software choose F IN by choosing a type I error rate so that the explanatory variable with highest partial correlation coefficient with y is added to model if partial F - statistic exceeds. α F (1, n p). α

10 Backward elimination procedure This methodology is contrary to forward selection procedure. The forward selection procedure starts with no explanatory variable in the model and keep on adding one variable at a time until a suitable model is obtained. The backward elimination methodology begins with all explanatory variables and keep on deleting one variable at a time until a suitable model is obtained. It is based on following steps: Consider all k explanatory variables and fit the model. Compute partial F - statistic for each explanatory variables as if it were the last variable to enter in the model. Choose a pelected value F OUT (F - to-remove). Compare the smallest of the partial F - statistics with F OUT. If it is less than F OUT, then remove the corponding explanatory variable from the model. The model will have now (k - 1) explanatory variables. Fit the model with these (k - 1) explanatory variables, compute the partial F - statistic for the new model and compare it with F OUT. If it is less them F OUT, then remove the corponding variable from the model. Repeat this procedure. Stop the procedure when smallest partial F - statistic exceeds F OUT.

11 Stepwise regsion procedure A combinations of forward selection and backward elimination procedure is the stepwise regsion. It is a modification of forward selection procedure and has following steps. Consider all the explanatory variables entered into to the model at previous step. Add a new variable and regses it via their partial F - statistics. An explanatory variable that was added at an earlier step may now become insignificant due to its relationship with currently pent explanatory variables in the model. If partial F - statistic for an explanatory variable is smaller than F OUT, then this variable is deleted from the model. Stepwise needs two cut-off values, F IN and F OUT. Sometimes F IN = F OUT or F IN > F OUT are considered. The choice F IN > F OUT makes relatively more difficult to add an explanatory variable than to delete one. General comments 1. None of the methods among forward selection, backward elimination or stepwise regsion guarantees the best subset model.. The order in which the explanatory variables enter or leave the models does not indicate the order of importance of explanatory variable. 3. In forward selection, no explanatory variable can be removed if entered in the model. Similarly in backward elimination, no explanatory variable can be added if removed from the model. 4. All procedu may lead to different models. 5. Different model selection criterion may give different subset models.

1 Comments about stopping rules Choice of F IN and/or F OUT provides stopping rules for algorithms. Some computer software allows analyst to specify these values directly. Some algorithms require type I errors to generate F IN or/and F OUT. Sometimes, taking as level of significance can be misleading because several correlated partial F - variables are considered at each step and maximum among them is examined. Some analyst prefer small values of F IN and F OUT whereas some prefer extreme values. Popular choice is F IN = F OUT = 4 which is corponding to 5% level of significance of F - distribution. α