School of Mathematical Sciences. Question 1. Best Subsets Regression

Similar documents
Apart from this page, you are not permitted to read the contents of this question paper until instructed to do so by an invigilator.

School of Mathematical Sciences. Question 1

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

The simple linear regression model discussed in Chapter 13 was written as

Multiple Regression Examples

Lecture 18: Simple Linear Regression

Multiple Regression Methods

Model Building Chap 5 p251

Confidence Interval for the mean response

Multiple Linear Regression

Chapter 26 Multiple Regression, Logistic Regression, and Indicator Variables

Chapter 12: Multiple Regression

13 Simple Linear Regression

Inference with Simple Regression

Soil Phosphorus Discussion

PART I. (a) Describe all the assumptions for a normal error regression model with one predictor variable,

STAT 212 Business Statistics II 1

Multiple Regression Analysis

23. Inference for regression

12.12 MODEL BUILDING, AND THE EFFECTS OF MULTICOLLINEARITY (OPTIONAL)

Institutionen för matematik och matematisk statistik Umeå universitet November 7, Inlämningsuppgift 3. Mariam Shirdel

Models with qualitative explanatory variables p216

STAT 212: BUSINESS STATISTICS II Third Exam Tuesday Dec 12, 6:00 PM

Inferences for linear regression (sections 12.1, 12.2)

1. Least squares with more than one predictor

INFERENCE FOR REGRESSION

SMAM 314 Practice Final Examination Winter 2003

Histogram of Residuals. Residual Normal Probability Plot. Reg. Analysis Check Model Utility. (con t) Check Model Utility. Inference.

Simple Linear Regression. Steps for Regression. Example. Make a Scatter plot. Check Residual Plot (Residuals vs. X)

Q Lecture Introduction to Regression

Business Statistics. Lecture 10: Course Review

Steps for Regression. Simple Linear Regression. Data. Example. Residuals vs. X. Scatterplot. Make a Scatter plot Does it make sense to plot a line?

Is economic freedom related to economic growth?

SMAM 314 Computer Assignment 5 due Nov 8,2012 Data Set 1. For each of the following data sets use Minitab to 1. Make a scatterplot.

Ph.D. Preliminary Examination Statistics June 2, 2014

[4+3+3] Q 1. (a) Describe the normal regression model through origin. Show that the least square estimator of the regression parameter is given by

(1) The explanatory or predictor variables may be qualitative. (We ll focus on examples where this is the case.)

TMA4255 Applied Statistics V2016 (5)

2.4.3 Estimatingσ Coefficient of Determination 2.4. ASSESSING THE MODEL 23

Day 4: Shrinkage Estimators

Chapter 9. Correlation and Regression

Simple Linear Regression: A Model for the Mean. Chap 7

28. SIMPLE LINEAR REGRESSION III

MBA Statistics COURSE #4

Predict y from (possibly) many predictors x. Model Criticism Study the importance of columns

Introduction to Regression

Basic Business Statistics, 10/e

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION

Chapter 14 Student Lecture Notes 14-1

9 Correlation and Regression

Analysis of Bivariate Data

Stat 529 (Winter 2011) A simple linear regression (SLR) case study. Mammals brain weights and body weights

Analysis of Covariance. The following example illustrates a case where the covariate is affected by the treatments.

Applied Regression Modeling: A Business Approach Chapter 3: Multiple Linear Regression Sections

Chapter 14. Multiple Regression Models. Multiple Regression Models. Multiple Regression Models

Psychology Seminar Psych 406 Dr. Jeffrey Leitzel

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #6

A discussion on multiple regression models

EXAM IN TMA4255 EXPERIMENTAL DESIGN AND APPLIED STATISTICAL METHODS

Chapter 15 Multiple Regression

CSCI 688 Homework 6. Megan Rose Bryant Department of Mathematics William and Mary

Multiple Regression Part I STAT315, 19-20/3/2014

10 Model Checking and Regression Diagnostics

STAT 360-Linear Models

Table 1: Fish Biomass data set on 26 streams

STA441: Spring Multiple Regression. This slide show is a free open source document. See the last slide for copyright information.

Density Temp vs Ratio. temp

Lecture 4: Multivariate Regression, Part 2

Concordia University (5+5)Q 1.

Six Sigma Black Belt Study Guides

Lecture 4: Multivariate Regression, Part 2

SMAM 319 Exam 1 Name. 1.Pick the best choice for the multiple choice questions below (10 points 2 each)

Unit 11: Multiple Linear Regression

MULTIPLE LINEAR REGRESSION IN MINITAB

Correlation & Simple Regression

ACOVA and Interactions

Stat 501, F. Chiaromonte. Lecture #8

AP Statistics Unit 6 Note Packet Linear Regression. Scatterplots and Correlation

Multiple linear regression S6

Regression Analysis and Forecasting Prof. Shalabh Department of Mathematics and Statistics Indian Institute of Technology-Kanpur

Applied Regression Analysis. Section 2: Multiple Linear Regression

Simple Linear Regression

Hypothesis testing Goodness of fit Multicollinearity Prediction. Applied Statistics. Lecturer: Serena Arima

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Homework 2: Simple Linear Regression

Chapter 14 Multiple Regression Analysis

Notes for ISyE 6413 Design and Analysis of Experiments

Conditions for Regression Inference:

y response variable x 1, x 2,, x k -- a set of explanatory variables

Variance. Standard deviation VAR = = value. Unbiased SD = SD = 10/23/2011. Functional Connectivity Correlation and Regression.

(4) 1. Create dummy variables for Town. Name these dummy variables A and B. These 0,1 variables now indicate the location of the house.

Final Exam - Solutions

STAB27-Winter Term test February 18,2006. There are 14 pages including this page. Please check to see you have all the pages.

Topic 14: Inference in Multiple Regression

W&M CSCI 688: Design of Experiments Homework 2. Megan Rose Bryant

CHAPTER 5 FUNCTIONAL FORMS OF REGRESSION MODELS

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

SMAM 314 Exam 42 Name

STATISTICS 110/201 PRACTICE FINAL EXAM

Inference for Regression Inference about the Regression Model and Using the Regression Line

Transcription:

School of Mathematical Sciences MTH5120 Statistical Modelling I Practical 9 and Assignment 8 Solutions Question 1 Best Subsets Regression Response is Crime I n W c e I P a n A E P U U l e Mallows g E P - L o E E t q Vars R-Sq R-Sq(adj) Cp S e d E 1 F M p 1 2 h u 1 47.3 46.1 36.1 28.393 X 2 58.0 56.1 22.0 25.619 X X 3 66.6 64.2 11.2 23.131 X X X 4 70.0 67.2 8.0 22.154 X X X X 5 73.0 69.7 5.6 21.301 X X X X X 6 74.8 71.0 4.9 20.827 X X X X X X 7 75.4 71.0 6.0 20.843 X X X X X X X 8 76.4 71.4 6.5 20.680 X X X X X X X X 9 76.6 70.9 8.1 20.847 X X X X X X X X X 10 76.7 70.2 10.0 21.110 X X X X X X X X X X 11 76.7 69.4 12.0 21.409 X X X X X X X X X X X Figure 1: Plots of the model evaluation measures. The model with six variables (Age, Ed, PE, UE2, Wealth and IncInequ) seems to have the best values of the measures. 1

The regression fit for the full model. Figure 2: There are no clear indications against the model assumptions of normality and constant variance of the errors. The regression equation is Crime = - 684 + 1.01 Age + 1.79 Ed + 1.62 PE - 0.67 PE-1-0.001 LF + 0.147 M - 0.035 Pop - 0.512 UE1 + 1.68 UE2 + 0.130 Wealth + 0.736 IncInequ Constant -684.2 151.2-4.52 0.000 Age 1.0061 0.3803 2.65 0.012 2.293 Ed 1.7872 0.6250 2.86 0.007 4.905 PE 1.624 1.027 1.58 0.123 93.572 PE-1-0.675 1.098-0.61 0.543 94.566 LF -0.0006 0.1286-0.00 0.996 2.711 M 0.1472 0.1973 0.75 0.461 3.393 Pop -0.0350 0.1260-0.28 0.783 2.308 UE1-0.5116 0.3966-1.29 0.206 5.131 UE2 1.6841 0.8153 2.07 0.046 4.758 Wealth 0.1300 0.1004 1.29 0.204 9.416 IncInequ 0.7364 0.2070 3.56 0.001 6.846 S = 21.4095 R-Sq = 76.7% R-Sq(adj) = 69.4% PRESS = 30579.5 R-Sq(pred) = 55.56% Analysis of Variance Source DF SS MS F P Regression 11 52766.5 4797.0 10.47 0.000 Residual Error 35 16042.8 458.4 Total 46 68809.3 Source DF Seq SS Age 1 550.8 Ed 1 7259.7 PE 1 31738.5 PE-1 1 1243.6 LF 1 225.6 M 1 1056.1 Pop 1 409.7 UE1 1 1.2 UE2 1 3697.7 Wealth 1 783.1 IncInequ 1 5800.4 Variables PE and PE-1 have very large VIF. Also, Wealth and IncInequ have somewhat enlarged VIF. Another interesting thing to notice is that the sequential sum of squares for UE1 is very small, while for UE2 it is very large. This suggests that the unemployment of younger men is not that important in respect to fitting the crime rate model as is the unemployment of middle-aged men. 2

Matrix Plot Figure 3: The matrix plot suggests that PE and PE-1 are strongly linearly positively related; Wealth and IncInequ are quite strongly negatively related; some positive relationship is also seen between UE1 and UE2. There are mild relationships among many other pairs of predictors. The predicted R 2 = 55.56 is much smaller than the adjusted R 2 = 69.4. This means that the model may be over-fitted. From MNITAB s Help: Predicted R 2 Used in regression analysis to indicate how well the model predicts responses for new observations, whereas R 2 indicates how well the model fits your data. Predicted R 2 can prevent overfitting the model and can be more useful than adjusted R 2 for comparing models because it is calculated using observations not included in model estimation. Overfitting refers to models that appear to explain the relationship between the predictor and response variables for the data set used for model calculation but fail to provide valid predictions for new observations. Predicted R 2 is calculated by systematically removing each observation from the data set, estimating the regression equation, and determining how well the model predicts the removed observation. Predicted R 2 ranges between 0 and 100% and is calculated from the PRESS statistic. Larger values of predicted R 2 suggest models of greater predictive ability. For example, you work for a financial consulting firm and are developing a model to predict future market conditions. The model you settle on looks promising because it has an R 2 of 87%. However, when you calculate the predicted R 2 you see that it drops to 52%. This may indicate an overfitted model and suggests that your model will not predict new observations nearly as well as it fits your existing data. 3

The regression fit for the model with six best variables. Figure 4: There are no clear indications against the model assumptions of normality and constant variance of the errors. The regression equation is Crime = - 619 + 1.13 Age + 1.82 Ed + 1.05 PE + 0.828 UE2 + 0.160 Wealth + 0.824 IncInequ Constant -618.5 108.2-5.71 0.000 Age 1.1252 0.3509 3.21 0.003 2.062 Ed 1.8179 0.4803 3.79 0.001 3.061 PE 1.0507 0.1752 6.00 0.000 2.876 UE2 0.8282 0.4274 1.94 0.060 1.382 Wealth 0.15956 0.09390 1.70 0.097 8.706 IncInequ 0.8236 0.1815 4.54 0.000 5.560 S = 20.8273 R-Sq = 74.8% R-Sq(adj) = 71.0% PRESS = 25837.8 R-Sq(pred) = 62.45% Analysis of Variance Source DF SS MS F P Regression 6 51458.2 8576.4 19.77 0.000 Residual Error 40 17351.1 433.8 Total 46 68809.3 Source DF Seq SS Age 1 550.8 Ed 1 7259.7 PE 1 31738.5 UE2 1 2173.9 Wealth 1 803.0 IncInequ 1 8932.3 Unusual Observations Obs Age Crime Fit SE Fit Residual St Resid 11 124 167.40 112.86 8.24 54.54 2.85R 29 119 104.30 142.61 11.55-38.31-2.21R The variable Wealth has rather large VIF (although not bigger than 10) and it also has the p-value suggesting that it may not be strongly significant in the presence of the other variables. Both, the adjusted and predicted R 2 have improved and the predicted value R 2 = 62.45 is now a bit closer to the adjusted R 2 = 71.0. 4

The regression fit for the model with best five variables. Figure 5: There is no apparent contradiction to the model assumptions of normality and constant variance. The regression equation is Crime = - 524 + 1.02 Age + 2.03 Ed + 1.23 PE + 0.914 UE2 + 0.635 IncInequ Constant -524.37 95.12-5.51 0.000 Age 1.0198 0.3532 2.89 0.006 1.998 Ed 2.0308 0.4742 4.28 0.000 2.853 PE 1.2331 0.1416 8.71 0.000 1.796 UE2 0.9136 0.4341 2.10 0.041 1.363 IncInequ 0.6349 0.1468 4.32 0.000 3.480 S = 21.3013 R-Sq = 73.0% R-Sq(adj) = 69.7% PRESS = 24971.3 R-Sq(pred) = 63.71% Analysis of Variance Source DF SS MS F P Regression 5 50206 10041 22.13 0.000 Residual Error 41 18604 454 Total 46 68809 Source DF Seq SS Age 1 551 Ed 1 7260 PE 1 31739 UE2 1 2174 IncInequ 1 8483 Unusual Observations Obs Age Crime Fit SE Fit Residual St Resid 11 124 167.40 104.44 6.74 62.96 3.12R 19 130 75.00 118.39 6.22-43.39-2.13R 29 119 104.30 149.64 11.02-45.34-2.49R Here we see that each explanatory variable is significant in the presence of the other variables and the inflation factors are not large, although VIF for IncInequ is slightly increased. The adjusted R 2 = 69.7% value is not much smaller than for the model with six regressors and the predicted R 2 = 63.71% is still better. We may say that crime rate in the USA in 1960 depended on age distribution, years of schooling, police expenditure, unemployment rate of middle-aged men and the income inequality. Other variables, not included in the model, are either not relevant or are correlated with those which are in the model. For example, Wealth is not included in the model because it does not contribute much to the SS R given IncInequ is already in the model. It does not mean, however, that crime rate is not related to the variable Wealth. 5

Question 2 2.1 The regression equation is Crime = - 524 + 1.02 Age + 2.03 Ed + 1.23 PE + 0.914 UE2 + 0.635 IncInequ Constant -524.37 95.12-5.51 0.000 Age 1.0198 0.3532 2.89 0.006 1.998 Ed 2.0308 0.4742 4.28 0.000 2.853 PE 1.2331 0.1416 8.71 0.000 1.796 UE2 0.9136 0.4341 2.10 0.041 1.363 IncInequ 0.6349 0.1468 4.32 0.000 3.480 (a) Predictor PE shows strong positive linear relationship with PE-1. This will effect in nearlinear columns of matrix X and so, make the determinant of X T X close to zero. This would give an inflated standard errors of the parameter estimates, hence decrease of the value of T-statistics which have the standard errors in the denominator. Eliminating the main near-linear column from matrix X, which is PE-1, decreases the multi-collinearity effect and the standard errors are not that large any longer. That is the T-statistics are larger. (b) From MINITAB Predicted Values for New Observations NewObs Fit SE Fit 95% CI 95% PI 1 91.70 3.12 (85.40, 97.99) (48.22, 135.17) Values of Predictors for New Observations NewObs Age Ed PE UE2 IncInequ 1 139 106 85.0 34.0 194 We may say with 95% confidence that the expected crime rate in an average state is between (85.40, 97.99). It means that the expected number of offences per 1 million population known to police is between 85 and 98. The 95% prediction interval is (48.22, 135.17). We predict with 95% confidence, that the observed crime rate in an average state may be between 48 and 136 offences per one million of population. (c) The formulae for the PI and CI are following: A 100(1 α)% prediction interval for a new observation Y 0 at x 0 is given by Ŷ 0 ± t α 2,n p S 2 {1 + x T 0 (XT X) 1 x 0 }. A 100(1 α)% confidence interval for the expected response E(Y 0 ) at x 0 is given by Ŷ 0 ± t α 2,n p S 2 x T 0 (XT X) 1 x 0. Here, Ŷ0 is the fitted response, S 2 is the MS E of the fitted model, x 0 is the vector of the predictor values and X is the design matrix. Also, t α 2,n p denotes the α/2 percentile of t-distribution with n p degrees of freedom, where n is the number of observations and p is the number of model parameters. 6

2.2 The graph in Figure 1 can help to choose a candidate set of explanatory variables. It presents values of the measures described above for best models out of each subset of variables of fixed size. The point where the values of these measures level off or have minimum (or maximum) shows the potentially best set of candidate explanatory variables. This graph can not show which variables in the model fit will be significant, which variables may be related to each other, nor can it tell anything about the residuals diagnostics. 7