Model Building Chap 5 p251

Similar documents
Models with qualitative explanatory variables p216

Confidence Interval for the mean response

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #6

Multiple Regression Examples

Model Selection Procedures

[4+3+3] Q 1. (a) Describe the normal regression model through origin. Show that the least square estimator of the regression parameter is given by

Analysis of Covariance. The following example illustrates a case where the covariate is affected by the treatments.

Chapter 14. Multiple Regression Models. Multiple Regression Models. Multiple Regression Models

STAT 360-Linear Models

Basic Business Statistics, 10/e

Outline. Topic 13 - Model Selection. Predicting Survival - Page 350. Survival Time as a Response. Variable Selection R 2 C p Adjusted R 2 PRESS

1 Introduction to Minitab

Steps for Regression. Simple Linear Regression. Data. Example. Residuals vs. X. Scatterplot. Make a Scatter plot Does it make sense to plot a line?

The simple linear regression model discussed in Chapter 13 was written as

Basic Business Statistics 6 th Edition

Topic 18: Model Selection and Diagnostics

TMA4255 Applied Statistics V2016 (5)

PART I. (a) Describe all the assumptions for a normal error regression model with one predictor variable,

Chapter 26 Multiple Regression, Logistic Regression, and Indicator Variables

Histogram of Residuals. Residual Normal Probability Plot. Reg. Analysis Check Model Utility. (con t) Check Model Utility. Inference.

Simple Linear Regression. Steps for Regression. Example. Make a Scatter plot. Check Residual Plot (Residuals vs. X)

EX1. One way ANOVA: miles versus Plug. a) What are the hypotheses to be tested? b) What are df 1 and df 2? Verify by hand. , y 3

Multiple Regression: Chapter 13. July 24, 2015

Examination paper for TMA4255 Applied statistics

ANOVA: Analysis of Variation

Concordia University (5+5)Q 1.

Ch 13 & 14 - Regression Analysis

STATISTICS 110/201 PRACTICE FINAL EXAM

Analysis of Bivariate Data

Multiple Regression Methods

Chapter 14 Multiple Regression Analysis

Data Set 8: Laysan Finch Beak Widths

SMAM 314 Exam 42 Name

23. Inference for regression

1-Way ANOVA MATH 143. Spring Department of Mathematics and Statistics Calvin College

School of Mathematical Sciences. Question 1. Best Subsets Regression

This document contains 3 sets of practice problems.

W&M CSCI 688: Design of Experiments Homework 2. Megan Rose Bryant

STAT 212 Business Statistics II 1

Orthogonal contrasts for a 2x2 factorial design Example p130

LEARNING WITH MINITAB Chapter 12 SESSION FIVE: DESIGNING AN EXPERIMENT

INFERENCE FOR REGRESSION

One-Way Analysis of Variance (ANOVA)

General Linear Model (Chapter 4)

Residual Analysis for two-way ANOVA The twoway model with K replicates, including interaction,

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Predict y from (possibly) many predictors x. Model Criticism Study the importance of columns

1 Use of indicator random variables. (Chapter 8)

Solution: X = , Y = = = = =

Dr. Maddah ENMG 617 EM Statistics 11/28/12. Multiple Regression (3) (Chapter 15, Hines)

MBA Statistics COURSE #4

Chapter 12: Multiple Regression

assumes a linear relationship between mean of Y and the X s with additive normal errors the errors are assumed to be a sample from N(0, σ 2 )

STAT 212: BUSINESS STATISTICS II Third Exam Tuesday Dec 12, 6:00 PM

SMAM 314 Computer Assignment 5 due Nov 8,2012 Data Set 1. For each of the following data sets use Minitab to 1. Make a scatterplot.

Apart from this page, you are not permitted to read the contents of this question paper until instructed to do so by an invigilator.

Inference for the Regression Coefficient

Inferences for linear regression (sections 12.1, 12.2)

ST430 Exam 2 Solutions

1. An article on peanut butter in Consumer reports reported the following scores for various brands

Multiple Regression an Introduction. Stat 511 Chap 9

Analysing qpcr outcomes. Lecture Analysis of Variance by Dr Maartje Klapwijk

(4) 1. Create dummy variables for Town. Name these dummy variables A and B. These 0,1 variables now indicate the location of the house.

Lecture 1 Linear Regression with One Predictor Variable.p2

Inference for Regression Inference about the Regression Model and Using the Regression Line

School of Mathematical Sciences. Question 1

Multiple Linear Regression

Selection of the Best Regression Equation by sorting out Variables

SMAM 314 Practice Final Examination Winter 2003

sociology 362 regression

Lecture 6 Multiple Linear Regression, cont.

Lecture 3: Multivariate Regression

Chapter 15 Multiple Regression

Lecture 9: Linear Regression

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

EXAM IN TMA4255 EXPERIMENTAL DESIGN AND APPLIED STATISTICAL METHODS

Lecture 4: Multivariate Regression, Part 2

UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematical Sciences Midterm Test, October 2013

SMAM 319 Exam 1 Name. 1.Pick the best choice for the multiple choice questions below (10 points 2 each)

Chapter 14 Student Lecture Notes 14-1

Notebook Tab 6 Pages 183 to ConteSolutions

28. SIMPLE LINEAR REGRESSION III

Simple Linear Regression Using Ordinary Least Squares

Institutionen för matematik och matematisk statistik Umeå universitet November 7, Inlämningsuppgift 3. Mariam Shirdel

holding all other predictors constant

Linear models and their mathematical foundations: Simple linear regression

Lecture 18: Simple Linear Regression

Linear Modelling in Stata Session 6: Further Topics in Linear Modelling

Correlation & Simple Regression

MULTIPLE LINEAR REGRESSION IN MINITAB

STA220H1F Term Test Oct 26, Last Name: First Name: Student #: TA s Name: or Tutorial Room:

Lab 07 Introduction to Econometrics

Multiple Regression and Model Building (cont d) + GIS Lecture 21 3 May 2006 R. Ryznar

AP Statistics Unit 6 Note Packet Linear Regression. Scatterplots and Correlation

1. Least squares with more than one predictor

Economics 326 Methods of Empirical Research in Economics. Lecture 14: Hypothesis testing in the multiple regression model, Part 2

Is economic freedom related to economic growth?

Oregon Hill Wireless Survey Regression Model and Statistical Evaluation. Sky Huvard

(1) The explanatory or predictor variables may be qualitative. (We ll focus on examples where this is the case.)

ANOVA Situation The F Statistic Multiple Comparisons. 1-Way ANOVA MATH 143. Department of Mathematics and Statistics Calvin College

Transcription:

Model Building Chap 5 p251 Models with one qualitative variable, 5.7 p277 Example 4 Colours : Blue, Green, Lemon Yellow and white Row Blue Green Lemon Insects trapped 1 0 0 1 45 2 0 0 1 59 3 0 0 1 48 4 0 0 1 46 5 0 0 1 38 6 0 0 1 47 7 0 0 0 21 8 0 0 0 12 9 0 0 0 14 10 0 0 0 17 11 0 0 0 13 12 0 0 0 17 13 0 1 0 37 14 0 1 0 32 15 0 1 0 15 16 0 1 0 25 17 0 1 0 39 18 0 1 0 41 19 1 0 0 16 20 1 0 0 11 21 1 0 0 20 22 1 0 0 21 23 1 0 0 14 24 1 0 0 7 Descriptive Statistics: Insects Variable Colour N N* Mean Insects B 6 0 14.83 G 6 0 31.50 L 6 0 47.17 W 6 0 15.67 The regression equation is Insects trapped = 15.7-0.83 Blue + 15.8 Green + 31.5 Lemon Predictor Coef StDev T P Constant 15.667 2.770 5.66 0.000 Blue -0.833 3.917-0.21 0.834 Green 15.833 3.917 4.04 0.001 Lemon 31.500 3.917 8.04 0.000 S = 6.784 R-Sq = 82.1% R-Sq(adj) = 79.4% 1

Analysis of Variance Source DF SS MS F P Regression 3 4218.5 1406.2 30.55 0.000 Residual Error 20 920.5 46.0 Estimate the betas using the means (Descriptive statistics) State whether the following statements are true or false. a) The value of the F-statistic for testing any differences among the colours is 30.55. b) We have evidence at p < 0.01 that the means for green and white are different. c) We have evidence at p < 0.01 that means for blue and white are different. d) A 95% confidence interval for the difference between means for lemon yellow and white is (23.3, 39.7) e) We may say that 82.1% of the variation in the number of insects trapped has been accounted for by the above model. 2

Models with two qualitative variables, 5.8 p282 Data Display Row C1 perform F2 F3 B2 F2B2 F3B2 F B 1 F1B1 65 0 0 0 0 0 F1 B1 2 F1B1 73 0 0 0 0 0 F1 B1 3 F1B1 68 0 0 0 0 0 F1 B1 4 F1B2 36 0 0 1 0 0 F1 B2 5 F2B1 78 1 0 0 0 0 F2 B1 6 F2B1 82 1 0 0 0 0 F2 B1 7 F2B2 50 1 0 1 1 0 F2 B2 8 F2B2 43 1 0 1 1 0 F2 B2 9 F3B1 48 0 1 0 0 0 F3 B1 10 F3B1 46 0 1 0 0 0 F3 B1 11 F3B2 61 0 1 1 0 1 F3 B2 12 F3B2 62 0 1 1 0 1 F3 B2 Example 5.10 p286 Main effects model The regression equation is perform = 64.5 + 6.70 F2-2.30 F3-15.8 B2 Predictor Coef StDev T P Constant 64.455 7.180 8.98 0.000 F2 6.705 9.941 0.67 0.519 F3-2.295 9.941-0.23 0.823 B2-15.818 8.291-1.91 0.093 S = 13.75 R-Sq = 36.2% R-Sq(adj) = 12.3% Analysis of Variance Source DF SS MS F P Regression 3 858.3 286.1 1.51 0.284 Residual Error 8 1512.4 189.1 Total 11 2370.7 Source DF Seq SS F2 1 92.0 F3 1 78.1 B2 1 688.1 3

Interaction model Descriptive Statistics: perform Variable C1 N N* Mean StDev perform F1B1 3 0 68.67 4.04 F1B2 1 0 36.000 * F2B1 2 0 80.00 2.83 F2B2 2 0 46.50 4.95 F3B1 2 0 47.00 1.41 F3B2 2 0 61.500 0.707 The regression equation is perform = 68.7 + 11.3 F2-21.7 F3-32.7 B2-0.83 F2B2 + 47.2 F3B2 Predictor Coef StDev T P Constant 68.667 1.939 35.42 0.000 F2 11.333 3.066 3.70 0.010 F3-21.667 3.066-7.07 0.000 B2-32.667 3.878-8.42 0.000 F2B2-0.833 5.130-0.16 0.876 F3B2 47.167 5.130 9.19 0.000 S = 3.358 R-Sq = 97.1% R-Sq(adj) = 94.8% Analysis of Variance Source DF SS MS F P Regression 5 2303.00 460.60 40.84 0.000 Residual Error 6 67.67 11.28 Total 11 2370.67 Source DF Seq SS F2 1 92.04 F3 1 78.13 B2 1 688.09 F2B2 1 491.30 F3B2 1 953.44 4

Interaction Plot - Data Means for perform B 80 B1 B2 70 Mean 60 50 40 F1 F2 F3 F Estimate the regression equation using the means (descriptive statistics) Test whether there is an interaction between brand and fuel type. 5

Variable Screening methods, Chap 6 p321 Stepwise regression p323 A hospital Surgical unit was interested in predicting the survival times of patients undergoing a particular type of liver operation. A random sample of patients was available for analysis. From each patient record, the following info was extracted from the preoperation evaluation: X1 = blood clotting score X2 = prognostic index X3 = enzyme function test score X4 = liver function test score X5 = age in years X6 = indicator variable for gender (0 = M, 1 = F) X7 and X8 = indicator variables for history of alcohol use (categorical: none, moderate, severe) X7 = indicator of moderate X8 = indicator of severe Data Display Row X1 X2 X3 X4 X5 X6 X7 X8 Y lny 1 6.7 62 81 2.59 50 0 1 0 695 6.544 2 5.1 59 66 1.70 39 0 0 0 403 5.999 3 7.4 57 83 2.16 55 0 0 0 710 6.565 4 6.5 73 41 2.01 48 0 0 0 349 5.854 5 7.8 65 115 4.30 45 0 0 1 2343 7.759 6 5.8 38 72 1.42 65 1 1 0 348 5.852 7 5.7 46 63 1.91 49 1 0 1 518 6.25 50 3.9 82 103 4.55 50 0 1 0 1078 6.983 51 6.6 77 46 1.95 50 0 1 0 405 6.005 52 6.4 85 40 1.21 58 0 0 1 579 6.361 53 6.4 59 85 2.33 63 0 1 0 550 6.310 54 8.8 78 72 3.20 56 0 0 0 651 6.478 6

The regression equation is Y = - 1149 + 62.4 X1 + 8.97 X2 + 9.89 X3 + 50.4 X4-0.95 X5 + 15.9 X6 + 7.7 X7+ 321 X8 Predictor Coef StDev T P Constant -1148.8 242.3-4.74 0.000 X1 62.39 24.47 2.55 0.014 X2 8.973 1.874 4.79 0.000 X3 9.888 1.742 5.68 0.000 X4 50.41 44.96 1.12 0.268 X5-0.951 2.649-0.36 0.721 X6 15.87 58.47 0.27 0.787 X7 7.71 64.96 0.12 0.906 X8 320.70 85.07 3.77 0.000 S = 201.4 R-Sq = 78.2% R-Sq(adj) = 74.3% Analysis of Variance Source DF SS MS F P Regression 8 6543615 817952 20.16 0.000 Residual Error 45 1825906 40576 Total 53 8369521 Residuals Versus the Fitted Values (response is Y) 800 Residual 400 0 0 500 1000 1500 Fitted Value 7

Normal Probability Plot of the Residuals (response is Y) 2 Normal Score 1 0-1 -2 0 Residual 400 800 Histogram of the Residuals (response is Y) 15 10 Frequency 5 0 0 Residual 500 1000 8

The regression equation is lny = 4.05 + 0.0685 X1 + 0.0135 X2 + 0.0150 X3 + 0.0080 X4-0.00357 X5 + 0.0842 X6 + 0.0579 X7 + 0.388 X8 Predictor Coef StDev T P Constant 4.0505 0.2518 16.09 0.000 X1 0.06851 0.02542 2.70 0.010 X2 0.013452 0.001947 6.91 0.000 X3 0.014954 0.001809 8.26 0.000 X4 0.00802 0.04671 0.17 0.865 X5-0.003566 0.002752-1.30 0.202 X6 0.08421 0.06075 1.39 0.173 X7 0.05786 0.06748 0.86 0.396 X8 0.38838 0.08838 4.39 0.000 S = 0.2093 R-Sq = 84.6% R-Sq(adj) = 81.9% Analysis of Variance Source DF SS MS F P Regression 8 10.8370 1.3546 30.93 0.000 Residual Error 45 1.9707 0.0438 Total 53 12.8077 Residuals Versus the Fitted Values (response is lny) 0.5 0.4 0.3 0.2 Residual 0.1 0.0-0.1-0.2-0.3-0.4 5.5 6.5 Fitted Value 7.5 9

Normal Probability Plot of the Residuals (response is lny) 2 Normal Score 1 0-1 -2-0.4-0.3-0.2-0.1 0.0 0.1 0.2 0.3 0.4 0.5 Residual Histogram of the Residuals (response is lny) 15 10 Frequency 5 0-0.4-0.3-0.2-0.1 0.0 0.1 0.2 0.3 0.4 0.5 Residual 10

Stepwise Regression F-to-Enter: 4.00 F-to-Remove: 4.00 Response is lny on 8 predictors, with N = 54 Step 1 2 3 4 Constant 5.264 4.351 4.291 3.852 X3 0.0151 0.0154 0.0145 0.0155 T-Value 6.23 8.19 9.33 11.07 X2 0.0141 0.0149 0.0142 T-Value 5.98 7.68 8.20 X8 0.429 0.353 T-Value 5.08 4.57 X1 0.073 T-Value 3.86 S 0.375 0.291 0.238 0.211 R-Sq 42.76 66.33 77.80 82.99 11

Minitab commands for stepwise regression 12

13

All possible Regressions Selection Procedure (6.3) p327 R-sq Criterion: 2 SSR SSE R = = 1 SST SST Response is lny Adj. X X X X X X X X Vars R-Sq R-Sq C-p s 1 2 3 4 5 6 7 8 1 42.8 41.7 117.4 0.37549 X 1 42.2 41.0 119.2 0.37746 X 1 22.1 20.6 177.9 0.43807 X 1 13.9 12.2 201.8 0.46052 X 1 6.1 4.3 224.7 0.48101 X 2 66.3 65.0 50.5 0.29079 X X 2 59.9 58.4 69.1 0.31715 X X 2 54.9 53.1 84.0 0.33668 X X 2 51.6 49.7 93.4 0.34850 X X 2 50.8 48.9 95.9 0.35157 X X 3 77.8 76.5 18.9 0.23845 X X X 3 75.7 74.3 25.0 0.24934 X X X 3 71.8 70.1 36.5 0.26885 X X X 3 68.1 66.2 47.3 0.28587 X X X 3 67.6 65.7 48.7 0.28802 X X X 4 83.0 81.6 5.8 0.21087 X X X X 4 81.4 79.9 10.3 0.22023 X X X X 4 78.9 77.2 17.8 0.23498 X X X X 4 78.4 76.6 19.3 0.23785 X X X X 4 78.0 76.2 20.4 0.23982 X X X X 5 83.7 82.1 5.5 0.20827 X X X X X 5 83.6 81.9 6.0 0.20931 X X X X X 5 83.3 81.6 6.8 0.21100 X X X X X 5 83.2 81.4 7.2 0.21193 X X X X X 5 81.8 79.9 11.3 0.22044 X X X X X 6 84.3 82.3 5.8 0.20655 X X X X X X 6 83.9 81.9 7.0 0.20934 X X X X X X 6 83.9 81.8 7.2 0.20964 X X X X X X 6 83.8 81.8 7.2 0.20982 X X X X X X 6 83.7 81.6 7.6 0.21066 X X X X X X 7 84.6 82.3 7.0 0.20705 X X X X X X X 7 84.4 82.0 7.7 0.20867 X X X X X X X 7 84.0 81.6 8.7 0.21081 X X X X X X X 7 84.0 81.5 8.9 0.21136 X X X X X X X 7 82.1 79.4 14.3 0.22306 X X X X X X X 8 84.6 81.9 9.0 0.20927 X X X X X X X X 14

Best Subsets Regression Response is lny Adj. X X X X X X X X Vars R-Sq R-Sq C-p s 1 2 3 4 5 6 7 8 1 42.8 41.7 117.4 0.37549 X 2 66.3 65.0 50.5 0.29079 X X 3 77.8 76.5 18.9 0.23845 X X X 4 83.0 81.6 5.8 0.21087 X X X X 5 83.7 82.1 5.5 0.20827 X X X X X 6 84.3 82.3 5.8 0.20655 X X X X X X 7 84.6 82.3 7.0 0.20705 X X X X X X X 8 84.6 81.9 9.0 0.20927 X X X X X X X X 85 80 75 70 R-sq 65 60 55 50 45 40 1 2 3 4 5 6 7 8 vars 15

Ex: Response is crimes p b o h d p t p s e o o 1 p g g v u t 8 o r r e n p - p a e r e Adj. o 3 6 d e t m Vars R-Sq R-Sq C-p s p 4 5 s s y p 1 75.4 75.3 23.6 39995 X 2 78.3 78.1-0.2 37660 X X 3 78.4 78.1 1.0 37671 X X X 4 78.5 78.0 2.6 37732 X X X X 5 78.5 78.0 4.1 37784 X X X X X 6 78.5 77.9 6.1 37875 X X X X X X 7 78.5 77.8 8.0 37968 X X X X X X X 78.5 77.5 R-sq 76.5 75.5 1 2 3 4 Vars 5 6 7 16

Other Criteria R-sq (Adj) 2) 2 MSE R = 1 Adj SST /( n 1) 3) C p criterion p328 Cp SSE p = + 2( p+ 1) n MSE k C p criterion selects as the best model, the subset model with 1) a small value of C p 2) value of C p near p + 1 (p is the number of predictors) 17

MINITAB commands 18

19