MBA Statistics COURSE #4

Similar documents
STATISTICS 110/201 PRACTICE FINAL EXAM

Basic Business Statistics, 10/e

Chapter 3 Multiple Regression Complete Example

The simple linear regression model discussed in Chapter 13 was written as

Chapter 14 Multiple Regression Analysis

Correlation & Simple Regression

STAT 212 Business Statistics II 1

28. SIMPLE LINEAR REGRESSION III

(4) 1. Create dummy variables for Town. Name these dummy variables A and B. These 0,1 variables now indicate the location of the house.

Multiple Regression Examples

SMAM 314 Practice Final Examination Winter 2003

Model Building Chap 5 p251

INFERENCE FOR REGRESSION

Chapter 9. Correlation and Regression

Chapter 13. Multiple Regression and Model Building

Regression Models. Chapter 4. Introduction. Introduction. Introduction

Ch 13 & 14 - Regression Analysis

Chapter 14 Student Lecture Notes 14-1

Confidence Interval for the mean response

Homework 1 Solutions

Multiple Linear Regression

Inference for Regression Inference about the Regression Model and Using the Regression Line

Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details. Section 10.1, 2, 3

Examination paper for TMA4255 Applied statistics

Inferences for Regression

Marquette University Executive MBA Program Statistics Review Class Notes Summer 2018

Models with qualitative explanatory variables p216

Final Exam - Solutions

Basic Business Statistics 6 th Edition

Answer all questions from part I. Answer two question from part II.a, and one question from part II.b.

Analysis of Bivariate Data

23. Inference for regression

Business 320, Fall 1999, Final

School of Mathematical Sciences. Question 1

27. SIMPLE LINEAR REGRESSION II

Section 3: Simple Linear Regression

Six Sigma Black Belt Study Guides

Trendlines Simple Linear Regression Multiple Linear Regression Systematic Model Building Practical Issues

Multiple Regression: Chapter 13. July 24, 2015

Correlation Analysis

School of Mathematical Sciences. Question 1. Best Subsets Regression

Project Report for STAT571 Statistical Methods Instructor: Dr. Ramon V. Leon. Wage Data Analysis. Yuanlei Zhang

SMAM 314 Exam 42 Name

Stat 501, F. Chiaromonte. Lecture #8

Correlation and Regression

FREC 608 Guided Exercise 9

Mathematics for Economics MA course

Chapter 4. Regression Models. Learning Objectives

Chapter 7 Student Lecture Notes 7-1

A discussion on multiple regression models

Multiple linear regression S6

Bayesian Analysis LEARNING OBJECTIVES. Calculating Revised Probabilities. Calculating Revised Probabilities. Calculating Revised Probabilities

ECO220Y Simple Regression: Testing the Slope

Final Exam Bus 320 Spring 2000 Russell

10. Alternative case influence statistics

Applied Regression Analysis. Section 2: Multiple Linear Regression

x3,..., Multiple Regression β q α, β 1, β 2, β 3,..., β q in the model can all be estimated by least square estimators

Hypothesis testing. Data to decisions

Topic 10 - Linear Regression

y n 1 ( x i x )( y y i n 1 i y 2

1 A Review of Correlation and Regression

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Business Statistics. Chapter 14 Introduction to Linear Regression and Correlation Analysis QMIS 220. Dr. Mohammad Zainal

STAT 3A03 Applied Regression With SAS Fall 2017

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

MULTIPLE REGRESSION ANALYSIS AND OTHER ISSUES. Business Statistics

Ordinary Least Squares Regression Explained: Vartanian

Multiple Regression Methods

Simple Linear Regression: A Model for the Mean. Chap 7

(ii) Scan your answer sheets INTO ONE FILE only, and submit it in the drop-box.

Chapter 26 Multiple Regression, Logistic Regression, and Indicator Variables

(1) The explanatory or predictor variables may be qualitative. (We ll focus on examples where this is the case.)

Econometrics. 4) Statistical inference

Section 5: Dummy Variables and Interactions

ANOVA - analysis of variance - used to compare the means of several populations.

regression analysis is a type of inferential statistics which tells us whether relationships between two or more variables exist

1 A Non-technical Introduction to Regression

Interactions. Interactions. Lectures 1 & 2. Linear Relationships. y = a + bx. Slope. Intercept

Regression analysis is a tool for building mathematical and statistical models that characterize relationships between variables Finds a linear

STAT 3900/4950 MIDTERM TWO Name: Spring, 2015 (print: first last ) Covered topics: Two-way ANOVA, ANCOVA, SLR, MLR and correlation analysis

Chapter 14 Student Lecture Notes Department of Quantitative Methods & Information Systems. Business Statistics. Chapter 14 Multiple Regression

Inferences for linear regression (sections 12.1, 12.2)

determine whether or not this relationship is.

Regression Analysis. BUS 735: Business Decision Making and Research

2.4.3 Estimatingσ Coefficient of Determination 2.4. ASSESSING THE MODEL 23

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

Analysis of Covariance. The following example illustrates a case where the covariate is affected by the treatments.

Inference for the Regression Coefficient

LI EAR REGRESSIO A D CORRELATIO

Analyzing Lines of Fit

In order to carry out a study on employees wages, a company collects information from its 500 employees 1 as follows:

1 Correlation and Inference from Regression

Chapter 14 Simple Linear Regression (A)

CHAPTER 5 FUNCTIONAL FORMS OF REGRESSION MODELS

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #6

Conditions for Regression Inference:

12.12 MODEL BUILDING, AND THE EFFECTS OF MULTICOLLINEARITY (OPTIONAL)

Two-Variable Analysis: Simple Linear Regression/ Correlation

STAT 7030: Categorical Data Analysis

This document contains 3 sets of practice problems.

Transcription:

MBA Statistics 51-651-00 COURSE #4 Simple and multiple linear regression What should be the sales of ice cream?

Example: Before beginning building a movie theater, one must estimate the daily number of people entering the building. How can we estimate it? There are 2 millions individuals in the city. 2

Possible solutions: One could realize a local market study. However it is often imprecise, specially for new projects. One could get data from similar projects in other cities. 3

City 1 2 3 4 5 6 7 8 9 10 Attendance (x1000) 10 12 8 10 14 20 30 16 4 12 What do you think? Can we do better? 4

Probably, taking into account the size of the city City 1 2 3 4 5 6 7 8 9 10 Attendance (x1 000) 10 12 8 10 14 20 30 16 4 12 Size (millions) 0.70 0.90 0.50 0.75 1.40 1.50 2.30 1.40 0.25 0.95 35 30 25 20 15 10 5 0 0.00 0.50 1.00 1.50 2.00 2.50 5

Case study: Ice Cream Sales The file icecream.xls contains pairs of data representing ice cream sales and temperature recorded that day, for 30 days. Is there a relation between temperature and sales? Can temperature be used to predict ice cream sales? If so what s the prediction when the temperature is 25? 6

Introduction One of the principle objectives of statistics is to explain the variability that we observe in data. Linear regression (or linear models) is a statistical tool MUCH USED to study the presence of a linear relation between a dependent variable Y (quantitative and continuous) and one or more independent variables X 1, X 2,, X p (qualitative and/or quantitative), called independent or explanatory variables. 7

For example, a manager could be interested in seeing if he could explain a good part of the variability that he observes in sales in his differents branches (dependant variable Y) in the last 12 months, by the area, number of employees, number of payed overtime hours, quality of customer service, number of promotions, etc. ( independent or explanatory variables). 8

A regression model can be used to answer one of the following three objectives: Describe data coming from non experimental studies i.e. we observe reality as it is. Examine the hypothesis (data coming from controled experimental studies). Predict (if we like to take risks!!). 9

Example: We are interested in knowing what are the important factors that influence or determine the value of a property and we want to build a model that would help us evaluate this value using certain factors. To do this, we have obtained the total value for a sample of 79 properties in a given region. The following variables have also been collected for each property: 10

Brief glimpse of the data file:house.xls # of square feet total land first outdoor heating OBS value value # of acres floor condition type 1 199657 63247 1.63 1726 Good NatGas 2 78482 38091 0.495 1184 Good NatGas 3 119962 37665 0.375 1014 Good Electric 4 116492 54062 0.981 1260 Average Electric 5 131263 61546 1.14 1314 Average NatGas... 78 253480 57948 0.862 1720 Good Electric 79 257037 57489 0.95 2004 Excellnt Electric OBS # of # of # of completed # of non completed # of rooms bedroom bathrooms bathrooms fire-places GARAGE 1 8 4 2 1 2 Garage 2 6 2 1 0 0 NoGarage 3 7 3 2 0 1 Garage 4 6 3 2 0 1 Garage 5 8 4 2 1 2 NoGarage... 78 10 5 5 1 1 Garage 79 9 4 2 2 2 Garage 11

Is there a link between the total value and the different factors? 450000 350000 Total 250000 150000 50000 40000 90000 Land 140000 12

450000 450000 350000 350000 Total 250000 Total 250000 150000 150000 50000 0 1 2 3 Acre 4 5 6 50000 500 1500 2500 3500 Sq.Feet 450000 450000 350000 350000 Total 250000 Total 250000 150000 150000 50000 5 10 15 Rooms 50000 2 3 4 5 6 7 8 Bedroom 13

450000 450000 350000 350000 Total 250000 Total 250000 150000 150000 50000 1 2 3 4 5 Completed Bathrooms 6 7 50000 0 1 2 3 Bathrooms 450000 450000 350000 350000 Total 250000 Total 250000 150000 150000 50000 0 1 2 3 4 5 6 7 Fire-place 50000 NoGarage Garage Garage 14

The Pearson correlation coefficient r is used to measure the intensity of the linear relation between two quantitative variables. The correlation coefficient r will take its values between -1 and 1. If a perfect linear relation exist between X and Y, then r = ±1 (r =1 if X and Y vary in the same direction and r = -1 if X varies in the opposite direction of Y). If r = 0, there is no linear link between X and Y. The more the r value furthers from 0 to get closer to ±1, the more the linear link intensity between X and Y becomes larger. 15

Y 6.5 * r = 0.035 Y r = 1 31 * 6.0 * * 29 * 27 * 25 * 5.5 * * 23 * 21 * 19 * 5.0 * 17 * 15 * 13 * 4.5 * * * 11 * ------------------------------------ 4.0 * * 4 5 6 7 8 9 10 11 12 13 14 ----------------------------------- 4 5 6 7 8 9 10 11 12 13 14 X X Y r = -1-8.0 * -10.5 * -13.0 * -15.5 * -18.0 * -20.5 * -23.0 * -25.5 * -28.0 * -30.5 * -33.0 * ---------------------------------- 4 5 6 7 8 9 10 11 12 13 14 X 16

Descriptive statistics Variable N Mean Median Sta.Deviation Minimum Maximum Total 79 187253 156761 84401 74365 453744 Land 79 65899 59861 22987 35353 131224 Acre 79 1.579 1.040 1.324 0.290 5.880 Sq.Feet 79 1678 1628 635 672 3501 Rooms 79 8.519 8.000 2.401 5 18 Bedrooms 79 3.987 4.000 1.266 2 8 C.Bathro 79 2.241 2.000 1.283 1 7 Bathro 79 0.7215 1.000 0.715 0 3 Fire-pl. 79 1.975 2.000 1.368 0 7 Pearson Correlation Coefficients Total Land Acre Sq.Feet Rooms Bedroom C.Bathro Bathro Land 0.815 Acre 0.608 0.918 Sq.Feet 0.767 0.516 0.301 Rooms 0.626 0.518 0.373 0.563 Bedrooms 0.582 0.497 0.382 0.431 0.791 C.Bathro 0.626 0.506 0.376 0.457 0.479 0.586 Bathro 0.436 0.236 0.074 0.354 0.489 0.166 0.172 Fire-pl. 0.548 0.497 0.391 0.365 0.394 0.400 0.486 0.386 17

BE CAREFULL!! it is important to interpret the correlation coefficient with the graph. r = 0.816 in all cases below 12.5 10 * * * * * * 10.0 * 8 * * * Y1 * Y2 * * 7.5 * * 6 * * * * 5.0 * 4 * * 2.5 2 ----------------------------------- ------------------------------------ 4 5 6 7 8 9 10 11 12 13 14 4 5 6 7 8 9 10 11 12 13 14 X X 15.0 Y4 12.5 * 12.5 * Y3 10.0 10.0 * * * * * 7.5 * 7.5 * * * * * * * * * * * * 5.0 5.0 ----------------------------------- ----------------------------- 4 5 6 7 8 9 10 11 12 13 14 8 19 18 X X

Simple linear regression To describe a linear relation between two quantitative variables or to be able to predict Y for a given value of X, we use a regression line: Y = β 0 + β 1 X + ε Since any statistical model is only an approximation (we hope the best possible!!) and because the linear link is never perfect, in the model, there is always an error, noted ε. If there was a perfect linear relation between Y and X, the error term would always be equal to 0, and all the variability of Y would be explained by the independent variable X. 19

So, for a given value of X, we would like to estimate Y. Thus, with the help of the data sample we will estimate the regression model parameters β 0 and β 1 in order to minimize the residuals (errors) sum of squares. The squared correlation coefficient is called the coefficient of determination and the percentage of the variability of Y explained by X: R 2 = 1 - (n-2)/(n-1){s e /S y } 2, where S e is the standard deviation of the errors and S y is the standard deviation of Y. 20

We can also use the adjusted coefficient of determination to indicate the percentage of the variability of Y explained by X: R 2 ajusted = 1 - {S e /S y }2. 21

Simple linear regression example: MODEL 1. Regression Analysis The regression equation is Total = 16209 + 102 Sq.Feet Predictor Coef StDev T P Constant 16209 17447 0.93 0.356 Sq.Feet 101.939 9.734 10.47 0.000 S = 54556 R-Sq = 58.8% R-Sq(adj) = 58.2% Analysis of Variance Source DF SS MS F P Regression 1 3.26460E+11 3.26460E+11 109.68 0.000 Residual Error 77 2.29181E+11 2976374177 Total 78 5.55641E+11 22

MODEL 2. The regression equation is : Total = - 347 + 22021 Rooms Predictor Coef StDev T P Constant -347 27621-0.01 0.990 Rooms 22021 3122 7.05 0.000 S = 66210 R-Sq = 39.3% R-Sq(adj) = 38.5% Analysis of Variance Source DF SS MS F P Regression 1 2.18090E+11 2.18090E+11 49.75 0.000 Residual Error 77 3.37551E+11 4383775699 Total 78 5.55641E+11 MODEL 3. The regression equation is : Total = 32428 + 38829 Bedrooms Predictor Coef StDev T P Constant 32428 25826 1.26 0.213 Bedrooms 38829 6177 6.29 0.000 S = 69056 R-Sq = 33.9% R-Sq(adj) = 33.1% Analysis of Variance Source DF SS MS F P Regression 1 1.88445E+11 1.88445E+11 39.52 0.000 Residual Error 77 3.67196E+11 4768775127 Total 78 5.55641E+11 23

Model 1: total value = 16209 + 102*( # of squared feet ). R 2 = 58.8%. Thus 58.8% of the variability of the total value is explained by the # of squared feet. Model 2: total value = -347 + 22021*(# of rooms ). R 2 = 39.3%. Thus 39.3% of the variability of the total value is explained by the # of rooms. Model 3: total value = 32428 + 38829 *(# of bedrooms ). R 2 = 33.9%. Thus 33.9% of the variability of the total value is explained by the # of bedrooms. 24

Which one of the 3 previous models would you choose and why? Model 1 because it has the largest value of R 2. 25

1-α confidence interval for the mean of the values of Y for a specific value of X: For model 1 and a value of X=1500 sq.ft we obtain the following point estimation : est. total value = 16 209 + 102*1500 = 169 117$ 95% confidence interval for the mean of the total value for properties of 1500 sq.ft : [156 418, 181 817] as calculated by CI-regression.xls 26

1-α confidence interval for a new value of Y (prediction) being given a specific value of X: For model 1 and a value of X=1500 sq.ft we obtain the following point estimation : est.total value = 16 209 + 101.939*1500 = 169 117$ 95% confidence interval for a predicted total value when the area of the first floor is 1500 sq.ft : [59 742, 278 492] The confidence interval for a predicted value is always larger than for the mean of the value of Y for a specific X. 27

Inference on regression model parameters: If there is no linear link between Y and X then β 1 = 0. So, we want to examine the following hypothesis : H 0 : β 1 = 0 vs H 1 : β 1 0 We will reject H 0 when the p-value is too small This test will be valid if the relation between X and Y is linear the data are independent the variance of Y is the same for every value of X. Y has a normal distribution for every value of X or the sample size n is large. 28

Multiple linear regression It is more likely possible that the variability of the dependent variable Y will be explained not only by one independent variable X, but rather by a linear combination of several independent variables X 1, X 2,, X p. In this case, the multiple regression model is given by: Y = β 0 + β 1 X 1 + β 2 X 2 + + β p X p + ε Also, using the sample data, we will estimate the regression model parameters β 0, β 1,, β p in order to minimize the residuals (errors) sum of squares. 29

The multiple correlation coefficient R 2, also called the coefficient of determination, represents the percentage of the variability of Y explained by the independent variables X 1, X 2,, X p. In the model, when we add one or more independent variables, R 2 increases. The question is to know if R 2 increases to a significant degree. Note that we cannot have more independent variables in the model that there are observations in the sample. (general rule: n 5p). 30

Example: MODEL 1. The regression equation is Total = - 89131 + 3.05 Land - 20730 Acre + 43.3 Sq.Feet - 4352 Rooms + 10049 Bedroom + 7606 C.Bathro + 18725 Bathro + 882 Fire-pl. Predictor Coef StDev T P Constant -89131 18302-4.87 0.000 Land 3.0518 0.5260 5.80 0.000 Acre -20730 7907-2.62 0.011 Sq.Feet 43.336 7.670 5.65 0.000 Rooms -4352 3036-1.43 0.156 Bedroom 10049 5307 1.89 0.062 CBathro 7606 3610 2.11 0.039 Bathro 18725 6585 2.84 0.006 Fire-pl. 882 3184 0.28 0.783 S = 29704 R-Sq = 88.9% R-Sq(adj) = 87.6% Analysis of Variance Source DF SS MS F P Regression 8 4.93877E+11 61734659810 69.97 0.000 Residual Error 70 61763515565 882335937 Total 78 5.55641E+11 31

MODEL 2 Regression Analysis The regression equation is Total = - 97512 + 3.11 Land - 21880 Acre + 40.2 Sq.Feet + 4411 Bedroom + 8466 C.bathro + 14328 Bathro Predictor Coef StDev T P Constant -97512 17466-5.58 0.000 Land 3.1103 0.5236 5.94 0.000 Acre -21880 7884-2.78 0.007 Sq.Feet 40.195 7.384 5.44 0.000 Bedroom 4411 3469 1.27 0.208 C.bathro 8466 3488 2.43 0.018 Bathro 14328 5266 2.72 0.008 S = 29763 R-Sq = 88.5% R-Sq(adj) = 87.6% Analysis of Variance Source DF SS MS F P Regression 6 4.91859E+11 81976430646 92.54 0.000 Residual Error 72 63782210167 885864030 Total 78 5.55641E+11 32

MODEL 3 Regression Analysis The regression equation is Total = - 90408 + 3.20 Land - 22534 Acre + 41.1 Sq.Feet + 10234 C.bathro + 14183 Bathro Predictor Coef StDev T P Constant -90408 16618-5.44 0.000 Land 3.2045 0.5205 6.16 0.000 Acre -22534 7901-2.85 0.006 Sq.Feet 41.060 7.383 5.56 0.000 C.bathro 10234 3213 3.19 0.002 Bathro 14183 5287 2.68 0.009 S = 29889 R-Sq = 88,3% R-Sq(adj) = 87,5% Analysis of Variance Source DF SS MS F P Regression 5 4.90426E+11 98085283380 109.80 0.000 Residual Error 73 65214377146 893347632 Total 78 5.55641E+11 33

Model without the area of the land ( # of acres ) because of the multicolinearity with the land value. MODEL 4 The regression equation is Total = - 55533 + 1.82 Land + 49.8 Sq.Feet + 11696 C.bathro + 18430 Bathro Predictor Coef StDev T P Constant -55533 11783-4.71 0.000 Land 1.8159 0.1929 9.42 0.000 Sq.Feet 49.833 7.028 7.09 0.000 C.bathro 11696 3321 3.52 0.001 Bathro 18430 5312 3.47 0.001 S = 31297 R-Sq = 87.0% R-Sq(adj) = 86.3% Analysis of Variance Source DF SS MS F P Regression 4 4.83160E+11 1.20790E+11 123.32 0.000 Residual Error 74 72481137708 979474834 Total 78 5.55641E+11 34

Which one of the 4 previous models would you choose and why? Probably model 4 because all the independent variables are significant at the 5% level (i.e. for each β in the model, p-value < 5%) and although R 2 is smaller, it is just marginally smaller. Moreover, all the model coefficients make«sense»! In model 1, the variables # of rooms and # of fire-place are not statistically significant at the 5% level (p-value > 5%). The variable # of bedrooms is at the limit with a p-value = 0.0624. 35

Which one of the 4 previous models would you choose and why?(continued) In model 2 the variable # of bedroom is not statistically significant at the 5% level. In model 3 (and the previous models), the variable # of acres coefficient is negative which is contrary to «common sense» and to what we observed in the scatter plot and the positive Pearson correlation coefficient (r = 0.608). In models 1 to 3, the negative coefficient for the variable # of acres is due to the fact that there is a strong linear relation between the value of the land and the area of the land (r = 0.918): multicolinearity problem. 36

Multicolinearity If two or more explanatory variables are strongly correlated (> 0.85 in absolute value), one says that there is multicolinearity. It has an influence on the estimation of parameters in the model. If two explanatory variables are highly correlated, then can get rid of one of these variables. Because of the strong correlation, the contribution of the other variable is not significant. The correlation between several pairs of variables can be calculated in Excel using correlation in the Data Analysis toolbox. 37

How can we choose a particular linear regression model among all the possible ones? There are several techniques: Step by step selection by adding one variable at a time, starting with the most significant one (stepwise, forward). Selection starting from the model in which all the variables are included and removing one variable at a time starting with the least significant (backward). Construct all possible models and choose the best subset of variables according to certain specific criteria (ex: adjusted R 2, C p de Mallow.) 38

Example of selection among the best subsets: Best Subsets Regression : Response is Total B C S e b B F q R d a a i L A f o r t t r a c e o o h h e Adj. n r e m o r r p Vars R-Sq R-Sq C-p s d e t s m o o l 1 66.4 65.9 136.8 49262 X 1 58.8 58.2 184.7 54556 X 1 39.3 38.5 307.6 66210 X 2 82.7 82.2 35.9 35564 X X 2 78.8 78.3 60.3 39343 X X 2 74.4 73.7 88.1 43244 X X 3 85.6 85.0 19.5 32637 X X X 3 84.8 84.2 24.5 33521 X X X 3 84.8 84.2 24.9 33591 X X X 4 87.1 86.4 12.2 31115 X X X X 4 87.0 86.3 13.1 31297 X X X X 4 86.6 85.9 15.2 31682 X X X X 5 88.3 87.5 6.9 29889 X X X X X 5 87.6 86.7 11.2 30744 X X X X X 5 87.4 86.5 12.4 30979 X X X X X 6 88.5 87.6 7.3 29763 X X X X X X 6 88.3 87.3 8.6 30030 X X X X X X 6 88.3 87.3 8.9 30096 X X X X X X 7 88.9 87.8 7.1 29510 X X X X X X X 7 88.6 87.4 9.1 29924 X X X X X X X 7 88.3 87.2 10.6 30240 X X X X X X X 8 88.9 87.6 9.0 29704 X X X X X X X X 39

Selection of the model without the variable # of acres Best Subsets Regression : Response is Total B C S e b B F q R d a a i L f o r t t r a e o o h h e Adj. n e m o r r p Vars R-Sq R-Sq C-p s d t s m o o l 1 66.4 65.9 120.6 49262 X 1 58.8 58.2 164.9 54556 X 1 39.3 38.5 278.3 66210 X 2 82.7 82.2 27.6 35564 X X 2 72.7 71.9 86.0 44704 X X 2 72.5 71.8 86.8 44813 X X 3 84.8 84.2 17.2 33521 X X X 3 84.8 84.2 17.6 33591 X X X 3 84.0 83.3 22.3 34467 X X X 4 87.0 86.3 6.9 31297 X X X X 4 86.1 85.3 12.1 32352 X X X X 4 85.3 84.5 16.5 33226 X X X X 5 87.3 86.4 6.9 31100 X X X X X 5 87.0 86.1 8.5 31439 X X X X X 5 87.0 86.1 8.9 31509 X X X X X 6 87.8 86.8 6.1 30707 X X X X X X 6 87.3 86.3 8.7 31264 X X X X X X 6 87.0 85.9 10.5 31656 X X X X X X 7 87.8 86.6 8.0 30908 X X X X X X X 40

The selection of the best model is done according to the combination: The greatest value of R 2 adjusted for the number of variables in the model. The smallest value of C p. For the models with R 2 adjusted and comparable C p, we will choose the model which has the most «common sense» according to the experts in the field. For the models with R 2 adjusted and comparable C p, the model with the independent variables that are the easiest and least expensive to measure. The model validity. 41

1-α confidence interval for Y mean and a new value of Y (prediction) being given a specific value combination for X 1, X 2,, X p. For model 4 and property with a land= 65 000$, sq.ft = 1500, 2 completed bathrooms and 1 notcompleted, we obtain the following point estimation : est. total value = -55 533 + 1.816*65 000 + 49.833*1 500 + 11 696*2 + 18 430*1 = 179 074$ 95% confidence interval for the mean of the total value: [170 842, 187 306] 95% confidence interval for a total predicted value : [116 173, 241 974] 42

Notes: For a 1500 sq.ft property, the multiple regression model gives a smaller 95% confidence intervals than the simple regression model. Therefore the addition of several other variables in the model helped to better explain the total value variability and to improve our estimations. If two or more independent variables are correlated we will say that there is multicolinearity. This can influence the value of the parameters in the model. Also, if two independent variables are strongly correlated then only one of the two variables would be included in the model, the other one bringing very little additional information. Certain conditions are required for the validity of the model and the corresponding inference (similar to the simple linear regression ). 43

Dummy variables How can one take into account qualitative information in a regression? Application: Test on two or more means 44

Trick If a qualitative variable takes two values, one defined one dummy variable taking values 0 or 1. Examples: Sex: 1 if male, 0 otherwise Garage: 1 if garage, 0 if not. 45

Trick (continued) More generally, if a qualitative variable can take m values, one defines (m-1) dummy variables all takong values 0 or 1. Example: Sex and job category (executive, white-collar, bue-collar) X 1 = 1 if male, 0 otherwise. X 2 = 1 si exe, 0 otherwise. X 3 = 1 si w-c, 0 otherwise. 46

Example One wants to explain the salary of an employee (Y) with the following variables: sex, job category and experience. X 1 = 1 if male, 0 otherwise. X 2 = 1 if exe, 0 otherwise. X 3 = 1 if w-c, 0 otherwise. X 4 = years of experience. 47

Example (continued) Regression model: Y = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 + β 4 X 4 + ε Question: Interpret β 0, β 1, β 2, β 3, β 4. How do know if women have a smaller salary? 48

P-value for one-tailed tests in Excel. The evaluation of the p-value of a one-tailed test hypothesis H 1 is not given in general, only the p-value of a two-tailed test. For example, in regression, Excel calculates the p-value P corresponding to H 0 : β i = 0 vs H 1 : β i 0. How can we calculate the p-value correponding to one-tailed hypotheses H 1? 49

Rules : P: p-value for the two-tailed test. If H 1 is of the form β i > 0 and b i >0, then the p-value of the right-tailed is P/2. Otherwise it is 1- P/2. If H 1 is of the form β i < 0 and b i <0, then the p-value of the right-tailed is P/2. Otherwise it is 1- P/2. In other words, the one-tailed p-value is half of the two-tailed p- value when the estimated coefficient has the same sign as the coefficient in H 1. Otherwise, it is 1- p-value /2. 50

Question: One wants to know if having a garage increase the total value of the property. The hypotheses to be tested should be: H 0 : β garage 0 vs H 1 : β garage >0 Since b garage = 22372 > 0, the p-value corresponding to H 1 : β garage > 0 is 0.0058/2 = 0.029 < 0.05. The anwser is yes because we accept H 1. Does the decision depend on coding? 51

If the dummy is defined by 0 if there is a garage and 1 otherwise, we would have got: Totale = - 72080 + 1,83 Terrain + 47,2 Pied2 + 11535 SbainsC + 18899 Sbains - 22372 Garage Predictor Coef StDev T P Constant -72080 14175-5,08 0,000 Terrain 1,8342 0,1892 9,69 0,000 Pied2 47,175 7,013 6,73 0,000 SbainsC 11535 3256 3,54 0,001 Sbains 18899 5211 3,63 0,001 Garage -22372 11116-2,01 0,058 S = 30671 R-Sq = 87,6% R-Sq(adj) = 86,8% 52

In that case, the right choice for hypotheses would have been: H 0 : β garage 0 vs H 1 : β garage <0 The corresponding p-value stays 0.029 = 0.058/2 because b garage = -22372 < 0 has the same sign as β garage in H 1. 53

Comparison of several means Suppose one wants to compare the respective means of a quantitative variable Y for two groups: µ 1 = mean of group 1, µ 2 = mean of group 2. One can use regression by defining X = 1 for group 1, and X= 0 for group 2. In this case, β = µ 1 µ 2. 54

Hypothesis H 1 : µ 1 > µ 2 correspond to H 1 : β > 0. Hypothesis H 1 : µ 1 < µ 2 correspond to H 1 : β < 0. Hypothesis H 1 : µ 1 µ 2 correspond to H 1 : β 0. 55

Example A manager has some doubts on the (positive) effects of a course in order to improve the speed a given task is performed by employees. To confirm his belief, he asked a technician to choose at random 10 employees and to measure the time (hours) to complete a task. Then the same employees attend the course. After the course the employees had to realize a similar task. The results are summarized in the following table: manager.xls 56

Questions: a) Should the company maintain the formation program? Take α = 5%. b) The technician in charge of the measurements forget to identify employees on the measurements form. What is the conclusion using that data set? Unfortunately, case b is based on a real case. 57

Solution For situation a), data are paired and we have to check if the differences «Before After» are significantly positive. The p-value is 0.0003 < 0.05 = α. One accepts H 1 and the manager conclude that the program should be maintained. 58

In the second case, data are not paired. One can use regression with Y = time of execution, and X = 1 for measurements before the course and X = 0 for measurements after the course. In that case, the right choice for H 1 is: H 1 : β > 0 Results are given by: Coefficients Standard Error t Stat P-value Intercept 5.217 0.129989316 40.1340677 4.5838E-19 X 0.244 0.183832653 1.32729412 0.20100167 59

Since H 1 : β > 0 (which is equivalent to H 1 : µ before > µ after ), and b = 0.244 > 0, the p-value is 0.201/2 = 0.1005 > 0.05. One accepts H 0, so the formation program should nt be maintained. This is a very good example of the consequence of the greater variability for two samples compared to a paired sample. 60

Remark: Comparing several means If one needs to compare the means of k group, for some variable Y, one can use also regression. For i=1, 2,, k-1, set: X i = 1 for group i, 0 otherwise. Then β 0 = mean of group k = µ k and β i = µ i - µ k, 1 i k-1. 61

Therefore, the regression test where H 0 is given by H 0 : β 1 = β 2 =... = β k-1 = 0 is equivalent to a test where H 0 is given by H 0 : µ 1 = µ 2 =... = µ k If H 0 is rejected, then we conclude that at least two means are different. 62