UNIT 11 MULTIPLE LINEAR REGRESSION

Size: px

Start display at page:

Download "UNIT 11 MULTIPLE LINEAR REGRESSION"

Hannah Thompson
5 years ago
Views:

1 UNIT MULTIPLE LINEAR REGRESSION Structure. Itroductio release relies Obectives. Multiple Liear Regressio Model.3 Estimatio of Model Parameters Use of Matrix Notatio Properties of Least Squares Estimates.4 Test of Sigificace i Multiple Regressio.5 Coefficiet of Determiatio (R ) ad Adusted R.6 Regressio with Dummy Variables.7 Summary.8 Solutios/Aswers. INTRODUCTION I previous uits, we have discussed the liear relatioship betwee the depedet variable Y ad a idepedet variable X. The coefficiets a ad b were ukow ad for the give data o Y ad X, we have obtaied least squares estimates of parameters, i.e., â ad bˆ. We have also goe through the iferetial study to examie whether there exists a sigificat liear relatioship betwee Y ad X or ot. We have discussed the simple liear regressio model ad estimatio of model parameters, ad determied stadard errors. I this uit, we discuss the multiple liear regressio model alog with the estimatio of parameters i Secs.. ad.3. I multiple liear regressio, the basic cocept is the same as that of simple regressio. However, istead of oe idepedet variable, there are several idepedet variables, say, X, X, X 3,, X p. For example, the umber of uits sold by a car maufacturig compay per year may ot deped o oly oe idepedet variable such as price, but also o mileage per uit of fuel, appearace of the car, comfort level, durability ad moey spet o advertisig, etc. Here we may like to idetify the importat idepedet variables, which cotribute more to the variatio i the depedet variable(s). For this purpose, a mathematical relatioship betwee the depedet ad idepedet variables is established ad this relatio is further used for predictio purposes. We also discuss the iferetial study i multiple liear regressio i Sec..4. Sice the model may ivolve several idepedet variables affectig the depedet variable because of their relatioship via regressio, it may be of iterest to estimate their importace by estimatig regressio coefficiets alog with their stadard errors. The adequacy of model fit may be examied by overall fit of the model with the help of coefficiet of determiatio (R ). I this uit, we also discuss a method for calculatig R ad adusted R i Sec..5. The regressio aalysis with dummy variables is also discussed i Sec

2 Regressio Modellig I the ext uit, we shall discuss how to calculate the extra sum of squares explaied by the regressor variables o the respose variable. We shall also discuss the methods of selectio of importat regressor variables which play a importat role i selectio of the best fitted models. Obectives After studyig this uit, you should be able to: explai the cocept of multiple liear regressio; formulate a multiple liear regressio model; estimate the regressio coefficiets ad their stadard errors; calculate the coefficiet of determiatio (R ) ad adusted R ; ad predict the depedet variable for give values of idepedet variables.. MULTIPLE LINEAR REGRESSION MODEL I this sectio, we geeralise the simple regressio model cosidered i Uit 9. We have assumed i Uit 9 that (Y, X ), (Y, X ),, (Y, X ) are pairs of values. The equatio of the simple liear regressio model may be writte as Y = a + bx + e where e represets the error term, which arises from the differece of the observed Y ad the straight lie Y = a + bx. To miimise the term e, we use the method of least squares. From the above equatio, we may write a simple regressio model as Y i = a + bx i + e i i =,,, for the sample data of pairs give i terms of (Y i, X i ) (i=,, ). I agriculture, the crop yield depeds o more tha oe variable such as fertility of the soil, amout of raifall, amout of fertilisers, etc. A multiple regressio model that might describe this relatioship is Y = B + B X +B X + B 3 X 3 + e where Y deotes the yield, X deotes the fertility of soil, X deotes the raifall ad X 3 deotes the amout of fertilisers used. This is called the multiple liear regressio model with three idepedet/regressor variables. The term liear is used because the depedet/respose variable Y is a liear fuctio of the ukow parameters B, B, B ad B 3. I geeral, the respose variable may be related to p regressors or idepedet variables. Let Y be the depedet variable ad X, X,..., X p be p idepedet variables. The the multiple regressio model ca be writte as: Y B B X B X... B X e () p p 46 The parameters B, B,, B p are called the regressio coefficiets. The parameters B i (i =,,,, p) represet the expected chage i the respose variable Y per uit chage i X i whe the remaiig regressor variables are treated as costat. For the sake of simplicity, we shall attach a dummy variable X with the itercept B ; X takes value for all observatios. Now the model i equatio () ca be writte as: Y B X B X B X... B X e () p p

3 The simple regressio model cosidered i Uit 9 becomes a particular case of this model with X =, B = a, B = b ad B i =, (i ). The iterpretatio of coefficiets B ( =,,, p) is that B represets the amout of chage i Y for a uit chage i X, keepig the other idepedet variables X k (k ) fixed. These coefficiets are kow as partial regressio coefficiets as the effect of oe idepedet variable is studied o the depedet variable while the other variables are held fixed or costat. We use the term multiple liear regressio for this model because two or more tha two variables are icluded i the regressio aalysis ad the parameters B, B,, B p appear i a liear form. Moreover, the effect of these variables ca be studied oitly. Here X i ca be ay cotiuous fuctio such as log X, X, X 3, X, etc. However, it is ecessary that the equatio is liear. Let us cosider a polyomial model Multiple Liear Regressio Y p B BX BX... BpX e If we let X = X, X = X, X 3 = X 3 ad so o, the above model ca be writte i a liear form as give i equatio (). As i the case of simple liear regressio (Uit 9), here too we make the assumptios that e is ormally ad idepedetly distributed with mea zero ad costat variace σ..3 ESTIMATION OF MODEL PARAMETERS Recall that i Uit 9, we have estimated the parameters a ad b of a simple liear regressio equatio usig the method of least squares. I this method, we miimise the total error term, so that the sum of the squares of the differeces betwee the observed values ad their expected values is miimum, i.e., the sum of squares of the error terms is miimum. We also use the method of least squares to estimate the regressio coefficiets give i equatio (). Let the umber of observatios be (> p). Let y i deote the i th observed value ad x i deote the th observatio of the regressor variable X i. The data is represeted as give i the table below: pose Variable Y Regressor Variables X X.... X p y x x.... x p y x x.... x p y 3 x 3 x x p y x x.... x p The the multiple regressio model for the i th observatio Y i ca be writte as: Yi = B + BXi + BX i BpXpi + e i, i =,,..., where X i, X i,..., X pi are the correspodig values of p idepedet variables, B is the itercept, B, B,..., B p are p regressio coefficiets correspodig to idepedet variables X, X,, X p, respectively. 47

4 Regressio Modellig We ow miimise e i, the sum of squares of errors i the model give i equatio (): i i i i p pi (3) E e Y B X B X... B X i i with respect to B, B,..., B p to obtai their least squares estimates. For estimatig the model parameters B, B, B,, B p, we differetiate E with respect to B, B, B,, B p, respectively, ad equate the result to zero. If we differetiate E with respect to B, we obtai the th ( =,,, p) ormal equatios as follows: E B Yi BXi BXi... BpXpi X i,,,,..., p i Simplifyig equatio (4), we obtai the least squares ormal equatios: B Xi B Xi... Bp Xpi Yi i i i i Xi B Xi B XiX i... Bp XiX pi XiYi i i i i i B B B Xi B XiXi B Xi... Bp XiXpi X iyi i i i i i (4)..... (5) B Xpi B XpiXi B X pix i... Bp Xpi XpiYi i i i i i These are p + ormal equatios ad ca be solved usig the methods of solvig simultaeous liear equatios. The solutios of the above ormal equatios called the least squares estimates are B ˆ, B ˆ ˆ ˆ, B,..., B p,respectively. For simplicity, we shall rewrite the model i equatio () by cetralisig the idepedet variables X, X,..,X p, i.e., by takig differeces from their meas: Y B B X... B X B X X... B X X e p p p p p i B X B (X X )... B (X X ) e p p p 48 where B B BX... BpXp. Here X, X,..., Xp are the meas of p idepedet/ regressor variables. With this, the ormal equatio becomes i E Yi B Xi B Xi X Bp Xpi X p (X i X ), B (6) Note that X i = for all i. The coefficiets B, B,..., B p remai the same, but the itercept chages from B to B. Oce we have obtaied the estimates of B, B, B,, B p, we ca obtai ˆB from the followig equatio: Bˆ = Bˆ - Bˆ X Bˆ X (7) ' p p Let us cosider a applicatio of these results.

5 Example : A statistical aalyst is aalysig the vedig machie routes i the distributio system. He/she is iterested i predictig the amout of time required by the route driver to service the vedig machies i a outlet. The compay maager resposible for the study has suggested that the two most importat variables affectig the delivery time Y (i miutes), are (i) the umber of cases (X ) ad (ii) the distace travelled (i m) by the route driver (X ). The delivery time data collected by the statistical aalyst is give below: Multiple Liear Regressio Time (Y) No. of Cases (X ) Distace (X ) Check whether there is a liear relatioship betwee Y (Time) ad the two idepedet variables X (umber of cases) ad X (distace). Calculate the values of the regressio coefficiets ad fit the regressio equatio. Solutio: To fid the values of regressio coefficiets ad fit the regressio equatio for the give data, we form the followig table: Time (Y) No. of Cases (X ) Distace (X ) Y (X ) (X ) X Y X Y X X Y i = X i = X i =35 Y i =38 X i =95 X i =5 Xi Yi =85 XiYi =68 XiXi =35 49

6 Regressio Modellig O puttig the values from the above table i the ormal equatios (5) for p =, ad otig that X =, we get B ˆ B ˆ X B ˆ X Y i i i B ˆ X B ˆ X B ˆ X X Y X i i i i i i B ˆ X B ˆ X X B ˆ X YX i i i i i i O puttig the values calculated i the table i the above equatios, we get Bˆ Bˆ 35 Bˆ (i) Bˆ 95 Bˆ 35 Bˆ 85 (ii) 35 Bˆ 35 Bˆ 5 Bˆ 68 (iii) From equatio (i), we have ˆB Bˆ ˆ 35 B (iv) O puttig the value of ˆB i equatios (ii) ad (iii) ad simplifyig, we get 4 Bˆ 8 Bˆ (v) 8 Bˆ 587 Bˆ 6 O solvig equatios (v) ad (vi), we get ad Bˆ.3 Bˆ.356 Hece, the fitted equatio is ˆB.8765 Y X.356 X (vi) So we ca coclude that there is a liear relatioship betwee Y (time i secods) ad the two idepedet variables X (umber of cases) ad X (distace). As the regressio coefficiets for both variables are positive, these affect the delivery time. The umerical value of the regressio coefficiet ˆB associated with X is higher tha the value of ˆB associated with X. It shows that the umber of cases affects the delivery time more tha the distace travelled..3. Use of Matrix Notatio Whe p is greater tha, it is more coveiet to write the ormal equatios i matrix form. The regressio equatios i matrix otatio ca be writte as: 5 Y = X B + e (8)

7 where Multiple Liear Regressio y X. Xp B e X. Xp B y e Y.., X (p)...., Bp ad e y B X. X p p e I geeral, Y is a vector of the observed values of the respose variable Y, X is a (p+) matrix of the values of regressor variables, B is a (p+) vector of regressio coefficiets ad e is a vector of radom errors. I matrix otatio, the (p+) ormal equatios ca be writte as follows: X XBˆ XY (9a) Equatio (9a) represets the ormal least squares equatios. For the sake of simplicity, we may write them as xi. xpi B y i x i x. B i xix pi xi yi B pi pi i pi p x x x. x x piyi To solve the ormal least squares equatios give i equatio (9a), we multiply both sides by the iverse of X X. Thus, the estimates of the regressio coefficiets are give by XX X Y Bˆ (9b) () O puttig the values of the estimates i equatio (), we get the fitted regressio model correspodig to the observatios of the regressor variables X, X,, X p as Yˆ Bˆ Bˆ X Bˆ X... Bˆ X e () p p The matrix represetatio of the fitted values correspodig to the observed values are similar to the equatio (9a) ad are give as Yˆ X Bˆ X X X XY () The differece betwee the observed value y i ad the correspodig estimated value ŷ i is called the i th residual r i, i.e., Here, we shall use the followig otatio: Y Y.,. Y X X X. Xp Xp..... X(p) X X X. Xp Xp Note that X i s are all uity ad other variables are cetralised (deviatios from mea). The (k+) ormal equatios ca be writte as where ˆB' = X XBˆ = X Y (B ˆ ', ˆB, -----, ˆB ) I case X X is o-sigular, i.e. ( X X ) is of rak (p+), the least squares estimates of B, deoted by ˆB, ca be writte as - ˆB = (X X) X Y r y ŷ i i i i,,3,...,. The residuals may be writte i matrix otatio as r Y Ŷ Y X Bˆ (3) 5

8 Regressio Modellig.3. Properties of Least Squares Estimates We ow describe the statistical properties of least squares estimates. Whe X, ( =,,..., p) are liearly related, ( X X ) is ot ivertible. I this case we caot obtai uique estimates of B. We shall ot cosider this case ay more. It is to be oted that Bˆ is a ubiased estimate of B because E(B) ˆ (X X) X E(Y) (X X) X E(X B e) (X X) (X X)B = B sice E(e ) = ad (X X) (X X) I This shows that ˆB is ubiased. The variace of Y (which is actually the variace-covariace matrix as Y is a vector) is give as ( ) V Y = s I where I is a idetity matrix of order. The variace-covariace matrix of Bˆ is give by ˆ V B X X X V Y X X X XX X I X XX where (X X) (4) X X... X X X X X X X Xi XiX i Xi XiX X 'X Xpi XiX pi XiXpi Xpi i i pi i i i i i pi pi ad s X X. k i ki i Here V( Bˆ )= σ (X X) is a (p+) (p+) matrix ad its diagoal elemets give the variaces of coefficiets ad off diagoal elemets give the covariaces. If we use the otatio we ca write - ˆ ( ) ( ) V B = s X 'X = ( s ),,k =,,..., p k k k V(B ˆ ) = s, ad Cov(B ˆ,B ˆ ) = s (5) 5 The stadard error of Bˆ is give by S. E. ( ) ˆB = s (6)

9 The residual sum of squares SS is obtaied by substitutig the least squares estimates of B, B,, B p i equatio (3): SS Y Bˆ X Bˆ X... Bˆ X i oi i p pi i This is the sum of squares ot accouted for by the regressio model. I matrix otatio, this ca be writte as SS YY YX Bˆ Y B ˆ (Y'X ) B ˆ (Y 'X ) B ˆ (Y'X )... B ˆ (Y'X ) i p p (7) Note that X, X,, X p are deviatios from respective meas. As we have fitted (p +) parameters, the degree of freedom of residual sum of squares is ( p ). A ubiased estimate of σ is obtaied by dividig the residual sum of squares, i.e., SS, by its degree of freedom ( p ). Thus Multiple Liear Regressio ˆ SS /( p ) (8) If we are iterested i predictig the mea value of Y for a give set of idepedet variables X,, X p, the we use the fitted model. The predicted mea value of Y for give X,, X p is give by Y ˆ B ˆ B ˆ X... B ˆ X p p Let us explai the matrix method with the help of a example. Example : Usig the data of Example, fid the estimate of regressio coefficiets ad SS by usig the matrix method. Also predict the expected time Y at X = 7, X =. Solutio: Usig the matrix otatio we have from the data: ad Y = [,,, 5, 5,,, 5, 3, 5,, ] 35 X' X 95 35, X' Y X ' X ' ˆB ˆB X 'X X 'Y ˆB

10 Regressio Modellig Hece the fitted equatio is Y X.356 X Now, we calculate the value of residual sum of squares to obtai a estimate of ˆ as follows: SS YY YXBˆ = 38 (.8765) 85 (.3) 68 (.356) = = 97.5 Therefore, o puttig the value of ˆ 97.5/( 3).87 SS i equatio (8), we get Usig the above results ad puttig the values X = 7 ad X = i the fitted equatio for multiple regressio, we get Ŷ X.356 X Ŷ As far as the iterpretatio of coefficiets is cocered, there is a icrease of.3 secods i time for oe uit icrease i X. Similarly, for oe uit icrease i X there is a icrease of.356 secods i time. You may like to pause here ad solve the followig exercises to check your uderstadig. E) I a study of firms, the depedet variable was the total delivery time (Y) ad the idepedet variables were the distace covered (X ) ad the packagig time (X ). The delivery time data collected by the statistical aalyst is give below: 54 Time (Y) Distace (X ) Packagig Time (X ) Y = 66 i X = 747 i X = 6 i Estimate the parameters B, B, ad B by solvig ormal equatios ad fid the estimated multiple liear regressio equatio. E) Use the matrix method to estimate parameters from the data give i E).

11 .4 TEST OF SIGNIFICANCE IN MULTIPLE REGRESSION Multiple Liear Regressio So far you have leart how to estimate the parameters ad fit the multiple regressio model. You may ow like to test the adequacy of the fitted model ad examie whether the idepedet variables cotribute sigificatly i explaiig the variability i Y or ot. For this purpose, we use the test of sigificace of equality of variaces of the regressor variables. If there is a liear relatioship betwee the respose variable Y ad ay of the idepedet variables X, X,, X p, we use the test of sigificace of regressio. The test of sigificace of regressio is a test to determie the liear relatioship betwee the respose variable ad regressor variables ad is ofte used to examie the adequacy of the model. I order to test whether the cotributio of idepedet variables X,,X p is sigificat or ot, we test whether B, B,, B p are all zero i the model or at least oe of them is ot zero. This hypothesis ca be writte as: H : B B... Bp H : At least oe of the regressio coefficiets is ot zero It ca be tested by cosiderig the followig F-ratio: SS p F Reg SS p (9) I this test, the total sum of squares SS T is partitioed ito a sum of squares due to the cotributio of regressor variables ( SS ) ad a residual sum of squares ( SS ). From equatio (7), the residual sum of squares ( SS ) is: SS Y B ˆ (Y'X ) B ˆ (Y'X )... B ˆ (Y'X ) i p p or SS YY Y'XBˆ () If B, B,, B k are all zeros, i.e., idepedet variables do ot cotribute to the variability i Y, the the total sum of squares, deoted by SS T, is give as: Y i SST Yi Y YY () This is the total variability preset i Y aroud the mea Y. We ca rewrite equatio () as Reg Y i Yi i SS Y BY 'X that is, SS SST SSReg Hece, the differece of SST SS gives the cotributio of idepedet variables X, X,, X p, i explaiig the variability i Y, i.e., 55

12 Regressio Modellig SS SS B ˆ (Y'X ) B ˆ (YX )... B ˆ (YX ) Y T p p Y i SS ˆ Reg Y'X B or SS Bˆ Y 'X Y Reg We ow summarise these results i the followig ANOVA Table: () ANOVA TABLE Sources of Variatio Degree of Freedom (d.f.) Sum of Squares (S.S.) Mea Sum of Squares Variace Ratio Idepedet Variables (X, X,, X p ) p p SS Bˆ Y 'X Y Reg SS Reg SS F Reg SS p p p iduals ( SS ) p SS Y 'Y Bˆ Y 'X p SS p Total Y' Y Y Uder the ull hypothesis, i.e., whe B = B = = B p =, F is distributed as Fisher s F-distributio with p ad (p) degree of freedom, i.e., F ~ F p,( p) (3) If the calculated F is less tha the tabulated F p, (p) at α level of sigificace, the we coclude that the cotributio of X, X,..., X p to the variability i Y is ot sigificat. Thus, they have o cotributio i predictio. It may be of further iterest to examie whether ay oe coefficiet (say B ) correspodig to the idepedet variable X is differet from zero, after accoutig for other variables X k (all k ). This ca be tested by cosiderig the statistic t: Bˆ t (4) S.E.(Bˆ ) where S.E.( Bˆ ) uses the estimated value of ˆ give i equatio (8). Uder the ull hypothesis, i.e., B =, the proposed statistic t follows the Studet s t-distributio with (p) d.f. Thus, if t t (5) /, p we accept H. Otherwise, we reect it. If B is sigificatly differet from zero, it cotributes sigificatly to the variability i Y after takig ito accout the cotributio of other variables. If B is ot sigificatly differet from zero, its cotributio is ot sigificat after accoutig for other variables i the model. 56 Example 4: Usig the data of Example ad the results of Example, costruct the ANOVA table, apply a relevat test of hypothesis ad iterpret the results.

13 Solutio: As per the data give i Example ad the results of Example, we have SS = ad p ˆB Y'X = Usig these values, we costruct the ANOVA table as follows: ANOVA TABLE Multiple Liear Regressio Sources of Variatio Degree of Freedom (d.f.) Sum of Squares (S.S.) Mea Sum of Squares Variace Ratio Idepedet Variables (X, X ) iduals ( SS ) p SS Bˆ Y 'X Y Reg = p SS Y'Y Bˆ Y'X = SS Reg = SS p =.87 F SSReg p = SS p =7.5 ( - - ) Total Y' Y Y = We have obtaied the Variace Ratio F = 7., whereas the tabulated value of F, 9 at α =.5 is 4.6. Hece, we reect H ad coclude that X ad X cotribute sigificatly to the variability. It may be of further iterest to examie whether the coefficiet B correspodig to idepedet variable X is differet from zero, after accoutig for other variables X k (all k ). This ca be tested by cosiderig the statistic t: Bˆ t S.E.(Bˆ ) From the result of Example, we also have ˆB.3 ad ˆB.356 The Variace-Covariace matrix is ˆ ˆ XX V B Thus ˆ V B Usig equatio (5), we obtai V(B ˆ ) 7.7, V(B ˆ ˆ ).4 ad V(B ).4 ad therefore, ˆ S.E. B =

14 Regressio Modellig ˆ S.E. B ˆ S.E. B = Therefore, the statistic t is give as: ˆB.8765 t.6758 S.E.(B ˆ.7769 ) ˆB.3 t 4.64 S.E.(B ˆ.399 ) ˆB.356 t.7444 S.E.(B ˆ.494 ) But the tabulated value of t-statistic for α =.5 is t.5,.6 Hece, both variables cotribute sigificatly to the variability i Y. You may ow like to solve the followig exercise. E3) Make the ANOVA table, calculate stadard errors of estimates ad test their sigificace usig the data i E. Iterpret the results..5 COEFFICIENT OF DETERMINATION (R ) AND ADJUSTED R We defie the coefficiet of determiatio, R, i the same way as for simple regressio. It gives a measure of adequacy of model fit. We defie R as follows: R = Variability accouted by idepedet variables/total variability aroud the mea 58 p ˆB Y'X Y'Y Y Y (6) Its value always lies betwee ad. Whe the fit is good, R ~. Otherwise, R ~. The value of R always icreases with p. The icrease may be egligible, but R ever decreases. Whe we compare two models with differet values of p, the model with larger p is preferable if R correspodig to it is sigificatly larger tha R with smaller p. A model with smaller p with large R is always preferable as it is a simple model. Hece, you should choose a model with small p if its R is ot much smaller tha R for a model with a larger p. For this, we defie a adusted R, viz., R Ad, which pealises R whe p icreases but R does ot icrease sigificatly. We kow that

15 R R SSReg (7) SS T SS SS T Reg = SS T SS SS T Multiple Liear Regressio The we defie R Ad as SS /( p ) ( )( R ) Ad SS T /( ) ( p ) R (8) Here, we have divided the umerator ad deomiator by their degree of freedom. SS /( p ) may decrease with icrease i p eve whe there is o appreciable decrease i R. Hece, R Ad ( )( R ) (9) ( p ) Therefore, we should stop icludig the terms i the model if decreasig. We prefer a model with larger with smaller R Ad but larger p. R Ad starts R Ad ad smaller p tha a model Example 5: Usig the data of Example ad the results of Examples ad 3, calculate R, R ad iterpret the results. Ad Solutio: Usig the data of Example ad the results of Examples ad 3, ad o puttig the values i equatio (7), we get SS SS Reg R.797 T Therefore, the adusted R is obtaied as follows: ( )(.797) R Ad.7454 ( ) From the coefficiet of determiatio, R, we see that 79% variability i Y is due to X. This is quite a good fit. Adusted R is.7454, which is quite large. Hece we coclude that both X ad X cotribute adequately to the model fit. You may ow like to calculate R ad adusted R yourself. Try the followig exercise. E4) Calculate R ad adusted R ad commet o the goodess of fit of the model, for the data give i E..6 REGRESSION WITH DUMMY VARIABLES I previous sectios, we have dealt with multiple liear regressio whe the idepedet / regressor variables are quatitative. The quatitative variables such as height, distace, temperature, time, icome, pressure, etc. have a well 59

16 Regressio Modellig defied scale of measuremet. However, sometimes idepedet variables iclude qualitative variables such as sex (male/female), regios (orth, south, east, west, etc.), religio such as Hidu, Muslim, Christia, etc. Such variables called categorical variables caot be measured ad hece o quatitative umber ca be assiged to them. We defie dummy variables to accout for the effect that the qualitative variables may have o the respose variable. Dummy variables are also kow as idicator variables. Suppose, k is the umber of levels a categorical variable takes. The we defie (k ) dummy variables. For example, if we have two categories of male or female i the data, i.e., k = ad we defie oe dummy variable. Suppose that a statistical aalyst is aalysig the vedig machie s efficiecy i the distributio of a product. She/he is iterested i relatig the time required to service the cosumer with the distace travelled by the product i the vedig machie for machies of two types, A ad B. The secod regressor variable, machie type is qualitative, ad has two levels: Type A ad Type B. It allows us to code the types of machies used. Therefore, we defie a dummy variable X which takes o the values ad to idetify the types of machies as follows: X, if distributio is doe by machie A, if distributio is doe by machie B The variable X is called a idicator variable because it is used to idicate the presece or absece of Machie A or B. For such situatios, we have a multiple liear regressio model give by Y B BX BX e (3) To determie the regressio coefficiets i this model, we first cosider machie type A for which X takes value. The the regressio model is give by: Y B B X B e Y B BX e (3) The relatioship betwee the respose variable Y ad regressor variable X, i.e., distace travelled by the product i the machie is a straight lie with itercept B ad slope B. For machie of type B, we have X =. The the regressio model becomes Y B B X B e Y B B B X e (3) which shows that the relatioship betwee Y ad X is also a straight lie with slope B but itercept (B + B ). 6 Note that these models are liear with the same slope B but differet itercepts. Hece, these two models describe two parallel regressio lies, i.e., two lies with a commo slope ad differet itercepts. The vertical distace betwee these two lies is the differece i the itercepts, i.e., B. The two parallel regressio lies formed by the above models give i equatios (3) ad (3) are show Fig...

17 Multiple Liear Regressio Fig.. For three Machie types A, B ad C, two dummy variables X ad X 3 are used. The model becomes The levels of dummy variable would be: Y B BX BX B3X3 e (33) X = X 3 = X = X 3 = X = X 3 = For Machie Type A For Machie Type B For Machie Type C I geeral, a categorical variable with k categories is deoted by (k ) dummy variables. Let us try to uderstad regressio aalysis usig dummy variables with the help of a example. Example 6: A statistical aalyst is aalysig the performace of washig machies i the distributio system. He/she is iterested i predictig the amout of time required by the driver to service washig machies of two types: i) Type A ad ii) Type B. The data o the required time collected by the statistical aalyst is give below: Time (Y) Distace (X ) Machie Type (X )

18 Regressio Modellig Check whether there is a liear relatioship betwee Y (time) ad the two idepedet variables X (distace) ad X ( type). Calculate the values of the coefficiets ad fit the regressio equatio. Solutio: Sice two types of washig machies A ad B have bee used, k =. Here we have to defie oe dummy variable X, which takes two values: X = if the observatio is from machie A = if the observatio is from machie B We form the followig table from the give data to fit the regressio equatio: Time (Y) Distace (X ) Machie Type (X ) Y (X ) (X ) X Y X Y X X Y i = X i =35 X i = 5 Y = i 38 X i = 5 X i =5 X Y i i = 68 X Y i i = 8 X X i i = The ormal equatios (5) for p = ad X i = are: B ˆ B ˆ X B ˆ X Y ' i i i B ˆ X B ˆ X B ˆ X X YX i i i i i i B ˆ X B ˆ X X B ˆ X Y X i i i i i i O puttig the values of the sums calculated i the above table, we get Bˆ 35 Bˆ 5Bˆ (i) 35 Bˆ 5 Bˆ Bˆ 68 (ii) 5 Bˆ Bˆ 5Bˆ 8 (iii) 6 From equatio (iii), we have Bˆ 6 44 Bˆ Bˆ (iv)

19 O puttig the value of ˆB i equatios (i) ad (ii) ad simplifyig, we get Multiple Liear Regressio 78Bˆ 7 Bˆ 8 (v) 3 Bˆ 3 Bˆ O solvig equatios (v) ad (vi), we get Bˆ =.3498, Bˆ =-.38 Bˆ 6 44 Bˆ Bˆ.646 ad Hece, the fitted regressio equatio is (vi) Y X.38 X (vii) We coclude that there is a liear relatioship betwee Y (time i secods) ad the two idepedet variables X (distace) ad X (type of machie). Sice the regressio coefficiet for the variable X is egative, it affects the delivery time. The umerical value of the regressio coefficiet associated with X is higher tha that of the other regressor variable. It shows that distace travelled (i m) affects the delivery time less tha the type of machies. To determie the regressio coefficiets i this model for each type of machie, we first cosider machie A for which X takes value. We put the values of regressio coefficiets i equatio (8). The the regressio model becomes Y X (viii) For machie B, we put the value of the regressio coefficiet ad X =. The the regressio model becomes Y X (ix) Note that as discussed i Sec.5, these estimated regressio lies have the same slope, i.e.,.3498, but have differet itercepts, i.e.,.646 ad.66. You may ow like to solve the followig problem to check your uderstadig: E5) Usig the data give i the followig table, fid the regressio coefficiets ad obtai the estimated regressio equatios for the model give i equatios (7), (8) ad (9) : Time (hour) Y Distace (feet) X Machie Type X 8 6 A 4 95 A 7 7 A 4 84 A 3 98 A 4 53 B 3 68 B 54 B 89 B 9 73 B 63

20 Regressio Modellig Check whether there is a liear relatioship betwee Y (time) ad the two idepedet variables X (distace) ad X (machie type). Calculate the values of the coefficiets ad fit the regressio equatio. We ow summarise the cocepts that we have discussed i this uit..7 SUMMARY. The basic cocept of multiple liear regressio is the same as that of simple regressio. However, istead of oe idepedet variable, there are several idepedet variables, say, X, X, X 3,, X p.. A multiple regressio model is give by Y = B + B X +B X + + B p X p + e where Y is the depedet variable ad X, X,, X p are p idepedet variables. This is called the multiple liear regressio model with p idepedet/regressor variables. The term liear is used because the depedet/respose variable Y is a liear fuctio of the ukow parameters B, B, B,, B p. 3. The simple regressio model cosidered i Uit 9 becomes a particular case of this model with X =, B = a, B = b ad B i = (i ). The iterpretatio of coefficiets B ( =,,, p) is that B represets the amout of chage i Y for a uit chage i X, keepig the other idepedet variables X k s (k ) fixed. These coefficiets are kow as partial regressio coefficiets as the effect of oe idepedet variable is studied o the depedet variable while the other variables are held fixed or costat. We use the term multiple liear regressio for this model because several variables are icluded i the regressio ad the parameters B, B,, B p appear i a liear form. 4. We estimate the parameters of a multiple liear regressio equatio usig the method of least squares. I this method, we miimise the total error term, so that the sum of the squares of the differeces betwee the observed values Y i ad its expected values is miimum, i.e., the sum of squares of the error terms is miimum. Whe p is greater tha, it is more coveiet to write the ormal equatios i matrix form. The regressio equatios i matrix otatio ca be writte as Y = X B + e, where Y is a vector of the observed values of the respose variable Y, X is a (p + ) matrix of the values of regressor variables, B is a (p + ) vector of regressio coefficiets ad e is a vector of radom errors. I matrix otatio, the (p + ) ormal equatios ca be writte as X XBˆ XY 5. The variace-covariace matrix of Bˆ is give by V( Bˆ ) (X X) where V( Bˆ )= σ (X X) is a (p + ) (p + ) matrix ad its diagoal elemets give the variaces of coefficiets ad off diagoal elemets give the covariaces. If we use the otatio 64 V Bˆ X' X ( ),, k,,...,p. k

21 we ca write V(Bˆ ), ad, k ) Cov(Bˆ Bˆ k Multiple Liear Regressio The stadard error of Bˆ is give by Bˆ S.E. 6. If there is a liear relatioship betwee the respose variable Y ad ay of the idepedet variables X, X,, X p, we use the test of sigificace of regressio. The test of sigificace of regressio is a test to determie the liear relatioship betwee the respose variable ad regressor variables ad is ofte used to examie the adequacy of the model. 7. The coefficiet of determiatio, R ad adusted R are measures of goodess of fit of the multiple regressio model. The value of R always icreases with p. The icrease may be egligible, but R ever decreases. Whe we compare two models with differet values of p, the model with larger p is preferable if R correspodig to it is sigificatly larger tha R with smaller p. A model with smaller p with large R is always preferable as it is a simple model. Hece, oe should choose a model with small p if its R is ot much smaller tha R for a model with a larger p. 8. We defie dummy variables to accout for the effect that qualitative variables may have o the respose variable. Dummy variables are also kow as categorical as idicator variables. Suppose, k represets the umber of levels a categorical variable takes, the we defie (k ) dummy variables. For example, if we have two categories, male or female, i the data, k = ad we defie oe dummy variable..8 SOLUTIONS/ANSWERS E) We do the followig calculatios for the give data: Time (Y) Distace (X ) Vedig Time (X ) Y (X ) (X ) X Y X Y X X Y i = 66 X i = 747 X i = 6 Y i = 98 X i = 5889 X i = 75 X Y i i = 9 XiYi = 4535 X X i i = 86 65

22 Regressio Modellig From the above table, puttig the values of Yi, X, i X, i X, i X, i Xi Yi, XiY ad i XiX i ormal equatios, we get i Bˆ 747 Bˆ 6Bˆ 66 (i) 747 Bˆ 5889 Bˆ 86 Bˆ 9 (ii) 6 Bˆ 86 Bˆ 75 Bˆ 4535 (iii) Solvig these equatios, we get ˆB , Bˆ ˆ.79 ad B.6 The fitted regressio equatio is: Y = X +.6 X E) Usig the matrix otatio, we have from the data: Y [8, 4, 7, 4, 3, 4, 3,,, 9] X'X , X'Y X ' X ˆB ˆB X 'X X 'Y.79 Bˆ.6 ad Hece, the fitted equatio is Y = X +.6X We ow calculate the value of residual sum of squares to obtai a estimate of ˆ as follows: SS YY YXBˆ = (6.7569) 9 (.79) 4535 (.6) = = 4.45 Therefore, o puttig the value of ˆ 4.45/( 3) 6.6 SS i equatio (8), we get 66 E3) Usig the data of E ad the results of E, we get ˆB , Bˆ ˆ.79 ad B.6

23 As per the data give i Example ad the result of Example, we have Multiple Liear Regressio SS 4.45 ad p Bˆ Y' X = Usig these values we costruct the ANOVA table as follows: ANOVA TABLE Sources of Variatio Degree of Freedom Sum of Squares (S.S.) Mea Sum of Squares Variace Ratio (d.f.) Idepedet Variables (X, X ) iduals ( SS ) p å SS = Bˆ Y 'X - Y Reg = = p SS = Y 'Y - Bˆ å Y 'X = = 4.45 SS Reg = SS ( - p - ) = 6.6 F SSReg p = SS p = 9.74 ( - - ) Total 9 Y' Y Y = 5.4 The calculated value of Variace Ratio F = 9.74, whereas the tabulated value of F, at α =.5 is Hece, we reect H ad coclude that X ad X cotribute sigificatly i explaiig the variability. It may be of further iterest to examie whether the coefficiet B, correspodig to idepedet variable X, is differet from zero, after accoutig for other variables X k (all k ). This ca be tested by cosiderig statistic t: ˆB t = S.E.(B ˆ ) From the result of Example, we have Bˆ.79 ad Bˆ.6 The Variace-Covariace matrix is ˆ ˆ X X V B Thus V(B) ˆ Usig equatio (5), we obtai V(B ˆ ) , V(B ˆ ˆ ).4 ad V(B ).3 ad therefore, 67

24 Regressio Modellig ˆ S.E. B = ˆ S.E. B.4.64 ˆ S.E. B = Therefore, the statistic t is give as: ˆB t 3.65 S.E.(B ˆ 7.56 ) ˆB.79 t.74 S.E.(B ˆ.64 ) ˆB.6 t.94 S.E.(B ˆ.489 ) But the tabulated value of t-statistic for α =.5 is t.5,7.37 Hece, variable X cotributes sigificatly i explaiig the variability i Y but the variable X does ot. As far as the iterpretatio of coefficiets is cocered, there is a icrease of.6 secods i time for oe uit chage i cases (X ). Similarly, for oe uit icrease i X, there is a.79 secods decrease i time. E4) Usig the data of E) ad the results of E) ad E3), we get R = Sum of Squares due to X, X /Total Sum of Squares = 9.58/5.4 =.79 R = - Ad ( - )( - R ) ( - p - ) ( ) = - = R idicates that oly 7% of variability i Y is explaied by X ad X. E5) Two types of washig machies A ad B have bee used. Hece, k =. Here we have to defie oe dummy variable X, which takes two values: 68 X = if the observatio is from machie type A = if the observatio is from machie type B

25 From the give data, we form the followig table to fid ad fit the regressio equatio: Multiple Liear Regressio Time Y Distace (X ) (X ) Y (X ) (X ) X Y X Y X X Y i =66 X i =747 X i = 5 Y = i 98 X i = 5889 X i =5 X Y i i = 9 X Y i i = 9 X X i i =337 From the above table, puttig the values i the ormal equatios (5) for p = ad otig that X =, we get Bˆ 747 Bˆ 5Bˆ 66 (i) 747 Bˆ 5889 Bˆ 337 Bˆ 9 (ii) 5 Bˆ 337 Bˆ 5Bˆ 9 (iii) From equatio (iii), we have ˆB Bˆ 5Bˆ 5 (iv) O puttig the value of ˆB i equatios (i) ad (ii) ad simplifyig, we get 365Bˆ 5 Bˆ 7 (v) 396 Bˆ 5 Bˆ 735 O solvig equatios (v) ad (vi), we get Bˆ = -.4, Bˆ = (.4) 5(.344) ad ˆB (vi) Hece the fitted equatio for the model give i equatio (7) is Y X.344 X (vii) 69

26 Regressio Modellig Now we ca coclude that there is a liear relatioship betwee Y (time) ad the two idepedet variables X (distace) ad X (type of machies). As the regressio coefficiet for the variable X is egative, it affects the delivery time. To determie whether the regressio coefficiets i this model are correct, we first cosider machie A for which X takes value. We put the values of regressio coefficiets i equatio (8). The the regressio model becomes Y X (viii) For machie B, we put the value of regressio coefficiets ad X =. The the regressio model becomes Y 3.6.4X (ix) Note that as discussed i Sec.5, these estimated regressio lies have the same slope, i.e.,.4, but differet itercepts, i.e., ad

1 Inferential Methods for Correlation and Regression Analysis

1 Inferential Methods for Correlation and Regression Analysis 1 Iferetial Methods for Correlatio ad Regressio Aalysis I the chapter o Correlatio ad Regressio Aalysis tools for describig bivariate cotiuous data were itroduced. The sample Pearso Correlatio Coefficiet