Chapter 14 Simple Linear Regression Page 1. Introduction to regression analysis 14-2

Size: px

Start display at page:

Download "Chapter 14 Simple Linear Regression Page 1. Introduction to regression analysis 14-2"

Sharyl Lane
5 years ago
Views:

1 Chapter 4 Smple Lnear Regresson Page. Introducton to regresson analyss 4- The Regresson Equaton. Lnear Functons Estmaton and nterpretaton of model parameters Inference on the model parameters 4-5. Sums of Squares and the ANOVA table An example 4-7. Estmaton and Predcton Standardzed regresson coeffcents Addtonal concerns and observatons A. Karpnsk

2 The Regresson Equaton. Overvew of regresson analyss Regresson analyss s generally used when both the ndependent and the dependent varables are contnuous. (But modfcatons exst to handle categorcal ndependent varables and dchotomous dependent varables.) Type of Analyss Independent Varable Dependent Varable ANOVA Categorcal Contnuous Regresson Categorcal Analyss (Contngency Table Analyss) Contnuous or Categorcal Contnuous Categorcal Goals of regresson analyss: o To descrbe the relatonshp between two varables o To model responses on a dependent varable o To predct a dependent varable usng one or more ndependent varables o To statstcally control the effects of varables whle examnng the relatonshp between the ndependent and dependent varable Regresson analyss s usually performed on observatonal data. In these cases, we descrbe, model, predct, and control, but we cannot make any causal clams regardng these relatonshps 4- A. Karpnsk

3 Termnology n regresson analyss o As n ANOVA, we wll develop a model to explan the data DATA = MODEL + ERROR o The model assumes greater mportance n regresson. Unlke ANOVA, we are usually nterested n the model parameters o The goal of most regresson models s to use the nformaton contaned n a set of varables to predct a response. As a result, we use slghtly dfferent termnology n regresson, compared to ANOVA. ANOVA Dependent varable Independent varables REGRESSION Dependent varable or Response varable or Outcome varable Independent varables or Predctor varables 4-3 A. Karpnsk

4 Smple Lnear Regresson The Regresson Equaton. Lnear Functons The goal of smple lnear regresson s to descrbe an outcome varable (Y) as a lnear functon of a predctor varable (). The end result wll be a model that defnes the equaton of a straght lne Y = b + a Where b = the y-ntercept a = the slope o Let s consder a smple example: Y = + 3 The y-ntercept s - The lne crosses the y-axs at y = The slope of the lne s /3 The slope s a measure of the steepness of the lne The slope s the change n y assocated wth a unt change n x Two data ponts A straght lne through two ponts y-axs x-axs y-axs x-axs 4-4 A. Karpnsk

5 o Let s revew the method covered n hgh school algebra for determnng the lne that falls through ponts: (, ) & (5, ) Frst, we compute the slope of the lne y y slope = x x ( ) slope = 5 ( ) = = We nterpret the slope as the change n y assocated wth a unt change n x In ths example, for every unt ncrease n x, y wll ncrease by.333 x y We compute the y-ntercept by fndng the value of y when x = We can use: the equaton for the slope of a lne and the coordnates of ether known pont to solve for, y ) ( ( x, y) Let s use (5, ) y.333 = (5) = y y =.667 y =.667 Fnally, we use the slope and the ntercept to wrte the equaton of the lne through the ponts Y = b + a Y = ( ) 4-5 A. Karpnsk

6 3. Estmaton and nterpretaton of model parameters Wth real data, the ponts rarely fall drectly on a straght lne. Regresson s a technque to estmate the slope and the y-ntercept from nosy data Because not every pont wll fall on the regresson lne, there wll be error n our model DATA = MODEL + ERROR o The DATA, or the outcome we want to predct s the Y varable o The MODEL s the equaton of the regresson lne, b + b b = the populaton value of the ntercept b = the populaton value of the slope = the predctor varable o The ERROR s devaton of the observed data from our regresson lne. We refer to the ndvdual error terms as resduals o The full smple lnear regresson model s gven by the followng equaton: DATA = MODEL + ERROR Y b + b + ε = Some key characterstcs of ths model o We can only model lnear relatonshps between the outcome varable and the predctor varable o The model can be expanded to nclude the lnear relatonshps between multple predctor varables and a sngle outcome Y = b + b + b b k k + ε 4-6 A. Karpnsk

7 Predcted values and resduals o Wth real data, we need to estmate the value of the slope and the ntercept. (Detals on the estmaton process wll follow shortly.) Y = bˆ + bˆ + ε ˆb = the estmated value of the ntercept ˆb = the estmated value of the slope o Based on the model, we have a best guess as to the partcpant s response on the outcome varable ˆ = bˆ + bˆ Y In other words, we use the equaton of the lne we have developed to estmate how each partcpant responded on the outcome varable Yˆ s called the predcted value or ftted value for the th partcpant o If the actual response of the partcpant devates from our predcted value, then we have some ERROR n the model. We defne the resdual to be the devaton of the observed value from the predcted value. DATA = MODEL + ERROR Y = ( bˆ + bˆ ) + e Y = Y ˆ + e e = Y Yˆ o If we want to know f our model s a good model, we can examne the resduals. If we have many large resduals, then there are many observatons that are not predcted well by the model. We say that the model has a poor ft. If most of the resduals are small, then our model s very good at explanng responses on the Y varable. Ths model would have a good ft. 4-7 A. Karpnsk

8 o Let s consder a smple example to llustrate these ponts Y We notce that a straght lne can be drawn that goes drectly through three of the 5 observed data ponts. Let s use ths lne as our best guess lne ~ Y = + Now we can calculate predcted values and resduals Y Yˆ e A. Karpnsk

9 In the prevous example, we eyeballed a regresson lne. We would lke to have a better method of estmatng the regresson lne. Let s consder desrable two propertes of a good regresson lne. The sum of the resduals should be zero ( yˆ) = y If we have ths property, then the average resdual would be zero In other words, the average devaton from the predcted lne would be zero. Overall, we would lke the resduals to be as small as possble We already requre the resduals to sum to zero, by property (). So, let s requre the sum of the squared resduals to be as small as possble. Ths approach has the added beneft of penalzng large resduals more than small resduals ( y yˆ ) = mnmum o Estmatng a regresson lne usng these two propertes s called the ordnary least squares (OLS) estmaton procedure o Estmates of the ntercept and slope are called the ordnary least squares (OLS) estmates o To solve for these estmates, we can use the followng procedure We want to mnmze SSE = ( Y ˆ Y ) = ( Y b b ) We take the dervatves of SSE wth respect to b and b, set each equal to zero, and solve for b and b SSE SSE = and = b b We ll skp the detals and jump to the fnal estmates SS ˆ Y bˆ Y b SS = b = Where SS SS Y = = n n ( ( ) )( Y Y ) 4-9 A. Karpnsk

10 Now, let s return to our example and examne the least squares regresson lne 5 4 Eyeball Lne 3 LS Lne Let s compare the least squares regresson lne to our eyeball regresson lne ~ Y = + Y ˆ = Y Data Eyeball Least Squares Y ~ e ~ ~e Yˆ ê ê e e. o For both models, we satsfy the condton that the resduals sum to zero o But the least squares regresson lne produces the model wth the smallest squared resduals Note that other regresson lnes are possble o We could mnmze the absolute value of the resduals o We could mnmze the shortest dstance to the regresson lne 4- A. Karpnsk

11 4. Inference on the model parameters We have learned how to estmate the model parameters, but also want to perform statstcal tests on those parameters SS ˆ Y bˆ Y b SS = ˆ b = Frst, let s estmate the amount of error n the model, σ o Intutvely, the greater the amount of error n a sample, the more dffcult t wll be to estmate the model parameters. o The error n the model s captured n the resduals, e We need to calculate the varance of the resduals Recall a varance s the average squared devaton from the mean When appled to resduals, we obtan But we know ε = ( ε ε ) Var( ε ) = N Varˆ( ( ˆ ε ) ( ˆ ) Y Y ) = ˆ σ = = N N ε ε = SSResdual N Why use N-? A general heurstc s to use N (number of parameters ftted) In ths case, we have estmated two parameter: the slope and the ntercept Recall that for Var(), we dvded by N-. We only estmated one parameter (the grand mean) Ths heurstc also appled for ANOVA. 4- A. Karpnsk

12 And so we are left wth ( ˆ ε ) SSresd Varˆ( ε ) = = N N # of parameters = MSresd And we are justfed usng nvolvng the regresson model MSresd as the error term for tests o Interpretng MSresd: Resduals measure devaton from regresson lne (the predcted values) The varance of the resduals captures the average squared devaton from the regresson lne So we can nterpret MSresd as a measure of average devaton from the regresson lne. SPSS labels MSresd as standard error of the estmate Now that we have an estmate of the error varance, we can proceed wth statstcal tests of the model parameters We can perform a t-test usng our famlar t-test formula t ~ estmate standard error of the estmate o We know how to calculate the estmates of the slope and the ntercept. All we need are standard errors of the estmates 4- A. Karpnsk

13 Inferences about the slope, ˆb o Dervng the samplng dstrbuton of ˆb tedous. We ll skp the detals (see an advanced regresson textbook, f nterested) and the end result s: std. error ( bˆ ) MSresd ( ) = o Thus, we can conduct the followng statstcal test: H : b = H : b t( N ) ~ bˆ standard error ( bˆ ) o We can also easly compute confdence ntervals around ˆb estmate ± tα /, df bˆ * standard error of MSresd ( ) ± tα /, df * estmate o Conclusons If the test s sgnfcant, then we conclude that there s a sgnfcant lnear relatonshp between and Y For every one-unt change n, there s a unt change n Y ˆb If the test s not sgnfcant, then there s no sgnfcant lnear relatonshp between and Y Utlzng the lnear relatonshp between and Y does not sgnfcantly mprove our ablty to predct Y, compared to usng the grand mean. There may stll exst a sgnfcant non-lnear relatonshp between and Y 4-3 A. Karpnsk

14 Inferences about the ntercept, b o b tells us the predcted value of Y when = o The test of b s automatcally computed and dsplayed, but be careful not to msnterpret ts sgnfcance! o Only rarely are we nterested n the value of the ntercept o Agan, we ll skp the detals concernng the dervaton of the samplng dstrbuton of ˆb (see an advanced regresson textbook, f nterested) and the end result s: std. error ( bˆ ) ( = MSresd N ) o Thus, we can conduct the followng statstcal test: H H : b : b = t( N ) ~ bˆ standard error ( bˆ ) o We can also easly compute confdence ntervals around ˆb estmate ± tα /, df bˆ * standard error of ( estmate ) ± tα /, df * MSresd N 4-4 A. Karpnsk

15 5. Sums of Squares n Regresson and the ANOVA table Total Sums of Squares (SST) o In ANOVA, the total sums of squares were the sum of the squared devatons from the grand mean o We wll use ths same defnton n regresson. SST s the sum of the squared devatons from the grand mean of Y 5 SST = n = ( Y Y ) 4 3 Mean(Y) Sums of Squares Regresson o In ANOVA, we had a sum of squares for the model. Ths SS captured the mprovement n our predcton of Y based on all the terms n the model o In regresson, we can also examne how much we mprove our predcton (compared to the grand mean) by usng the regresson lne to predct new observatons If we had not conducted a regresson, then our best guess for a new value of Y would be the mean of Y, Y But we can use the regresson lne to make better predctons of new observatons ˆ = bˆ + bˆ Y 4-5 A. Karpnsk

16 The devaton of the regresson best guess (the predcted value) from the grand mean s the SS Regresson. 5 SSReg = n = ( Yˆ Y ) 4 3 LS Lne Mean(Y) Sums of Squares Error / Resdual o The resduals are the devatons of the predcted values from the observed values e = Y Yˆ o The SS Resdual s the amount of the total SS that we cannot predct from the regresson model n SSResd = ( Y Yˆ ) 5 = 4 3 LS Lne Mean(Y) A. Karpnsk

17 Sums of Squares parttonng o We have three SS components and we can partton them n the followng manner n = SST = SSreg + n SSresd ( Y ) = ( ˆ ) + ( ˆ Y Y Y Y Y ) = o In ANOVA, we had a smlar partton SST = SSmodel + SSerror It turns out that ANOVA s a specal case of regresson. If we set up a regresson wth categorcal predctors, then we wll fnd SSreg = SSmodel SSresd = SSerror Every analyss we conducted n ANOVA, can be conducted n regresson. But regresson provdes a much more general statstcal framework (and thus s frequently called the general lnear model ). Where there are sums of squares, there s an ANOVA table. o Based on the SS decomposton, we can construct an ANOVA table Source SS df MS F Regresson n SSReg = Yˆ (# of parameters) SSreg MSreg ( Y ) - = df MSresd Resdual n SSResd = Y Yˆ N SSresd ( ) (# of parameters) df = Total SST = n = ( Y Y ) N- 4-7 A. Karpnsk

18 o The Regresson test examnes all of the slope parameters n the model smultaneously. Do these parameters sgnfcantly mprove our ablty to predct Y, compared to usng the grand mean to predct Y? H : b = b =... = bk = : b j ' s = H Not all o For smple lnear regresson, we only have one slope parameter. Ths test becomes a test of the slope of b H : b = : b H In other words, for smple lnear regresson, the Regresson F-test wll be dentcal to the t-test of the b parameter Ths relatonshp wll not hold for multple regresson, when more than one predctor s entered nto the model Calculatng a measure of varance n Y accounted for by o SS Total s a measure of the total varablty n Y SST = n = ( Y Y ) SST Var ( Y ) = N o The SS Regresson s the part of the total varablty that we can explan usng our regresson lne o As a result, we can consder the followng rato, R to be a measure of the proporton of the sample varance n Y that s explaned by R = SSReg SSTotal R s analogous to η n ANOVA 4-8 A. Karpnsk

19 o But n ANOVA, we preferred a measure varance accounted for n the populaton ( ω ) rather than n the sample ( η ). o The regresson equvalent of ω s called the Adjusted R. Any varable (even a completely random varable) s unlkely to have SSReg exactly equal to zero. Thus, any varable we use wll explan some of the varance n the sample Adjusted R corrects for ths overestmaton by penalzng R for the number of varables n the regresson equaton What happens f we take the square root of R? R = SSReg SSTotal o R s nterpreted as the overall correlaton between all the predctor varables and the outcome varable o When only one predctor s n the model, R s the correlaton between and Y, r Y 4-9 A. Karpnsk

20 6. An example Predctng the amount of damage caused by a fre from the dstance of the fre from the nearest fre staton Fre Damage Data Fre Damage Dstance from Staton (Thousands (Mles) of Dollars) Dstance from Staton (Mles) Fre Damage (Thousands of Dollars) Always plot the data frst!!! 4. dollars mles A. Karpnsk

21 In SPSS, we use the Regresson command to obtan a regresson analyss REGRESSION /DEPENDENT dollars /METHOD=ENTER mles. Varables Entered/Removed b Model Varables Entered Varables Removed Method MILES a. Enter a. All requested varables entered. b. Dependent Varable: DOLLARS Ths box tells us that MILES was entered as the only predctor Ths box gves us measures of the varance accounted for by the model Model Summary Model R R Square Adjusted R Square Std. Error of the Estmate.96 a a. Predctors: (Constant), MILES MSE ANOVA b Here s our old frend the ANOVA table Model Sum of Squares df Mean Square F Sg. Regresson a Resdual Total a. Predctors: (Constant), MILES b. Dependent Varable: DOLLARS Unstandardzed Coeffcents Coeffcents a Standardzed Coeffcents Model B Std. Error Beta t Sg. (Constant) MILES a. Dependent Varable: DOLLARS These are the tests of the ntercept and the slope 4- A. Karpnsk

22 o From ths table, we read that bˆ =. 78 and that bˆ = Usng ths nformaton we can wrte the regresson equaton Y ˆ = * o To test the slope: H H : b : b = We fnd a sgnfcant lnear relatonshp between the dstance from the fre, and the amount of damage caused by the fre, t ( 3) =.53, p <.. For every mle from the fre staton, the fre caused an addtonal $4,99 n damage o Note that the t-test for ˆβ s dentcal to the Regresson test on the ANOVA table because we only have one predctor n ths case. o In ths case, the test of the ntercept s not meanngful You can also easly obtan 95% confdence ntervals around the parameter estmates REGRESSION /STATISTICS coeff r anova c /DEPENDENT dollars /METHOD=ENTER mles. o COEFF, R and ANOVA are defaults COEFF prnts the estmates of b and b R prnts R and Adjusted R ANOVA prnts the regresson ANOVA table 4- A. Karpnsk

23 o Addng CI to the STATISTICS command wll prnt the confdence ntervals for all model parameters Model Unstandardzed Coeffcents Coeffcents a Standard zed Coeffcen ts 95% Confdence Interval for B B Std. Error Beta t Sg. Lower Bound Upper Bound (Constant) MILES a. Dependent Varable: DOLLARS b ˆ = 4.99 t ( 3) =.53, p <. 95% CI for ˆb : bˆ * ( ˆ ) ± tα /, df std.error b (.393) (4.7, 5.77) 7. Estmaton and predcton One of the goals of regresson analyss s to allow us to estmate or predct new values of Y based on observed values. There are two knds of Y values we may want to predct o Case I: We may want to estmate the mean value of Y, Ŷ, for a specfc value of In ths case, we are attemptng to estmate the mean result of many events at a sngle value of For example, what s the average damage caused by (all) fres that are 5.8 mles from a fre staton? 4-3 A. Karpnsk

24 o Case II: We may also want to predct a partcular value of Y, Yˆ, for a specfc value of In ths case, we are attemptng to predct the outcome of a sngle event at a sngle value of For example, what would be the predcted damage caused by a (sngle) fre that s 5.8 mles from a fre staton? In ether case, we can use our regresson equaton to obtan an estmated mean value or partcular value of Y Y ˆ = * o For a fre 5.8 mles from a staton, we substtute = 5. 8 nto the regresson equaton Yˆ = * 5.8 Yˆ = 38.8 The dfference n these two uses of the regresson model les n the accuracy (varance) of our estmate of the predcton Case I: Varance of the estmate the mean value of Y, Yˆ, at p o When we attempt to estmate a mean value, there s one source of varablty: the varablty due to the regresson lne We know the equaton of the regresson lne: ˆ = bˆ + bˆ Y Var Yˆ) = Var( bˆ + bˆ ( Skppng a few detals, we arrve at the followng equaton ) ˆ σ Yˆ = Var( Yˆ) = MSE N ( p ) + S 4-4 A. Karpnsk

25 o And thus, the equaton for the confdence nterval of the estmate of the mean value of Y, Yˆ s Yˆ ± t σ ˆ α /, N Yˆ Yˆ ± tα /, N MSE N ( p ) + S Case II: Varance of the predcton of a partcular value of Y, Yˆ, at p o When we attempt to predct a sngle value, there are now two sources of varablty: the varablty due to the regresson lne and varablty of Y around ts mean Predcton lmts f the mean value of Y s at the lower bound Predcton lmts f the mean value of Y s at the upper bound Ŷ Confdence nterval for mean value of Y Predcton nterval for a sngle value of Y o The varance for the predcton nterval of a sngle value must nclude these two forms of varablty ˆ σ Y ˆ ˆ σ = Yˆ σ + ε σ Yˆ ( p ) = MSE + + N S 4-5 A. Karpnsk

o And thus, the equaton for the predcton nterval of the estmate of a partcular value of Y, Yˆ s Yˆ ± t σ Yˆ ± tα /, N ˆ α /, N Y ˆ ( p ) MSE + + N S Luckly, we can get SPSS to perform most of the

26 o And thus, the equaton for the predcton nterval of the estmate of a partcular value of Y, Yˆ s Yˆ ± t σ Yˆ ± tα /, N ˆ α /, N Y ˆ ( p ) MSE + + N S Luckly, we can get SPSS to perform most of the ntermedate calculatons for us, but we need to be sneaky o Add a new lne to the data fle wth a mssng value for Y and = p o Ask SPSS to save the predcted value and the standard error of the predcted value when you run the regresson REGRESSION /MISSING LISTWISE /DEPENDENT dollars /METHOD=ENTER mles /SAVE PRED (pred) SEPRED (sepred). 4-6 A. Karpnsk

27 We wll have two new varables n the data fle PRED Yˆ for the value SEPRED for the value σˆ Yˆ DOLLARS MILES PRED SEPRED For = 5. 8 p Y ˆ = 38.8 ˆ ˆ =. 56 σ Y o Use the formulas to compute the confdence and predcton ntervals To calculate a 95% confdence nterval around the mean value, Ŷ ˆ Y ± t ˆ α /, N σ Yˆ 38.8± t.5, 3 (.56) 38.8± (.6)(.56) ( 36.3, 4.3) To calculate a 95% predcton nterval around the sngle value, Yˆ Yˆ ± t ˆ Y ± t α /,N ˆ α /, N σ Y ˆ ˆ σ ε + ˆ σ Y ˆ 38.8± t.5, (.56) 38.8± (.6)(.589) (3.8, 45.54) 4-7 A. Karpnsk

28 The regresson lne can be used for predcton and estmaton, but not for extrapolaton In other words, the regresson lne s only vald for s wthn the range of the observed s SPSS can be used to graph confdence ntervals and predcton ntervals Confdence Bands Predcton Bands dollars 3. dollars mles mles 8. Standardzed regresson coeffcents To nterpret the slope parameter, we must return to the orgnal scale of the data b = 56 suggests that for every one unt change n the varable, Y changes by 56 unts. Ths dependence on unts can make for dffculty n comparng the effects of on Y across dfferent studes o If one researcher measures self-esteem usng a 7 pont scale and another uses a 4 pont scale, they wll obtan dfferent estmates of b o If one researcher measures length n centmeters and another uses nches, they wll obtan dfferent estmates of b One soluton to ths problem s to use standardzed regresson coeffcents β = b σ σ Y 4-8 A. Karpnsk

29 To understand how to nterpret standardzed regresson coeffcents, t s helpful to see how they can be obtaned drectly o Transform both Y and nto z-scores, z Y and z compute zmles = (mles - 3.8)/.576. compute zdollar = (dollars - 6.4)/ o Regress z Y on z REGRESSION /DEPENDENT zdollar /METHOD=ENTER zmles. Model Unstandardzed Coeffcents (Constant) ZMILES a. Dependent Varable: ZDOLLAR Coeffcents a Standard zed Coeffcen ts B Std. Error Beta t Sg. 4.3E b = β =.96 o Compare ths result to the regresson on the raw data REGRESSION /DEPENDENT dollars /METHOD=ENTER mles. Model (Constant) MILES Unstandardzed Coeffcents a. Dependent Varable: DOLLARS Coeffcents a Standard zed Coeffcen ts B Std. Error Beta t Sg β = A. Karpnsk

30 o To nterpret standardzed beta coeffcents, we need to thnk n terms of z-scores A standard devaton change n (mles), s assocated wth a.96 standard devaton change n Y (dollars) For smple lnear regresson (wth only predctor), β = ry Wth more than predctor, standardzed coeffcents should not be nterpreted as correlatons. It s possble to have standardzed coeffcents greater than. 9. Addtonal concerns and observatons Standard assumptons of regresson analyss ε ~ NID (, σ ) o All observatons are ndependent and randomly selected from the populaton (or equvalently, the resdual terms, ε s, are ndependent) o The resduals are normally dstrbuted at each level of o The varance of the resduals s constant across all levels of Addtonally, we assume that the regresson model s a sutable proxy for the correct (but unknown) model: o The relatonshp between and Y must be lnear o No mportant varables have been omtted from the model o No outlers or nfluental observatons These assumptons can be examned by lookng at the resduals 4-3 A. Karpnsk

Department of Quantitative Methods & Information Systems. Time Series and Their Components QMIS 320. Chapter 6

Department of Quantitative Methods & Information Systems. Time Series and Their Components QMIS 320. Chapter 6 Department of Quanttatve Methods & Informaton Systems Tme Seres and Ther Components QMIS 30 Chapter 6 Fall 00 Dr. Mohammad Zanal These sldes were modfed from ther orgnal source for educatonal purpose only.