Smple Lnear Regresson and Correlaton Introducton Prevousl, our attenton has been focused on one varable whch we desgnated b x. Frequentl, t s desrable to learn somethng about the relatonshp between two (or more) varables. For example, we mght be nterested n studng the relatonshp between o cholesterol level and age, o blood pressure and age, o heght and weght o the amount of exercse and heart rate; o the concentraton of an njected drug and heart rate o the consumpton level of some nutrent and weght gan. The nature and strength of the relatonshps between two varables ma be examned b regresson and correlaton analses, two related statstcal technques that serve dfferent purposes. Regresson s used to dscover the probable form of the relatonshp between two varables x and b fndng an approprate equaton. The ultmate objectves when ths method of analss s emploed usuall s to predct or estmate the value of one varable correspondng to a gven value of another varable.e. to predct or estmate the value of for a gven value of x. Correlaton analss, on the other hand, s concerned wth measurng how strong s the relatonshp between two varables x and.e. the degree of the correlaton between the two varables. SIMPLE LINEAR REGRESSION In smple lnear the varable x s usuall referred to as the ndependent varable, snce frequentl t s controlled b the nvestgator; that s; values of x ma be selected b the nvestgator and, correspondng to each preselected value of x, one -or more- value of s obtaned. The other varable,, accordngl, s called the dependent varable, and we speak of the regresson of on x. In the above examples, the nvestgator could control, the age but not the cholesterol level, the weght but not heght, the concentraton of njected drug but not the heart rate.. and so on. We assume that for each value of x, there s a whole populaton of values whch s normall dstrbuted and all of the populatons have equal varances. In smple lnear regresson the object of the researcher s nterest s the regresson equaton that descrbes the true relatonshp between the dependent varable and the ndependent varable x. Scatter dagram A frst step that s usuall useful n studng the relatonshp between two varables s to prepare a scatter dagram of the data. The ponts are plotted b assgnng values of the ndependent varable x to the horzontal axs and values of the dependent varable to the vertcal axs. The pattern made b the ponts plotted on the scatter dagram usuall suggests the basc nature and the strength of the relatonshp between two varables. 11
Example Relatonshp between and optcal denst Optcal denst 3 4. 4.5.5 5 5.5 3 6 5 6.5.47 7.49 7.5.53 Optcal denst.6.5.4. 3 4 5 6 7 8 In our example, we can see, n general, that as the ncreases the optcal denst also ncreases so that the have a postve relatonshp. The least-square lne We can also see that the ponts seem to be scattered around an nvsble lne whch would descrbe the relatonshp between x and. These mpressons suggest that the relatonshp between ponts n the two varables ma be descrbed b a straght lne crossng the -axs near the orgn and makng approxmatel a 45 degree angle wth the x-axs. Thnkng Challenge It looks as ths lne would be eas to draw b hand, but t s doubtful that the lnes drawn b an two people would be exactl the same. In other words, for ever person drawng such a lne b ee, or freehand, we would expect a dfferent lne. Whch lne best descrbes relatonshp between the varables? What s needed for obtanng the desred lne?.6.6.5.5 Optcal denst.4. Optcal denst.4. 3 4 5 6 7 8 3 4 5 6 7 8 111
Answer If the scatter dagram has a lnear trend, we need a mathematcal wa to obtan the best lne through the data. We need to emplo a method known as the method of least squares for obtanng the desred lne, and the resultng lne s called the least-square lne. The reason for callng the method b ths name wll be explaned n the dscusson that follow. Equaton for straght lne (Lnear Equaton) Now, recall from algebra that the general equaton for straght lne s gven b = a + bx Where s a value on the vertcal axs, and x s a value on the horzontal axs, a s the pont where the lne crosses the vertcal axs, and referred to as -ntercept. b shows the amount b whch changes for each unt change n x and referred to as the slope of the lne. = a + bx b = slope Change n Change n x a = ntercept x To draw a lne based on the equaton, we need the numercal values of the constants a and b. Gven these constants, we ma substtute varous values of x nto the equaton to obtan correspondng values of. = a + bx The resultng ponts ma then be plotted. Computaton Fndng the b-value ( )( ) n x ( x ) n x x b = ( )( ) ( 9)( 84) ( 49) 9 18. -(49)(3.4) b = =.958 11
Fndng the -ntercept (x) a= bx where = mean of values and x = mean of x values 3.4 = = 378 9 49 x = = 5.444 9 a = 378 (.958)( 5.444) = -837 Optcal denst () x x 3 9.1 4. 16.4.8 4.5.5.5.65 1.15 5 5 4 1.6 5.5 3 3.5 89 1.815 6 5 36 5.1 6.5.47 4.5.9 3.55 7.49 49.4 3.43 7.5.53 56.5.81 3.975 Total Σ x = 49 Σ = 3.4 Σ x = 84 Σ = 1.188 Σ x = 18. Mean x = 5.444 = 378 Alternatvel a b x = n The equaton for the least squares lne s: = a + bx = - 837+.958x =.958x - 837 Note that we use the smbol because ths value s computed from the equaton and s not an observed value of. Now, we can substtute varous values of x nto the equaton to obtan correspondng values of. The resultng ponts ma be plotted. 113
Example: Predctng for a gven x usng the regresson equaton Choose a value for x (wthn the range of x values). x = 6.8 Substtute the selected x n the regresson equaton. =.958 6.8-837 Determne correspondng value of. =.958x - 837 =.465 Accordng to the equaton, a of 6.8 would has a.465 optcal denst. Drawng the least-squares lne Snce an two such coordnates determne a straght lne, we ma select an two values n the range of x, compute two correspondng values, locate them on a graph, and connect them wth a straght lne to obtan the lne correspondng the equaton. The followng pont wll alwas be on the least squares lne: ( x, ) Use 5.444 and 378, the averages of the x s and the s, respectvel. Tr x = 4, Compute: =.957(4) - 835 = 965 Sketchng the Lne Usng the Ponts (5.444, 378) and (4, 965) Optcal denst.6.5.4. =.957x - 835 3 4 5 6 7 8 Now what we have obtaned s what s called the best lne for descrbng the relatonshp between our two varables. B what crteron t s consdered best? Before the crteron s stated, let us examne the fgure obtaned. Note that the least squares lne does not pass through most of the observed ponts that are plotted on the scatter dagram. In other words, the observed ponts devate from the lne b varng amounts. 114
Optcal denst.6.5.4. Devaton Devaton Devaton 3 4 5 6 7 8 The lne that we have drawn s best n ths sense: The sum of the squared vertcal devatons of the observed data ponts ( ) from the least square lne s smaller than the sum of the squared vertcal devatons of the observed data ponts from an other lne. In other words, f we square the vertcal dstance from the observed pont ( ) to the least-squares lne and add these squared values for all ponts, the resultng total wll be smaller than the smlarl computed total for an other lne that can be drawn through the ponts. For ths reason the lne we have drawn s called the least-squares lne. Evaluaton the strength of the regresson equaton One wa to evaluate the strength of the regresson equaton n descrbng the relatonshp between two varables s to compare the scatter of the ponts about the regresson lne wth the scatter about, the mean of the values of. To do that, draw through the ponts a lne that ntersects the -axs at and s parallel to the x-axs, b dong so, we ma obtan a vsual mpresson of the relatve magntudes of the scatter of the ponts about ths lne and the regresson lne. Ths has been done n the next Fgure..6 It appears from the Fgure that =.957x - 835 the scatter of the ponts about the regresson lne s much less than the scatter of ponts about lne. But, the stuaton ma not be alwas ths cleat-cut, so some sort of calculaton to evaluate the strength of the regresson equaton s necessar, that s,.5.4. = 378 the coeffcent of 3 4 5 6 7 8 determnaton r. Optcal denst 115
The logc behnd the computaton of "coeffcent of determnaton". We begn b consderng the pont correspondng to an observed value,, and b measurng ts vertcal dstance from the ( ) lne. We call ths total devaton. If we measure the vertcal dstance from the regresson lne to the lne, we obtan, whch s called the explaned devaton, snce t shows b how much the total devaton s reduced when the regresson lne s ftted to the ponts. Fnall, we measure the vertcal dstance of the observed pont from the regresson lne to obtan, whch s called the unexplaned devaton, snce t represents the porton of the total devaton not explaned or accounted for b the ntroducton of the regresson lne..6 =.957x - 835 Optcal denst.5.4. Unexplaned devaton Explaned devaton = 378 Total devaton ( ) 3 4 5 6 7 8 It seen then, that the total devaton for a partcularl, s equal to the sum of the explaned and unexplaned devatons ( ) = + total explaned unexplaned = + devaton devaton devaton If we measure these devatons for each value of and, square each devaton, and add up the squared devatons, we have ( ) = + total sum explaned sum unexplaned sum = + of squares ( SST) of squares ( SSR) of squares ( SSE) We ma express the relatonshp between the three sums of squares as SST = SSR + SSE 116
It s ntutvel appealng to speculate that f a regresson equaton does a good job of descrbng the relatonshp between two varables, the explaned sum of squares should consttute a large proporton of the total sum of squares. The next fgure llustratng that the explaned devaton consttute a small proporton of the total devaton, as compared wth the prevous fgure.6.5 Unexplaned devaton Optcal denst.4. Explaned devaton = 378 Total devaton ( ) 3 4 5 6 7 8 It would be of nterest then, to determne the magntude of ths proporton b computng the rato of the explaned sum of squares to the total sum of squares. Ths s exactl what s done n evaluatng a regresson equaton based on sample data, and the result s called the sample coeffcent of determnaton, r. In other words, Optcal denst ( ) ( ) r Explaned sum of squares SSR 5773 = = = = =.9784 Total sum of squares SST 6136 ( ) -8.57 4 -.4 1.9X1-5 -4.55. -38.19 99.1 4.9 X1-5 -38.19.5 -.88.8.47.3 8.1 X1-5 -.91.8 -.18..95.5.65-.43. 3 -.8. 43 -.13.1651.5. 5.1. 91 -.41.16565.53.3.47 3.17.439.31.98911.1.49 5.3.486.4 1.9 X1-5 49..53 9.37.534 -.4 1.81 X1-5 96.39 SST = SSE = SSR = =38 6136.348 5773 117
.6.5 =.957 x - 835 R =.9784 Optcal denst.4. 3 4 5 6 7 8 Thus, the coeffcent of determnaton measures the closeness of ft of the regresson equaton to observed values of. Interpretaton of r If r =.9784 Approxmatel 98 percent of the varaton n Optcal denst () s explaned b the lnear relatonshp wth x, change. Less than three percent s explaned b other causes. From the table, we see that when the quanttes, the vertcal dstances of observed values of from the equaton, are small, the unexplaned sum of squares (SSE) s small. Ths leads to a large explaned sum of squares that leads n turn, to a large value of r. Ths s llustrated n the fgures In ths fgure, we see, that the observatons all le close to the regresson lne, and we would expect r to be large. Small unexplaned devaton Large explaned devaton x 118
In ths fgure, we see, that the observatons wdel scattered about the regresson lne, and we would expect r to be small. Large unexplaned devaton Small explaned devaton In ths fgure, we see, that all the observatons fall on the regresson lne, and we would expect the largest value of r whch equal to 1. x Explaned devaton = total devaton In ths fgure, we see, that the regresson lne and the lne drawn through concde, and we would expect the lowest value of r whch s close to. x Explaned devaton = x 119
Coeffcent of Determnaton Examples Y r = 1 r = 1 Y r = 1 Y X r =.8 Y r = X X X CORRELATION Correlaton coeffcent r 1. Measures the strength of the lnear relatonshp between the two varables represented as x and.. Coeffcent of correlaton use values range from -1 to +1 An alternatve formula for computng the coeffcent of correlaton, r r = Computaton Table r = Coeffcent of Determnaton n x ( x )( ) ( ) ( ) n x x n (x) Optcal denst () x x 3 9.1 4..8 16.4 4.5.5 1.15.5.65 5 1.6 5 4 5.5 3 1.815 3.5 89 6 5.1 36 5 6.5.47 3.55 4.5.9 7.49 3.43 49.41 7.5.53 3.975 56.5.89 x = 49 = 3.4 x = 18. x = 84 = 1.188 1
r ( )( ) ( )( ) = 9 18. 49 3.4 ( 9)( 84) ( 49) ( 9)( 1.188) ( 3.4) = r =.989 r. 99 Coeffcent of Correlaton Values.9891 The statstc r has the followng propertes: 1. r measures the extent of lnear assocaton between two varables.. r has value between 1 and 1. 3. r = 1 f and onl f all the observatons are on a straght lne wth postve slope. 4. r = 1 f and onl f all observatons are on a straght lne wth negatve slope. 5. r tends to be close to zero f there s no lnear assocaton between x and. Perfect Negatve Correlaton No Correlaton Perfect Postve Correlaton -1. -.5 +.5 +1. Increasng degree of negatve correlaton Increasng degree of postve correlaton Coeffcent of Correlaton Examples Although there s no fxed rule or nterpretaton of the strength of a correlaton, we wll sa that the correlaton s Strong f r.8 Moderate f.5 r.8 Weak f r.5 We wll also add the words postve or negatve to ndcate the tpe of correlaton. Warnng The correlaton coeffcent (r) measures the strength of the relatonshp between two varables. Just because two varables are related does not mpl that there s a cause-andeffect relatonshp between them. 11
Fgure Scatter plots llustratng how the correlaton coeffcent, r, s a measure of the lnear assocaton between two varables. 1
Spearman rank correlaton Spearman s rank correlaton s a nonparametrc correlaton coeffcent test. To perform ths procedure, we frst arrange the x values from smallest (rank = 1) to largest (rank = n); let R be the rank of value x. Smlarl, we arrange the values from smallest to largest and assgn a rank from 1 to n for each value; let S be the rank of value. Fnd the dfference between R S, and then square t. The last step s to calculate the Spearman's correlaton coeffcent rho ( ρ ) from 6 ( R S ) ρ = 1 n( n 1) The Spearman rank correlaton ma be used for qualtatve varables whch can be ordered. For example, a varable whch gves an opnon about somethng. Then answer choces such as dslke ver much, dslke, no opnon, lke, and lke ver much have a logcal order to them as the are lsted. An example of a qualtatve varable whch cannot be ordered s ee color whch has values such as blue, green, and brown. There s no wa to logcall order such categores and so we could not use the rankng procedure for them. Note that Pearson correlaton s not approprate n ether of these cases. Example Consder agan the data of the above Example. The next Table contans the prelmnar calculatons necessar to fnd the Spearman rank correlaton. From the Table we fnd that 6 ( R S ) ρ = 1 n n 1 ( ) 6(17) 76 = 1 = 1 = 1.6786 =.773 15( 4) 336 whch s not as the same as what s obtaned for the Pearson correlaton. Ths dfference s due to the presence of extreme observatons. When such extreme values are few or not found, the results of the two tests wll be ver close to each other. Age SBP x Rank R Rank S R-S (R-S) 4.5 13 4-1.5.5 46 4 115 4 4.5 148 5 -.5 6.5 71 8 1 1 7 49 8 1.5 156 9.5 3 9 74 1 16 13.5-3.5 1.5 7 7 151 7 8 1.5 156 9.5 3 9 85 15 16 13.5 1.5.5 7 9 158 11-4 64 6 155 8-4 81 14 16 1 4 41 1 15 3-4 61 5 15 6-1 1 75 11 165 15-4 16 17 13