Secton 1.5 Correlaton In the prevous sectons, we looked at regresson and the value r was a measurement of how much of the varaton n y can be attrbuted to the lnear relatonshp between y and x. In ths secton, we are gong to look at a related concept, correlaton, whch goes beyond regresson, and s a measurement of how strong of a lnear relatonshp there s between two random varables, X and Y. The Sample Correlaton Coeffcent r Gven n pars of observatons,( x, y ),( x, y ),,( x, y ) 1 1, f y s large when x s large, then they have a strong postve relatonshp. If y s small when x s large, then they have a strong negatve relatonshp. Consder then n n Here s the ratonale: If X and y have a strong postve relatonshp to one another, then ( x x) wll tend to be postve when ( y y) s also postve (when the value of x s above ts mean, the value of y wll
lkewse be above ts mean); On the flp sde when ( x x) s negatve ( y y) wll also be negatve (so when the x s fall below the mean of x, the y s lkewse wll fall below the mean of y). Ether way,, Sxy = ( x x)( y y) > 0 suggestng a strong postve lnear relatonshp. On the other hand, f the are strongly negatvely related, when ( x x) <0, typcally ( y y) >0, and vce versa, suggestng that S = ( x x)( y y) < 0 wll go. xy So Sxy = ( x x)( y y) s a way to measure how strong a lnear relatonshp between two random varables s, but there s one flaw: If the unts of measurement are small, Sxy s small, and f the unts of measurement are bg, Sxy wll be bg. So, we want to come up wth a way to re-scale ths so that we can objectvely decde whether the relatonshp s strong or weak, regardless of the unts of measurement. Instead of usng Sxy we use r s called the sample correlaton coeffcent Propertes of r 1. The value of r does not depend on whch of the two varables under study s labeled x and whch s labeled y. The value of r s ndependent of the unts n whch x and y are measured
3. 1 r 1 4. r = 1 ff all (x, y) pars le on a straght lne wth postve slope and -1 ff all (x,y) pars le on a straght lne wth negatve slope. 5. The square of the sample correlatons coeffcent gves the value of the coeffcent of determnaton that would result from fttng the smple lnear regresson model. In other words, the r s are the same: r = sample correlaton coeffcent; r = the coeffcent of determnaton. When s correlaton strong/weak? Weak: 0 r 5 Moderate:.5 < r <.8 Strong: 8. r 1 Example 1: Page 49 #59 Toughness and fbrousness of asparagus are major determnants of qualty. Ths was the focus of a study reported n the Journal of the Amercan Socety of Hortcultural Scence. The artcle reported the accompanyng data on x = shear force (n kg) and y = percent fber dry weght. X 46 48 55 57 60 7 81 85 94 Y.18.1.13.8.34.53.8.6.63 X 109 11 13 137 148 149 184 185 187 Y.5.66.79.8 3.01.98 3.34 3.49 3.6
n= x = x = 18, 1950, 51,970 y y xy = 47.9, = 130.6074, = 5530.9 a. Calculate the value of the sample correlaton coeffcent. Based on ths value, how would you descrbe the nature of the relatonshp between the two varables? b. If a frst specmen has a larger value of shear force than does a second specmen, what tends to be true of the percent dry fber weght for the two specmens? Also bgger c. If shear force s expressed n pounds, what happens to the value of r? Why? No change, purpose of usng correlaton coeffcent; not affected by unt change d. If the smple lnear regresson model were to ft to ths data, what proporton of observed varaton n percent fber dry weght could be explaned by the model relatonshp.
The Populaton Correlaton Coeffcent ρ and Inferences about Correlaton The correlaton coeffcent r s a measure of how strongly related x and y are n the observed sample. We can thnk of the pars ( x1, y1),( x, y),, ( xn, yn) as beng drawn from a bvarate populaton of pars, wth (X,Y) havng jont pdf f(x,y). Ths goes back to the defnton of correlaton coeffcent n chapter 5 Cov( X, Y ) ρ = ρ( XY, ) = σ σ X Y ( ) = ( µ x)( µ y) ( ) Cov X, Y x y p x, y ( x, y dscrete) x y ( µ )( µ ) ( ) = x y f x, y dx dy ( x, y contnuous) x y The formula for populaton correlaton coeffcent looks very smlar to the formula we just developed for the sample correlaton coeffcent. Thus r s an estmator for ρ. ˆ ρ = r = We are gong to make nferences about r, so we wll need to dscuss a new dstrbuton, the Bvarate Dstrbuton: The jont pdf of (X, Y) s specfed by
We only need to know some of ts propertes: 1. It s a three-dmensonal bell shaped curve that les entrely above the xy plane. If we slce through the surface wth any plane perpendcular to the xy plane, you wll get a normal dstrbuton. a. =In regresson, all of the work we dd was based on x beng fxed n advance, and only y was random; that corresponds to slcng through the graph of the bvarate dstrbuton through the plane X=x; Ths would lead to the condtonal dstrbuton of Y wth X=x and would gve mean Ths mples that f the ordered pars (x, y) are drawn from a bvarate normal dstrbuton, then the smple lnear regresson model s an approprate way of studyng the behavor of Y for fxed x. 3. Assumng that the pars are drawn from a bvarate normal dstrbuton allows us to test hypotheses about and to construct CIs. The tests we wll learn cannot be done when n s small
4. Bvarate normalty mples that the margnal dstrbutons of both X and Y are normal. 5. There s no real way to test for bnormalty, other than dong normal probablty plots for x and y ndvdually. If ether s far off from beng normal, you cannot use the bvarate dstrbuton. Testng for Absense of Correlaton When H : 0 0 ρ = s true, then the test statstc T = R n 1 R Has a t dstrbuton wth n-degrees of freedom. Alteratve hypotheses: When the null hypothess, H : 0 0 ρ = s true, ths means that there s no lnear relatonshp between the two random varables. In secton 1.3, we dd a hypothess test usng H0 : β 1 = 0as the null hypothess wll test statstc T ˆ β =. If the null was true, then the slope of the regresson lne 1 s ˆ β 1 would be 0, hence the data represents a horzontal lne, wth no dependence on the varable x at all (n other words, the two varables are
ndependent). Ths s actually the same test as the one above because R n ˆ β1 T = = 1 R s ˆ 1 Example 1 contnued β Toughness and fbrousness of asparagus are major determnants of qualty. Ths was the focus of a study reported n the Journal of the Amercan Socety of Hortcultural Scence. The artcle reported the accompanyng data on x = shear force (n kg) and y = percent fber dry weght. n= x = x = 18, 1950, 51,970 y = 47.9, y = 130.6074, xy = 5530.9 e. Carry out a test at sgnfcance level.01 to decde whether there s a postve lnear assocaton between the two varables Other Inferences concernng To test H0 : ρ = ρ0 (not 0), we must use a transformed random varable, called the Fsher transformaton. When ( X, Y ), ( X, Y ) s a sample from a bvarate normal dstrbuton, the RV 1 1 n n
V 1 1+ R = ln 1 R Has approxmately a normal dstrbuton wth mean and varance µ V 1 1+ ρ = ln 1 ρ and 1 σ V = n 3 Ths cannot be used f n s small Hypothess tests of the form H0 : ρ = ρ0 Test Statstc: Alternatve: Rejecton Regon Confdence ntervals for Just as we dd before, we obtan a CI by begnnng wth our RV, V n ths case, standardze t, and use the normal table to help us fnd the value of
Example : The artcle Increases n Sterod Bndng Globulns Induced by Tamoxfen n Patents wth Carcnoma of the breast reports data on
the effects of the drug tamoxfen on change n the level of cortsolbndng globuln (CBG) of patents durng treatment. Wth age =x and change n CBG = y summary values are a. Compute a 90% CI for the true correlaton coeffcent
b. Test H : ρ =.5 0 vs. H : ρ <.5 a c. In a regresson analyss of y on x, what proporton of cortsonebndng globuln level could be explaned by varaton n patent age wthn the sample? d. If you decde to perform regresson analyss wth age as the dependent varable what proporton of varaton n age s explanable by varaton n the change n CBG?