We say that y is a linear function of x if. Chapter 13: The Correlation Coefficient and the Regression Line

Chapter 13: The Crrelatin Cefficient and the Regressin Line We begin with a sme useful facts abut straight lines. Recall the x, y crdinate system, as pictured belw. 3 2 1 y = 2.5 y = 0.5x 3 2 1 1 2 3 1 2 3 y = 1 x + 337 We say that y is a linear functin f x if y = a + bx, fr sme numbers a and b. If y = a + bx then the graph f the functin is a straight line with y-intercept equal t a and slpe equal t b. The line is hrizntal if, and nly if, b = 0;.w. it slpes up if b > 0 and slpes dwn if b < 0. The nly lines nt cvered by the abve are the vertical lines, e.g. x = 6. Vertical lines are nt interesting in Statistics. In math class we learn that lines extend frever. In statistical applicatins, as we will see, they never extend frever. This distinctin is very imprtant. In fact, it wuld be mre accurate t say that statisticians study line segments, nt lines, but everybdy says lines. It will be very imprtant fr yu t understand lines in tw ways, what I call visually and analytically. + 338 Here is what I mean. Cnsider the line y = 5+2x. We will want t substitute (plugin) values fr x t learn what we get fr y. Fr example, x = 3. We d this analytically by substituting in the equatin: y = 5 + 2(3) = 11. But we can als d this visually, by graphing the functin. Walk alng the x axis until we get t x = 3 and then climb up a rpe (slide dwn a ple) until we hit the line. ur height when we hit the line is y = 11. (Draw picture n bard.) The Scatterplt We are interested in situatins in which we btain tw numbers per subject. Fr example, if the subjects are cllege students, the numbers culd be: X = height and Y = weight. X = scre n ACT and Y = first year GPA. + 339 X = number f AP credits and Y = first year GPA. Law schls are interested in: X = LSAT scre and Y = first year law schl GPA. and s n. In each f these examples, the Y is cnsidered mre imprtant by the researcher and is called the respnse. The X is imprtant b/c its value might help us understand Y better and it is called the predictr. Fr sme studies, reasnable peple can disagree n which variable t call Y. Here are tw examples: The subjects are married cuples and the variables are: wife s IQ and husband s IQ. The subjects are identical twins and the variables are: first brn s IQ and secnd brn s IQ. We study tw big tpics in Chapter 13. Fr the first f these, the crrelatin cefficient, it des nt matter which variable is called Y. + 340

Fr the secnd f these, the regressin line, changing the assignment f Y and X will change the answer. Thus, if yu are uncertain n the assignment, yu might chse t d the regressin line analysis twice, nce fr each assignment. The material in Chapter 13 differs substantially frm what we have dne in this class. In Chapter 13, we impse fairly strict structure n hw we view the data. This structure allws researchers t btain very elabrate answers frm a small amunt f data. Perhaps surprisingly, these answers have a histry f wrking very well in science. But it will be imprtant t have a healthy skepticism abut the answers we get and t examine the data carefully t decide whether the impsed structure seems reasnable. We begin with an example with n = 124 subjects, a very large number f subjects fr these prblems. As we will see, ften n is 10 r smaller. + 341 The subjects are 124 men wh played majr league baseball in bth 1985 and 1986. This set cntains every man wh had at least 200 fficial at-bats in the American League in bth years. The variables are: Y = 1986 Batting Average (BA) and X = 1985 BA. The idea is that, as a baseball executive, yu might be interested in learning hw effectively ffensive perfrmance ne year (1985) can predict ffensive perfrmance the next year (1986). In case yu are nt a baseball fan, here is all yu need t knw abut this example. BA is a measure f ffensive perfrmance, with larger values better. BA is nt really an average; it is a prprtin: BA equals number f hits divided by number f fficial at-bats. BA is always reprted t three digits f precisin and a BA f, say, 0.300 is referred t as hitting 300. BTW, 300 is the threshld fr gd hitting and 200 is the threshld fr really bad hitting. + 342 The names and data fr the 124 men are n pp. 442 3. Behaving like the MITK, we first study the variables individually, fllwing the ideas f Chapter 12. X 0.180 0.240 0.300 0.360 Y 0.180 0.240 0.300 0.360 These histgrams suggest small and large utliers bth years. In additin, bth histgrams are clse t symmetry and bell-shape. Als, the means and sd s changed little frm X t Y. Year Mean St.Dev. 1985 0.2664 0.0280 1986 0.2636 0.0320 + 343 Belw is the scatterplt f these BA data. The first thing we lk fr are islated cases (IC). I see tw, pssibly three, IC identified by initials belw: WB, DM and FR. 1986 Batting Ave. 0.370 DM WB 0.330 2 0.290 2 2 0.250 0.210 FR 0.170 0.170 0.210 0.250 0.290 0.330 0.370 1985 Batting Ave. + 344

Nw, ignre the utliers and lk fr a pattern in the remaining data. Fr the BA data, the data describe an ellipse that is tilted upwards (lwer t the left, higher t the right). This is an example f a linear relatinship between X and Y ; i.e. as X grws larger (sweep yur eyes frm left t right in the picture), the Y values tend t increase (becme higher). In Chapter 13, we limit attentin t data sets that reveal a linear relatinship between X and Y. If yur data d nt fllw a linear relatinship, yu shuld nt use the methds f Chapter 13. Thus, yur analysis shuld always begin with a scatterplt t investigate whether a linear relatinship is reasnable. Page 447 f the text presents five hypthetical scatterplts: ne reveals an increasing linear pattern; ne reveals a decreasing linear pattern; and the remaining three shw varius curved relatinships between X and Y. Thus, t reiterate; if yur scatterplt is curved, d nt use the methds f Chapter 13. + 345 Page 448 f the text presents fur scatterplts fr data sets fr small values f n (the n s are 9, 6, 12 and 13, typical sizes in practice). The subjects are spiders and the fur scatterplts crrespnd t fur categries f spiders. Fr each spider, Y is heart rate and X is weight. Abve each scatterplt is the numerical value f r, the crrelatin cefficient f the data. At this time, it suffices t nte that r > 0 indicates (reflects?) an increasing linear relatinship and r < 0 indicates a decreasing linear relatinship between Y and X. There are tw imprtant ideas revealed by these scatterplts. First, fr small n it can be difficult t decide whether a case is islated; whenever pssible, use yur scientific knwledge t help with this decisin. Secnd, especially fr a small n, the presence f ne r tw islated cases can drastically change ur view f the data. Fr example, cnsider the n = 9 small hunters. + 346 The tw spiders in the lwer left f the scatterplt might be labeled islated. Including these cases, the text states that r > 0, but if they are deleted frm the data set (which culd be a deliberate actin by the researcher, r perhaps these guys were stepped n during their cmmute t the lab) then r < 0. Scientists typically get very excited abut whether r is psitive r negative, s it is ntewrthy that its sign can change s easily. Thus far, we have been quite casual abut lking at scatterplts. We say, The pattern is linear and lks increasing (decreasing, flat). It will remain (in this curse) the jb f ur eyes and brain t decide n linearity, but the matter f increasing r decreasing will be usurped by the statisticians. Furthermre, using my eyes and brain, I can say that the pattern is decreasing fr tarantulas and fr web weavers (r agrees with me), and I can say that the linear pattern is strnger fr the tarantulas. + 347 The crrelatin cefficient agrees with me n the issue f strength and has the further benefit f quantifying the ntin f strnger in a manner that is useful t scientists. I am nt very gd at mtivating the frmula fr the crrelatin cefficient. In additin, the end f the semester is near, s time is limited. The interested student is referred t pp. 450 3 f the text fr a (partial) explanatin f the frmula. Here is the briefest f presentatins f the frmula. Each subject has an x and a y. We standardize these values int x and y : x = (x x)/s X ; y = (y ȳ)/s Y. We then frm the prduct z = x y. The idea is that z > 0 prvides evidence f an increasing relatinship and z < 0 prvides evidence f a decreasing relatinship. + 348

(The prduct is psitive if bth terms are psitive r bth are negative. Bth psitive means a large x is matched with a large y; bth negative means a small x is matched with a small y.) The crrelatin cefficient cmbines the z s by almst cmputing their mean: z r = n 1. The next slide presents 12 prttypes f the crrespndence between a scatterplt and its crrelatin cefficient. These 12 scatterplts illustrate six imprtant facts abut crrelatin cefficients. These six facts appear n pages 454 and 456 f the text and will nt be reprinted here. r = 1.00 r = 0.40 r = 0.20 r = 0.80 r = 0.80 r = 0.60 r = 0.20 r = 0.00 r = 0.40 r = 0.60 r = 1.00 r = 0.00 + 349 + 350 13.3: The regressin line. ẏ = 37.5 + 0.25x Air Temp. 90 80 70 60 100 150 200 Chirps per Minute ŷ = 56.2 + 0.136x Air Temp. 90. 80 70 60 100 150 200 Chirps per Minute x y ẏ y ẏ (y ẏ) 2 145.0 62.6 73.75 11.15 124.32 172.0 81.5 80.50 1.00 1.00 155.0 77.9 76.25 1.65 2.72 137.0 84.2 71.75 12.45 155.00 179.5 92.8 82.37 10.43 108.68 192.0 86.9 85.50 1.40 1.96 207.0 87.8 89.25 1.45 2.10 165.5 69.8 78.87 9.07 82.36 193.0 71.6 85.75 14.15 200.22 100.0 71.6 62.50 9.10 82.81 189.0 80.4 84.75 4.35 18.92 SSE(ẏ) = 780.10 + 351 x y ŷ y ŷ (y ŷ) 2 145.0 62.6 75.92 13.32 177.42 172.0 81.5 79.59 1.91 3.64 155.0 77.9 77.28 0.62 0.38 137.0 84.2 74.83 9.37 87.76 179.5 92.8 80.61 12.19 148.55 192.0 86.9 82.31 4.59 21.05 207.0 87.8 84.35 3.45 11.89 165.5 69.8 78.71 8.91 79.35 193.0 71.6 82.45 10.85 117.68 100.0 71.6 69.80 1.80 3.24 189.0 80.4 81.90 1.50 2.26 SSE(ŷ) = 653.23 n n = 11 ccasins, Susan Rbrds determined tw values fr different crickets: Y is the air temperature and X is the cricket s chirp rate in chirps per minute. In her campcraft class, she was tld that ne can calculate the air temperature with the fllwing equatin: ẏ = 37.5 + 0.25x. Abve, we have a scatterplt f Susan s data with this line. The mst bvius fact is that calculate was way t ptimistic! + 352

As Ygi Berra nce said, Yu can bserve a lt by just watching. Let s fllw his advice and examine the scatterplt and table abve. We see that n sme ccasins, ẏ prvides an accurate predictin f y. Visually, this is represented by circles that are n, tuching, r nearly tuching the line. But n many ther ccasins, the predictins are pr: the line is either far lwer than the circle (the predictin is t small) r the line is far higher than the circle (the predictin is t large). Next, we d smething very strange. We change perspective and instead f saying that the predictin is t small (large) we say that the bservatin is t large (small). Egcentric? Yes, but there are tw reasns. First, lk at the scatterplt and line again. It is easier t fcus n the line and see hw the pints deviate frm it, than it is t fcus n all the pints (n culd be large) and see hw the line deviates. + 353 Secnd, we plan t cmpare ẏ and y by subtractin. We culd use ẏ y r y ẏ. The frmer takes y as the standard and the latter reverses the rles. Fr circles belw the line, I want this errr t be a negative number; t get that I must subtract in the rder y ẏ; that is, I take the predictin as the standard and the bservatin errs by nt agreeing. Lk at the table again. The ideal fr the errr y ẏ is 0. As the errr mves away frm 0, in either directin, the inadequacy f the predictin becmes mre and mre serius. Fr math reasns (and ften it makes sense scientifically; at least apprximately) we cnsider an errr f, say, 5 t be exactly as serius as an errr f +5. As in Chapter 12, we might be tempted t achieve this by taking the abslute value f each errr, but, again, we get much better math results by squaring the errrs. Finally, we sum all f the squared errrs t btain: SSE(ẏ) = 780.10. + 354 Ideally, SSE = 0 and the larger it is, the wrse the predictin. Yu are prbably thinking that we need t adjust SSE t accunt fr sample size, but we wn t bther with that. Instead, we pse the fllwing questin: Can we imprve n Susan s line? r: Can we find anther predictin line which has an SSE that is smaller than Susan s 780.10? I suggest the line ŷ = 56.2+0.136x. Frm the table, we see that SSE(ŷ) = 653.23. Thus, accrding t The Principle f Least Squares ŷ is superir t ẏ. Can we d better than my ŷ? N. Majr Result: There is always a unique line that minimizes SSE ver all pssible lines. The equatin f the line is given as ŷ = b 0 + b 1 x, where b 1 = r(s Y /s X ) and b 0 = ȳ b 1 x. + 355 Fr the cricket data, fr example, it can be shwn that x = 166.8, s X = 31.0, ȳ = 78.83, s Y = 9.11, and r = 0.461. Substituting these values int the abve yields b 1 = 0.461(9.11/31.0) = 0.1355, and b 0 = 78.83 0.1355(166.8) = 56.23. Thus, the equatin f the best predictin line is ŷ = 56.23 + 0.1355x, which I runded in my earlier presentatin f it. The means and sd s f the BA data were given n slide 343 and it has r = 0.554. Thus, b 1 = 0.554(0.032/0.028) = 0.633, and b 0 = 0.2636 0.633(0.2664) = 0.095. Thus, the equatin f the regressin line is ŷ = 0.095 + 0.633x. This line appears n page 471 f the text. + 356

We have seen that it is easy t calculate ŷ and it is the best line pssible (based n the principle f least squares), but is it any gd? (Is Sylvester Stallne s best perfrmance any gd? Is there a reasn he has never dne Shakespeare?) First, nte that we can see why r is s imprtant. We need five numbers t calculate ŷ: tw numbers that tell us abut x nly; tw numbers that tell us abut y nly; and ne number (r) that tells us hw x and y relate t each ther. In ther wrds, r tells us all we need t knw abut the assciatin between x and y. We btain the regressin line by calculating tw numbers: b 0 and b 1. Thus, bviusly, this pair f numbers is imprtant. Als, b 1, the slpe, is imprtant by itself; it tells us hw a change in x affects ur predictin ŷ. Unlike mathematics, hwever, the intercept, b 0, alne usually is nt f interest. Nw in math, the intercept is interpreted as the value f y when x = 0. Cnsider ur examples. Fr the Cricket study, x = 0 gives us ŷ = 56.2. But we have n data at r near x = 0; thus, we really dn t knw what it means fr x t equal 0. (Discuss.) Similarly, fr the BA study, x = 0 predicts a 1986 BA f 0.095. But nbdy batted at r near 0.000 in 1985. In fact, I cnjecture that in the histry f baseball there has never been a psitin player with at least 200 at-bats wh batted 0.000. Cnsider the fllwing scatterplt f fish activity versus water temperature fr fish in an aquarium. (Shuld we use these data t predict fish activity fr x = 32? Fr x = 212?) + 357 + 358 Fish Activity 500 450 400 350 300 707274767880 Water Temp. (F) The abve cnsideratins has resulted in sme statisticians advcating a secnd way t write the equatin fr ŷ: Fr the cricket study: ŷ = ȳ + b 1 (x x). ŷ = 78.83 + 0.461( 9.11 )(x 166.8) = 31.0 78.83 + 0.1355(x 166.8). This secnd frmula cntains three numbers and they all have meaning: the mean f the predictr; the mean f the respnse and the slpe. Fr better r wrse, this frmulatin has nt becme ppular and yu are nt respnsible fr it n the final. It des, hwever, give us an easy prf f ne f the mst imprtant features f the regressin line, smething I like t call: The law f preservatin f medicrity! Suppse that a subject is medicre n x; that is, the subject s x = x. What is the predicted respnse fr this subject? Plugging x = x int we get ŷ = ȳ + b 1 (x x) ŷ = ȳ + b 1 ( x x) = ŷ = ȳ + b 1 (0) = ȳ. + 359 + 360