Least-Squares Regression - PDF Free Download

MATH 482 Least-Squares Regressio Dr. Neal, WKU As well as fidig the correlatio of paired sample data {{ x 1, y 1 }, { x 2, y 2 },..., { x, y }}, we also ca plot the data with a scatterplot ad fid the least squares regressio lie through the data. This lie, deoted y = a x + b, provides a approximate liear fuctioal relatioship betwee the values of x i ad y i. Of course, it is oly a good fit if the correlatio is ear ±1, which meas that there is a strog liear depedece betwee the measuremets X ad Y. The slope a is give by a = x y x y x 2 x 2, where x y = 1 x i y i ad x 2 = 1 x 2 i = 1 i. i = 1 After the slope a is calculated, the the itercept b is give by b = y a x. The we have y = a x + b. Theorem. The least squares lie is the lie that miimizes the sum of squared errors betwee the actual y i values ad the liear approximatios a x i + b. That is, the sum (a x i + b y i ) 2 is miimized with these choices of a ad b. Proof. Let f (a,b) = (a x i + b y i ) 2. We take the first partial derivatives with respect to a ad b, equate them to 0, the solve for a ad b. First, 0 = f b = 2(a x i + b y i ) which gives 0 = a x i + b y i = a x + b y. Solvig for b gives b = y a x. i= 1 i=1 Note that the 2d derivative of f with respect to b equals 2 which is always positive. Thus the critical poit of b = y a x does yield a miimum value by the 2d deriv. test. Next, we have 0 = f a = 2(a x 2 i + b y i )x i = 2a x i + 2b x i 2 x i y i = 2a x 2 + 2b x 2x y. i=1 Thus, 0 = a x 2 + b x x y = a x 2 + (y a x ) x x y. Solvig for a gives a = x y x y x 2 x 2. Now the 2d derivative of f with respect to a equals 2 x 2 So the critical poit value for a does yield a miimum value. which is always positive.

To compute both the sample correlatio r ad the least-squares lie o a TI, eter paired data ito lists L1 ad L2 (or some other pair of lists), press STAT, scroll to CALC, press 4 for LiReg(ax+b), the eter the commad LiReg(ax+b) L1, L2. Some Other Quick Facts 1. The poit ( x, y ) is always o the least-squares regressio lie. 2. The slope of the least-squares regressio lie also is give by a = r σ Y, where σ X r = x y x y = x y x y x 2 x 2 y 2 y 2 σ, ad σ X σ X = Y x 2 x 2, ad σ Y = y 2 y 2. 3. The value r 2 is the coefficiet of determiatio. It measures the proportio of the observed values accouted for by the regressio fit. Because 0 r 2 1, a r 2 ear 1 meas that there is a strog fit, ad a r 2 ear 0 meas that there is virtually o fit of the data. Examples. (i) Make a scatterplot; (ii) Compute r ad explai what it meas; (iii) Fid the equatio of the least-squares regressio lie ad graph it through the scatterplot. (iv) Explai what r 2 meas i each case. 1. Aalyze the relatioship betwee the tar ad icotie levels i cigarettes. Brad Tar (mg) Nicotie (mg) Alpie 14.1 0.86 Beso & Hedges 16.0 1.06 Bull Durham 29.8 2.03 Camel Lights 8.0 0.67 Carlto 4.1 0.40 Chesterfield 15.0 1.04 Golde Lights 8.8 0.76 Ket 12.4 0.95 Kool 16.6 1.12 L&M 14.9 1.02 STAT EDIT STAT PLOT Adjust Settigs ZOOM 9 We see a geeral tred: As the tar level icreases, the icotie level seems to icrease.

STAT CALC Eter commad r 0.9871 Because r 0.9871 is so close to +1, there is a strog positive liear relatioship betwee tar ad icotie. As the tar level icreases, the icotee level teds to icrease liearly. The least-squares regressio lie is give by y = a x + b 0.0610475 x + 0.138167. Due to the close fit, this liear fuctio could be used to predict a icotie level y for a give tar level x. For istace, if x = 20 mg of tar, the y 1.359 mg of icotie. Y= From VARS STATISTICS EQ Press ENTER From CALC (2d TRACE) value, X = 20 Here r 2 0.974. Usig the lie y = 0. 0610475 x + 0.138167 as a predictor, icotie level is 97.4% determied by tar level ad 2.6% determied by other factors. 2. If a perso has high body desity, the they should have less body fat. The followig data lists measuremets of body desities ad percetages of body fat from a radom sample of people. Is the relatioship observable? Perso Body Desity Body Fat 1. 1.0708 12.6% 2. 1.0751 10.9% 3. 1.0549 19% 4. 1.0722 12% 5. 1.0622 16.1% 6. 1.0668 14.2% 7. 1.079 9.4% 8. 1.0775 9.9% 9. 1.055 19% 10. 1.0416 24.5% STAT EDIT STAT PLOT Adjust Settigs ZOOM 9 We see the strog tred: As body desity icreases, the body fat decreases.

STAT CALC Eter commad r 0.999948 Because r is so close to 1, there is a strog egative liear relatioship betwee body desity ad body fat. As body desity icreases, the body fat teds to decrease liearly. The least-squares regressio lie is give by y = a x + b 404. 5927 x + 445.8576, which i this case gives a almost perfect liear fit. Y= From VARS STATISTICS EQ Press ENTER Because r 2 0.9999, we ca say that whe usig body fat is 99.99% determied by oe s body desity. as a predictor, the 3. Is there a relatioship betwee drivig speed ad MPG for your gas-guzzlig SUV? Speed (mph) MPG 10 18 20 20 40 22 60 20 70 18 STAT EDIT STAT PLOT Adjust Settigs ZOOM 9 There clearly seems to be some sort of relatioship (perhaps quadratic). The mpg icreases as speed icreases to a certai poit; but the as speed icreases further, the mpg drops off.

There is o correlatio! STAT CALC r = 0. The correlatio measures the stregth ad directio of the liear relatioship betwee the variables. Just because the correlatio equals 0, it does ot mea that there is o relatioship. I this case, there simply is ot a permaet liear relatioship betwee speed ad mpg; but there certaily is a relatioship. Here, the least-squares regressio lie is costat ad is give by y = 19.6. I this case, the least-squares regressio lie is ot a good fit of the data. However, it is the best liear approximatio of the data, which really does o good here because there is o permaet liear relatioship betwee speed ad mpg. Because r 2 = 0, the lie does ot fit the data at all. Usig the lie y = 19.6 as a predictor, the the mpg is ot at all determied by its speed. 4. Is there a relatioship betwee height ad GPA? The followig data is a collectio of measuremets from a radom sample of WKU studets. Studet Height (iches) GPA 1. 60.5 3.95 2. 69.5 2.51 3. 69.5 2.85 4. 71.5 3.41 5. 73 1.91 6. 65.5 2.70 7. 75 3.26 8. 68 3.34 9. 64 2.20 10. 61 2.28 STAT EDIT STAT PLOT Adjust Settigs ZOOM 9

There does ot appear to be ay relatioship betwee height ad grade poit average. High ad low (ad middle) GPAs are attaied from studets of all heights. STAT CALC r 0.059468 The correlatio is early 0. Whe there is o relatioship whatsoever betwee the variables, the we say that they are idepedet. Whe variables are idepedet, the the true correlatio will equal 0. So the correlatio coefficiet from a radom sample of measuremets should be very close to 0 whe the two variables have o associatio betwee them. Here, the least-squares regressio lie is give by y = 0. 007754 x + 3.3663446. Because the correlatio is early 0, there is o liear relatioship betwee height ad GPA. (I fact, there is o relatioship at all because GPA is probably idepedet of height.) So agai, the least-squares regressio lie is ot a good fit of the data. Because r 2 0.0035, whe usig the lie = as a predictor, a perso s GPA is 0.35% determied by height ad 99.65% determied by other factors. 5. Sixtee bige drikers at Ohio State Uiversity had a beer party. Thirty miutes later, campus police measured their blood alcohol cotet (BAC). Here are the data: Studet 1 2 3 4 5 6 7 8 # Beers 5 2 9 8 3 7 3 5 BAC 0.10 0.03 0.19 0.12 0.04 0.095 0.07 0.06 Studet 9 10 11 12 13 14 15 16 # Beers 3 5 4 6 5 7 1 4 BAC 0.02 0.05 0.07 0.10 0.085 0.09 0.01 0.05 Is there a correlatio betwee BAC ad the umber of beers druk by the studet? What would you predict the average BAC to be for people havig 6 beers?

STAT EDIT STAT PLOT Adjust Settigs ZOOM 9 The BAC ca vary from perso to perso. For istace, the BACs of the studets havig 3 beers were 0.02, 0.04, ad 0.07. But a geeral tred exists: As the umber of beers icreases, the blood alcohol cotet teds to icrease. STAT CALC r 0.894338 The high correlatio shows that there is a relatively strog positive liear relatioship. But the relatioship is ot precisely liear, perhaps due to the varyig effects of alcohol o differet people. Here, the least-squares regressio lie is give by y 0.018 x 0.0127, which ca be iterpreted as givig the average BAC for the various amouts of beer. Evaluatig the lie for x = 6 beers, we obtai a average BAC of about 0.095, which is well above the legal limit! So do't drik ad drive! From CALC (2d TRACE) value Here r 2 0.80. Usig y 0.018 x 0.0127 as a predictor, the blood alcohol cotet is 80% determied by the umber of beers, ad 20% determied by other factors such as body weight, male or female, amout of food i stomach, etc.