Dealing with Data and Fitting Empirically

Size: px

Start display at page:

Download "Dealing with Data and Fitting Empirically"

Blanche Beasley
5 years ago
Views:

1 Fittig Fuctios to Data: Regressio Dealig with Data ad Fittig Empirically Notes by Holly Hirst Whe workig with a situatio for which we do t have a physical law (ad hece a equatio, fuctio or iequality), we observe the system, recordig data from our observatios. We are the left with the task of makig predictios ad drawig coclusios from the data. To do this we start by determiig the tred of the data, ad the fittig a appropriate fuctio. If the data from the observatios are ordered pairs, we ca look at a graph of the data to determie the tred, ad hece the form of the fuctio to fit. Commo practice puts the variable we are selectig to measure (iput) o the x axis statisticias call it the predictor ad the variable we are hopig to predict (output) o the y axis the respose. I the graph below o the left, the tred appears to be roughly liear with a dowward slope. We ca fit the lie by eye as i the graph o the right, simply drawig the lie so that it looks like o oe data poit is too far from the lie. If a equatio is eeded, we could choose two poits o the lie ad use simple algebra to fid the equatio. While fittig by eye is appealig because it is so simple, a slightly more mathematical approach yields a lie for which we ca quatify the fact that o poit is too far from the lie. Let the formula for the lie we are lookig for be y = Ax+B, where A is the ukow slope ad B is the ukow y- itercept. Look at the graph i the picture below. The actual data poit value for x 1 is y 1 ad the respose value for the predictor x 1 from the lie is Ax 1 +B. We ll call the differece betwee these values the deviatio. Oe way to fit a lie to the data is to miimize the sum of these deviatios, miimize " Ax i + B! y i, Data Fittig Notes 1

2 where we have assumed that there are +1 data poits umbered through. Also ote that we have used the absolute values of the deviatios. We do this so that deviatios for poits above the lie do t cacel out deviatios for poits below the lie. So what ext? To miimize this expressio, we eed to take the partial derivatives with respect to A ad B (the ukows), set the derivatives equal to zero, ad the solve the resultig equatios simultaeously. Ufortuately, the absolute values are difficult to deal with whe takig derivatives, sice absolute values have discotiuous derivatives. Note, however, that a liear programmig approach ca be take to get a Chebychev lie of best fit. We will ot pursue that here. How do we fix this? Istead of miimizig the sum of the deviatios, we will miimize the sum of the squares of the deviatios: miimize "( Ax i + B! y i ) The squares of the deviatios are always o-egative which is why we used the absolute values but avoid the problems with derivatives. I additio, miimizer for this sum is equal to the miimizer for the square root of this sum ad the square root of the sum is closely related to Euclidea distace. To fiish this, we eed to take the two partials ad set them equal to zero:! ( Ax i + B " y i ) # = " ( Ax i + B! y i )( x i )="( Ax i + Bx i! x i y i ) =!A (1)! ( Ax i + B " y i ) # = " ( Ax i + B! y i )( 1)=" ( Ax i + B! y i ) =!B () Yuk! Now what? We eed to simplify these two equatios ad the solve them for A ad B. To do this we eed to use the fact that we ca rearrage sums of sums as follows:!( T i + R i + S i ) = T i +!! R i +! S i The left side of this equatio idicates to add the i th T, R ad S values first ad the add those sums; the right side says to add the T s, add the R s, add the S s ad the sum those sums. Either way, we have added all the T s, R s ad S s together. Usig this fact, we ca simplify (1) to get: Ax i! +! Bx i "! x i y i = A! x i + B! x i "! x i y i = (3) Where did the factor of go? We divided both sides of the equatio by. Similarly, () simplifies to:! Ax i +! B "! y i = A! x i + B "! y i = (4) We are fially ready to solve these two equatios for A ad B. We will rewrite these without the subscripts ad limits for simplicity as: A! x + B! x =! xy ad A! x + B =! y Data Fittig Notes

3 Multiplyig the first oe by ad multiplyig secod oe by Solvig for A gives: ( ) = xy A! x "! x! x! x ad the subtractig gives:! "! x! y (5) A =! x i y i "! x i! y i #! x i " % &! x $ i ' i= Pluggig this expressio for A ito (4) gives a formula for B: B =! y " A i! x i i =1 i =1 Here is a example doe usig these formulas. We have used a table to orgaize the calculatios of all of the sums. (FYI: These data are from a physics experimet o sprigs.) X Y X^ XY sums: A = ( 6*94.5 5*59.3) / (6*145 5*5) = B = ( *5) / 6 = So the lie we might use to predict y usig a x value is: y = 1.61 x Goodess of Fit So how ca we tell the quality of the fit? First we eed to look at the data ad a graph showig the lie ad the data, askig whether the coditios eeded for liear regressio hold. If we are satisfied with the fit, the we ca look at a statistical measure of the stregth of the fit called the coefficiet of determiatio. As usual, let the actual data poits be (x i, y i ), i = 1,...,. Also let the predicted value for x i from the fitted lie be y ˆ i. So we have y ˆ i = Ax i + B. Coditios for Regressio The assumptios we eed for regressio iformatio to make sese (be ubiased estimates of coefficiets ad give reliable estimates of variace) are: 1. The x values are fixed, ot radom. Data Fittig Notes 3

4 . The y data values are idepedet of each other ad ormally distributed. This is usually the case if the data were collected carefully. Oe otable exceptio is values that are serially collected for which the idepedet variable is ot time. These time series data may deped o time ad hece o each other. 3. The idividual residuals ( y ˆ i y i ) are idepedet of each other ad ormally distributed with mea. How do we recogize whe these assumptios are violated? 1. The sigle best way to tell if the regressio is good is: LOOK AT THE GRAPH! Look for the followig thigs: Does the lie go through the middle of the data? o = bad (might be iflueced by a outlier) Is there a patter to the deviatios? yes = bad (might be o-liear) Is the shape of the data rectagular (rather tha wedge shaped)? o = bad (might eed a trasformatio of the y values or a weighted regressio).. Oe ca also look at some other graphs: Plot the y values o a histogram to look for ormality Plot the residuals ( y ˆ i y i ) to look for ormality with mea =. Plot residuals agaist fitted values (Y = ( y ˆ i y i ) versus X = y ˆ i ). This should be a horizotal bad across the graph. If there s a (sloped) liear patter there might be a uderlyig idepedet variable that should replace the x that was used. If there is a wedge shape, a trasformatio of y or weighted regressio might be eeded. Coefficiet of Determiatio Oce we are sure that oe of the coditios above are violated, we ca look at a value called the coefficiet of determiatio, aka R-Squared. This will give us a statistical measure of the quality of the fit. What did we do to fid the fits? We solved the problem or, usig the hat otatio: miimize miimize " ( ) " Ax i + b! y i ( ) ˆ y i! y i If we wat to fid out how well the lie captures the relatioship i the data, we ca examie the followig quatities that measure variace: Data Fittig Notes 4

5 SSTO (total sum of squares): " ( y i! y ), which measures the overall deviatio from the mea ( y ) of the respose variables without regard to the predictor variable. If this were, the there would be o eed to kow ay x values, i.e., o depedece o the x values. SSE (error sum of squares): " ( y i! y ˆ ) i, which measures the deviatio of the lie values from the observed values for the respose, i.e., takig ito cosideratio the predictor variable. If this were, the we would kow that the relatioship was exactly liear, i.e., all of the data poits are o the regressio lie. SSR (regressio sum of squares): " ( y ˆ i! y ), which measures the overall deviatio from the mea ( y ) of the fitted variables. If this were, the there would be o eed to kow ay x values, i.e., o depedece o the x values ad the regressio lie would be horizotal. Usig basic ideas from liear algebra, we ca show that " ( y i! y ) = " ( y i! y ˆ ) i + " ( y ˆ i! y ), (1) i.e., SSTO = SSE + SSR. This equality is the fudametal reaso for choosig to miimize the sum of the squared deviatios. I words, this equality says the followig: The overall variatio i the data is equal to the sum of the overall deviatio of the data from the fitted lie plus the deviatio of the lie values from the mea. The coefficiet of determiatio is defied as the proportio of the error ot attributable to the lie SSTO! SSE (SSTO-SSE) out of the total error:, ad thus calculated as SSTO " ( y! y ˆ i i ) 1! " ( y i!y ) or, i words, this is the proportio of the total variatio that ca be accouted for by the lie fit. It ca also be calculated as: " ( y ˆ i! y ) " ( y i! y ) That these two values are the same ca be show easily from the fudametal relatioship (1). Also from (1) we kow that this umber is always betwee ad 1, ad the closer to 1 it is, the Data Fittig Notes 5

6 more the lie "explais" the overall variatio i the data. I fact, the coefficiet of determiatio ca be thought of as the %-goodess of fit of the lie. We ca calculate this by had, but luckily for us, may software packages ca calculate this automatically whe the tredlie calculatio is doe y = 1.161x R = Series1 Liear (Series1) Notice that the lie is a pretty good fit (.9919 is very close to 1). We eed to use cautio whe lookig at the coefficiet of determiatio aka R-Squared. Whe fittig a lie, we ca iterpret the umber as a percet goodess of fit. So for the lie fitted above, 99% of the variatio i the data is explaied by the liear relatioship y = 1.16 x 5.4. What should t we say? Noe of the followig statemets are true! The lie will predict the y value 99% of the time. The y value predicted by the lie will be 99% of the actual value. No-Liear Fits Cosider the example i Chapter 5 of the Edward s ad Hamso s Guide to Modellig data collected to see if there is a relatioship betwee height i meters ad weight i kilograms. (Table 6.1 o page 13) x(height) y(weight) Data Fittig Notes 6

7 These data, whe plotted, give the followig graph: Height versus Weight weight (kg) height (m) These data seem a little curved, but let s fit a lie to it ad take a look: Height versus Weight y = x weight (kg) height (m) The liear fit is t bad, but otice that there is a patter to the deviatios below 1 meter ad above 1.75 meters, the data poits are above the lie; i betwee 1 ad 1.75 meters, the data poits lie below the lie. As metioed i the previous sectio, patters to the deviatios are idicative of a o-liear fit. So what o-liear fuctio should we try to fit? If we take the approach from MAT 595, we would thik about what the data represet: Height versus weight i humas. Let s make some assumptios: Humas are geometrically similar Weight is proportioal to volume Volume is a 3-D measuremet whereas height is 1-D. From these assumptios, it seems reasoable to try weight! height 3. So let s repeat the steps we used to fid the formula for liear regressio to fid a formula for simple cubic regressio. We start with miimizig the sum of the squared deviatios: miimize " ( Ax 3 i! y i ) Next we take the derivative with respect to the variable (A) ad set the result equal to zero. d da "( Ax 3 i! y i ) = Ax 3 3 " ( i! y i ) x i ( ) = A x i 6 i= Data Fittig Notes 7 "! " x 3 i y i = (1)

8 Solvig this for A gives: Orgaizig the calculatios i table form: A =!! i= x i 3 y i x i 6 Buildig a ew table with x, y ad A*x^3 gives: x(height) y(weight) x^6 x^3*y sum: A= Plottig this: x(height) y(weight) A*x^ Data Fittig Notes 8

9 data with cubic fit w e i g h t h e i g h t y(weight) A*x^3 Note from the graph that this fit appears to be bad at the bottom. Why might our assumptios have let us dow? The lower values i our data set come from people uder 1 meter tall amely childre. Are childre geometrically similar to adults? Not really. Let s try some other models. I the last attempt we chose the power o the height to be 3. What if we let that be ukow as well lettig the data drive the power empirically? miimize " ( Ax B i! y i ) Next we take the derivative with respect to the variables (A ad B) ad set the results equal to zero. Here is the partial derivative with respect to A.! #( Ax B!A i " y i ) = Ax B B " ( i! y i ) x i ( ) = A x i B Data Fittig Notes 9 "! " x B i y i = (1) Notice that the resultig equatio is ot goig to be liear (B is i the expoet!), leadig to a system that is ot easy to hadle. Is there aother way? Yes! We ca use a log trick to get the B out of the expoet: y = Ax B! l( y) = l( Ax B )! l( y) = l( A) + l( x B )! l( y) = l( A) + Bl( x) This says that Y = l(y) ad X = l(x) have a liear relatioship wheever y ad x have a power relatioship, with the slope equal to the power, B, ad the y-itercept equal to l(a) -- the log of the costat. So what should we do? Fit a lie betwee l(x) ad l(y)... x(height) y(weight) l(x) l(y)

10 Here is a lie fit to l(x) ad l(y): log-log plot l ( w e i g h t ) 5 y =.347x l ( h e i g h t ) This is ofte referred to as log-log regressio. This looks liear, ad the deviatios are more radomly placed. How do we recover the origial fuctio from this? We'll use the statemet we made two paragraphs up: Y = l(y) ad X = l(x) have a liear relatioship wheever y ad x have a power relatioship, with the slope equal to power, B, ad the y-itercept equal to l(a) -- the log of the costat. So.347 = B = power ad.86 = l(a) or exp(.86) = = costat, givig Here is the graph usig the power curve: y = 16.79x power fit usig excel y = x.347 weight (y) height (x) y(weight) Power (y(weight)) Oe last thig to try o this height-weight data set is expoetial regressio. A*exp(Bx) is aother fuctio that has a similar shape for x >. If we tried agai with the aive approach to regressio miimizig the sum of the squared deviatios, we would ru ito algebraic problems agai, just like with the power curve calculatios above. We ca trasform this i a similar way: Data Fittig Notes 1

11 y = Ae Bx! l( y) = l( Ae Bx )! l( y) = l( A) + l( e Bx )! l y ( ) = l( A) + Bxl( e) ( ) = l( A) + Bx! l y This says that Y = l(y) ad X = x have a liear relatioship wheever y ad x have a expoetial relatioship, with the slope equal to the power, B, ad the y-itercept equal to l(a) -- the log of the costat. So what should we do? Fit a lie betwee x ad l(y)... Here is a graph of the liearized data: x(height) y(weight) x(height) l(y) l(y) -- l(weight) log plot y = x l(y).5 Liear (l(y)) x -- height This is ofte referred to as log regressio. This looks liear, ad as i the power fit the deviatios are more radomly placed. How do we recover the origial fuctio from this? We'll use the statemet we made two paragraphs up: Y = l(y) ad X = x have a liear relatioship wheever y ad x have a expoetial relatioship, with the slope equal to power, B, ad the y-itercept equal to l(a) -- the log of the costat. So = B = power ad.913 = l(a) or exp(.913) =.513 = costat, givig y =.513e 1.856x Data Fittig Notes 11

12 Here is the graph usig the power curve ad the expoetial curve ad showig the fuctios together: height versus weight weight -- y height -- x y(weight) Power (y(weight)) Expo. (y(weight)) y =.515e x y = x.347 Checkig for Goodess of Fit i the No-Liear Case Sice these fits are based upo a liearizatio of the origial data, the same rules apply whe decidig if the fit is good. Look at the trasformed data ad check: 1. The x values are fixed, ot radom.. The y data values are idepedet of each other ad ormally distributed. This is usually the case if the data were collected carefully. Oe otable exceptio is values that are serially collected for which the idepedet variable is ot time. These time series data may deped o time ad hece o each other. 3. The residuals are idepedet of each other ad ormally distributed with mea. How do we recogize whe these assumptios are violated? The sigle best way to tell if the regressio is good is LOOK AT THE GRAPH! Look for the followig thigs both with the origial ad with the trasformed data: Does the curve go through the middle of the data? o = bad Is there a patter to the deviatios? yes = bad Is the shape of the data rectagular (rather tha wedge shaped)? I the case of the origial data, look for a rectagle that is bet to follow the curve. o = bad Coefficiet of Determiatio as used with No-Liear Fits Note that i each case we could get a R-squared value. Whe fittig o-liear curves to the data, i particular the power ad expoetial fits we have see already, cautio must be used whe lookig at this umber. It tells us how good the fit is for the trasformed data NOT the origial data. I particular, logarithmic trasformatios chage the distaces betwee the y data values ad the mea ( y ) i a o-uiform way. This implies that the traditioal explaatio for the meaig of the coefficiet of determiatio does t work if the data have bee trasformed i this way. We ca still look at the coefficiet of determiatio i a comparative way to choose betwee two curves of the same kid, such as two differet expoetials, two power fits, etc., but ot as a way to choose betwee the fits from differet kids of curves. This mis-use is commo i applicatios of statistics to the social scieces remember to avoid this!!! Data Fittig Notes 1

13 So, bottom lie, what is the best way to tell if we have chose the best curve? Use your eyes! Look at the data with the various fits ad choose the oe that appears to come closest to all of the poits ad gives a radom patter i the deviatios. Problems 1. The followig data table cotais a sample of 4 idividuals out of over 7 who participated i a study i Hoolulu, HI i Aswer each questio, ad explai your coclusios. Is there a lik betwee weight ad cholesterol levels? Please ote: Lower total cholesterol is cosidered healthier. Does age matter? Does physical activity matter? Does smokig affect blood pressure? Some experts propose that comparig weight aloe to blood pressure is ot as good as comparig weight per cm of height to blood pressure, i.e., create a ew variable that is weight divided by height for each perso. Ca this ew variable be used to predict blood pressure? Educatio (Highest completed) Hoolulu Study Data (197) Systolic Blood Pressure Weight (kg) Height (cm) Age Smoker Physical Activity Cholesterol primary y moderate oe heavy oe y moderate 7 19 primary y moderate primary heavy high school moderate oe heavy itermediate y moderate college heavy primary moderate high school y heavy oe moderate 9 15 oe heavy primary moderate itermediate heavy high school heavy itermediate y moderate college heavy oe moderate high school moderate oe y moderate itermediate y moderate primary y moderate primary moderate itermediate y heavy itermediate moderate primary y heavy 11 1 high school moderate 1 98 primary moderate itermediate moderate high school heavy oe y heavy primary moderate oe heavy primary moderate itermediate heavy primary heavy college moderate oe y heavy oe y moderate Wild black bears were aesthetized, ad their bodies were measured ad weighed. Oe goal of the study was to look at whether the distributio of the weights is differet for male ad female bears. Data Fittig Notes 13

14 The secod goal of the study was to determie the kid of relatioship that explaied best the associatio betwee legth ad weight. The third goal of the study was to fid a liear equatio for weight i terms of some other measurable characteristic for forest ragers, so they could estimate the weight of a bear based o that oe measuremet. This would be useful because i the field it is easier to measure a legth tha it is to weigh a bear with a scale. Use the dataset to ivestigate these goals. lik to data: 3. I order to avoid the expese ad icoveiece of usig a water tak to determie body fat, we wish to fid a measuremet that ca predict body fat. The data set provided below cosists of estimates of body fat determied by uderwater weighig alog with various body circumferece measuremets for 5 me. Which of these measuremets is the best predictor of body fat? lik to data: 4. Is there a relatioship betwee brai ad body weight i mammals? Use the data provided for 53 species average adult male body weight to ivestigate this questio. lik to data: 5. World Rakigs: Data Source: This problem is desiged to let you grub aroud with real data. Real data are ot always ice, so do t be surprised if you ru ito less tha satisfyig results. Your assigmet is to try various fits ad use the ideas from class to choose the best relatioship or to say why you thik there is o relatioship. Some examples of relatioships you might ivestigate: Explore the relatioship betwee life expectacy ad GDP. Explore the relatioship betwee eergy use ad GDP. Explore the relatioship betwee literacy ad ifat mortality rate. Explore the relatioship betwee CO emissio ad GDP. Explore the relatioship betwee populatio ad eergy use. Explore the relatioship betwee literacy ad life expectacy. Explore the relatioship betwee CO emissios ad eergy use. Explore the relatioship betwee fertility rate ad % over 6. Data Fittig Notes 14

1 Inferential Methods for Correlation and Regression Analysis

1 Inferential Methods for Correlation and Regression Analysis 1 Iferetial Methods for Correlatio ad Regressio Aalysis I the chapter o Correlatio ad Regressio Aalysis tools for describig bivariate cotiuous data were itroduced. The sample Pearso Correlatio Coefficiet