Dealing with Data and Fitting Empirically

Similar documents
1 Inferential Methods for Correlation and Regression Analysis

Response Variable denoted by y it is the variable that is to be predicted measure of the outcome of an experiment also called the dependent variable

II. Descriptive Statistics D. Linear Correlation and Regression. 1. Linear Correlation

Linear Regression Models

11 Correlation and Regression

Correlation Regression

Statistical Properties of OLS estimators

Simple Linear Regression

Properties and Hypothesis Testing

Frequentist Inference

Regression, Part I. A) Correlation describes the relationship between two variables, where neither is independent or a predictor.

Linear Regression Demystified

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Algebra of Least Squares

This is an introductory course in Analysis of Variance and Design of Experiments.

Statistics 511 Additional Materials

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

University of California, Los Angeles Department of Statistics. Simple regression analysis

ECON 3150/4150, Spring term Lecture 3

SNAP Centre Workshop. Basic Algebraic Manipulation

Regression and Correlation

Understanding Samples

3/3/2014. CDS M Phil Econometrics. Types of Relationships. Types of Relationships. Types of Relationships. Vijayamohanan Pillai N.

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Simple Regression. Acknowledgement. These slides are based on presentations created and copyrighted by Prof. Daniel Menasce (GMU) CS 700

STA Learning Objectives. Population Proportions. Module 10 Comparing Two Proportions. Upon completing this module, you should be able to:

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER / Statistics

10-701/ Machine Learning Mid-term Exam Solution

Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen)

September 2012 C1 Note. C1 Notes (Edexcel) Copyright - For AS, A2 notes and IGCSE / GCSE worksheets 1

Power and Type II Error

a is some real number (called the coefficient) other

Polynomial Functions and Their Graphs

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

PH 425 Quantum Measurement and Spin Winter SPINS Lab 1

Chapter 22. Comparing Two Proportions. Copyright 2010, 2007, 2004 Pearson Education, Inc.

Regression and correlation

Section 14. Simple linear regression.

A statistical method to determine sample size to estimate characteristic value of soil parameters

Read through these prior to coming to the test and follow them when you take your test.

Big Picture. 5. Data, Estimates, and Models: quantifying the accuracy of estimates.

Mathematical Notation Math Introduction to Applied Statistics

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Estimation for Complete Data

Parameter, Statistic and Random Samples

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

REGRESSION (Physics 1210 Notes, Partial Modified Appendix A)

Random Variables, Sampling and Estimation

WHAT IS THE PROBABILITY FUNCTION FOR LARGE TSUNAMI WAVES? ABSTRACT

Number of fatalities X Sunday 4 Monday 6 Tuesday 2 Wednesday 0 Thursday 3 Friday 5 Saturday 8 Total 28. Day

Chapter 22. Comparing Two Proportions. Copyright 2010 Pearson Education, Inc.

STP 226 EXAMPLE EXAM #1

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 22

Stat 139 Homework 7 Solutions, Fall 2015

A quick activity - Central Limit Theorem and Proportions. Lecture 21: Testing Proportions. Results from the GSS. Statistics and the General Population

3.2 Properties of Division 3.3 Zeros of Polynomials 3.4 Complex and Rational Zeros of Polynomials

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures

If, for instance, we were required to test whether the population mean μ could be equal to a certain value μ

Regression, Inference, and Model Building

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

REVISION SHEET FP1 (MEI) ALGEBRA. Identities In mathematics, an identity is a statement which is true for all values of the variables it contains.

Chapter 23: Inferences About Means

The picture in figure 1.1 helps us to see that the area represents the distance traveled. Figure 1: Area represents distance travelled

CS322: Network Analysis. Problem Set 2 - Fall 2009

Example: Find the SD of the set {x j } = {2, 4, 5, 8, 5, 11, 7}.

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

Chapter 6 Part 5. Confidence Intervals t distribution chi square distribution. October 23, 2008

ENGI 4421 Confidence Intervals (Two Samples) Page 12-01

AP Statistics Review Ch. 8

Some examples of vector spaces

Linear Regression Analysis. Analysis of paired data and using a given value of one variable to predict the value of the other

Chapter 8: STATISTICAL INTERVALS FOR A SINGLE SAMPLE. Part 3: Summary of CI for µ Confidence Interval for a Population Proportion p

Economics Spring 2015

Simple Linear Regression

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Summary: CORRELATION & LINEAR REGRESSION. GC. Students are advised to refer to lecture notes for the GC operations to obtain scatter diagram.

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

Chapter two: Hypothesis testing

Chapter 6 Sampling Distributions

Lecture 2: Monte Carlo Simulation

INSTRUCTIONS (A) 1.22 (B) 0.74 (C) 4.93 (D) 1.18 (E) 2.43

Introducing Sample Proportions

Recall the study where we estimated the difference between mean systolic blood pressure levels of users of oral contraceptives and non-users, x - y.

Chapters 5 and 13: REGRESSION AND CORRELATION. Univariate data: x, Bivariate data (x,y).

GG313 GEOLOGICAL DATA ANALYSIS

MCT242: Electronic Instrumentation Lecture 2: Instrumentation Definitions

Lesson 11: Simple Linear Regression

CURRICULUM INSPIRATIONS: INNOVATIVE CURRICULUM ONLINE EXPERIENCES: TANTON TIDBITS:

(X i X)(Y i Y ) = 1 n

Open book and notes. 120 minutes. Cover page and six pages of exam. No calculators.

Machine Learning Brett Bernstein

Final Examination Solutions 17/6/2010

Chapter 8: Estimating with Confidence

IP Reference guide for integer programming formulations.

Section 1.1. Calculus: Areas And Tangents. Difference Equations to Differential Equations

Confidence Intervals for the Population Proportion p

4.3 Growth Rates of Solutions to Recurrences

Transcription:

Fittig Fuctios to Data: Regressio Dealig with Data ad Fittig Empirically Notes by Holly Hirst Whe workig with a situatio for which we do t have a physical law (ad hece a equatio, fuctio or iequality), we observe the system, recordig data from our observatios. We are the left with the task of makig predictios ad drawig coclusios from the data. To do this we start by determiig the tred of the data, ad the fittig a appropriate fuctio. If the data from the observatios are ordered pairs, we ca look at a graph of the data to determie the tred, ad hece the form of the fuctio to fit. Commo practice puts the variable we are selectig to measure (iput) o the x axis statisticias call it the predictor ad the variable we are hopig to predict (output) o the y axis the respose. I the graph below o the left, the tred appears to be roughly liear with a dowward slope. We ca fit the lie by eye as i the graph o the right, simply drawig the lie so that it looks like o oe data poit is too far from the lie. If a equatio is eeded, we could choose two poits o the lie ad use simple algebra to fid the equatio. While fittig by eye is appealig because it is so simple, a slightly more mathematical approach yields a lie for which we ca quatify the fact that o poit is too far from the lie. Let the formula for the lie we are lookig for be y = Ax+B, where A is the ukow slope ad B is the ukow y- itercept. Look at the graph i the picture below. The actual data poit value for x 1 is y 1 ad the respose value for the predictor x 1 from the lie is Ax 1 +B. We ll call the differece betwee these values the deviatio. Oe way to fit a lie to the data is to miimize the sum of these deviatios, miimize " Ax i + B! y i, Data Fittig Notes 1

where we have assumed that there are +1 data poits umbered through. Also ote that we have used the absolute values of the deviatios. We do this so that deviatios for poits above the lie do t cacel out deviatios for poits below the lie. So what ext? To miimize this expressio, we eed to take the partial derivatives with respect to A ad B (the ukows), set the derivatives equal to zero, ad the solve the resultig equatios simultaeously. Ufortuately, the absolute values are difficult to deal with whe takig derivatives, sice absolute values have discotiuous derivatives. Note, however, that a liear programmig approach ca be take to get a Chebychev lie of best fit. We will ot pursue that here. How do we fix this? Istead of miimizig the sum of the deviatios, we will miimize the sum of the squares of the deviatios: miimize "( Ax i + B! y i ) The squares of the deviatios are always o-egative which is why we used the absolute values but avoid the problems with derivatives. I additio, miimizer for this sum is equal to the miimizer for the square root of this sum ad the square root of the sum is closely related to Euclidea distace. To fiish this, we eed to take the two partials ad set them equal to zero:! ( Ax i + B " y i ) # = " ( Ax i + B! y i )( x i )="( Ax i + Bx i! x i y i ) =!A (1)! ( Ax i + B " y i ) # = " ( Ax i + B! y i )( 1)=" ( Ax i + B! y i ) =!B () Yuk! Now what? We eed to simplify these two equatios ad the solve them for A ad B. To do this we eed to use the fact that we ca rearrage sums of sums as follows:!( T i + R i + S i ) = T i +!! R i +! S i The left side of this equatio idicates to add the i th T, R ad S values first ad the add those sums; the right side says to add the T s, add the R s, add the S s ad the sum those sums. Either way, we have added all the T s, R s ad S s together. Usig this fact, we ca simplify (1) to get: Ax i! +! Bx i "! x i y i = A! x i + B! x i "! x i y i = (3) Where did the factor of go? We divided both sides of the equatio by. Similarly, () simplifies to:! Ax i +! B "! y i = A! x i + B "! y i = (4) We are fially ready to solve these two equatios for A ad B. We will rewrite these without the subscripts ad limits for simplicity as: A! x + B! x =! xy ad A! x + B =! y Data Fittig Notes

Multiplyig the first oe by ad multiplyig secod oe by Solvig for A gives: ( ) = xy A! x "! x! x! x ad the subtractig gives:! "! x! y (5) A =! x i y i "! x i! y i #! x i " % &! x $ i ' i= Pluggig this expressio for A ito (4) gives a formula for B: B =! y " A i! x i i =1 i =1 Here is a example doe usig these formulas. We have used a table to orgaize the calculatios of all of the sums. (FYI: These data are from a physics experimet o sprigs.) X Y X^ XY 5.3 7. 4 14 4 9.4 16 37.6 5 11.1 5 55.5 6 1.3 36 73.8 8 14. 64 113.6 sums: 5 59.3 145 94.5 A = ( 6*94.5 5*59.3) / (6*145 5*5) = 1.611 B = (94.5 1.611*5) / 6 = 5.44898 So the lie we might use to predict y usig a x value is: y = 1.61 x + 5.4. Goodess of Fit So how ca we tell the quality of the fit? First we eed to look at the data ad a graph showig the lie ad the data, askig whether the coditios eeded for liear regressio hold. If we are satisfied with the fit, the we ca look at a statistical measure of the stregth of the fit called the coefficiet of determiatio. As usual, let the actual data poits be (x i, y i ), i = 1,...,. Also let the predicted value for x i from the fitted lie be y ˆ i. So we have y ˆ i = Ax i + B. Coditios for Regressio The assumptios we eed for regressio iformatio to make sese (be ubiased estimates of coefficiets ad give reliable estimates of variace) are: 1. The x values are fixed, ot radom. Data Fittig Notes 3

. The y data values are idepedet of each other ad ormally distributed. This is usually the case if the data were collected carefully. Oe otable exceptio is values that are serially collected for which the idepedet variable is ot time. These time series data may deped o time ad hece o each other. 3. The idividual residuals ( y ˆ i y i ) are idepedet of each other ad ormally distributed with mea. How do we recogize whe these assumptios are violated? 1. The sigle best way to tell if the regressio is good is: LOOK AT THE GRAPH! Look for the followig thigs: Does the lie go through the middle of the data? o = bad (might be iflueced by a outlier) Is there a patter to the deviatios? yes = bad (might be o-liear) Is the shape of the data rectagular (rather tha wedge shaped)? o = bad (might eed a trasformatio of the y values or a weighted regressio).. Oe ca also look at some other graphs: Plot the y values o a histogram to look for ormality Plot the residuals ( y ˆ i y i ) to look for ormality with mea =. Plot residuals agaist fitted values (Y = ( y ˆ i y i ) versus X = y ˆ i ). This should be a horizotal bad across the graph. If there s a (sloped) liear patter there might be a uderlyig idepedet variable that should replace the x that was used. If there is a wedge shape, a trasformatio of y or weighted regressio might be eeded. Coefficiet of Determiatio Oce we are sure that oe of the coditios above are violated, we ca look at a value called the coefficiet of determiatio, aka R-Squared. This will give us a statistical measure of the quality of the fit. What did we do to fid the fits? We solved the problem or, usig the hat otatio: miimize miimize " ( ) " Ax i + b! y i ( ) ˆ y i! y i If we wat to fid out how well the lie captures the relatioship i the data, we ca examie the followig quatities that measure variace: Data Fittig Notes 4

SSTO (total sum of squares): " ( y i! y ), which measures the overall deviatio from the mea ( y ) of the respose variables without regard to the predictor variable. If this were, the there would be o eed to kow ay x values, i.e., o depedece o the x values. SSE (error sum of squares): " ( y i! y ˆ ) i, which measures the deviatio of the lie values from the observed values for the respose, i.e., takig ito cosideratio the predictor variable. If this were, the we would kow that the relatioship was exactly liear, i.e., all of the data poits are o the regressio lie. SSR (regressio sum of squares): " ( y ˆ i! y ), which measures the overall deviatio from the mea ( y ) of the fitted variables. If this were, the there would be o eed to kow ay x values, i.e., o depedece o the x values ad the regressio lie would be horizotal. Usig basic ideas from liear algebra, we ca show that " ( y i! y ) = " ( y i! y ˆ ) i + " ( y ˆ i! y ), (1) i.e., SSTO = SSE + SSR. This equality is the fudametal reaso for choosig to miimize the sum of the squared deviatios. I words, this equality says the followig: The overall variatio i the data is equal to the sum of the overall deviatio of the data from the fitted lie plus the deviatio of the lie values from the mea. The coefficiet of determiatio is defied as the proportio of the error ot attributable to the lie SSTO! SSE (SSTO-SSE) out of the total error:, ad thus calculated as SSTO " ( y! y ˆ i i ) 1! " ( y i!y ) or, i words, this is the proportio of the total variatio that ca be accouted for by the lie fit. It ca also be calculated as: " ( y ˆ i! y ) " ( y i! y ) That these two values are the same ca be show easily from the fudametal relatioship (1). Also from (1) we kow that this umber is always betwee ad 1, ad the closer to 1 it is, the Data Fittig Notes 5

more the lie "explais" the overall variatio i the data. I fact, the coefficiet of determiatio ca be thought of as the %-goodess of fit of the lie. We ca calculate this by had, but luckily for us, may software packages ca calculate this automatically whe the tredlie calculatio is doe. 16 14 y = 1.161x + 5.449 R =.9919 1 1 8 Series1 Liear (Series1) 6 4 4 6 8 1 Notice that the lie is a pretty good fit (.9919 is very close to 1). We eed to use cautio whe lookig at the coefficiet of determiatio aka R-Squared. Whe fittig a lie, we ca iterpret the umber as a percet goodess of fit. So for the lie fitted above, 99% of the variatio i the data is explaied by the liear relatioship y = 1.16 x 5.4. What should t we say? Noe of the followig statemets are true! The lie will predict the y value 99% of the time. The y value predicted by the lie will be 99% of the actual value. No-Liear Fits Cosider the example i Chapter 5 of the Edward s ad Hamso s Guide to Modellig data collected to see if there is a relatioship betwee height i meters ad weight i kilograms. (Table 6.1 o page 13) x(height) y(weight).75 1.86 1.95 15 1.8 17 1.1 1.6 7 1.35 35 1.51 41 1.55 48 1.6 5 1.63 51 1.67 54 1.71 59 1.78 66 1.85 75 Data Fittig Notes 6

These data, whe plotted, give the followig graph: Height versus Weight 8 7 6 weight (kg) 5 4 3 1.5 1 1.5 height (m) These data seem a little curved, but let s fit a lie to it ad take a look: Height versus Weight 8 7 6 y = 57.871x - 41.79 weight (kg) 5 4 3 1.5 1 1.5 height (m) The liear fit is t bad, but otice that there is a patter to the deviatios below 1 meter ad above 1.75 meters, the data poits are above the lie; i betwee 1 ad 1.75 meters, the data poits lie below the lie. As metioed i the previous sectio, patters to the deviatios are idicative of a o-liear fit. So what o-liear fuctio should we try to fit? If we take the approach from MAT 595, we would thik about what the data represet: Height versus weight i humas. Let s make some assumptios: Humas are geometrically similar Weight is proportioal to volume Volume is a 3-D measuremet whereas height is 1-D. From these assumptios, it seems reasoable to try weight! height 3. So let s repeat the steps we used to fid the formula for liear regressio to fid a formula for simple cubic regressio. We start with miimizig the sum of the squared deviatios: miimize " ( Ax 3 i! y i ) Next we take the derivative with respect to the variable (A) ad set the result equal to zero. d da "( Ax 3 i! y i ) = Ax 3 3 " ( i! y i ) x i ( ) = A x i 6 i= Data Fittig Notes 7 "! " x 3 i y i = (1)

Solvig this for A gives: Orgaizig the calculatios i table form: A =!! i= x i 3 y i x i 6 Buildig a ew table with x, y ad A*x^3 gives: x(height) y(weight) x^6 x^3*y.75 1.1779785 4.1875.86 1.445674 7.6367.95 15.7359189 1.8665 1.8 17 1.5868743 1.41514 1.1 1.973869 8.9856 1.6 7 4.15414 54.115 1.35 35 6.5344514 86.11315 1.51 41 11.8539116 141.16991 1.55 48 13.86745 178.746 1.6 5 16.77716 4.8 1.63 51 18.7553696.86897 1.67 54 1.6919616 51.53 1.71 59 5.11 95.1449 1.78 66 31.8686 37.363 1.85 75 4.894751 474.871875 sum: 194.777376 353.5353 A= 1.8354 Plottig this: x(height) y(weight) A*x^3.75 1 5.9768.86 1 7.6855959.95 15 1.359838 1.8 17 15.13588 1.1 16.976336 1.6 7 4.179541 1.35 35 9.79165 1.51 41 41.618841 1.55 48 44.9963465 1.6 5 49.49893 1.63 51 5.39355 1.67 54 56.7781 1.71 59 6.4185766 1.78 66 68.146818 1.85 75 76.56354 Data Fittig Notes 8

data with cubic fit w e i g h t 9 8 7 6 5 4 3 1.5 1 1.5 h e i g h t y(weight) A*x^3 Note from the graph that this fit appears to be bad at the bottom. Why might our assumptios have let us dow? The lower values i our data set come from people uder 1 meter tall amely childre. Are childre geometrically similar to adults? Not really. Let s try some other models. I the last attempt we chose the power o the height to be 3. What if we let that be ukow as well lettig the data drive the power empirically? miimize " ( Ax B i! y i ) Next we take the derivative with respect to the variables (A ad B) ad set the results equal to zero. Here is the partial derivative with respect to A.! #( Ax B!A i " y i ) = Ax B B " ( i! y i ) x i ( ) = A x i B Data Fittig Notes 9 "! " x B i y i = (1) Notice that the resultig equatio is ot goig to be liear (B is i the expoet!), leadig to a system that is ot easy to hadle. Is there aother way? Yes! We ca use a log trick to get the B out of the expoet: y = Ax B! l( y) = l( Ax B )! l( y) = l( A) + l( x B )! l( y) = l( A) + Bl( x) This says that Y = l(y) ad X = l(x) have a liear relatioship wheever y ad x have a power relatioship, with the slope equal to the power, B, ad the y-itercept equal to l(a) -- the log of the costat. So what should we do? Fit a lie betwee l(x) ad l(y)... x(height) y(weight) l(x) l(y).75 1 -.87681.35859.86 1 -.1589.4849665.95 15 -.51933.785 1.8 17.769614.8331334 1.1.1133869.995737 1.6 7.311117 3.9583687 1.35 35.31459 3.5553486 1.51 41.411965 3.713577 1.55 48.4385493 3.87111 1.6 5.47363 3.9131

Here is a lie fit to l(x) ad l(y): 1.63 51.488581 3.9318563 1.67 54.518363 3.9889845 1.71 59.53649337 4.7753744 1.78 66.57661336 4.18965474 1.85 75.61518564 4.31748811 log-log plot l ( w e i g h t ) 5 y =.347x +.86 4.5 4 3.5 3.5 1.5 1.5 -.4 -...4.6.8 l ( h e i g h t ) This is ofte referred to as log-log regressio. This looks liear, ad the deviatios are more radomly placed. How do we recover the origial fuctio from this? We'll use the statemet we made two paragraphs up: Y = l(y) ad X = l(x) have a liear relatioship wheever y ad x have a power relatioship, with the slope equal to power, B, ad the y-itercept equal to l(a) -- the log of the costat. So.347 = B = power ad.86 = l(a) or exp(.86) = 16.79 = costat, givig Here is the graph usig the power curve: y = 16.79x.35 8 7 6 power fit usig excel y = 16.788x.347 weight (y) 5 4 3 1.5 1 1.5 height (x) y(weight) Power (y(weight)) Oe last thig to try o this height-weight data set is expoetial regressio. A*exp(Bx) is aother fuctio that has a similar shape for x >. If we tried agai with the aive approach to regressio miimizig the sum of the squared deviatios, we would ru ito algebraic problems agai, just like with the power curve calculatios above. We ca trasform this i a similar way: Data Fittig Notes 1

y = Ae Bx! l( y) = l( Ae Bx )! l( y) = l( A) + l( e Bx )! l y ( ) = l( A) + Bxl( e) ( ) = l( A) + Bx! l y This says that Y = l(y) ad X = x have a liear relatioship wheever y ad x have a expoetial relatioship, with the slope equal to the power, B, ad the y-itercept equal to l(a) -- the log of the costat. So what should we do? Fit a lie betwee x ad l(y)... Here is a graph of the liearized data: x(height) y(weight) x(height) l(y).75 1.75.35859.86 1.86.4849665.95 15.95.785 1.8 17 1.8.8331334 1.1 1.1.995737 1.6 7 1.6 3.9583687 1.35 35 1.35 3.5553486 1.51 41 1.51 3.713577 1.55 48 1.55 3.87111 1.6 5 1.6 3.9131 1.63 51 1.63 3.9318563 1.67 54 1.67 3.9889845 1.71 59 1.71 4.7753744 1.78 66 1.78 4.18965474 1.85 75 1.85 4.31748811 l(y) -- l(weight) log plot y = 1.8558x +.913 5 4.5 4 3.5 3 l(y).5 Liear (l(y)) 1.5 1.5.5 1 1.5 x -- height This is ofte referred to as log regressio. This looks liear, ad as i the power fit the deviatios are more radomly placed. How do we recover the origial fuctio from this? We'll use the statemet we made two paragraphs up: Y = l(y) ad X = x have a liear relatioship wheever y ad x have a expoetial relatioship, with the slope equal to power, B, ad the y-itercept equal to l(a) -- the log of the costat. So 1.8558 = B = power ad.913 = l(a) or exp(.913) =.513 = costat, givig y =.513e 1.856x Data Fittig Notes 11

Here is the graph usig the power curve ad the expoetial curve ad showig the fuctios together: height versus weight 9 8 7 weight -- y 6 5 4 3 1.5 1 1.5 height -- x y(weight) Power (y(weight)) Expo. (y(weight)) y =.515e 1.8558x y = 16.788x.347 Checkig for Goodess of Fit i the No-Liear Case Sice these fits are based upo a liearizatio of the origial data, the same rules apply whe decidig if the fit is good. Look at the trasformed data ad check: 1. The x values are fixed, ot radom.. The y data values are idepedet of each other ad ormally distributed. This is usually the case if the data were collected carefully. Oe otable exceptio is values that are serially collected for which the idepedet variable is ot time. These time series data may deped o time ad hece o each other. 3. The residuals are idepedet of each other ad ormally distributed with mea. How do we recogize whe these assumptios are violated? The sigle best way to tell if the regressio is good is LOOK AT THE GRAPH! Look for the followig thigs both with the origial ad with the trasformed data: Does the curve go through the middle of the data? o = bad Is there a patter to the deviatios? yes = bad Is the shape of the data rectagular (rather tha wedge shaped)? I the case of the origial data, look for a rectagle that is bet to follow the curve. o = bad Coefficiet of Determiatio as used with No-Liear Fits Note that i each case we could get a R-squared value. Whe fittig o-liear curves to the data, i particular the power ad expoetial fits we have see already, cautio must be used whe lookig at this umber. It tells us how good the fit is for the trasformed data NOT the origial data. I particular, logarithmic trasformatios chage the distaces betwee the y data values ad the mea ( y ) i a o-uiform way. This implies that the traditioal explaatio for the meaig of the coefficiet of determiatio does t work if the data have bee trasformed i this way. We ca still look at the coefficiet of determiatio i a comparative way to choose betwee two curves of the same kid, such as two differet expoetials, two power fits, etc., but ot as a way to choose betwee the fits from differet kids of curves. This mis-use is commo i applicatios of statistics to the social scieces remember to avoid this!!! Data Fittig Notes 1

So, bottom lie, what is the best way to tell if we have chose the best curve? Use your eyes! Look at the data with the various fits ad choose the oe that appears to come closest to all of the poits ad gives a radom patter i the deviatios. Problems 1. The followig data table cotais a sample of 4 idividuals out of over 7 who participated i a study i Hoolulu, HI i 1969. Aswer each questio, ad explai your coclusios. Is there a lik betwee weight ad cholesterol levels? Please ote: Lower total cholesterol is cosidered healthier. Does age matter? Does physical activity matter? Does smokig affect blood pressure? Some experts propose that comparig weight aloe to blood pressure is ot as good as comparig weight per cm of height to blood pressure, i.e., create a ew variable that is weight divided by height for each perso. Ca this ew variable be used to predict blood pressure? Educatio (Highest completed) Hoolulu Study Data (197) Systolic Blood Pressure Weight (kg) Height (cm) Age Smoker Physical Activity Cholesterol primary 7 165 61 y moderate 199 1 oe 6 16 5 heavy 67 138 oe 6 15 53 y moderate 7 19 primary 66 165 51 y moderate 166 1 primary 7 16 51 heavy 39 18 high school 59 165 53 moderate 189 11 oe 47 16 61 heavy 38 18 itermediate 66 17 48 y moderate 3 116 college 56 155 54 heavy 79 134 primary 6 167 48 moderate 19 14 high school 68 165 49 y heavy 4 116 oe 65 166 48 moderate 9 15 oe 56 157 55 heavy 1 134 primary 8 161 49 moderate 171 13 itermediate 66 16 5 heavy 55 13 high school 91 17 5 heavy 3 118 itermediate 71 17 48 y moderate 147 136 college 66 15 59 heavy 68 18 oe 73 159 59 moderate 31 18 high school 59 161 5 moderate 199 18 oe 64 16 5 y moderate 55 118 itermediate 55 161 5 y moderate 199 134 primary 78 175 5 y moderate 8 178 primary 59 16 54 moderate 4 134 itermediate 51 167 48 y heavy 184 16 itermediate 83 171 55 moderate 19 16 primary 66 157 49 y heavy 11 1 high school 61 165 51 moderate 1 98 primary 65 16 53 moderate 3 144 itermediate 75 17 49 moderate 43 118 high school 61 164 49 heavy 181 118 oe 73 157 53 y heavy 38 138 primary 66 157 5 moderate 186 134 oe 73 155 48 heavy 198 18 primary 61 16 53 moderate 165 96 itermediate 68 16 5 heavy 19 14 primary 5 157 5 heavy 196 1 college 73 16 5 moderate 39 146 oe 5 165 61 y heavy 59 16 oe 56 16 53 y moderate 16 176. Wild black bears were aesthetized, ad their bodies were measured ad weighed. Oe goal of the study was to look at whether the distributio of the weights is differet for male ad female bears. Data Fittig Notes 13

The secod goal of the study was to determie the kid of relatioship that explaied best the associatio betwee legth ad weight. The third goal of the study was to fid a liear equatio for weight i terms of some other measurable characteristic for forest ragers, so they could estimate the weight of a bear based o that oe measuremet. This would be useful because i the field it is easier to measure a legth tha it is to weigh a bear with a scale. Use the dataset to ivestigate these goals. lik to data: http://mathsci.appstate.edu/~hph/sagemath/bears.csv 3. I order to avoid the expese ad icoveiece of usig a water tak to determie body fat, we wish to fid a measuremet that ca predict body fat. The data set provided below cosists of estimates of body fat determied by uderwater weighig alog with various body circumferece measuremets for 5 me. Which of these measuremets is the best predictor of body fat? lik to data: http://mathsci.appstate.edu/~hph/sagemath/bodyfat.csv 4. Is there a relatioship betwee brai ad body weight i mammals? Use the data provided for 53 species average adult male body weight to ivestigate this questio. lik to data: http://mathsci.appstate.edu/~hph/sagemath/brai-body.csv 5. World Rakigs: Data Source: http://www.photius.com/rakigs/ This problem is desiged to let you grub aroud with real data. Real data are ot always ice, so do t be surprised if you ru ito less tha satisfyig results. Your assigmet is to try various fits ad use the ideas from class to choose the best relatioship or to say why you thik there is o relatioship. Some examples of relatioships you might ivestigate: Explore the relatioship betwee life expectacy ad GDP. Explore the relatioship betwee eergy use ad GDP. Explore the relatioship betwee literacy ad ifat mortality rate. Explore the relatioship betwee CO emissio ad GDP. Explore the relatioship betwee populatio ad eergy use. Explore the relatioship betwee literacy ad life expectacy. Explore the relatioship betwee CO emissios ad eergy use. Explore the relatioship betwee fertility rate ad % over 6. Data Fittig Notes 14