Correlation and Regression

Similar documents
ARITHMETIC OPERATIONS. The real numbers have the following properties: a b c ab ac

approaches as n becomes larger and larger. Since e > 1, the graph of the natural exponential function is as below

A-Level Mathematics Transition Task (compulsory for all maths students and all further maths student)

Duality # Second iteration for HW problem. Recall our LP example problem we have been working on, in equality form, is given below.

Operations with Polynomials

Goals: Determine how to calculate the area described by a function. Define the definite integral. Explore the relationship between the definite

Appendix 3, Rises and runs, slopes and sums: tools from calculus

Properties of Integrals, Indefinite Integrals. Goals: Definition of the Definite Integral Integral Calculations using Antiderivatives

Optimization Lecture 1 Review of Differential Calculus for Functions of Single Variable.

than 1. It means in particular that the function is decreasing and approaching the x-

THE DISCRIMINANT & ITS APPLICATIONS

Math 1B, lecture 4: Error bounds for numerical methods

The steps of the hypothesis test

SUMMER KNOWHOW STUDY AND LEARNING CENTRE

A Matrix Algebra Primer

Chapter 4 Contravariance, Covariance, and Spacetime Diagrams

Math 270A: Numerical Linear Algebra

MATHEMATICS AND STATISTICS 1.2

Unit 1 Exponentials and Logarithms

Non-Linear & Logistic Regression

BRIEF NOTES ADDITIONAL MATHEMATICS FORM

Student Activity 3: Single Factor ANOVA

fractions Let s Learn to

Quadratic Forms. Quadratic Forms

The First Fundamental Theorem of Calculus. If f(x) is continuous on [a, b] and F (x) is any antiderivative. f(x) dx = F (b) F (a).

1B40 Practical Skills

Before we can begin Ch. 3 on Radicals, we need to be familiar with perfect squares, cubes, etc. Try and do as many as you can without a calculator!!!

The Algebra (al-jabr) of Matrices

Precalculus Spring 2017

Polynomial Approximations for the Natural Logarithm and Arctangent Functions. Math 230

TO: Next Year s AP Calculus Students

20 MATHEMATICS POLYNOMIALS

SESSION 2 Exponential and Logarithmic Functions. Math 30-1 R 3. (Revisit, Review and Revive)

Scientific notation is a way of expressing really big numbers or really small numbers.

Best Approximation in the 2-norm

CHM Physical Chemistry I Chapter 1 - Supplementary Material

Review of basic calculus

f(x) dx, If one of these two conditions is not met, we call the integral improper. Our usual definition for the value for the definite integral

Improper Integrals, and Differential Equations

In Section 5.3 we considered initial value problems for the linear second order equation. y.a/ C ˇy 0.a/ D k 1 (13.1.4)

Review of Calculus, cont d

Mathematics Extension 1

Integral points on the rational curve

Equations and Inequalities

Math 8 Winter 2015 Applications of Integration

Tests for the Ratio of Two Poisson Rates

3.1 Exponential Functions and Their Graphs

Vyacheslav Telnin. Search for New Numbers.

The Wave Equation I. MA 436 Kurt Bryan

New Expansion and Infinite Series

MORE FUNCTION GRAPHING; OPTIMIZATION. (Last edited October 28, 2013 at 11:09pm.)

W. We shall do so one by one, starting with I 1, and we shall do it greedily, trying

5.2 Exponent Properties Involving Quotients

NUMERICAL INTEGRATION. The inverse process to differentiation in calculus is integration. Mathematically, integration is represented by.

Higher Maths. Self Check Booklet. visit for a wealth of free online maths resources at all levels from S1 to S6

1 Linear Least Squares

MATHS NOTES. SUBJECT: Maths LEVEL: Higher TEACHER: Aidan Roantree. The Institute of Education Topics Covered: Powers and Logs

A. Limits - L Hopital s Rule. x c. x c. f x. g x. x c 0 6 = 1 6. D. -1 E. nonexistent. ln ( x 1 ) 1 x 2 1. ( x 2 1) 2. 2x x 1.

Identify graphs of linear inequalities on a number line.

The Regulated and Riemann Integrals

How do we solve these things, especially when they get complicated? How do we know when a system has a solution, and when is it unique?

CBE 291b - Computation And Optimization For Engineers

Adding and Subtracting Rational Expressions

MIXED MODELS (Sections ) I) In the unrestricted model, interactions are treated as in the random effects model:

Chapter 1: Logarithmic functions and indices

7.2 The Definite Integral

MATH 144: Business Calculus Final Review

Chapter 7 Notes, Stewart 8e. 7.1 Integration by Parts Trigonometric Integrals Evaluating sin m x cos n (x) dx...

Recitation 3: More Applications of the Derivative

Bernoulli Numbers Jeff Morton

State space systems analysis (continued) Stability. A. Definitions A system is said to be Asymptotically Stable (AS) when it satisfies

Lesson 1: Quadratic Equations

Continuous Random Variables Class 5, Jeremy Orloff and Jonathan Bloom

5.7 Improper Integrals

Improper Integrals. Type I Improper Integrals How do we evaluate an integral such as

Topic 1 Notes Jeremy Orloff

Bridging the gap: GCSE AS Level

Math 520 Final Exam Topic Outline Sections 1 3 (Xiao/Dumas/Liaw) Spring 2008

First midterm topics Second midterm topics End of quarter topics. Math 3B Review. Steve. 18 March 2009

Sections 1.3, 7.1, and 9.2: Properties of Exponents and Radical Notation

AQA Further Pure 1. Complex Numbers. Section 1: Introduction to Complex Numbers. The number system

SCHEME OF WORK FOR IB MATHS STANDARD LEVEL

p-adic Egyptian Fractions

Math 113 Exam 2 Practice

Chapter 3 Exponential and Logarithmic Functions Section 3.1

Preparation for A Level Wadebridge School

Jim Lambers MAT 169 Fall Semester Lecture 4 Notes

Section 5.1 #7, 10, 16, 21, 25; Section 5.2 #8, 9, 15, 20, 27, 30; Section 5.3 #4, 6, 9, 13, 16, 28, 31; Section 5.4 #7, 18, 21, 23, 25, 29, 40

Chapter 3 Polynomials

Higher Checklist (Unit 3) Higher Checklist (Unit 3) Vectors

Theoretical foundations of Gaussian quadrature

Part I: Basic Concepts of Thermodynamics

Section 6.1 INTRO to LAPLACE TRANSFORMS

Conservation Law. Chapter Goal. 5.2 Theory

A. Limits - L Hopital s Rule ( ) How to find it: Try and find limits by traditional methods (plugging in). If you get 0 0 or!!, apply C.! 1 6 C.

Chapter 5 : Continuous Random Variables

CMDA 4604: Intermediate Topics in Mathematical Modeling Lecture 19: Interpolation and Quadrature

Logarithms. Logarithm is another word for an index or power. POWER. 2 is the power to which the base 10 must be raised to give 100.

THE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS.

1.2. Linear Variable Coefficient Equations. y + b "! = a y + b " Remark: The case b = 0 and a non-constant can be solved with the same idea as above.

Transcription:

Topic 3 Correltion nd Regression In this section, we shll tke creful look t the nture of liner reltionships found in the dt used to construct sctterplot. The first of these, correltion, emines this reltionship in symmetric mnner. The second, regression, considers the reltionship of response vrible s determined by one or more eplntory vribles. Correltion focuses primrily of ssocition, while regression is designed to help mke predictions. Consequently, the first does not ttempt to estblish ny cuse nd effect. The second is often used s tool to estblish cuslity. 3. Covrince nd Correltion The covrince mesures the liner reltionship quntittive between pir of quntittive mesures vectors observtions, 2,..., n nd y,y 2,...,y n v =(v,...,v n ) =(,..., n ) w =(w,...,w n ) y =(y,...,y n ) on the sme smple of n individuls. Beginning with the definition of vrince, the defi- ( i ) 2 norm-squred vrince v 2 = P n v2 i s 2 = P n nition of covrince is similr to the reltionship between the squre of the norm v 2 of v s norm stndrd devition vector v nd the inner product hv, wi of two inner product covrince hv, wi = P n v iw i cov(, y) = P n vectors v nd w. ( i )(y i ȳ) cosine correltion cov(, y) = ( cos = hv,wi v w r = cov(,y) i )(y i ȳ). s s y A positive covrince mens tht the Tble I: Anlogies between vectors nd quntittive observtions. terms ( i )(y i ȳ) in the sum re more likely to be positive thn negtive. This occurs whenever the nd y vribles re more often both bove or below the men in tndem thn not. Just like the sitution in which the inner product of vector with itself yields the squre of the norm, the covrince of with itself cov(, ) =s 2 is the vrince of. Eercise 3.. Eplin in words wht negtive covrince signifies, nd wht covrince ner 0 signifies. We net look t severl eercises tht cll for lgebric mnipultions of the formul for covrince or closely relted functions. Eercise 3.2. Derive the lterntive epression for the covrince: cov(, y) = 3 i y i n ȳ!.

Introduction to the Science of Sttistics Correltion nd Regression Eercise 3.3. cov( + b, cy + d) =c cov(, y). How does chnge in units (sy from centimeters to meters) ffect the covrince? Thus, covrince s mesure of ssocition hs the drwbck tht its vlue depends on the units of mesurement. This shortcoming is remedied by using the correltion. Definition 3.4. The correltion, r, is the covrince of the stndrdized versions of nd y. r(, y) = i yi ȳ cov(, y) =. s s y The observtions nd y re clled uncorrelted if r(, y) =0. s Eercise 3.5. r( + b, cy + d) =±r(, y). How does chnge in units (sy from centimeters to meters) ffect the correltion? The plus sign occurs if c>0 nd the minus sign occurs if c<0. Sometimes we will drop (, y) if there is no mbiguity nd simply write r for the correltion. Eercise 3.6. Show tht Give the nlogy between this formul nd the lw of cosines. s y s +y θ Figure 3.: The nlogy of the smple stndrd devitions nd the lw of cosines in eqution (3.). Here, the corrreltion r = cos. s y s 2 +y = s 2 + s 2 y +2cov(, y) =s 2 + s 2 y +2rs s y. (3.) s In prticulr if the two observtions re uncorrelted we hve the Pythgoren identity s 2 +y = s 2 + s 2 y. (3.2) We will now look to uncover some of the properties of correltion. The net steps re to show tht the correltion is lwys number betwee nd nd to determine the reltionship between the two vribles in the cse tht the correltion tkes on one of the two possible etreme vlues. Eercise 3.7 (Cuchy-Schwrz inequlity). For two sequences v,,v n nd w,...,w n, show tht! 2!! v i w i pple. (3.3) Written in terms of norms nd inner products, the Cuchy-Schwrz inequlity becomes hv, wi 2 pple v 2 w 2. (Hint: Consider the epression P n (v i + w i ) 2 0 s qudrtic epression in the vrible nd consider the discriminnt in the qudrtic formul.) If the discriminnt is zero, then we hve equlity in (3.3) nd we hve tht P n (v i + w i ) 2 =0for ectly one vlue of. v 2 i w 2 i We shll use inequlity (3.3) by choosing v i = i nd w i = y i ȳ to obtin 2!! ( i )(y i ȳ)! pple ( i ) 2 (y i ȳ) 2, 2 ( i )(y i ȳ)! pple cov(, y) 2 pple s 2 s 2 y! ( i ) 2! (y i ȳ) 2, cov(, y) 2 s 2 s 2 y pple 32

Introduction to the Science of Sttistics Correltion nd Regression 2 0 2 2 0 2 3 r=0.9 y 2 0 2 2 0 2 3 r=0.7 y 2 0 2 2 0 2 3 r=0.3 y 2 0 2 2 0 2 3 r=0.0 z 2 0 2 2 0 2 3 r= 0.5 y 2 0 2 2 0 2 r= 0.8 y [t!] Figure 3.2: Sctterplots showing differing levels of the correltion r 33

Introduction to the Science of Sttistics Correltion nd Regression Consequently, we find tht r 2 pple or pple r pple. When we hve r =, then we hve equlity in (3.3). In ddition, for some vlue of we hve tht (( i )+(y i ȳ) ) 2 =0. The only wy for sum of nonnegtive terms to dd to give zero is for ech term in the sum to be zero, i.e., Thus i nd y i re linerly relted. In this cse, the sign of r is the sme s the sign of. ( i )+(y i ȳ) =0, for ll i =,...,n. (3.4) y i = + i. Eercise 3.8. For n lterntive derivtion tht pple r pple. Use eqution (3.) with nd y stndrdized observtions. Use this to determine in eqution (3.4) (Hint: Consider the seprte cses s 2 +y for the r = nd s 2 y for the r =.) We cn see how this looks for simulted dt. Choose vlue for r betwee nd +. ><-rnorm(00) >z<-rnorm(00) >y<-r* + sqrt(-rˆ2)*z Emple of plots of the output of this simultion re given in Figure 3.. For the moment, the object of this simultion is to obtin n intuitive feeling for differing vlues for correltion. We shll soon see tht this is the simultion of pirs of norml rndom vribles with the desired correltion. From the discussion bove, we cn see tht the sctterplot would lie on stright line for the vlues r = ±. For the Archeoptery dt on bone lengths, we hve the correltion > cor(femur, humerus) [] 0.994486 Thus, the dt lnd very nerly on line with positive slope. For the bnks i974, we hve the correltion > cor(income,ssets) [] 0.93259 3.2 Liner Regression Covrince nd correltion re mesures of liner ssocition. For the Archeoptery mesurements, we lern tht the reltionship in the length of the femur nd the humerus is very nerly liner. We now turn to situtions in which the vlue of the first vrible i will be considered to be eplntory or predictive. The corresponding observtion y i, tken from the input i, is clled the response. For emple, cn we eplin or predict the income of bnks from its ssets? In this cse, ssets is the eplntory vrible nd income is the response. In liner regression, the response vrible is linerly relted to the eplntory vrible, but is subject to devition or to error. We write y i = + i + i. (3.5) Our gol is, given the dt, the i s nd y i s, to find nd tht determines the line hving the best fit to the dt. The principle of lest squres regression sttes tht the best choice of this liner reltionship is the one tht 34

Introduction to the Science of Sttistics Correltion nd Regression minimizes the squre in the verticl distnce from the y vlues in the dt nd the y vlues on the regression line. This choice reflects the fct tht the vlues of re set by the eperimenter nd re thus ssumed known. Thus, the error ppers in the vlue of the response vrible y. This principle leds to minimiztion problem for SS(, )= 2 i = (y i ( + i )) 2. (3.6) In other words, given the dt, determine the vlues for nd denote by ˆ nd ˆ the vlue for nd tht minimize SS. Tke the prtil derivtive with respect to. tht minimizes the sum of squres SS. Let s the n @ @ SS(, )= 2 X (y i i ) At the vlues ˆ nd ˆ, this prtil derivtive is 0. Consequently 0= (y i ˆ ˆi ) y i = (ˆ ˆi ). Now, divide by n. ȳ =ˆ + ˆ. (3.7) Thus, we see tht the center of mss point (, ȳ) is on the regression line. To emphsize this fct, we rewrite (3.5) in slope-point form. Thus, y i ȳ = ( i )+ i. (3.8) We then pply this to the sums of squres criterion (3.6) to obtin condition tht depends on, SS( )= 2 i = Now, differentite with respect to nd set this eqution to zero for the vlue ˆ. Now solve for ˆ. d d SS( )= 2 ((y i ȳ) ( i )) 2. (3.9) ((y i ȳ) ˆ(i ))( i ) =0. (y i ȳ)( i ) = ˆ ( i ) 2 (y i ȳ)( i ) = ˆ cov(, y) = ˆvr() ( i ) 2 ˆ = cov(, y) vr(). (3.0) 35

Introduction to the Science of Sttistics Correltion nd Regression 4 3 2 y 0 2 3 4 2 0 2 3 Figure 3.3: Sctterplot nd the regression line for the si point dt set below. The regression line is the choice tht minimizes the squre of the verticl distnces from the observtion vlues to the line, indicted here in green. Notice tht the totl length of the positive residuls (the lengths of the green line segments bove the regression line) is equl to the totl length of the negtive residuls. This property is derived in eqution (3.). In summry, to determine the regression line. we use (3.0) to determine ˆ nd then (3.7) to solve for We cll ŷ i the fit for the vlue i. ŷ i =ˆ + ˆ i, ˆ ˆ ȳ. Emple 3.9. Let s begin with 6 points nd derive by hnd the eqution for regression line. -2-0 2 3 y -3 - -2 0 4 2 Add the nd y vlues nd divide by n =6to see tht =0.5 nd ȳ =0. Thus, i y i i y i ȳ ( i )(y i ŷ) ( i ) 2-2 -3-2.5-3 7.5 6.25 - - -.5 -.5 2.25 0-2 -0.5-2.0 0.25 0 0.5 0 0.0 0.25 2 4.5 4 6.0 2.25 3 2 2.5 2 5.0 6.25 sum 0 0 cov(, y) = 2/5 vr() = 7.50/5 ˆ 2/5 = =.2 nd 0=ˆ +.2 0.5 =ˆ +0.6 or ˆ = 0.6 7.5/5 36

Introduction to the Science of Sttistics Correltion nd Regression As seen in this emple, fits, however rrely perfect. The difference between the fit nd the dt is n estimte ˆ i for the error i. This difference is clled the residul. So, RESIDUAL i = DATA i FIT i = y i ŷ i or, by rerrnging terms, DATA i = FIT i + RESIDUAL i, or y i =ŷ i +ˆ i. s ŷ i FIT RESIDUAL = y i c y i DATA ŷ i We cn rewrite eqution (3.6) with ˆ i estimting the error in (3.5). 0= (y i ˆ ˆi )= (y i ŷ i )= ˆ i (3.) to see tht the sum of the residuls is 0. Thus, we strted with criterion for line of best fit, nmely, lest squres, nd discover tht consequence of this criterion the regression line hs the property tht the sum of the residul vlues is 0. This is illustrted in Figure 3.3. Let s check this property for the emple bove. DATA FIT RESIDUAL i y i ŷ i ŷ i y i -2-3 -3.0 0 - - -.8 0.8 0-2 -0.6 -.4 0 0.6-0.6 2 4.8 2.2 3 2 3.0 -.0 totl 0 Generlly speking, we will look t residul plot, the plot of the residuls versus the eplntory vrible, to ssess the ppropriteness of regression line. Specificlly, we will look for circumstnces in which the eplntory vrible nd the residuls hve no systemtic pttern. Eercise 3.0. Use R to perform the following opertions on the dt set in Emple 3.9.. Enter the dt nd mke sctterplot. 2. Use the lm commnd to find the eqution of the regression line. 3. Use the bline commnd to drw the regression line on the sctterplot. 4. Use the resid nd the predict commnd commnd to find the residuls nd plce them in dt.frme with nd y 37

Introduction to the Science of Sttistics Correltion nd Regression 5. Drw the residul plot nd use bline to dd the horizontl line t 0. We net show three emples of the residuls plotting ginst the vlue of the eplntory vrible. - 6 Regression fits the dt well - homoscedsticity - 6 Prediction is less ccurte for lrge, n emple of heteroscedsticity - 6 Dt hs curve. A stright line fits the dt poorly. For ny vlue of, we cn use the regression line to estimte or predict vlue for y. We must be creful in using this prediction outside the rnge of. This etrpoltion will not be vlid if the reltionship between nd y is not known to be liner in this etended region. 38

Introduction to the Science of Sttistics Correltion nd Regression Emple 3.. For the 974 bnk dt set, the regression line \ income =7.680 + 4.975 ssets. So, ech dollr in ssets brings in bout $5 income. For bnk hving 0 billion dollrs in ssets, the predicted income is 56.430 billion dollrs. However, if we etrpolte this down to very smll bnks, we would predict nonsensiclly tht bnk with no ssets would hve n income of 7.68 billion dollrs. This illustrtes the cution necessry to perform relible prediction through n etrpoltion. In ddition for this dt set, we see tht three bnks hve ssets much greter thn the others. Thus, we should consider emining the regression lines omitting the informtion from these three bnks. If smll number of observtions hs lrge impct on our results, we cll these points influentil. Obtining the regression line in R is strightforwrd: > lm(income ssets) Cll: lm(formul = income ssets) Coefficients: (Intercept) ssets 7.680 4.975 Emple 3.2 (regression line in stndrdized coordintes). Sir Frncis Glton ws the first to use the term regression in his study Regression towrds mediocrity in hereditry stture. The rtionle for this term nd the reltionship between regression nd correltion cn be best seen if we convert the observtions into stndrdized form. First, write the regression line to point-slope form. ŷ i ȳ = ˆ( i ). Becuse the slope we cn rewrite the point-slope form s ˆ = cov(, y) vr() = rs s y s 2 = rs y s, ŷ i ȳ = rs y s ( i ) or ŷ i ȳ = r i, ŷi = r i. (3.2) s y s where the sterisk is used to indicte tht we re stting our observtions in stndrdized form. In words, if we use this stndrdized form, then the slope of the regression line is the correltion. For Glton s emple, let s use the height of mle s the eplntory vrible nd the height of his dult son s the response. If we observe correltion r =0.6 nd consider mn whose height is stndrd devition bove the men, then we predict tht the son s height is 0.6 stndrd devitions bove the men. If mn whose height is 0.5 stndrd devition below the men, then we predict tht the son s height is 0.3 stndrd devitions below the men. In either cse, our prediction for the son is height tht is closer to the men then the fther s height. This is the regression tht Glton hd in mind. From the discussion bove, we cn see tht if we reverse the role of the eplntory nd response vrible, then we chnge the regression line. This should be intuitively obvious since in the first cse, we re minimizing the totl squre verticl distnce nd in the second, we re minimizing the totl squre horizontl distnce. In the most etreme circumstnce, cov(, y) =0. In this cse, the vlue i of n observtion is no help in predicting the response vrible. 39

Introduction to the Science of Sttistics Correltion nd Regression r=0.8 r=0.0 r=-0.8 y -2-0 2 y -2-0 2 y y -2-0 2-2 - 0 2-2 - 0 2-2 - 0 2 Figure 3.4: Sctterplots of stndrdized vribles nd their regression lines. The red lines show the cse in which is the eplntory vrible nd the blue lines show the cse in which y is the eplntory vrible. Thus, s the formul sttes, when is the eplntory vrible the regression line hs slope 0 - it is horizontl line through ȳ. Correspondingly, when y is the eplntory vrible, the regression is verticl line through. Intuitively, if nd y re uncorrelted, then the best prediction we cn mke for y i given the vlue of i is just the smple men ȳ nd the best prediction we cn mke for i given the vlue of y i is the smple men. More formlly, the two regression equtions re ŷ i = r i nd ˆ i = ry i. These equtions hve slopes r nd /r. This is shown by emple in Figure 3.2. Eercise 3.3. Compute the regression line for the 6 pirs of observtions bove ssuming tht y is the eplntory vrible. Show tht the two region lines differ by showing tht the product of the slopes in not equl to one. Eercise 3.4. Continuing the previous emple, let ˆ be the slope of the regression line obtined from regressing y on nd ˆy be the slope of the regression line obtined from regressing on y. Show tht the product of the slopes ˆ ˆy = r 2, the squre of the correltion. Becuse the point (, ȳ) is on the regression line, we see from the eercise bove tht two regression lines coincide precisely when the slopes re reciprocls, nmely precisely when r 2 =. This occurs for the vlues r =nd r =. Eercise 3.5. Show tht the FIT, ŷ, nd the RESIDUALS, y Let s gin write the regression line in point slope form ŷ re uncorrelted. FIT i ȳ =ŷ i ȳ = r s y s ( i ). Using the qudrtic identity for vrince we find tht s DATA s RESIDUAL s 2 FIT = r 2 s2 y s 2 s 2 = r 2 s 2 y = r 2 s 2 DATA. Thus, the vrition in the FIT is reduced from the vrition in the DATA by fctor of r 2 nd r 2 = s2 FIT s 2 DATA. 40 s FIT Figure 3.5: The reltionship of the stndrd devitions of the DATA, the FIT, nd the RESIDUALS. s 2 DAT A = s2 FIT + s 2 RESID = r2 s 2 DAT A +( r2 )s 2 DAT A. In this cse, we sy tht r 2 of the vrition in the response vrible is due to the fit nd the rest r 2 is due to the residuls.

Introduction to the Science of Sttistics Correltion nd Regression Eercise 3.6. Use the eqution bove to show tht P n r 2 = (ŷ i ȳ) P 2 n (y i ȳ). 2 When the stright line fits the dt well, the FIT nd the RESIDUAL re uncorrelted nd the mgnitude of the residul does not depend on the vlue of the eplntory vrible. We hve, in this circumstnce, from eqution (3.2), the Pythgoren identity, tht s 2 DATA = s 2 FIT + s 2 RESID = r 2 s 2 DATA + s 2 RESIDUAL s 2 RESIDUAL =( r 2 )s 2 DATA. Thus, r 2 of the vrince in the dt cn be eplined by the fit. As consequence of this computtion, mny sttisticl softwre tools report r 2 s prt of the liner regression nlysis. In the cse the remining r 2 of the vrince in the dt is found in the residuls. Eercise 3.7. For some situtions, the circumstnces dictte tht the line contin the origin ( =0). Use lest squres criterion to show tht the slope of the regression line P n ˆ = P iy i n. 2 i R ccommodtes this circumstnce with the commnds lm(y -) or lm(y 0+). Note tht in this cse, the sum of the residuls is not necessrily equl to zero. For lest squres regression, this property followed from @SS(, )/@ =0where is the y-intercept. 3.2. Trnsformed Vribles For pirs of observtions (,y ),...,( n,y n ), the liner reltionship my eist not with these vribles, but rther with trnsformtion of the vribles. In this cse we hve, (y i )= + g( i )+ i. (3.3) We then perform liner regression on the vribles ỹ = (y) nd = g() using the lest squres criterion. If we tke logrithms, y i = Ae ki+ i, ln y i =lna + k i + i So, in (3.3), (y i )=lny i, g( i )= i is the trnsformtion of the dt, The prmeters re =lna nd = k. Before we look t n emple, let s review few bsic properties of logrithms Remrk 3.8 (logrithms). We will use both log, the bse 0 common logrthm, nd ln, the bse e nturl logrithm. Common logrithms help us see orders of mgnitude. For emple, if log y = 5, then we know tht y = 0 5 = 00, 000. if log y =, then we know tht y = 0 =/0. We will use nturl logrithms to show instntneous rtes of growth. Consider the differentil eqution dy dt = ky. We re sying tht the instntneous rte of growth of y is proportionl to y with constnt of proportionlity k. The solution to this eqution is y = y 0 e kt or ln y =lny 0 + kt 4

Introduction to the Science of Sttistics Correltion nd Regression where y 0 is the initil vlue for y. This gives liner reltionship between ln y nd t. The two vlues of logrithm hve simple reltionship. If we write Thus, by substituting for, we find tht = 0. Then log = nd ln = l0. ln = log l0 = 2.3026 log. In R, the commnd for the nturl logrithm of is log(). For the common logrithm, it is log(,0). Emple 3.9. In the dt on world oil production, the reltionship between the eplntory vrible nd response vrible is nonliner but cn be mde to be liner with simple trnsformtion, the common logrithm. Cll the new response vrible logbrrel. The eplntory vrible remins yer. With these vribles, we cn use regression line to help describe the dt. Here the model is log y i = + i + i. (3.4) Regression is the first emple of clss of sttisticl models clled liner models. At this point we emphsize tht liner refers to the ppernce of the prmeters nd linerly in the function (3.4). This cknowledges tht, in this circumstnce, the vlues i nd y i re known. Indeed, they re the dt. Our gol is to give n estimte ˆ nd ˆ for the vlues of nd. Thus, R uses the commnd lm. Here is the output. > summry(lm(logbrrel yer)) Cll: lm(formul = logbrrel yer) Residuls: MiQ Medin 3Q M -0.25562-0.03390 0.0349 0.07220 0.2922 Coefficients: Estimte Std. Error t vlue Pr(> t ) (Intercept) -5.59e+0.30e+00-39.64 <2e-6 *** yer 2.675e-02 6.678e-04 40.05 <2e-6 *** --- Signif. codes: 0 *** 0.00 ** 0.0 * 0.05. 0. Residul stndrd error: 0.5 on 27 degrees of freedom Multiple R-Squred: 0.9834,Adjusted R-squred: 0.9828 F-sttistic: 604 o nd 27 DF, p-vlue: < 2.2e-6 Note tht the output sttes r 2 =0.9828. Thus, the correltion is r =0.994 is very nerly one nd so the dt lies very close to the regression line. For world oil production, we obtined the reltionship \ log(brrel) = If we rewrite the eqution in eponentil form, we obtin 5.59 + 0.02675 yer. \brrel = A0 0.02675 yer = Aeˆk yer. 42

Introduction to the Science of Sttistics Correltion nd Regression Thus, ˆk gives the instntneous growth rte tht best fits the dt. This is obtined by converting from common logrithm to nturl logrithm. ˆk =0.02675 l0 = 0.066. Consequently, the use of oil sustined growth of 6% per yer over spn of hundred yers. Net, we will look for finer scle structure by emining the residul plot. > use<-lm(logbrrel yer) > plot(yer,resid(use)) resid(use) 0.2 0. 0.0 0. 880 900 920 940 960 980 yer Eercise 3.20. Remove the dt points fter the oil crisis of the mid 970s, find the regression line nd the instntneous growth rte tht best fits the dt. Look t the residul plot nd use fct bout Americn history to eplin why the residuls increse until 920 s, decrese until the erly 940 s nd increse gin until the erly 970 s. Emple 3.2 (Michelis-Menten Kinetics). In this emple, we will hve to use more sophisticted line of resoning to crete liner reltionship between eplntory nd response vrible. Consider the chemicl rection in which n enzyme ctlyzes the ction on substrte. Here E + S k k ES k2! E + P (3.5) 43

Introduction to the Science of Sttistics Correltion nd Regression E 0 is the totl mount of enzyme. E is the free enzyme. S is the substrte. ES is the substrte-bound enzyme. P is the product. V = d[p ]/dt is the production rte. The numbers bove or below the rrows gives the rection rtes. Using the symbol [ ] to indicte concentrtion, notice tht the enzyme, E 0, is either free or bound to the substrte. Its totl concentrtion is, therefore, [E 0 ]=[E]+[ES], nd, thus [E] =[E 0 ] [ES] (3.6) Our gol is to relte the production rte V to the substrte concentrtion [S]. The lw of mss ction turns the chemicl rections in (3.5) into differentil equtions. In prticulr, the rections, focusing on the substrte-bound enzyme nd the product, gives the equtions d[es] dt = k [E][S] [ES](k + k 2 ) nd V = d[p ] dt = k 2 [ES] (3.7) We cn meet our gol if we cn find n eqution for V = k 2 [ES] tht depends only on [S], the substrte concentrtion. Let s look t dt, [S] (mm) 2 5 0 20 V (nmol/sec).0.5 2.2 2.5 2.9 production rte.0.5 2.0 2.5 5 0 5 20 concentrtion of substrte If we wish to use liner regression, then we will hve to trnsform the dt. In this cse, we will develop the Michelis-Menten trnsformtion pplied to situtions in which the concentrtion of the substrte-bound enzyme (nd hence lso the unbound enzyme) chnges much more slowly thn those of the product nd substrte. 0 d[es] dt In words, the substrte-bound enzyme is nerly in stedy stte. Using the lw of mss ction eqution (3.7) for d[es]/dt, we cn rerrnge terms to conclude tht 44

Introduction to the Science of Sttistics Correltion nd Regression.4.2 0.8 /V 0.6 0.4 /V m 0.2 0 /K m 0.5 0 0.5 /[S] Figure 3.6: Linewever-Burke double reciprocl plot for the dt presented bove. The y-intercept gives the reciprocl of the mimum production. The dotted line indictes tht negtive concentrtions re not physicl. Nevertheless, the -intercept give the negtive reciprocl of the Michelis constnt. [ES] k [E][S] k + k 2 = [E][S] K m. (3.8) The rtio K m =(k + k 2 )/k of the rte of loss of the substrte-bound enzyme to its production is clled the Michelis constnt. We hve now met our gol prt wy, V is function of [S], but it is lso stted s function of [E]. Thus, we hve [E] s function of [ES]. Now, if we combine this with (3.8) nd (3.6) nd solve for [ES], then [ES] ([E 0] [ES])[S] [S], [ES] [E 0 ] K m K m +[S] Under this pproimtion, the production rte of the product is: V = d[p ] dt [S] = k 2 [ES]=k 2 [E 0 ] K m +[S] = V [S] m K m +[S] Here, V m = k 2 [E 0 ] is the mimum production rte. (To see this, let the substrte concentrtion [S]!.) To perform liner regression, we need to hve function of V be linerly relted to function of [S]. This is chieved vi tking the reciprocl of both sides of this eqution. V = K m +[S] = + K m V m [S] V m V m [S] (3.9) Thus, we hve liner reltionship between, the response vrible, nd, the eplntory vrible V [S] subject to eperimentl error. The Linewever-Burke double reciprocl plot provides useful method for nlysis of the Michelis-Menten eqution. See Figure 3.6. For the dt, 45

Introduction to the Science of Sttistics Correltion nd Regression [S] (mm) 2 5 0 20 V (nmol/sec).0.5 2.2 2.5 2.9 The regression line is Here re the R commnds V =0.32 + [S] 0.683. > S<-c(,2,5,0,20) > V<-c(.0,.5,2.2,2.5,2.9) > Sinv<-/S > Vinv<-/V > lm(vinv Sinv) Cll: lm(formul = Vinv Sinv) Coefficients: (Intercept) Sinv 0.32 0.683 Using (3.9), we find tht V m =3.4 nd K m =0.473. With more ccess to computtionl softwre, this method is not used s much s before. The mesurements for smll vlues of the concentrtion (nd thus lrge vlue of /[S]) re more vrible nd consequently the residuls re likely to be heteroscedstic. We look in the net section for n lterntive pproch, nmely nonliner regression. Emple 3.22 (Frnk Amscombe). Consider the three dt sets: 0 8 3 9 4 6 4 2 7 5 y 8.04 6.95 7.58 8.8 8.33 9.96 7.24 4.26 0.84 4.82 5.68 0 8 3 9 4 6 4 2 7 5 y 9.4 8.4 8.47 8.77 9.26 8.0 6.3 3.0 9.3 7.26 4.74 8 8 8 8 8 8 8 8 8 8 9 y 6.58 5.76 7.7 8.84 8.47 7.04 5.25 5.56 7.9 6.89 2.50 t 4 6 8 0 2 4 y y 4 6 8 0 2 4 y 4 6 8 0 2 4 5 0 5 20 5 0 5 20 5 0 5 20 46

Introduction to the Science of Sttistics Correltion nd Regression Ech of these dt sets hs regression line ŷ =3+0.5 nd correltions between 0.806 nd 0.86. However, only the first is suitble dt set for liner regression. This emple is ment to emphsize the point tht softwre will hppily compute regression line nd n r 2 vlue, but further emintion of the dt is required to see if this method is pproprite for ny given dt set. 3.3 Etensions We will discuss briefly two etensions - the first is lest squres criterion between nd y tht is nonliner in the prmeters =( 0,..., k). Thus, the model is y i = g( i )+ i for g, nonliner function of the prmeters. The second considers situtions with more thn one eplntory vrible. y i = 0 + i + 2 i2 + + k ik + i. (3.20) This brief discussion does not hve the detil necessry to begin to use these methods. It serves primrily s n invittion to begin to consult resources tht more fully develop these ides. 3.3. Nonliner Regression Here, we continue using estimtion of prmeters using lest squres criterion. SS( )= (y i g( i )) 2. For most choices of g( ) the solution to the nonliner lest squre criterion cnnot be epressed in closed form. Thus, numericl strtegy for the solution ˆ is necessry. This generlly begins with some initil guess of prmeter vlues nd n itertion scheme to minimize SS( ). Such scheme is likely to use locl informtion bout the first nd second prtil derivtives of g with respect to the prmeters i. For emple, grdient descent (lso known s steepest descent, or the method of steepest descent) is n itertive method in which produces sequence of prmeter vlues. The increment of the prmeter vlues for n itertion is proportionl to the negtive of the grdient of SS( ) evluted t the current point. The hope is tht the sequence converges to give the desired minimum vlue for SS( ). The R commnd gnls for generl nonliner lest squres is used to ccomplish this. As bove, you should emine the residul plot to see tht it hs no structure. For, emple, if the Linewever-Burke method for Michelis- Mentens kinetics yields structure in the residuls, then liner regression is not considered good method. Under these circumstnces, one cn net try to use the prmeter estimtes derived from Linewever-Burke s n initil guess in nonliner lest squres regression using lest squre criterion bsed on the sum of squres SS(V m,k m )= for dt (V, [S] ), (V 2, [S] 2 ),...(V n, [S] n ). 3.3.2 Multiple Liner Regression j= 2 [S] j V j V m K m +[S] j Before we strt with multiple liner regression, we first recll couple of concepts nd results from liner lgebr. Let C ij denote the entry in the i-th row nd j-th column of mtri C. 47

Introduction to the Science of Sttistics Correltion nd Regression A mtri A with r A rows nd c A nd mtri B with r B rows nd c B columns cn be multiplied to form mtri AB provide tht c A = r B, the number of columns in A equls the number of rows in B. In this cse (AB) ij = c A X k= A ik B kj. The d-dimensionl identity mtri I is the mtri with the vlue for ll entries on the digonl (I jj =,j =...,d) nd 0 for ll other entries. Notice for nd d-dimensionl vector, I =. A d d mtri C is clled invertible with inverse C provided tht only one mtri cn hve this property. CC = C C = I. Suppose we hve d-dimensionl vector of known vlues nd d d mtri C nd we wnt to determine the vectors tht stisfy = C. This eqution could hve no solutions, single solution, or n infinite number of solutions. If the mtri C is invertible, then we hve single solution = C. The trnspose of mtri is obtined by reversing the rows nd columns of mtri. We use superscript T to indicte the trnspose. Thus, the ij entry of mtri C is the ji entry of its trnspose, C T. Emple 3.23. T 23 = 427 0 24 @ 2A 37 A squre mtri C is invertible if nd only if its determinnt det(c) 6= 0. For 2 2 mtri b C = cd det(c) =d bc nd the mtri inverse C d b = det(c) c Eercise 3.24. (C) T = T C T Eercise 3.25. For find det(c) nd C. C = 3, 24 In multiple liner regression, we hve more thn one predictor or eplntory rndom vrible. Thus cn write (3.20) in mtri form y = X + (3.2) 48

Introduction to the Science of Sttistics Correltion nd Regression y =(y,y 2,...,y n ) T is column vector of responses, X is mtri of predictors, 0 k 2 2k X = B @..... C. A. (3.22) n nk The column of ones on the left give the constnt term in multiliner eqution. This mtri X is n n emple of wht is know s design mtri. =( 0,,..., k) T is column vector of prmeters, nd =(, 2,..., n ) T is column vector of errors. Eercise 3.26. Show tht the lest squres criterion cn be written in mtri form s SS( )= (y i 0 i k ik ) 2. (3.23) SS( )=(y X ) T (y X ). To minimize SS, we tke the grdient nd set it equl to 0. Eercise 3.27. Check tht the grdient is Bsed on the eercise bove, the vlue ˆ tht minimizes SS is Tking the trnspose of this lst eqution r SS( )= 2(y X ) T X. (3.24) (y X ˆ) T X =0, y T X = ˆT X T X. X T X ˆ = X T y. If X T X is invertible, then we cn multiply both sides of the eqution bove by (X T X) to obtin n eqution for the prmeter vlues ˆ =(ˆ0, ˆ,..., ˆn) in the lest squres regression. ˆ =(X T X) X T y. (3.25) Thus the estimtes ˆ re liner trnsformtion of the repsonses y through the so-clled ht mtri H = (X T X) X T, i.e. ˆ = Hy. Eercise 3.28. Verify tht the ht mtri H is left inverse of the design mtri X. This gives the regression eqution y = ˆ0 + ˆ + ˆ2 2 + + ˆk k Emple 3.29 (ordinry lest squres regression). In this cse, 0 0 X T 2 n X = B C 2 n @.. A = @ P n i n P n i A P n 2 i 49

Introduction to the Science of Sttistics Correltion nd Regression nd 0 y 0 P n X T y 2 y = B C 2 n @. A = y i @ A P. n y iy i n The determinnt of X T X is nd thus nd n 2 i (X T X) = n(n ˆ =(X T X) X T y = n(n! 2 i = n(n 0 @ )vr() 0 @ )vr() P n 2 i P n 2 i P n i P n i For emple, for the second row, we obtin!!! i y i + n i y i = n()vr() s seen in eqution (??). )vr(). P n i A n P n 0 P n i y i A @ A P. n n iy i n()cov(, y) n()vr() Emple 3.30. The choice of ij = j i in (3.22) results in polynomil regression in eqution (3.20). y i = 0 + i + 2 2 i + + k k i + i. Emple 3.3 (US popultion). Below re the census popultions = cov(, y) vr() census census census census yer popultion yer popultion yer popultion yer popultion 790 3,929,24 850 23,9,876 90 92,228,496 970 203,2,926 800 5,236,63 860 3,443,32 920 06,02,537 980 226,545,805 80 7,239,88 870 38,558,37 930 23,202,624 990 248,709,873 820 9,638,453 880 49,37,340 940 32,64,569 2000 28,42,906 830 2,866,020 890 62,979,766 950 5,325,798 200 308,745,538 840 7,069,453 900 76,22,68 960 79,323,75 To nlyze this in R we enter the dt: > uspop<-c(392924,523663,723988,9638453,2866020,7069453,239876,344332, + 3855837,4937340,62979766,762268,92228496,0602537,23202624,3264569, + 5325798,7932375,2032926,226545805,248709873,2842906,308745538) > yer<-c(0:22)*0+790 > plot(yer,uspop) > loguspop<-log(uspop,0) > plot(yer,loguspop) 50

Introduction to the Science of Sttistics Correltion nd Regression uspop 0.0e+00 5.0e+07.0e+08.5e+08 2.0e+08 2.5e+08 3.0e+08 800 850 900 950 2000 loguspop 7.0 7.5 8.0 8.5 800 850 900 950 2000 yer yer Figure 3.7: () United Sttes census popultion from 790 to 200 nd (b) its bse 0 logrithm. Note tht the logrithm of the popultion still hs bend to it, so we will perform qudrtic regression on the logrithm of the popultion. In order to keep the numbers smller, we shll give the yer minus 790, the yer of the first census for our eplntory vrible. > yer<-yer-790 > yer2<-yerˆ2 log(uspopultion) = 0 + (yer 790) + 2 (yer 790) 2. Thus, loguspop is the reponse vrible. The + sign is used in the cse of more thn one eplntory vrible nd here is plced between the response vribles yer nd yer2. > lm.uspop<-lm(loguspop yer+yer2) > summry(lm.uspop) Cll: lm(formul = loguspop yer + yer2) Residuls: MiQ Medin 3Q M -0.037387-0.03453-0.00092 0.0528 0.029782 Coefficients: Estimte Std. Error t vlue Pr(> t ) (Intercept) 6.582e+00.37e-02 578.99 <2e-6 *** yer.47e-02 2.394e-04 6.46 <2e-6 *** yer2-2.808e-05.05e-06-26.72 <2e-6 *** --- Signif. codes: 0 *** 0.00 ** 0.0 * 0.05. 0. 5

Introduction to the Science of Sttistics Correltion nd Regression resid.uspop -0.04-0.03-0.02-0.0 0.00 0.0 0.02 0.03 800 850 900 950 2000 yer Figure 3.8: Residul plot for US popultion regression. Residul stndrd error: 0.0978 on 20 degrees of freedom Multiple R-squred: 0.999,Adjusted R-squred: 0.9989 F-sttistic: 978 on 2 nd 20 DF, p-vlue: < 2.2e-6 The R output shows us tht ˆ0 =6.587 ˆ =0.89 ˆ2 = 0.00002808. So, tking the the regression line to the power 0, we hve tht uspopultion \ 0.89(yer 790) 0.00002808(yer 790)2 = 3863670 0 In Figure 3.8, we show the residul plot for the logrithm of the US popultion. > resid.uspop<-resid(lm.uspop) > plot(yer,resid.uspop) 3.4 Answers to Selected Eercises 3.. Negtive covrince mens tht the terms ( i )(y i ȳ) in the sum re more likely to be negtive thn positive. This occurs whenever one of the nd y vribles is bove the men, then the other is likely to be below. 52

Introduction to the Science of Sttistics Correltion nd Regression 3.2. We epnd the product inside the sum. cov(, y) = = ( i )(y i ȳ)= i y i! i y i n ȳ n ȳ + n ȳ = ȳ i i y i The chnge in mesurements from centimeters to meters would divide the covrince by 0,000. 3.3. We rerrnge the terms nd simplify. cov( + b, cy + d) = = (( i + b) ( + b)((cy i + d) (cȳ d)) ( i )(cy i cȳ) =c! y i + n ȳ n ȳ! ( i )(y i ȳ)=c cov(, y) 3.5. Assume tht 6= 0nd c 6= 0. If =0or c =0, then the covrince is 0 nd so is the correltion. r( + b, cy + d) = cov( + b, cy + d) c cov(, y) = = c cov(, y) = ±r(, y) s +b s cy+d s c s y c s s y We tke the plus sign if the sign of nd c gree nd the minus sign if they differ. 3.6. First we rerrnge terms s 2 +y = = (( i + y i ) ( +ȳ)) 2 = (( i )+(y i ȳ)) 2 ( i ) 2 +2 ( i )(y i ȳ)+ (y i ȳ) 2 = s 2 +2cov(, y)+s 2 y = s 2 +2rs s y + s 2 y For tringle with sides, b nd c, the lw of cosines sttes tht c 2 = 2 + b 2 2b cos where is the mesure of the ngle opposite side c. Thus the nlogy is s corresponds to, s y corresponds to b, s +y corresponds to c, nd r corresponds to cos Notice tht both r nd cos tke vlues betwee nd. 3.7. Using the hint,! 0 pple (v i + w i ) 2 = vi 2 +2 v i w i + wi 2! 2 = A + B + C 2 For qudrtic eqution to lwys tke on non-negitive vlues, we must hve non-positive discriminnt, i. e.,! 2 0 B 2 4AC =4 v i w i 4! vi 2 wi 2 53!.

Introduction to the Science of Sttistics Correltion nd Regression Now, divide by 4 nd rerrnge terms.! vi 2! wi 2! 2 v i w i. 3.8. The vlue of the correltion is the sme for pirs of observtions nd for their stndrdized versions. Thus, we tke nd y to be stndrdized observtions. Then s = s y =. Now, using eqution (3.), we hve tht 0 pple s 2 +y =++2r =2+2r. Simplifying, we hve 2 pple 2r nd r. For the second inequlity, use the similr identity to (3.) for the difference in the observtions Then, s 2 y = s 2 + s 2 y 2rs s y. 0 pple s y =+ 2r =2 2r. Simplifying, we hve 2r pple 2 nd r pple. Thus, correltion must lwys be between - nd. In the cse tht r =, we tht tht s 2 +y =0nd thus using the stndrdized coordintes i + y i ȳ =0. s s y Thus, = s y /s. In the cse tht r =, we tht tht s 2 y =0nd thus using the stndrdized coordintes i y i ȳ =0. s s y Thus, = s y /s. 3.0.. First the dt nd the sctterplot, prepring by using mfrowto hve side-by-side plots > <-c(-2:3) > y<-c(-3,-,-2,0,4,2) > pr(mfrow=c(,2)) > plot(,y) 2. Then the regression line nd its summry. > regress.lm<-lm(y ) > summry(regress.lm) Cll: lm(formul = y ) Residuls: 2 3 4 5 6-2.776e-6 8.000e-0 -.400e+00-6.000e-0 2.200e+00 -.000e+00 Coefficients: 54

Introduction to the Science of Sttistics Correltion nd Regression Estimte Std. Error t vlue Pr(> t ) (Intercept) -0.6000 0.6309-0.95 0.3955.2000 0.3546 3.384 0.0277 * --- Signif. codes: 0 *** 0.00 ** 0.0 * 0.05. 0. Residul stndrd error:.483 on 4 degrees of freedom Multiple R-squred: 0.742,Adjusted R-squred: 0.6765 F-sttistic:.45 o nd 4 DF, p-vlue: 0.02767 3. Add the regression line to the sctterplot. > bline(regress.lm) 4. Mke dt frme to show the predictions nd the residuls. > residuls<-resid(regress.lm) > predictions<-predict(regress.lm,newdt=dt.frme(=c(-2:3))) > dt.frme(,y,predictions,residuls) y predictions residuls -2-3 -3.0-2.775558e-6 2 - - -.8 8.000000e-0 3 0-2 -0.6 -.400000e+00 4 0 0.6-6.000000e-0 5 2 4.8 2.200000e+00 6 3 2 3.0 -.000000e+00 5. FInlly, the residul plot nd horizontl line t 0. > plot(,residuls) > bline(h=0) 3.3. Use the subscript y in ˆ y nd ˆy to emphsize tht y is the eplntory vrible. We still hve =0.5, ȳ =0. So, the slope ˆy = 2/34 nd y i i y i ȳ i ( i )(y i ŷ) (y i ȳ) 2-3 -2-3 -2.5 7.5 9 - - - -.5.5-2 0-2 -0.5.0 4 0 0 0.5 0.0 0 4 2 4.5 6.0 6 2 3 2 2.5 5.0 4 totl 0 0 cov(, y) = 2/5 vr(y) = 34/5 =ˆ y + ˆyȳ, /2 =ˆ y. Thus, to predict from y, the regression line is ˆ i =/2 + 2/34y i. Becuse the product of the slopes 6 5 2 34 = 63 85 6=, 55

Introduction to the Science of Sttistics Correltion nd Regression y -3-2 - 0 2 3 4 residuls -.5 -.0-0.5 0.0 0.5.0.5 2.0-2 - 0 2 3-2 - 0 2 3 Figure 3.9: (left) sctterplot nd regression line (right) residul plot nd horizontl line t 0 this line differs from the line used to predict y from. 3.4. Recll tht the covrince of nd y is symmetric, i.e., cov(, y) =cov(y, ). Thus, ˆ ˆy = cov(, y) s 2 cov(y, ) s 2 y = cov(, y)2 s 2 s 2 y 2 cov(, y) = = r 2. s s y In the emple bove, r 2 = cov(, y)2 s 2 2 y = (2/5) 2 (7.5/5) (34/5) = 22 7.5 34 = 2 35 2 7 = 3 5 2 7 = 63 85 3.5 To show tht the correltion is zero, we show tht the numertor in the definition, the covrince is zero. First, The first term in this difference, cov(ŷ, y ŷ) =cov(ŷ, y) cov(ŷ, ŷ). cov(, y) cov(, y)2 cov(ŷ, y) =cov s 2, y = s 2 = r2 s 2 s 2 y s 2 = r 2 s 2 y. For the second, cov(ŷ, ŷ) =s 2 ŷ = r 2 s 2 y. So, the difference is 0. 3.6. For the denomintor s 2 DATA = (y i ȳ) 2 56

Introduction to the Science of Sttistics Correltion nd Regression For the numertor, recll tht (, ȳ) is on the regression line. Consequently, ȳ =ˆ + ˆ. Thus, the men of the fits ŷ = n ŷ i = n (ˆ + ˆ i )=ˆ + ˆ =ȳ. This could lso be seen by using the fct (3.) tht the sum of the residuls is 0. For the denomintor, s 2 FIT = (ŷ i ŷ) 2 = (ŷ i ȳ) 2. Now, tke the rtio nd notice tht the frctions /(n 3.7. The lest squres criterion becomes ) in the numertor nd denomintor cncel. The derivtive with respect to S 0 ( )=0for the vlue is S( )= S 0 ( )= 2 (y i i ) 2. i (y i i ). P n ˆ = P iy i n. 2 i 3.2. The i-th component of (C) T is Now the i-th component of T C T is C ij j. j= j Cji T = j C ij. j= j= 3.24. det(c) =4 6= 2 nd C = 2 4 3 2 3/2 =. 2 /2 3.25. Using eqution (3.2), the i-th component of y X, (y X ) i = y i n X j=0 j jk = y i 0 i k in. Now, (y X ) T (y X ) is the dot product of y X with itself. This gives (3.23). 3.26. Write i0 =for ll i, then we cn write (3.23) s SS( )= (y i i0 0 i k ik ) 2. 57

Introduction to the Science of Sttistics Correltion nd Regression Then, @ @ j S( )= 2 This is the j-th coordinte of (3.24). = 2 (y i i0 0 i k ik ) ij (y i (X ) i ) ij = 2((y X ) T X)) j. 3.27. HX =(X T X) X T X =(X T X) (X T X)=I, the identity mtri. 58