Regression and Covariance - PDF Free Download

Regression and Covariance James K. Peterson Department of Biological ciences and Department of Mathematical ciences Clemson University April 16, 2014 Outline A Review of Regression Regression and Covariance

Abstract This lecture redoes regression in terms of covariances. We begin with a collection of data pairs {(x i, y i) : 1 i }. The line we want to pick has the form y = mx + b for some choice of slope m and intercept b. The distance from a given data point (x i, y i) and our line is the usual Euclidean distance d ij given by d i = (mx i + b y i) 2. If we want to minimize the sum of all these individual errors, we get the same result by minimizing the sum of all the errors squared. Define an error function E by E = d 2 i = (mx i + b y i) 2. We see the error function E is really a function of the two independent variables m and b.

The optimal slope is m = E(XY) E(X) E(Y) E(X 2 ) E(X) E(X). The optimal intercept is b = E(Y) E(X2 ) E(X) E(XY) E(X 2 ) E(X) E(X) An equivalent solution that is easier to find is b = E(X)E(Y) m (E(X))2 E(X) = E(Y) m E(X). Now the term E(X 2 ) E(X) E(X) and E(XY) E(X) E(Y) occurs a lot in this kind of work. We call this kind of calculation a covariance and use the symbol Cov for them. The formal definitions are We thus know Thus, = E(X 2 ) E(X) E(X) Cov(XY) = E(XY) E(X) E(Y) m = E(XY) E(X) E(Y) E(X 2 ) E(X) E(X) = Cov(XY) Cov(XY) = m which tells us the covariance of x and y is proportional to the slope of the regression line of Y on X with proportionality constant given by.

Hence, the optimal slope and intercept are given by b = E(Y) E(X2 ) E(X) E(XY) m = Cov(XY). Finally, there is one other idea that is useful here: the idea of how our data varies from the expected value E(X). We can calculate the expected squared total difference as follows E((X E(X)) 2 ) = 1 (xi E(X)) 2 = 1 ( ) xi 2 2xi E(X) + (E(X)) 2 = 1 xi 2 2 E(X) xi + 1 (E(X)) 2 = E(X 2 ) 2 (E(X)) 2 + (E(X)) 2 = E(X 2 ) (E(X)) 2 =. This calculation gives us what is called the variance of our data and you should learn more about this tool in other courses as it is extremely useful. Alas, our needs are quite limited in this course, so we just need to mention it. o we have another definition. The variance is denoted by the symbol Var and defined by Var(X) = E((X E(X)) 2 ) = E(X 2 ) (E(X)) 2 =. Note that the variance Var(X) is exactly the same as the covariance! We can now see that there is an interesting connection between the covariance of X and Y in terms of the covariance and variance of X.

Let s denote the slope m obtaining by our optimization strategy by m(x, Y) so we always remember is a function of our data pairs. A summary of our work is in order. We have found the optimal slope and y intercept for our data is given by m and b, respectively, where m(x, Y) = Cov(XY) b(x, Y) = E(Y) E(X2 ) E(X) E(XY) = E(Y) m(x, Y) E(X) We call the optimal slope m(x, Y) the slope of the regression of Y on X. This is something we can measure so it is an estimate of how the variable y changes with respect to x. We now know a fair bit of calculus, so we can think of m(x, Y) as an estimate of either dy y dx or x which is a really useful idea. Then, we notice that Cov(X, Y) = Var(X) Cov(X, Y) = Var(X) m(x, Y). Var(X) Thus, the covariance of x and y is proportional to the slope of the regression line of Y on X with proportionality constant given by the variance Var(X) which, of course, is the same as the covariance of X with itself, Cov(X, X). Example Let s find the = Var(X) and Cov(XY) for the data D = {(1.2, 2.3), (2.4, 1.9), (3.0, 4.5), (3.7, 5.2), (4.1, 3.2), (5.0, 7.2)}. olution We do this in MatLab. % s e t u p t h e d a t a a s X and Y v e c t o r s X = [ 1. 2 ; 2. 4 ; 3. 0 ; 3. 7 ; 4. 1 ; 5. 0 ] ; Y = [ 2. 3 ; 1. 9 ; 4. 5 ; 5. 2 ; 3. 2 ; 7. 2 ] ; % g e t l e n g t h o f d a t a N = l e n g t h (X) ; % F i n d E (X) c a l l e d EX h e r e EX = sum (X) /N;

olution % F i n d E (Y) c a l l e d EY h e r e EY = sum (Y) /N % f i n d E (XY) c a l l e d EXY h e r e EXY = sum (X. Y) /N; % f i n d E (Xˆ2) c a l l e d EXX h e r e EXX = sum (X. X) /N; % f i n d Cov (X), h e r e COVX COVX = EXX EX EX ; % here COVX = 1. 4956 % F i n d COV(X,Y), h e r e COVXY COVXY = EXY EX EY % here COVXY = 1. 7683 Homework 75 Again, these problems are taken from from ones you can find in R. okal and F. J. Rohlf, Introduction to Biostatistics published by Dover in the chapter on regression. Your results need to be placed in a Word doc in the usual way, nicely commented with embedded plots. For these problems calculate and Cov(XY) in MatLab. 75.1 The data here has the form (Time, Temperature) where the time is the amount of time that has elapsed since a rabbit was inoculated with a virus and the temperature is the rabbit s temperature at that time. Find the covariances for this data. D = {(24, 102.8), (32, 104.5), (48, 106.5), (56.0, 107.0), (72.0, 103.9), (80.0, 103.2)}.

Homework 75 75.2 The data here has the form (Larval Density, Weight) where the larval density is the number of fly larva per unit area and the weight is the adult fly weight. Find the covariances for this data. D = {(1, 1.356), (3, 1.356), (5, 1.284), (6, 1.252), (10, 0.989), (20, 0.664)}. 75.3 The data here has the form (Temperature, Calorie Expenditure) where temperature is the environmental temperature a sparrow is living in and calorie expenditure is the amount of energy the sparrow used at that temperature. Find the regression line for this data D = {(0, 24.9), (4, 23.4), (10, 24.2), (18, 18.7), (26, 15.2), (34, 13.7)}. Homework 75 75.4 The data here has the form (Temperature, Developmental Time) where temperature is the environmental temperature the leaf hopper is living in and the developmental time is a measurement of the time it takes for the leaf hopper to develop at this temperature. Find the covariances for this data. D = {(59.8, 58.1), (67.6, 27.3), (70.0, 26.8), (74.0, 19.1), (78.0, 16.5), (91.4, 14.6)}. 75.5 The data here has the form (Depth, Temperature) where depth is the depth in meters at which the water temperature in a lake is measured. Find the covariances for this data. D = {(0, 24.8), (1, 23.2), (2, 22, 2), (3, 21.2), (5, 13.8), (7, 8.2)}.