Regression and Covariance

Similar documents
f X, Y (x, y)dx (x), where f(x,y) is the joint pdf of X and Y. (x) dx

Bivariate distributions

CS70: Lecture 33. Linear Regression. 1. Examples 2. History 3. Multiple Random variables 4. Linear Regression 5. Derivation 6.

ENGG2430A-Homework 2

Business Statistics. Lecture 9: Simple Regression

Variance reduction. Michel Bierlaire. Transport and Mobility Laboratory. Variance reduction p. 1/18

Riemann Sums. Outline. James K. Peterson. September 15, Riemann Sums. Riemann Sums In MatLab

Project Two. James K. Peterson. March 26, Department of Biological Sciences and Department of Mathematical Sciences Clemson University

CS70: Jean Walrand: Lecture 22.

Lecture 16 - Correlation and Regression

Simple Linear Regression Estimation and Properties

STAT Homework 8 - Solutions

Chapter 12 - Lecture 2 Inferences about regression coefficient

Regression and correlation. Correlation & Regression, I. Regression & correlation. Regression vs. correlation. Involve bivariate, paired data, X & Y

Review of Probability. CS1538: Introduction to Simulations

HW5 Solutions. (a) (8 pts.) Show that if two random variables X and Y are independent, then E[XY ] = E[X]E[Y ] xy p X,Y (x, y)

STAT 430/510: Lecture 16

MAS113 Introduction to Probability and Statistics. Proofs of theorems

Lecture 2: Repetition of probability theory and statistics

Joint Distribution of Two or More Random Variables

Introduction to Simple Linear Regression

Political Science 6000: Beginnings and Mini Math Boot Camp

Algorithms for Uncertainty Quantification

Jointly Distributed Random Variables

Lecture 4: Proofs for Expectation, Variance, and Covariance Formula

Linear Systems of ODE: Nullclines, Eigenvector lines and trajectories

Covariance and Correlation

Project Two. Outline. James K. Peterson. March 27, Cooling Models. Estimating the Cooling Rate k. Typical Cooling Project Matlab Session

BIOSTATISTICS NURS 3324

Preliminary Statistics. Lecture 3: Probability Models and Distributions

ECSE B Solutions to Assignment 8 Fall 2008

Class 8 Review Problems solutions, 18.05, Spring 2014

ECE 450 Homework #3. 1. Given the joint density function f XY (x,y) = 0.5 1<x<2, 2<y< <x<4, 2<y<3 0 else

Week 10 Worksheet. Math 4653, Section 001 Elementary Probability Fall Ice Breaker Question: Do you prefer waffles or pancakes?

Linear Systems of ODE: Nullclines, Eigenvector lines and trajectories

Derivatives and the Product Rule

The Derivative of a Function

Solving systems of ODEs with Matlab

Properties of Summation Operator

Econ 2120: Section 2

X = X X n, + X 2

18.440: Lecture 26 Conditional expectation

Covariance and Correlation Class 7, Jeremy Orloff and Jonathan Bloom

Mathematical Induction

Principal components

2. Suppose (X, Y ) is a pair of random variables uniformly distributed over the triangle with vertices (0, 0), (2, 0), (2, 1).

1 Correlation between an independent variable and the error

Applied Regression. Applied Regression. Chapter 2 Simple Linear Regression. Hongcheng Li. April, 6, 2013

ECE Homework Set 3

MA 575 Linear Models: Cedric E. Ginestet, Boston University Revision: Probability and Linear Algebra Week 1, Lecture 2

Linear Estimation of Y given X:

Predator - Prey Model Trajectories and the nonlinear conservation law

Predator - Prey Model Trajectories are periodic

Business Statistics 41000: Homework # 2 Solutions

Derivatives in 2D. Outline. James K. Peterson. November 9, Derivatives in 2D! Chain Rule

Problem Solving. Correlation and Covariance. Yi Lu. Problem Solving. Yi Lu ECE 313 2/51

Math 510 midterm 3 answers

CHAPTER 4 MATHEMATICAL EXPECTATION. 4.1 Mean of a Random Variable

Chapter 2: simple regression model

MA 1128: Lecture 08 03/02/2018. Linear Equations from Graphs And Linear Inequalities

ORF 245 Fundamentals of Statistics Chapter 4 Great Expectations

2 (Statistics) Random variables

Gaussian random variables inr n

Probability. Paul Schrimpf. January 23, Definitions 2. 2 Properties 3

Intro to Linear Regression

STOR Lecture 16. Properties of Expectation - I

Integration by Parts Logarithms and More Riemann Sums!

Predator - Prey Model Trajectories are periodic

Final Examination Solutions (Total: 100 points)

Random Vectors, Random Matrices, and Matrix Expected Value

UC Berkeley Department of Electrical Engineering and Computer Science. EE 126: Probablity and Random Processes. Problem Set 9 Fall 2007

Correlation and Regression

Statistics 351 Probability I Fall 2006 (200630) Final Exam Solutions. θ α β Γ(α)Γ(β) (uv)α 1 (v uv) β 1 exp v }

15-780: Grad AI Lecture 17: Probability. Geoff Gordon (this lecture) Tuomas Sandholm TAs Erik Zawadzki, Abe Othman

Notes on Random Processes

University of Regina. Lecture Notes. Michael Kozdron

Lecture 4. August 24, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University.

3 Multiple Discrete Random Variables

Math101, Sections 2 and 3, Spring 2008 Review Sheet for Exam #2:

Chapter 2. Some Basic Probability Concepts. 2.1 Experiments, Outcomes and Random Variables

Antiderivatives! Outline. James K. Peterson. January 28, Antiderivatives. Simple Fractional Power Antiderivatives

Defining Exponential Functions and Exponential Derivatives and Integrals

Math 426: Probability MWF 1pm, Gasson 310 Exam 3 SOLUTIONS

Lecture 13 (Part 2): Deviation from mean: Markov s inequality, variance and its properties, Chebyshev s inequality

Lecture 10: F -Tests, ANOVA and R 2

MATH 38061/MATH48061/MATH68061: MULTIVARIATE STATISTICS Solutions to Problems on Random Vectors and Random Sampling. 1+ x2 +y 2 ) (n+2)/2

Riemann Integration. James K. Peterson. February 2, Department of Biological Sciences and Department of Mathematical Sciences Clemson University

ACM 116: Lectures 3 4

Matrices and Vectors

Covariance. Lecture 20: Covariance / Correlation & General Bivariate Normal. Covariance, cont. Properties of Covariance

4. Distributions of Functions of Random Variables

Optimization and Simulation

3. General Random Variables Part IV: Mul8ple Random Variables. ECE 302 Fall 2009 TR 3 4:15pm Purdue University, School of ECE Prof.

Riemann Integration. Outline. James K. Peterson. February 2, Riemann Sums. Riemann Sums In MatLab. Graphing Riemann Sums

Lecture 17: Estimating and Manipulating Covariance

Newton s Cooling Model in Matlab and the Cooling Project!

Perhaps the simplest way of modeling two (discrete) random variables is by means of a joint PMF, defined as follows.

Chapter 4 continued. Chapter 4 sections

STA441: Spring Multiple Regression. This slide show is a free open source document. See the last slide for copyright information.

Joint Probability Distributions and Random Samples (Devore Chapter Five)

Transcription:

Regression and Covariance James K. Peterson Department of Biological ciences and Department of Mathematical ciences Clemson University April 16, 2014 Outline A Review of Regression Regression and Covariance

Abstract This lecture redoes regression in terms of covariances. We begin with a collection of data pairs {(x i, y i) : 1 i }. The line we want to pick has the form y = mx + b for some choice of slope m and intercept b. The distance from a given data point (x i, y i) and our line is the usual Euclidean distance d ij given by d i = (mx i + b y i) 2. If we want to minimize the sum of all these individual errors, we get the same result by minimizing the sum of all the errors squared. Define an error function E by E = d 2 i = (mx i + b y i) 2. We see the error function E is really a function of the two independent variables m and b.

The optimal slope is m = E(XY) E(X) E(Y) E(X 2 ) E(X) E(X). The optimal intercept is b = E(Y) E(X2 ) E(X) E(XY) E(X 2 ) E(X) E(X) An equivalent solution that is easier to find is b = E(X)E(Y) m (E(X))2 E(X) = E(Y) m E(X). Now the term E(X 2 ) E(X) E(X) and E(XY) E(X) E(Y) occurs a lot in this kind of work. We call this kind of calculation a covariance and use the symbol Cov for them. The formal definitions are We thus know Thus, = E(X 2 ) E(X) E(X) Cov(XY) = E(XY) E(X) E(Y) m = E(XY) E(X) E(Y) E(X 2 ) E(X) E(X) = Cov(XY) Cov(XY) = m which tells us the covariance of x and y is proportional to the slope of the regression line of Y on X with proportionality constant given by.

Hence, the optimal slope and intercept are given by b = E(Y) E(X2 ) E(X) E(XY) m = Cov(XY). Finally, there is one other idea that is useful here: the idea of how our data varies from the expected value E(X). We can calculate the expected squared total difference as follows E((X E(X)) 2 ) = 1 (xi E(X)) 2 = 1 ( ) xi 2 2xi E(X) + (E(X)) 2 = 1 xi 2 2 E(X) xi + 1 (E(X)) 2 = E(X 2 ) 2 (E(X)) 2 + (E(X)) 2 = E(X 2 ) (E(X)) 2 =. This calculation gives us what is called the variance of our data and you should learn more about this tool in other courses as it is extremely useful. Alas, our needs are quite limited in this course, so we just need to mention it. o we have another definition. The variance is denoted by the symbol Var and defined by Var(X) = E((X E(X)) 2 ) = E(X 2 ) (E(X)) 2 =. Note that the variance Var(X) is exactly the same as the covariance! We can now see that there is an interesting connection between the covariance of X and Y in terms of the covariance and variance of X.

Let s denote the slope m obtaining by our optimization strategy by m(x, Y) so we always remember is a function of our data pairs. A summary of our work is in order. We have found the optimal slope and y intercept for our data is given by m and b, respectively, where m(x, Y) = Cov(XY) b(x, Y) = E(Y) E(X2 ) E(X) E(XY) = E(Y) m(x, Y) E(X) We call the optimal slope m(x, Y) the slope of the regression of Y on X. This is something we can measure so it is an estimate of how the variable y changes with respect to x. We now know a fair bit of calculus, so we can think of m(x, Y) as an estimate of either dy y dx or x which is a really useful idea. Then, we notice that Cov(X, Y) = Var(X) Cov(X, Y) = Var(X) m(x, Y). Var(X) Thus, the covariance of x and y is proportional to the slope of the regression line of Y on X with proportionality constant given by the variance Var(X) which, of course, is the same as the covariance of X with itself, Cov(X, X). Example Let s find the = Var(X) and Cov(XY) for the data D = {(1.2, 2.3), (2.4, 1.9), (3.0, 4.5), (3.7, 5.2), (4.1, 3.2), (5.0, 7.2)}. olution We do this in MatLab. % s e t u p t h e d a t a a s X and Y v e c t o r s X = [ 1. 2 ; 2. 4 ; 3. 0 ; 3. 7 ; 4. 1 ; 5. 0 ] ; Y = [ 2. 3 ; 1. 9 ; 4. 5 ; 5. 2 ; 3. 2 ; 7. 2 ] ; % g e t l e n g t h o f d a t a N = l e n g t h (X) ; % F i n d E (X) c a l l e d EX h e r e EX = sum (X) /N;

olution % F i n d E (Y) c a l l e d EY h e r e EY = sum (Y) /N % f i n d E (XY) c a l l e d EXY h e r e EXY = sum (X. Y) /N; % f i n d E (Xˆ2) c a l l e d EXX h e r e EXX = sum (X. X) /N; % f i n d Cov (X), h e r e COVX COVX = EXX EX EX ; % here COVX = 1. 4956 % F i n d COV(X,Y), h e r e COVXY COVXY = EXY EX EY % here COVXY = 1. 7683 Homework 75 Again, these problems are taken from from ones you can find in R. okal and F. J. Rohlf, Introduction to Biostatistics published by Dover in the chapter on regression. Your results need to be placed in a Word doc in the usual way, nicely commented with embedded plots. For these problems calculate and Cov(XY) in MatLab. 75.1 The data here has the form (Time, Temperature) where the time is the amount of time that has elapsed since a rabbit was inoculated with a virus and the temperature is the rabbit s temperature at that time. Find the covariances for this data. D = {(24, 102.8), (32, 104.5), (48, 106.5), (56.0, 107.0), (72.0, 103.9), (80.0, 103.2)}.

Homework 75 75.2 The data here has the form (Larval Density, Weight) where the larval density is the number of fly larva per unit area and the weight is the adult fly weight. Find the covariances for this data. D = {(1, 1.356), (3, 1.356), (5, 1.284), (6, 1.252), (10, 0.989), (20, 0.664)}. 75.3 The data here has the form (Temperature, Calorie Expenditure) where temperature is the environmental temperature a sparrow is living in and calorie expenditure is the amount of energy the sparrow used at that temperature. Find the regression line for this data D = {(0, 24.9), (4, 23.4), (10, 24.2), (18, 18.7), (26, 15.2), (34, 13.7)}. Homework 75 75.4 The data here has the form (Temperature, Developmental Time) where temperature is the environmental temperature the leaf hopper is living in and the developmental time is a measurement of the time it takes for the leaf hopper to develop at this temperature. Find the covariances for this data. D = {(59.8, 58.1), (67.6, 27.3), (70.0, 26.8), (74.0, 19.1), (78.0, 16.5), (91.4, 14.6)}. 75.5 The data here has the form (Depth, Temperature) where depth is the depth in meters at which the water temperature in a lake is measured. Find the covariances for this data. D = {(0, 24.8), (1, 23.2), (2, 22, 2), (3, 21.2), (5, 13.8), (7, 8.2)}.