Correlation Regression

Similar documents
Regression, Inference, and Model Building

1 Inferential Methods for Correlation and Regression Analysis

S Y Y = ΣY 2 n. Using the above expressions, the correlation coefficient is. r = SXX S Y Y

Simple Linear Regression

Response Variable denoted by y it is the variable that is to be predicted measure of the outcome of an experiment also called the dependent variable

SIMPLE LINEAR REGRESSION AND CORRELATION ANALYSIS

Mathematical Notation Math Introduction to Applied Statistics

3/3/2014. CDS M Phil Econometrics. Types of Relationships. Types of Relationships. Types of Relationships. Vijayamohanan Pillai N.

Properties and Hypothesis Testing

11 Correlation and Regression

ECON 3150/4150, Spring term Lecture 3

Simple Regression. Acknowledgement. These slides are based on presentations created and copyrighted by Prof. Daniel Menasce (GMU) CS 700

First, note that the LS residuals are orthogonal to the regressors. X Xb X y = 0 ( normal equations ; (k 1) ) So,

II. Descriptive Statistics D. Linear Correlation and Regression. 1. Linear Correlation

Linear Regression Models

Regression, Part I. A) Correlation describes the relationship between two variables, where neither is independent or a predictor.

Linear Regression Demystified

Chapters 5 and 13: REGRESSION AND CORRELATION. Univariate data: x, Bivariate data (x,y).

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

ST 305: Exam 3 ( ) = P(A)P(B A) ( ) = P(A) + P(B) ( ) = 1 P( A) ( ) = P(A) P(B) σ X 2 = σ a+bx. σ ˆp. σ X +Y. σ X Y. σ X. σ Y. σ n.

Continuous Data that can take on any real number (time/length) based on sample data. Categorical data can only be named or categorised

Algebra of Least Squares

Simple Linear Regression

UNIVERSITY OF TORONTO Faculty of Arts and Science APRIL/MAY 2009 EXAMINATIONS ECO220Y1Y PART 1 OF 2 SOLUTIONS

Final Examination Solutions 17/6/2010

Stat 200 -Testing Summary Page 1

(all terms are scalars).the minimization is clearer in sum notation:

UNIT 11 MULTIPLE LINEAR REGRESSION

Topic 9: Sampling Distributions of Estimators

Chapter 13, Part A Analysis of Variance and Experimental Design

Stat 139 Homework 7 Solutions, Fall 2015

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

REGRESSION (Physics 1210 Notes, Partial Modified Appendix A)

University of California, Los Angeles Department of Statistics. Simple regression analysis

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Assessment and Modeling of Forests. FR 4218 Spring Assignment 1 Solutions

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

MOST PEOPLE WOULD RATHER LIVE WITH A PROBLEM THEY CAN'T SOLVE, THAN ACCEPT A SOLUTION THEY CAN'T UNDERSTAND.

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Chapter 6 Sampling Distributions

Lecture 11 Simple Linear Regression

9. Simple linear regression G2.1) Show that the vector of residuals e = Y Ŷ has the covariance matrix (I X(X T X) 1 X T )σ 2.

University of California, Los Angeles Department of Statistics. Practice problems - simple regression 2 - solutions

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Dr. Maddah ENMG 617 EM Statistics 11/26/12. Multiple Regression (2) (Chapter 15, Hines)

Linear Regression Analysis. Analysis of paired data and using a given value of one variable to predict the value of the other

Describing the Relation between Two Variables

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

Correlation. Two variables: Which test? Relationship Between Two Numerical Variables. Two variables: Which test? Contingency table Grouped bar graph

Chapter 22. Comparing Two Proportions. Copyright 2010, 2007, 2004 Pearson Education, Inc.

Ismor Fischer, 1/11/

Worksheet 23 ( ) Introduction to Simple Linear Regression (continued)

Statistical Properties of OLS estimators

Statistics 511 Additional Materials

Chapter 12 Correlation

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

Lecture 3. Properties of Summary Statistics: Sampling Distribution

Chapter If n is odd, the median is the exact middle number If n is even, the median is the average of the two middle numbers

Overview. p 2. Chapter 9. Pooled Estimate of. q = 1 p. Notation for Two Proportions. Inferences about Two Proportions. Assumptions

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

Topic 9: Sampling Distributions of Estimators

Regression. Correlation vs. regression. The parameters of linear regression. Regression assumes... Random sample. Y = α + β X.

CTL.SC0x Supply Chain Analytics

Chapter 23: Inferences About Means

Data Analysis and Statistical Methods Statistics 651

Least-Squares Regression

Correlation and Regression

Investigating the Significance of a Correlation Coefficient using Jackknife Estimates

A statistical method to determine sample size to estimate characteristic value of soil parameters

Regression and Correlation

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)

t distribution [34] : used to test a mean against an hypothesized value (H 0 : µ = µ 0 ) or the difference

Statistics Lecture 27. Final review. Administrative Notes. Outline. Experiments. Sampling and Surveys. Administrative Notes

Chapter 22. Comparing Two Proportions. Copyright 2010 Pearson Education, Inc.

Bivariate Sample Statistics Geog 210C Introduction to Spatial Data Analysis. Chris Funk. Lecture 7

ECON 3150/4150, Spring term Lecture 1

Correlation and Covariance

Open book and notes. 120 minutes. Cover page and six pages of exam. No calculators.

Chapter 13: Tests of Hypothesis Section 13.1 Introduction

Recall the study where we estimated the difference between mean systolic blood pressure levels of users of oral contraceptives and non-users, x - y.

STP 226 EXAMPLE EXAM #1

Topic 9: Sampling Distributions of Estimators

Important Formulas. Expectation: E (X) = Σ [X P(X)] = n p q σ = n p q. P(X) = n! X1! X 2! X 3! X k! p X. Chapter 6 The Normal Distribution.

STP 226 ELEMENTARY STATISTICS

(X i X)(Y i Y ) = 1 n

BIOS 4110: Introduction to Biostatistics. Breheny. Lab #9

4 Multidimensional quantitative data

Dealing with Data and Fitting Empirically

MA Advanced Econometrics: Properties of Least Squares Estimators

STA Learning Objectives. Population Proportions. Module 10 Comparing Two Proportions. Upon completing this module, you should be able to:

Regression and correlation

Tests of Hypotheses Based on a Single Sample (Devore Chapter Eight)

Summary: CORRELATION & LINEAR REGRESSION. GC. Students are advised to refer to lecture notes for the GC operations to obtain scatter diagram.

Lecture 5: Parametric Hypothesis Testing: Comparing Means. GENOME 560, Spring 2016 Doug Fowler, GS

Soo King Lim Figure 1: Figure 2: Figure 3: Figure 4: Figure 5: Figure 6: Figure 7:

Comparing Two Populations. Topic 15 - Two Sample Inference I. Comparing Two Means. Comparing Two Pop Means. Background Reading

Sample Size Determination (Two or More Samples)

Lesson 11: Simple Linear Regression

DAWSON COLLEGE DEPARTMENT OF MATHEMATICS 201-BZS-05 PROBABILITY AND STATISTICS FALL 2015 FINAL EXAM

Transcription:

Correlatio Regressio While correlatio methods measure the stregth of a liear relatioship betwee two variables, we might wish to go a little further: How much does oe variable chage for a give chage i aother variable? How accurately ca the value of oe variable be predicted from kowledge of the other? Regressio aalysis refers to the process of studyig the causal relatioship betwee a depedet variable ad a set of idepedet explaatory variables

Two Sorts of Bivariate Relatioships Geerally, we ca classify the ature of the relatioship betwee a pair of variables ito two types: A bivariate relatioship ca be determiistic, where kowledge of oe of the variables etails a perfect kowledge of the other OR A bivariate relatioship ca be probabilistic, where kowledge of oe of the variables ca allow you to estimate the value of the other variable, but ot with absolute accuracy ad/or certaity

A Determiistic Relatioship Suppose we are travelig from oe place to aother o the Iterstate, ad we travel at a costat speed There is a determiistic relatioship betwee the time spet drivig ad the distace traveled that we ca express graphically, or usig a equatio: distace (s) itercept (s 0 ) time (t) slope (v) s = s 0 + vt s: distace traveled s 0 : iitial distace v: speed t: time traveled Ufortuately, few relatioships are truly determiistic

A Probabilistic Relatioship More ofte, we fid relatioships betwee two variables that have a probabilistic ature For example, suppose we compare the ages ad heights of a sample of youg people betwee 2 ad 20 years old: height (meters) age (years) Here, we caot predict height from age as we could distace from time i the previous example There is a relatioship here, but there is a elemet of upredictability or error cotaied i this model

Samplig ad Regressio Whe we are comparig a pair of variables usig a sampled data set, we expect to fid a relatioship that is less tha perfect (i.e. probabilistic ad ot determiistic) because We expect that i the process of collectig the data there will be some measuremet errors which is aother source of variatio We might fid that there are other factors exertig some cotrol over the relatioship (which of course are ot accouted for i our simple bivariate model)

Simple vs. Multiple Regressio Today, we are goig to examie simple liear regressio, where we estimate the values of a depedet variable (y) usig the values of a idepedet variable (x) This cocept ca be exteded to multiple liear regressio, where more explaatory idepedet variables (x 1, x 2, x 3 x ) are used to develop estimates of the depedet variable s values For purposes of clarity, we will first look at the simple case, so we ca more easily grasp the mathematics ivolved

Simple Liear Regressio Simple liear regressio models the relatioship betwee a idepedet variable (x) ad a depedet variable (y) usig a equatio that expresses y as a liear fuctio of x, plus a error term: y = a + bx + e y (depedet) a error: ε b x (idepedet) x is the idepedet variable y is the depedet variable b is the slope of the fitted lie a is the itercept of the fitted lie e is the error term

Fittig a Lie to a Set of Poits Whe we have a data set cosistig of a idepedet ad a depedet variable, ad we plot these usig a scatterplot, to costruct our model betwee the relatioship betwee the variables, we eed to select a lie that represets the relatioship: y (depedet) x (idepedet) We ca choose a lie that fits best usig a least squares method The least squares lie is the lie that miimizes the vertical distaces betwee the poits ad the lie, i.e. it miimizes the error term ε whe it is cosidered for all poits i the data set

Samplig ad Regressio II We usually operate usig sampled data, ad while we are buildig a model of the form: y = a + bx + e from our sample, i doig so we are attemptig to estimate a true regressio lie, describig the relatioship betwee idepedet variable (x) ad depedet variable (y) for the etire populatio: y = α + βx + ε Multiple samples would yield several similar regressio lies, which should approximate the populatio regressio lie

Least Squares Method The least squares method operates mathematically, miimizig the error term e for all poits We ca describe the lie of best fit we will fid usig the equatio ŷ = a + bx, ad you ll recall that from a previous slide that the formula for our liear model was expressed usig y = a + bx + e y ŷ We use the value ŷ o the lie to estimate the true value, y (y - ŷ) The differece betwee the two is (y - ŷ) = e ŷ = a + bx This differece is positive for poits above the lie, ad egative for poits below it

Estimates ad Residuals Our simple liear regressio models take the form: y = a + bx + e which ca alteratively be expressed as: ŷ = a + bx where ŷ is the estimate of y produced by the regressio We ca rearrage these equatios to show: e = y ŷ The errors i the estimatio of y usig the regressio equatio are kow are residuals, ad express for ay give value i the data set to what extet the regressio lie is either uderestimatig or overestimatig the true value of y

Miimizig the Error Term I a liear model, the error i estimatig the true value of the depedet variable y is expressed by the differece betwee the true value ad the estimated value ŷ, e = (y - ŷ) (i.e. the residuals) Sometimes this differece will be positive (whe the lie uderestimates the value of y) ad sometimes it will be egative (whe the lie overestimates the value of y), because there will be poits above ad below the lie If we were to simply sum these error terms, the positive ad egative values would cacel out Istead, we ca square the differeces ad the sum them up to create a useful estimate of the overall error

Error Sum of Squares By squarig the differeces betwee y ad ŷ, ad summig these values for all poits i the data set, we calculate the error sum of squares (usually deoted by SSE): SSE = Σ (y - ŷ) 2 The least squares method of selectig a lie of best fit fuctios by fidig the parameters of a lie (itercept a ad slope b) that miimizes the error sum of squares, i.e. it is kow as the least squares method because it fids the lie that makes the SSE as small as it ca possibly be, miimizig the vertical distaces betwee the lie ad the poits

Miimizig the SSE We eed to the values of a ad b that would be miimize the error sums of squares: mi a,b Σ (y - ŷ) 2 = mi a,b Σ (y i -a -bx i ) 2 Solvig this problem would require calculus: Take the derivative of the expressio w.r.t. to a ad b, settig them each to 0 ad solvig for the 2 ukows It is graphically equivalet to fidig the miimum of a 3- dimesioal parabolic coe:

Fidig Regressio Coefficiets The equatios used to fid the values for the slope (b) ad itercept (a) of the lie of best fit usig the least squares method are: b = Σ (x i - x) (y i -y) a = y - bx Σ (x i -x) 2 Where: x i is the i th idepedet variable value y i is the i th depedet variable value x is the mea value of all the x i values y is the mea value of all the y i values

Iterpretig Slope (b) The slope of the lie (b), gives the chage i y (depedet variable) due to a uit chage i x (idepedet variable): b > 0 b < 0 Positive relatioship As the values of x icrease, the values of y icrease too Negative (a.k.a. iverse) relatioship As values of x icrease, the values of y decrease

Regressio Slope ad Correlatio The iterpretatio of the sig of the slope parameter ad the correlatio coefficiet is idetical, ad this is o coicidece the umerator of the slope expressio is idetical to that of the correlatio coefficiet r = i= Σ (x i - x)(y i -y) i=1 ( - 1) s X s Y The regressio slope ca expressed i terms of the correlatio coefficiet: b = b = r s y s x Σ (x i - x) (y i -y) Σ (x i -x) 2

Coefficiet of Determiatio (r 2 ) For example, suppose we have two datasets, ad we fit a regressio lie to each usig the least squares method: (a) (b) y y x While the same approach (the least squares method) has bee used to select the lie of best fit for both data sets, the relatioship betwee x ad y is clearly stroger i (a) tha i (b), because the poits are closer to the lie We have a umerical measure to express the stregth of the relatioship; the coefficiet of determiatio (r 2 ) x

Coefficiet of Determiatio (r 2 ) y ŷ y If we use y to estimate y, the error is (y - y) If we use ŷ to estimate y, the error is (y - ŷ) Thus, (ŷ - y) is the improvemet i our model To accout for the total improvemet for the model, we ca calculate this distace ad sum it for all poits i the data set, first takig the square of the differece (ŷ -y)

Coefficiet of Determiatio (r 2 ) The regressio sum of squares (SSR) expresses the improvemet made i estimatig y by usig the regressio lie: y ŷ y SSR = Σ (ŷ i -y) 2 The total sum of squares (SST) expresses the overall variatio betwee the values of y ad their mea y: SST = Σ (y i -y) 2 The coefficiet of determiatio (R 2 ) expresses the amout of variatio i y explaied by the regressio lie (the stregth of the relatioship): r 2 = SSR SST

Partitioig the Total Sum of Squares We ca also thik of regressio as a way to partitio the variatio i the values of the depedet variable y We ca take the total variatio, ad divide it ito two compoets: The compoet explaied by the regressio lie The compoet that remais uexplaied We ca characterize the total variability i y usig the sum of the squared deviatios of the y i values from their mea The total variability is expressed by the total sum of squares: SST = Σ (y i -y) 2

Partitioig the Total Sum of Squares We ca decompose the total sum of squares ito those two compoets: SST = Σ (y i -y) 2 I other words: SST = SSR + SSE ad the coefficiet of determiatio expresses the portio of the total variatio i y explaied by the regressio lie = Σ (ŷ i -y) 2 SST + Σ (y i - ŷ) 2 SSE y SSR ŷ y

Regressio ANOVA Table We ca create a aalysis of variace table that allows us to display the sums of squares, their degrees of freedom, mea square values (for the regressio ad error sums of squares), ad a F-statistic: Compoet Regressio (SSR) Error (SSE) Total (SST) Sum of Squares Σ (ŷ i -y) 2 Σ (y i - ŷ) 2 Σ (y i -y) 2 df 1-2 - 1 Mea Square SSR / 1 SSE / ( - 2) F MSSR MSSE

Regressio Example We ca use the data set we used to illustrate covariace ad correlatio: It was a set of 10 values of TVDI for remotely sesed pixels cotaiig the Glydo catchmet i Baltimore Couty, ad accompayig soil moisture measuremets take i the catchmet o matchig dates: Volumetric Soil Moisture 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 Glydo Field Sampled Soil Moisture versus TVDI from a 3x3 kerel 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 TVDI (3x3 kerel) TVDI Soil Moisture 0.274 0.414 0.542 0.359 0.419 0.396 0.286 0.458 0.374 0.350 0.489 0.357 0.623 0.255 0.506 0.189 0.768 0.171 0.725 0.119

Regressio Example To fid the optimal values for slope (b) ad the itercept (a), we must first calculate the mea values of the idepedet variable (TVDI) ad the depedet variable (soil moisture): Mea TVDI = 0.501, mea soil moisture = 0.307 We ca ow use these values to calculate the optimal slope accordig to the formula: b = Σ (x i - x) (y i -y) Σ (x i -x) 2

Regressio Example TVDI (x) Soil Moisture (y) (x - xbar) (y - ybar) (x - xbar) * (y - ybar) (x - xbar)^2 0.274 0.414-0.227 0.107-0.02431 0.05137 0.542 0.359 0.042 0.052 0.00216 0.00173 0.419 0.396-0.082 0.090-0.00732 0.00668 0.286 0.458-0.215 0.151-0.03242 0.04618 0.374 0.350-0.127 0.044-0.00555 0.01609 0.489 0.357-0.011 0.050-0.00057 0.00013 0.623 0.255 0.122-0.052-0.00637 0.01499 0.506 0.189 0.005-0.118-0.00062 0.00003 0.768 0.171 0.267-0.136-0.03628 0.07147 0.725 0.119 0.225-0.188-0.04229 0.05057 Mea 0.501 0.307 Sum -0.15357 0.25924 Slope -0.59239 We ca ow substitute the slope value ito the itercept equatio to calculate the itercept: a = y - bx a = 0.307 - (-0.592 * 0.501) = 0.603

Regressio Example We ca ow use our regressio equatio ŷ = 0.603-0.592x to calculate estimates for each of the values of x i the dataset, ad the proceed to calculate the SSR, SSE & SST TVDI (x) Soil Moisture (y) TVDI Estimate (yhat) (yhat - ybar) (yhat - ybar)^2 (y - yhat) (y - yhat) ^2 (y - ybar) (y - ybar) ^2 0.274 0.414 0.441 0.134 0.01803-0.02703 0.00073 0.107 0.01150 0.542 0.359 0.282-0.025 0.00061 0.07664 0.00587 0.052 0.00271 0.419 0.396 0.355 0.048 0.00234 0.04116 0.00169 0.090 0.00803 0.286 0.458 0.434 0.127 0.01621 0.02358 0.00056 0.151 0.02277 0.374 0.350 0.382 0.075 0.00565-0.03138 0.00098 0.044 0.00192 0.489 0.357 0.313 0.007 0.00004 0.04329 0.00187 0.050 0.00250 0.623 0.255 0.234-0.073 0.00526 0.02047 0.00042-0.052 0.00271 0.506 0.189 0.304-0.003 0.00001-0.11453 0.01312-0.118 0.01384 0.768 0.171 0.148-0.158 0.02508 0.02265 0.00051-0.136 0.01842 0.725 0.119 0.173-0.133 0.01775-0.05484 0.00301-0.188 0.03536 Mea 0.501 0.307 SSR 0.09097 Slope -0.592 SSE 0.02877 Itercept 0.603 SST 0.11974 SSR+SSE 0.11974

Regressio Example Now that we have all the ecessary values, we ca fill i the ANOVA table: Sum of Degrees of Mea Compoet Squares Freedom Square F-Test Regressio 0.09097 1 0.09097 25.296 (SSR) Error 0.02877 8 0.0035962 (SSE) Total 0.11974 9 (SST) We ca also calculate the coefficiet of determiatio r 2 = SSR / SST = 0.09097 / 0.11974 = 0.76

A Sigificace Test for r 2 We ca test to see if the regressio lie has bee successful i explaiig a sigificat portio of the variatio i y, by performig a F-test This operates i a similar fashio to how we used the F-test i ANOVA, this time testig the ull hypothesis that the true coefficiet of determiatio of the populatio ρ 2 = 0 usig a F-test formulated as: F test = r2 ( - 2) 1 - r 2 = MSSR MSSE which has a F-distributio with degrees of freedom: df = (1, - 2)

Hypothesis Testig - Sigificace of r 2 F-test Example Research questio: Is the regressio lie explaiig a sigificat proportio of the variatio i y (Soil Moisture) 1. H 0 : ρ 2 = 0 (Explaatio of variatio ot sigificat) 2. H A : ρ 2 0 (Sigificat variatio explaied) 3. Select α = 0.05, oe-tailed because we are usig a F-test 4. I order to compute the F-test statistic, we eed to first calculate either the coefficiet of determiatio or the mea sums of squares for both the regressio ad error terms (i this case we have already doe both): F test = 0.76 (8) 1-0.76 = 0.09097 0.0035962 = 25.296

Hypothesis Testig - Sigificace of r 2 F-test Example 5. We ow eed to fid the critical F-value, first calculatig the degrees of freedom: df = (1, - 2) = (1, 10-2) = (1, 8) We ca ow look up the F crit value for our α (0.05 i oe tail) ad df = (1, 8), F crit = 5.32 6. F test > F crit, therefore we reject H 0, ad accept H A, fidig that the regressio explais a sigificat portio of the variatio i y (i.e. the populatio coefficiet of determiatio ρ 2, which we have estimated usig the sample coefficiet of determiatio r 2 is ot equal to 0)