Correlatio Regressio While correlatio methods measure the stregth of a liear relatioship betwee two variables, we might wish to go a little further: How much does oe variable chage for a give chage i aother variable? How accurately ca the value of oe variable be predicted from kowledge of the other? Regressio aalysis refers to the process of studyig the causal relatioship betwee a depedet variable ad a set of idepedet explaatory variables
Two Sorts of Bivariate Relatioships Geerally, we ca classify the ature of the relatioship betwee a pair of variables ito two types: A bivariate relatioship ca be determiistic, where kowledge of oe of the variables etails a perfect kowledge of the other OR A bivariate relatioship ca be probabilistic, where kowledge of oe of the variables ca allow you to estimate the value of the other variable, but ot with absolute accuracy ad/or certaity
A Determiistic Relatioship Suppose we are travelig from oe place to aother o the Iterstate, ad we travel at a costat speed There is a determiistic relatioship betwee the time spet drivig ad the distace traveled that we ca express graphically, or usig a equatio: distace (s) itercept (s 0 ) time (t) slope (v) s = s 0 + vt s: distace traveled s 0 : iitial distace v: speed t: time traveled Ufortuately, few relatioships are truly determiistic
A Probabilistic Relatioship More ofte, we fid relatioships betwee two variables that have a probabilistic ature For example, suppose we compare the ages ad heights of a sample of youg people betwee 2 ad 20 years old: height (meters) age (years) Here, we caot predict height from age as we could distace from time i the previous example There is a relatioship here, but there is a elemet of upredictability or error cotaied i this model
Samplig ad Regressio Whe we are comparig a pair of variables usig a sampled data set, we expect to fid a relatioship that is less tha perfect (i.e. probabilistic ad ot determiistic) because We expect that i the process of collectig the data there will be some measuremet errors which is aother source of variatio We might fid that there are other factors exertig some cotrol over the relatioship (which of course are ot accouted for i our simple bivariate model)
Simple vs. Multiple Regressio Today, we are goig to examie simple liear regressio, where we estimate the values of a depedet variable (y) usig the values of a idepedet variable (x) This cocept ca be exteded to multiple liear regressio, where more explaatory idepedet variables (x 1, x 2, x 3 x ) are used to develop estimates of the depedet variable s values For purposes of clarity, we will first look at the simple case, so we ca more easily grasp the mathematics ivolved
Simple Liear Regressio Simple liear regressio models the relatioship betwee a idepedet variable (x) ad a depedet variable (y) usig a equatio that expresses y as a liear fuctio of x, plus a error term: y = a + bx + e y (depedet) a error: ε b x (idepedet) x is the idepedet variable y is the depedet variable b is the slope of the fitted lie a is the itercept of the fitted lie e is the error term
Fittig a Lie to a Set of Poits Whe we have a data set cosistig of a idepedet ad a depedet variable, ad we plot these usig a scatterplot, to costruct our model betwee the relatioship betwee the variables, we eed to select a lie that represets the relatioship: y (depedet) x (idepedet) We ca choose a lie that fits best usig a least squares method The least squares lie is the lie that miimizes the vertical distaces betwee the poits ad the lie, i.e. it miimizes the error term ε whe it is cosidered for all poits i the data set
Samplig ad Regressio II We usually operate usig sampled data, ad while we are buildig a model of the form: y = a + bx + e from our sample, i doig so we are attemptig to estimate a true regressio lie, describig the relatioship betwee idepedet variable (x) ad depedet variable (y) for the etire populatio: y = α + βx + ε Multiple samples would yield several similar regressio lies, which should approximate the populatio regressio lie
Least Squares Method The least squares method operates mathematically, miimizig the error term e for all poits We ca describe the lie of best fit we will fid usig the equatio ŷ = a + bx, ad you ll recall that from a previous slide that the formula for our liear model was expressed usig y = a + bx + e y ŷ We use the value ŷ o the lie to estimate the true value, y (y - ŷ) The differece betwee the two is (y - ŷ) = e ŷ = a + bx This differece is positive for poits above the lie, ad egative for poits below it
Estimates ad Residuals Our simple liear regressio models take the form: y = a + bx + e which ca alteratively be expressed as: ŷ = a + bx where ŷ is the estimate of y produced by the regressio We ca rearrage these equatios to show: e = y ŷ The errors i the estimatio of y usig the regressio equatio are kow are residuals, ad express for ay give value i the data set to what extet the regressio lie is either uderestimatig or overestimatig the true value of y
Miimizig the Error Term I a liear model, the error i estimatig the true value of the depedet variable y is expressed by the differece betwee the true value ad the estimated value ŷ, e = (y - ŷ) (i.e. the residuals) Sometimes this differece will be positive (whe the lie uderestimates the value of y) ad sometimes it will be egative (whe the lie overestimates the value of y), because there will be poits above ad below the lie If we were to simply sum these error terms, the positive ad egative values would cacel out Istead, we ca square the differeces ad the sum them up to create a useful estimate of the overall error
Error Sum of Squares By squarig the differeces betwee y ad ŷ, ad summig these values for all poits i the data set, we calculate the error sum of squares (usually deoted by SSE): SSE = Σ (y - ŷ) 2 The least squares method of selectig a lie of best fit fuctios by fidig the parameters of a lie (itercept a ad slope b) that miimizes the error sum of squares, i.e. it is kow as the least squares method because it fids the lie that makes the SSE as small as it ca possibly be, miimizig the vertical distaces betwee the lie ad the poits
Miimizig the SSE We eed to the values of a ad b that would be miimize the error sums of squares: mi a,b Σ (y - ŷ) 2 = mi a,b Σ (y i -a -bx i ) 2 Solvig this problem would require calculus: Take the derivative of the expressio w.r.t. to a ad b, settig them each to 0 ad solvig for the 2 ukows It is graphically equivalet to fidig the miimum of a 3- dimesioal parabolic coe:
Fidig Regressio Coefficiets The equatios used to fid the values for the slope (b) ad itercept (a) of the lie of best fit usig the least squares method are: b = Σ (x i - x) (y i -y) a = y - bx Σ (x i -x) 2 Where: x i is the i th idepedet variable value y i is the i th depedet variable value x is the mea value of all the x i values y is the mea value of all the y i values
Iterpretig Slope (b) The slope of the lie (b), gives the chage i y (depedet variable) due to a uit chage i x (idepedet variable): b > 0 b < 0 Positive relatioship As the values of x icrease, the values of y icrease too Negative (a.k.a. iverse) relatioship As values of x icrease, the values of y decrease
Regressio Slope ad Correlatio The iterpretatio of the sig of the slope parameter ad the correlatio coefficiet is idetical, ad this is o coicidece the umerator of the slope expressio is idetical to that of the correlatio coefficiet r = i= Σ (x i - x)(y i -y) i=1 ( - 1) s X s Y The regressio slope ca expressed i terms of the correlatio coefficiet: b = b = r s y s x Σ (x i - x) (y i -y) Σ (x i -x) 2
Coefficiet of Determiatio (r 2 ) For example, suppose we have two datasets, ad we fit a regressio lie to each usig the least squares method: (a) (b) y y x While the same approach (the least squares method) has bee used to select the lie of best fit for both data sets, the relatioship betwee x ad y is clearly stroger i (a) tha i (b), because the poits are closer to the lie We have a umerical measure to express the stregth of the relatioship; the coefficiet of determiatio (r 2 ) x
Coefficiet of Determiatio (r 2 ) y ŷ y If we use y to estimate y, the error is (y - y) If we use ŷ to estimate y, the error is (y - ŷ) Thus, (ŷ - y) is the improvemet i our model To accout for the total improvemet for the model, we ca calculate this distace ad sum it for all poits i the data set, first takig the square of the differece (ŷ -y)
Coefficiet of Determiatio (r 2 ) The regressio sum of squares (SSR) expresses the improvemet made i estimatig y by usig the regressio lie: y ŷ y SSR = Σ (ŷ i -y) 2 The total sum of squares (SST) expresses the overall variatio betwee the values of y ad their mea y: SST = Σ (y i -y) 2 The coefficiet of determiatio (R 2 ) expresses the amout of variatio i y explaied by the regressio lie (the stregth of the relatioship): r 2 = SSR SST
Partitioig the Total Sum of Squares We ca also thik of regressio as a way to partitio the variatio i the values of the depedet variable y We ca take the total variatio, ad divide it ito two compoets: The compoet explaied by the regressio lie The compoet that remais uexplaied We ca characterize the total variability i y usig the sum of the squared deviatios of the y i values from their mea The total variability is expressed by the total sum of squares: SST = Σ (y i -y) 2
Partitioig the Total Sum of Squares We ca decompose the total sum of squares ito those two compoets: SST = Σ (y i -y) 2 I other words: SST = SSR + SSE ad the coefficiet of determiatio expresses the portio of the total variatio i y explaied by the regressio lie = Σ (ŷ i -y) 2 SST + Σ (y i - ŷ) 2 SSE y SSR ŷ y
Regressio ANOVA Table We ca create a aalysis of variace table that allows us to display the sums of squares, their degrees of freedom, mea square values (for the regressio ad error sums of squares), ad a F-statistic: Compoet Regressio (SSR) Error (SSE) Total (SST) Sum of Squares Σ (ŷ i -y) 2 Σ (y i - ŷ) 2 Σ (y i -y) 2 df 1-2 - 1 Mea Square SSR / 1 SSE / ( - 2) F MSSR MSSE
Regressio Example We ca use the data set we used to illustrate covariace ad correlatio: It was a set of 10 values of TVDI for remotely sesed pixels cotaiig the Glydo catchmet i Baltimore Couty, ad accompayig soil moisture measuremets take i the catchmet o matchig dates: Volumetric Soil Moisture 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 Glydo Field Sampled Soil Moisture versus TVDI from a 3x3 kerel 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 TVDI (3x3 kerel) TVDI Soil Moisture 0.274 0.414 0.542 0.359 0.419 0.396 0.286 0.458 0.374 0.350 0.489 0.357 0.623 0.255 0.506 0.189 0.768 0.171 0.725 0.119
Regressio Example To fid the optimal values for slope (b) ad the itercept (a), we must first calculate the mea values of the idepedet variable (TVDI) ad the depedet variable (soil moisture): Mea TVDI = 0.501, mea soil moisture = 0.307 We ca ow use these values to calculate the optimal slope accordig to the formula: b = Σ (x i - x) (y i -y) Σ (x i -x) 2
Regressio Example TVDI (x) Soil Moisture (y) (x - xbar) (y - ybar) (x - xbar) * (y - ybar) (x - xbar)^2 0.274 0.414-0.227 0.107-0.02431 0.05137 0.542 0.359 0.042 0.052 0.00216 0.00173 0.419 0.396-0.082 0.090-0.00732 0.00668 0.286 0.458-0.215 0.151-0.03242 0.04618 0.374 0.350-0.127 0.044-0.00555 0.01609 0.489 0.357-0.011 0.050-0.00057 0.00013 0.623 0.255 0.122-0.052-0.00637 0.01499 0.506 0.189 0.005-0.118-0.00062 0.00003 0.768 0.171 0.267-0.136-0.03628 0.07147 0.725 0.119 0.225-0.188-0.04229 0.05057 Mea 0.501 0.307 Sum -0.15357 0.25924 Slope -0.59239 We ca ow substitute the slope value ito the itercept equatio to calculate the itercept: a = y - bx a = 0.307 - (-0.592 * 0.501) = 0.603
Regressio Example We ca ow use our regressio equatio ŷ = 0.603-0.592x to calculate estimates for each of the values of x i the dataset, ad the proceed to calculate the SSR, SSE & SST TVDI (x) Soil Moisture (y) TVDI Estimate (yhat) (yhat - ybar) (yhat - ybar)^2 (y - yhat) (y - yhat) ^2 (y - ybar) (y - ybar) ^2 0.274 0.414 0.441 0.134 0.01803-0.02703 0.00073 0.107 0.01150 0.542 0.359 0.282-0.025 0.00061 0.07664 0.00587 0.052 0.00271 0.419 0.396 0.355 0.048 0.00234 0.04116 0.00169 0.090 0.00803 0.286 0.458 0.434 0.127 0.01621 0.02358 0.00056 0.151 0.02277 0.374 0.350 0.382 0.075 0.00565-0.03138 0.00098 0.044 0.00192 0.489 0.357 0.313 0.007 0.00004 0.04329 0.00187 0.050 0.00250 0.623 0.255 0.234-0.073 0.00526 0.02047 0.00042-0.052 0.00271 0.506 0.189 0.304-0.003 0.00001-0.11453 0.01312-0.118 0.01384 0.768 0.171 0.148-0.158 0.02508 0.02265 0.00051-0.136 0.01842 0.725 0.119 0.173-0.133 0.01775-0.05484 0.00301-0.188 0.03536 Mea 0.501 0.307 SSR 0.09097 Slope -0.592 SSE 0.02877 Itercept 0.603 SST 0.11974 SSR+SSE 0.11974
Regressio Example Now that we have all the ecessary values, we ca fill i the ANOVA table: Sum of Degrees of Mea Compoet Squares Freedom Square F-Test Regressio 0.09097 1 0.09097 25.296 (SSR) Error 0.02877 8 0.0035962 (SSE) Total 0.11974 9 (SST) We ca also calculate the coefficiet of determiatio r 2 = SSR / SST = 0.09097 / 0.11974 = 0.76
A Sigificace Test for r 2 We ca test to see if the regressio lie has bee successful i explaiig a sigificat portio of the variatio i y, by performig a F-test This operates i a similar fashio to how we used the F-test i ANOVA, this time testig the ull hypothesis that the true coefficiet of determiatio of the populatio ρ 2 = 0 usig a F-test formulated as: F test = r2 ( - 2) 1 - r 2 = MSSR MSSE which has a F-distributio with degrees of freedom: df = (1, - 2)
Hypothesis Testig - Sigificace of r 2 F-test Example Research questio: Is the regressio lie explaiig a sigificat proportio of the variatio i y (Soil Moisture) 1. H 0 : ρ 2 = 0 (Explaatio of variatio ot sigificat) 2. H A : ρ 2 0 (Sigificat variatio explaied) 3. Select α = 0.05, oe-tailed because we are usig a F-test 4. I order to compute the F-test statistic, we eed to first calculate either the coefficiet of determiatio or the mea sums of squares for both the regressio ad error terms (i this case we have already doe both): F test = 0.76 (8) 1-0.76 = 0.09097 0.0035962 = 25.296
Hypothesis Testig - Sigificace of r 2 F-test Example 5. We ow eed to fid the critical F-value, first calculatig the degrees of freedom: df = (1, - 2) = (1, 10-2) = (1, 8) We ca ow look up the F crit value for our α (0.05 i oe tail) ad df = (1, 8), F crit = 5.32 6. F test > F crit, therefore we reject H 0, ad accept H A, fidig that the regressio explais a sigificat portio of the variatio i y (i.e. the populatio coefficiet of determiatio ρ 2, which we have estimated usig the sample coefficiet of determiatio r 2 is ot equal to 0)