11. Regression and Least Squares

Size: px

Start display at page:

Download "11. Regression and Least Squares"

Prosper Miller
5 years ago
Views:

1 11. Regression and Least Squares Prof. Tesler Math 186 Winter 2016 Prof. Tesler Ch. 11: Linear Regression Math 186 / Winter / 23

2 Regression Given n points ( 1, 1 ), ( 2, 2 ),..., we want to determine a function = f () that is close to them. Scatter plot of data (,) Prof. Tesler Ch. 11: Linear Regression Math 186 / Winter / 23

3 Regression Based on knowledge of the underling problem or on plotting the data, ou have an idea of the general form of the function, such as: Line = β 0 β 1 Polnomial = β + β 1 β β " = " Eponential = Ae (B) Deca = Ae B Logistic Curve = A (1 = + BA/(1 C ) + B/C ) Goal: Compute the parameters (β 0, β 1,... or A, B, C,...) that give a best fit to the data. Prof. Tesler Ch. 11: Linear Regression Math 186 / Winter / 23

4 Regression The methods we consider require the parameters to occur linearl. It is fine if (, ) do not occur linearl. E.g., plugging (, ) = (2, 3) into = β 0 + β 1 + β β 3 3 gives 3 = β 0 + 2β 1 + 4β 2 + 8β 3. For eponential deca, = Ae B, parameter B does not occur linearl. Transform the equation to: ln = ln(a) B = A B When we plug in (, ) values, the parameters A, B occur linearl. Transform the logistic curve = A/(1 + B/C ) to: ( ) A ln 1 = ln(b) ln(c) = B + C where A is determined from A = lim (). Now B, C occur linearl. Prof. Tesler Ch. 11: Linear Regression Math 186 / Winter / 23

5 Least squares fit to a line = " Given n points ( 1, 1 ), ( 2, 2 ),..., we will fit them to a line ŷ = β 0 + β 1 : Independent variable:. We assume the s are known eactl or have negligible measurement errors. Dependent variable:. We assume the s depend on the s but fluctuate due to a random process. We do not have = f (), but instead, = f () + error. Prof. Tesler Ch. 11: Linear Regression Math 186 / Winter / 23

6 Least squares fit to a line = " Given n points ( 1, 1 ), ( 2, 2 ),..., we will fit them to a line ŷ = β 0 + β 1 : Predicted value (on the line): ŷ i = β 0 + β 1 i Actual data ( ): i = β 0 + β 1 i + ɛ i Residual (actual minus prediction): ɛ i = i ŷ i = i (β 0 + β 1 i ) Prof. Tesler Ch. 11: Linear Regression Math 186 / Winter / 23

7 Least squares fit to a line = " We will use the least squares method: pick parameters β 0, β 1 that minimize the sum of squares of the residuals. n L = ( i (β 0 + β 1 i )) 2 Prof. Tesler Ch. 11: Linear Regression Math 186 / Winter / 23

8 Least squares fit to a line L = n ( i (β 0 + β 1 i )) 2 To find β 0, β 1 that minimize this, solve L = ( L β 0, L β 1 ) = (0, 0): L β 0 = 2 L β 1 = 2 n ( i (β 0 + β 1 i )) = 0 nβ 0 + n ( i (β 0 + β 1 i )) i = 0 ( n ) i β 0 + ( n ) i β 1 = ( n i 2 ) β 1 = n i n i i which has solution (all sums are i = 1 to n) β 1 = n ( i i i ) ( i i) ( i i) n ( i i 2 ) ( i i) 2 = i ( i )( i ȳ) i ( i ) 2 β 0 = ȳ β 1 Not shown: use 2nd derivatives to confirm it s a minimum rather than a maimum or saddle point. Prof. Tesler Ch. 11: Linear Regression Math 186 / Winter / 23

9 Best fitting line = β 0 + β 1 + ε = α 0 + α 1 + ε = slope = = slope = The best fit for = β 0 + β 1 + error or = α 0 + α 1 + error give different lines = β 0 + β 1 + error assumes the s are known eactl with no errors, while the s have errors. = α 0 + α 1 + error is the other wa around. Prof. Tesler Ch. 11: Linear Regression Math 186 / Winter / 23

10 Total Least Squares / Principal Components Analsis = " = # 0 + # 1 + " = slope = = slope = First principal component of centered data All three slope = = = In man eperiments, both and have measurement errors. Use Total Least Squares or Principal Components Analsis, in which the residuals are measured perpendicular to the line. Details require advanced linear algebra, beond Math 20F. Prof. Tesler Ch. 11: Linear Regression Math 186 / Winter / 23

11 Confidence intervals = " true line sample data best fit line 95% prediction interval r 2 = The best fit line is different than the true line. We found point estimates of β 0 and β 1. Assuming errors are independent of and normall distributed gives Confidence intervals for β 0, β 1. A prediction interval to etrapolate = f () at other s. Warning: it ma diverge from the true line when we go out too far. Not shown: one can also do hpothesis tests on the values of β 0 and β 1, and on whether two samples give the same line. Prof. Tesler Ch. 11: Linear Regression Math 186 / Winter / 23

12 Confidence intervals The method of least squares gave point estimates of β 0 and β 1 : ˆβ 1 = n i i i ( i i) ( i i) n ( i i 2 ) ( i i i) 2 = ( i )( i ȳ) i ( ˆβ i ) 2 0 = ȳ ˆβ 1 The sample variance of the residuals is s 2 = 1 n ( i ( ˆβ 0 + ˆβ 1 i )) 2 (with df = n 2). n 2 100(1 α)% confidence intervals: ( s i β 0 : ˆβ 0 t i 2 α/2,n 2, ˆβ 0 + t n i ( α/2,n 2 i ) β 1 : ( ˆβ 1 t α/2,n 2 s, ˆβ 1 + t i ( α/2,n 2 i ) s i i 2 n i ( i ) s i ( i ) ) ) at new : (ŷ w, ŷ + w) with ŷ = β 0 + β 1 and w = t α/2,n 2 s n + ( )2 i ( i ) 2 Prof. Tesler Ch. 11: Linear Regression Math 186 / Winter / 23

13 Covariance Let X and Y be random variables, possibl dependent. Let µ X = E(X), µ Y = E(Y) ( ((X Var(X + Y) = E((X + Y µ X µ Y ) 2 ) ( )) ) 2 ) = E µx + Y µy ( (X ) ) ( 2 (Y ) ) 2 = E µx + E µy + 2E ( (X µ X )(Y µ Y ) ) = Var(X) + Var(Y) + 2 Cov(X, Y) where the covariance of X and Y is defined as Cov(X, Y) = E ( (X µ X )(Y µ Y ) ) = E(XY) E(X)E(Y) Independent variables have E(XY) = E(X)E(Y), so Cov(X, Y) = 0. But Cov(X, Y) = 0 does not guarantee X and Y are independent. Prof. Tesler Ch. 11: Linear Regression Math 186 / Winter / 23

14 Covariance and independence Independent variables have E(XY) = E(X)E(Y), so Cov(X, Y) = 0. But Cov(X, Y) = 0 does not guarantee X and Y are independent. Consider the standard normal distribution, Z. Z and Z 2 are dependent. Cov(Z, Z 2 ) = E(Z 3 ) E(Z)E(Z 2 ). The standard normal distribution has mean 0: E(Z) = 0. E(Z 3 ) = 0 since Z 3 is an odd function and the pdf of Z is smmetric around Z = 0. So Cov(Z, Z 2 ) = 0. Prof. Tesler Ch. 11: Linear Regression Math 186 / Winter / 23

15 Covariance properties Covariance properties Var(X + Y) = Var(X) + Var(Y) + 2 Cov(X, Y) Cov(X, X) = Var(X) Cov(X, Y) = Cov(Y, X) Cov(aX + b, cy + d) = ac Cov(X, Y) Sign of covariance Cov(X, Y) = E((X µ X )(Y µ Y )) When Cov(X, Y) is positive: There is a tendenc to have X > µ X when Y > µ Y and vice-versa, and X < µ X when Y < µ Y and vice-versa. When Cov(X, Y) is negative: There is a tendenc to have X > µ X when Y < µ Y and vice-versa, and X < µ X when Y > µ Y and vice-versa. When Cov(X, Y) = 0: a) X and Y might be independent, but it s not guaranteed. b) Var(X + Y) = Var(X) + Var(Y) Prof. Tesler Ch. 11: Linear Regression Math 186 / Winter / 23

16 Sample variance and covariance Variance of a random variable: σ 2 = Var(X) = E((X µ X ) 2 ) = E(X 2 ) (E(X)) 2 Sample variance from data 1,..., n to estimate σ 2 : ( s 2 = var() = 1 n n ) ( i ) 2 = 1 2 i n 1 n 1 n n 1 2 Covariance between random variables X, Y: σ XY = Cov(X, Y) = E((X µ X )(Y µ Y )) = E(XY) E(X)E(Y) Sample covariance from data ( 1, 1 ),..., ( n, n ) to estimate σ XY : ( s XY = cov(, ) = 1 n n ) ( i )( i ȳ) = 1 i i n n 1 n 1 n 1 ȳ Prof. Tesler Ch. 11: Linear Regression Math 186 / Winter / 23

17 Correlation coefficient Let X and Y be two random variables. Their correlation coefficient is ρ(x, Y) = Cov(X, Y) Var(X) Var(Y) This is a normalized version of covariance, and is between ±1. For a line Y = ax + b with a, b constants (a 0), ρ(x, Y) = a Var(X) Var(X) Var(aX) = aσ2 σ a σ = a a = ±1 (sign of a) ρ(x, Y) = ±1 iff Y = ax + b with a, b constants (a 0). Closer to ±1: more linear. Closer to 0: less linear. If X and Y are independent then ρ(x, Y)=0. The converse is not valid: dependent variables can have ρ(x, Y)=0. Prof. Tesler Ch. 11: Linear Regression Math 186 / Winter / 23

18 Correlation coefficient ρ(x,y) is estimated from data b the sample correlation coefficient (a.k.a. Pearson product-moment correlation coefficient): cov(, ) i r(, ) = = ( i )( i ȳ) var() var() i ( i ) 2 i ( i ȳ) 2 = n i i i ( i i)( i i) n i i 2 ( i i) 2 n i i 2 ( i i) 2 People often report r 2 (between 0 and 1) instead of r. Prof. Tesler Ch. 11: Linear Regression Math 186 / Winter / 23

19 Sample correlation coefficient r Middle row: Perfect linear relation Y = ax + b gives r = 1 for lines with positive slope (a > 0) r = 1 for lines with negative slope (a < 0) r undefined for horizontal line (Y = b) Other rows: coming up Prof. Tesler Ch. 11: Linear Regression Math 186 / Winter / 23

20 Interpretation of r 2 Let ŷ i = ˆβ 1 i + ˆβ 0 be the predicted -value for i based on the least squares line. Write the deviation of i from ȳ as i ȳ = ( i ŷ i ) + (ŷ i ȳ) Total Uneplained Eplained deviation b line b line It can be shown that the sum of squared deviations for all s is i ( i ȳ) 2 = i ( i ŷ i ) 2 + i (ŷ i ȳ) i ( i ŷ i )(ŷ i ȳ) Total Uneplained Eplained = 0 b a miracle variation variation variation (Tedious algebra not shown) and that i r 2 = (ŷ i ȳ) 2 Eplained variation i ( = i ȳ) 2 Total variation r = 1: 100% of the variation is eplained b the line and 0% is due to other factors, and the slope is positive. r =.8: 64% of the variation is eplained b the line and 36% is due to other factors, and the slope is negative. Prof. Tesler Ch. 11: Linear Regression Math 186 / Winter / 23

21 Sample correlation coefficient r Top row: Linear relations with varing r. Bottom: r = 0, et X and Y are dependent in all of these (ecept possibl the last). Each plot shows a nonlinear relationship eists. Prof. Tesler Ch. 11: Linear Regression Math 186 / Winter / 23

Correlation does not impl causation High correlation between X and Y doesn t mean X causes Y or vice-versa. It could be a coincidence. Or the could both be caused b a third variable.

22 Correlation does not impl causation High correlation between X and Y doesn t mean X causes Y or vice-versa. It could be a coincidence. Or the could both be caused b a third variable. Website tlervigen.com plots man data sets (various quantities b ear) against each other to find spurious correlations: Prof. Tesler Ch. 11: Linear Regression Math 186 / Winter / 23

23 More about interpretation of correlation Low r 2 does NOT guarantee independence; it just means that a line = β 0 + β 1 is not a good fit to the data. r is an estimate of ρ. The estimate improves with higher n. With additional assumptions on the underling joint distribution of X, Y, we can use r to test H 0 : ρ = 0 vs. H 1 : ρ 0 (or other values). Best fits and correlation generalize to other models, including Polnomial regression Multiple linear regression Weighted versions = β 0 + β 1 + β β p p = β 0 + β 1 t + β 2 u + + β p w t, u,..., w: multiple independent variables : dependent variable When the variance is different at each value of the independent variables Prof. Tesler Ch. 11: Linear Regression Math 186 / Winter / 23