Oct Simple linear regression. Minimum mean square error prediction. Univariate. regression. Calculating intercept and slope

Oct 2017 1 / 28

Minimum MSE Y is the response variable, X the predictor variable, E(X) = E(Y) = 0. BLUP of Y minimizes average discrepancy var (Y ux) = C YY 2u C XY + u 2 C XX This is minimized when u takes the value b 1 = C XY /C XX. Ŷ = b 1 X is the of Y on X, b 1 is the coefficient. The error is var ( ) Y Ŷ = C YY C2 XY C XX If we relax the assumption that E(X) = E(Y) = 0, Ŷ = m Y + b 1 (X m X ). 2 / 28

error (C yy C 2 xy C xx) b = C xy C xx 3 / 28

Parent-offspring Trait is measured on offspring (Y) and parents (X 1, X 2 ). Mid-parent value X = (X 1 + X 2 )/2. According to genetic theory, C XY = 1 2 V A, C XX = 1 2 (V A + V E ) Regression coefficient (offspring on mid-parent) is b 1 = C XY /C XX = V A /(V A + V E ), the heritability of the trait. 4 / 28

Francis Galton Tall fathers tend to have tall sons, but average height of sons of tall fathers is less than average height of the fathers. There is (falling back) to the overall mean. 5 / 28

Galton s data child 74 72 70 68 66 64 62 5 3 2 4 3 4 3 2 2 3 1 4 4 11 4 9 7 1 2 11 18 20 7 4 2 5 4 19 21 25 14 10 1 1 2 7 13 38 48 33 18 5 2 1 7 14 28 34 20 12 3 1 2 5 11 17 38 31 27 3 4 2 5 11 17 36 25 17 1 3 1 1 7 2 15 16 4 1 1 4 4 5 5 14 11 16 2 4 9 3 5 7 1 1 1 3 3 1 1 1 1 1 64 66 68 70 72 74 parent 6 / 28

Bivariate normal distribution In general, best possible predictor is E(Y X). For the bivariate normal distn, E(Y X) and the BLUP Ŷ coincide. When we are dealing with normally distd r.v.s, the BLUP is not just best among all linear predictors, but best among all predictors. 7 / 28

Sampling Usually (co)s are not known but are estimated from a sample (X 1, Y 1 ), (X 2, Y 2 ),..., (X n, Y n ) from the bivariate distn. Sample s obtained by dividing Sxx and Syy by n 1 provide unbiased estimates of C XX and C YY. The sample co obtained by dividing Sxy by n 1 is an unbiased estimate of C XY, where Sxy is the corrected sum of products. The coefficient of Y on X is then estimated by ˆb 1 = S xy /S xx. 8 / 28

Simple example Blood pressure was measured on a sample of women of different ages. Ages were grouped into 10-year classes, and mean b.p. calculated for each age class. Age class (yrs) 35 45 55 65 75 b.p. (mmhg) 114 124 143 158 166 Here age is fixed by experiment design, not random. Model for the dependence of Y (b.p.) on X (age): Y i = b 0 + b 1 X i + e i, i = 1... n Errors (residuals) e 1... e n are independently distd with zero mean and constant σ 2. Residuals e i are errors, and σ 2 is the error (residual ). 9 / 28

Blood pressure data 170 bp 160 150 140 130 120 110 age 35 45 55 65 75 10 / 28

Method I (not recommended). Calculate X = 55, Ȳ = 141, d x = X X, d y = Y Ȳ. X Y d x d y d 2 x d x d y d 2 y 35 114 20 27 400 540 729 45 124 10 17 100 170 289 55 143 0 2 0 0 4 65 158 10 17 100 170 289 75 166 20 25 400 500 625 Total 0 0 1000 1380 1936 S xx = d 2 x = 1000, S xy = d x d y = 1380, S yy = d 2 y = 1936. 11 / 28

Calculations Method II (recommended). Calculate uncorrected sum of squares or products, then subtract a correction factor. The result is the corrected sum of squares or products: N X Y X 2 Y 2 XY 5 275 705 16125 101341 40155 S xx = 16125 275 2 /5 = 1000 S xy = 40155 275 705/5 = 1380 S yy = 101341 705 2 /5 = 1936 12 / 28

Calculations Equation of the line is Y 141 = 1.38 (X 55), or Y = 65.1 + 1.38 X Slope of the line is ˆb 1 = 1.38 mmhg/year, an average increase of 13.8 mmhg per decade. The intercept (ˆb 0 = 65.1) is the predicted value of Y when X = 0. (An extrapolation far outside the range of the data). 13 / 28

Fitted line 170 bp 160 150 140 130 120 110 age 35 45 55 65 75 14 / 28

Residuals, Values of Y predicted by the equation at the data values X 1... X n are called (Ŷ). Differences between observed and (Y Ŷ) are called residuals. For the blood pressure data, and residuals are: X Y Fitted Residual 35 114 113.4 0.6 45 124 127.2 3.2 55 143 141.0 2.0 65 158 154.8 3.2 75 166 168.6 2.6 15 / 28

Fitted values and residuals 170 bp 160 150 140 130 120 110 age 35 45 55 65 75 16 / 28

Deviation from the mean can be split into two components: Y i Ȳ = (Y i Ŷ i ) + (Ŷ i Ȳ) Correspondingly, the total sum of squares splits into two components: (Yi Ȳ) 2 = (Ŷi Ȳ) 2 + (Y i Ŷi) 2 Total = Regression + Residual Regression sum of squares is corrected sum of squares of fitted values. Residual sum of squares is the sum of squared residuals. 17 / 28

anova calculation Total sum of squares is S yy. Regression sum of squares is S 2 xy/s xx. Residual sum of squares is obtained by subtraction. S yy = S 2 xy/s xx + (S yy S 2 xy/s xx ) Total = Regression + Residual 18 / 28

table Source Df Sum Sq Mean Sq F ratio Regression 1 1904.4 1904.4 180.8 Residual 3 31.6 10.53 Total 4 1936.0 Regression sum of squares S 2 xy/s xx has one degree of freedom. For the blood pressure data, total sum of squares has 4 d.f., leaving the residual sum of squares with 3 d.f. In general with a sample of size n (n pairs of X and Y values), total sum of squares has n 1 d.f., and residual sum of squares has n 2 d.f. Mean squares are obtained by dividing the corresponding sum of squares by the number of degrees of freedom. Residual mean square (10.53) estimates the residual σ 2. 19 / 28

Variance explained The proportion of explained is R 2 = ( sum of squares)/(total sum of squares), Sample correlation coefficient is the positive square root of R 2 multiplied by ±1 (the sign of the coefficient or Sxy). 20 / 28

Is the apparent increase in b.p. with age real, or due to chance? Null hypothesis H 0 : b 1 = 0 ( no relationship between X and Y ). Under H 0, the F statistic has an F distn with 1 and n 2 d.f. H 0 is rejected with large values of F (a one-sided test). For b.p. data, F = 180.8 with 1 and 3 d.f. Tables of the F distn show this to be highly significant (P < 0.001). 21 / 28

Alternatively, null distn of b 1 /E is t with n 2 d.f., where E = S 2 /S xx is estimated s.e. of ˆb 1. For b.p. data, s.e. of ˆb 1 is 10.53/1000 = 0.1026, and t = 1.38/0.1026 = 13.45 with 3 d.f. Tables of the t distn give P < 0.001 (two-sided test). Note that the t statistic is the square root of the F statistic. (When F has 1 and ν d.f., ± F is t with ν d.f.) 22 / 28

Inspect residuals for evidence that model assumptions do not hold. Plot residual against predictor variable or fitted value. Plots may show evidence of systematic discrepancy, due to inadequacies in the model, or an isolated discrepancy, due to an outlier. An outlier has an unusually large residual. If possible, a reason should be found. Outliers may sometimes be rejected, cautiously. 23 / 28

A correlation between X and Y does not necessarily imply that a change in X causes a change in Y. The link may be between X and Z, and between Z and Y, where Z is a third (unobserved) variable. For example, a correlation between birth rate and tractor sales may arise simply because both variables are increasing over time. 24 / 28

The Ŷ = Ȳ + b 1 (X X) has σ 2 [ 1/n + (X X) 2 /S xx ] In the example, Ŷ predicts the the mean b.p. of a large number of women whose age is X. The for one particular woman of age X is Ŷ plus an error term with zero mean and σ 2. The is unchanged, but is now σ 2 [ 1 + 1/n + (X X) 2 /S xx ] limits are Ŷ ± k E, where E is the square root of the appropriate, and k comes from tables of the normal distn (σ 2 known) or tables of the t distn (when σ 2 estimated by S 2 ). 25 / 28

age - c(35, 45, 55, 65, 75) bp - c(114, 124, 143, 158, 166) fit - lm(bp age) summary(fit) anova(fit) plot(fit) plot(fit) produces diagnostic plots. To graph points and line: plot(bp age) abline(fit) 26 / 28

summary output summary(fit) Residuals: 1 2 3 4 5 0.6-3.2 2.0 3.2-2.6 Coefficients: Estimate Std. Error t value Pr( t ) (Intercept) 65.1000 5.8284 11.17 0.001538 ** age 1.3800 0.1026 13.45 0.000889 *** Multiple R-squared: 0.9837 F-statistic: 180.8 on 1 and 3 DF, p-value: 0.0008894 27 / 28

anova output anova(fit) Variance Table Response: bp Df Sum Sq Mean Sq F value Pr( F) age 1 1904.4 1904.40 180.8 0.0008894 *** Residuals 3 31.6 10.53 28 / 28