Oct Simple linear regression. Minimum mean square error prediction. Univariate. regression. Calculating intercept and slope

Similar documents
Ch 2: Simple Linear Regression

13 Simple Linear Regression

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

Simple Linear Regression

Simple Linear Regression

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

Multiple regression. Partial regression coefficients

STAT5044: Regression and Anova. Inyoung Kim

Oct Analysis of variance models. One-way anova. Three sheep breeds. Finger ridges. Random and. Fixed effects model. The random effects model

MAT2377. Rafa l Kulik. Version 2015/November/26. Rafa l Kulik

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

STAT Chapter 11: Regression

Simple Linear Regression Analysis

Density Temp vs Ratio. temp

Measuring the fit of the model - SSR

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

Lecture 6 Multiple Linear Regression, cont.

Math 1710 Class 20. V2u. Last Time. Graphs and Association. Correlation. Regression. Association, Correlation, Regression Dr. Back. Oct.

ST430 Exam 2 Solutions

MODELS WITHOUT AN INTERCEPT

Mathematics for Economics MA course

Simple Linear Regression

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

MATH 644: Regression Analysis Methods

Homework 2: Simple Linear Regression

Multiple Linear Regression

Chapter 1. Linear Regression with One Predictor Variable

CAS MA575 Linear Models

y ˆ i = ˆ " T u i ( i th fitted value or i th fit)

Lectures 5 & 6: Hypothesis Testing

Simple Linear Regression

(ii) Scan your answer sheets INTO ONE FILE only, and submit it in the drop-box.

Lecture 15. Hypothesis testing in the linear model

Inference for Regression

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

27. SIMPLE LINEAR REGRESSION II

Simple Linear Regression

Multiple Regression: Example

Lecture 1 Linear Regression with One Predictor Variable.p2

Introduction and Single Predictor Regression. Correlation

Linear Models and Estimation by Least Squares

Lab 3 A Quick Introduction to Multiple Linear Regression Psychology The Multiple Linear Regression Model

Simple and Multiple Linear Regression

2 Regression Analysis

Statistical inference (estimation, hypothesis tests, confidence intervals) Oct 2018

Statistics - Lecture Three. Linear Models. Charlotte Wickham 1.

9. Linear Regression and Correlation

Lecture 11: Simple Linear Regression

Correlation and Regression Notes. Categorical / Categorical Relationship (Chi-Squared Independence Test)

Correlation Analysis

Confidence Intervals, Testing and ANOVA Summary

SCHOOL OF MATHEMATICS AND STATISTICS

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression

22s:152 Applied Linear Regression. Chapter 8: 1-Way Analysis of Variance (ANOVA) 2-Way Analysis of Variance (ANOVA)

Topic 14: Inference in Multiple Regression

CHAPTER EIGHT Linear Regression

Coefficient of Determination

Math 423/533: The Main Theoretical Topics

TMA4255 Applied Statistics V2016 (5)

Chapter 2. Continued. Proofs For ANOVA Proof of ANOVA Identity. the product term in the above equation can be simplified as n

Statistics 512: Solution to Homework#11. Problems 1-3 refer to the soybean sausage dataset of Problem 20.8 (ch21pr08.dat).

Chapter 16. Simple Linear Regression and dcorrelation

ST430 Exam 1 with Answers

Simple Linear Regression

Scatter plot of data from the study. Linear Regression

Simple Linear Regression: A Model for the Mean. Chap 7

ANOVA Analysis of Variance

PART I. (a) Describe all the assumptions for a normal error regression model with one predictor variable,

6. Multiple Linear Regression

STAT420 Midterm Exam. University of Illinois Urbana-Champaign October 19 (Friday), :00 4:15p. SOLUTIONS (Yellow)

Unit 6 - Simple linear regression

Chapter 8: Correlation & Regression

Math 3330: Solution to midterm Exam

Linear models and their mathematical foundations: Simple linear regression

Lecture 2. The Simple Linear Regression Model: Matrix Approach

THE ROYAL STATISTICAL SOCIETY 2008 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE (MODULAR FORMAT) MODULE 4 LINEAR MODELS

9 Correlation and Regression

2.4.3 Estimatingσ Coefficient of Determination 2.4. ASSESSING THE MODEL 23

Inference for the Regression Coefficient

Simple linear regression

General Linear Model (Chapter 4)

Correlation. Bivariate normal densities with ρ 0. Two-dimensional / bivariate normal density with correlation 0

Regression Models - Introduction

Scatter plot of data from the study. Linear Regression

Statistics for Engineers Lecture 9 Linear Regression

14 Multiple Linear Regression

Chapter 12 - Lecture 2 Inferences about regression coefficient

Lecture 3: Inference in SLR

Chapter 14 Simple Linear Regression (A)

Chapter 2 Inferences in Simple Linear Regression

Correlation and simple linear regression S5

Lecture 18: Simple Linear Regression

Concordia University (5+5)Q 1.

STAT2012 Statistical Tests 23 Regression analysis: method of least squares

Inferences for Regression

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Correlation and Regression

STA 302 H1F / 1001 HF Fall 2007 Test 1 October 24, 2007

STA 303H1F: Two-way Analysis of Variance Practice Problems

Applied Regression Modeling: A Business Approach Chapter 2: Simple Linear Regression Sections

Transcription:

Oct 2017 1 / 28

Minimum MSE Y is the response variable, X the predictor variable, E(X) = E(Y) = 0. BLUP of Y minimizes average discrepancy var (Y ux) = C YY 2u C XY + u 2 C XX This is minimized when u takes the value b 1 = C XY /C XX. Ŷ = b 1 X is the of Y on X, b 1 is the coefficient. The error is var ( ) Y Ŷ = C YY C2 XY C XX If we relax the assumption that E(X) = E(Y) = 0, Ŷ = m Y + b 1 (X m X ). 2 / 28

error (C yy C 2 xy C xx) b = C xy C xx 3 / 28

Parent-offspring Trait is measured on offspring (Y) and parents (X 1, X 2 ). Mid-parent value X = (X 1 + X 2 )/2. According to genetic theory, C XY = 1 2 V A, C XX = 1 2 (V A + V E ) Regression coefficient (offspring on mid-parent) is b 1 = C XY /C XX = V A /(V A + V E ), the heritability of the trait. 4 / 28

Francis Galton Tall fathers tend to have tall sons, but average height of sons of tall fathers is less than average height of the fathers. There is (falling back) to the overall mean. 5 / 28

Galton s data child 74 72 70 68 66 64 62 5 3 2 4 3 4 3 2 2 3 1 4 4 11 4 9 7 1 2 11 18 20 7 4 2 5 4 19 21 25 14 10 1 1 2 7 13 38 48 33 18 5 2 1 7 14 28 34 20 12 3 1 2 5 11 17 38 31 27 3 4 2 5 11 17 36 25 17 1 3 1 1 7 2 15 16 4 1 1 4 4 5 5 14 11 16 2 4 9 3 5 7 1 1 1 3 3 1 1 1 1 1 64 66 68 70 72 74 parent 6 / 28

Bivariate normal distribution In general, best possible predictor is E(Y X). For the bivariate normal distn, E(Y X) and the BLUP Ŷ coincide. When we are dealing with normally distd r.v.s, the BLUP is not just best among all linear predictors, but best among all predictors. 7 / 28

Sampling Usually (co)s are not known but are estimated from a sample (X 1, Y 1 ), (X 2, Y 2 ),..., (X n, Y n ) from the bivariate distn. Sample s obtained by dividing Sxx and Syy by n 1 provide unbiased estimates of C XX and C YY. The sample co obtained by dividing Sxy by n 1 is an unbiased estimate of C XY, where Sxy is the corrected sum of products. The coefficient of Y on X is then estimated by ˆb 1 = S xy /S xx. 8 / 28

Simple example Blood pressure was measured on a sample of women of different ages. Ages were grouped into 10-year classes, and mean b.p. calculated for each age class. Age class (yrs) 35 45 55 65 75 b.p. (mmhg) 114 124 143 158 166 Here age is fixed by experiment design, not random. Model for the dependence of Y (b.p.) on X (age): Y i = b 0 + b 1 X i + e i, i = 1... n Errors (residuals) e 1... e n are independently distd with zero mean and constant σ 2. Residuals e i are errors, and σ 2 is the error (residual ). 9 / 28

Blood pressure data 170 bp 160 150 140 130 120 110 age 35 45 55 65 75 10 / 28

Method I (not recommended). Calculate X = 55, Ȳ = 141, d x = X X, d y = Y Ȳ. X Y d x d y d 2 x d x d y d 2 y 35 114 20 27 400 540 729 45 124 10 17 100 170 289 55 143 0 2 0 0 4 65 158 10 17 100 170 289 75 166 20 25 400 500 625 Total 0 0 1000 1380 1936 S xx = d 2 x = 1000, S xy = d x d y = 1380, S yy = d 2 y = 1936. 11 / 28

Calculations Method II (recommended). Calculate uncorrected sum of squares or products, then subtract a correction factor. The result is the corrected sum of squares or products: N X Y X 2 Y 2 XY 5 275 705 16125 101341 40155 S xx = 16125 275 2 /5 = 1000 S xy = 40155 275 705/5 = 1380 S yy = 101341 705 2 /5 = 1936 12 / 28

Calculations Equation of the line is Y 141 = 1.38 (X 55), or Y = 65.1 + 1.38 X Slope of the line is ˆb 1 = 1.38 mmhg/year, an average increase of 13.8 mmhg per decade. The intercept (ˆb 0 = 65.1) is the predicted value of Y when X = 0. (An extrapolation far outside the range of the data). 13 / 28

Fitted line 170 bp 160 150 140 130 120 110 age 35 45 55 65 75 14 / 28

Fitted line 170 bp 160 150 140 130 120 110 age 35 45 55 65 75 14 / 28

Residuals, Values of Y predicted by the equation at the data values X 1... X n are called (Ŷ). Differences between observed and (Y Ŷ) are called residuals. For the blood pressure data, and residuals are: X Y Fitted Residual 35 114 113.4 0.6 45 124 127.2 3.2 55 143 141.0 2.0 65 158 154.8 3.2 75 166 168.6 2.6 15 / 28

Fitted values and residuals 170 bp 160 150 140 130 120 110 age 35 45 55 65 75 16 / 28

Fitted values and residuals 170 bp 160 150 140 130 120 110 age 35 45 55 65 75 16 / 28

Fitted values and residuals 170 bp 160 150 140 130 120 110 age 35 45 55 65 75 16 / 28

Deviation from the mean can be split into two components: Y i Ȳ = (Y i Ŷ i ) + (Ŷ i Ȳ) Correspondingly, the total sum of squares splits into two components: (Yi Ȳ) 2 = (Ŷi Ȳ) 2 + (Y i Ŷi) 2 Total = Regression + Residual Regression sum of squares is corrected sum of squares of fitted values. Residual sum of squares is the sum of squared residuals. 17 / 28

anova calculation Total sum of squares is S yy. Regression sum of squares is S 2 xy/s xx. Residual sum of squares is obtained by subtraction. S yy = S 2 xy/s xx + (S yy S 2 xy/s xx ) Total = Regression + Residual 18 / 28

table Source Df Sum Sq Mean Sq F ratio Regression 1 1904.4 1904.4 180.8 Residual 3 31.6 10.53 Total 4 1936.0 Regression sum of squares S 2 xy/s xx has one degree of freedom. For the blood pressure data, total sum of squares has 4 d.f., leaving the residual sum of squares with 3 d.f. In general with a sample of size n (n pairs of X and Y values), total sum of squares has n 1 d.f., and residual sum of squares has n 2 d.f. Mean squares are obtained by dividing the corresponding sum of squares by the number of degrees of freedom. Residual mean square (10.53) estimates the residual σ 2. 19 / 28

Variance explained The proportion of explained is R 2 = ( sum of squares)/(total sum of squares), Sample correlation coefficient is the positive square root of R 2 multiplied by ±1 (the sign of the coefficient or Sxy). 20 / 28

Is the apparent increase in b.p. with age real, or due to chance? Null hypothesis H 0 : b 1 = 0 ( no relationship between X and Y ). Under H 0, the F statistic has an F distn with 1 and n 2 d.f. H 0 is rejected with large values of F (a one-sided test). For b.p. data, F = 180.8 with 1 and 3 d.f. Tables of the F distn show this to be highly significant (P < 0.001). 21 / 28

Alternatively, null distn of b 1 /E is t with n 2 d.f., where E = S 2 /S xx is estimated s.e. of ˆb 1. For b.p. data, s.e. of ˆb 1 is 10.53/1000 = 0.1026, and t = 1.38/0.1026 = 13.45 with 3 d.f. Tables of the t distn give P < 0.001 (two-sided test). Note that the t statistic is the square root of the F statistic. (When F has 1 and ν d.f., ± F is t with ν d.f.) 22 / 28

Inspect residuals for evidence that model assumptions do not hold. Plot residual against predictor variable or fitted value. Plots may show evidence of systematic discrepancy, due to inadequacies in the model, or an isolated discrepancy, due to an outlier. An outlier has an unusually large residual. If possible, a reason should be found. Outliers may sometimes be rejected, cautiously. 23 / 28

A correlation between X and Y does not necessarily imply that a change in X causes a change in Y. The link may be between X and Z, and between Z and Y, where Z is a third (unobserved) variable. For example, a correlation between birth rate and tractor sales may arise simply because both variables are increasing over time. 24 / 28

The Ŷ = Ȳ + b 1 (X X) has σ 2 [ 1/n + (X X) 2 /S xx ] In the example, Ŷ predicts the the mean b.p. of a large number of women whose age is X. The for one particular woman of age X is Ŷ plus an error term with zero mean and σ 2. The is unchanged, but is now σ 2 [ 1 + 1/n + (X X) 2 /S xx ] limits are Ŷ ± k E, where E is the square root of the appropriate, and k comes from tables of the normal distn (σ 2 known) or tables of the t distn (when σ 2 estimated by S 2 ). 25 / 28

age - c(35, 45, 55, 65, 75) bp - c(114, 124, 143, 158, 166) fit - lm(bp age) summary(fit) anova(fit) plot(fit) plot(fit) produces diagnostic plots. To graph points and line: plot(bp age) abline(fit) 26 / 28

summary output summary(fit) Residuals: 1 2 3 4 5 0.6-3.2 2.0 3.2-2.6 Coefficients: Estimate Std. Error t value Pr( t ) (Intercept) 65.1000 5.8284 11.17 0.001538 ** age 1.3800 0.1026 13.45 0.000889 *** Multiple R-squared: 0.9837 F-statistic: 180.8 on 1 and 3 DF, p-value: 0.0008894 27 / 28

anova output anova(fit) Variance Table Response: bp Df Sum Sq Mean Sq F value Pr( F) age 1 1904.4 1904.40 180.8 0.0008894 *** Residuals 3 31.6 10.53 28 / 28