Simple Linear Regression: A Model for the Mean. Chap 7

Simple Linear Regression: A Model for the Mean Chap 7

An Intermediate Model (if the groups are defined by values of a numeric variable) Separate Means Model Means fall on a straight line function of the group values Equal Means Model 2

Meaning of the Word Regression statistical meaning not related to the usual English definition of regression misnomer, but there s a historical reason refers to finding the best fitting (straight line) relationship between the mean of Y and values of the variable defining the groups (X) 3

Galton s data 75 height of son 7 65 6 6 65 7 75 height of father 4

Case Study Hubble data (from 92 s) observational data each point represents a nebula Y distance of the nebula X recession velocity of the nebula 5

Case Study don t worry about the geometry, but Big Bang theory implies that Y = const. * X (i.e. a straight line relationship between Y and X, with intercept ) furthermore, const. should be the age of the universe 6

Case Study 2 DISTANCE 5 VELOCITY 7

Case Study Regression methods can help answer big cosmological questions: Is the Big Bang theory correct? is the intercept of the straight-line relationship? How old is the universe? what is the slope of the straight-line relationship? 8

Case Study 2 designed experiment response variable (Y) ph of steer carcass explanatory variable (X) log time (hours) after slaughter what is the approximate ph of a particular steer carcass 3 hours after slaughter? about how long do you have to wait after slaughter for the ph to reach 6.? 9

Case Study 2 7. 7. 6.5 6. 5 PH PH 6. 6. 5.5 5. 5 2 3 4 TIME 5 6 7 8 log(time) 2

Regression Analysis statistical methods based on describing the distribution of values of one variable (Y the response variable) as a function of the other variable (X the explanatory variable) simple linear regression: the function is a straight line function

Regression Analysis specifically, the mean of Y is a straight line function of X µ{y X} = β + β X where β is the intercept and β is the slope the standard deviation of Y is constant σ{y X} = σ the distribution of values of Y for any value of X is normal 2

Simple Linear Regression Model LINEARITY The mean response, µ, has a straight-line relationship with X, (µ{y X}= β + β X). ASSUMPTIONS NORMALITY For any given X, Y is normally distributed around µ{y X}. y x µ y = β + β x EQUAL STANDARD DEVIATION The standard deviation(σ) of Y is the same for all values of X. INDEPENDENCE Observations are independent of each other. 3

Simple Linear Regression Model Response Explanatory µ { Y X} = β + β X Regression coefficents 4

Simple Linear Regression Model Response Regressor µ { Y X} = β + β X Intercept Slope 5

Interpretation of coefficients slope: for a -unit change in X, Y changes by β units on average intercept: when X is, the mean value of Y is β 6

Simple Linear Regression Model advantage relative to separate means model interpolation: making inferences about the distribution of Y for X values not actually included in the data set (but within the range of the observed X values e.g. estimating the mean ph of carcasses 2.5 hours after slaughter) danger relative to separate means model extrapolation: making inferences about the distribution of Y for X values outside the range of the observed X values (e.g. estimating the mean ph of carcasses 2 hours after slaughter) 7

Estimating β, β and σ we can t just calculate means to estimate β and β (like we did in ANOVA) strategy: use least squares to find best fitting line; then use slope and intercept of best fitting line as estimates true slope and intercept 8

Least squares find the straight line that minimizes Σ(res i ) 2 µ{y X} = β +β X fit i Y i res i = Y i -fit i 9

Formulas for estimators the least squares principle is just for background 2 ) ˆ ˆ ˆ ) ( ) )( ( ˆ 2 2 = = = = = n res X Y X X Y Y X X i n i i n i i i ( σ β β β 2

Degrees of Freedom n-2 here n-i in one way ANOVA n- in one-sample t tools in general, df = (number of observations) (number of parameters in model for the means) 2

Standard Errors ) ( ) ( ˆ ) ˆ ( ) ( ˆ ) ˆ ( 2 2 2 2 2 = + = = n X X s where s n X n SE s n SE i X X X σ β σ β 22

Inferences with these estimators and standard errors, you can test hypotheses about and compute confidence intervals for β and β in the usual way e.g. est. +/- t (df)(-α/2) SE(est.) based on the t-distribution with n-2 degrees of freedom 23

Regression Software in practice you never need to hand-calculate these estimators nor their standard errors but you might have to hand-calculate tests or confidence intervals MINITAB and S-Plus Hubble data: test the null hypothesis that β = estimate β 24

Hubble data Regression Analysis: DISTANCE versus VELOCITY ˆ ˆβ ( ˆ β σˆ SE β ) SE( ˆ β ) The regression equation is DISTANCE =.399 +.37 VELOCITY Predictor Coef SE Coef T P Constant.399.85 3.37.3 VELOCITY.3729.2274 6.4. S =.45 R-Sq = 62.4% R-Sq(adj) = 6.6% 27

Hubble data Regression Analysis: DISTANCE versus VELOCITY The regression equation is DISTANCE =.399 +.37 VELOCITY Predictor Coef SE Coef T P Constant.399.85 3.37.3 VELOCITY.3729.2274 6.4. S =.45 R-Sq = 62.4% R-Sq(adj) = 6.6% Since p=.3, the simple Big Bang theory is not correct 28

Hubble data Regression Analysis: DISTANCE versus VELOCITY The regression equation is DISTANCE =.399 +.37 VELOCITY ˆβ SE( ˆ β ) Predictor Coef SE Coef T P Constant.399.85 3.37.3 VELOCITY.3729.2274 6.4. S =.45 R-Sq = 62.4% R-Sq(adj) = 6.6% 29

Hubble data hand calculate confidence interval for β.3729 +/- 2.7 (.2274) = (.98,.844) or roughly (.89,.8) billion years 3

ph data Regression Analysis: PH versus log(time) The regression equation is PH = 6.98 -.726 log(time) Predictor Coef SE Coef T P Constant 6.98363.4853 43.9. log(time -.72566.3443-2.8. S =.8226 R-Sq = 98.2% R-Sq(adj) = 98.% (but these statistics don t answer the questions of interest to us) 3

Three New Kinds of Inference. Estimating the mean value of Y at some specified value X=X (e.g. What is the mean ph of steer carcasses 3 hours after slaughter?) ˆ{ µ Y X } = ˆ β + ˆ β X SE( ˆ{ µ Y X }) = σˆ n + ( X ( n X ) ) s 2 x 2 32

Three New Kinds of Inference 2. Predicting the value of Y for a specific individual at some specified value X=X (e.g. What is the ph of a particular steer s Garth s -- carcass 3 hours after slaughter?) pred{ Y X } = ˆ β + ˆ β X SE( pred{ Y X }) = ˆ σ + n + ( X ( n X ) ) s 2 x 2 = ˆ σ 2 + SE( ˆ{ µ Y X }) 2 33

Three New Kinds of Inference 3. Estimating the value of X that results in Y=Y (inverse prediction or calibration) (e.g. About how long do you have to wait after slaughter for the mean ph to reach 6.?) Xˆ = Y ˆ β ˆ β SE( Xˆ ) = SE( ˆ{ µ Y ˆ β Xˆ }) or SE( pred{ Y ˆ β Xˆ }) 34

Software for first two inferences MINITAB easy SPlus a little harder (for me) 35

ˆ µ { Y X } SE ( ˆ{ µ Y X }) ˆ{ Y X } ± t( n 2)(.975) SE( ˆ{ µ Y X }) µ Predicted Values for New Observations New Obs Fit SE Fit 95.% CI 95.% PI 6.86.262 ( 6.257, 6.2465) ( 5.987, 6.3852) Values of Predictors for New Observations New Obs log(time. pred{ Y X } ± t( n 2)(.975) SE( pred{ Y X }) X = ln(3.)=.99 38

Hand calculations for inverse estimation for Y =6 Xˆ = Y ˆ β ˆ β = 6. 6.98.726 =.35 exp(.35) = 3.86 hours SE( Xˆ ) = SE( ˆ{ µ Y ˆ β Xˆ }) =.266.726 =.37 39

Related ideas computer centering trick confidence and prediction bands multiple comparison issues 4

Regression Plot 7 PH 6 Regression 95% CI 5 95% PI log(time) 2 4

Standard Errors We have discussed six standard errors SE of the slope SE of the intercept SE of the estimated mean of Y at X SE of the predicted value of Y (new individual) at X SE of inverse estimate of X for a mean Y SE of inverse estimate of X for an individual Y 42

Correlation Correlation is a measure of linear association between two variables Formula on page 94 between - and + zero correlation implies no linear relation + implies on points fall exactly on a straight line with positive slope 43