Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Similar documents
Announcements. Lecture 18: Simple Linear Regression. Poverty vs. HS graduate rate

Lecture 19: Inference for SLR & Transformations

Announcements. Unit 7: Multiple linear regression Lecture 3: Confidence and prediction intervals + Transformations. Uncertainty of predictions

2. Outliers and inference for regression

Announcements. Unit 6: Simple Linear Regression Lecture : Introduction to SLR. Poverty vs. HS graduate rate. Modeling numerical variables

Chi-square tests. Unit 6: Simple Linear Regression Lecture 1: Introduction to SLR. Statistics 101. Poverty vs. HS graduate rate

Unit 6 - Simple linear regression

Lecture 18: Simple Linear Regression

Regression. Marc H. Mehlman University of New Haven

Announcements. Lecture 10: Relationship between Measurement Variables. Poverty vs. HS graduate rate. Response vs. explanatory

Business Statistics. Lecture 10: Course Review

Inference for Regression

Unit 6 - Introduction to linear regression

Density Temp vs Ratio. temp

ST430 Exam 1 with Answers

Lecture 3: Inference in SLR

Lecture 16 - Correlation and Regression

Statistics for Engineers Lecture 9 Linear Regression

13 Simple Linear Regression

Inferences for Regression

Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details. Section 10.1, 2, 3

STA 101 Final Review

Inference for the Regression Coefficient

Lectures on Simple Linear Regression Stat 431, Summer 2012

FinalExamReview. Sta Fall Provided: Z, t and χ 2 tables

appstats27.notebook April 06, 2017

Statistiek II. John Nerbonne. March 17, Dept of Information Science incl. important reworkings by Harmut Fitz

Modeling kid s test scores (revisited) Lecture 20 - Model Selection. Model output. Backward-elimination

MODULE 4 SIMPLE LINEAR REGRESSION

IES 612/STA 4-573/STA Winter 2008 Week 1--IES 612-STA STA doc

INFERENCE FOR REGRESSION

STAT 215 Confidence and Prediction Intervals in Regression

Inference for Regression Simple Linear Regression

STA 302 H1F / 1001 HF Fall 2007 Test 1 October 24, 2007

Lecture 20: Multiple linear regression

STAT Chapter 11: Regression

Chapter 27 Summary Inferences for Regression

Lecture 11: Simple Linear Regression

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

Lecture 2. Simple linear regression

Statistical Modelling in Stata 5: Linear Models

Biostatistics 380 Multiple Regression 1. Multiple Regression

13: SIMPLE LINEAR REGRESSION V

Introduction to Linear regression analysis. Part 2. Model comparisons

Ch 2: Simple Linear Regression

Figure 1: The fitted line using the shipment route-number of ampules data. STAT5044: Regression and ANOVA The Solution of Homework #2 Inyoung Kim

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Basic Business Statistics 6 th Edition

Review of Statistics 101

Simple Linear Regression

y ˆ i = ˆ " T u i ( i th fitted value or i th fit)

AMS 7 Correlation and Regression Lecture 8

Inference with Simple Regression

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

Regression and Models with Multiple Factors. Ch. 17, 18

Variance Decomposition and Goodness of Fit

9 Correlation and Regression

Lecture 10 Multiple Linear Regression

Simple and Multiple Linear Regression

Inference for Regression Inference about the Regression Model and Using the Regression Line

Homework 2: Simple Linear Regression

Psychology Seminar Psych 406 Dr. Jeffrey Leitzel

Objectives Simple linear regression. Statistical model for linear regression. Estimating the regression parameters

Chapter 8: Multiple and logistic regression

ST430 Exam 2 Solutions

Correlation and Simple Linear Regression

Lecture 1: Linear Models and Applications

Stat 411/511 ESTIMATING THE SLOPE AND INTERCEPT. Charlotte Wickham. stat511.cwick.co.nz. Nov

Data Set 1A: Algal Photosynthesis vs. Salinity and Temperature

Simple linear regression

Correlation & Simple Regression

Regression Models - Introduction

Warm-up Using the given data Create a scatterplot Find the regression line

14 Multiple Linear Regression

9. Linear Regression and Correlation

Chapter 16. Simple Linear Regression and dcorrelation

Mathematics for Economics MA course

Final Exam. Name: Solution:

Correlation and the Analysis of Variance Approach to Simple Linear Regression

Multiple Regression. Inference for Multiple Regression and A Case Study. IPS Chapters 11.1 and W.H. Freeman and Company

Stat 412/512 TWO WAY ANOVA. Charlotte Wickham. stat512.cwick.co.nz. Feb

Regression Analysis: Exploring relationships between variables. Stat 251

Oct Simple linear regression. Minimum mean square error prediction. Univariate. regression. Calculating intercept and slope

Lecture 6 Multiple Linear Regression, cont.

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Topic 14: Inference in Multiple Regression

Biostatistics for physicists fall Correlation Linear regression Analysis of variance

Correlation and Regression

Simple Linear Regression: A Model for the Mean. Chap 7

Simple Linear Regression

Topic 10 - Linear Regression

ST Correlation and Regression

Correlation and Linear Regression

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

ECO220Y Simple Regression: Testing the Slope

Psychology 282 Lecture #4 Outline Inferences in SLR

Statistics for Managers using Microsoft Excel 6 th Edition

Regression so far... Lecture 21 - Logistic Regression. Odds. Recap of what you should know how to do... At this point we have covered: Sta102 / BME102

Correlation Analysis

Comparing Nested Models

Transcription:

Understanding regression output from software Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals In 1966 Cyril Burt published a paper called The genetic determination of differences in intelligence: A study of monozygotic twins reared apart The data consist of IQ scores for [an assumed random sample of] 7 identical twins, one raised by foster parents, the other by the biological parents. Sta1 / BME1 Colin Rundel November 9, 14 13 1 11 1 9 8 7 R =.88 7 8 9 1 11 1 13 Sta1 / BME1 (Colin Rundel) Lec 18 November 9, 14 / 4 Understanding regression output from software Regression Output summary(lm(twins$foster ~ twins$biological)) Call: lm(formula = twins$foster ~ twins$biological) Residuals: Min 1Q Median 3Q Max -11.31 -.7311.74 4.344 16.331 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 9.76 9.999.99.33 twins$biological.9144.9633 9.38 1.e-9 *** --- Signif. codes: ***.1 **.1 *...1 1 Residual standard error: 7.79 on degrees of freedom Multiple R-squared:.7779, Adjusted R-squared:.769 F-statistic: 87.6 on 1 and DF, p-value: 1.4e-9 Sta1 / BME1 (Colin Rundel) Lec 18 November 9, 14 3 / 4 In order to perform inference, the following conditions must be met: 1 Linearity Nearly normal residuals 3 Constant variability Sta1 / BME1 (Colin Rundel) Lec 18 November 9, 14 4 / 4

Conditions: (1) Linearity Anatomy of a residuals plot The relationship between the explanatory and the response variable should be linear. Methods for fitting a model to non-linear relationships exist, but are beyond the scope of this class. Check using a scatterplot or a residual plot. % in poverty 1 1 Rhode Island: % HS grad = 81 % in poverty = 1.3 % in poverty = 64.68.6 81 = 14.46 e RI = % in poverty % in poverty = 1.3 14.46 = 4.16 8 8 9 % HS grad Washington, DC: % HS grad = 86 % in poverty = 16.8 % in poverty = 64.68.6 86 = 11.36 e DC = % in poverty % in poverty = 16.8 11.36 =.44 Sta1 / BME1 (Colin Rundel) Lec 18 November 9, 14 / 4 Sta1 / BME1 (Colin Rundel) Lec 18 November 9, 14 6 / 4 Conditions: () Nearly normal residuals Conditions: (3) Constant variability frequency 4 6 8 1 1 The residuals should be nearly normal. This condition may not be satisfied when there are unusual observations that don t follow the trend of the rest of the data. Checked using a histogram or normal probability plot of residuals. Sample Quantiles 4 4 Normal Q Q Plot % in poverty 6 8 1 1 14 16 18 4 4 8 8 9 % HS grad The variability of points around the least squares line should be roughly constant. This implies that the variability of residuals around the line should be roughly constant as well. Also called homoscedasticity. Check using a residuals plot. 4 4 6 residuals 1 1 Theoretical Quantiles 8 9 Sta1 / BME1 (Colin Rundel) Lec 18 November 9, 14 7 / 4 Sta1 / BME1 (Colin Rundel) Lec 18 November 9, 14 8 / 4

Checking conditions Checking conditions (II) What condition is this linear model violating? What condition is this linear model obviously violating? y g$residuals g$residuals x Sta1 / BME1 (Colin Rundel) Lec 18 November 9, 14 9 / 4 Sta1 / BME1 (Colin Rundel) Lec 18 November 9, 14 1 / 4 HT for the slope HT for the slope Back to Nature vs nurture Testing for the slope 13 1 11 1 9 8 7 R =.88 7 8 9 1 11 1 13 Estimate Std. Error t value Pr(> t ) (Intercept) 9.76 9.999.99.3316 bioiq.914.963 9.36. Assuming that these 7 twins comprise a representative sample of all twins separated at birth, we would like to test if these data provide convincing evidence that the IQ of the biological twin is a significant predictor of IQ of the foster twin. What are the appropriate hypotheses? First consider what the null hypothesis should be, if there is no relationship between the two variables what value of the slope would we expect to see? H : β 1 = H A : β 1 Sta1 / BME1 (Colin Rundel) Lec 18 November 9, 14 11 / 4 Sta1 / BME1 (Colin Rundel) Lec 18 November 9, 14 1 / 4

HT for the slope HT for the slope Testing for the slope (cont.) Testing for the slope (cont.) Estimate Std. Error t value Pr(> t ) (Intercept) 9.76 9.999.99.3316 bioiq.914.963 9.36. We always use a t-test in inference for regression parameters. Remember: Test statistic, T = point estimate null value SE Point estimate is b 1, the calculated slope for the observed data. SE b1, is the standard error associated with that slope. Degrees of freedom associated with the slope is df = n, where n is the sample size. Estimate Std. Error t value Pr(> t ) (Intercept) 9.76 9.999.99.3316 bioiq.914.963 9.36. T =.914 = 9.36.963 df = 7 = p-value = P( T > 9.36) <.1 Remember: We lose 1 degree of freedom for each parameter we estimate, and in simple linear regression we estimate parameters, β and β 1. Sta1 / BME1 (Colin Rundel) Lec 18 November 9, 14 13 / 4 Sta1 / BME1 (Colin Rundel) Lec 18 November 9, 14 14 / 4 CI for the slope CI for the slope Confidence interval for the slope Remember that a confidence interval is calculated as point estimate ± ME and the degrees of freedom associated with the slope in a simple linear regression is n. What is the correct 9% confidence interval for the slope parameter? Note that the model is based on observations from 7 twins. Estimate Std. Error t value Pr(> t ) (Intercept) 9.76 9.999.99.3316 bioiq.914.963 9.36. Recap Inference for the slope for a SLR model (only one explanatory variable): Hypothesis test: Confidence interval: T = b 1 null value SE b1 df = n b 1 ± t df =n SE b1 The null value is usually, since we are usually checking for any relationship between the explanatory and the response variable. The regression output gives b 1, SE b1, and the two-tailed p-value for the t-test of the slope where the null value is. We rarely do inference on the intercept (since it is rarely meaningful), so we will be focus on the estimates and inference for the slope. Sta1 / BME1 (Colin Rundel) Lec 18 November 9, 14 1 / 4 Sta1 / BME1 (Colin Rundel) Lec 18 November 9, 14 16 / 4

CI for the slope An alternative statistic Caution Variability partitioning Always be aware of the type of data you re working with: random sample, non-random sample, or population. Statistical inference, and the resulting p-values, are meaningless when you have population data. If you have a sample that is non-random (biased), the results will be unreliable. We considered the t-test as a way to evaluate the strength of evidence for a hypothesis test for the slope of relationship between x and y. However, we can also consider the variability in y explained by x, compared to the unexplained variability. Partitioning the variability in y to explained and unexplained variability requires analysis of variance (ANOVA). The ultimate goal is to have independent observations and you know how to check for those by now. Sta1 / BME1 (Colin Rundel) Lec 18 November 9, 14 17 / 4 Sta1 / BME1 (Colin Rundel) Lec 18 November 9, 14 18 / 4 An alternative statistic An alternative statistic ANOVA output - Sum of squares Df Sum Sq Mean Sq F value Pr(>F) bioiq 1 31.13 31.13 87.6. Residuals 1493.3 9.74 Total 6 674.66 Sum of squares: SS Tot = (y i ȳ) = 674.66 (total variability in y) SS Err = (y i ŷ i ) = ei i i = 1493.3 (unexplained variability in residuals) SS Reg = (ŷ i ȳ) i = SS Tot SS Err (explained variability in y) = 674.66 1493.3 = 31.13 Degrees of freedom: df Tot = n 1 = 7 1 = 6 df Reg = 1 = 1 (there are coefficients) df Res = df Tot df Reg = 6 1 = Sta1 / BME1 (Colin Rundel) Lec 18 November 9, 14 19 / 4 ANOVA output - F-test Df Sum Sq Mean Sq F value Pr(>F) bioiq 1 31.13 31.13 87.6. Residuals 1493.3 9.74 Total 6 674.66 Mean sq.: MS Reg = SS Reg = 31.13 = 31.13 df Reg 1 MS Err = SS Err = 1493.3 = 9.74 df Err F-statistic: F (1,) = MS Reg MS Err = 87.6 (ratio of explained to unexplained variability) The null hypothesis is β = β 1 = and the alternative is β j for some j. With a large F-statistic, and a small p-value, we reject H and conclude that the linear model is significant. Sta1 / BME1 (Colin Rundel) Lec 18 November 9, 14 / 4

An alternative statistic An alternative statistic Regression Output ANOVA output - R calculation summary(lm(twins$foster ~ twins$biological)) Call: lm(formula = twins$foster ~ twins$biological) Residuals: Min 1Q Median 3Q Max -11.31 -.7311.74 4.344 16.331 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 9.76 9.999.99.33 twins$biological.9144.9633 9.38 1.e-9 *** --- Signif. codes: ***.1 **.1 *...1 1 Residual standard error: 7.79 on degrees of freedom Multiple R-squared:.7779, Adjusted R-squared:.769 F-statistic: 87.6 on 1 and DF, p-value: 1.4e-9 13 1 11 1 9 8 7 R =.88 7 8 9 1 11 1 13 R = explained variability total variability Df Sum Sq Mean Sq F value Pr(>F) bioiq 1 31.13 31.13 87.6. Residuals 1493.3 9.74 Total 6 674.66 = SS Reg SS Tot = 31.13 674.66 =.7779 Sta1 / BME1 (Colin Rundel) Lec 18 November 9, 14 1 / 4 Sta1 / BME1 (Colin Rundel) Lec 18 November 9, 14 / 4 Types of outliers Types of outliers How does one or more outliers influence the least squares line? How does the following outlier influence the least squares line? 1 To answer this question think of where the regression line would be with and without the outlier(s). 1 Sta1 / BME1 (Colin Rundel) Lec 18 November 9, 14 3 / 4 Sta1 / BME1 (Colin Rundel) Lec 18 November 9, 14 4 / 4

Some terminology Outliers are points that fall away from the cloud of points. Outliers that fall horizontally away from the center of the cloud are called leverage points. High leverage points that actually influence the slope of the regression line are called influential points. In order to determine if a point is influential, visualize the regression line with and without the point. Does the slope of the line change considerably? If so, then the point is influential. If not, then it s not. Sta1 / BME1 (Colin Rundel) Lec 18 November 9, 14 / 4 Influential points Data are available on the log of the surface temperature and the log of the light intensity of 47 stars in the star cluster CYG OB1. 3.6 3.8 4. 4. 4.4 4.6 4. 4... 6. log(temp) log(light intensity) w/ outliers w/o outliers Sta1 / BME1 (Colin Rundel) Lec 18 November 9, 14 6 / 4 Hertzsprung-Russell Diagram Sta1 / BME1 (Colin Rundel) Lec 18 November 9, 14 7 / 4 Types of outliers Which type of outlier is displayed below? 4 Sta1 / BME1 (Colin Rundel) Lec 18 November 9, 14 8 / 4

Types of outliers Which type of outlier is displayed below? Recap Are following statements true or false? 1 1 (1) Influential points always change the intercept of the regression line. () Influential points always reduce R. (3) It is much more likely for a high leverage point to be influential, than a low leverage point. (4) When the data set includes an influential point, the relationship between the explanatory variable and the response variable is always nonlinear. Sta1 / BME1 (Colin Rundel) Lec 18 November 9, 14 9 / 4 Sta1 / BME1 (Colin Rundel) Lec 18 November 9, 14 3 / 4 Confidence intervals Recap (cont.) Confidence intervals for average values 1 8 6 4 R =.91, R =.83 1 R =.7, R =. A confidence interval for the average (expected) value of y for a given x, is given by ŷ ± tn s 1 e n + 1 (x x) n 1 sx where s is the standard deviation of the residuals (yi ŷ i ) s e = n 1 Note that when x = x this reduces to s e ŷ ± tn n Sta1 / BME1 (Colin Rundel) Lec 18 November 9, 14 31 / 4 Sta1 / BME1 (Colin Rundel) Lec 18 November 9, 14 3 / 4

Confidence intervals Confidence intervals Example Calculation Distance from the mean Calculate a 9% confidence interval for the average IQ score of foster twins whose biological twins have IQ scores of 1 points. Note that the average IQ score of ŷ = 9.76 +.9144 1 99.3 7 biological twins in the sample is 9.3 points, with a standard deviation is 1.74 points. df = n t =.6 Estimate Std. Error t value Pr(> t ) 1 (1 9.3) ME =.6 7.79 + (Intercept) 9.76 9.999.99.33 7 6 1.74 bioiq.9144.9633 9.38 3. 1.e-9 Residual standard error: 7.79 on degrees of freedom 14 1 1 8 6 14 1 1 8 6 CI = 99.3 ± 3. = (96.1, 1.) 7 8 9 1 11 1 13 7 8 9 1 11 1 13 Sta1 / BME1 biological (Colin IQ Rundel) Lec 18 November 9, 14 33 / 4 How would you expect the width of the 9% confidence interval for the average IQ score of foster twins whose biological twins have IQ scores of 13 points (x = 13) to compare to the previous confidence interval (where x = 1)? 14 1 1 8 6 7 8 9 1 11 1 13 Sta1 / BME1 (Colin Rundel) Lec 18 November 9, 14 34 / 4 14 1 1 Confidence intervals and Prediction intervals How do the confidence intervals where x = 1 and x = 13 compare in 8 terms of their widths? 6 7 8 9 1 11 1 13 x = 1 x = 13 Confidence intervals 1 (1 9.3) ME 1 =.6 7.79 + 7 6 1.74 = 3. 1 (13 9.3) ME 13 =.6 7.79 + 7 6 1.74 = 7.3 Recap Confidence intervals The width of the confidence interval for E(y) increases as x moves away from the center. Conceptually: We are much more certain of our predictions at the center of the data than at the edges (and our level of certainty decreases even further when predicting outside the range of the data extrapolation). Mathematically: As (x x) term increases, the margin of error of the confidence interval increases as well. 14 1 1 8 6 7 8 9 1 11 1 13 Sta1 / BME1 (Colin Rundel) Lec 18 November 9, 14 3 / 4 14 1 1 8 6 7 8 9 1 11 1 13 Sta1 / BME1 (Colin Rundel) Lec biological 18 IQ November 9, 14 36 / 4

Prediction intervals Prediction intervals Predicting a value, not an average Earlier we learned how to calculate a confidence interval for average y, E(y), for a given x. Suppose that we are not interested in the average, but instead we want to predict a future value of y for a given x. Would you expect there to be more uncertainty around an average or a specific predicted value? Prediction intervals A prediction interval for y for a given x is ŷ ± t n s e 1 + 1 n + (x x) (n 1)s x where s is the standard deviation of the residuals. The formula is very similar, except the variability is higher since there is a 1 added in the formula. Prediction level: If we repeat the study of obtaining a regression data set many times, each time forming a XX% prediction interval at x, and wait to see what the future value of y is at x, then roughly XX% of the prediction intervals will contain the corresponding actual value of y. Sta1 / BME1 (Colin Rundel) Lec 18 November 9, 14 37 / 4 Sta1 / BME1 (Colin Rundel) Lec 18 November 9, 14 38 / 4 Recap - CI vs. PI Recap - CI vs. PI Confidence interval for E(y) vs. prediction interval for y CI for E(y) vs. PI for y 14 14 1 1 8 6 confidence prediction 1 1 8 6 confidence prediction 7 8 9 1 11 1 13 Sta1 / BME1 (Colin Rundel) Lec 18 November 9, 14 39 / 4 7 8 9 1 11 1 13 Sta1 / BME1 (Colin Rundel) Lec 18 November 9, 14 4 / 4

Recap - CI vs. PI Recap - CI vs. PI CI for E(y) vs. PI for y - differences CI for E(y) vs. PI for y - similarities A prediction interval is similar in spirit to a confidence interval, except that the prediction interval is designed to cover a moving target, the random future value of y the confidence interval is designed to cover the fixed target, the average (expected) value of y, E(y), Although both are centered at ŷ, the prediction interval is wider than the confidence interval, for a given x and confidence level. This makes sense, since the prediction interval must take account of the tendency of y to fluctuate from its mean value the confidence interval simply needs to account for the uncertainty in estimating the mean value. For a given data set, the error in estimating E(y) and ŷ grows as x moves away from x. Thus, the further x is from x, the wider the confidence and prediction intervals will be. If any of the conditions underlying the model are violated, then the confidence intervals and prediction intervals may be invalid as well. This is why it s so important to check the conditions by examining the residuals, etc. Sta1 / BME1 (Colin Rundel) Lec 18 November 9, 14 41 / 4 Sta1 / BME1 (Colin Rundel) Lec 18 November 9, 14 4 / 4