Sampling Distributions in Regression. Mini-Review: Inference for a Mean. For data (x 1, y 1 ),, (x n, y n ) generated with the SRM,

Similar documents
Business Statistics. Lecture 9: Simple Regression

Topic 10 - Linear Regression

STAT Chapter 11: Regression

Business Statistics. Lecture 10: Course Review

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

(ii) Scan your answer sheets INTO ONE FILE only, and submit it in the drop-box.

Ch 2: Simple Linear Regression

Inferences for Regression

Unit 10: Simple Linear Regression and Correlation

Objectives Simple linear regression. Statistical model for linear regression. Estimating the regression parameters

Correlation Analysis

Lectures on Simple Linear Regression Stat 431, Summer 2012

Assumptions in Regression Modeling

SMA 6304 / MIT / MIT Manufacturing Systems. Lecture 10: Data and Regression Analysis. Lecturer: Prof. Duane S. Boning

Lecture 2 Simple Linear Regression STAT 512 Spring 2011 Background Reading KNNL: Chapter 1

Lecture 3: Inference in SLR

Chapter 9. Correlation and Regression

STAT 350: Summer Semester Midterm 1: Solutions

Inference with Simple Regression

appstats27.notebook April 06, 2017

y n 1 ( x i x )( y y i n 1 i y 2

2.1: Inferences about β 1

Review of Statistics 101

ECON3150/4150 Spring 2015

STA Module 5 Regression and Correlation. Learning Objectives. Learning Objectives (Cont.) Upon completing this module, you should be able to:

Chapter 27 Summary Inferences for Regression

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

Chapter 10. Regression. Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania

STATISTICS 110/201 PRACTICE FINAL EXAM

9. Linear Regression and Correlation

Empirical Application of Simple Regression (Chapter 2)

Inference for Regression Simple Linear Regression

: The model hypothesizes a relationship between the variables. The simplest probabilistic model: or.

appstats8.notebook October 11, 2016

Correlation. A statistics method to measure the relationship between two variables. Three characteristics

THE ROYAL STATISTICAL SOCIETY 2008 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE (MODULAR FORMAT) MODULE 4 LINEAR MODELS

Two-Sample Inference for Proportions and Inference for Linear Regression

Basic Business Statistics 6 th Edition

Psychology 282 Lecture #4 Outline Inferences in SLR

Lectures 5 & 6: Hypothesis Testing

Section 3: Simple Linear Regression

General Linear Model (Chapter 4)

Can you tell the relationship between students SAT scores and their college grades?

Homework 2: Simple Linear Regression

ECO220Y Simple Regression: Testing the Slope

y ˆ i = ˆ " T u i ( i th fitted value or i th fit)

A discussion on multiple regression models

Data Science for Engineers Department of Computer Science and Engineering Indian Institute of Technology, Madras

IES 612/STA 4-573/STA Winter 2008 Week 1--IES 612-STA STA doc

Correlation and Simple Linear Regression

LECTURE 6. Introduction to Econometrics. Hypothesis testing & Goodness of fit

Lecture 5: Clustering, Linear Regression

Business Statistics. Lecture 10: Correlation and Linear Regression

Review of Multiple Regression

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

Applied Regression Analysis. Section 2: Multiple Linear Regression

Warm-up Using the given data Create a scatterplot Find the regression line

Ch 3: Multiple Linear Regression

Data Analysis and Statistical Methods Statistics 651

Midterm 2 - Solutions

Introduction and Single Predictor Regression. Correlation

Basic Business Statistics, 10/e

Statistics for Engineers Lecture 9 Linear Regression

Lecture 5: Clustering, Linear Regression

Applied Statistics and Econometrics

Density Temp vs Ratio. temp

INFERENCE FOR REGRESSION

Lecture 10 Multiple Linear Regression

Chapter 10 Correlation and Regression

Acknowledgements. Outline. Marie Diener-West. ICTR Leadership / Team INTRODUCTION TO CLINICAL RESEARCH. Introduction to Linear Regression

AMS 7 Correlation and Regression Lecture 8

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details. Section 10.1, 2, 3

The scatterplot is the basic tool for graphically displaying bivariate quantitative data.

Statistics for Managers using Microsoft Excel 6 th Edition

Unit 11: Multiple Linear Regression

CHAPTER EIGHT Linear Regression

Fitting a regression model

Regression analysis is a tool for building mathematical and statistical models that characterize relationships between variables Finds a linear

Assumptions, Diagnostics, and Inferences for the Simple Linear Regression Model with Normal Residuals

Section 4: Multiple Linear Regression

Lecture 2 Linear Regression: A Model for the Mean. Sharyn O Halloran

Lecture 1 Linear Regression with One Predictor Variable.p2

CHAPTER 5. Outlier Detection in Multivariate Data

Data Analysis 1 LINEAR REGRESSION. Chapter 03

L6: Regression II. JJ Chen. July 2, 2015

Lecture 4 Scatterplots, Association, and Correlation

Regression Models for Time Trends: A Second Example. INSR 260, Spring 2009 Bob Stine

Multiple Regression Analysis

Chapter 8. Linear Regression. The Linear Model. Fat Versus Protein: An Example. The Linear Model (cont.) Residuals

7.0 Lesson Plan. Regression. Residuals

Lecture 5: Clustering, Linear Regression

y response variable x 1, x 2,, x k -- a set of explanatory variables

Introduction to Regression

Mathematics for Economics MA course

Chapter 4: Regression Models

Lecture notes on Regression & SAS example demonstration

STA121: Applied Regression Analysis

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

Biostatistics and Design of Experiments Prof. Mukesh Doble Department of Biotechnology Indian Institute of Technology, Madras

Transcription:

Department of Statistics The Wharton School University of Pennsylvania Statistics 61 Fall 3 Module 3 Inference about the SRM Mini-Review: Inference for a Mean An ideal setup for inference about a mean assumes normality, where y is used to estimate µ y. y 1,, y n iid ~ N( µ, σ ) What is meant by the sampling distribution of y? Astonishing Fact #1 The sampling distribution of y is y N n ~ ( µ y, σ y / ) The standard error of y is SE( y ) = sy / y ± SE( y ) are approx 95% CI limits for µ y For testing H : µ y = c vs H 1 : µ y c, t ratio = ( y c) /SE( y ) if t ratio > or p-value <.5 or 95% CI does not contain c, reject H at the.5 level of significance. 3-1 y n y Sampling Distributions in Regression For data (x 1, y 1 ),, (x n, y n ) generated with the SRM, y i =β + β 1 x i + ε i, i = 1,, n ε 1,,ε n iid ~ N(, σ ) What is meant by the sampling distributions of ˆβ and ˆβ 1? How could you use the simulation in utopia.jmp to generate these sampling distributions? Astonishing Fact # The sampling distribution 1 of ˆβ 1 1) has mean E( ˆβ 1) = β 1 ) has standard deviation SD( ˆβ 1) = 3- σ ε n ε 1 SD( x) 3a) is exactly normal 3b) is approximately normal even if the errors ε 1,,ε n are not normally distributed Note: The sampling distribution of ˆβ has the same properties but with a slightly different formula for SD( ˆβ ) 1 More precisely, this result refers to the sampling distribution when y 1,,y n vary, but the values of x 1,,x n are treated as fixed constants that are the same for every sample. Things work out similarly even if the x i are random.

Inference about β and β 1 Typically the intercept β and slope β 1 are estimated by ˆβ and 1 ˆβ, and the standard errors of ˆβ and ˆβ 1 are given by JMP. We ll denote them by SE( ˆβ ) and SE( ˆβ 1). Confidence Intervals Approximate 95% CI's for β and β 1 are given by ˆ β ± SE( ˆ ) and ˆ β ± SE( ˆ ) β 3-3 1 β 1 Hypothesis Tests ˆ β c For H : β = c vs H 1 : β c,, t ratio = SE( ˆ β) ˆ β1 c For H : β 1 = c vs H 1 : β 1 c,, t ratio = SE( ˆ β ) If t ratio > or p-value <.5 or 95% CI does not contain c, reject H at the.5 level of significance. H : β 1 = is the usual null hypothesis of interest. Why? These are obtained by simply substituting RMSE for σ ε in the formulas for the standard deviations of their sampling distributions. 1 JMP provides the details: estimates, SE's, t statistics, and p- values for inference about β and β 1 Example Consider inference for β and β 1 in the diamond regression. Bivariate Fit of Price By 11 1 9 8 Pri 7 ce 6 5 4 3.1.15..5.3.35 Linear Price = -59.659 + 371.49 Summary of Adj Observations (or Sum Parameter Term Intercept Estimate -59.659 371.49 3-4.97861.977788 31.845 5.833 48 Std Error 17.31886 81.78588 t Ratio -14.99 45.5 Prob> t <.1 <.1

Here, β and β 1 are estimated by ˆβ -59.6 and ˆβ 1 371. The standard errors of these estimates are SE( ˆβ ) 17.3 and SE( ˆβ 1) 81.8 Approximate 95% confidence interval limits for β and β 1 are -59.6 ± (17.3) and 371. ± (81.8) Should the hypothesis H : β 1 = be rejected? Yes, because t = 45.5 > or because p-value <.1 Should the hypothesis H : β 1 = 38 be rejected? No, because t = 371. 38 /81.8 =.96 < Why is it interesting that H : β = can be rejected? Suppose that instead of the full diamond.jmp data set, we only had the 8 observations for which.18. The LS regression with these 8 observations yields Bivariate Fit of Price By 11 Price 1 9 8 7 6 5 4 3.1.15..5.3.35 Price = -18.635 + 347.55 Summary of Fit Adj Observations (or Sum Wgts) Parameter Estimates Term Intercept Estimate -18.635 347.55.68711.67518 3.5833 348.8571 8 Std Error 7.5896 49.6156 t Ratio -.58 7.56 Prob> t.157 <.1 How has the regression output changed? 3-5 3-6

Confidence Intervals for the Regression Line Where does the true population regression line lie? What is the average price for all diamonds of a chosen weight? After running a regression based on (x 1, y 1 ),, (x n, y n ), each point ( x, y ˆ ) on the LS regression line x yˆ x = ˆ β + ˆ β x 1 is an estimate of the corresponding point (x, µ y x ) on the true regression line µ y x =β + β 1 x The estimate y ˆ x is a statistic with a sampling distribution. JMP provides 3 a graph of the exact 95% CIs for µ y x over the whole line. For the regression on the smaller diamond data, these confidence bands are seen to be Price 11 1 9 8 7 6 5 4 3.1.15..5.3.35 What happens to the 95% CI for the true regression line x as you get farther away from x? An astonishing fact again comes to the rescue the sampling distribution of y ˆ x is approximately normal with mean µ y x and a standard error SE( y ˆ x ) that can be computed. An approximate 95% CI for µ y x is obtained as y ˆ ± SE( yˆ x 3-7 x ) This phenomenon can be thought of as a Statistical Extrapolation Penalty. 4 Note that the LS line for the full data set is contained within the confidence bands. Why is this reasonable? 3 After executing the Fit Line subcommand, right click next to and select Confid Curves Fit from the Pop-up menu to obtain this plot. 4 Even though the intervals widen as we extrapolate, as the output shows, this penalty is rather optimistic because it assumes that the model we have fit is correct for all values of x. If you price a big diamond with this model, you ll see that the interval is not nearly wide enough! 3-8

Predicting Individual Values with a Regression Where will a future value of the response y lie? How much might I pay for a specific 1/4 carat diamond? After running a regression based on (x 1, y 1 ),, (x n, y n ), each point ( xy, ˆ ) on the LS regression line x yˆ x = ˆ β + ˆ β x 1 is an estimate of the corresponding future point (x, y x ) generated by the SRM y x = β + β 1 x + ε x What s the difference between y x and µ y x on pg 3-7? JMP provides 5 a graph of the exact 95% PIs for y x over the whole line. For the regression on the smaller diamond data, these prediction bands are seen to be Price 11 1 9 8 7 6 5 4 3.1.15..5.3.35 These prediction bands are wider than the confidence bands for the true regression line on pg 3-8. Why is this reasonable? Extrapolate with caution! To accommodate the extra variation of y x due to ε x, an approximate 95% prediction interval (PI) for y x is obtained as yˆ x ± SE( yˆ x) + This interval has the interpretation that RMSE If x is not in the range of the data, predicting y x is especially dangerous because the linear model may fail. Consider pricing the Hope Diamond (at 45.5 carats) with this model. Another example: Average systolic blood pressure in people is well approximated by y 118 +.43 x for x 6 where y = blood pressure and x = age. But when x = 1, y = 5 After executing the Fit Line subcommand, right click next to and select Confid Curves Indiv from the Pop-up menu to obtain this plot. 3-9 3-1

R Index of Performance The next piece of output that we ll consider from the full diamond regression, is =.978 Bivariate Fit of Price By 11 1 9 8 7 6 5 4 3.1.15..5.3.35 Price Price = -59.659 + 371.49 Summary of Fit Adj Observations (or Sum Wgts).97861.977788 31.845 5.833 48 The intuition behind R is based on the decomposition 6 or Response i = Signal i + Noise i y = yˆ + e i i i An Amazing Identity When the signal comes from a regression, it turns out that ( yi y) = ( yˆ i y) + ( yi yˆ i ) This identity is often written as Total SS = Model SS + Residual SS where SS reads Sum of Squares. Sometimes 7 the Residual SS is called the Error SS, but since it comes from the residuals, that name seems better. What do these SS quantities measure? This number is called R and is widely interpreted as the proportion of variation explained by the regression 6 This language comes from electrical engineering. In that context, the signal might come from a radio or TV station. The unwanted things that contaminate your reception are called noise. 7 In particular, JMP calls the Residual SS the Error SS. 3-11 3-1

R is defined by either of the two expressions R Model SS Residual SS = = 1 Total SS Total SS which is the proportion of the total variation explained by the regression. Note how R compares Total SS and Residual SS to capture the usefulness of using x to predict y. (p 9) 8 R is often used as a measure of the effectiveness of a regression. Advice: Resist the temptation to think of R in absolute terms. Regression is a statistical tool for extracting information from data. Its value depends on the value of the information provided. RMSE also provides useful information about the effectiveness of a model. Do R and RMSE answer the same question about a model? Curious (but useful) Fact: In simple regression R = r, where as in Stat 63, r is the sample correlation. 8 These page numbers refer to the BAR casebook. The Impact of Outliers: Another Example Outliers can impact inferences from regression in a dramatic fashion. Value of Housing Construction (p 89) The data set cottage.jmp gives the profits obtained by a construction firm and the square footage of the properties. The scatterplot shows that the firm has built one rather large cottage. This is an outlier in the sense that it is very different from the rest of the points. (p 9) Bivariate Fit of Profit By Sq_Feet 5 Profit 4 3 1 5 15 5 35 Sq_Feet Profit = -416.8594 + 9.755469 Sq_Feet Summary of Fit Adj Observations (or Sum Wgts) Parameter Estimates Term Intercept Sq_Feet Estimate -416.8594 9.755469.779667.765896 357.379 8347.799 18 Std Error 1437.15 1.95848 t Ratio -.9 7.5 Prob> t.7755 <.1 3-13 3-14

What do R and RMSE tell us about this model? What is the interpretation of the 95% CI for the slope? What is the interpretation of the 95% CI for the intercept? Without the Large Property How does the fitted model change when we set aside the large cottage and refit the model without this one? (p 94) Bivariate Fit of Profit By Sq_Feet 5 Profit 4 3 1 5 15 5 35 Sq_Feet Profit = 45.45 + 6.13746 Sq_Feet Summary of Fit Adj Observations (or Sum Wgts) Parameter Estimates Term Intercept Sq_Feet Estimate 45.45 6.13746.755.1355 3633.591 68.899 17 Std Error 437.49 5.55667 t Ratio.53 1.1 Prob> t.639.868 What has happened to R? To RMSE? To the CI for the slope? Which version of this model should the firm use to estimate profits on the next large cottage it is considering building? What additional information about these construction projects would you like to have in order to make a decision? Leverage and Outliers The dramatic effects of removing the outlying large cottage in this last example illustrates the impact of a leveraged outlier. Leverage: points that are far from the rest of the data along the x-axis are said to be leveraged. BAR gives the formula for leverage on page 63. Heuristic: Moving away from the center of the predictor impacts the possible effect of a single observation in regression much like moving your weight out to the end of a see-saw. As your weight moves farther from the fulcrum, you can lift more weight on the other side. Leverage is a property of the values of the predictor, not the response. Leveraged points are not necessarily bad and in fact improve the accuracy of your estimate of the slope. 9 Just recognize that you are giving some observations a bigger role than others. 9 In particular, leveraged observations are those that contribute the most to the variation in the predictor. Since these points spread out the values of the predictor, they make it easier to estimate the slope of the regression. 3-15 3-16

Take-Away Review Inference for regression benefits from the same sort of astonishing facts that made inference for means possible in Stat 63. In particular, the sampling distribution of the slope is approximately 1 ˆβ ~ N(β 1, σ 1 ) SD( x) So, we can form confidence intervals as before, forming the intervals as before, namely as [estimate ± SE(estimate)]. We can use these same ideas to construct confidence intervals for the average of the response for any value of the predictor as well as for a specific response. The R summary measures the proportion of the variation of the response explained by the model; the RMSE shows the SD of the noise that remains. ε n Next Time Getting more data gives you better estimates of the slope and intercept, but has little impact on the accuracy of prediction. The only way to improve R and reduce RMSE is to add more predictors. This is the domain of multiple regression. 3-17