Inference for Regression

Similar documents
Ch 2: Simple Linear Regression

Applied Regression Analysis

Variance Decomposition and Goodness of Fit

ST430 Exam 1 with Answers

Ch 3: Multiple Linear Regression

Inferences for Regression

Correlation Analysis

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Least Squares Regression

Lecture 18: Simple Linear Regression

Linear regression. We have that the estimated mean in linear regression is. ˆµ Y X=x = ˆβ 0 + ˆβ 1 x. The standard error of ˆµ Y X=x is.

Figure 1: The fitted line using the shipment route-number of ampules data. STAT5044: Regression and ANOVA The Solution of Homework #2 Inyoung Kim

ANOVA: Comparing More Than Two Means

Simple Linear Regression

BNAD 276 Lecture 10 Simple Linear Regression Model

Lecture 6 Multiple Linear Regression, cont.

Chapter 14 Student Lecture Notes Department of Quantitative Methods & Information Systems. Business Statistics. Chapter 14 Multiple Regression

Chapter 14 Simple Linear Regression (A)

Biostatistics 380 Multiple Regression 1. Multiple Regression

LECTURE 6. Introduction to Econometrics. Hypothesis testing & Goodness of fit

Math 3330: Solution to midterm Exam

Simple Linear Regression

Correlation and the Analysis of Variance Approach to Simple Linear Regression

Linear Regression Model. Badr Missaoui

Coefficient of Determination

Measuring the fit of the model - SSR

1 Multiple Regression

Linear models and their mathematical foundations: Simple linear regression

Lecture 15 Multiple regression I Chapter 6 Set 2 Least Square Estimation The quadratic form to be minimized is

STAT 350 Final (new Material) Review Problems Key Spring 2016

The Multiple Regression Model

Statistics for Managers using Microsoft Excel 6 th Edition

MATH 644: Regression Analysis Methods

Lecture 11: Simple Linear Regression

Multiple Linear Regression

Density Temp vs Ratio. temp

STAT420 Midterm Exam. University of Illinois Urbana-Champaign October 19 (Friday), :00 4:15p. SOLUTIONS (Yellow)

ST430 Exam 2 Solutions

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression

Chapter 7 Student Lecture Notes 7-1

Lecture 14 Simple Linear Regression

Simple Linear Regression

Scatter plot of data from the study. Linear Regression

STA121: Applied Regression Analysis

Mathematics for Economics MA course

Chapter 4: Regression Models

Econ 3790: Business and Economics Statistics. Instructor: Yogesh Uppal

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Regression Analysis II

Formal Statement of Simple Linear Regression Model

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

Multiple Linear Regression

Lecture 10 Multiple Linear Regression

Chapter 4. Regression Models. Learning Objectives

13 Simple Linear Regression

Basic Business Statistics 6 th Edition

Scatter plot of data from the study. Linear Regression

Statistics for Engineers Lecture 9 Linear Regression

Applied Regression. Applied Regression. Chapter 2 Simple Linear Regression. Hongcheng Li. April, 6, 2013

Biostatistics for physicists fall Correlation Linear regression Analysis of variance

AMS 7 Correlation and Regression Lecture 8

MS&E 226: Small Data

Simple and Multiple Linear Regression

UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematical Sciences Midterm Test, October 2013

STA 302 H1F / 1001 HF Fall 2007 Test 1 October 24, 2007

STAT 215 Confidence and Prediction Intervals in Regression

ECO220Y Simple Regression: Testing the Slope

Handout 4: Simple Linear Regression

Regression. Marc H. Mehlman University of New Haven

STAT5044: Regression and Anova. Inyoung Kim

Chapter 12 - Lecture 2 Inferences about regression coefficient

Homework 9 Sample Solution

Regression Models. Chapter 4. Introduction. Introduction. Introduction

Inference for Regression Simple Linear Regression

Inference for the Regression Coefficient

Sociology 6Z03 Review II

Tests of Linear Restrictions

MAT2377. Rafa l Kulik. Version 2015/November/26. Rafa l Kulik

Math Section MW 1-2:30pm SR 117. Bekki George 206 PGH

Inference. ME104: Linear Regression Analysis Kenneth Benoit. August 15, August 15, 2012 Lecture 3 Multiple linear regression 1 1 / 58

SCHOOL OF MATHEMATICS AND STATISTICS

SIMPLE REGRESSION ANALYSIS. Business Statistics

Regression used to predict or estimate the value of one variable corresponding to a given value of another variable.

Table of z values and probabilities for the standard normal distribution. z is the first column plus the top row. Each cell shows P(X z).

Business Statistics. Chapter 14 Introduction to Linear Regression and Correlation Analysis QMIS 220. Dr. Mohammad Zainal

Inference for Proportions, Variance and Standard Deviation

Business Statistics. Lecture 10: Correlation and Linear Regression

Lecture 15. Hypothesis testing in the linear model

STAT 512 MidTerm I (2/21/2013) Spring 2013 INSTRUCTIONS

6. Multiple Linear Regression

Econ 3790: Statistics Business and Economics. Instructor: Yogesh Uppal

STAT 3022 Spring 2007

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

Chapter 14. Linear least squares

STAT 511. Lecture : Simple linear regression Devore: Section Prof. Michael Levine. December 3, Levine STAT 511

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

36-707: Regression Analysis Homework Solutions. Homework 3

Unit 6 - Simple linear regression

Chapter 14 Student Lecture Notes 14-1

CAS MA575 Linear Models

Transcription:

Inference for Regression Section 9.4 Cathy Poliak, Ph.D. cathy@math.uh.edu Office in Fleming 11c Department of Mathematics University of Houston Lecture 13b - 3339 Cathy Poliak, Ph.D. cathy@math.uh.edu Office in Fleming 11c Section (Department 9.4 of Mathematics University of Lecture Houston 13b ) - 3339 1 / 24

Outline 1 Inference 2 Inferences for β 1 3 F-test Cathy Poliak, Ph.D. cathy@math.uh.edu Office in Fleming 11c Section (Department 9.4 of Mathematics University of Lecture Houston 13b ) - 3339 2 / 24

Introduction Questions Because elderly people may have difficulty standing to have their heights measured, a study looked at predicting overall height from height to the knee. Here are data (in centimeters, cm) for five elderly men: Knee Height (cm) 57.7 47.4 43.5 44.8 55.2 Overall Height(cm) 192.1 153.3 146.4 162.7 169.1 1. Which variable is the response (dependent) variable? a) Knee Height b) Overall Height Cathy Poliak, Ph.D. cathy@math.uh.edu Office in Fleming 11c Section (Department 9.4 of Mathematics University of Lecture Houston 13b ) - 3339 3 / 24

Introduction Questions The following is a scatterplot of the data. Overall Height (cm) 150 160 170 180 190 44 46 48 50 52 54 56 58 Knee Height (cm) 2. Which is the best description based on the scatter plot? a) Positive, strong, linear relationship. b) Negative, strong, linear relationship. c) Negative, weak relationship. d) None of these above. Cathy Poliak, Ph.D. cathy@math.uh.edu Office in Fleming 11c Section (Department 9.4 of Mathematics University of Lecture Houston 13b ) - 3339 4 / 24

Introduction Questions 3. The following is an output from R for the lm (linear models) function. Give the LSRL equation of the relationship between the knee height and overall height. > height.lm=lm(overall~knee) > summary(height.lm) Call: lm(formula = overall ~ knee) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 44.1315 38.3986 1.149 0.3338 knee 2.4254 0.7673 3.161 0.0508. --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 9.766 on 3 degrees of freedom Multiple R-squared: 0.7691,Adjusted R-squared: 0.6921 F-statistic: 9.992 on 1 and 3 DF, p-value: 0.05083 a) ŷ = 44.1315 + 2.4254x b) ŷ = 38.3986 + 0.7673x c) ŷ = 2.4254 + 44.1315x d) ŷ = 1.149 + 3.161x Cathy Poliak, Ph.D. cathy@math.uh.edu Office in Fleming 11c Section (Department 9.4 of Mathematics University of Lecture Houston 13b ) - 3339 5 / 24

Is this good at predicting the response? R 2 is the percent (fraction) of variability in the response variable (Y ) that is explained by the least-squares regression with the explanatory variable. This is a measure of how successful the regression equation was in predicting the response variable. The closer R 2 is to one (100%) the better our equation is at predicting the response variable. We will look later at how this is calculated. In the R output it is the Multiple R-squared value. Cathy Poliak, Ph.D. cathy@math.uh.edu Office in Fleming 11c Section (Department 9.4 of Mathematics University of Lecture Houston 13b ) - 3339 6 / 24

Is this good at predicting the response? A residual is the difference between an observed value of the response variable and the value predicted by the regression line. residual = observed y predicted y We can determine residuals for each observation. The closer the residuals are to zero, the better we are at predicting the response variable. We can plot the residuals for each observation, these are called the residual plots. Cathy Poliak, Ph.D. cathy@math.uh.edu Office in Fleming 11c Section (Department 9.4 of Mathematics University of Lecture Houston 13b ) - 3339 7 / 24

Estimating the Regression Parameters In the simple linear regression setting, we use the slope b 1 and intercept b 0 of the least-squares regression line to estimate the slope β 1 and intercept β 0 of the population regression line. The standard deviation, σ, in the model is estimated by the regression standard error s = (yi ŷ i ) 2 n 2 = all residuals 2 n 2 Recall that y i is the observed value from the data set and ŷ i is the predicted value from the equation. In R s is the called the Residual Standard Error in the last paragraph of the summary. Cathy Poliak, Ph.D. cathy@math.uh.edu Office in Fleming 11c Section (Department 9.4 of Mathematics University of Lecture Houston 13b ) - 3339 8 / 24

Determining if the Model is Good For the sample we can use R 2 and the residuals to determine if the equation is a good way of predicting the response variable. Another way to determine if this equation is a good way of predicting the response variable is to determine if the explanatory variable is needed (significant) in the equation. These tests of significance and confidence intervals in regression analysis are based on assumptions about the error term ɛ. Cathy Poliak, Ph.D. cathy@math.uh.edu Office in Fleming 11c Section (Department 9.4 of Mathematics University of Lecture Houston 13b ) - 3339 9 / 24

Assumptions about the error term ɛ 1. The error term ε is a random variable with a mean or expected value of zero, that is E(ε) = 0, an estimate for ε is the residuals for each value of the X-variable. residual = observed y predicted y 2. The variance of ε, denoted by σ 2, is the same for all values of x. The estimate for σ 2 is s 2 = MSE = SSE n 2 = (yi ŷ i ) 2 n 2. 3. The values of ε are independent. 4. The error term ε is a normally distributed random variable. 5. The residual plots help us assess the fit of a regression line and determine if the assumptions are met. - 3339 10 / 24

Definitions of Regression Output 1. The error sum of squares, denoted by SSE is SSE = (y i ŷ i ) 2 2. A quantitative measure of the total amount of variation in observed values is given by the total sum of squares, denoted by SST. SST = (y i ȳ) 2 3. The regression sum of squares, denoted SSR is the amount of total variation that is explained by the model SSR = (ŷ i ȳ) 2 4. The coefficient of determination, r 2 is given by r 2 = SSR SST - 3339 11 / 24

Finding these values using R > anova(shelf.lm) Analysis of Variance Table Response: sold Df Sum Sq Mean Sq F value Pr(>F) space 1 20535 20535 21.639 0.0009057 *** Residuals 10 9490 949 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1-3339 12 / 24

Conditions for regression inference The sample is an SRS from the population. There is a linear relationship in the population. The standard deviation of the responses about the population line is the same for all values of the explanatory variable. The response varies Normally about the population regression line. - 3339 13 / 24

t Test for Significance of β 1 Hypothesis Usually β 10 = 0 Test statistic H 0 : β 1 = β 10 versus H a : β 1 β 10 t = observed hypothesized standard deviation of observed observed = b 1 hypothesized = 0 s standard error = SE b1 = (xi x) 2 With degrees of freedom df = n 2. P-value: based on a t distribution with n 2 degrees of freedom. Decision: Reject H 0 if p-value α. Conclusion: If H 0 is rejected we conclude that the explanatory variable x can be used to predict the response variable y. - 3339 14 / 24

Testing β 1 1. We want to test: H 0 : β 1 = 0 versus H a : β 1 0 for the coffee sales. 2. Test statistic: t = (7.4 0) 1.591 = 4.652 3. P-value: 2 P(T > 4.652) = 2 0.00425 = 0.00906 4. Decision: Reject the Null hypothesis 5. Conclusion: β 1 is significantly not zero, thus shelf space can be used to predict the number of units sold. - 3339 15 / 24

R code > shelf.lm=lm(sold~space) > summary(shelf.lm) Call: lm(formula = sold ~ space) Residuals: Min 1Q Median 3Q Max -42.00-26.75 5.50 21.75 41.00 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 145.000 21.783 6.657 5.66e-05 *** space 7.400 1.591 4.652 0.000906 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 30.81 on 10 degrees of freedom Multiple R-squared: 0.6839,Adjusted R-squared: 0.6523 F-statistic: 21.64 on 1 and 10 DF, p-value: 0.0009057-3339 16 / 24

Height Because elderly people may have difficulty standing to have their heights measured, a study looked at predicting overall height from height to the knee. Here are data (in centimeters, cm) for five elderly men: Knee Height (cm) 57.7 47.4 43.5 44.8 55.2 Overall Height(cm) 192.1 153.3 146.4 162.7 169.1 1. Test H 0 : β 1 = 0 versus H a : β 1 0. 2. Give a conclusion for the relationship between using knee length to predict overall height. - 3339 17 / 24

Confidence Intervals for β 1 If we want to know a range of possible values for the slope we can use a confidence interval. Remember confidence intervals are estimate ± t standard error of the estimate Confidence interval for β 1 is b 1 ± t α/2,n 1 SE b1 Where t is from table D with degrees of freedom n 2 where n = number of observations. In R we can get this by confint(name.lm,level = 0.95). > confint(shelf.lm) 2.5 % 97.5 % (Intercept) 96.464405 193.53560 space 3.855461 10.94454-3339 18 / 24

Inferences Concerning ˆµ y Let Ŷ = ˆβ 0 + ˆβ 1 x where x is some fixed value of x. Then, 1. The mean value of Ŷ is E(Ŷ ) = β0 + β1x 2. The variance of Ŷ is ( 1 V (Ŷ ) = σ2 n + (x x) 2 ) 3. Ŷ has a normal distribution. 4. The 100(1 α)% confidence interval for µ Y that is the expected value of Y for a specific value of x, is ( 1 ˆµ y (x ) ± t α/2,n 1 σ 2 n + (x x) 2 ) S xx S xx - 3339 19 / 24

R code > predict(shelf.lm,newdata=data.frame(space=12),interval="c",level = 0.9) fit lwr upr 1 233.8 217.6177 249.9823-3339 20 / 24

F-distribution The F distribution with ν 1 degrees of freedom in the numerator and nu 1 degrees of freedom in the denominator is the distribution of a random variable F = U/ν 1 V /ν 2, where U χ 2 (df = ν 1 ) and V χ 2 (df = ν 2 ) are independent. That F has this distribution is indicated by F F(ν 1, ν 2 ). Notice U = SSR χ 2 (df = 1) and V = SSE χ 2 (df = n 2) are σ 2 σ 2 independent. Let MSE = SSE/(n 2) and MSR = SSR/1. Then F = MSR MSE = SSR/1 F(1, n 2) SSE/(n 2) Then we can use the F-distribution to test the hypothesis H 0 : β = 1 versus H a : β 0. athy Poliak, Ph.D. cathy@math.uh.edu Office in Fleming 11c Section (Department 9.4 of Mathematics University Lecture of Houston 13b) - 3339 21 / 24

F-test for Shelf Space Analysis of Variance Table Response: sold Df Sum Sq Mean Sq F value Pr(>F) space 1 20535 20535 21.639 0.0009057 *** Residuals 10 9490 949 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Note: F = t 2 and the p-value is the same. - 3339 22 / 24

Example The following data was collected comparing score on a measure of test anxiety and exam score: Measure of test anxiety 23 14 14 0 7 20 20 15 21 Exam score 43 59 48 77 50 52 46 51 51 We will use R to: Construct a scatterplot. Find the LSRL and fit it to the scatterplot. Find r and r 2. Does there appear to be a linear relationship between the two variables? Based on what you found, would you characterized the relationship as positive or negative? Strong or weak? Draw the residual plot. What does the residual plot reveal? Test H 0 : β 1 = 0 versus H a : β 1 0. - 3339 23 / 24

- 3339 24 / 24