STAT Chapter 11: Regression

Similar documents
Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Lecture 10 Multiple Linear Regression

Simple Linear Regression

Regression Models - Introduction

Correlation Analysis

Lecture 2 Simple Linear Regression STAT 512 Spring 2011 Background Reading KNNL: Chapter 1

Lecture 3: Inference in SLR

Simple and Multiple Linear Regression

Inference for Regression Simple Linear Regression

Ch 2: Simple Linear Regression

: The model hypothesizes a relationship between the variables. The simplest probabilistic model: or.

Inference for Regression Inference about the Regression Model and Using the Regression Line

STAT 4385 Topic 03: Simple Linear Regression

9. Linear Regression and Correlation

Inferences for Regression

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

Statistics for Engineers Lecture 9 Linear Regression

Topic 10 - Linear Regression

Psychology 282 Lecture #4 Outline Inferences in SLR

Simple Linear Regression. Material from Devore s book (Ed 8), and Cengagebrain.com

Objectives Simple linear regression. Statistical model for linear regression. Estimating the regression parameters

CHAPTER EIGHT Linear Regression

MAT2377. Rafa l Kulik. Version 2015/November/26. Rafa l Kulik

Inference for the Regression Coefficient

Homework 2: Simple Linear Regression

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression

Chapter 16. Simple Linear Regression and dcorrelation

Mathematics for Economics MA course

Correlation & Simple Regression

Unit 10: Simple Linear Regression and Correlation

Basic Business Statistics 6 th Edition

A discussion on multiple regression models

Lecture 1 Linear Regression with One Predictor Variable.p2

Chapter 14 Simple Linear Regression (A)

Sampling Distributions in Regression. Mini-Review: Inference for a Mean. For data (x 1, y 1 ),, (x n, y n ) generated with the SRM,

y response variable x 1, x 2,, x k -- a set of explanatory variables

Review 6. n 1 = 85 n 2 = 75 x 1 = x 2 = s 1 = 38.7 s 2 = 39.2

Econ 3790: Business and Economics Statistics. Instructor: Yogesh Uppal

Regression and correlation. Correlation & Regression, I. Regression & correlation. Regression vs. correlation. Involve bivariate, paired data, X & Y

Review of Statistics 101

28. SIMPLE LINEAR REGRESSION III

Multiple Regression. Inference for Multiple Regression and A Case Study. IPS Chapters 11.1 and W.H. Freeman and Company

Lecture 12 Inference in MLR

Statistics for Managers using Microsoft Excel 6 th Edition

THE ROYAL STATISTICAL SOCIETY 2008 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE (MODULAR FORMAT) MODULE 4 LINEAR MODELS

Simple Linear Regression. (Chs 12.1, 12.2, 12.4, 12.5)

The simple linear regression model discussed in Chapter 13 was written as

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

y ˆ i = ˆ " T u i ( i th fitted value or i th fit)

Business Statistics. Lecture 10: Correlation and Linear Regression

Oct Simple linear regression. Minimum mean square error prediction. Univariate. regression. Calculating intercept and slope

Chapter 16. Simple Linear Regression and Correlation

Ch 13 & 14 - Regression Analysis

SSR = The sum of squared errors measures how much Y varies around the regression line n. It happily turns out that SSR + SSE = SSTO.

STAT 511. Lecture : Simple linear regression Devore: Section Prof. Michael Levine. December 3, Levine STAT 511

(ii) Scan your answer sheets INTO ONE FILE only, and submit it in the drop-box.

Chapter 4: Regression Models

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #6

Correlation and the Analysis of Variance Approach to Simple Linear Regression

Applied Regression Modeling: A Business Approach Chapter 2: Simple Linear Regression Sections

Confidence Intervals, Testing and ANOVA Summary

Simple Linear Regression

STA 302 H1F / 1001 HF Fall 2007 Test 1 October 24, 2007

Simple Linear Regression for the MPG Data

Chapter 13. Multiple Regression and Model Building

Ch. 1: Data and Distributions

Chapter 12 - Lecture 2 Inferences about regression coefficient

Chapter 14 Student Lecture Notes Department of Quantitative Methods & Information Systems. Business Statistics. Chapter 14 Multiple Regression

Ordinary Least Squares Regression Explained: Vartanian

Statistical Techniques II EXST7015 Simple Linear Regression

13 Simple Linear Regression

Simple linear regression

Overview. Overview. Overview. Specific Examples. General Examples. Bivariate Regression & Correlation

Scatter plot of data from the study. Linear Regression

Chapter 10. Regression. Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania

General Linear Model (Chapter 4)

Lecture 11: Simple Linear Regression

Business Statistics. Chapter 14 Introduction to Linear Regression and Correlation Analysis QMIS 220. Dr. Mohammad Zainal

STAT5044: Regression and Anova. Inyoung Kim

Inference for Regression

Econ 3790: Statistics Business and Economics. Instructor: Yogesh Uppal

Density Temp vs Ratio. temp

Lecture 2 Linear Regression: A Model for the Mean. Sharyn O Halloran

6. Multiple Linear Regression

Statistics 112 Simple Linear Regression Fuel Consumption Example March 1, 2004 E. Bura

Lecture 14 Simple Linear Regression

Chapter 9. Correlation and Regression

Chapter 6: Regression Analysis: fitting equations to data

Scatter plot of data from the study. Linear Regression

STA 4210 Practise set 2a

Y i = η + ɛ i, i = 1,...,n.

Lectures on Simple Linear Regression Stat 431, Summer 2012

Correlation and regression

Multiple Linear Regression

STAT 350 Final (new Material) Review Problems Key Spring 2016

Chapter 1. Linear Regression with One Predictor Variable

STA121: Applied Regression Analysis

CHAPTER 4 DESCRIPTIVE MEASURES IN REGRESSION AND CORRELATION

1. Define the following terms (1 point each): alternative hypothesis

Transcription:

STAT 515 -- Chapter 11: Regression Mostly we have studied the behavior of a single random variable. Often, however, we gather data on two random variables. We wish to determine: Is there a relationship between the two r.v. s? Can we use the values of one r.v. to predict the other r.v.? Often we assume a straight-line relationship between two variables. This is known as simple linear regression. Probabilistic vs. Deterministic Models If there is an exact relationship between two (or more) variables that can be predicted with certainty, without any random error, this is known as a deterministic relationship. Examples:

In statistics, we usually deal with situations having random error, so exact predictions are not possible. This implies a probabilistic relationship between the 2 variables. Example: Y = breathalyzer reading X = amount of alcohol consumed (fl. oz.)

We typically assume the random errors balance out they average zero. Then this is equivalent to assuming the mean of Y, denoted E(Y), equals the deterministic component. Straight-Line Regression Model Y = 0 + 1 X + Y = response variable (dependent variable) X = predictor variable (independent variable) = random error component 0 = Y-intercept of regression line 1 = slope of regression line Note that the deterministic component of this model is E(Y) = 0 + 1 X Typically, in practice, 0 and 1 are unknown parameters. We estimate them using the sample data. Response Variable (Y): Measures the major outcome of interest in the study. Predictor Variable (X): Another variable whose value explains, predicts, or is associated with the value of the response variable.

Fitting the Model (Least Squares Method) If we gather data (X, Y) for several individuals, we can use these data to estimate 0 and 1 and thus estimate the linear relationship between Y and X. First step: Decide if a straight-line relationship between Y and X makes sense. Plot the bivariate data using a scattergram (scatterplot). Once we settle on the best-fitting regression line, its equation gives a predicted Y-value for any new X-value. How do we decide, given a data set, which line is the best-fitting line?

Note that usually, no line will go through all the points in the data set. For each point, the error = (Some positive errors, some negative errors) We want the line that makes these errors as small as possible (so that the line is close to the points). Least-squares method: We choose the line that minimizes the sum of all the squared errors (SSE). Least squares regression line: Yˆ ˆ ˆ 1X 0 where 0 ˆ and ˆ 1 are the estimates of 0 and 1 that produce the best-fitting line in the least squares sense.

Formulas for 0 ˆ and ˆ 1 : Estimated slope and intercept: ˆ SS xy 1 SS and ˆ 0 Y ˆ 1X xx ( X i )( Yi ) where SSxy X iyi n 2 ( X i ) 2 SSxx X i n and n = the number of observations. and Example (Table 11.3): Y = X = SS xy = SS xx =

Slope: Interpretations: Intercept: Example: Avoid extrapolation: predicting/interpreting the regression line for X-values outside the range of X in the data set.

Model Assumptions Recall model equation: Y 0 1X To perform inference about our regression line, we need to make certain assumptions about the random error component,. We assume: (1) The mean of the probability distribution of is 0. (In the long run, the values of the random error part average zero.) (2) The variance of the probability distribution of is constant for all values of X. We denote the variance of by 2. (3) The probability distribution of is normal. (4) The values of for any two observed Y-values are independent the value of for one Y-value has no effect on the value of for another Y- value. Picture:

Estimating 2 Typically the error variance 2 is unknown. An unbiased estimate of 2 is the mean squared error (MSE), also denoted s 2 sometimes. MSE = SSE n 2 where SSE = SS yy - ˆ 1 SS xy and SS yy ( Yi ) 2 Yi n 2 Note that an estimate of is s = MSE SSE n 2 Since has a normal distribution, we can say, for example, that about 95% of the observed Y-values fall within 2s units of the corresponding values Yˆ.

Testing the Usefulness of the Model For the SLR model, Y 0 1X. Note: X is completely useless in helping to predict Y if and only if 1 = 0. So to test the usefulness of the model for predicting Y, we test: If we reject H 0 and conclude H a is true, then we conclude that X does provide information for the prediction of Y. Picture:

Recall that the estimate ˆ 1 is a statistic that depends on the sample data. This ˆ 1 has a sampling distribution. If our four SLR assumptions hold, the sampling distribution of ˆ 1 is normal with mean 1 and standard deviation which we estimate by Under H 0 : 1 = 0, the statistic s / has a t-distribution with n 2 d.f. ˆ1 SS xx Test for Model Usefulness One-Tailed Tests Two-Tailed Test H 0 : 1 = 0 H 0 : 1 = 0 H 0 : 1 = 0 H 0 : 1 < 0 H 0 : 1 > 0 H 0 : 1 0 Test statistic: t = s / ˆ1 SS xx Rejection region: t < -t t > t t > t /2 or t < -t /2 P-value: left tail area right tail area 2*(tail area outside t) outside t outside t

Example: In the drug reaction example, recall 1 ˆ = 0.7. Is the real 1 significantly different from 0? (Use =.05.)

A 100(1 )% Confidence Interval for the true slope 1 is given by: where t /2 is based on n 2 d.f. In our example, a 95% CI for 1 is:

Correlation The scatterplot gives us a general idea about whether there is a linear relationship between two variables. More precise: The coefficient of correlation (denoted r) is a numerical measure of the strength and direction of the linear relationship between two variables. Formula for r (the correlation coefficient between two variables X and Y): SS xy r SS SS xx yy Most computer packages will also calculate the correlation coefficient. Interpreting the correlation coefficient: Positive r => The two variables are positively associated (large values of one variable correspond to large values of the other variable) Negative r => The two variables are negatively associated (large values of one variable correspond to small values of the other variable) r = 0 => No linear association between the two variables. Note: -1 r 1 always.

How far r is from 0 measures the strength of the linear relationship: r nearly 1 => Strong positive relationship between the two variables r nearly -1 => Strong negative relationship between the two variables r near 0 => Weak relationship between the two variables Pictures: Example (Drug/reaction time data): Interpretation?

Notes: (1) Correlation makes no distinction between predictor and response variables. (2) Variables must be numerical to calculate r. Examples: What would we expect the correlation to be if our two variables were: (1) Work Experience & Salary? (2) Weight of a Car & Gas Mileage? Some Cautions Example: Speed of a car (X) 20 30 40 50 60 Mileage in mpg (Y) 24 28 30 28 24 Scatterplot of these data: Calculation will show that r = 0 for these data. Are the two variables related?

Another caution: Correlation between two variables does not automatically imply that there is a cause-effect relationship between them. Note: The population correlation coefficient between two variables is denoted. To test H 0 : = 0, we simply use the equivalent test of H 0 : 1 = 0 in the SLR model. If this null hypothesis is rejected, we conclude there is a significant correlation between the two variables. The square of the correlation coefficient is called the coefficient of determination, r 2. Interpretation: r 2 represents the proportion of sample variability in Y that is explained by its linear relationship with X. 2 SSE r 1 SS (r 2 always between 0 and 1) yy For the drug/reaction time example, r 2 = Interpretation:

Estimation and Prediction with the Regression Model Major goals in using the regression model: (1) Determining the linear relationship between Y and X (accomplished through inferences about 1 ) (2) Estimating the mean value of Y, denoted E(Y), for a particular value of X. Example: Among all people with drug amount 3.5%, what is the estimated mean reaction time? (3) Predicting the value of Y for a particular value of X. Example: For a new individual having drug amount 3.5%, what is the predicted reaction time? The point estimate for these last two quantities is the same; it is: Example: However, the variability associated with these point estimates is very different. Which quantity has more variability, a single Y-value or the mean of many Y-values?

This is seen in the following formulas: 100(1 )% Confidence Interval for the mean value of Y at X = x p : where t /2 based on n 2 d.f. 100(1 )% Prediction Interval for the an individual new value of Y at X = x p : where t /2 based on n 2 d.f. The extra 1 inside the square root shows the prediction interval is wider than the CI, although they have the same center. Note: A Prediction Interval attempts to contain a random quantity, while a confidence interval attempts to contain a (fixed) parameter value.

The variability in our estimate of E(Y) reflects the fact that we are merely estimating the unknown 0 and 1. The variability in our prediction of the new Y includes that variability, plus the natural variation in the Y- values. Example (drug/reaction time data): 95% CI for E(Y) with X = 3.5: 95% PI for a new Y having X = 3.5: