Ordinary Least Squares Regression Explained: Vartanian

Similar documents
Ordinary Least Squares Regression Explained: Vartanian

Regression Analysis II

Inferences for Regression

Correlation Analysis

Overview. Overview. Overview. Specific Examples. General Examples. Bivariate Regression & Correlation

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

Chapter 16. Simple Linear Regression and Correlation

The Simple Linear Regression Model

Review 6. n 1 = 85 n 2 = 75 x 1 = x 2 = s 1 = 38.7 s 2 = 39.2

ECO220Y Simple Regression: Testing the Slope

Chapte The McGraw-Hill Companies, Inc. All rights reserved.

Draft Proof - Do not copy, post, or distribute. Chapter Learning Objectives REGRESSION AND CORRELATION THE SCATTER DIAGRAM

Chapter 14 Simple Linear Regression (A)

Regression Models. Chapter 4. Introduction. Introduction. Introduction

Chapter 16. Simple Linear Regression and dcorrelation

QUEEN S UNIVERSITY FINAL EXAMINATION FACULTY OF ARTS AND SCIENCE DEPARTMENT OF ECONOMICS APRIL 2018

Ch 13 & 14 - Regression Analysis

2 Regression Analysis

Mathematics for Economics MA course

Correlation and the Analysis of Variance Approach to Simple Linear Regression

Lecture (chapter 13): Association between variables measured at the interval-ratio level

Lectures 5 & 6: Hypothesis Testing

Correlation and Regression

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Multiple Linear Regression

LECTURE 6. Introduction to Econometrics. Hypothesis testing & Goodness of fit

Variance Decomposition and Goodness of Fit

Business Statistics. Lecture 10: Correlation and Linear Regression

Chapter 4. Regression Models. Learning Objectives

SIMPLE REGRESSION ANALYSIS. Business Statistics

appstats27.notebook April 06, 2017

df=degrees of freedom = n - 1

Regression used to predict or estimate the value of one variable corresponding to a given value of another variable.

Simple Linear Regression

Chapter 4: Regression Models

Chapter 27 Summary Inferences for Regression

ECON3150/4150 Spring 2015

Business Statistics. Lecture 9: Simple Regression

Chapter 19 Sir Migo Mendoza

determine whether or not this relationship is.

Final Exam - Solutions

ECON3150/4150 Spring 2016

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression

Biostatistics 380 Multiple Regression 1. Multiple Regression

Simple Linear Regression

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Chapter 10. Regression. Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania

Density Temp vs Ratio. temp

Statistics for Managers using Microsoft Excel 6 th Edition

STAT Chapter 11: Regression

Midterm 2 - Solutions

Midterm 2 - Solutions

Simple and Multiple Linear Regression

Practice exam questions

Lecture 12: Interactions and Splines

1 Correlation and Inference from Regression

Warm-up Using the given data Create a scatterplot Find the regression line

AMS 7 Correlation and Regression Lecture 8

Lecture 3: Inference in SLR

Regression Analysis. BUS 735: Business Decision Making and Research. Learn how to detect relationships between ordinal and categorical variables.

Lectures on Simple Linear Regression Stat 431, Summer 2012

Simple Linear Regression

Association Between Variables Measured at the Interval-Ratio Level: Bivariate Correlation and Regression

CORRELATION AND REGRESSION

Big Data Analysis with Apache Spark UC#BERKELEY

Measuring the fit of the model - SSR

STA 4210 Practise set 2a

Chapter 3 Multiple Regression Complete Example

(ii) Scan your answer sheets INTO ONE FILE only, and submit it in the drop-box.

Block 3. Introduction to Regression Analysis

Inference for Regression

Assumptions, Diagnostics, and Inferences for the Simple Linear Regression Model with Normal Residuals

Semester 2, 2015/2016

Harvard University. Rigorous Research in Engineering Education

Simple Linear Regression: One Qualitative IV

Correlation and Regression Analysis. Linear Regression and Correlation. Correlation and Linear Regression. Three Questions.

Objectives Simple linear regression. Statistical model for linear regression. Estimating the regression parameters

Basic Business Statistics 6 th Edition

Simple Linear Regression Using Ordinary Least Squares

Week 8: Correlation and Regression

LI EAR REGRESSIO A D CORRELATIO

Wed, June 26, (Lecture 8-2). Nonlinearity. Significance test for correlation R-squared, SSE, and SST. Correlation in SPSS.

y response variable x 1, x 2,, x k -- a set of explanatory variables

CHAPTER EIGHT Linear Regression

Linear Regression with 1 Regressor. Introduction to Econometrics Spring 2012 Ken Simons

Chapter 12 - Lecture 2 Inferences about regression coefficient

Test 3 Practice Test A. NOTE: Ignore Q10 (not covered)

What is a Hypothesis?

Chapter 23. Inferences About Means. Monday, May 6, 13. Copyright 2009 Pearson Education, Inc.

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

Finding Relationships Among Variables

The simple linear regression model discussed in Chapter 13 was written as

ECON2228 Notes 2. Christopher F Baum. Boston College Economics. cfb (BC Econ) ECON2228 Notes / 47

Variance. Standard deviation VAR = = value. Unbiased SD = SD = 10/23/2011. Functional Connectivity Correlation and Regression.

The t-test: A z-score for a sample mean tells us where in the distribution the particular mean lies

Can you tell the relationship between students SAT scores and their college grades?

Regression Analysis: Basic Concepts

Binary Logistic Regression

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Multiple Regression. Peerapat Wongchaiwat, Ph.D.

Transcription:

Ordinary Least Squares Regression Explained: Vartanian When to Use Ordinary Least Squares Regression Analysis A. Variable types. When you have an interval/ratio scale dependent variable.. When your independent variables are either interval/ratio scale or dummy variables. B. Types of relationships We use ordinary least squares regression when we are interested in determining cause-and-effect relationships. Thus, if we believe that there is a positive relationship between the unemployment rate in a community and wages (we believe that high unemployment causes people to depress wages) then use ordinary least squares regression analysis. The Process of Using OLS Regression Analysis When examining the relationship between an independent and dependent variable in a scattergram, the line that fits these points best is known as the least squares line. This line is chosen by minimizing the distance between all of these points and the line. In other words, we re choosing a line that is closest to all the data points. For example, let s say we have the following two variables, x and y. x y 0 0 3 3 3 4 4 5 5 5 5 5 5 5 6 6 7 6 9 6 And from this we get a scattergram and the best fitting line through that scattergram. D:\Word\Lect.mss\OLSregres\Ordinary Least Squares Regression 00.doc Page

0 5 0 5 3 4 5 6 x Fitted values y How do we form the line that goes through the data points (in the scattergram)? We do this by minimizing the sum of the squared deviations from any line we could draw through the points. We thus will choose a line that minimizes the following equation ( y i y ). Here, y i are the actual values of y (for each of the sample members) and y is the predicted value of y (or the line we ll be drawing through the scattering of points note: I will sometimes refer to this as y p where p stands for the predicted value of y). We re trying to minimize the sum of the squared deviations of the actual (sample) values of y (y i ) from the best line we can draw through all of the y i points. This ( y i y) expression is known as the unexplained sums of squares or the error sums of squares. The total sums of squares given below can be broken up into explained and unexplained sums of squares. Or ( y y) = ( y y) + ( y y) i i The first expression after the equals sign is the unexplained sums of squares and the second expression after the equals sign is the explained sums of squares. The first expression is the total sums of squares (to the left of the equals sign). Unexplained: Our error in predicting what y will be by using the regression line. Explained: What we gain by using y instead of y. What we re trying to do is predict the value of y, or the dependent variable, given that we know something about the person, the independent variable, x. If we knew nothing about the person, our best guess of what y would be y. We are trying to improve on y in predicting the value of y. We ll do this with our knowledge of the independent variable, x. D:\Word\Lect.mss\OLSregres\Ordinary Least Squares Regression 00.doc Page

The y line will allow us to predict the value of the dependent variable, y, for any value of x, the independent variable. For example, we may know that a particular state has an unemployment rate of %. We may wish to predict how long a person will stay unemployed if they live in such a state. By knowing the y line, we ll be able to predict how long a person stays unemployed. We may not be perfectly right in our prediction, for instance, if the points around the line are highly dispersed. But if the points around the line are concentrated around the line, then we can predict fairly accurately how long someone will spend unemployed for a given unemployment rate within the state. If we were examining the effect of the income (the independent variable) on expenditures (the dependent variable), we would examine the scatter of points from a sample drawn from the population. We then find a line, the regression line, the best fits these points. In what we are doing now, we are looking only at linear relationships. We can also look at non-linear relationships. Not all of the sample points will be located on the ordinary least squares regression line some will be below the line and some will be above the line. The closer the points are to this line, the better the predictor of the dependent variable the independent variable will be. We can determine the y line by the following equation: y = b0 + b x Here, b 0 is the intercept, b is the slope coefficient, and x is the independent variable. y is the predicted value of y for a given value of x. The formulas for determining the intercept (b 0 ) and the slope (b ) are given below (on the next page). We can define the b 0 and b coefficients as the following: b 0, or the intercept, is the point where we cross the y axis when the value of x is 0. We know this because if we give x a value of 0, y = b 0. b, or the slope coefficient, tells us how much y changes for a one-unit change in x. A positive value for b indicates that there is a positive relationship between the independent and dependent variable. A negative value for b indicates that there is a negative relationship between the independent and dependent variable. A value of for b indicates that for every unit increase in the independent variable, the dependent variable in predicted to increases by unit. If b =, this indicates that for a one unit increase in the independent variable, the dependent variable is predicted to increase by units. If b = -9, this indicates that for every unit increase in the independent variable, the dependent variable is predicted to decrease by 9 units. Thus, D:\Word\Lect.mss\OLSregres\Ordinary Least Squares Regression 00.doc Page 3

b = change in y unit increase in x The slope is generally defined as Δ y y y = Δx x x. Let s say we have the following 5 observations, where x, the independent variable, is the number of children in the household, and y, the dependent variable, is the time in months unemployed. x y 3 3 4 4 5 5 The formula for determining the slope, or the b coefficient estimate is n xy ( x)( y) b = nx ( x) The formula for the intercept, or the b 0 coefficient estimate is y b x b0 = n or b = y b x 0 In the example given, n=5. xy = 55, x = 5, y = 5, ( x) = 55,( x) = 5 5(55) 5(5) 50 b = = = 5(55) 5 50 and b 0 5 (5) 0 = = = 0 5 5 So, y = 0 + (x). D:\Word\Lect.mss\OLSregres\Ordinary Least Squares Regression 00.doc Page 4

The b coefficient estimate tells us that for every unit increase in x, the predicted value for the dependent variable will increase by unit. The b 0 coefficient estimate tells us that when x=0, the value of the dependent variable is 0. When x =, then y =. We could graph this line to see the relationship between the two variables -- the independent and the dependent -- which is given above. It turns out in this case, we have a perfect relationship because all of the points lie on the y line. If we were to determine a correlation coefficient (r), it would be =. To graph this relationship, we could determine the value of y for each x. x y 0 0 3 3 4 4 3 4 5 3 4 5 x Fitted values y Let s say we have the following 5 cases for a second example. x y 5 4 3 3 4 5 n=5 35, 5, 5, ( ) 5 xy = x = y = x = = 55, ( ) = 5, = 55, = 3, = 3 x y y y x D:\Word\Lect.mss\OLSregres\Ordinary Least Squares Regression 00.doc Page 5

To determine b : 5*35 5*5 50 b = = = 5*55 5 50 and b 0 =3-(-)(3)=6 The regression equation is therefore y =6-()x or y =6-x The b coefficient estimate, or the slope coefficient, for this example = -. The b 0 coefficient estimate, or the intercept, = 6. Thus, when x=0, y, the predicted value of y, is 6. If x=, then the predicted value of y ( y ) is 5. When x=6, y =0. In this second situation, we again would find a perfect relationship between the two variables all of the points are on the regression line. If we were to determine the correlation coefficient (r) for this example, it would = -. To graph this we could determine the value of y for each x value. We again use the y equation from above. x y 0 6 5 4 3 3 4 5 6 0 Ordinary Least Squares Regression 00.doc Page 6

3 4 5 3 4 5 x Fitted values y We will rarely find a perfect relationship between two variables as we have in the two examples above. For example, if we had the following 5 cases below, we would not find a perfect relationship between the two variables. n=5 x y 3 4 8 4 6 5 0 To determine b : xy = x = y = x = 98, 5, 30, ( ) 5 x y y y x = 55, ( ) = 900, = 00, = 6, = 3 5*98 5*30 40 b = = =.80, and 5*55 5 50 b 0 = 6.8*3= 3.6 The regression equation is therefore y =3.6 +.80 (x). Where b =.8 and b 0 =3.6. Ordinary Least Squares Regression 00.doc Page 7

Thus, when x=0, the predicted value for y, y, is 3.6 -- replace x with a value of 0 in the above y equation. When x=, the predicted value for, y, y,=4.4 replace x with a value of in the above y equation. When x=0, the predicted value for y, y, =.6. 4 6 8 0 3 4 5 x Fitted values y A final example examines a sample of people who have been in job training programs to determine the relationship between time in these job training programs (in months) and their wage after they find work. We come up with the following b 0 and b coefficients: b 0 =3, b =4 In other words, y = 3 + 4 x Here, x=time in months in the job training program. What we can do is put in different values of x to see what we predict about the dependent variable. If x=0 (or the time in the job training program at 0), we would predict that person will have a wage of $3/hour. y =3 + 4 (0) = 3. If x=, we would predict that wages would be $7/hour y = 3 + 4() = 7. If x= (the time in job training months), we would predict that wages would be $/hour. Ordinary Least Squares Regression 00.doc Page 8

y = 3 + 4() =. Testing to Determine if the Relationship Between the Independent and Dependent Variables is Significant or Testing the Significance of the b coefficient estimate. You will generally be testing a null hypothesis that states that there is no relationship between the independent and dependent variables. In other words, you ll be testing the following: H 0 : β =0. If you re testing for a positive relationship between the independent and dependent variables, your one tailed research hypothesis will be: H R : β >0. A negative research hypothesis will be: H R : β <0 A two-tailed research hypothesis will be: H R : β 0 In order to test for the significance of the b coefficient, you will have to know the standard error for the b coefficient. The standard error for the coefficient is very similar to a standard deviation it measures the spread of the distribution. We will use a student t distribution to test the b coefficient, to determine if there is in all likelihood a relationship between the independent and dependent variables. As we ve learned with the difference of means test, the student t distribution value is very similar to a z value. The t is telling us how many standard error units we are away from our null hypothesized value. The hypothesized value we re examining is the null hypothesis -- a value of β =0. We found that for the normal distribution, when we were.96 units away from the mean of the distribution (where z=.96), we were in the.05 tail of the normal distribution. When sample sizes get relatively large, it will again take around.96 units (now standard error units measured in t values rather than z values) for us to be in the.05 tail-end of the distribution. In other words, when sample sizes get large, the student t distribution turns into a normal distribution. The t value is determined by the formula below. t = n k b s b Ordinary Least Squares Regression 00.doc Page 9

Where the standard error for the estimate is given by s sb =, or ( x x) i s sb = ( x) x n SSE Where, s = n k. Where sb is the standard error for the estimate, and SSE stands for the error sums of squares or the unexplained sums of squares. The n-k- part of the t formula indicates the degrees of freedom. Here, n is equal to the number of observations, k is equal to the number of independent variables, and sb is the standard error for the b coefficient estimate. If we had 5 observations and independent variable, we would have 3 degrees of freedom. We would use this degrees of freedom in a table of critical values for t to determine if the t value is greater than or equal to the critical value. If the t value is greater than the critical value, you will reject the null hypothesis. If the t value is less than the critical value, you will accept the null hypothesis. Let s say that you determine that the b coefficient estimate = 4. You also determine that the standard error for the b coefficient estimate is, with an n=4 (or you re examining 4 cases). Let s also say you re examining a one-tailed hypothesis at the.05 level of significance. Your t statistic would be the following: t t 4 40 = 4 = = This indicates that the t value =, with 40 degrees of freedom. The critical value is.684. Because the t value is greater than the critical value, you would reject the null hypothesis at the.05 level, for a one-tailed test. If you were testing this hypothesis at the.05 level for a two-tailed test, the critical value =.0. Because the t value is less than the critical value, you would accept the null hypothesis. Ordinary Least Squares Regression 00.doc Page 0

AN EXAMPLE You re examining the relationship between age and wage. You have the following 4 observations: Obs Age (X) Wage (Y) 0 5.50 30 6.50 3 40 7.50 4 50 8.00 From this information, we could determine the b 0 and b coefficients: b 0 =3.9, b =.085. y =3.9+.085x.0375 s b = =.00866 5400 4900 We can then determine whether the t coefficient is significant by using the t formula:.085 t = = 9.8.00866 At two degrees of freedom for a.05, two-tailed test, the critical value is 4.3. Because the t value is greater than the critical value, reject the null hypothesis. Using the F test to determine statistical significance The F test will determine whether your regression model (including all of the covariates) is statistically significant. In the single covariate case, you will be testing whether the single covariate is statistically significant. We will use the Mean Square Regression and Mean Square Error in an F test. MSR F = kn, k MSE Where we are testing the following hypothesis. H H : β = 0 : β 0 0 a In our previous example, we determined that SSE=.075. We could then use the formula for the Ordinary Least Squares Regression 00.doc Page

total sums of square ( y y) =3.688, or determine the SSR or regression sums of squares i ( y y) = 3.63. To then determine F, we need to determine the Mean Square Regression and the Mean Square Error. MSR=SSR/k MSE=SSE/n-k- MSR=3.63/=3.63 MSE=.075/=.0375 F, = 3.688/.0375=96.35. If we look on an F table with and DFs, we find that the critical value is 8.5. Because the F value is greater than the critical value, we will reject the null hypothesis. Confidence Intervals for β. β =.085 β ± sb * CV, Or the margin of error will be the standard error for the estimate multiplied by the critical value. We will use the t table to determine critical values. In this example, our estimate for β =.085 and s b =.00866. The critical value (CV) for the t test is 4.3 for a.05 test (give our small degrees of freedom). So the 95% CI for the coefficient estimate is:.085 ±.0086*4.3 =.0480 to.98. We are 95% confident that the β coefficient in the population lies between these two values. Or, for every additional year of age, wages increase by 4.80 cents to.98 cents per hour. Ordinary Least Squares Regression 00.doc Page