Ordinary Least Squares Regression Explained: Vartanian

Similar documents
Ordinary Least Squares Regression Explained: Vartanian

Inferences for Regression

Correlation and simple linear regression S5

Area1 Scaled Score (NAPLEX) .535 ** **.000 N. Sig. (2-tailed)

Correlation Analysis

Lecture (chapter 13): Association between variables measured at the interval-ratio level

Regression Models. Chapter 4. Introduction. Introduction. Introduction

Chapter 27 Summary Inferences for Regression

Review of Multiple Regression

Regression Analysis II

Business Statistics. Lecture 10: Correlation and Linear Regression

Multiple Regression. More Hypothesis Testing. More Hypothesis Testing The big question: What we really want to know: What we actually know: We know:

Chapter 9 Regression. 9.1 Simple linear regression Linear models Least squares Predictions and residuals.

Draft Proof - Do not copy, post, or distribute. Chapter Learning Objectives REGRESSION AND CORRELATION THE SCATTER DIAGRAM

The Simple Linear Regression Model

ECON 497 Midterm Spring

STAT 3900/4950 MIDTERM TWO Name: Spring, 2015 (print: first last ) Covered topics: Two-way ANOVA, ANCOVA, SLR, MLR and correlation analysis

Bivariate Regression Analysis. The most useful means of discerning causality and significance of variables

1 Correlation and Inference from Regression

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

Overview. Overview. Overview. Specific Examples. General Examples. Bivariate Regression & Correlation

Chapter 16. Simple Linear Regression and Correlation

ECO220Y Simple Regression: Testing the Slope

df=degrees of freedom = n - 1

Multiple Regression. Peerapat Wongchaiwat, Ph.D.

Chapter 19 Sir Migo Mendoza

Variance Decomposition and Goodness of Fit

Statistics and Quantitative Analysis U4320

Statistics and Quantitative Analysis U4320. Lecture 13: Explaining Variation Prof. Sharyn O Halloran

Chapter 16. Simple Linear Regression and dcorrelation

2 Regression Analysis

appstats27.notebook April 06, 2017

Density Temp vs Ratio. temp

ECON3150/4150 Spring 2015

(ii) Scan your answer sheets INTO ONE FILE only, and submit it in the drop-box.

In Class Review Exercises Vartanian: SW 540

Chapter 4. Regression Models. Learning Objectives

Chs. 16 & 17: Correlation & Regression

Lectures 5 & 6: Hypothesis Testing

Econ 3790: Statistics Business and Economics. Instructor: Yogesh Uppal

In the previous chapter, we learned how to use the method of least-squares

Practice exam questions

Simple Linear Regression Using Ordinary Least Squares

AMS 7 Correlation and Regression Lecture 8

CORRELATION AND REGRESSION

y response variable x 1, x 2,, x k -- a set of explanatory variables

Lecture 3: Inference in SLR

Simple Linear Regression

Correlation and Regression

LI EAR REGRESSIO A D CORRELATIO

Chapte The McGraw-Hill Companies, Inc. All rights reserved.

Can you tell the relationship between students SAT scores and their college grades?

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

MORE ON SIMPLE REGRESSION: OVERVIEW

REVIEW 8/2/2017 陈芳华东师大英语系

Chapter 13 Student Lecture Notes Department of Quantitative Methods & Information Systems. Business Statistics

Sociology 593 Exam 2 Answer Key March 28, 2002

Multiple linear regression

SIMPLE REGRESSION ANALYSIS. Business Statistics

Univariate analysis. Simple and Multiple Regression. Univariate analysis. Simple Regression How best to summarise the data?

: The model hypothesizes a relationship between the variables. The simplest probabilistic model: or.

Correlation and regression. Correlation and regression analysis. Measures of association. Why bother? Positive linear relationship

Correlation. A statistics method to measure the relationship between two variables. Three characteristics

Multiple linear regression S6

Chapter 10. Regression. Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania

Chs. 15 & 16: Correlation & Regression

Mathematics for Economics MA course

SPSS Output. ANOVA a b Residual Coefficients a Standardized Coefficients

Midterm 2 - Solutions

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Final Exam - Solutions

Example: Forced Expiratory Volume (FEV) Program L13. Example: Forced Expiratory Volume (FEV) Example: Forced Expiratory Volume (FEV)

Business Statistics. Lecture 9: Simple Regression

Inference for Regression

Binary Logistic Regression

Wed, June 26, (Lecture 8-2). Nonlinearity. Significance test for correlation R-squared, SSE, and SST. Correlation in SPSS.

Chapter 12 - Part I: Correlation Analysis

STA441: Spring Multiple Regression. This slide show is a free open source document. See the last slide for copyright information.

16.3 One-Way ANOVA: The Procedure

Chapter 4: Regression Models

Sociology 593 Exam 2 March 28, 2002

1.) Fit the full model, i.e., allow for separate regression lines (different slopes and intercepts) for each species

ECON3150/4150 Spring 2016

Chapter 3 Multiple Regression Complete Example

Example. Multiple Regression. Review of ANOVA & Simple Regression /749 Experimental Design for Behavioral and Social Sciences

Midterm 2 - Solutions

Chapter 14 Simple Linear Regression (A)

ISQS 5349 Final Exam, Spring 2017.

Chapter 3: Examining Relationships

Inference for Regression Inference about the Regression Model and Using the Regression Line

Regression used to predict or estimate the value of one variable corresponding to a given value of another variable.

QUEEN S UNIVERSITY FINAL EXAMINATION FACULTY OF ARTS AND SCIENCE DEPARTMENT OF ECONOMICS APRIL 2018

Approximate Linear Relationships

Correlation and the Analysis of Variance Approach to Simple Linear Regression

MATH ASSIGNMENT 2: SOLUTIONS

Warm-up Using the given data Create a scatterplot Find the regression line

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

Multiple Regression and Model Building Lecture 20 1 May 2006 R. Ryznar

IT 403 Practice Problems (2-2) Answers

STA 4210 Practise set 2a

Transcription:

Ordinary Least Squares Regression Eplained: Vartanian When to Use Ordinary Least Squares Regression Analysis A. Variable types. When you have an interval/ratio scale dependent variable.. When your independent variables are either interval/ratio scale or dummy variables. B. Types of relationships We use ordinary least squares regression when we are interested in determining cause-and-effect relationships. Thus, if we believe that there is a positive relationship between the unemployment rate in a community and wages (we believe that high unemployment causes people to depress wages) then use ordinary least squares regression analysis. The Process of Using OLS Regression Analysis When eamining the relationship between an independent and dependent variable in a scattergram, the line that fits these points best is known as the least squares line. This line is chosen by minimizing the distance between all of these points and the line. In other words, we re choosing a line that is closest to all the data points. For eample, let s say we have the following two variables, and y. y y y 0 3 5 5 0 3 5 5 3 5 6 4 6 7 4 6 9 5 5 6 And from this we get a scattergram and the best fitting line through that scattergram. 0 5 0 5 3 4 5 6 Fitted values y How do we form the line that goes through the data points (in the scattergram)? H:\Word 00\Lect.mss\OLSregres\Ordinary_Least_Squares_Regression_04.doc Page

We do this by minimizing the sum of the squared deviations from any line we could draw through the points. We thus will choose a line that minimizes the following equation ɵ ( y i y ). Here, y i are the actual values of y (for each of the sample members) and ɵ y is the predicted value of y (or the line we ll be drawing through the scattering of points note: I will sometimes refer to this as y p where p stands for the predicted value of y). We re trying to minimize the sum of the squared deviations of the actual (sample) values of y (y i ) from the best line we can draw through all of the y i points. This ɵ ( y i y) epression is known as the uneplained sums of squares or the error sums of squares. The total sums of squares given below can be broken up into eplained and uneplained sums of squares. Or ɵ ɵ ( y y) = ( y y) + ( y y) i i The first epression after the equals sign is the uneplained sums of squares and the second epression after the equals sign is the eplained sums of squares. The first epression is the total sums of squares (to the left of the equals sign). Uneplained: Our error in predicting what y will be by using the regression line. Eplained: What we gain by using ɵ y instead of y. What we re trying to do is predict the value of y, or the dependent variable, given that we know something about the person, the independent variable,. If we knew nothing about the person, our best guess of what y would be y. We are trying to improve on y in predicting the value of y. We ll do this with our knowledge of the independent variable,. The ɵ y line will allow us to predict the value of the dependent variable, y, for any value of, the independent variable. For eample, we may know that a particular state has an unemployment rate of %. We may wish to predict how long a person will stay unemployed if they live in such a state. By knowing the ɵ y line, we ll be able to predict how long a person stays unemployed. We may not be perfectly right in our prediction, for instance, if the points around the line are highly dispersed. But if the points around the line are concentrated around the line, then we can predict fairly accurately how long someone will spend unemployed for a given unemployment rate within the state. If we were eamining the effect of the income (the independent variable) on ependitures (the dependent variable), we would eamine the scatter of points from a sample drawn from the population. We then find a line, the regression line, the best fits these points. In what we are doing now, we are looking only at linear relationships. We can also look at non-linear relationships. Not all of the sample points will be located on the ordinary least squares H:\Word 00\Lect.mss\OLSregres\Ordinary_Least_Squares_Regression_04.doc Page

regression line some will be below the line and some will be above the line. The closer the points are to this line, the better the predictor of the dependent variable the independent variable will be. We can determine the ɵ y line by the following equation: ɵ y = b0 + b Here, b 0 is the intercept, b is the slope coefficient, and is the independent variable. ɵ y is the predicted value of y for a given value of. The formulas for determining the intercept (b 0 ) and the slope (b ) are given below (on the net page). We can define the b 0 and b coefficients as the following: b 0, or the intercept, is the point where we cross the y ais when the value of is 0. We know this because if we give a value of 0, ɵ y = b 0. b, or the slope coefficient, tells us how much ɵ y changes for a one-unit change in. A positive value for b indicates that there is a positive relationship between the independent and dependent variable. A negative value for b indicates that there is a negative relationship between the independent and dependent variable. A value of for b indicates that for every unit increase in the independent variable, the dependent variable in predicted to increases by unit. If b =, this indicates that for a one unit increase in the independent variable, the dependent variable is predicted to increase by units. If b = -9, this indicates that for every unit increase in the independent variable, the dependent variable is predicted to decrease by 9 units. Thus, b = change in y unit increase in The slope is generally defined as y y y =. Let s say we have the following 5 observations, where, the independent variable, is the number of children in the household, and y, the dependent variable, is the time in months unemployed. y 3 3 4 4 5 5 H:\Word 00\Lect.mss\OLSregres\Ordinary_Least_Squares_Regression_04.doc Page 3

The formula for determining the slope, or the b coefficient estimate is n y ( )( y) b = n ( ) The formula for the intercept or the b 0 coefficient estimate is y b b0 = n or b = y b 0 In the eample given, n=5. y= 55, = 5, y= 5, ( ) = 55,( ) = 5 5(55) 5(5) 50 b = = = 5(55) 5 50 and b 0 5 (5) 0 = = = 0 5 5 So, ɵ y = 0 + (). (You will not need to know how to calculate b 0 or b but will need to know how to interpret them.) The b coefficient estimate tells us that for every unit increase in, the predicted value for the dependent variable will increase by unit. The b 0 coefficient estimate tells us that when =0, the value of the dependent variable is 0. When =, then ɵ y =. We could graph this line to see the relationship between the two variables -- the independent and the dependent -- which is given above. It turns out in this case, we have a perfect relationship because all of the points lie on the ɵ y line. If we were to determine a correlation coefficient (r), it would be =. To graph this relationship, we could determine the value of ɵ y for each. ɵ y 0 0 3 3 4 4 H:\Word 00\Lect.mss\OLSregres\Ordinary_Least_Squares_Regression_04.doc Page 4

3 4 5 3 4 5 Fitted values y Let s say we have the following 5 cases for a second eample. y 5 4 3 3 4 5 n=5 To determine b : y= = y= = 35, 5, 5, ( ) 5 y y y = 55, ( ) = 5, = 55, = 3, = 3 5*35 5*5 50 b = = = 5*55 5 50 and b 0 =3-(-)(3)=6 The regression equation is therefore ɵ y =6-() or ɵ y =6- The b coefficient estimate, or the slope coefficient, for this eample = -. The b 0 coefficient estimate, or the intercept, = 6. Thus, when =0, ɵ y, the predicted value of y, is 6. If =, then the predicted value of y ( ɵ y ) is 5. When =6, ɵ y H:\Word 00\Lect.mss\OLSregres\Ordinary_Least_Squares_Regression_04.doc Page 5

=0. In this second situation, we again would find a perfect relationship between the two variables all of the points are on the regression line. If we were to determine the correlation coefficient (r) for this eample, it would = -. To graph this we could determine the value of ɵ y for each value. We again use the ɵ y equation from above. ɵ y 0 6 5 4 3 3 4 5 6 0 3 4 5 3 4 5 Fitted values y We will rarely find a perfect relationship between two variables as we have in the two eamples above. For eample, if we had the following 5 cases below, we would not find a perfect relationship between the two variables. n=5 y 3 4 8 4 6 5 0 6

To determine b : y= = y= = 98, 5, 30, ( ) 5 y y y = 55, ( ) = 900, = 00, = 6, = 3 5*98 5*30 40 b = = =.80, and 5*55 5 50 b 0 = 6.8*3= 3.6 The regression equation is therefore ɵ y =3.6 +.80 (). Where b =.8 and b 0 =3.6. Thus, when =0, the predicted value for y, ɵ y, is 3.6 -- replace with a value of 0 in the above ɵ y equation. When =, the predicted value for, y, ɵ y,=4.4 replace with a value of in the above ɵ y equation. When =0, the predicted value for y, ɵ y, =.6. 4 6 8 0 3 4 5 Fitted values y Testing to determine if the relationship between the independent and dependent variables is statistically significant or testing the significance of the b coefficient estimate. You will generally be testing a null hypothesis that states that there is no relationship between the independent and dependent variables. In other words, you ll be testing the following: H 0 : β =0. If you re testing for a positive relationship between the independent and dependent variables, your one tailed research hypothesis will be: H R : β >0. 7

A negative research hypothesis will be: H R : β <0 A two-tailed research hypothesis will be: H R : β 0 In order to test for the significance of the b coefficient, you will have to know the standard error for the b coefficient. The standard error for the coefficient is very similar to a standard deviation it measures the spread of the distribution. We will use a student t distribution to test the b coefficient, to determine if there is in all likelihood a relationship between the independent and dependent variables. The student t distribution value is very similar to a z value. The t is telling us how many standard error units we are away from our null hypothesized value. The hypothesized value we re eamining is the null hypothesis -- a value of β =0. We found that for the normal distribution, when we were.96 units away from the mean of the distribution (where z=.96), we were in the.05 tail of the normal distribution. When sample sizes get relatively large, it will again take around.96 units (now standard error units measured in t values rather than z values) for us to be in the.05 tail-end of the distribution. In other words, when sample sizes get large, the student t distribution turns into a normal distribution. The t value is determined by the formula below. t = n k b s b Where the standard error for the estimate is given by s sb = ( ) n SSE Where, s = n k.( You will not need to use these formulas they are here for those who wish to know them. You will need to know about the standard error and how to use it but not how to calculate it.) Where sb is the standard error for the estimate, and SSE stands for the error sums of squares or the uneplained sums of squares. The n-k- part of the t formula indicates the degrees of freedom. Here, n is equal to the number of observations, k is equal to the number of independent variables, and sb is the standard error for the b coefficient estimate. If we had 5 observations and independent variable, we would have 3 degrees of freedom. We would use these degrees of freedom in a table of critical values for t to determine if the t value is greater than or equal to the critical value. If the t value is greater than the critical value, you will reject the null hypothesis. If the t value is less than the critical value, you will accept the null hypothesis. 8

Let s say that you determine that the b coefficient estimate = 4. You also determine that the standard error for the b coefficient estimate is, with an n=4 (or you re eamining 4 cases). Let s also say you re eamining a one-tailed hypothesis at the.05 level of significance. Your t statistic would be the following: t t 4 40 = 4 = = This indicates that the t value =, with 40 degrees of freedom. The critical value is.684. Because the t value is greater than the critical value, you would reject the null hypothesis at the.05 level, for a one-tailed test. If you were testing this hypothesis at the.05 level for a two-tailed test, the critical value =.0. Because the t value is less than the critical value, you would accept the null hypothesis. AN EXAMPLE You re eamining the relationship between age and wage. You have the following 4 observations: Obs Age (X) Wage (Y) 0 5.50 30 6.50 3 40 7.50 4 50 8.00 From this information, we could determine the b 0 and b coefficients: b 0 =3.9, b =.085. ɵ y =3.9+.085.0375 s b = =.00866 5400 4900 We can then determine whether the t coefficient is significant by using the t formula:.085 t = = 9.8.00866 At two degrees of freedom for a.05, two-tailed test, the critical value is 4.3. Because the t value is greater than the critical value, reject the null hypothesis. Using the F test to determine statistical significance The F test will determine whether your regression model (including all of the covariates) is statistically significant. In the single covariate case, you will be testing whether the single covariate is statistically significant. We will use the Mean Square Regression and Mean Square Error (or the Mean Square Residual) in an F test. You will see each of these in your computer output. 9

F = k, n k MSR MSE Here, we are testing the following hypothesis. H H 0 : R = 0 a : R > 0 Let s say that we re eamining the effects of age on the number of cigarettes smoked. ANOVA b Model Sum of Squares df Mean Square F Sig. Regression 054.53 054.53 60.35.000 a Residual 47639.993 3994 34.043 Total 478447.55 3995 a. Predictors: (Constant), age b. Dependent Variable: cigsperday Here, the mean square regression is 054.53. The mean square error is 34.043. F=054.53/34.043=60.35. From the table we see that this is significant at the.000 level. We can also determine the slope by eamining the unstandardized Coefficients for age (age in 0). This is given below. Coefficients a Model Unstandardized Coefficients Standardized Coefficients B Std. Error Beta t Sig. (Constant) 3.93.47.360.000 age -.04.003 -.066-7.769.000 a. Dependent Variable: cigsperday This indicates that the b for age is -.04. This means that for each additional year of age, people are predicted to smoke.04 fewer cigarettes per day. 0