Ordinary Least Squares Regression Explained: Vartanian

Ordinary Least Squares Regression Explained: Vartanian When to Use Ordinary Least Squares Regression Analysis A. Variable types. When you have an interval/ratio scale dependent variable.. When your independent variables are either interval/ratio scale or dummy variables. B. Types of relationships We use ordinary least squares regression when we are interested in determining cause-and-effect relationships. Thus, if we believe that there is a positive relationship between the unemployment rate in a community and wages (we believe that high unemployment causes people to depress wages) then use ordinary least squares regression analysis. The Process of Using OLS Regression Analysis When examining the relationship between an independent and dependent variable in a scattergram, the line that fits these points best is known as the least squares line. This line is chosen by minimizing the distance between all of these points and the line. In other words, we re choosing a line that is closest to all the data points. For example, let s say we have the following two variables, x and y. x y 0 0 3 3 3 4 4 5 5 5 5 5 5 5 6 6 7 6 9 6 And from this we get a scattergram and the best fitting line through that scattergram. D:\Word\Lect.mss\OLSregres\Ordinary Least Squares Regression 00.doc Page

0 5 0 5 3 4 5 6 x Fitted values y How do we form the line that goes through the data points (in the scattergram)? We do this by minimizing the sum of the squared deviations from any line we could draw through the points. We thus will choose a line that minimizes the following equation ( y i y ). Here, y i are the actual values of y (for each of the sample members) and y is the predicted value of y (or the line we ll be drawing through the scattering of points note: I will sometimes refer to this as y p where p stands for the predicted value of y). We re trying to minimize the sum of the squared deviations of the actual (sample) values of y (y i ) from the best line we can draw through all of the y i points. This ( y i y) expression is known as the unexplained sums of squares or the error sums of squares. The total sums of squares given below can be broken up into explained and unexplained sums of squares. Or ( y y) = ( y y) + ( y y) i i The first expression after the equals sign is the unexplained sums of squares and the second expression after the equals sign is the explained sums of squares. The first expression is the total sums of squares (to the left of the equals sign). Unexplained: Our error in predicting what y will be by using the regression line. Explained: What we gain by using y instead of y. What we re trying to do is predict the value of y, or the dependent variable, given that we know something about the person, the independent variable, x. If we knew nothing about the person, our best guess of what y would be y. We are trying to improve on y in predicting the value of y. We ll do this with our knowledge of the independent variable, x. D:\Word\Lect.mss\OLSregres\Ordinary Least Squares Regression 00.doc Page

The y line will allow us to predict the value of the dependent variable, y, for any value of x, the independent variable. For example, we may know that a particular state has an unemployment rate of %. We may wish to predict how long a person will stay unemployed if they live in such a state. By knowing the y line, we ll be able to predict how long a person stays unemployed. We may not be perfectly right in our prediction, for instance, if the points around the line are highly dispersed. But if the points around the line are concentrated around the line, then we can predict fairly accurately how long someone will spend unemployed for a given unemployment rate within the state. If we were examining the effect of the income (the independent variable) on expenditures (the dependent variable), we would examine the scatter of points from a sample drawn from the population. We then find a line, the regression line, the best fits these points. In what we are doing now, we are looking only at linear relationships. We can also look at non-linear relationships. Not all of the sample points will be located on the ordinary least squares regression line some will be below the line and some will be above the line. The closer the points are to this line, the better the predictor of the dependent variable the independent variable will be. We can determine the y line by the following equation: y = b0 + b x Here, b 0 is the intercept, b is the slope coefficient, and x is the independent variable. y is the predicted value of y for a given value of x. The formulas for determining the intercept (b 0 ) and the slope (b ) are given below (on the next page). We can define the b 0 and b coefficients as the following: b 0, or the intercept, is the point where we cross the y axis when the value of x is 0. We know this because if we give x a value of 0, y = b 0. b, or the slope coefficient, tells us how much y changes for a one-unit change in x. A positive value for b indicates that there is a positive relationship between the independent and dependent variable. A negative value for b indicates that there is a negative relationship between the independent and dependent variable. A value of for b indicates that for every unit increase in the independent variable, the dependent variable in predicted to increases by unit. If b =, this indicates that for a one unit increase in the independent variable, the dependent variable is predicted to increase by units. If b = -9, this indicates that for every unit increase in the independent variable, the dependent variable is predicted to decrease by 9 units. Thus, D:\Word\Lect.mss\OLSregres\Ordinary Least Squares Regression 00.doc Page 3

b = change in y unit increase in x The slope is generally defined as Δ y y y = Δx x x. Let s say we have the following 5 observations, where x, the independent variable, is the number of children in the household, and y, the dependent variable, is the time in months unemployed. x y 3 3 4 4 5 5 The formula for determining the slope, or the b coefficient estimate is n xy ( x)( y) b = nx ( x) The formula for the intercept, or the b 0 coefficient estimate is y b x b0 = n or b = y b x 0 In the example given, n=5. xy = 55, x = 5, y = 5, ( x) = 55,( x) = 5 5(55) 5(5) 50 b = = = 5(55) 5 50 and b 0 5 (5) 0 = = = 0 5 5 So, y = 0 + (x). D:\Word\Lect.mss\OLSregres\Ordinary Least Squares Regression 00.doc Page 4

The b coefficient estimate tells us that for every unit increase in x, the predicted value for the dependent variable will increase by unit. The b 0 coefficient estimate tells us that when x=0, the value of the dependent variable is 0. When x =, then y =. We could graph this line to see the relationship between the two variables -- the independent and the dependent -- which is given above. It turns out in this case, we have a perfect relationship because all of the points lie on the y line. If we were to determine a correlation coefficient (r), it would be =. To graph this relationship, we could determine the value of y for each x. x y 0 0 3 3 4 4 3 4 5 3 4 5 x Fitted values y Let s say we have the following 5 cases for a second example. x y 5 4 3 3 4 5 n=5 35, 5, 5, ( ) 5 xy = x = y = x = = 55, ( ) = 5, = 55, = 3, = 3 x y y y x D:\Word\Lect.mss\OLSregres\Ordinary Least Squares Regression 00.doc Page 5

To determine b : 5*35 5*5 50 b = = = 5*55 5 50 and b 0 =3-(-)(3)=6 The regression equation is therefore y =6-()x or y =6-x The b coefficient estimate, or the slope coefficient, for this example = -. The b 0 coefficient estimate, or the intercept, = 6. Thus, when x=0, y, the predicted value of y, is 6. If x=, then the predicted value of y ( y ) is 5. When x=6, y =0. In this second situation, we again would find a perfect relationship between the two variables all of the points are on the regression line. If we were to determine the correlation coefficient (r) for this example, it would = -. To graph this we could determine the value of y for each x value. We again use the y equation from above. x y 0 6 5 4 3 3 4 5 6 0 Ordinary Least Squares Regression 00.doc Page 6

3 4 5 3 4 5 x Fitted values y We will rarely find a perfect relationship between two variables as we have in the two examples above. For example, if we had the following 5 cases below, we would not find a perfect relationship between the two variables. n=5 x y 3 4 8 4 6 5 0 To determine b : xy = x = y = x = 98, 5, 30, ( ) 5 x y y y x = 55, ( ) = 900, = 00, = 6, = 3 5*98 5*30 40 b = = =.80, and 5*55 5 50 b 0 = 6.8*3= 3.6 The regression equation is therefore y =3.6 +.80 (x). Where b =.8 and b 0 =3.6. Ordinary Least Squares Regression 00.doc Page 7

Thus, when x=0, the predicted value for y, y, is 3.6 -- replace x with a value of 0 in the above y equation. When x=, the predicted value for, y, y,=4.4 replace x with a value of in the above y equation. When x=0, the predicted value for y, y, =.6. 4 6 8 0 3 4 5 x Fitted values y A final example examines a sample of people who have been in job training programs to determine the relationship between time in these job training programs (in months) and their wage after they find work. We come up with the following b 0 and b coefficients: b 0 =3, b =4 In other words, y = 3 + 4 x Here, x=time in months in the job training program. What we can do is put in different values of x to see what we predict about the dependent variable. If x=0 (or the time in the job training program at 0), we would predict that person will have a wage of $3/hour. y =3 + 4 (0) = 3. If x=, we would predict that wages would be $7/hour y = 3 + 4() = 7. If x= (the time in job training months), we would predict that wages would be $/hour. Ordinary Least Squares Regression 00.doc Page 8

y = 3 + 4() =. Testing to Determine if the Relationship Between the Independent and Dependent Variables is Significant or Testing the Significance of the b coefficient estimate. You will generally be testing a null hypothesis that states that there is no relationship between the independent and dependent variables. In other words, you ll be testing the following: H 0 : β =0. If you re testing for a positive relationship between the independent and dependent variables, your one tailed research hypothesis will be: H R : β >0. A negative research hypothesis will be: H R : β <0 A two-tailed research hypothesis will be: H R : β 0 In order to test for the significance of the b coefficient, you will have to know the standard error for the b coefficient. The standard error for the coefficient is very similar to a standard deviation it measures the spread of the distribution. We will use a student t distribution to test the b coefficient, to determine if there is in all likelihood a relationship between the independent and dependent variables. As we ve learned with the difference of means test, the student t distribution value is very similar to a z value. The t is telling us how many standard error units we are away from our null hypothesized value. The hypothesized value we re examining is the null hypothesis -- a value of β =0. We found that for the normal distribution, when we were.96 units away from the mean of the distribution (where z=.96), we were in the.05 tail of the normal distribution. When sample sizes get relatively large, it will again take around.96 units (now standard error units measured in t values rather than z values) for us to be in the.05 tail-end of the distribution. In other words, when sample sizes get large, the student t distribution turns into a normal distribution. The t value is determined by the formula below. t = n k b s b Ordinary Least Squares Regression 00.doc Page 9

Where the standard error for the estimate is given by s sb =, or ( x x) i s sb = ( x) x n SSE Where, s = n k. Where sb is the standard error for the estimate, and SSE stands for the error sums of squares or the unexplained sums of squares. The n-k- part of the t formula indicates the degrees of freedom. Here, n is equal to the number of observations, k is equal to the number of independent variables, and sb is the standard error for the b coefficient estimate. If we had 5 observations and independent variable, we would have 3 degrees of freedom. We would use this degrees of freedom in a table of critical values for t to determine if the t value is greater than or equal to the critical value. If the t value is greater than the critical value, you will reject the null hypothesis. If the t value is less than the critical value, you will accept the null hypothesis. Let s say that you determine that the b coefficient estimate = 4. You also determine that the standard error for the b coefficient estimate is, with an n=4 (or you re examining 4 cases). Let s also say you re examining a one-tailed hypothesis at the.05 level of significance. Your t statistic would be the following: t t 4 40 = 4 = = This indicates that the t value =, with 40 degrees of freedom. The critical value is.684. Because the t value is greater than the critical value, you would reject the null hypothesis at the.05 level, for a one-tailed test. If you were testing this hypothesis at the.05 level for a two-tailed test, the critical value =.0. Because the t value is less than the critical value, you would accept the null hypothesis. Ordinary Least Squares Regression 00.doc Page 0

AN EXAMPLE You re examining the relationship between age and wage. You have the following 4 observations: Obs Age (X) Wage (Y) 0 5.50 30 6.50 3 40 7.50 4 50 8.00 From this information, we could determine the b 0 and b coefficients: b 0 =3.9, b =.085. y =3.9+.085x.0375 s b = =.00866 5400 4900 We can then determine whether the t coefficient is significant by using the t formula:.085 t = = 9.8.00866 At two degrees of freedom for a.05, two-tailed test, the critical value is 4.3. Because the t value is greater than the critical value, reject the null hypothesis. Using the F test to determine statistical significance The F test will determine whether your regression model (including all of the covariates) is statistically significant. In the single covariate case, you will be testing whether the single covariate is statistically significant. We will use the Mean Square Regression and Mean Square Error in an F test. MSR F = kn, k MSE Where we are testing the following hypothesis. H H : β = 0 : β 0 0 a In our previous example, we determined that SSE=.075. We could then use the formula for the Ordinary Least Squares Regression 00.doc Page

total sums of square ( y y) =3.688, or determine the SSR or regression sums of squares i ( y y) = 3.63. To then determine F, we need to determine the Mean Square Regression and the Mean Square Error. MSR=SSR/k MSE=SSE/n-k- MSR=3.63/=3.63 MSE=.075/=.0375 F, = 3.688/.0375=96.35. If we look on an F table with and DFs, we find that the critical value is 8.5. Because the F value is greater than the critical value, we will reject the null hypothesis. Confidence Intervals for β. β =.085 β ± sb * CV, Or the margin of error will be the standard error for the estimate multiplied by the critical value. We will use the t table to determine critical values. In this example, our estimate for β =.085 and s b =.00866. The critical value (CV) for the t test is 4.3 for a.05 test (give our small degrees of freedom). So the 95% CI for the coefficient estimate is:.085 ±.0086*4.3 =.0480 to.98. We are 95% confident that the β coefficient in the population lies between these two values. Or, for every additional year of age, wages increase by 4.80 cents to.98 cents per hour. Ordinary Least Squares Regression 00.doc Page