sociology 36 regression Regression is a means of modeling how the conditional distribution of a response variable (say, Y) varies for different values of one or more independent explanatory variables (say, X). The feature of the response variable distribution that has attracted the most interest in the past is the mean. The response variable is frequently quantitative and measured on a true metric, but it doesn t have to be; similarly, the independent variables are frequently quantitative, but they don t have to be. For the time being we ll work exclusively with regression models in which the dependent variable and the independent variables are both quantitative. Below we use data from respondents to the current population survey (cps) to look at how the mean of the sample conditional distribution of hourly wage varies across distinct values of. Let s begin by graphing Y against X, i.e., wages (vertical axis) against schooling (horizontal axis). 6 7 8 9 11 1 13 14 16 17 18 19 figure 1. conditional distributions of wage by schooling Model 1 We ll start with a model for the mean of wages that totally ignores schooling. Write this model as M ( y ) a 1 a where 8 18 is a schooling value and is a constant that is calculated from sample data. Let the calculated value of a be written as â. Then the predicted or fitted value of wage for the ith person at the th value of schooling can be written as ˆ aˆ y
So the equation for the observed value of wage for the i th person at the th year of schooling can be written as y aˆ + ˆ e where the term on the end is the residual, the difference between the observed value of the response variable and the fitted value from the model. To render all this operational, the constant a must be calculated from sample data. For that purpose we use the function of sample data that minimizes the sum of the squared residuals: The value of â 1. regress hrwage e ˆ ˆ ( y yˆ ) eˆ ( y can be found by running aˆ) Source SS df MS Number of obs ---------+------------------------------ F(, 14). Model.. Prob > F. Residual 1374.963 14 4.783 R-squared. ---------+------------------------------ Ad R-squared. Total 1374.963 14 4.783 Root MSE 4.967 hrwage Coef. Std. Err. t P> t [9% Conf. Interval] ---------+-------------------------------------------------------------------- _cons 9.88874.16 4.36. 8.66499 9.13649 which yields the least-squares value of a y 889. This will be our predicted or fitted value of wage for everyone in the sample, no matter how many they have, since the model ignores schooling. ˆ $9. Here s the graph of the fitted line against. hrwage grand 6 7 8 9 11 1 13 14 16 17 18 19 figure. fitting constant function
Model Now let s fit a model in which the fitted values of y are equal to the mean wage at each distinct value of. In contrast to the previous model, in which there was the same mean wage at every value of schooling, this model accommodates a possibly different value of the mean at every value of X. Hence, there will be as many different, distinct predictions as there are different values of schooling, in this case, eleven. You can see from the scatter diagram that this makes more sense. So the second model for the mean of y is Then the predicted or fitted value of wage for the ith person at the th value of schooling can be written as where the value of the â M ( y ) a yˆ aˆ that minimizes the sum of squared residuals are the conditional sample means at each value of schooling, i.e., y. equation for the i th observation at the th is Then the y aˆ + eˆ To find the eleven fitted wage values, I issue the following Stata command:. oneway hrwage edyrs, tab Summary of hrwage edyrs Mean Std. Dev. Freq. ------------+------------------------------------ 8.98.43 9 7.33 4. 1 7.3.66 17 11 6.8 3.33 7 1 7.89 3.69 13 8. 4. 36 14.41.3.67.4 13 16.84.3 7 17 13.61 6.98 4 18 13.3 6.9 31 ------------+------------------------------------ Total 9.9 4.91 Analysis of Variance Source SS df MS F Prob > F ------------------------------------------------------------------------ Between groups 1.983.1983.91. Within groups 17.9798 4.1844837 ------------------------------------------------------------------------ Total 1374.963 14 4.783 Here s the graph of this sample fitted conditional mean function:
hrwage mean_ y 6 7 8 9 11 1 13 14 16 17 18 19 figure 3. conditional mean function Model 3 Instead of a sample conditional mean function that fits exactly the mean of wage for every distinct value of schooling, perhaps we would prefer, or be satisfied with, a linear approximation to it. To get the best linear predictor of wage given schooling, we do a linear regression on schooling. The model for the mean wage is then M ( y ) a + 3 bx which yields the equation for the fitted line: yˆ aˆ + bˆ x So the equation for the observation is y aˆ + bx ˆ + eˆ To render all this operational, the constants â and must be calculated from sample data. For that purpose we again use the function of sample data that minimizes the sum of the squared residuals: e ˆ ( y yˆ ) ˆb eˆ ( y aˆ bx ˆ ) â bˆ The values of and can be found by running
3. regress hrwage edyrs Source SS df MS Number of obs ---------+------------------------------ F( 1, 13) 97.7 Model 198.6338 1 198.6338 Prob > F. Residual 394.8996 13.696 R-squared.16 ---------+------------------------------ Ad R-squared.84 Total 1374.963 14 4.783 Root MSE 4.14 hrwage Coef. Std. Err. t P> t [9% Conf. Interval] ---------+-------------------------------------------------------------------- edyrs.8347.8333 9.88..698174.987137 _cons -1.77461 1.1167-1.89.113-3.968497.419961 Here s the graph of the fitted values of wage from the linear regression. hrwage blp 6 7 8 9 11 1 13 14 16 17 18 19 figure 4. best linear predictor Below is the graph of all the fitted models. The linear regression does a good ob of tracking the exact fitted conditional mean function. To see how good, compare the mean square residuals from the different models as given in the table.
grand mean_y blp 6 7 8 9 11 1 13 14 16 17 18 19 figure. constant, mean, and blp functions model comparisons constant model conditional mean linear regression SST total sum of squares 1374.96 1374.96 1374.96 SSresidual residual sum of squares 1374.96 17.98 394.9 SSregression 1.98 198.6 Regression sum of squares df residual degrees of freedom (n-1) 14 (n-11) 4 (n-) 13 MSres mean square residual (1374.9/14)4.7 (17.98/4).18 (394.9/13).6 Root MSres sqrt(4.7) 4.9 sqrt(.18) 4.49 sqrt(.6) 4.
Other statistics for wages and schooling total variation in y: standard deviation of y: ( y y) 1374.963 s y 1374.963/ 14 4.91 mean of y: y 9.88 total variation in x: ( x ) x 919. 967 standard deviation of x: s x 919.967 / 14.38 mean of x: x 13. 19 covariation of y and x: ( x x)( y y) 44. 8 covariance of y and x: s xy 44.8/ 14 4.68 correlation of x and y: r xy s / s s 4.68/(.38)(4.91).4 xy x y