sociology 36 regression Regression is a means of studying how the conditional distribution of a response variable (say, Y) varies for different values of one or more independent explanatory variables (say, X). The feature of the response variable distribution that most work on regression looks at is the mean. The response variable is frequently quantitative and measured on a true metric, but it doesn t have to be: we ll do regression with qualitative, categorical response variables. The independent variables (aka regressors) are frequently quantitative, but they don t have to be: we ll do regressions with qualitative, categorical independent variables. But for the time being we ll work exclusively with regression models in which the dependent variable and the independent variable are both quantitative. Below we use data from 1 respondents to the 198 current population survey (cps) to look at how the mean of the sample conditional distribution of hourly wage varies across distinct values of. Let s begin by graphing Y against X, wages (vertical axis) against schooling (horizontal axis). 3 1 1 figure 1. conditional distributions of wage by schooling Let s begin by looking at a model for the mean of wages that totally ignores schooling. Write this model as M ( y j ) a 1 where a is a constant that is calculated from sample data. Let the calculated value of a be written as â. Then the predicted or fitted value of wage for the ith person at the jth value of schooling can be written as ˆ aˆ y
So the equation for the observed value of wage for the i th person at the j th year of schooling is Where the term on the end is the residual, the difference between the observed value of the response variable and the fitted value from the model. To render all this operational, the constant â must be calculated from sample data. For that purpose we use the function of sample data that minimizes the sum of the squared residuals: e y yˆ ) ( ( y aˆ) e The value of â can be found by running 1. regress hrwage y aˆ + ˆ e Source SS df MS Number of obs 1 ---------+------------------------------ F(, 14). Model.. Prob > F. Residual 1374.963 14 4.783 R-squared. ---------+------------------------------ Adj R-squared. Total 1374.963 14 4.783 Root MSE 4.967 hrwage Coef. Std. Err. t P> t [9% Conf. Interval] ---------+-------------------------------------------------------------------- _cons 9.88874.161 4.36. 8.66499 9.13649 which yields the least-squares value of a y 889. This will be our predicted or fitted value of wage for everyone in the sample, no matter how many they have, since the model ignores schooling. ˆ $9.. pred grand Here s the graph of the fitted line against.
hrwage grand 3 1 1 figure. fitting constant function Now let s fit a model in which the fitted/predicted values of y are equal to the mean wage at each value of. In contrast to the previous model, in which there was the same mean wage at every value of schooling, let s consider a model in which there s a possibly different value of the mean at every value of X, a different fitted value. Hence, there will be as many different, distinct predictions as there are different values of schooling, in this case, eleven. So the second model for the mean of y is M ( y j ) a j Then the predicted or fitted value of wage for the ith person at the jth value of schooling can be written as yˆ ˆ a j where the value of the â j that minimizes the sum of squared residuals are the conditional sample means at each value of schooling, y j. Then the equation for the i th observation at the j th is y aˆ + eˆ j To find the eleven fitted wage values, I issue the following command:
3. oneway hrwage edyrs, tab Summary of hrwage edyrs Mean Std. Dev. Freq. ------------+------------------------------------ 8.98.43 1 9 7.33 4. 1 1 7.3.66 17 11 6.8 3.33 7 1 7.89 3.69 1 13 8. 4. 36 14 1.41.3 1 1.67.4 13 16 1.84.3 7 17 13.61 6.98 4 18 13.3 6.9 31 ------------+------------------------------------ Total 9.9 4.91 1 Analysis of Variance Source SS df MS F Prob > F ------------------------------------------------------------------------ Between groups 1.983 1.1983 1.91. Within groups 117.9798 4.1844837 ------------------------------------------------------------------------ Total 1374.963 14 4.783 Here s the graph of this sample fitted conditional mean function: hrwage mean_y 3 1 1 figure 3. conditional mean function Instead of a sample conditional mean function that fits exactly the mean of wage for every distinct value of schooling, perhaps we would prefer, or be satisfied with, a linear approximation to it. To get the best linear predictor of wage given schooling, we do a linear regression on schooling. The model for the mean wage is then M ( y ) a + 3 j bx
Which yields the equation for the fitted line: yˆ aˆ + bˆ x So the equation for the observation is y aˆ + bx ˆ + eˆ The least-squares values of â and bˆ can be found by running: 4. regress hrwage edyrs Source SS df MS Number of obs 1 ---------+------------------------------ F( 1, 13) 97.7 Model 198.6338 1 198.6338 Prob > F. Residual 1394.8996 13.696 R-squared.16 ---------+------------------------------ Adj R-squared.184 Total 1374.963 14 4.783 Root MSE 4.14 hrwage Coef. Std. Err. t P> t [9% Conf. Interval] ---------+-------------------------------------------------------------------- edyrs.8347.83333 9.88..698174.987137 _cons -1.77461 1.11671-1.89.113-3.968497.419961. pred blp Here s the graph of the fitted values of wage from the linear regression. hrwage blp 3 1 1 figure 4. best linear predictor
Here s the graph of all the fitted models. The linear regression does a good job of tracking the exact fitted conditional mean function. To see how good, compare the mean square residuals from the different models. 3 grand mean_y blp 1 1 figure. constant, mean, and blp functions model comparisons constant model conditional mean linear regression SST total sum of squares 1374.96 1374.96 1374.96 SSresidual residual sum of squares 1374.96 117.98 1394.9 SSregression 1.98 198.6 Regression sum of squares df residual degrees of freedom (n-1) 14 (n-11) 4 (n-) 13 MSres mean square residual (1374.9/14)4.7 (117.98/4).18 (1394.9/13).6 Root MSres sqrt(4.7) 4.9 sqrt(.18) 4.49 sqrt(.6) 4.
Other statistics for wages and schooling total variation in y: standard deviation of y: ( y y) 1374.963 s y 1374.963/ 14 4.91 mean of y: y 9.88 total variation in x: ( x x) 919.967 standard deviation of x: s x 919.967 / 14.38 mean of x: x 13.19 covariation of y and x: ( x x)( y y) 44.8 covariance of y and x: s xy 44.8/ 14 4.68 correlation of x and y: r xy s / s s 4.68/(.38)(4.91).4 xy x y