Project Report for STAT7 Statistical Methods Instructor: Dr. Ramon V. Leon Wage Data Analysis Yuanlei Zhang 77--7 November,
Part : Introduction Data Set The data set contains a random sample of observations on variables sampled from the Current Population Survey of 98. It provides information on wages and other characteristics of the workers, including sex, number of years of education, years of work experience, occupational status, region of residence and union membership. This data set is obtained from StatLib and its original source is: Berndt, ER. The Practice of Econometrics. 99. NY: Addison-Wesley. (The JMP file containing this data set is attached as raw_data.jmp) Variables The variables contained in the data set are summarized in the table below: Variable Name Description Data type Explanatory Variable: WAGE Wage (dollars per hour). Continuous Predictor Variables: EDUCATION Number of years of education. Continuous SOUTH SEX Indicator variable for Southern Region =Person lives in South =Person lives elsewhere Indicator variable for sex =Female =Male Nominal Nominal EXPERIENCE Number of years of work experience. Continuous UNION Indicator variable for union membership =Union member =Not union member Nominal AGE Age (years). Continuous RACE Race =Other =Hispanic =White Nominal
OCCUPATION SECTOR MARR Occupational category =Management =Sales =Clerical =Service =Professional =Other Sector =Other =Manufacturing =Construction Marital Status =Unmarried =Married Nominal Nominal Nominal NOTE: For nominal variables SOUTH, SEX, UNION and MARR, they already serve as dummy variables since they can only take the values and ; for nominal variables RACE, OCCUPATION and SECTOR, dummy variables will be introduced automatically by JMP when doing regressions. Objective According to common sense, wages may be dependent on the other characteristics of the workers more or less. The objectives of this project are to find out whether wages are indeed related to these characteristics and if so, what are possible ways to model such relationships and how good are these models. A final model will be selected that models the relationship between wages and these characteristics the best on the given data set, which might help to predict a worker s wage based on some relevant characteristics of the worker. However, since the data were collected in 98, it might not be appropriate for us to make general inferences about a worker s wage in the current time.
Part : Data Analysis Problems with the data As an initial attempt, WAGE is fitted against all predictor variables. Model : WAGE = β + β EDUCATION + β SOUTH + β SEX + β EXPERIENCE + β UNION + β AGE + β RACE + β OCCUPATION + β SECTOR + β MARR 7 8 9 (NOTE: Nominal variables RACE, OCCUPATION and SECTOR should be replaced with corresponding dummy variables. For simplicity, the model expression above didn t reflect this, but conversions will be done when actually carrying out the regression.) The output of the least square regression immediately helps to identify two major problems to be corrected before any further analyses can be carried out. Problem : Unstable variance The problem of unstable variance is identified by examining the JMP output of residual plot against the predicted values as shown below: Residual by Predicted Plot WAGE Residual - WAGE Predicted From the above residual plot, we can easily see that the variance is increasing with the predicted value. This indicates that Var(Y) is a function of E(Y) instead of being constant, and therefore needs to be transformed to be stabilized. Applying the Box-Cox Transformation technique in JMP, we find out that the best transformation needed to stabilize Var(Y) is the log transformation. Therefore, the new model should become:
Model +: Log( WAGE) = β + β EDUCATION + β SOUTH + β SEX + β EXPERIENCE + β UNION + β AGE + β RACE + β OCCUPATION + β SECTOR + β MARR 7 8 9 After refitting Log(WAGE) against all possible predictor variables, the residual plot is as following: Residual by Predicted Plot.. LOG (WAGE) Residual. -. -...... LOG (WAGE) Predicted The residual plot now suggests a much more constant variance, and from this step on, we will use Log(WAGE) to do all the least square regressions instead of using WAGE. Problem : Multicollinearity The problem of multicollinearity is identified by examining the VIF values in the JMP output of least square regression as shown below: Parameter Estimates Term VIF Intercept. EDUCATION.8 SOUTH. SEX.79997 EXPERIENCE.87 UNION.88 AGE 9.887 RACE[].7977 RACE[].8 OCCUPATION[].79 OCCUPATION[].97 OCCUPATION[].78 OCCUPATION[].889 OCCUPATION[].9899 SECTOR[].9 SECTOR[].99 MARR.9
We see that VIF values of EDUCATION, EXPERIENCE and AGE are much greater than, which suggests a serious multicollinearity problem. Usually, multicollinearity problems are caused by correlated predicator variables. Therefore, pair-wise correlations between EDUCATION, EXPERIENCE and AGE are calculated to justify the cause to this multicollinearity problem. Correlations EDUCATION EXPERIENCE AGE EDUCATION. -.7 -. EXPERIENCE -.7..978 AGE -..978. Scatterplot Matrix EDUCATION EXPERIENCE AGE From the pair-wise correlations and the scatterplot matrix, we see that AGE and EXPERIENCE are almost perfectly correlated with a correlation coefficient r =.978. This explains the multicollinearity problem identified above and to solve this problem, we need to get rid of either AGE or EXPERIENCE. The following models to be evaluated start with removing the AGE variable from the model.
Models without the AGE variable Removing the AGE variable from Model +, we get the model below: Model -AGE: Log( WAGE) = β + β EDUCATION + β SOUTH + β SEX β RACE + β OCCUPATION + β SECTOR + β MARR 7 8 9 + β EXPERIENCE + β UNION + Fitting the above model in JMP, we get the following least square regression output. Summary of Fit RSquare.8 RSquare Adj.889 Observations (or Sum Wgts) Parameter Estimates Term Estimate VIF Intercept.7. EDUCATION..88 SOUTH -.98.89 SEX -.9.89 EXPERIENCE.97.889 UNION.7.89 RACE[] -.8.79 RACE[] -.9.89 OCCUPATION[].88. OCCUPATION[] -..78 OCCUPATION[]..9 OCCUPATION[] -.7.888 OCCUPATION[].77.997 SECTOR[] -.9.99 SECTOR[]..77 MARR.8.997 Effect Tests Source Nparm DF Sum of Squares F Ratio Prob > F EDUCATION 8.8.8 <. SOUTH.9777.999. SEX.9899.877 <. EXPERIENCE.9878 9.9 <. UNION.88 7. <. RACE.87..9 OCCUPATION.9.9 <. SECTOR.87.7. MARR.87.99. Removing the AGE variable results in a drop in the R value from.7 to.8, which is negligible. However, the multicollinearity problem has been solved, as there is no VIF value greater than now. The above output also suggests that the coefficients of the RACE, SECTOR and MARR variables are not significant at α =. level since their corresponding P-values in the partial F-tests are greater than.. Therefore, in the next model, we will try to get rid of these variables to see how the regression results will be affected.
Model : Log( WAGE) = β + β EDUCATION + β SOUTH + β SEX + β EXPERIENCE + β UNION + β OCCUPATION Summary of Fit RSquare.877 RSquare Adj.9 Observations (or Sum Wgts) Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model.7.7 7.97 Error 9.77.89 Prob > F C. Total 8.8 <. Parameter Estimates Term Estimate VIF Intercept.79. EDUCATION.97.9899 SOUTH -..79 SEX -.88.7 EXPERIENCE.9.8 UNION.8.8 OCCUPATION[].8. OCCUPATION[] -.79.77 OCCUPATION[] -..89 OCCUPATION[] -.89.8 OCCUPATION[]..9 Effect Tests Source Nparm DF Sum of Squares F Ratio Prob > F EDUCATION 9. 8.8 <. SOUTH.988.79. SEX.9.9 <. EXPERIENCE 7.89.89 <. UNION.999.7 <. OCCUPATION 7.88 7. <. Cp Press 9.89.77 The F-test of the hypothesis H : β = β = β = β = β = β = indicates the rejection of H since the P-value in the pooled F-test is less than.. Also, each of the coefficients in the model is statistically significant as the P-values in the partial F-tests are very small. All VIF values are less than, which suggests no multicollinearity problem in the model. The R value of this model is.877 (a slight drop from.8 in the previous model) and therefore,.877% of the variability in the wages of workers is explained by regression on the predictors. This may not seem to be a satisfactory result, but thinking of the reality of uncertainties, this may still be acceptable. The R adj, C p and PRESS statistics are also calculated for comparisons with other competing models. In all, Model seems quite reasonable and does suggest a good candidate model to consider. The next step is to consider all possible interactions between the predictors as well as their corresponding quadratic terms (only for EDUCATION and EXPERIENCE), and see
if we can have an improvement in the regression results. Since there are C = possible interaction terms in total plus quadratic terms, the JMP output is very lengthy and therefore omitted. As a final result, only EXPERIENCE turns out to be significant and EDUCATION as well as all the interaction terms can be discarded as they neither are significant nor add much information to the model. Introducing EXPERIENCE to Model, we get the following model: Model +: Log( WAGE) = β + β EDUCATION + β SOUTH + β SEX β OCCUPATION + β EXPERIENCE 7 + β EXPERIENCE + β UNION + Summary of Fit RSquare.88 RSquare Adj. Observations (or Sum Wgts) Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model.777.977 7.798 Error 9.9.799 Prob > F C. Total 8.8 <. Parameter Estimates Term Estimate VIF Intercept.9987. EDUCATION.8.9977 SOUTH -.97.9 SEX -..7 EXPERIENCE.88.8 UNION.9.988 OCCUPATION[].97.7 OCCUPATION[] -.987.79 OCCUPATION[] -.877.8 OCCUPATION[] -.889.98 OCCUPATION[]..9 (EXPERIENCE-7.8)*(EXPERIENCE-7.8) -.7.897 Effect Tests Source Nparm DF Sum of Squares F Ratio Prob > F EDUCATION 7.88. <. SOUTH.9.8.8 SEX.87 7.87 <. EXPERIENCE. 8.78 <. UNION.89.9 <. OCCUPATION.98 7.79 <. EXPERIENCE*EXPERIENCE..879 <. Cp Press 8.8889 98.7 This model gives better results on the R, R adj, C p and PRESS statistics than Model does. Also, each of the coefficients in the model is highly significant and all the VIF values are less than. Therefore, Model + suggests another good candidate model to consider.
Models without the EXPERIENCE variable As mentioned in previous sections, we can also remove the EXPERIENCE variable from Model + to get the model below: Model -EXPERIENCE: Log( WAGE) = β + β EDUCATION + β SOUTH + β SEX β OCCUPATION + β SECTOR + β MARR 7 8 9 + β UNION + β AGE + β RACE + Fitting the above model in JMP, we get the following least square regression output. Summary of Fit RSquare.7 RSquare Adj.7 Observations (or Sum Wgts) Parameter Estimates Term Estimate VIF Intercept.. EDUCATION.7.88 SOUTH -.977.8 SEX -.7.8 UNION.8.87 AGE.97.99 RACE[] -..797 RACE[] -.9.899 OCCUPATION[].97.78 OCCUPATION[] -..78 OCCUPATION[].997.889 OCCUPATION[] -.7.8779 OCCUPATION[].7.988 SECTOR[] -.9.9 SECTOR[].7.78 MARR.7.997 Effect Tests Source Nparm DF Sum of Squares F Ratio Prob > F EDUCATION.888.977 <. SOUTH.9977.99. SEX.9977.8 <. UNION.99 7. <. AGE.79998 9.88 <. RACE.798.7.9 OCCUPATION.978.97 <. SECTOR.879.7. MARR.998.9. Removing the EXPERIENCE variable results in a drop in the R value from.7 to.7, which is negligible. However, the multicollinearity problem has also been solved, as there is no VIF value greater than. Just as in the case of removing the AGE variable, the RACE, SECTOR and MARR variables are suggested to be excluded from the model since they are not significant at α =. level. Therefore, these variables are omitted in the next model to see how the regression results will be affected.
Model : Log( WAGE) = β + β EDUCATION + β SOUTH + β SEX + β UNION + β AGE + β OCCUPATION Summary of Fit RSquare.8 RSquare Adj.89 Observations (or Sum Wgts) Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model.7.7 7.98 Error 9.7.89 Prob > F C. Total 8.8 <. Parameter Estimates Term Estimate VIF Intercept.989. EDUCATION.88.777 SOUTH -.8.79 SEX -.78.7 UNION.8.877 AGE.7.98 OCCUPATION[]..97 OCCUPATION[] -.7.79 OCCUPATION[] -..8 OCCUPATION[] -..87 OCCUPATION[].8.979 Effect Tests Source Nparm DF Sum of Squares F Ratio Prob > F EDUCATION 7. 8.988 <. SOUTH.8.9. SEX.977.89 <. UNION.9.7 <. AGE 7.88.7 <. OCCUPATION 7.79 7. <. Cp Press 9.97.799 This model also gives fairly good results on the R, R adj, C p and PRESS statistics. All its regression coefficients are highly significant and no VIF value is greater than. Therefore, Model also suggests a good candidate model to consider. As in the case of developing Model +, the next step is to consider all possible interactions between the predictors as well as their corresponding quadratic terms (only for EDUCATION and AGE), and see if we can have an improvement in the regression results. As a final result, only AGE turns out to be significant and EDUCATION as well as all the interaction terms can be discarded. Again, the lengthy JMP output is omitted. Introducing AGE to Model, we get the following model:
Model +: Log( WAGE) = β β OCCUPATION + β AGE + β EDUCATION + β SOUTH + β SEX 7 + β UNION + β AGE + Summary of Fit RSquare.87 RSquare Adj.8 Observations (or Sum Wgts) Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model.78.979 7.77 Error 9.7.79 Prob > F C. Total 8.8 <. Parameter Estimates Term Estimate VIF Intercept.97. EDUCATION..7978 SOUTH -.7.78 SEX -.9.97 UNION.98.887 AGE.8.88 OCCUPATION[].98.878 OCCUPATION[] -.7.78 OCCUPATION[] -..88 OCCUPATION[] -.89.987 OCCUPATION[].9.9 (AGE-.8)*(AGE-.8) -.. Effect Tests Source Nparm DF Sum of Squares F Ratio Prob > F EDUCATION.797.88 <. SOUTH.9.97.9 SEX.788.8 <. UNION.797. <. AGE.8 7.78 <. OCCUPATION.7.9 <. AGE*AGE.7.89 <. Cp Press 8.9879 98.897 This model gives even better results on the R, R adj, C p and PRESS statistics than Model does, while all the regression coefficients in the model are still highly significant and the VIF values are all less than. Therefore, Model + suggests one more good candidate model to consider.
Model Comparisons Based on above analyses, we have altogether four candidate models to consider. Summarizing all of them according to their R, R adj, C p and PRESS values, we get the following table: Model Variables in Model R R C adj p PRESS EDUCATION, SOUTH, SEX, EXPERIENCE,.877.9 9.89.77 UNION, OCCUPATION + EDUCATION, SOUTH, SEX, EXPERIENCE, UNION,.88. 8.8889 98.7 OCCUPATION, EXPERIENCE EDUCATION, SOUTH, SEX, UNION, AGE,.8.89 9.97.799 OCCUPATION + EDUCATION, SOUTH, SEX, UNION, AGE, OCCUPATION, AGE.87.8 8.9879 98.897 According to the above table, both Model + and Model + are better than Model and Model, and between the two better ones, Model + is just slightly better than Model +. Actually the differences between Model + and Model + can be neglected, which means EXPERIENCE and AGE can be used interchangeably in fitting models, due to their almost-perfect correlations with each other. For decision-making purposes, I will choose Model + as the final model, given that it can pass the following checks for violations of model assumptions.
Model Assumptions Verifications Doing regression diagnostics on Model +: Log( WAGE) = β + β EDUCATION + β SOUTH + β SEX + β EXPERIENCE + β UNION + β OCCUPATION + β EXPERIENCE 7 We get the following JMP results: a) Checking for outliers and influential observations: By calculating the standardized residuals and the h ii values, one large outlier with standardized e i = -.78 is detected. This was a male, with years of experience and years of education, in a management position, who lived in the north and was not a union member. As an outlier, he had much lower wages than expected. This outlier will simply be omitted when doing the following analyses. b) Residual plots against all predictor variables (checking for linearity) Bivariate Fit of By EDUCATION - - EDUCATION Bivariate Fit of By SOUTH - -.......7.8.9 SOUTH
Bivariate Fit of By SEX - -.......7.8.9 SEX Bivariate Fit of By EXPERIENCE - - EXPERIENCE Bivariate Fit of By UNION - -.......7.8.9 UNION
Bivariate Fit of By AGE - - AGE Oneway Analysis of By RACE - - RACE Oneway Analysis of By OCCUPATION - - OCCUPATION
Oneway Analysis of By SECTOR - - SECTOR Bivariate Fit of By MARR - -.......7.8.9 MARR Except for the large outlier identified in (a), all of the residuals exhibit random scatter around zero, and there are no unusual patterns in these residual plots. Therefore no further transformations are needed and the omitted variables can keep being excluded.
c) Residual plot against predicted value (checking for constant variance) Bivariate Fit of By Predicted LOG (WAGE) - - Predicted LOG (WAGE) Except for the large outlier identified in (a), the variances appear stable and the dispersion of the residuals is approximately constant with respect to the predicted values. Therefore the constant variance assumption is satisfied. d) Normal plot of residuals (checking for normality).99.9.9.7..... - - - Normal Quantile Plot - -
Except for the large outlier identified in (a), the normal plot of residuals appears very close to a straight line, which indicates that the normality assumption is satisfied. e) Run-Chart of residuals (checking for independence) Residual by Row Plot.. Residual. -. -. Row Number Durbin-Watson Durbin-Watson Number of Obs. AutoCorrelation Prob<DW.98.8.9 There are no signs of correlation introduced by time order from the above Run-Chart of residuals. Also, the autocorrelation statistic is calculated as.8 and its associated P-value is.9, which indicates that we do not have a problem of autocorrelation. Therefore, the assumption of independence is satisfied.
Part : Final Conclusion Based on all above analyses, the final model that has been decided on is: Log( WAGE) = β + β EDUCATION + β SOUTH + β SEX + β EXPERIENCE + β UNION + β OCCUPATION + β EXPERIENCE 7 (EXPERIENCE variables) is centered and OCCUPATION should be replaced with dummy Substituting actual estimates for β s, we get: Log( WAGE) =.9987 +.8EDUCATION.97SOUTH.SEX +.88EXPERIENCE +.9UNION +.97OCCUPATION[].987OCCUPATION[].877OCCUPATION[].889OCCUPATION[] +.OCCUPATION [].7( EXPERIENCE 7.8) According to this final model, more education and more experiences helped to earn more; people living in the South earned less than people living elsewhere; females earned less than males; union members earned more than non-union members; management and professional positions were paid the most, and service and clerical positions were paid the least. The overall variability in the wages of workers explained by this model is R =.88%. As mentioned in the beginning, since the data used to build this model were collected in 98, we might not be able to use this model to make general inferences about a worker s wage in the current time.