Project Report for STAT571 Statistical Methods Instructor: Dr. Ramon V. Leon. Wage Data Analysis. Yuanlei Zhang

Similar documents
Regression Models for Time Trends: A Second Example. INSR 260, Spring 2009 Bob Stine

Unit 11: Multiple Linear Regression

Chapter 9: The Regression Model with Qualitative Information: Binary Variables (Dummies)

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

Analysing data: regression and correlation S6 and S7

Unit 10: Simple Linear Regression and Correlation

x3,..., Multiple Regression β q α, β 1, β 2, β 3,..., β q in the model can all be estimated by least square estimators

Stat 328 Final Exam (Regression) Summer 2002 Professor Vardeman

Single and multiple linear regression analysis

Chapter 14 Multiple Regression Analysis

CHAPTER 6: SPECIFICATION VARIABLES

Regression Diagnostics Procedures

y response variable x 1, x 2,, x k -- a set of explanatory variables

Chapter 12: Multiple Regression

Chapter 13. Multiple Regression and Model Building

Lecture 5: Omitted Variables, Dummy Variables and Multicollinearity

Hypothesis testing Goodness of fit Multicollinearity Prediction. Applied Statistics. Lecturer: Serena Arima

1 Linear Regression Analysis The Mincer Wage Equation Data Econometric Model Estimation... 11

Lecture (chapter 13): Association between variables measured at the interval-ratio level

Final Exam - Solutions

Making sense of Econometrics: Basics

Prepared by: Prof. Dr Bahaman Abu Samah Department of Professional Development and Continuing Education Faculty of Educational Studies Universiti

Ch 7: Dummy (binary, indicator) variables

SMA 6304 / MIT / MIT Manufacturing Systems. Lecture 10: Data and Regression Analysis. Lecturer: Prof. Duane S. Boning

STAT 3900/4950 MIDTERM TWO Name: Spring, 2015 (print: first last ) Covered topics: Two-way ANOVA, ANCOVA, SLR, MLR and correlation analysis

Multiple Regression and Model Building Lecture 20 1 May 2006 R. Ryznar

Lectures on Simple Linear Regression Stat 431, Summer 2012

In Class Review Exercises Vartanian: SW 540

Ecn Analysis of Economic Data University of California - Davis February 23, 2010 Instructor: John Parman. Midterm 2. Name: ID Number: Section:

ECON Interactions and Dummies

Chapter 7 Student Lecture Notes 7-1

Statistical View of Least Squares

Mathematics for Economics MA course

Midterm 2 - Solutions

Introduction to statistical modeling

Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details. Section 10.1, 2, 3

2. Linear regression with multiple regressors

Chapter Fifteen. Frequency Distribution, Cross-Tabulation, and Hypothesis Testing

Statistical Inference with Regression Analysis

Answer all questions from part I. Answer two question from part II.a, and one question from part II.b.

Inferences for Regression

2) For a normal distribution, the skewness and kurtosis measures are as follows: A) 1.96 and 4 B) 1 and 2 C) 0 and 3 D) 0 and 0

ECON 497 Midterm Spring

Chapter 14 Student Lecture Notes 14-1

Chapter 9. Correlation and Regression

Review of Multiple Regression

Chapter Goals. To understand the methods for displaying and describing relationship among variables. Formulate Theories.

WISE International Masters

Lab 10 - Binary Variables

The Model Building Process Part I: Checking Model Assumptions Best Practice

MBA Statistics COURSE #4

Regression #8: Loose Ends

Chapter 26 Multiple Regression, Logistic Regression, and Indicator Variables

Heteroskedasticity. Part VII. Heteroskedasticity

The simple linear regression model discussed in Chapter 13 was written as

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)

ECON3150/4150 Spring 2015

AMS 7 Correlation and Regression Lecture 8

Chapter 10 Correlation and Regression

Lecture 6: Linear Regression

10. Alternative case influence statistics

Contest Quiz 3. Question Sheet. In this quiz we will review concepts of linear regression covered in lecture 2.

Ridge Regression. Summary. Sample StatFolio: ridge reg.sgp. STATGRAPHICS Rev. 10/1/2014

PBAF 528 Week 8. B. Regression Residuals These properties have implications for the residuals of the regression.

Checking model assumptions with regression diagnostics

ECON 4230 Intermediate Econometric Theory Exam

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Lecture 6: Linear Regression (continued)

5. Let W follow a normal distribution with mean of μ and the variance of 1. Then, the pdf of W is

Multiple Regression. Peerapat Wongchaiwat, Ph.D.

Business Statistics. Lecture 10: Correlation and Linear Regression

FAQ: Linear and Multiple Regression Analysis: Coefficients

FinQuiz Notes

Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response.

REGRESSION DIAGNOSTICS AND REMEDIAL MEASURES

4. Nonlinear regression functions

Introduction and Single Predictor Regression. Correlation

Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure.

Overview. Overview. Overview. Specific Examples. General Examples. Bivariate Regression & Correlation

Chapter 3 Multiple Regression Complete Example

Universidad Carlos III de Madrid Econometría Nonlinear Regression Functions Problem Set 8

Multiple Regression Analysis. Part III. Multiple Regression Analysis

Lecture 4: Multivariate Regression, Part 2

Econometrics (60 points) as the multivariate regression of Y on X 1 and X 2? [6 points]

Regression Models. Chapter 4. Introduction. Introduction. Introduction

More on Roy Model of Self-Selection

Appendix B. Additional Results for. Social Class and Workers= Rent,

a. The least squares estimators of intercept and slope are (from JMP output): b 0 = 6.25 b 1 =

Psychology Seminar Psych 406 Dr. Jeffrey Leitzel

ECON 497 Final Exam Page 1 of 12

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

ECO321: Economic Statistics II

Chapter 12 - Part I: Correlation Analysis

Econometrics -- Final Exam (Sample)

Basic Business Statistics, 10/e

Regression with Qualitative Information. Part VI. Regression with Qualitative Information

MISCELLANEOUS REGRESSION TOPICS

Multiple Regression Examples

Stat 500 Midterm 2 12 November 2009 page 0 of 11

Practice exam questions

Transcription:

Project Report for STAT7 Statistical Methods Instructor: Dr. Ramon V. Leon Wage Data Analysis Yuanlei Zhang 77--7 November,

Part : Introduction Data Set The data set contains a random sample of observations on variables sampled from the Current Population Survey of 98. It provides information on wages and other characteristics of the workers, including sex, number of years of education, years of work experience, occupational status, region of residence and union membership. This data set is obtained from StatLib and its original source is: Berndt, ER. The Practice of Econometrics. 99. NY: Addison-Wesley. (The JMP file containing this data set is attached as raw_data.jmp) Variables The variables contained in the data set are summarized in the table below: Variable Name Description Data type Explanatory Variable: WAGE Wage (dollars per hour). Continuous Predictor Variables: EDUCATION Number of years of education. Continuous SOUTH SEX Indicator variable for Southern Region =Person lives in South =Person lives elsewhere Indicator variable for sex =Female =Male Nominal Nominal EXPERIENCE Number of years of work experience. Continuous UNION Indicator variable for union membership =Union member =Not union member Nominal AGE Age (years). Continuous RACE Race =Other =Hispanic =White Nominal

OCCUPATION SECTOR MARR Occupational category =Management =Sales =Clerical =Service =Professional =Other Sector =Other =Manufacturing =Construction Marital Status =Unmarried =Married Nominal Nominal Nominal NOTE: For nominal variables SOUTH, SEX, UNION and MARR, they already serve as dummy variables since they can only take the values and ; for nominal variables RACE, OCCUPATION and SECTOR, dummy variables will be introduced automatically by JMP when doing regressions. Objective According to common sense, wages may be dependent on the other characteristics of the workers more or less. The objectives of this project are to find out whether wages are indeed related to these characteristics and if so, what are possible ways to model such relationships and how good are these models. A final model will be selected that models the relationship between wages and these characteristics the best on the given data set, which might help to predict a worker s wage based on some relevant characteristics of the worker. However, since the data were collected in 98, it might not be appropriate for us to make general inferences about a worker s wage in the current time.

Part : Data Analysis Problems with the data As an initial attempt, WAGE is fitted against all predictor variables. Model : WAGE = β + β EDUCATION + β SOUTH + β SEX + β EXPERIENCE + β UNION + β AGE + β RACE + β OCCUPATION + β SECTOR + β MARR 7 8 9 (NOTE: Nominal variables RACE, OCCUPATION and SECTOR should be replaced with corresponding dummy variables. For simplicity, the model expression above didn t reflect this, but conversions will be done when actually carrying out the regression.) The output of the least square regression immediately helps to identify two major problems to be corrected before any further analyses can be carried out. Problem : Unstable variance The problem of unstable variance is identified by examining the JMP output of residual plot against the predicted values as shown below: Residual by Predicted Plot WAGE Residual - WAGE Predicted From the above residual plot, we can easily see that the variance is increasing with the predicted value. This indicates that Var(Y) is a function of E(Y) instead of being constant, and therefore needs to be transformed to be stabilized. Applying the Box-Cox Transformation technique in JMP, we find out that the best transformation needed to stabilize Var(Y) is the log transformation. Therefore, the new model should become:

Model +: Log( WAGE) = β + β EDUCATION + β SOUTH + β SEX + β EXPERIENCE + β UNION + β AGE + β RACE + β OCCUPATION + β SECTOR + β MARR 7 8 9 After refitting Log(WAGE) against all possible predictor variables, the residual plot is as following: Residual by Predicted Plot.. LOG (WAGE) Residual. -. -...... LOG (WAGE) Predicted The residual plot now suggests a much more constant variance, and from this step on, we will use Log(WAGE) to do all the least square regressions instead of using WAGE. Problem : Multicollinearity The problem of multicollinearity is identified by examining the VIF values in the JMP output of least square regression as shown below: Parameter Estimates Term VIF Intercept. EDUCATION.8 SOUTH. SEX.79997 EXPERIENCE.87 UNION.88 AGE 9.887 RACE[].7977 RACE[].8 OCCUPATION[].79 OCCUPATION[].97 OCCUPATION[].78 OCCUPATION[].889 OCCUPATION[].9899 SECTOR[].9 SECTOR[].99 MARR.9

We see that VIF values of EDUCATION, EXPERIENCE and AGE are much greater than, which suggests a serious multicollinearity problem. Usually, multicollinearity problems are caused by correlated predicator variables. Therefore, pair-wise correlations between EDUCATION, EXPERIENCE and AGE are calculated to justify the cause to this multicollinearity problem. Correlations EDUCATION EXPERIENCE AGE EDUCATION. -.7 -. EXPERIENCE -.7..978 AGE -..978. Scatterplot Matrix EDUCATION EXPERIENCE AGE From the pair-wise correlations and the scatterplot matrix, we see that AGE and EXPERIENCE are almost perfectly correlated with a correlation coefficient r =.978. This explains the multicollinearity problem identified above and to solve this problem, we need to get rid of either AGE or EXPERIENCE. The following models to be evaluated start with removing the AGE variable from the model.

Models without the AGE variable Removing the AGE variable from Model +, we get the model below: Model -AGE: Log( WAGE) = β + β EDUCATION + β SOUTH + β SEX β RACE + β OCCUPATION + β SECTOR + β MARR 7 8 9 + β EXPERIENCE + β UNION + Fitting the above model in JMP, we get the following least square regression output. Summary of Fit RSquare.8 RSquare Adj.889 Observations (or Sum Wgts) Parameter Estimates Term Estimate VIF Intercept.7. EDUCATION..88 SOUTH -.98.89 SEX -.9.89 EXPERIENCE.97.889 UNION.7.89 RACE[] -.8.79 RACE[] -.9.89 OCCUPATION[].88. OCCUPATION[] -..78 OCCUPATION[]..9 OCCUPATION[] -.7.888 OCCUPATION[].77.997 SECTOR[] -.9.99 SECTOR[]..77 MARR.8.997 Effect Tests Source Nparm DF Sum of Squares F Ratio Prob > F EDUCATION 8.8.8 <. SOUTH.9777.999. SEX.9899.877 <. EXPERIENCE.9878 9.9 <. UNION.88 7. <. RACE.87..9 OCCUPATION.9.9 <. SECTOR.87.7. MARR.87.99. Removing the AGE variable results in a drop in the R value from.7 to.8, which is negligible. However, the multicollinearity problem has been solved, as there is no VIF value greater than now. The above output also suggests that the coefficients of the RACE, SECTOR and MARR variables are not significant at α =. level since their corresponding P-values in the partial F-tests are greater than.. Therefore, in the next model, we will try to get rid of these variables to see how the regression results will be affected.

Model : Log( WAGE) = β + β EDUCATION + β SOUTH + β SEX + β EXPERIENCE + β UNION + β OCCUPATION Summary of Fit RSquare.877 RSquare Adj.9 Observations (or Sum Wgts) Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model.7.7 7.97 Error 9.77.89 Prob > F C. Total 8.8 <. Parameter Estimates Term Estimate VIF Intercept.79. EDUCATION.97.9899 SOUTH -..79 SEX -.88.7 EXPERIENCE.9.8 UNION.8.8 OCCUPATION[].8. OCCUPATION[] -.79.77 OCCUPATION[] -..89 OCCUPATION[] -.89.8 OCCUPATION[]..9 Effect Tests Source Nparm DF Sum of Squares F Ratio Prob > F EDUCATION 9. 8.8 <. SOUTH.988.79. SEX.9.9 <. EXPERIENCE 7.89.89 <. UNION.999.7 <. OCCUPATION 7.88 7. <. Cp Press 9.89.77 The F-test of the hypothesis H : β = β = β = β = β = β = indicates the rejection of H since the P-value in the pooled F-test is less than.. Also, each of the coefficients in the model is statistically significant as the P-values in the partial F-tests are very small. All VIF values are less than, which suggests no multicollinearity problem in the model. The R value of this model is.877 (a slight drop from.8 in the previous model) and therefore,.877% of the variability in the wages of workers is explained by regression on the predictors. This may not seem to be a satisfactory result, but thinking of the reality of uncertainties, this may still be acceptable. The R adj, C p and PRESS statistics are also calculated for comparisons with other competing models. In all, Model seems quite reasonable and does suggest a good candidate model to consider. The next step is to consider all possible interactions between the predictors as well as their corresponding quadratic terms (only for EDUCATION and EXPERIENCE), and see

if we can have an improvement in the regression results. Since there are C = possible interaction terms in total plus quadratic terms, the JMP output is very lengthy and therefore omitted. As a final result, only EXPERIENCE turns out to be significant and EDUCATION as well as all the interaction terms can be discarded as they neither are significant nor add much information to the model. Introducing EXPERIENCE to Model, we get the following model: Model +: Log( WAGE) = β + β EDUCATION + β SOUTH + β SEX β OCCUPATION + β EXPERIENCE 7 + β EXPERIENCE + β UNION + Summary of Fit RSquare.88 RSquare Adj. Observations (or Sum Wgts) Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model.777.977 7.798 Error 9.9.799 Prob > F C. Total 8.8 <. Parameter Estimates Term Estimate VIF Intercept.9987. EDUCATION.8.9977 SOUTH -.97.9 SEX -..7 EXPERIENCE.88.8 UNION.9.988 OCCUPATION[].97.7 OCCUPATION[] -.987.79 OCCUPATION[] -.877.8 OCCUPATION[] -.889.98 OCCUPATION[]..9 (EXPERIENCE-7.8)*(EXPERIENCE-7.8) -.7.897 Effect Tests Source Nparm DF Sum of Squares F Ratio Prob > F EDUCATION 7.88. <. SOUTH.9.8.8 SEX.87 7.87 <. EXPERIENCE. 8.78 <. UNION.89.9 <. OCCUPATION.98 7.79 <. EXPERIENCE*EXPERIENCE..879 <. Cp Press 8.8889 98.7 This model gives better results on the R, R adj, C p and PRESS statistics than Model does. Also, each of the coefficients in the model is highly significant and all the VIF values are less than. Therefore, Model + suggests another good candidate model to consider.

Models without the EXPERIENCE variable As mentioned in previous sections, we can also remove the EXPERIENCE variable from Model + to get the model below: Model -EXPERIENCE: Log( WAGE) = β + β EDUCATION + β SOUTH + β SEX β OCCUPATION + β SECTOR + β MARR 7 8 9 + β UNION + β AGE + β RACE + Fitting the above model in JMP, we get the following least square regression output. Summary of Fit RSquare.7 RSquare Adj.7 Observations (or Sum Wgts) Parameter Estimates Term Estimate VIF Intercept.. EDUCATION.7.88 SOUTH -.977.8 SEX -.7.8 UNION.8.87 AGE.97.99 RACE[] -..797 RACE[] -.9.899 OCCUPATION[].97.78 OCCUPATION[] -..78 OCCUPATION[].997.889 OCCUPATION[] -.7.8779 OCCUPATION[].7.988 SECTOR[] -.9.9 SECTOR[].7.78 MARR.7.997 Effect Tests Source Nparm DF Sum of Squares F Ratio Prob > F EDUCATION.888.977 <. SOUTH.9977.99. SEX.9977.8 <. UNION.99 7. <. AGE.79998 9.88 <. RACE.798.7.9 OCCUPATION.978.97 <. SECTOR.879.7. MARR.998.9. Removing the EXPERIENCE variable results in a drop in the R value from.7 to.7, which is negligible. However, the multicollinearity problem has also been solved, as there is no VIF value greater than. Just as in the case of removing the AGE variable, the RACE, SECTOR and MARR variables are suggested to be excluded from the model since they are not significant at α =. level. Therefore, these variables are omitted in the next model to see how the regression results will be affected.

Model : Log( WAGE) = β + β EDUCATION + β SOUTH + β SEX + β UNION + β AGE + β OCCUPATION Summary of Fit RSquare.8 RSquare Adj.89 Observations (or Sum Wgts) Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model.7.7 7.98 Error 9.7.89 Prob > F C. Total 8.8 <. Parameter Estimates Term Estimate VIF Intercept.989. EDUCATION.88.777 SOUTH -.8.79 SEX -.78.7 UNION.8.877 AGE.7.98 OCCUPATION[]..97 OCCUPATION[] -.7.79 OCCUPATION[] -..8 OCCUPATION[] -..87 OCCUPATION[].8.979 Effect Tests Source Nparm DF Sum of Squares F Ratio Prob > F EDUCATION 7. 8.988 <. SOUTH.8.9. SEX.977.89 <. UNION.9.7 <. AGE 7.88.7 <. OCCUPATION 7.79 7. <. Cp Press 9.97.799 This model also gives fairly good results on the R, R adj, C p and PRESS statistics. All its regression coefficients are highly significant and no VIF value is greater than. Therefore, Model also suggests a good candidate model to consider. As in the case of developing Model +, the next step is to consider all possible interactions between the predictors as well as their corresponding quadratic terms (only for EDUCATION and AGE), and see if we can have an improvement in the regression results. As a final result, only AGE turns out to be significant and EDUCATION as well as all the interaction terms can be discarded. Again, the lengthy JMP output is omitted. Introducing AGE to Model, we get the following model:

Model +: Log( WAGE) = β β OCCUPATION + β AGE + β EDUCATION + β SOUTH + β SEX 7 + β UNION + β AGE + Summary of Fit RSquare.87 RSquare Adj.8 Observations (or Sum Wgts) Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model.78.979 7.77 Error 9.7.79 Prob > F C. Total 8.8 <. Parameter Estimates Term Estimate VIF Intercept.97. EDUCATION..7978 SOUTH -.7.78 SEX -.9.97 UNION.98.887 AGE.8.88 OCCUPATION[].98.878 OCCUPATION[] -.7.78 OCCUPATION[] -..88 OCCUPATION[] -.89.987 OCCUPATION[].9.9 (AGE-.8)*(AGE-.8) -.. Effect Tests Source Nparm DF Sum of Squares F Ratio Prob > F EDUCATION.797.88 <. SOUTH.9.97.9 SEX.788.8 <. UNION.797. <. AGE.8 7.78 <. OCCUPATION.7.9 <. AGE*AGE.7.89 <. Cp Press 8.9879 98.897 This model gives even better results on the R, R adj, C p and PRESS statistics than Model does, while all the regression coefficients in the model are still highly significant and the VIF values are all less than. Therefore, Model + suggests one more good candidate model to consider.

Model Comparisons Based on above analyses, we have altogether four candidate models to consider. Summarizing all of them according to their R, R adj, C p and PRESS values, we get the following table: Model Variables in Model R R C adj p PRESS EDUCATION, SOUTH, SEX, EXPERIENCE,.877.9 9.89.77 UNION, OCCUPATION + EDUCATION, SOUTH, SEX, EXPERIENCE, UNION,.88. 8.8889 98.7 OCCUPATION, EXPERIENCE EDUCATION, SOUTH, SEX, UNION, AGE,.8.89 9.97.799 OCCUPATION + EDUCATION, SOUTH, SEX, UNION, AGE, OCCUPATION, AGE.87.8 8.9879 98.897 According to the above table, both Model + and Model + are better than Model and Model, and between the two better ones, Model + is just slightly better than Model +. Actually the differences between Model + and Model + can be neglected, which means EXPERIENCE and AGE can be used interchangeably in fitting models, due to their almost-perfect correlations with each other. For decision-making purposes, I will choose Model + as the final model, given that it can pass the following checks for violations of model assumptions.

Model Assumptions Verifications Doing regression diagnostics on Model +: Log( WAGE) = β + β EDUCATION + β SOUTH + β SEX + β EXPERIENCE + β UNION + β OCCUPATION + β EXPERIENCE 7 We get the following JMP results: a) Checking for outliers and influential observations: By calculating the standardized residuals and the h ii values, one large outlier with standardized e i = -.78 is detected. This was a male, with years of experience and years of education, in a management position, who lived in the north and was not a union member. As an outlier, he had much lower wages than expected. This outlier will simply be omitted when doing the following analyses. b) Residual plots against all predictor variables (checking for linearity) Bivariate Fit of By EDUCATION - - EDUCATION Bivariate Fit of By SOUTH - -.......7.8.9 SOUTH

Bivariate Fit of By SEX - -.......7.8.9 SEX Bivariate Fit of By EXPERIENCE - - EXPERIENCE Bivariate Fit of By UNION - -.......7.8.9 UNION

Bivariate Fit of By AGE - - AGE Oneway Analysis of By RACE - - RACE Oneway Analysis of By OCCUPATION - - OCCUPATION

Oneway Analysis of By SECTOR - - SECTOR Bivariate Fit of By MARR - -.......7.8.9 MARR Except for the large outlier identified in (a), all of the residuals exhibit random scatter around zero, and there are no unusual patterns in these residual plots. Therefore no further transformations are needed and the omitted variables can keep being excluded.

c) Residual plot against predicted value (checking for constant variance) Bivariate Fit of By Predicted LOG (WAGE) - - Predicted LOG (WAGE) Except for the large outlier identified in (a), the variances appear stable and the dispersion of the residuals is approximately constant with respect to the predicted values. Therefore the constant variance assumption is satisfied. d) Normal plot of residuals (checking for normality).99.9.9.7..... - - - Normal Quantile Plot - -

Except for the large outlier identified in (a), the normal plot of residuals appears very close to a straight line, which indicates that the normality assumption is satisfied. e) Run-Chart of residuals (checking for independence) Residual by Row Plot.. Residual. -. -. Row Number Durbin-Watson Durbin-Watson Number of Obs. AutoCorrelation Prob<DW.98.8.9 There are no signs of correlation introduced by time order from the above Run-Chart of residuals. Also, the autocorrelation statistic is calculated as.8 and its associated P-value is.9, which indicates that we do not have a problem of autocorrelation. Therefore, the assumption of independence is satisfied.

Part : Final Conclusion Based on all above analyses, the final model that has been decided on is: Log( WAGE) = β + β EDUCATION + β SOUTH + β SEX + β EXPERIENCE + β UNION + β OCCUPATION + β EXPERIENCE 7 (EXPERIENCE variables) is centered and OCCUPATION should be replaced with dummy Substituting actual estimates for β s, we get: Log( WAGE) =.9987 +.8EDUCATION.97SOUTH.SEX +.88EXPERIENCE +.9UNION +.97OCCUPATION[].987OCCUPATION[].877OCCUPATION[].889OCCUPATION[] +.OCCUPATION [].7( EXPERIENCE 7.8) According to this final model, more education and more experiences helped to earn more; people living in the South earned less than people living elsewhere; females earned less than males; union members earned more than non-union members; management and professional positions were paid the most, and service and clerical positions were paid the least. The overall variability in the wages of workers explained by this model is R =.88%. As mentioned in the beginning, since the data used to build this model were collected in 98, we might not be able to use this model to make general inferences about a worker s wage in the current time.