ECON 482 / WH Hong Binary or Dummy Variables 1. Qualitative Information

1. Qualitative Information Qualitative Information Up to now, we assume that all the variables has quantitative meaning. But often in empirical work, we must incorporate qualitative factor into regression model. Examples: gender, race, industry, region, rating grade,... A way to incorporate qualitative information is to use binary variables which take either zero or one. Binary variables are most commonly called dummy variables. They may appear as the dependent or as independent variables. 1

2. A Single Dummy Independent Variable (Example) Wage equation wage = β + δ female + β educ + u 0 0 1 female is a dummy variable: female = 1 if the person is a woman; female = 0 if the person is a man δ 0: the wage gain/loss if the person is a woman rather than a man (holding other factors fixed) Holding educ and other factors constant, If female ( female = 1), the mean wage is β0 + δ0 If male ( female = 0), the mean wage is β 0 δ 0 = E ( wage female = 1, educ) E ( wage female = 0, educ) Thus, δ 0 measures the difference in mean wage between men and women with the same level of education. 2

Graphically, This is an intercept shift. 3

Dummy variable trap Why not using two dummy variables instead of one when we have two categories in a qualitative variable? In the wage equation example, wage = β + γ male + δ female + β educ + u 0 0 0 1 This model cannot be estimated because of perfect collinearity. This is called the "dummy variable trap" When using dummy variables, one category always has to be omitted: wage = β + δ female + β educ + u => The base category is men 0 0 1 wage = β + γ male + β educ + u 0 0 1 Alternatively, one could omit the intercept: wage = γ male + δ female + β educ + u Disadvantages: 0 0 1 => The base category is women i. More difficult to test for differences between the parameters ii. R-squared formula is only valid when regression contains intercept 4

(Example) Wage equation (cont'd) Estimated equation wage = 1.57 1.81 female + 0.572 educ + 0.025 exper + 0.141 tenure (0.72) (0.26) (0.049) (0.012) (0.021) 2 n = 526, R = 0.364 The t-statistic said that the coefficient on female is statistically significant. Interpretation: Holding education, experience, and tenure fixed, women earn $1.81 less per hour than men Does that mean that women are discriminated against? Not necessarily. Being female may be correlated with other productivity characteristics that have not been controlled for. => Omitted variable bias 5

(Example) Effect of training grants on hours of training Estimated equation frsemp = 46.67 + 26.25grant 0.98log( sales) 6.07log( employ) (43.41) (5.59) (3.54) (3.88) n = 105, R 2 = 0.237 frsemp: hours training per employee; grant : =1 if firm received training grant, =0 otherwise The coefficient on grant is statistically significant, implying that holding other factors constant, the receipt of training grant increases hours of training by 26.25 hours. This is an example of program evaluation Treatment group (= grant receiver) vs. control group (= no grant) Is the effect of treatment on the outcome of interest causal? => possibly not, because the event of giving grant is not randomly chosen. Therefore, the event might be correlated with other factor that have not been controlled for. => Omitted variable bias 6

3. Using Dummy Variables for Multiple Categories (Example) City credit ratings and municipal bond interest reates Estimation model 1: MBR = β + β CR + other factors 0 1 MBR : Municipal bond rate; CR : Credit ratings for a firm: from 0 to 4 (0=worst, 4= best) β 1 measures the percentage point change in MBR when CR increases by one unit. Is the difference between four and three the same as the difference between zero and one? Unfortunately, it is hard to interpret the meaning of one unit increase in CR, because indeed CR has ordinal meaning rather than cardinal. Therefore, a better way to incorporate this information is to define dummies Estimation model 2: MBR = β + δ CR + δ CR + δ CR + δ CR + other factors 0 1 1 2 2 3 3 4 4 where, for example, CR = if CR = 1 and zero otherwise, etc. Now, 1 1 δ i for i = 1,...,4 are measured in comparison to the worst rating ( = base category) 7

4. Interactions Involving Dummy Variables (1) Interaction involving dummy variables (Example) Wage equation Estimation model ( ) log wage = β + δ female + β educ + δ female educ + u 0 0 1 1 ( β δ ) ( β δ ) = + female + + female educ + u 0 0 1 1 β 0 = intercept for men; β 1 = slope for men β0 + δ0 = intercept for women; β1+ δ1 = slope for women Interesting Hypothesis o 1 H : δ = 0 => the return to education is the same for men and women. H : δ = 0, δ = 0 0 0 1 => The whole wage equation is the same for men and women. 8

Graphically Graph (a): δ 0 < 0 and δ 1 < 0 => women earn less than men at all levels of education, and the gap increases as gets larger. Graph (b): δ 0 < 0 and δ 1 > 0 educ => women earn less than men at low levels of education, but the gap narrows ad education increase 9

(2) Testing for difference in regression functions across groups It is possible that functional forms might be different depending on a group. To test this follow the step. Unrestricted model (contains full set of interaction, which allow different functional forms) cumgpa = β + δ female + β sat + δ female sat + β hsperc + δ female hsperc 0 0 1 1 2 2 + β tothrs + δ female tothrs + u 3 3 cumgpa: college grade point average; sat : sat score hsperc: high school rank perentile; tothrs : total hours spent in college courses Restricted model cumgpa = β + β sat + β hsperc + β tothrs + u 0 1 2 3 Null hypothesis: H0 : δ0 = 0, δ1 = 0, δ2 = 0, δ3 = 0 => All interaction effects are zero, i.e., the same regression coefficients apply to men and women. (No different functional form) 10

Estimation of the unrestricted model cumgpa = 1.48 0.353 female + 0.0011 sat + 0.00075 female sat (0.21) (0.411) (0.0002) (0.00039) 0.0085hsperc + 0.00055 female hsperc+ 0.0023tothrs 0.00012 female tothrs (0.0014) (0.00316) (0.0009) (0.00163) Tested individually, the hypothesis that the interaction effects are zero cannot be rejected Joint test with F-statistic ( SSRr SSRur) / q ( 85.515 78.355 ) / 4 SSR / n k 1 78.355 / ( 366 7 1) F = = 8.18 ur The null is rejected. Alternative way to compute F-statistic in the given case Run separate regressions for men and for women; the unrestricted SSR is given by the sum of the SSR of these two regression: SSRur = SSRmen + SSRwomen Run regression for the restricted model and store SSR: SSR r If the test is computed in this way, it is called the Chow-Test. 11

5. A Binary Dependent Variable: The Linear Probability Model Linear regression when the dependent variable is binary y β0 β1x1... βkxk u = + + + + => E ( y x ) = β0 + β1x1 +... + βkxk ( x) = ( = x) + ( = x) = ( = x) When y is binary, E y 1 P y 1 0 P y 0 P y 1. Therefore, P ( ) β0 β1 1 y = 1 x = + x +... + β kxk; Linear Probability Model (LPM) Interpretation: β j = ( 1 x) P y = x j The coefficient measures the effect of the explanatory variables on the probability that y =1. 12

(Example) Labor force participation of married women Estimated equation 2 inlf = 0.586 0.0034nwifeinc + 0.038educ + 0.039exper 0.00060exper (0.154) (0.0014) (0.007) (0.006) (0.00018) 0.016age 0.262kidslt6 + 0.0130kidage6 (0.002) (0.034) (0.0132) n = 763, R 2 = 0.264 inlf : =1 if in labor force, =0 otherwise; nwifeinc: non-wife income (in thousands); kidslt 6: number of kids under six years; kidsage 6: number of kids between 6 and 18 The coefficient on kidslt 6 is statistically significant, indicating that the probability that the woman works falls by 26.2% when the number of kids under six increase by one. The coefficient on kidsage6 is insignificant. 13

(Example) Labor force participation of married women (cont'd) Graph for nwifein = 50, exper = 5, age = 30, kidslt 6= 1, kidsage 6= 0 the max of educ is 17. For the given case, this leads to a predicted probability to be in the labor force of about 50% Note that there is negative probability. But no problem because no woman in the sample has educ < 5 14

Disadvantages of the linear probability model Predicted probability may be larger than one or smaller than zero. Marginal probability sometimes logically impossible. => A probability cannot be linearly related to the independent variables for all their possible value. Δ inlf = 0.262Δ kidslt6 implies that the effect of the variable from zero to one and from three to four are the same. But we would expect the marginal effect would decreases as the number of kids under 6 increases. The linear probablity model is necessarily heteroskedastic ( y x) = P( y = X) P( y = x ) var 1 1 1 Heteroskedastic consistent standard errors need to be computed. (We'll see later) Advantages of the linear probability model Easy estimation and interpretation Estimated effects and predictions often reasonably good in practice. 15