Regression with Qualitative Information. Part VI. Regression with Qualitative Information

Similar documents
Econometrics II. Seppo Pynnönen. Spring Department of Mathematics and Statistics, University of Vaasa, Finland

ECON 482 / WH Hong Binary or Dummy Variables 1. Qualitative Information

Heteroskedasticity. Part VII. Heteroskedasticity

Econometrics I Lecture 7: Dummy Variables

Statistical Inference. Part IV. Statistical Inference

Model Specification and Data Problems. Part VIII

Outline. 2. Logarithmic Functional Form and Units of Measurement. Functional Form. I. Functional Form: log II. Units of Measurement

Multiple Regression Analysis. Part III. Multiple Regression Analysis

Course Econometrics I

The general linear regression with k explanatory variables is just an extension of the simple regression as follows

5. Let W follow a normal distribution with mean of μ and the variance of 1. Then, the pdf of W is

Lab 10 - Binary Variables

The Simple Regression Model. Part II. The Simple Regression Model

4. Nonlinear regression functions

CHAPTER 7. + ˆ δ. (1 nopc) + ˆ β1. =.157, so the new intercept is = The coefficient on nopc is.157.

Estimating the return to education for married women mroz.csv: 753 observations and 22 variables

Chapter 9: The Regression Model with Qualitative Information: Binary Variables (Dummies)

Making sense of Econometrics: Basics

7. Prediction. Outline: Read Section 6.4. Mean Prediction

Intermediate Econometrics

Ch 7: Dummy (binary, indicator) variables

CHAPTER 6: SPECIFICATION VARIABLES

2) For a normal distribution, the skewness and kurtosis measures are as follows: A) 1.96 and 4 B) 1 and 2 C) 0 and 3 D) 0 and 0

CHAPTER 4. > 0, where β

2. Linear regression with multiple regressors

ECON Interactions and Dummies

Applied Economics. Regression with a Binary Dependent Variable. Department of Economics Universidad Carlos III de Madrid

Lecture 8. Using the CLR Model. Relation between patent applications and R&D spending. Variables

Course Econometrics I

Economics 471: Econometrics Department of Economics, Finance and Legal Studies University of Alabama

Practice Questions for the Final Exam. Theoretical Part

Linear Regression With Special Variables

Econometrics Problem Set 10

Universidad Carlos III de Madrid Econometría Nonlinear Regression Functions Problem Set 8

Would you have survived the sinking of the Titanic? Felix Pretis (Oxford) Econometrics Oxford University, / 38

Heteroscedasticity 1

Solutions to Problem Set 5 (Due November 22) Maximum number of points for Problem set 5 is: 220. Problem 7.3

Regression #8: Loose Ends

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Exercise Sheet 6: Solutions

Answers to Problem Set #4

x = 1 n (x = 1 (x n 1 ι(ι ι) 1 ι x) (x ι(ι ι) 1 ι x) = 1

ECON 5350 Class Notes Functional Form and Structural Change

Applied Econometrics. Applied Econometrics Second edition. Dimitrios Asteriou and Stephen G. Hall

5. Erroneous Selection of Exogenous Variables (Violation of Assumption #A1)

Multiple Regression: Inference

Lecture 8. Using the CLR Model

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Binary Dependent Variables

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

Sample Problems. Note: If you find the following statements true, you should briefly prove them. If you find them false, you should correct them.

x i = 1 yi 2 = 55 with N = 30. Use the above sample information to answer all the following questions. Show explicitly all formulas and calculations.

University of California at Berkeley Fall Introductory Applied Econometrics Final examination. Scores add up to 125 points

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

Lecture Module 6. Agenda. Professor Spearot. 1 P-values. 2 Comparing Parameters. 3 Predictions. 4 F-Tests

Univariate linear models

Multiple Regression. Peerapat Wongchaiwat, Ph.D.

13. Time Series Analysis: Asymptotics Weakly Dependent and Random Walk Process. Strict Exogeneity

Practice exam questions

Outline. 11. Time Series Analysis. Basic Regression. Differences between Time Series and Cross Section

1 Quantitative Techniques in Practice

ECO375 Tutorial 4 Wooldridge: Chapter 6 and 7

Warwick Economics Summer School Topics in Microeconometrics Instrumental Variables Estimation

Density Temp vs Ratio. temp

Single-level Models for Binary Responses

One-Way ANOVA. Some examples of when ANOVA would be appropriate include:

Introducing Generalized Linear Models: Logistic Regression

Problem Set 2: Box-Jenkins methodology

Matched Pair Data. Stat 557 Heike Hofmann

Exercise sheet 6 Models with endogenous explanatory variables

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011)

Practical Econometrics. for. Finance and Economics. (Econometrics 2)

3. Linear Regression With a Single Regressor

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours

Hint: The following equation converts Celsius to Fahrenheit: F = C where C = degrees Celsius F = degrees Fahrenheit

Testing and Model Selection

Chapter 9. Dummy (Binary) Variables. 9.1 Introduction The multiple regression model (9.1.1) Assumption MR1 is

Project Report for STAT571 Statistical Methods Instructor: Dr. Ramon V. Leon. Wage Data Analysis. Yuanlei Zhang

Statistics 203 Introduction to Regression Models and ANOVA Practice Exam

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

Inference in Regression Analysis

Econometrics II Tutorial Problems No. 1

Lecture 12: Effect modification, and confounding in logistic regression

Lecture 7: OLS with qualitative information

22s:152 Applied Linear Regression

Solutions: Monday, October 22

Introduction to Econometrics Chapter 4

Linear Regression Models P8111

Binary Logistic Regression

STAT 7030: Categorical Data Analysis

ECNS 561 Topics in Multiple Regression Analysis

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

BMI 541/699 Lecture 22

Econometrics Honor s Exam Review Session. Spring 2012 Eunice Han

ST430 Exam 1 with Answers

Brief Sketch of Solutions: Tutorial 3. 3) unit root tests

Econometrics - Slides

Transcription:

Part VI Regression with Qualitative Information As of Oct 17, 2017

1 Regression with Qualitative Information Single Dummy Independent Variable Multiple Categories Ordinal Information Interaction Involving Dummy Variables Interaction among Dummy Variables Different Slopes Chow Test Binary Dependent Variable The Linear Probability Model The Logit and Probit Model

Qualitative information: family owns a car or not, a person smokes or not, firm is in bankruptcy or not, industry of a firm, gender of a person, etc Some of the above examples can be both in a role of background (independent) variable or dependent variable. Technically these kinds of variables are coded by a binary values (1 = yes ) (0 = no ). In Econometrics these variables are called generally called dummy variables. Remark 6.1 Usual practice is to denote the dummy variable by the name of one of the categories. For example, instead of using gender one can define the variable e.g. as female, which equals 1 if the gender is female and 0 if male.

Exploring Further (W 5ed 7.1, p. 216) Suppose, in a study comparing election outcomes between Democratic and Republican candidates, the researcher wishes to indicate the party of each candidate. Is a name such as party a wise choice for a binary variable in this case. What would be a better name?

Single Dummy Independent Variable 1 Regression with Qualitative Information Single Dummy Independent Variable Multiple Categories Ordinal Information Interaction Involving Dummy Variables Interaction among Dummy Variables Different Slopes Chow Test Binary Dependent Variable The Linear Probability Model The Logit and Probit Model

Single Dummy Independent Variable Dummy variables can be incorporated into a regression model as any other variables. Consider the simple regression y = β 0 + δ 0 D + β 1 x + u, (1) where D = 1 if individual has the property and D = 0 otherwise, and E[u D, x] = 0. Parameter δ 0 indicated the difference with respect to the reference group (D = 0), for which the parameter is β 0. Then δ 0 = E[y D = 1, x] E[y D = 0, x]. (2) The value of x is same in both expectations, thus the difference is only due to the property D.

Single Dummy Independent Variable y β 0 + δ 0 β 0 E[y D = 1, x] = β 0 + δ 0 + β 1 x E[y D = 0, x] = β 0 + β 1 x x Figure 6.1: E[y D, x] = β 0 + δ 0 D + x + u, δ 0 > 0. The category with D = 0 makes the reference category, and δ 0 indicates the change in the intercept with respect to the reference group.

Single Dummy Independent Variable From interpretation point of view it may also be beneficial to associate the categories directly to the regression coefficients. Consider the wage example, where log(w) = β 0 + β 1 educ + β 2 exper + β 3 tenure + u. (3) Suppose we are interested about the difference in wage levels between men an women. Then we can model β 0 as a function of gender as β 0 = β m + δ f female, (4) where subscripts m and f refer to male and female, respectively. Model (3) can be written as log(w) = β m + δ f female + β 1 educ +β 2 exper + β 3 tenure + u. (5)

Single Dummy Independent Variable In the model the female dummy is zero for men. All other factors remain the same. Thus the expected difference between wages in terms of the model is according to (2) equal to δ f Because the left hand side is in logs, 100 δ f indicates the difference approximately in percentages.

Single Dummy Independent Variable The log approximation may be inaccurate if the percentage difference is big. A more accurate approximation is obtained by using the fact that log(w f ) log(w m ) = log(w f /w m ), where w f and w m refer to a woman s and a man s wage, respectively, thus ( ) wf w m 100 % = 100 (exp(δ f ) 1) %. (6) w m

Single Dummy Independent Variable Example 1 Augment the wage example with squared exper and squared tenure to account for possible reducing incremental effect of experience and tenure and account for the possible wage difference with the female dummy. Thus the model is log(w) = β m + δ f female + β 1 educ +β 2 exper + β 3 tenure (7) +β 4 (exper) 2 + β 5 (tenure) 2 + u.

Single Dummy Independent Variable Example 1 continues Dependent Variable: LOG(WAGE) Included observations: 526 ======================================================= Variable Coefficient Std. Error t-statistic Prob. ------------------------------------------------------- C 0.416691 0.098928 4.212066 0.0000 FEMALE -0.296511 0.035805-8.281169 0.0000 EDUC 0.080197 0.006757 11.86823 0.0000 EXPER 0.029432 0.004975 5.915866 0.0000 TENURE 0.031714 0.006845 4.633036 0.0000 EXPER^2-0.000583 0.000107-5.430528 0.0000 TENURE^2-0.000585 0.000235-2.493365 0.0130 ====================================================== R-squared 0.440769 Mean dependent var 1.623268 Adjusted R-squared 0.434304 S.D. dependent var 0.531538 S.E. of regression 0.399785 Akaike info criterion 1.017438 Sum squared resid 82.95065 Schwarz criterion 1.074200 F-statistic 68.17659 Prob(F-statistic) 0.000000 ===========================================================

Single Dummy Independent Variable Example 1 continues Using (6), with ˆδ f = 0.296511 100ŵf ŵ m ŵ m = 100[exp(ˆδ f ) 1] 25.7%, (8) which suggests that, given the other factors, women s wages (w f ) are on average 25.7 percent lower than men s wages (w m ). It is notable that exper and tenure squared have statistically significant negative coefficient estimates, which supports the idea of diminishing marginal increase due to these factors.

Multiple Categories 1 Regression with Qualitative Information Single Dummy Independent Variable Multiple Categories Ordinal Information Interaction Involving Dummy Variables Interaction among Dummy Variables Different Slopes Chow Test Binary Dependent Variable The Linear Probability Model The Logit and Probit Model

Multiple Categories Additional dummy variables can be included to the regression model as well. In the wage example if married (married = 1, if married, and 0, otherwise) is included we have the following possibilities female married characteization 1 0 single woman 1 1 married woman 0 1 married man 0 0 single man and the intercept parameter refines to β 0 = β sm + δ f female + δ ma marr. (9) Coefficient δ ma is the wage marriage premium.

Multiple Categories Example 2 Including married dummy into the wage model yields Dependent Variable: LOG(WAGE) Method: Least Squares Sample: 1 526 Included observations: 526 ====================================================== Variable Coefficient Std. Error t-statistic Prob. ------------------------------------------------------ C 0.41778 0.09887 4.226 0.0000 FEMALE -0.29018 0.03611-8.036 0.0000 MARRIED 0.05292 0.04076 1.299 0.1947 EDUC 0.07915 0.00680 11.640 0.0000 EXPER 0.02695 0.00533 5.061 0.0000 TENURE 0.03130 0.00685 4.570 0.0000 EXPER^2-0.00054 0.00011-4.813 0.0000 TENURE^2-0.00057 0.00023-2.448 0.0147 ====================================================== R-squared 0.443 Mean dependent var 1.623 Adjusted R-squared 0.435 S.D. dependent var 0.532 S.E. of regression 0.400 Akaike info crit. 1.018 Sum squared resid 82.682 Schwarz criterion 1.083 Log likelihood -259.731 F-statistic 58.755 Durbin-Watson stat 1.798 Prob(F-statistic) 0.000 ======================================================

Multiple Categories Example 2 continues The estimate of the marriage premium is about 5.3%, but it is not statistically significant. A major limitation of this model is that it assumes that the marriage premium is the same for men and women. We relax this next.

Multiple Categories Generating dummies: singfem, marrfem, and marrmale we can investigate the marriage premiums for men and women. The intercept term becomes β 0 = β sm + δ mm marrmale +δ mf marrfem + δ sf singfem. (10) The needed dummy-variables can be generated as cross-products form the female and married dummies. For example, the singfem dummy is singfem = (1 married) female. So, how do you generate marrmale (married man)?

Multiple Categories Example 3 Estimating the model with the intercept modeled as (10) gives Dependent Variable: LOG(WAGE) Method: Least Squares Sample: 1 526 Included observations: 526 ====================================================== Variable Coefficient Std. Error t-statistic Prob. ------------------------------------------------------ C 0.3214 0.1000 3.213 0.0014 MARRMALE 0.2127 0.0554 3.842 0.0001 MARRFEM -0.1983 0.0578-3.428 0.0007 SINGFEM -0.1104 0.0557-1.980 0.0483 EDUC 0.0789 0.0067 11.787 0.0000 EXPER 0.0268 0.0052 5.112 0.0000 TENURE 0.0291 0.0068 4.302 0.0000 EXPER^2-0.0005 0.0001-4.847 0.0000 TENURE^2-0.0005 0.0002-2.306 0.0215 ====================================================== R-squared 0.461 Mean dependent var 1.623 Adjusted R-squared 0.453 S.D. dependent var 0.532 S.E. of regression 0.393 Akaike info crit. 0.988 Sum squared resid 79.968 Schwarz criterion 1.061 Log likelihood -250.955 F-statistic 55.246 Durbin-Watson stat 1.785 Prob(F-statistic) 0.000 ======================================================

Multiple Categories Example 3 continues The reference group is single men. Single women and men estimated wage difference: -11.0%, just borderline statistically significant in the two sided test. Marriage premium for men: 21.3% (more accurately, using (7), 23.7%). Married women are estimated to earn 19.8% less than single men. Marriage premium for women: or 8.8%. ˆδ mf ˆδ sf 0.198 ( 0.110) = 0.088 The statistical significance of this can be tested either by redefining the dummies such that the single women become the base.

Multiple Categories Example 3 continues Another option is to use advanced econometric software. EViews Wald test produces for H 0 : δ mf = δ sf (11) F = 2.821 (df 1 and 517) with p-value 0.0937, which is not statistically significant and there is not empirical evidence for wage difference between married and single women. In the same manner testing for H 0 : δ mm = δ mf produces F = 80.61 with p-value 0.0000, i.e., highly statistically significant. Thus, there is strong empirical evidence of marriage premium for men.

Multiple Categories Remark 6.1: If there are q then q 1 dummy variables are needed. The category which does not have a dummy variable becomes the base category or benchmark. Remark 6.2: Dummy variable trap. If the model includes the intercept term, defining q dummies for q categories leads to an exact linear dependence, because 1 = D 1 + + D q. Note also that D 2 = D, which again leads to an exact linear dependency if a dummy squared is added to the model. All these cases which lead to the exact linear dependency with dummy-variables are called the dummy variable trap.

Multiple Categories Exploring Further (W 5ed 7.2, p. 227) In the baseball salary data MLB1 (available at www.econometrics.com), players are given one of six positions: frstbase, scndbase, thrdbase, shrtsop, outfield, or catcher. To allow for salary differentials accross position, with outfielders as the base group, which dummy variables would you include as independent variables?

Multiple Categories 1 Regression with Qualitative Information Single Dummy Independent Variable Multiple Categories Ordinal Information Interaction Involving Dummy Variables Interaction among Dummy Variables Different Slopes Chow Test Binary Dependent Variable The Linear Probability Model The Logit and Probit Model

Multiple Categories If the categories include ordinal information (e.g. 1 = good, 2 = better, 3 = best ), sometimes people these variables as such in regressions. However, interpretation may be a problem, because one unit change implies a constant partial effect. That is the difference between better and good is as big as best and better.

Multiple Categories The usual alternative to use dummy-variables. In the above example two dummies are needed. D 1 = 1 is better, and 0 otherwise, D 2 = 1 for best and 0 otherwise. As a consequence, the reference group is good.

Multiple Categories Exploring Further (Adapted from W 5ed 7.3, p228) Consider the effect of credit rating (CR) on the municipal bond rate (MBR). Credit ratings are indicated by ranking from zero to four, with zero the worst and four being the best. Using using model MBR = β 0 + δ 1 CR 1 + δ 2 CR 2 + δ 3 CR 3 + δ 4 CR 4 + other factors, where CR 1 = 1 if CR = 1, and CR 1 = 0 otherwise, CR 2 = 1 if CR = 2, and CR 2 = 0 otherwise, and so on. What can you expect about the signs of the coefficients δ i? How would you test that credit rating has no effect on MBR?

Multiple Categories The constant partial effect can be tested by testing the restricted model y = β 0 + δ(d 1 + 2D 2 ) + x + u (12) against the unrestricted alternative y i = β 0 + δ 1 D 1 + δ 2 D 2 + x + u. (13) Exploring Further How would you test using Eviews whether the credit rating affects linearly on the municipal bond rate?

Multiple Categories Example 4 Effects of law school ranking on starting salaries. Dummy variables top10, r11 25, r26 40, r41 60, and r61 100. The reference group is the schools ranked below 100.

Multiple Categories Example 4 continues Below are estimation results with some additional covariates (Wooldridge, Example 7.8). Dependent Variable: LOG(SALARY) Sample (adjusted): 1 155 Included observations: 136 after adjustments ====================================================== Variable Coefficient Std. Error t-statistic Prob. ------------------------------------------------------ C 9.1653 0.4114 22.277 0.0000 TOP10 0.6996 0.0535 13.078 0.0000 R11_25 0.5935 0.0394 15.049 0.0000 R26_40 0.3751 0.0341 11.005 0.0000 R41_60 0.2628 0.0280 9.399 0.0000 R61_100 0.1316 0.0210 6.254 0.0000 LSAT 0.0057 0.0031 1.858 0.0655 GPA 0.0137 0.0742 0.185 0.8535 LOG(LIBVOL) 0.0364 0.0260 1.398 0.1647 LOG(COST) 0.0008 0.0251 0.033 0.9734 ====================================================== R-squared 0.911 Mean dependent var 10.541 Adjusted R-squared 0.905 S.D. dependent var 0.277 S.E. of regression 0.086 Akaike info crit. -2.007 Sum squared resid 0.924 Schwarz criterion -1.792 F-statistic 143.199 Prob(F-statistic) 0.000 ======================================================

Multiple Categories Example 4 continues The estimation results indicate that the ranking has a big influence on the staring salary. The estimated median salary at a law school ranked between 61 and 100 is about 13% higher than in those ranked below 100. The coefficient estimate for the top 10 is 0.6996, using (7) we get 100 [exp(0.6996) 1] 101.4%, that is median starting salaries in top 10 schools tend to be double to those ranked below 100.

Multiple Categories Example 5 Although not fully relevant, let us for just illustration purposes test constant partial effect hypothesis. I.e., whether δ top10 = 5δ 61 100, δ H 0 : 11 25 = 4δ 61 100, δ 26 40 = 3δ 61 100, δ 41 60 = 2δ 61 100. (14) Using Wald test for coefficient restrictions in EViews gives F = 1.456 with df 1 = 4 and df 2 = 126 and p-value 0.2196. This indicates that the there is not much empirical evidence against the constant partial effect for the starting salary increment. The estimated constant partial coefficient is 0.139782, i.e., at each ranking class starting median salary is estimated to increase approximately by 14%.

Multiple Categories Example 5 continues 11.4 11.2 LOG(SALARY) 11.0 10.8 10.6 10.4 10.2 10.0 0 40 80 120 160 200 RANK Figure 6.1: Starting median salary and law school ranking.

Interaction Involving Dummy Variables 1 Regression with Qualitative Information Single Dummy Independent Variable Multiple Categories Ordinal Information Interaction Involving Dummy Variables Interaction among Dummy Variables Different Slopes Chow Test Binary Dependent Variable The Linear Probability Model The Logit and Probit Model

Interaction Involving Dummy Variables 1 Regression with Qualitative Information Single Dummy Independent Variable Multiple Categories Ordinal Information Interaction Involving Dummy Variables Interaction among Dummy Variables Different Slopes Chow Test Binary Dependent Variable The Linear Probability Model The Logit and Probit Model

Interaction Involving Dummy Variables In Example 3 additional dummy variables were generated by multiplying various other dummy variables. Factually these define interaction terms among dummy variables. Example 6 The same model as in Example 3 can be specified as log(wage) = β 0 +δ f female+δ mar married+δ m f female married+ (15)

Interaction Involving Dummy Variables Example 6 continues lm(formula = log(wage) ~ female + married + I(female * married) + educ + educ + exper + tenure + I(exper^2) + I(tenure^2), data = wdfr) Residuals: Min 1Q Median 3Q Max -1.89697-0.24060-0.02689 0.23144 1.09197 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 0.3213781 0.1000090 3.213 0.001393 ** female -0.1103502 0.0557421-1.980 0.048272 * married 0.2126757 0.0553572 3.842 0.000137 *** I(female * married) -0.3005931 0.0717669-4.188 3.30e-05 *** educ 0.0789103 0.0066945 11.787 < 2e-16 *** exper 0.0268006 0.0052428 5.112 4.50e-07 *** tenure 0.0290875 0.0067620 4.302 2.03e-05 *** I(exper^2) -0.0005352 0.0001104-4.847 1.66e-06 *** I(tenure^2) -0.0005331 0.0002312-2.306 0.021531 * --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 0.3933 on 517 degrees of freedom Multiple R-squared: 0.4609,Adjusted R-squared: 0.4525 F-statistic: 55.25 on 8 and 517 DF, p-value: < 2.2e-16

Interaction Involving Dummy Variables Example 6 continues This example allows to estimate wage differential for all the four groups (single men, single women, married men, married women), but we need to be careful in deriving the differentials. It is better to compute the differential as differences of the intercepts of the groups. For example marriage premium for women (compared to single women): The intercept for married women is β 0 + δ f + δ mar + δ m f and the intercept for single women is β 0 + δ f, such that the difference is δ mar + δ m f with estimate 0.2126757 0.3005931 0.088 as in Example 3. We observe that the marriage premium for men to single men in this specification is directly δ mar (= β 0 + δ mar β 0 ).

Interaction Involving Dummy Variables 1 Regression with Qualitative Information Single Dummy Independent Variable Multiple Categories Ordinal Information Interaction Involving Dummy Variables Interaction among Dummy Variables Different Slopes Chow Test Binary Dependent Variable The Linear Probability Model The Logit and Probit Model

Interaction Involving Dummy Variables Consider y = β 0 + β 1 x 1 + u. (16) If the slope depends also on the group, we get in addition to β 0 = β 00 + δ 0 D, (17) for the slope coefficient similarly β 1 = β 11 + δ 1 D. (18) The regression equation is then y = β 00 + β 1 D + β 11 x 1 + δ 1 x 2 + u, (19) where x 2 = D x 1.

Interaction Involving Dummy Variables Example 7 Wage example. Test whether return to eduction differs between women and men. This can be tested by defining The null hypothesis is H 0 : δ feduc = 0. β educ = β meduc + δ feduc female. (20)

Interaction Involving Dummy Variables Example 7 continues Dependent Variable: LOG(WAGE) Method: Least Squares Sample: 1 526 Included observations: 526 ====================================================== Variable Coefficient Std. Error t-statistic Prob. ------------------------------------------------------ C 0.31066 0.11831 2.626 0.0089 MARRMALE 0.21228 0.05546 3.828 0.0001 MARRFEM -0.17093 0.17100-1.000 0.3180 SINGFEM -0.08340 0.16815-0.496 0.6201 FEMALE*EDUC -0.00219 0.01288-0.170 0.8652 EDUC 0.07976 0.00838 9.521 0.0000 EXPER 0.02676 0.00525 5.095 0.0000 TENURE 0.02916 0.00678 4.299 0.0000 EXPER^2-0.00053 0.00011-4.829 0.0000 TENURE^2-0.00054 0.00023-2.309 0.0213 ====================================================== R-squared 0.461 Mean dependent var 1.623 Adjusted R-squared 0.452 S.D. dependent var 0.532 S.E. of regression 0.394 Akaike info crit. 0.992 Sum squared resid 79.964 Schwarz criterion 1.073 Log likelihood -250.940 F-statistic 49.018 Durbin-Watson stat 1.785 Prob(F-statistic) 0.000 ======================================================

Interaction Involving Dummy Variables Example 7 continues ˆδ feduc = 0.00219 with p-value 0.8652. Thus there is no empirical evidence that the return to education would differ between men and women.

Interaction Involving Dummy Variables 1 Regression with Qualitative Information Single Dummy Independent Variable Multiple Categories Ordinal Information Interaction Involving Dummy Variables Interaction among Dummy Variables Different Slopes Chow Test Binary Dependent Variable The Linear Probability Model The Logit and Probit Model

Interaction Involving Dummy Variables Suppose there are two populations (e.g. men and women) and we want to test whether the same regression function applies to both groups. All this can be handled by introducing a dummy variable, D with D = 1 for group 1 and zero for group 2.

Interaction Involving Dummy Variables If the regression in group g (g = 1, 2) is y g,i = β g,0 + β g,1 x g,i,1 + + β g,k x g,i,k + u g,i, (21) i = 1,..., n g, where n g is the number of observations from group g. Using the group dummy, we can write β g,j = β j + δ j D, (22) j = 0, 1,..., k. An important assumption is that in both groups var[u g,i ] = σ 2 u. The null hypothesis is H 0 : δ 0 = δ 1 = = δ k = 0. (23)

Interaction Involving Dummy Variables The null hypothesis (23) can be tested with the F -test, given in (4.20). In the first step the unrestricted model is estimated over the pooled sample with coefficients of the form in equation (21) (thus 2(k + 1)-coefficients). Next the restricted model, with all δ-coefficients set to zero, is estimated again over the pooled sample. Using the SSRs from restricted and unrestricted models, test statistic (4.20) becomes F = (SSR r SSR ur )/(k + 1), (24) SSR ur /[n 2(k + 1)] which has the F -distribution under the null hypothesis with k + 1 and n 2(k + 1) degrees of freedom.

Interaction Involving Dummy Variables Exactly the same result is obtained if one estimates the regression equations separately from each group and sums up the SSRs. That is SSR ur = SSR 1 + SSR 2, (25) Where SSR g is from the regression estimated from group g, g = 1, 2. Thus, statistic (24) can be written alternatively as F = [SSR r (SSR 1 + SSR 2 )]/(k + 1), (26) (SSR 1 + SSR 2 )/[n 2(k + 1)] which is known as Chow statistic (or Chow test).

Binary Dependent Variable 1 Regression with Qualitative Information Single Dummy Independent Variable Multiple Categories Ordinal Information Interaction Involving Dummy Variables Interaction among Dummy Variables Different Slopes Chow Test Binary Dependent Variable The Linear Probability Model The Logit and Probit Model

Binary Dependent Variable 1 Regression with Qualitative Information Single Dummy Independent Variable Multiple Categories Ordinal Information Interaction Involving Dummy Variables Interaction among Dummy Variables Different Slopes Chow Test Binary Dependent Variable The Linear Probability Model The Logit and Probit Model

Binary Dependent Variable Up until now in regression y = x β + u, (27) where x β = β 0 + β 1 x 1 + + β k x k, y has had quantitative meaning (e.g. wage). What if y indicates a qualitative event (e.g., firm has gone to bankruptcy), such that y = 1 indicates the occurrence of the event ( success ) and y = 0 non-occurrence ( fail ), and we want to explain it by some explanatory variables?

Binary Dependent Variable The meaning of the regression y = x β + u, when y is a binary variable. Then, because E[u x] = 0, E[y x] = x β. (28) Because y is a random variable that can have only values 0 or 1, we can define probabilities for y as P(y = 1 x) and P(y = 0 x) = 1 P(y = 1 x), such that E[y x] = 0 P(y = 0 x) + 1 P(y = 1 x) = P(y = 1 x).

Binary Dependent Variable Thus, E[y x] = P(y = 1 x) indicates the success probability and regression in equation 28 models P(y = 1 x) = β 0 + β 1 x 1 + + β k x k, (29) the probability of success. This is called the linear probability model (LPM). The slope coefficients indicate the marginal effect of corresponding x-variable on the success probability, i.e., change in the probability as x changes, or P(y = 1 x) = β j x j. (30)

Binary Dependent Variable In the OLS estimated model ŷ = β 0 + ˆβ 1 x 1 +... ˆβ k x k (31) ŷ is the estimated or predicted probability of success. In order to correctly specify the binary variable, it may be useful to name the variable according to the success category (e.g., in a bankruptcy study, bankrupt = 1 for bankrupt firms and bankrupt = 0 for non-bankrupt firm [thus success is just a generic term]).

Binary Dependent Variable Example 8 Married women participation in labor force (year 1975). (See R-snippet for the R-commands) Call: lm(formula = inlf ~ huswage + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, data = mdfr) Residuals: Min 1Q Median 3Q Max -0.95304-0.37143 0.08112 0.33820 0.92459 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 0.6067070 0.1536916 3.948 8.64e-05 *** huswage -0.0081511 0.0039107-2.084 0.0375 * educ 0.0370734 0.0073371 5.053 5.48e-07 *** exper 0.0404174 0.0056619 7.138 2.24e-12 *** I(exper^2) -0.0006120 0.0001852-3.304 0.0010 ** age -0.0165940 0.0024592-6.748 3.02e-11 *** kidslt6-0.2636923 0.0335068-7.870 1.25e-14 *** kidsge6 0.0110821 0.0131915 0.840 0.4011 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 0.4275 on 745 degrees of freedom Multiple R-squared: 0.2631,Adjusted R-squared: 0.2561 F-statistic: 37.99 on 7 and 745 DF, p-value: < 2.2e-16

Binary Dependent Variable Example 8 continues All but kidsge6 are statistically significant with signs as might be expected. The coefficients indicate the marginal effects of the variables on the probability that inlf = 1. Thus e.g., an additional year of educ increases the probability by 0.037 (other variables held fixed). Marginal effect of experince on marri women labor force participation Marginal effect of eduction on marrie women labor force participation Probability 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Probability 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 10 20 30 40 Experience (years) 0 5 10 15 Education (years)

Binary Dependent Variable Some issues with associated to the LPM. Dependent left hand side restricted to (0, 1), while right hand side (, ), which may result to probability predictions less than zero or larger than one. Heteroskedasticity of u, since by denoting p(x) = P(y = 1 x) = x β var[u x] = (1 p(x))p(x) (32) which is not a constant but depends on x, and hence violating Assumption 2.

Binary Dependent Variable 1 Regression with Qualitative Information Single Dummy Independent Variable Multiple Categories Ordinal Information Interaction Involving Dummy Variables Interaction among Dummy Variables Different Slopes Chow Test Binary Dependent Variable The Linear Probability Model The Logit and Probit Model

Binary Dependent Variable The first of the above problems can be technically easily solved by mapping the linear function on the right hand side of equation 29 by a non-linear function to the range (0, 1). Such a function is generally called a link function. That is, instead we write equation (19) as P(y = 1 x) = G(x β). (33) Although any function G : R [0, 1] applies in principle, so called logit and probit transformations are in practice most popular (the former is based on logistic distribution and the latter normal distribution). Economists favor often the probit transformation such that G is the distribution function of the standard normal density, i.e., G(z) = Φ(z) = z 1 2π e 1 2 v 2 dv, (34)

Binary Dependent Variable In the logit tranformation G(z) = Both as S-shaped ez 1 + e z = 1 1 + e z = z e v dv. (35) (1 + e v 2 ) Probit transformation Logit transformation G(z) 0.0 0.2 0.4 0.6 0.8 1.0 G(z) 0.0 0.2 0.4 0.6 0.8 1.0 3 1 0 1 2 3 z 3 1 0 1 2 3 z

Binary Dependent Variable The price, however, is that the interpretation of the marginal effects is not any more as straightforward as with the LPM. However, negative sign indicates decreasing effect on the probability and positive increasing.

Binary Dependent Variable Example 9 Probit and logit estimatation married women labor force participation probability. Probit: (family = binomial(link = probit ) in glm) glm(formula = inlf ~ huswage + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, family = binomial(link = "probit"), data = wkng) Deviance Residuals: Min 1Q Median 3Q Max -2.2495-0.9127 0.4288 0.8502 2.2237 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 0.3606454 0.5045248 0.715 0.47472 huswage -0.0255438 0.0126530-2.019 0.04351 * educ 0.1248347 0.0249191 5.010 5.45e-07 *** exper 0.1260736 0.0187091 6.739 1.60e-11 *** I(exper^2) -0.0019245 0.0005987-3.214 0.00131 ** age -0.0547298 0.0083918-6.522 6.95e-11 *** kidslt6-0.8650224 0.1177382-7.347 2.03e-13 *** kidsge6 0.0285102 0.0438420 0.650 0.51550 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Null deviance: 1029.75 on 752 degrees of freedom Residual deviance: 804.69 on 745 degrees of freedom AIC: 820.69

Binary Dependent Variable Example 9 continues Logit: (family = binomial(link = logit ) in glm) glm(formula = inlf ~ huswage + educ + exper + I(exper^2) + age + kidslt6 + kidsge6, family = binomial(link = "logit"), data = wkng) Deviance Residuals: Min 1Q Median 3Q Max -2.2086-0.8990 0.4481 0.8447 2.1953 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 0.592500 0.853042 0.695 0.48732 huswage -0.043803 0.021312-2.055 0.03985 * educ 0.209272 0.042568 4.916 8.83e-07 *** exper 0.209668 0.031963 6.560 5.39e-11 *** I(exper^2) -0.003181 0.001013-3.141 0.00168 ** age -0.091342 0.014460-6.317 2.67e-10 *** kidslt6-1.431959 0.202233-7.081 1.43e-12 *** kidsge6 0.047089 0.074284 0.634 0.52614 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Null deviance: 1029.75 on 752 degrees of freedom Residual deviance: 805.85 on 745 degrees of freedom AIC: 821.85 Qualitatively the results are similar to those of the LPM. (R exercise: create similar graphs to those of the linear case for the marginal effects.)