14. Binary Outcomes. A. Colin Cameron Pravin K. Trivedi Copyright 2006

Size: px

Start display at page:

Bethany Cameron
5 years ago
Views:

1 14. Binary Outcomes A. Colin Cameron Pravin K. Trivedi Copyright 2006 These slides were prepared in They cover material similar to Sections of our subsequent book Microeconometrics: Methods and Applications, Cambridge University Press, 2005.

2 INTRODUCTION Discrete choice or qualitative response models are for + that takes only a nite number of discrete values. Here we consider binary outcome models where only two values are taken, f and Particularly logit and probit models. These are relatively straightforward nonlinear models. Later consider complications such as discrete variable endogeneity, panel, etc. Reason for this course. 2

3 OUTLINE General Results Probit, logit, LPM and OLS models. Latent variable formulations, especially random utility model. Choice-based samples. Semi-parametric estimation. Grouped or aggregate data. 3

4 STATISTICAL MOTIVATION The coin toss example of introductory statistics. Let R denote the probability of a head E+ ' on one coin toss. Then Prd+ ' o'r and Prd+ 'fo' R. For? tosses + is the of? independent realizations of head or tail. The MLE for R is the sample mean +, i.e. the proportion of tosses that are heads. 4

5 BERNOULLI DISTRIBUTION Prd+ ' o'r and Prd+ 'fo' R Compact expression for density se+ 'R + E R + This is Bernoulli density which is the binomial with one trial per observation. Moments Ed+o ' R nf E R 'R Vd+o 'E R R nef R E R 'RE R. Note that R can be interpreted as Ed+o or as Prd+ ' o 5

6 ECONOMIC EXAMPLES Leading economic applications are labor supply: + ' if work and + 'fotherwise transportation mode choice: + ' if commute to work by public transit and + 'fotherwise. Independent trials may be reasonable. Assuming a constant probability R for each trial is not. It will vary with an individual s characteristics. So generalize to let R be a function of regressors. 6

7 BINARY OUTCOME MODELS Regression model formed by parameterizing R to depend on regressors and parameters. Usually specify single-index model (d+ m o'r ' 8 E 3 Usually chose 8 E to be a cumulative distribution function (cdf). Then f 8 E, f R. logistic cdf gives logit model. standard normal cdf gives probit model. 7

8 MLE Density se+ 'R + E R + c R ' 8 E 3, *? se+ '+ *? 8 E 3 ne + *?E 8 E 3 Log-likelihood function is OE ' [? ' + *? 8 E 3 ne + *?E 8 E 3 Let 8 3 E5 'Y8E5*Y5. MLE solves [? ' +, + 8 E E 3 n + 8 E E 3 ' f 8

9 DISTRIBUTION OF MLE The MLE f.o.c. simplify to [? + 8 E 3 ' 8 E 3 E 8 E E 3 ' f General ML result if density correctly speci ed e 0/ 1 % f c (dy 2 O*YY 3 o f For binary outcome MLE e 0/ 1 7 f c # [? ' $ 6 8 E 3 fe 8 E 3 f 8 3 E 3 f

10 MISSPECIFIED MODEL For binary data the dgp density is always Bernoulli as hd+ ' o'r, hd+ 'fo' hd+ ' o' 8E 3 Therefore only possible misspeci cation of dgp is if R 9' 8 E 3. Clearly inconsistent estimator if R 9' 8 E 3 as then (d+ 8 E 3 o 9' f leading to left-hand side of f.o.c. not having expected value f. 10

11 WEIGHTED NLS INTERPRETATION Since (d+m o '8 E 3 9d+m o '8 E 3 E 8 E 3 Y(d+m o*yq ' 8 3 E 3 the MLE rst-order conditions imply [? + (d+ m oy(d+ m o ' 9d+ m o Y ' f Residuals are orthogonal to regressors upon weighting to adjust for heteroskedasticity. i.e. nonlinear WLS. 11

12 Similarly e f c # [? ' 9d+ m o is of nonlinear WLS form. Y(d+ m o Yq Y(d+ m o Yq 3 f $ 6 These properties are more generally those of the quasi-mle where the speci ed density is in the linear exponential family. 8 12

13 LOGIT MODEL The logit model speci es R '\E 3 ' e 3 ne 3 c \E5 'e 5 *E n e 5 ' *E n e 5 is the logistic cdf. The derivative \ 3 E5 ' \E5E \E5 is the logistic density. For this reason also called logistic regression model. 13

14 The logit ML conditions simplify to [? ' E+ \E 3 ' f The logit MLE has [? e /RJLW 1 f c ' \E 3 fe \E 3 f 3 14

15 PROBIT MODEL The probit model speci es R 'xe 3 xe5 ' U 5 4 Er_r is the standard normal. The derivative x 3 E5 'E5 'E * s 2Zi TE5 2 *2 is the standard normal density function. There is no simpli cation to the f.o.c., unlike logit case. 15

16 The probit MLE has distribution e 3URELW 1 7 f c # [? ' E 3 f 2 $ 6 xe 3 fe xe 3 f

17 LINEAR PROBABILITY MODEL (LPM) The LPM speci es R ' 3 The LPM MLE f.o.c. conditions are [? ' E 3 ' fc The LPM MLE has distribution e /30 1 7c # [? ' $ 6 3 fe 3 f

18 The LPM model has the obvious weakness of permitting probabilities outside the (0, 1) interval. Furthermore, the estimator can be numerically unstable if 3 close to f or 18

19 OLS The LPM is better estimated by OLS, which also speci es (d+ m o' 3. Allow for the intrinsic heteroskedasticity of binary data e 1 k f c Ej 3 j j 3 ljej 3 j l where for l use el ' 'LDJdE+ 3 e 2 o or el ' 'LDJd 3 e E 3 e o 19

20 COEFFICIENT INTERPRETATION Coef cients in different models are not directly comparable due to different scaling. Instead compare across models effect of a one unit change in regressors on Pd+ ' m o 'Ed+ ' m o. Now (d+m o '8 E 3, Y(d+m o*y ' 8 3 E 3 where 8 3 E5 'Y8E5*Y5. Thus the effect depends on the functional form of 8 and the evaluation point, in addition to parameter. 20

21 For Y(d+m o*y ' 8 3 E 3 consider 8 3 E 3. Logit: 8 3 E5 f2d since 8 3 E5 '\E5E \E5 is maximized for f \E5 when \E5 'fd. Probit model 8 3 E5 fef since 8 3 E5 'E5which is at maximum of * s 2Z * fe at 5 'f. LPM or OLS: 8 3 E5 '. This suggests for slope parameters theruleofthumb e /RJLW * e e 2/6 e 3URELW * 2D e 2/6 e /RJLW * S e 3URELW 21

22 Amemiya (1981, p.1488) demonstrates this works quite well, for f 8 E 3 fb. 22

23 ODDS RATIO FOR LOGIT For the logit model R ' i TE 3 *E n i TE 3, R R ' i TE 3, *? R R ' 3 R*E R is the odds ratio which measures the probability that + ' relative to the probability that + 'f. E.g. Pharmaceutical drug study where + ' denotes survival and + 'fdenotes death. An odds ratio of 2 means that the odds of survival are twice those of death. 23

24 Statistical analyses and packages use R*E R ' i TE 3. Suppose the regressor increases by one unit. Then 3 increases to 3 nq. And i TE 3 increases to i TE 3 nq ' i TE 3 i T Eq. Thus the odds ratio has increased by a multiple i T Eq. E.g. a logit slope parameter of f means that a one unit change in the regressor increases the odds ratio by a multiple i TEf * f fd. The relative probability of survival has increased by fd percent. This interpretation widely used in applied biostatistics. 24

25 For economists it is more natural to interpret q as a semi-elasticity for the odds ratio, since *? R*E R ' 3. Then a logit slope parameter of f means that a one unit change in the regressor increases the odds ratio by a multiple f. This coincides exactly with the interpretation used in statistics for very small of q, since then i TEq 'q. 25

26 WHICH BINARY CHOICE MODEL? Which model logit, probit or linear probability? Theoretically it depends on the data generating process (dgp). Unlike other applications of ML there is no problem in specifying the distribution the only possible distribution for a Efc variable is the Bernoulli. The problem lies in specifying a functional form for the parameter of this distribution. If the dgp has R '\E 3 f then a logit model should 26

27 be used, and estimators based on other models such as probit are potentially inconsistent. Similar conclusions hold if instead for the dgp has R 'xe 3 f or R ' 3 f. 27

28 Aside: Misspeci cation consequences, however, are not as large as this. As long as in the true model the probability or mean is of the single-index form R ' 8 E 3 f, then choosing the wrong function 8 effects all slope parameters equally, and the ratio of slope parameters is constant across the models. See Ruud (1986, 1993). 28

29 WHY LOGIT MODEL? Logit model is the binary model used by statisticians: F.o.c. and asymptotic distribution are relatively simple. Logit model corresponds ises the canonical link function for the binomial, a generalized linear model. Coef cients can be interpreted in terms of the logodds ratio. Easy generalization to multinomial logit. A discriminant analysis interpretation can be given. 29

30 DISCRIMINANT ANALYSIS Aside: In discriminant analysis both + and are random variables is observed but + is not observed given only we determine whether + equals0or1. E.g. Classify what type of humanoid (+ 'for )askull belongs to given various dimensions ( ) of the skull. If m+ is multivariate normal distributed, the posterior probability of +m is similar to logit model probability. 30

31 WHY PROBIT MODEL? The probit model is often used by economists. It is motivated by a latent normal random variable. So ties in with tobit models and multinomial probit. Empirically, either logit and probit can be used little difference between results from probit and logit analysis, once rescale parameter estimates. Greatest difference is in prediction of probabilities close to f or. 31

32 WHY OLS? The LPM should not be used as probabilities outside the (0, 1) interval and be numerically unstable. Nonetheless OLS can be useful for preliminary data analysis. In practice standard errors of slope coef cients are often quite similar across logit, probit and OLS (even using the incorrect r 2 Ej 3 j in the case of OLS). Final results should, however, use probit or logit. 32

33 DETERMINING MODEL ADEQUACY Several measures of model adequacy have been proposed. Many are very speci c to binary outcome models. There is no single best measure. See Amemiya (1981) and Maddala (1983). Approaches: R-squared measures. Compare e+ with + Compare predicted e 3Ud+ ' owith actual 3Ud+ ' o. 33

34 McFADDEN S R-SQUARED There are many --squareds for binary models as - 2 in linear model has many interpretations. McFadden proposed two. We favor McFadden (1974) where - 2 ' O s O f c O s ' log-likelihood in the tted model O f is the log-likelihood in the intercept-only model. This - 2 should be only used for discrete choice models. 34

35 In other nonlinear models instead use - 2 ' EO 4@ O s *EO 4@ O f c where O 4@ is the maximum possible value of the log-likelihood. For binary outcome models O 4@ 'f. For some other models O 4@ can be unbounded restricting use of this. 35

36 PREDICTION THAT ) ' Many measures compare predicted e+ with +. The problem is in de ning a rule for when e+ '. Obvious is e+ ' when er ' 8 E 3 e : fd. But this can e.g. yield e+ 'fall the time if most of the sample has + 'f The receiver operating characteristics (ROC) curve considers what happens as the cuttoff e+ :Svaries. 36

37 PREDICTION OF Pr[) ' o Can compare predicted e 3U[) ' o with Pr[) ' o. But testing whether on on average the predicted probabilities equal the sample frequencies is not helpful over the entire sample, since for the logit model with an intercept the f.o.c. imply S? ' + \E 3 e 'f,sothat S?' er '7+. Useful for subsamples. 37

38 LATENT VARIABLE APPROACHES Let + be an unobserved underlying continuous latent variable. When + crosses a threshold + ' is generated. Theobservedbinaryoutcome+ is still Bernoulli. The latent variable + determines the functional form of the parameter of this Bernoulli distribution. Two latent variable approaches: index function random utility. 38

39 INDEX FUNCTION Interest is in explaining the underlying unobserved continuous random variable +. The natural regression model for + is the index function model + ' 3 n This cannot be estimated as + is not observed. Instead we observe + if + : f + ' f if + f The choice of 0 as the threshold is a normalization. If 39

40 instead + :Sis used as the threshold, exactly the same slope parameter estimates are obtained if includes a constant term. Examples: + is a person s tendency to work and we observe only whether or not the person works (+ ' ) + is a person s propensity to commute by public transit and we observe only whether or not the public transit is used (+ ' ). 40

41 INDEX FUNCTION... Then hd+ ' o 'hd+ : fo 'hd 3 n :fo 'hd 3 o ' 8 E 3 c where 8 is the cdf of (which equals the cdf of in the usual case of density symmetric about f in which case 8 E ' 8 E.) The index function therefore gives a way to interpret the function 8. 41

42 The probit model arises from standard normal errors in the index model. The logit model arises from logistic errors in the index model. Index function approach Gives a way to interpret the parameter as the change in + when changes by one unit, but this is of limited use as + is essentially unitless. Cangeneralize to ordered models, multivariate models and limited dependent variable models. 42

43 RANDOM UTILITY MODELS In the random utility formulation a consumer selects the choice with highest utility. The discrete variable + takesvalue if choice has higher utility takesvaluefif choice f has higher utility. 43

44 The random utility model speci es the utilities of alternatives f and to be L f ' > f n 0 f L ' > n 0 where > f and > are deterministic components of utility, whose dependence on regressors is detailed below. 0 f and 0 are random components of utility. 44

45 The alternative with highest utility is chosen. So the observed choice is hd+ ' o 'hdl:l f o 'hd> n 0 :> f n 0 f o 'hd0 f 0 > > f o ' 8 E> > f c where 8 is the cdf of E0 f 0. Different distributions of 0 f and 0 give different discrete choice models. 45

46 Binary probit arises if 0 f and 0 are normal, as is readily seen by noting that then E0 f 0 is normally distributed, upon normalization of the variance of E0 f 0 to unity. Binary logit model arises if 0 f and 0 are type I extreme value distributed, de ned soon, as then the difference E0 f 0 can be shown to be logistic distributed. The random component 0 in utility model is needed. Otherwise, choice would be deterministic, with alternative 1 always chosen if > :> f. 46

47 REGRESSORS The simplest applications let > f ' 3 f and > ' 3 where are individual characteristics that do not vary with the choice. More generally we let where > ' 3 3 n 3 c 'fc c 3 are characteristics that vary with alternatives and are characteristics that do not vary with the alternatives. 47

48 BRUTE FORCE: LOGIT This yields hd+ ' o'8 EE3 3 f 3 n 3 E f For alternative-invariant regressors only the difference E f canbeidenti ed. 48

49 BRUTE FORCE In the binary case it is suf cient to work with the distribution of 0 f 0. But in more general multinomial models presented later it is necessary to work directly with the distributions of 0 f and 0. We present the algebra here for completeness. 49

50 In general hd+ ' o'hd0 f 0 > > f o 0 n > > f o 'hd0 f ' U 4 U 0 n> > f 4 4 ' U 4 4 se0 q U 0 n> > f 4 se0 f c0 _0 f _0 r se0 f _0 f _0 c where in the last line 0 f and 0 areassumedtobe independent. 50

51 BRUTE FORCE: LOGIT Assume 0 are iid log Weibull (or type I extreme value) distributed with density se0 'e 0 i TEe 0 Then substituting for se0 f yields hd+ ' o' U q 4 U r 0 n> 4 se0 > f 4 e 0 f i TEe 0 f_0 f ' U 4 4 se0 di TEe 0 fo 0 n> > f 4 _0 ' U 4 4 se0 i TEe E0 n> > f _0 _0 51

52 Now substitute for se0 yields hd+ ' o ' U 4 4 e 0 i TEe 0 i TEe E0 n> > f _0 ' U q r 4 4 e 0 i TEe 0 n e E0 n> > f _0 ' U q r 4 4 e 0 i TEe 0 n e 0 e E> > f _0 ' U q r 4 4 e 0 i T e 0 E n e E> > f _0 Since U 4 0 i _0 ' it follows that U 4 4 e 0 i TE@e _0 ' *@. 52

53 Using this result ' ne E> > f yields hd+ ' o ' E n e E> > f e > ' e > f n e > ' e> > f ne > > f This last result implies the logit model hd+ ' o' e 3 ne 3 c where 3 'E3 3 f 3 n 3 E f. 53

54 BRUTE FORCE: PROBIT To instead obtain the probit model assume E0 f c0 are bivariate normal with means zero, variances j 2 f and j2 and covariance j f. Then similar analysis yields hd+ ' o 'hd> n 3 0 :> f n 0 f o 4 E > ' x > f F Ct D j 2 f n j2 2j 2 This implies the probit model hd+ ' o'xe 3 *j c 54

55 where 3 'E3 3 f 3 n 3 E f and j 2 ' j 2 f n j2 2j 2. It is simpler to use the result that E0 f 0 is normally distributed with mean zero and variance j 2. 55

56 CHOICE-BASED SAMPLING Choice-based sampling arises when one choice is over-sampled. This occurs when relatively few people make a choice, so a random sample would pick up too few + '. E.g. Oversampling bus riders by sampling at bus stops. MLE based on the conditional density of ) given j leads to inconsistent estimates. Instead MLE needs to be based on the joint density of E)c j. 56

57 SEMI-PARAMETRIC REGRESSION Here as usual + is Bernoulli with parameter R modelled in the single-index form 8 E 3. But a functional form for 8 is not speci ed. Instead both and 8 are estimated from the data. This is a very active area, and a leading example of semiparametric estimation. 57

58 GROUPED DATA Suppose regressor vector, ' c c?c takes only A distinct values, where A is much smaller then?. Then for each value of the regressors we have multiple observations on +. This type of grouped data is called many observations per cell. It can arise particularly in experimental data. Economics data are rarely of grouped form, unless the regressors are just a few indicator variables. 58

59 Let, ' c c Ac be the A distinct values,? be the number of observations on + for the distinct value of,so S A '? '?, R be the proportion of times + ' when ' R ' hd+ ' m ' o'8e 3 Berkson s minimum chi-square estimatorestimates the transformed model 8 ER ' 3 n by weighted least squares with weights j 2 ' R E R? d8 3 E8 ER o 2 59

60 Then e 0& 1 k SA ' Ej 2 3 SA ' c Ej 2 3 l S A ' Ej 2 8 ER where the asymptotic theory requires? $4. This is simpletoimplement, as it only requires an OLS package. Yetitisfullyef cient, as has same asymptotic distribution as the MLE which would treat each observation separately. 60

61 For the logit model this estimator is especially simple. The transformed model becomes *? R R ' 3 n c which is estimated by weighted least squares with weights j 2 '? R E R The transformed dependent variable is the log-odds ratio. Advantage of min chi-square estimator is computational simplicity. This is moot now. 61

62 AGGREGATE DATA Example: Suppose R equals the unemployment rate in region and equals the average level of schooling in region. One possible model is LS regression of R on 7. Because f R, many studies instead use the logodds ratio to transform to a dependent variable which lies in E4c 4, estimating the model *? R R ' 7 3 n eoojo 62

63 This looks similar to the minimum chi-square estimator, butisnot. Berkson s estimator is appropriate if all regressors in the cell take the same value. Here instead regressors can take different values, as different people in region have different schooling. If apply minimum chi-square the resulting estimator cannot be interpreted as being an estimate of the parameter in an underlying binary choice model, say hd+ ' o'8e 3, asignoring the heterogeneity leads to an inconsistent estimator of. 63

64 To see this suppose + ' if + : f, data are only observed by region, and 1d> c P o in region. Then in region hd+ ' mc % c P o'hd 3 n : fmc c P o 3 n > 3 'h s & j 2 n 3 P : fmc c P # > 3 'x s $ j 2 n 3 P since 3 n 1d> 3 cj2 n 3 P o. 64

65 This suggests estimating the model x E7+ ' 7 3 s j 2 n 3 P Also see Allenby and Berry, Levinsohn and Pakes. 65

66 APPLICATION: LABOR SUPPLY Use data of Mroz (1987) on 753 married women from the 1976 Panel Survey of Income Dynamics (PSID). Dependent variable LFP is an indicator variable that equals 1 if worked in previous year, and 0 if no work. For this sample 428 (or 57%) worked. 66

67 The regressors are a constant term and 1. KL6: Number of children less than six 2. K618: Number of children more than six 3. AGE: Age 4. ED: Education (years of schooling completed) 5. NLINCOME: annual nonlabor income of wife measuredin$10,000 s. 67

68 Variable Coeff t-stat u7 Logit Probit u7 Logit Probit. SeD. 2 e22 ef fb fb gus 2b. eh Hb2 H.D.H gs H f 2 fss fh fh f fb C. f 2 fd. fe es ed ed.( fd2 2DH DS S. S SD uu. SHe eb 2fb ed ee es *? u ede H ede22 68

69 Consider, for example, the effect of education. The logit coef cient is 1.7 times the probit as. fh *fss (vs. rule of thumb multiplier of 1.6). The OLS coef cient is 0.3 times the probit as f fh *f 2 (vs. rule of thumb multiplier of 0.4). All models give similar results regarding statistical signi cance (though the reported OLS standard errors here are the incorrect ones). 69

16. Tobit and Selection A. Colin Cameron Pravin K. Trivedi Copyright 2006

16. Tobit and Selection A. Colin Cameron Pravin K. Trivedi Copyright 2006 These slides were prepared in 1999. They cover material similar to Sections 16.2-16.3 and 16.5 of our subsequent book Microeconometrics: