Economics 472. Lecture 16. Binary Dependent Variable Models

University of Illinois Fall 998 Department of Economics Roger Koenker Economics 472 Lecture 6 Binary Dependent Variable Models Let's begin with a model for an observed proportion, or frequency. We would like to explain variation in the proportion p i as a function of covariates x i.we could simply specify that p i = x 0 i æ + error and run it as OLS regression. But this has certain problems. For example, we might ænd that ^p i =2 ë0; ë: So we typically consider transformations gèp i è=x 0 i æ + error where g is usually called the ëlink" function. A typical example of g is the logit function gèpè = logitèpè =logèp=, pè this corresponds to the logistic df. The transformation may be seen to induce a certain degree of heteroscedasticity into the model. Suppose each observation ^p i is based on a moderately large sample of n i observations with ^p i! p i : We may then use the æ-method to compute the variability of logitè^p i è, so V ègè^p i èè = èg 0 èp i èè 2 V è^p i è gèpè = logèp=è, pèè ç g 0 èpè =, p d p æ p dp, p V è^p i è = p iè, p i è n i V èlogitè^p i èè = n i p i è, p i è ç = pè, pè Thus GLS would suggest running the weighted regression of logitè^p i èonx i with weights n i p i è, p i è. Of course, we could, based on considerations so far, replace logitè^p i èwithany other quantile-type transformation from ë0,ë to jr. For example, we might use æ, è^p i è in which case the same logic suggests regressing æ, è^p i èonx i with weights n iç 2 èæ, èp i èè p i è, p i è An immediate problem presents itself, however, if we would like to apply the foregoing to data in which some of the observed p i are either 0 or. Since the foregoing approach seems rather ad hoc any way based as it is an approximate normality of the ^p i wemightaswell leap in the briar patch of mle. But to keep things quite close to the

2 regression setting we will posit the following latent variable model. We posit the model for the latent èunobservedè variable y æ i and assume that the observed binary response variable y i is generated as, y æ i = so letting the df of u i be denoted by F, x0 i æ + u i ç if y æ y i = i é 0 0 otherwise P èy =è = P èu i é,x 0 i æè=, F è,x0 i æè P èy =0è = F è,x 0 i æè For F symmetric F èzè+f è,zè =sofèzè =fè,zè and we have P èy =è = F èx 0 i æè P èy =0è =, F èx 0 i æè and we may write the likelihood of seeing Y the sample fy i ;x i Y è: Lèæè = è, F èx 0 i æèè = i:y i =0 ny i= F y i i è, F i è,y i i:y i = i =;:::;ng as F èx 0 i æè Now we need to make somechoice of F. There are several popular choices: èiè: Logit p = F èzè = ez è +e R logèp=, pè =z so Ey z i = p i = F èx i æè è logitèp i è=x 0 i æ z èiiè: Probit F èzè =æèzè =, çèxèdx æ, èpè =x 0 i æ èiiiè: Cauchy F èzè + 2 ç, tan, èzè F, èpè =tanèçèèp, èè = 2 x0 i æ: èivè: Complementary log log èvè: log-log F, èpè =logè, logè, pèè = x 0 i æ F, èpè =, logè, logèpèè = x i æ

3 6 4 2 probablility scale 0.00 0.00 0.00 0.900 0.990 0.999-6 log-log -4-2 -2 2 4 6 logit scale probit logit cauchy -4 c-loglog -6 Figure. Comparison of æve link functions: The horizontal axis is on the logistic scale so the logit link appears as the 45 degree line. Symmetry around the y-axis indicates symmetry of the distribution corresponding with the link as in the logit, probit and Cauchy cases. The log-log forms are asymmetric in this respect. Note that while the probit and logit are quite similar the Cauchy link is much more long tailed.

4 ëèplots of link functions for binary dep models postscriptè"fig.ps",horizontal=f,width=6.5,height=6.5, font=7,pointsize=2è eps_.2 u_è:999èè000 plotèlogèuèè-uèè,logèuèè-uèè,type="l",axes=f, xlab="",ylab=""è tics_cè.00,.0,.è tics_cètics,-ticsè ytics_0*tics segmentsèlogèticsèè-ticsèè,ytics,logèticsèè-ticsèè,ytics+epsè textèlogèticsèè-ticsèè,ytics+3*eps,pasteèformatèroundètics,3èèèè textèlogèticsë2ëèè-ticsë2ëèè,.3,"probablility scale"è tics_cè2,4,6è tics_cètics,-ticsè ytics_0*tics segmentsètics,ytics,tics,ytics-epsè textètics,ytics-3*eps,pasteèformatèroundèticsèèèè textèticsë2ë,-.3,"logit scale"è segmentsè-eps,ytics,eps,yticsè textè-3*eps,tics,pasteèformatèroundèticsèèèè ablineèh=0è ablineèv=0è linesèlogèuèè-uèè,qnormèuè,lty=2è linesèlogèuèè-uèè,logè-logè-uèè,lty=3è linesèlogèuèè-uèè,-logè-logèuèè,lty=4è linesèlogèuèè-uèè,tanèpi*èu-.5èè,lty=5è textè-cè6,6,5,5,2è,-cè,3,4,6,4è,cè"log-log","probit","logit","c-loglog","cauchy"èè frameèè

5 In regression we are used to the idea that Interpretation of the coeæcients @Eèyjxè = æ i @x i provided we really have a linear model in x i, but under our symmetry assumption here the situation is slightly more complicated. Now, so now for logit we have so while for probit we have Eèy i jx i è=æ P èy i =è+0æ P èy i =0è=F èx i æè @Eèyjxè = fèx 0 æèæ j @x j ez F èzè = +e z fèzè =F èzèè, F èzèè fèzè =çèzè = p 2ç e,z2 =2 and for Cauchy fèzè = çè + zè 2 We can compare these, for example, at z =0whereweget factor from fè0è logit è4 probit p = 2ç Cauchy =ç and roughly speaking the whole ^æ-vector should scale by these factors so e.g., 4 ælogit j ç p b probit j 2ç so æ logit j ç :60æ probit j Diagnostic For the Logistic Link Function Let gèpè = logitèpè in the usual one observation per cell logit model, and suppose we've ætted the model logitèp i è=xæ

6 but we'd like to know if there is some more general form for the density which works better. Pregibon suggests, following Box-Cox, gèpè = pæ,æ, æ, æ, è, pèæ,æ, æ + æ note as æ; æ! 0weget = log p, logè, pè = logèp=, pè: æ =0è symmetry, æ governs fatness of tails. Expanding g in æ; æ we get èwith diligenceè gèpè = g 0 èpè+æg æ 0 èpè+æg æ 0èpè g æ 0 èpè = 2 ëlog2 èpè, log 2 è, pèë g æ 0èpè =, 2 ëlog2 èpè + log 2 è, pèë LM tests of signiæcance of g æ ;g æ ; in an expanded model in which we include g æè^pè 0 andgæ 0 è^pè where these variables are constructed from a preliminary logistic regression, can be used to evaluate the reasonableness of the logit speciæcation. Semiparametric Methods for Binary Choice Models It is worthwhile to explore what happens when we relax the assumptions of the prior analysis, in particular the assumptions that a.è we know the form of the df F, and b.è that the u i 's in the latent variable formulation are iid. Recall that in ordinary linear regression we can justify OLS methods with the minimal assumption that u is mean independent of the covariates x, i.e., that Eèujxè = 0. We will see that this condition is not suæcient to identify the parameters æ in the latent variable form of the binary choice model. The following example is taken from Horowitz è998è. Suppose we have the simple logistic model, where u i is iid logistic, i.e., has df y æ i = x i æ + u i F èuè ==è + e,u è It is clear that multiplying the latent variable equation through by ç levels observable choices unchanged, so the ærst observation about identiæcation in this model is that we can only identify æ ëup to scale". This is essentially the reason we are entitled to impose the assumption that u has a df with known scale. Now let æ be another parameter vector such that æ 6= çæ for any choice of the scalar ç: It is easy to construct new random variables, say v, whose dfs will now depend upon x, and for which èæè F vjx èx 0 æè==è + expè,xæèè

and Eèvjxè=0 Thus, æ and the v's would generate the same observable probabilities as æ and the u's and both would have mean independent errors with respect to x. The argument is most easily seen by drawing a picture. Suppose we have the original èæ; uè model with nice logistic densities at each x, the and a line representing xæ and we could imagine recentering the logistic densities so that they were centered with respect to the xæ line. Now on the left side of the picture imagine stretching the right tail of the density until the mean matches xæ, similarly we can stretch the left tail an the right side of the picture í as long as the stretching doesn't move mass across the xæ line è*è is satisæed. What this shows is that mean independence is the wrong idea for thinking about binary choice models. What is appropriate? The example illustrates that the right concept is median independence. As long as medianèujxè =0 we do get identiæcation under two rather mild conditions. A simple way toseehow to exploit this is to recall that under the general quantile regression model, Q y èçjxè xæ equivariance to monotone transformations implies that for the rather drastic transformation Ièx é0è we have Q Ièyé0èèçjxè =Ièxæ é 0è but Ièy é0è is just the observable binary variable so this suggests the following estimation strategy X ç I èy i, Ièx i æé0èè min jjæjj= where y i is the binary variable. This problem is rather tricky computationally but it has a natural interpretation í we want tochose æ so that as often as possible Ièx i æé0è predicts correctly. Manski è975è introduced this idea under the rather unfortunate name ëmaximum score" estimator, writing it as X è2yi, èè2ièx 0 iæ ç 0è, è max jjæjj= In this form we try to maximize the number of matches, rather than minimizing the number of unmatches but the two problems are equivalent. The large sample theory of this estimator is rather complicated, but an interesting aspect of the quantile regression formulation is that it enables us by estimating the model for various values of ç to explore the problem of ëheteroscedasticity" in the binary choice model. 7 Discrete Choice Models í Some Theory The theory of discrete choice has a long history in both psychology and economics. McFadden's version of the Thurstone è927è model may be very concisely expressed as follows:

8 m choices y æ i = çutility ofi th choice if y æ y i = i =maxfy æ ;:::;yæ mg 0 otherwise Suppose we can express the ëutility of the i th choice" as, y æ i = vèx i è+u i where x i is a vector of attributes of the i th choice, and fu i g are iid draws from some df F. Note that in contrast to classical economic models of choice, here utility has a random component. This randomness has an important role to play, because it allows us to develop simple models with ëcommon tastes" in which not everyone make exactly the same choices. Thm. Iftheu i are iid with F èuè =F èu i éuè=e,e,u, then P èy i =jx i è= ev i P e v i where v i ç vèx i è. Remark. F èæè is often called the Type extreme value distribution. Proof. y æ i =maxfçg è u i + v i év j + v j for all j 6= i or u j éu i + v i, v j. So, conditioning on u i and then integrating with respect to the marginal density ofu i, P èy æ i =jx iè= Z Y F èu i + v i, v j èfèu i èdu i Note if F èuè takes the hypothesized form, then fèuè =e,e,u æ e,u = e,u,e,u so Y i F èu i + v i, v j èfèu i è = Y j expè, expè,u i, v i + v j èè expè,u i, expè,u i èè = expè,u i, e,u i è + X j6=i e v j e v i èè let so P èy æ i =jx iè = X ç i = logè + e v j =e v i X m è=logè e v j =e v i è j6=i j= Z expè,u i, e,èu i,ç iè èdu i Z = e,ç i expè,~u i, e,~u i èd~u i ~u i = u i, ç i = e,ç i = e v i P n j= ev j Extensions: y æ ij = utility ofi th person for j th choice y æ ij + x ij æ + z i æ + u ij

9 Then assuming u ij iid F yields p ij = P èy ij =jx ij;zi è= P e x ij æ+z i æ ex ijæ+z i æ E.g., here x ij is a vector of choice speciæc individual characteristics like travel time to work by thei th mode of transport, and z i is a vector of individual characteristics, like income, age, etc. Critique of IIA í Independence of Irrelevant Alternatives Luce derived aversion of the above model from the assumption that the odds of choosing alternatives i and j shouldn't depend on the characteristics of a 3 rd alternative k. Clearly here P i = P èy i =è P j P èy j =è = evi e v j which is independent ofv k. One should resist the temptation to relate this to similarly named concepts in the theory of voting. For some purchases this is a desirable feature of choice model, but in other circumstances it might be considered a ëbug." Debreu in a famous critique of the IIA property suggested that it might be unreasonable to think that the choice between car and bus transportation would be invariant to the introduction of a new form of bus which diæered from the original one only in terms of color. In this red-bus-blue-bus example we would expect that the draws of u i 's for the two bus modes would be highly correlated, not independent. Recently, there has been considerable interest in multinomial probit models of this type in which correlation can be easily incorporated. References Pregibon, D. è980è Goodness of link tests for generalized linear models, Applied Stat, 29, 5-24.