Nonseparable Unobserved Heterogeneity and Partial Identification in IV models for Count Outcomes

Size: px

Start display at page:

Download "Nonseparable Unobserved Heterogeneity and Partial Identification in IV models for Count Outcomes"

Madison Tyler
6 years ago
Views:

1 Nonseparable Unobserved Heterogeneity and Partial Identification in IV models for Count Outcomes Dongwoo Kim Department of Economics, University College London [Latest update: March 29, 2017] Abstract This paper studies count data instrumental variable (IV) models where explanatory variables are endogenous and unobserved heterogeneity is nonseparable. Prevailing models in literature are shown to suffer from undesirable specification problems. I propose a single equation count data model in which neither parametric restrictions nor strong separability is required. This model explicitly accommodates the discreteness of count data by modifying an ordered choice model. Structural features of interest are set identified and the characterisation of the identified set is provided by a generalised IV model framework introduced in Chesher and Rosen (2016). Identified sets can be rather small since count data often have a rich support. Numerical examples and an application to the effect of supplemental insurance on doctor visits are provided. State-of-art inference methods are employed to find confidence regions for set estimates. The empirical application shows that the set estimation framework delivers useful information about structural features. It also examines misspecification. Keywords: Count data; Poisson regression; negative binomial regression; endogeneity; instrumental variables; single equation models; partial identification; set identification; intersection bounds; incomplete models; confidence regions; JEL Classification Numbers: C25, C26, I12 Address: Room G01, Department of Economics, University College London, 30 Gordon Street, London, UK, dongwoo.kim.13@ucl.ac.uk. The author is deeply grateful to Andrew Chesher and Toru Kitagawa for their supervision. I also thank Ivan Canay, Matias Cattaneo, Sokbae Lee, Jeff Rowley, Alexander Torgovitsky, Daniel Wilhelm, and seminar participants at UCL and LSE for helpful discussions.

2 1 Introduction This paper introduces a new approach to count data 1 instrumental variable (IV) models where explanatory variables are potentially endogenous and unobserved heterogeneity is nonseparable. The proposed approach is widely applicable in applied studies as many outcomes of interest are count-measured. For instance, in health economics, the numbers of doctor visits and other types of health care utilisation, occupational injuries and illnesses are all count outcomes. Other examples are widely found in labour and empirical IO, and even finance literature such as absenteeism in the workplaces, recreational or shopping trips, entry and exits from industries, mortgage prepayments and loan defaults, bank failures, patent registration in connection with industrial R&D, and frequency of airline accidents (see Cameron and Trivedi (2013) - CT2013 henceforth - for more detailed examples). Endogeneity may arise in count data models. In the context of doctor visits, some observable characteristics can be correlated with unobserved heterogeneity. Suppose that individuals in a survey self-reported their current health statuses. If individuals do not report whether they have private health insurance, explanatory variables such as income would be endogenous as this unobserved factor is probably correlated with their health conditions and income. In this case, the OLS estimator fails to deliver correct information about causal effects of interest. IV models are a usual ploy to cope with this problem. I propose a single equation count IV model by modifying an ordered choice model 2 suggested in Chesher and Smolinski (2012). Structural features of interest are set identified and the parsimonious characterisation of the sharp identified set is provided by a generalised IV model framework introduced in Chesher and Rosen (2016). The use of this model is beneficial in the sense that popular count data models with endogeneity in the current literature tend to be misspecified as the discreteness of the count outcome is often ignored. I demonstrate that popular approaches such as the control function approach and moment based models deliver misleading information about the causal effects of interest by using simulated data. The proposed model respects the discreteness of count data and hence more robust to misspecification. Partially identifying models often provides uninformative - could be too large - identified sets so its usefulness in practice is a concern in empirical studies (Ho and Rosen (2015), Section 7.2). In the context of count data models, this could be overcome since count outcome 1 Count data are a type of discrete data in which observations only take non-negative integer values. 2 Ordered choice models are sometimes used for count data as count outcomes are also of ordered choices. CT2013 suggests that parametric models such as logit and probit are particularly suitable when the support of the outcome is very limited such as binary or {0, 1, 2} or if the outcome is generated from threshold crossing of a latent continuous variable. 2

3 may have a rich support depending upon the duration in which the count is aggregated. The richer support of the outcome in general leads to the smaller identified set. I show that identified sets of structural features are very close to points in some numerical examples where the IV is strong or the support of the outcome is rich. A simple algorithm is introduced to numerically implement the characterisation of the identified set. As simple grid search with a selection of conditional moment inequalities provides a good approximation of the identified set, the problem at hands becomes substantially tractable and computationally feasible. Estimation in a finite sample is also studied. An empirical example is provided on a data set used in CT2013 and the set estimates of structural parameters are compared to the point estimation results. Recent developments in the partial identification literature provide inference methods on identified sets. Chernozhukov et al. (2013) develops a novel inference method on identified sets characterised by intersection bounds. Kaido et al. (2016) introduces a bootstrap based inference method on projections of high dimensional identified sets. These state-of-art methods are employed to find confident regions for set estimates. Therefore, this paper documents a unified framework for partially identifying count data IV models from identification to inference. In the current literature, two branches of IV estimation for count outcomes are particularly prevailing. The first is a full information (FI) approach in which data generating processes (DGPs) are specified for all endogenous variables. The control function method is a representative example. Terza et al. (2008) implements this approach in the context of count models, namely 2 stage residual inclusion estimation (2SRI). The control function approach is widely used in applied studies but is known to have several problems. For example, the recursive structure rules out full simultaneity. Moreover, endogenous variables are required to be continuously distributed. Otherwise, structural features of interest are generally set identified as shown in Chesher (2005). On the contrary, a limited information (LI) method does not specify some or all endogenous variables. LI models are more robust to misspecification as less restrictions are imposed. This robustness is, however, often obtained at the cost of identification power or efficiency of an estimator. Moment based LI approaches are suggested in Windmeijer and Santos Silva (1997) (WS1997 henceforth) and Mullahy (1997). These single equation models are argued to be point identifying under strong separability. However, they ignore the discreteness of count outcomes. Therefore, even though the parameters in their moment conditions are point identified, the models explains nothing about the DGPs of outcome variables. Furthermore, count outcomes are discrete but the continuous structural function is imposed in their specifications. Hence the separable errors absorb the discreteness of the 3

4 outcomes. Consequently, the conditional support of the separable error depends on given values of explanatory variables. It can be shown that no instrument satisfies the strong independence condition if endogenous variables are discrete. Therefore, a more flexible form of the error term is required to avoid this specification problem. It may give rise to partial identification as Chesher (2010) points out that IV models for discrete outcomes are generally not point identifying but set identifying unless strong restrictions are invoked. The importance of model specifications cannot be emphasised enough in applied economic studies. Applied researchers often impose simplifying assumptions which are not based on economic theories in order to make identification and estimation more tractable. In many cases, they become the primary source of misspecification. Misspecified models deliver misleading information on causal relationships of interest. Therefore, econometricians have tried to minimise redundant and unjustifiable restrictions. Partial identification is a brainchild of this philosophy. It imposes a minimal set of restrictions to read useful information in data and hence it is less vulnerable to attacks on econometric assumptions. LI often induces partial identification but even with FI, point identification is not always guaranteed. This paper is structured as follows. Section 2 points out potential flaws of prevailing approaches in literature. Section 3 introduces count data IV models with the nonseparable error and the characterisation of identified sets. Section 4 demonstrates identified sets in numerical examples by employing parametric restrictions. Section 5 shows estimation and inference results on an empirical example. Section 6 concludes. All proofs are provided in Appendix I. 2 Prevailing approaches and potential problems Standard linear regression models are incoherent for count outcomes as fitted values can take negative numbers. The most wildly used count data method is Poisson regression. Define that Y is a scalar count outcome and X is a vector of explanatory variables. Then the conditional mean function of Y given X is E(Y X) = exp(x β). The vector of structural parameters β governs the conditional mean and the variance of Y given X. Equidispersion is the main feature of this model. It is too simple and highly restrictive because count data are in general over or underdispersed. The negative binomial (NB) model, under which a shape parameter controls the degree of dispersion of Y, is a popular ploy in such a case. Now suppose that there is an unobserved characteristic U which should be included to the outcome equation if observable. The endogeneity problem arises when U is correlated with X since the parameters are not identified. Instrumental variable models address this problem. The control function approach and the moment based methods are most popular in literature. 4

5 2.1 Control function approach Terza et al. (2008) introduce the control function approach in the context of count data models. The model is specified as Y P oisson[λ(x) = exp(x β + U)] (1) X = g(z δ) + V (2) U = αv + e (3) Z is assumed to be independent of e and V. e and V are mutually independent and E[exp(e)] is normalised to 1. Then E[λ(X) X, Z, V ] = E[exp(e) X, Z, V ] exp(x β + αv ) = exp(x β + αv ) and V is identified by the second equation. Therefore, in the first stage, the regression of X on Z yields ˆV. Secondly, the Poisson regression of Y on X and ˆV gives the IV estimate of β. This method is widely used as it is very tractable but is somewhat restrictive in the sense that the recursive structure rules out full simultaneity 3 (Koenker (2005), Section 8.8.3). Moreover, there would be additional sources of misspecification due to the auxiliary first stage equation for which simple linear models are usually employed in practice. If the true function g is nonlinear, then all the estimation results are invalid. Furthermore, the endogenous variable X is generally required to be continuously distributed. Otherwise, the error term in the first stage is not separably identified. For instance, if X is ordered choice, then standard parametric models do not provide a single value of e given Z and X. The instrument Z is also required to be continuous unless the first stage is linear. Chesher (2005) suggests that set identification is possible when X is discrete and the error term is nonseparable but his method is not applicable if X is a single binary variable. 2.2 Moment based approaches with strong separability Moment based approaches are not reliant on the recursive structure. Suppose that unobserved heterogeneity U is additively separable. Then the model is specified as follows. Y = exp(x β) + U, E[U Z] = 0 (4) 3 In simultaneous equation models, endogenous variables might affect each other. Therefore, the variation of Y is possibly able to lead the change in X. In the recursive system in the control function approach rules out this relationship as Y is restricted to have no effect on X. 5

6 Then under the existence of a relevant instrument, WS1997 show that β is point-identified by the moment condition E[Z(Y exp(x β))]. The generalised method of moments (GMM) estimator with an appropriate weight matrix consistently estimates β. However, Mullahy (1997) points out that this specification treats X and U asymmetrically without a particular reason. Suppose now that unobserved heterogeneity W is omitted characteristics. U is a regression error such that E[U X, W, Z] = 0. Then the structural equation is written as Y = exp(x β + W δ) + U = exp(x β)v + U (5) where exp(w δ) = V. V is multiplicatively separable and X and V are treated symmetrically. Normalise that E[V Z] = 1. The following moment condition point identifies β as shown in Mullahy (1997). [ ] Y E exp(x β) 1 Z = 0 E [ ( Z Y exp(x β) 1 )] = 0 (6) Those two specifications (4) and (5) are observationally equivalent (see Wooldridge (1992)). The moment based approaches involve a fundamental problem when unobserved heterogeneity is interpreted as of economic interest. In econometric models with endogeneity, unobserved heterogeneity generally has a clear economic meaning. When it comes to returns to schooling, years of education (X) are supposed to be correlated with unobservable ability (U) which affects X as well as income (Y ) for an individual. Therefore, an instrument Z is necessary in order to separately identify the causal effect of education on earnings from that of unobserved ability and persuasive explanation about the relationship between Z and U should be presented as it is untestable. Now suppose that a model specification per se highly restricts the distribution of U with which endowing U with economic interpretation is hard. If one cannot devise an economic example of such unobserved heterogeneity, then it would be also impossible to argue that there exist some good instruments Z. For example, suppose that one writes a linear probability model (LPM) when Y and X are binary. Y = α + βx + U, E[U Z] = 0 (7) Then the conditional support of U given X is binary as it only takes either 1 α βx or α βx and hence X and U are not independent. This arises due to the attempt to fit the discrete outcome by a continuous (linear) function. U absorbs the discreteness of Y. However it is seldom justified to impose such discreteness on U. Can unobserved heterogeneity, whose discrete conditional support varies with X, be found in any economic 6

7 example? How can one endow it with an economic meaning? These questions are very hard to answer, even though this model specification is not uncommon. The more fundamental problem is that there exist no instrument which is independent of U but correlated with X. Suppose that Z U and X is binary. The conditional support and the probability mass of U given Z are shown in Table 1. As P[U Z] = P[U] by the independence assumption, P[Y = y X = x Z] = P[Y = y X = x] for any x, y {0, 1} and hence X and Z are also independent. Therefore, the rank condition is never satisfied. This result is extended to a more general case where X is continuous. Y X U P[U Z] 0 0 α P[Y = 0 X = 0 Z] α P[Y = 1 X = 0 Z] 0 1 α β P[Y = 0 X = 1 Z] α β P[Y = 1 X = 1 Z] Table 1: Conditional Support of U given Z Proposition 1 Under the LPM (7), suppose that Y is binary and X is continuous. Assume that all admissible structures by the model satisfy the condition 0 < α + βx < 1 for all x R X. If there exist an instrument Z which is independent of U, then X and Z are independent of each other. Remark 1 The crucial condition for Proposition 1 is 0 < α + βx < 1. This is sensible as α + βx is the conditional probability of Y = 1 given X. If it is violated, then variation of X in Z cannot be ruled out because a value of u matches to two pairs of (Y, X) in the overlapped area between the conditional supports. This is paradoxical as the more extreme behaviour of the conditional probability is a key to resolve the problem. Conditional mean independence of U given Z is required for identification of α and β. It is slightly weaker than the strong independence condition in Proposition 1 above. However, in many applied economic studies, it is rarely justifiable to argue that Z satisfies conditional mean independence but is not independent of U. As neither of those is testable, most applied researchers argue that their instruments are completely exogenous to unobserved heterogeneity. (For example, see Angrist and Krueger (1991).) However, this argument is fundamentally impossible due to the specification of the LPM. A partial identification approach to address this problem is proposed in Chesher and Rosen (2013) for binary outcomes. Models (4) and (5) have the similar problem. Suppose the model is Y = exp(α+βx)+u. There is no support restriction on α + βx. Define the support of Y as M {0, 1, 2, }. M 7

8 is possibly unbounded. If X is continuous and unbounded, then for any u = m exp(α+βx) where m M and x R X, there exist m M and x R X such that m exp(α + βx) = m exp(α + βx ). Then the probability distribution of X varies with Z even if Z U. This can be easily shown. Suppose that β > 0. For any given u R U, a level set of pairs (m, x m ) is defined as follows. C(u) {(m, x m ) : u = m exp(α + βx m ), x m R X, m M} (8) As the exponential term is always positive, if m u, then u > m exp(α + βx) regardless of the value of X. Thus the cumulative distribution function of U, F U ( ), is defined as F U (u) = P[Y u] + P[Y = m X x m ]. m>u The strict independence condition requires that F U Z (u z) = F U (u) for all z R Z and hence F U Z (u z) = P[Y u z] + P[Y = m X x m z] = F U (u), z R Z. (9) m>u Neither P[Y u z] = P[Y u] nor P[Y = m X x m z] = P[Y = m X x m ] is necessarily required. Therefore, the possibility of variation of X in Z cannot be ruled out. If X is discrete and bounded, however, the same problem occurs here. The following proposition shows that existence of a good IV is rarely assured. Proposition 2 Suppose that Y is count and X is discrete and finite i.e. R X {x 1, x 2,, x n }. Under the model such that Y = exp(α + βx) + U, only a particular set of pairs (α, β), whose Lebesgue measure is zero, allows for the instrument Z being independent of U, but correlated with X. The true parameter values are never known. Therefore, it is never assured that there exists a proper instrument. Even if the true parameters indeed lie on the particular set in Proposition 2, limited variation between certain values of X is allowed. The result in Proposition 2 is extended to the model (5) with the multiplicative error. The additive error U is omitted here as it is redundant. Proposition 3 Suppose that Y is count and X is discrete and finite. Under the model Y = exp(α + βx)v, only a particular set of pairs (α, β), whose Lebesgue measure is zero, allows for the instrument Z being independent of U, but correlated with X. 8

9 The moment based approaches ignore the discreteness of count outcomes. Even though parameters in the models are point identified, those do not necessarily tell about the underlying DGP. Furthermore, the specifications cannot accommodate more complex interactions between X and U. For instance, polynomial regression models involving interaction terms between explanatory variables are often employed in applied studies. Suppose that the true model is Y = exp(α + βx + γw + δxw ) + U and W is unobservable. Then multiplicative unobserved heterogeneity V in the model (3) is exp(γw + δxw ) and hence there exists no instrument satisfying the moment condition E[exp(γW + δxw ) Z] = 0 and the rank condition simultaneously 4. Therefore, the models (4) and (5) cannot handle more complex interactions between X and U. Needless to say, the moment based approach does not work when unobserved heterogeneity is nonseparable. 3 Count data IV Models with nonseparable error A nonparametric count data IV model is built upon an ordered outcome model suggested by Chesher (2010). Define M as a subset of all non-negative integers, M {0, 1, 2, } and m M. M is possibly unbounded. Y is a random count outcome, X is a vector of potentially endogenous explanatory variables and U is a random scalar. The model is Y = h(x, U) = 0 if p 0 (X) U p 1 (X) = 1 if p 1 (X) < U p 2 (X) = (10) = m if p m (X) < U p m+1 (X) = where p 0 (X) = 0, 0 p m (X) 1 and p m (X) p m+1 (X) for all X and m. U is normalised to U Unif(0, 1) without loss of generality. The threshold functions {p m (X)} m=1 are of interest. Suppose that X is discrete and independent of U. Then it is reasonable to define the conditional distribution function of Y given X as p m+1 (X) = P[Y m X] = F Y X (m X). Since U X Unif(0, 1), the thresholds, {{p m (x)} m=1} k x=0 where k = R X, are all point 4 In a standard linear regression model with no endogeneity, Y = α + βx + γw + δxw + U, if X is independent of U and W, the moment condition E[X(γW + δxw + U)] = 0 can be satisfied. Therefore, the parameters of interest are identified. But, the model (4) and (5) cannot identify the parameters even if X is independent of W as X and unobserved variable are nonseparable. 9

10 identified by the cumulative distribution function (cdf) of Y conditional on X. Therefore, the full conditional distribution of Y given X is nonparametrically identified and it provides useful insight about the causal relationship between X and Y as the distributional shift of Y with regard to X is captured. Structural features of interest such as average treatment effects are also identified. In a finite sample, {{p m (x)} m=1} k x=0 are consistently estimated by the sample analogue estimator. If X and U are not independent, the thresholds are not identified by F Y X. Suppose that X is binary i.e. R X {0, 1} and that F U X (τ X = 1) first order stochastically dominates F U X (τ X = 0). F U X (τ X = 1) F U (τ) F U X (τ X = 0), τ [0, 1] Then F Y X (m X = 1) p m+1 (1) and p m+1 (0) F Y X (m X = 0). Therefore without additional information, {p m (0), p m (1)} m=1 are not identified. What one can identify are lower bounds for {p m (1)} m=1 and upper bounds for {p m (0)} m=1 which might not be very informative. Without the first order stochastic dominance assumption, one may be able to identify no-assumption bounds as in Manski and Pepper (2000). The main question of this paper is how to identify the threshold functions under the existence of an instrument Z. Strong separability has been the key source of point identification in count data models. As Chesher (2010) shows, point identification is generally not achievable in single equation nonseparable IV models for discrete outcomes even with parametric restrictions. However, under the existence of a relevant instrument, one is able to identify more informative bounds than no-assumption bounds. 3.1 Generalised Instrumental Variable Model The characterisation of identified sets in count data IV models is provided under the generalised instrumental variable (GIV) model restrictions in Chesher and Rosen (2016). Let G U Z denote the collection of conditional distributions of U given Z. G U Z {G U Z ( z) : z R Z } Under the GIV restrictions 1-6 in their paper, the identified set for the structural function h and G U Z is characterised. The following restrictions 1-3 satisfy the GIV restrictions so they facilitate the use of the same characterisation of the identified set in the model (10). Restriction 1 Y and U are random scalars and X and Z are random vectors defined on a probability space (Ω, L, P), endowed with the Borel sets on Ω. 10

11 Restriction 2 The support of Y is a subset of all non-negative integers M {0, 1, 2, } and the support of (X, Z) is a subset of Euclidean space. A collection of conditional distributions F Y X Z {F Y X Z ( z) : z R Z } is identified by the sampling process where F Y X Z (T z) P[(Y, X) T z] for all T {(y, x) : y R Y, x R X }. Restriction 3 U is uniformly distributed on the unit interval [0, 1] and G U Z ( z) = G U ( ) for all z R Z where G U denotes the marginal distribution function of U. As G U Z is singleton by Restriction 3, the object of identification is only the structural function h which is fully characterised by the threshold functions {{p m (x)} m M } x RX. To use theories of random sets, define two level sets. Y(U; h) {(y, x) R Y X : h(x, U) = y}, U(Y, X; h) {u R U : h(x, u) = Y } Then under the model (10), those two level sets are modified as follow. Y(u; h) = {(m, x) R Y X : p m (x) < u p m+1 (x)}, U(m, x; h) = [p m (x), p m+1 (x)] To be more precise, U(m, x; h) should be left-open but I stick to the closed set definition henceforth. It utilises random set theories which characterise distributions of random closed sets and it does not incur any problem because the difference between the level set U(m, x; h) and its closure is always zero as U is continuously distributed 5. Let S be a closed subset of [0, 1]. A containment functional of U(Y, X; h) is C h (S z) P[U(Y, X; h) S z]. A set function G U (S) P[U S] is defined. Let H denote the identified set of the structural function h. F(A) is the collection of all closed subsets of a set A. Then Corollary 1 provides the sharp characterisation of the identified set. Corollary 1 (Chesher and Rosen (2016)) Under Restriction 1-3, the sharp identified set of the structural function h in the model (10) is defined as H {h : S F([0, 1]), C h (S z) G U (S), a.e z R Z }. 3.2 Core determining test sets The number of closed subsets of [0, 1] is infinite. Computation of the sharp identified set is thus often infeasible in practice. To find a practically implementable characterisation of 5 This is because the left-end point of the level set U has Lebesgue measure zero. See Chesher and Rosen (2016) p. 9 for a detailed discussion. 11

12 the identified set, a notion of core determining classes is employed as Galichon and Henry (2011) suggest. In the context of the model (10), a collection of core determining test sets (CDTS) is defined as follows. Definition 1 (Core determining test sets) Let Q h denote a subcollection of F([0, 1]) such that C h (S z) G U (S), S Q h (11) and for almost every z R Z. Q h is a collection of core determining test sets if the same inequality also holds for every S [0, 1]. Therefore, finding the smallest possible subcollection of F([0, 1]) which satisfies the definition of CDTS is essential to reduce computational burden for identification of H. Let U h denote the support of the random level set U(Y, X; h). U h {[0, p 1 (x)], [p 1 (x), p 2 (x)],, [p m (x), p m+1 (x)], : x R X } Theorem 3 in Chesher and Rosen (2016) (TH3 henceforth) suggests a collection of all connected unions 6 of elements in U h as the collection of CDTS. But under the model (10) the number of elements in U h is possibly infinite as M may be unbounded. Thus the number of core determining test sets is also infinite. Let Q h be a collection of all connected unions of elements of U h. To make identification feasible in practice, a finite subcollection of Qh should be selected. The objective is to construct a finite collection which has as fewer elements as possible without loss of information. Further refinement from Q h is achievable under a certain condition. Condition 1 For all intervals [p m (x), p m+1 (x)] U h, there exists k M such that p k (x ) [p m (x), p m+1 (x)] for all x x. By exploiting Condition 1, I propose a refinement of Q h and show that the suggested collection loses no information compared to Q h in the following theorem. Theorem 1 Suppose that X is discrete. Under the model (10) and Condition 1, the collection Q h such that Q h {[0, p m (x)], [p m (x), 1] : m M\{0}, x R X } (12) is a collection of core determining test sets. 6 All disconnected unions and [0, 1] are excluded by TH3. The inequality (11) is trivially satisfied by [0, 1] thus there is no need to check this interval. 12

13 Theorem 1 also works for an outcome with a bounded support so it is applicable to other ordered choice models. Identification of the structural function is now straightforward. Corollary 1 with Q h gives the sharp characterisation of the identified set for h. Under Condition 1, Q h suffices to deliver the sharp identified set. Define a set of threshold functions P {p m (x) : x R X, m M}. Then a function h is fully characterised by P so I substitute P for h in notations henceforth. Let P denote a collection of all admissible P. P {P : p m (x) [0, 1], p m (x) p m+1 (x) for all x R X, m M} Then the identified set P, a subcollection of P, is found by the following corollary. Corollary 2 Given the joint distribution of (Y, X, Z), the identified set for the structural function h is characterised as follows. P = {P : S Q h, C P (S z) G U (S) a.e z R Z }. Corollary 3 Given the joint distribution of (Y, X, Z) and under Condition 1, the identified set for the structural function h is characterised as follows. P = {P : S Q h, C P (S z) G U (S) a.e z R Z }. Corollary 2 and 3 are the direct applications of TH3 and Theorem 1. The corollaries provides a parsimonious characterisation of the sharp identified set. The identification result here is fully nonparametric. The value of the containment functional is determined by the ordering of the elements of P. Therefore all possible orderings need to be considered for identification. Given a particular set of threshold functions P, its ordering gives upper and lower bounds for each of its elements. If all the elements lie between their bounds, P P. However, as the supports of Y and X become richer, the number of admissible orderings increase explosively. The number of admissible orderings is computed in Chesher and Smolinski (2012). L = (k( m 1))! (( m 1)!) k As M is possibly unbounded, computation for identification is highly cumbersome in count data IV models. Therefore, appropriate shape restrictions or parametric restrictions might be imposed in practice to reduce computational burden. In particular, parametric restrictions generate the large number of threshold functions with a few structural parameters. It motivates the use of appropriate parametric specifications which fit data well. Remark 2 Condition 1 is highly restrictive as it is only satisfied for a limited subset of 13

14 P. In the process of identification, therefore, Condition 1 should not be imposed unless the impact of X on Y is expected to be very small. Nonetheless, the use of Q h could be still beneficial without Condition 1 if it provides a good approximation of the sharp identified set as discussed henceforth. If a parametric restriction is imposed, the use of Q h is particularly beneficial even without Condition 1 being satisfied. As the condition seems not very intuitive, it is helpful to find more intuitive sufficient conditions under which Condition 1 is satisfied. Condition 2 For all m M, (i) Complete separation : max{p m (x 1 ), p m (x 2 ),, p m (x K )} min{p m+1 (x 1 ), p m+1 (x 2 ),, p m+1 (x K )} (ii) Monotonicity : p m (x 1 ) p m (x 2 ) p m (x K ) or p m (x K ) p m (x K 1 ) p m (x 1 ) Lemma 1 Under Condition 2, Condition 1 is satisfied. If the threshold functions p m (x) are generated from a parametric structure i.e. p m (x) = F (m, λ(x)), λ(x) = exp(α + βx) where F belongs to a known class of parametric count distribution functions, complete separation means that X has a very weak impact to shift the thresholds. In other words, β is close enough to zero. Under this type of parametric restrictions, a set of threshold functions P(α, β) is generated by a finite number of structural parameters. Thus identification of α and β is provided by the conditional moment inequalities from Corollary 3 given P(α, β). A certain pair of (α, β) is to be included in the outer region if P(α, β) satisfies all the conditional moment inequalities. As we can always find a small enough β to ensure Condition 2, Q h provides the strongest possible criterion for values of β around 0. If the true β is close to 0, the main interest of identification is whether the identified set of β includes 0. Therefore, when the outer region provided by Q h contains 0, the core determining collection Q h is also unable to exclude 0. In the case where β is large (that means Condition 2 is highly violated), the outer region delivered by Q h can be very close to the sharp identified set. The proximity of the outer region to the sharp identified set depends on the data generating process of Y. 14

15 Theorem 2 As K or var(y ) increases, the outer region delivered by Q h converges to the sharp identified set of the thresholds {p m (x)} m M,x RX. Theorem 2 implies that the use of Q h is particularly beneficial if the supports of Y and X are rich or the impact of X on Y is huge. In such a case, the outer region is a good approximation of the identified set as Theorem 2 shows. As X or Y becomes less discrete, the size of the identified set tends to be smaller and so is the outer region. Remark 3 If a large outer region excluding β = 0 is identified, further refinement of the outer region is available via Corollary 2. However, how much it would be refined is a question left to numerical exercises. For computational feasibility, a focus is given to a finite subset of M as M is infinite. Since p m (x) converges to 1 as m goes to infinity, a large enough m at which p m (x) for all x are very close to 1 can be found. For identification on numerical examples, m is defined as follows. m min{n : min x R X p n (x) > } The values greater than m are almost never realised so ignoring these values has a negligible effect on the size of the identified set. When it comes to estimation, one can use the largest realisation of Y in data in practice. The subset of M we are interested in is now M {0, 1, 2,, m}. Qh and Q h are also redefined under M, not M. The following examples show the improvement in computation by using Q h rather than Q h. The number of elements of Q h is 2 mk where K = R X. Example 1 Suppose that m = 2. Then the support of the U level set is Ū h {[0, p 1 (x)], [p 1 (x), p 2 (x)], [p 2 (x), 1] : x R X }. All disconnected unions and the unit interval are excluded in Q h. unions in Q h are Therefore, all possible [0, p 1 (x)], [0, p 2 (x)], [p 1 (x), p 2 (x )],, [p 1 (x), 1], [p 2 (x), 1], x, x R X. The number of intervals in Q h is naturally 5k+k(k 1). Suppose that the computing time for an information bound induced by each interval is T. Let T and T denote the total computing times with Q h and Q h respectively. Then T = 5kT + k(k 1)T, T = 4kT, T / T = k = k

16 X is not constant so k 2 and T > T. As k increases, the enhancement achieved by Q h enlarges. If k = 4, then the computation with Q h is 2 times faster than that with Q h. Example 2 Suppose that m = 3. Ū h {[0, p 1 (x)], [p 1 (x), p 2 (x)], [p 2 (x), p 3 (x)], [p 3 (x), 1] : x R X }. The number of unions in Q h is 9k + 3k(k 1). Then T = 9kT + 3k(k 1)T, T = 6kT, T / T = k Therefore, T > T for all k 2. As k increases, the improvement achieved by Q h enlarges. If k = 4, then the computation with Q h is 3 times faster. Given m and k, the number of elements of Q m( m 1) h is 2 mk + k 2. Accordingly, the ratio 2 of T to T k( m 1) is + 1, which is explosively increasing as k and m go up. Therefore, the 4 improvement in computation achieved by using Q h becomes greater when k and m are large. Q h is also very straightforward to apply as one need not care about the ordering between threshold functions. Remark 4 Using Q h might be beneficial even if it is not always be core determining. It is plausible that the primary identification power is from only a subcollection of the CDTS. In some numerical examples, more than a half of sets in Q h are shown to deliver only negligible information about the identified set. Therefore, Q h can be numerically nondominated by Q h. Remark 5 Estimation of the identified set sometimes involves only a subcollection of the CDTS because the full collection would deliver an empty set due to finite sample bias. In count data models, this is more likely to happen as the support of the outcome is rich and the number of the CDTS is large. Therefore, the use of Q h is justifiable even though it does not always guarantee the core determining property. 3.3 Additional heterogeneity Unobserved heterogeneity U is so far assumed to be a random scalar. Suppose that some elements of X is unobserved. As those are relevant variables in the structural function of Y, the omitted variable problem arises. This may happen in practice so it needs to be accommodated in the model (10). Let the scalar unobserved heterogeneity assumption be relaxed. Now there is two dimensional unobserved heterogeneity (U, V ) where U is a uniformly distributed latent variable as before and V is a continuously or discretely distributed 16

17 characteristic which is unobservable and so is omitted. Then the model 10 is modified as follows. Y = h(x, U, V ) = 0 if p 0 (X, V ) U p 1 (X, V ) = 1 if p 1 (X, V ) < U p 2 (X, V ) = = m if p m (X, V ) < U p m+1 (X, V ) = (13) where p 0 (X, V ) is normalised to 0. If V is observable, then the set of threshold functions {p m (x, v)} m M,j RV,x R X is the object of identification. Suppose that X is independent of (U, V ) and U and V are mutually independent. Under these assumptions, without observing V, there is no hope of identification. As U Unif(0, 1), it is allowed to specify that p m (x, v) = m 1 y=0 P[Y = y X = x, V = v]. This probability cannot be identified without conditioning on V but one can identify the threshold functions p m (x) such that p m (x) R V p m (x, v)f V (v)dv. As X and V are independent of each other, f V X = f V and f XV = f X f V. Therefore, p m (x) is equivalent of P[Y = m X = x] as follows. p m (x) = = = = R V m 1 y=0 m 1 y=0 m 1 y=0 m 1 y=0 P[Y = y X = x, V = v]f V X (v x)dv f Y XV (y, x, v) R V f XV (x, v) R V f Y V X (y, v x)dv P[Y = y X = x] f XV (x, v) dv f X (x) In the case where X is not independent of V, the marginal response of Y to X is not separately identified from the effect of V as p m (x) P[Y = m X = x] and now p m (x) is 17

18 the counterfactual cdf of Y conditional on X. However, we can set-identify p m (x) as before under the existence of a relevant instrument Z. Given the values of Y and X, the location of (V, U) is partially identified on R V U R V [0, 1]. U(m, x; h) = {(v, [p m 1 (x, v), p m (x, v)]) : v R V } Therefore, U h the support of the U-level set is defined as follows. U h = {(v, [0, p 1 (x, v)]),, (v, [p m 1 (x, v), p m (x, v)]), : x R X, v R V } Let Q h be the collection of all the connected unions of elements of U h. collection of CDTS. Under the condition that Z (V, U), Then Q h is the Corollary 4 Given the joint distribution of (Y, X, Z) and the model 13, the identified set for the structural function h is characterised as follows. H = {h : S Q h, C h (S z) G V U (S) a.e z R Z }. However, this characterisation is hard to use in practice. As V is possibly continuously distributed, there exist an infinite number of elements in Q h even when M is bounded and small. Furthermore, the location of V is unable to be inferred given the values of Y, X and U since no distributional assumption is imposed. To make identification feasible, let Q h be a subcollection of Qh such that Q h {S V 1 (m, x), S V 2 (m, x) : m M, x R X } where S V 1 (m, x) {(v, [0, p m (x, v)] : v R V )} and S V 1 (m, x) {(v, [p m (x, v), 1] : v R V )}. Then the following lemma characterises the outer region of threshold functions p m (x). Lemma 2 For all m M and x R X, the outer region of p m (x) is sup P[Y m 1 X = x Z = z] p m (x) inf P[Y m X = x Z = z]. z Z z Z 4 Identified sets with parametric restrictions Identification analysis is demonstrated on different data generating processes. To avoid dealing with the tremendous number of orderings, Parametric restrictions (Poisson and negative binomial) are imposed. The model is still set identifying with those restrictions but 18

19 the size of the identified set on numerical examples tend to be small enough. A parametric restriction allows all the threshold functions {{p m (x)} m=1} k x=0 to be generated by a smaller number of structural parameters and the ordering is given. Let P denote the approximation of the identified set (I call it the identified set henceforth, not necessarily sharp) delivered by Q h. Suppose that r is the number of structural parameters. Then the algorithm for identification is following. 1. Define dense grid points on R r. Let Θ denote the set of grid points. Then Θ {θ 1, θ 2,, θ J } where J is the number of grid points in Θ. 2. Generate the thresholds {{p m (x)} m=1} m k x=0 by using θ i. Then the ordering l i between the threshold values is given. 3. Compute upper and lower bounds of the threshold functions by using the given ordering l i and Corollary (3). 4. Check whether all the threshold functions lie between their lower and upper bounds. If so, include θ i in P. Otherwise, θ i / P. 5. Repeat the above steps for all i = 1,, J. The above algorithm is used to deliver the identified sets on various data generating processes throughout this paper. 4.1 Poisson restriction Triangular DGP A recursive data generating process (DGP) is specified. This triangular system is particularly useful for identification analysis in the sense that it involves less computational burden 7. [ ] ([ ] [ ]) Z ε N(0, 1), N, V Z is independent of ε and V. X and Z are binary variables generated as follow. Z = 1[Z 0], X = 1[X 0], X = δ 1 + δ 2 Z + V 7 An alternative DGP is a joint Gaussian system where X, Z and ε are all generated by the joint normal distribution. This DGP implements full simultaneity between variables. It has more variables in the multivariate normal distribution so computation of the identified set takes a much longer time than the triangular DGP. 19

20 b no IV ( J = 0 ) moderate IV ( d 1 = 0, d 2 = 1) strong IV ( J = 2 ) super strong IV ( J = 4 ) true value a Figure 1: Identified sets under triangular structures when E[Y ] = 3.9 Unobserved heterogeneity ε is marginalised i.e. U Φ(ε). To generate a count outcome, the Poisson cumulative distribution function (cdf) p m (X) is specified. The threshold values {p m (X)} m M are generated as follow. m exp(α + βx) y p m+1 (X) = exp( exp(α + βx)) y! y=0 (14) Then a function g( X) generates Y by taking U as an argument. This function is distinct from the conditional quantile function of Y given X, even though the expression is very similar. Because p m (X) is not the conditional cdf of Y but the counterfactual conditional cdf. g(τ X) inf{m : p m+1 (X) τ}, Y = g(u X) For identification of the parameter values, the Poisson restriction is imposed under the introduced algorithm. This DPG is convenient to understand identification at infinity. If Z is a perfect predic- 20

21 b moderate-add moderate-mul strong-add strong-mul super-add super-mul identified set a Figure 2: Moment based results and the identified set with the moderate IV tor for X, the identified set boils down to a point as endogeneity of X disappears. The size of the identified set varies with the values of δ 1 and δ 2. Let J be a positive real number. When δ 1 = J and δ 2 = 2J, P[X = z Z = z] 1 as J. Z becomes a perfect predictor for X at values of J smaller than 5. The first IV, classified as moderate, has δ 1 = 0 and δ 2 = 1. The strong and super strong IVs have the values of J equal to 2 and 4 respectively. Figure 1 shows the identified sets associated with instruments. Considering the scale of the figure, the sets are very small. For the moderate IV, α and β lie on [0.497, 0.536] and [0.965, 1.003] respectively. Each identified set links to a collection of counterfactual conditional cdf of Y given X. Thus, all the features of interest can be computed. The interval identified ATE is [1.754, 1.860]. For strong and super strong IVs, the identified sets are extremely small. The ATEs lie between [1.763, 1.766] and [1.7634, ] respectively. For larger values of J e.g. 5, the parameters are point identified. The moment based approaches do not provide correct information about the true parameters. Figure 2 shows the point estimates delivered by the moment conditions on a large sample (n = 100, 000) generated by the triangular DGP. They are all outside the identified sets and are far away from the true point. This is natural because the moment-conditions are in general not satisfied under the DGP specified here. The identified sets are small even if the IV is very weak. The strong identification power 21

22 is from the rich support of Y. Under the true parameter values, Y takes values from 0 to 17. In the case where α = 0.1 and β = 0.1, the mean of Y is fairly small and so is the variance (E[Y ] = 1.18). The support of Y is now {0,, 9}. This represents more unfavourable cases in practice where the support of Y is not rich. Note that under this parameter values, Restriction 5 is satisfied and hence Q h is core determining. Figure 3 shows the identified sets. They are in general large unless instruments are very strong. However, even with Z having no correlation with X, the sign of the ATE is correctly identified. When δ 1 = δ 2 = 0, the identified sets for parameters and the ATE are α [ 0.25, 0.135], β [0.06, 0.59], exp(α + β) exp(α) [0.071, 0.626] where the true ATE is b no IV ( d 1 =d 2 = 0 ) very weak IV ( d 1 = 0, d 2 = 0.1) moderate IV ( d 1 = 0, d 2 = 1) strong IV ( J= 2 ) superstrong IV ( J= 4 ) true value a Figure 3: Identified sets under triangular structures when E[Y ] = 1.18 This result looks surprising. The GIV framework employed here does not necessarily require the rank condition, thus it is applicable in cases where an instrument is independent of U but has no prediction power for X. However, this identification power does not entirely 22

23 b t = 1 t = 3 t = 5 t = a Figure 4: Identifying power of the intervals in Q h come from this framework. In the example, X is positively correlated with U. Therefore, the observable joint distribution of Y, X and Z does not allow for negative values of β. If X is negatively correlated with U, then the identification power disappears. Let the correlation parameter γ be 0.5. Then the identified set contains grid points on which the value β is negative. Therefore, in such a case, the identified set is not informative about the ATE at all. One other interesting experiment is to evaluate identifying power of each intervals in Q h. This exercise would answer a question that where the identifying power is primarily from. Under the triangular DGP with α = 1, β = 0.5, m is 17. For numerical identification, I use Q h = {[0, p m (x)], [1, p m (x)] : 1 m 17, X {0, 1}} which includes in total 68 intervals. Is it possible to deliver the same approximation of the identified set by a smaller number of intervals in Q h. The answer is shown in Figure 4. A collection of intervals {[0, p m (X)], [p m (X), 1]} t m=1 is defined and the figure demonstrates the 23

24 b a= 0.1, b= 0.1 a= 0.5, b= 0.1 a= 1.0, b= 0.1 a= 1.5, b= 0.1 a= 2.0, b= 0.1 true value a Figure 5: Size variation of identified sets outer regions delivered by various values of t. As t increases, the outer region converges to the identified set. Note that convergence is achieved at t = 9 given the scale of the figure. This means that all the identifying power comes from the first 9 values of M and additional information provided by the higher values is very marginal. This experiment shows that identifying power of Q h is strong enough to deliver a good approximation of the identified set even when K = 2 and β is not huge in the sense that it has some redundant intervals which gives marginal information on parameters even though we are not ex-ante aware of them. Lastly, the discreteness of Y is pivotal for the size of the identified set. As the number of points in supp(y ) increases, the discreteness disappears. Figure 5 demonstrates the size variation of the identified sets with regard to the richness of supp(y ). The IV used in identification is moderate(δ 1 = 0, δ 2 = 1). The size of the identified set shrinks with the mean of Y. When E[Y ] = 8.17, the identified set becomes a point-alike. Therefore, if there is high dispersion in count data, then the identified set becomes a point. 24

25 b weak IV ( J = 0.1 ) medium IV ( J = 0.3 ) moderate IV ( J = 1.0 ) strong IV ( J = 2.0 ) true value a Figure 6: The identified set when X is trinary Three points in the support of X The size of the identified set tends to be smaller as the support of X becomes richer. The reason is straightforward in the sense that the number of conditional moment inequalities increases. The set geometry is studied with a higher dimensional X herein. The previous DGP is preserved but now X is trinary. X = 1 if X (, 1) = 0 if X [ 1, 1) = 1 if X [1, ) Figure (6) shows the identification results with IVs. The shapes of the identified sets are rather different from the binary X case where the sets are in general parallelograms. Now the shapes are more like polygons. The projections of the identified set show that given the strength of the instrument the richer support of X delivers the much smaller set. For the moderate IV, α and β lie on [0.488, 0.504] and [0.996, 1.010] respectively which are indeed 25

A Course in Applied Econometrics Lecture 14: Control Functions and Related Methods. Jeff Wooldridge IRP Lectures, UW Madison, August 2008

A Course in Applied Econometrics Lecture 14: Control Functions and Related Methods Jeff Wooldridge IRP Lectures, UW Madison, August 2008 1. Linear-in-Parameters Models: IV versus Control Functions 2. Correlated