Identification and shape restrictions in nonparametric instrumental variables estimation

Size: px

Start display at page:

Download "Identification and shape restrictions in nonparametric instrumental variables estimation"

Derek Sullivan
5 years ago
Views:

1 Identification shape restrictions in nonparametric instrumental variables estimation Joachim Freyberger Joel Horowitz The Institute for Fiscal Studies Department of Economics, UCL cemmap woring paper CWP3/3

2 IDENTIFICATION AND SHAPE RESTRICTIONS IN NONPARAMETRIC INSTRUMENTAL VARIABLES ESTIMATION by Joachim Freyberger Joel L. Horowitz Department of Economics Northwestern University Evanston IL USA April 203 Abstract This paper is concerned with inference about an unidentified linear functional, Lg ( ), where the function g satisfies the relation Y= g( X) + U; EU ( W) = 0. In this relation, Y is the dependent variable, X is a possibly endogenous explanatory variable, W is an instrument for X, U is an unobserved rom variable. The data are an independent rom sample of ( Y, XW, ). In much applied research, X W are discrete, W has fewer points of support than X. Consequently, neither g nor Lg ( ) is nonparametrically identified. Indeed, Lg ( ) can have any value in (, ). In applied research, this problem is typically overcome point identification is achieved by assug that g is a linear function of X. However, the assumption of linearity is arbitrary. It is untestable if W is binary, as is the case in many applications. This paper explores the use of shape restrictions, such as monotonicity or convexity, for achieving interval identification of Lg ( ). Economic theory often provides such shape restrictions. This paper shows that they restrict Lg ( ) to an interval whose upper lower bounds can be obtained by solving linear programg problems. Inference about the identified interval the functional Lg ( ) can be carried out by using the bootstrap. An empirical application illustrates the usefulness of shape restrictions for carrying out nonparametric inference about Lg ( ). An extension to nonseparable quantile IV models is described. JEL Classification: C3, C4, C26 Key Words: Partial identification, linear programg, bootstrap We than Ivan Canay, Xiaohong Chen, Sobae Lee, Chuc Mansi, Elie Tamer for helpful comments. This research was supported in part by NSF grant SES

3 IDENTIFICATION AND SHAPE RESTRICTIONS IN NONPARAMETRIC INSTRUMENTAL VARIABLES ESTIMATION. INTRODUCTION This paper is about estimation of the linear functional Lg ( ), where the unnown function g obeys the relation (a) Y = g( X) + U, (b) EU ( W= w) = 0 for almost every w. Equivalently, (2) EY [ g( X) W= w] = 0. In (a), (b), (2), Y is the dependent variable, X is a possibly endogenous explanatory variable, W is an instrument for X, U is an unobserved rom variable. The data consist of an independent rom sample { Y, X, W : i=,..., n} from the distribution of ( Y, XW, ). In this paper, it is assumed that i i i X W are discretely distributed rom variables with finitely many mass points. Discretely distributed explanatory variables instruments occur frequently in applied research, as is discussed two paragraphs below. When X is discrete, g can be identified only at mass points of X. Linear functionals that may be of interest in this case are the value of g at a single mass point the difference between the values of g at two different mass points. Although model ()-(2) is the main focus of this paper, we also present an extension of our methods to quantile IV models in which the mean-independence condition (b) (2) is replaced by a quantile independence condition. As will be explained later, the quantile IV model includes a class of nonseparable models. In much applied research, W has fewer mass points than X does. For example, in a study of returns to schooling, Card (995) used a binary instrument for the endogenous variable years of schooling. Moran Simon (2006) used a binary instrument for income in a study of the effects of the Social Security notch on the usage of prescription drugs by the elderly. Other studies in which an instrument has fewer mass points than the endogenous explanatory variable are Angrist Krueger (99), Bronars Grogger (994), Lochner Moretti (2004). The function g is not identified nonparametrically when W has fewer mass points than X does. The linear functional Lg ( ) is unidentified except in special cases. Indeed, as will be shown in Section 2 of this paper, except in special cases, Lg ( ) can have any value in (, ) when W has fewer points of support than X does. Thus, except in special cases, the data are uninformative about Lg ( ) in the

4 absence of further information. In the applied research cited in the previous paragraph, this problem is dealt with by assug that g is a linear function. The assumption of linearity enables g Lg ( ) to be identified, but it is problematic in other respects. In particular, the assumption of linearity is not testable if W is binary. Moreover, any other two-parameter specification is observationally equivalent to linearity untestable, though it might yield substantive conclusions that are very different from those obtained under the assumption of linearity. For example, the assumptions that gx ( ) sin = β0 + β xfor some constants 0 untestable if W is binary. gx ( ) 0 2 = β + β x or β β are observationally equivalent to gx ( ) = β0 + βx This paper explores the use of restrictions on the shape of g, such as monotonicity, convexity, or concavity, to achieve interval identification of Lg ( ) when X W are discretely distributed W has fewer mass points than X has. Specifically, the paper uses shape restrictions on g to establish an identified interval that contains Lg ( ). Shape restrictions are less restrictive than a parametric specification such as linearity. They are often plausible in applications may be prescribed by economic theory. For example, dem cost functions are monotonic, cost functions are convex. It is shown in this paper that under shape restrictions, such as monotonicity, convexity, or concavity, that impose linear inequality restrictions on the values of gx ( ) at points of support of X, Lg ( ) is restricted to an interval whose upper lower bounds can be obtained by solving linear programg problems. The bounds can be estimated by solving sample-analog versions of the linear programg problems. The estimated bounds are asymptotically distributed as the ima of multivariate normal rom variables. Under certain conditions, the bounds are asymptotically normally distributed, but calculation of the analytic asymptotic distribution is difficult in general. We present a bootstrap procedure that can be used to estimate the asymptotic distribution of the estimated bounds in applications. The asymptotic distribution can be used to carry out inference about the identified interval that contains Lg ( ), using methods lie those of Imbens Mansi (2004) Stoye (2009), inference about the parameter Lg ( ). Interval identification of g in (a) has been investigated previously by Chesher (2004) Mansi Pepper (2000, 2009). Chesher (2004) considered partial identification of g in (a) but replaced (b) with assumptions lie those used in the control-function approach to estimating models with an endogenous explanatory variable. He gave conditions under which the difference between the values of g at two different mass points of X is contained in an identified interval. Mansi Pepper (2000, 2009) replaced (b) with monotonicity restrictions on what they called treatment selection treatment response. They derived an identified interval that contains the difference between the values of g at two different mass points of X under their assumptions. Neither Chesher (2004) nor Mansi 2

5 Pepper (2000, 2009) treated restrictions on the shape of g under (a) (b). The approach described in this paper is non-nested with those of Chesher (2004) Mansi Pepper (2000, 2009). The approach described here is also distinct from that of Chernozhuov, Lee, Rosen (2009), who treated estimation of the interval [sup v θ ( v),inf v θ ( v)], where l u l θ u θ are unnown functions is a possibly infinite set. The remainder of this paper is organized as follows. In Section 2, it is shown that except in special cases, Lg ( ) can have any value in (, ) if the only information about g is that it satisfies (a) (b). It is also shown that under shape restrictions on g that tae the form of linear inequalities, Lg ( ) is contained in an identified interval whose upper lower bounds can be obtained by solving linear programg problems. The bounds obtained by solving these problems are sharp. Section 3 shows that the identified bounds can be estimated consistently by replacing unnown population quantities in the linear programs with sample analogs. The asymptotic distributions of the identified bounds are obtained. Methods for obtaining confidence intervals for testing certain hypotheses about the bounds are presented. Section 4 presents a bootstrap procedure for estimating the asymptotic distributions of the estimators of the bounds. Section 4 also presents the results of a Monte Carlo investigation of the performance of the bootstrap in finite samples. Section 5 presents an empirical example that illustrates the usefulness of shape restrictions for achieving interval identification of Lg ( ). Section 6 presents the extension to quantile IV models as well as models with exogenous covariates. Section 7 presents concluding comments. It is assumed throughout this paper that X W are discretely distributed with finitely many mass points. The ideas in the paper can be extended to models in which X /or W are continuously distributed, but continuously distributed variables present technical issues that are very different from those arising in the discrete case require a separate treatment. Consequently, the use of shape restrictions with continuously distributed variables is beyond the scope of this paper. 2. INTERVAL IDENTIFICATION OF Lg ( ) This section begins by defining notation that will be used in the rest of the paper. Then it is shown that, except in special cases, the data are uninformative about Lg ( ) if the only restrictions on g are those of (a) (b). It is also shown that when linear shape restrictions are imposed on g, Lg ( ) is contained in an identified interval whose upper lower bounds are obtained by solving linear programg problems. Finally, some properties of the identified interval are obtained. 3

6 Denote the supports of X W, respectively, by { x : j =,..., J} { w : =,..., K}. In this paper, it is assumed that K < J. Order the support points so that x < x 2 <... < xj w < w2 <... wk. Define g = gx ( ), π = PX ( = x, W= w), m = E( Y W = w ) P( W = w ). Then (2) is equivalent to j j j j j (3) J m = g π ; =,..., K. j j j= Let m = ( m,..., m K ) g = ( g,..., g J ). Define Π as the J K matrix whose ( j, ) element is π j. Then (3) is equivalent to (4) m = Π g. Note that ran( Π ) < J, because K < J. Therefore, (4) does not point identify g. Write the linear functional Lg ( ) as Lg ( ) = cg, where c = ( c,..., c J ) is a vector of nown constants. The following proposition shows that except in special cases, the data are uninformative about Lg ( ) when K < J. Proposition : Assume that K < J that c is not orthogonal to the null space of Π. Then any value of Lg ( ) in (, ) is consistent with (a) (b). Proof: Let g be a vector in the space spanned by the rows of Π that satisfies Π g = m. Let g 2 be a vector in the null space (the orthogonal complement of the row space) of Π such that cg 2 0. For any real γ, Π ( g+ γ g2) = m L( g+ γg2) = cg + γcg 2. Then L( g+ γ g 2) is consistent with (a)-(b), by choosing γ appropriately, L( g+ γ g 2) can be made to have any value in (, ). We now impose the linear shape restriction (5) Sg 0, where S is an M J matrix of nown constants for some integer M > 0. For example, if g is monotone non-increasing, then S is the ( J ) J matrix S = We assume that g satisfies the shape restriction. Assumption : The unnown function g satisfies (a)-(b) with Sg 0. 4

7 Sharp bounds on Lg ( ) are the optimal values of the objective functions of the linear programg problems (6) imize (imize) : ch h subject to: Π h= m h Sh 0. Let L L, respectively, denote the optimal values of the objective functions of the imization imization versions of (6). It is clear that under (a) (b), Lg ( ) cannot be less than greater than L. The following proposition shows that Lg ( ) can also have any value between L. Therefore, the interval [ L, L ] is the sharp identification set for Lg ( ). λl L or L Proposition 2: The identification set of Lg ( ) is convex. In particular, it contains + ( λ) L for any λ [0,]. Proof: Let d = λl + ( λ) L, where 0< λ <. Let g g be feasible solutions of (6) such that cg = L cg = L. Then d = c [( λ) g + λg ]. The feasible region of a linear programg problem is convex, so ( λ) g + λg is a feasible solution of (6). Therefore, d is a possible value of Lg ( ) is in the identified set of Lg ( ). The values of L L need not be finite. Moreover, there are no simple, intuitively straightforward conditions under which L L are finite. Accordingly, we assume that: Assumption 2: L > L <. Assumption 2 can be tested empirically. A method for doing this is outlined in Section 3.4. However, a test of assumption 2 is unliely to be useful in applied research. To see one reason for this, let L L, respectively, denote the estimates of L L that are described in Section 3.. The hypothesis that assumption 2 holds can be rejected only if L = or L =. These estimates cannot be improved under the assumptions made in this paper, even if it is nown that L L are finite. If L = or L =, then a finite estimate of L or L can be obtained only by imposing stronger restrictions on g than are imposed in this paper. A further problem is that a test of boundedness of L or L has unavoidably low power because, as is explained in Section 3.4, it amounts to a test of multiple one-sided hypotheses about a population mean vector. Low power maes it unliely that a false hypothesis of boundedness of L or L can be rejected even if L L are infinite. 2 5

8 We also assume: Assumption 3: There is a vector h satisfying Π h m = 0 Sh ε for some vector ε > 0. (The inequality holds component-wise, each component of ε exceeds zero.) This assumption ensures that problem (6) has a feasible solution with probability approaching as n when Π m are replaced by consistent estimators. 3 It also implies that L L, so Lg ( ) is not point identified. The methods results of this paper do not apply to settings in which Lg ( ) is point identified. A method for testing Assumption 3 is described in Section Further Properties of Problem (6) This section presents properties of problem (6) that will be used later in this paper. These are well-nown properties of linear programs. Their proofs are available in many references on linear programg, such as Hadley (962). We begin by putting problem (6) into stard LP form. In stard form, the objective function is imized, all constraints are equalities, all variables of optimization are non-negative. Problem (6) can be put into stard form by adding slac variables to the inequality constraints writing each component of h as the difference between its positive negative parts. Denote the resulting vector of variables of optimization by z. The dimension of z is 2J + M. There are J variables for the positive parts of the components of h, J variables for the negative parts of the components of h, M slac variables for the inequality constraints. The (2 J + M) vector of objective function coefficients is c = ( c, c,0 ), where 0 M is a M vector of zeros. The corresponding constraint matrix has M dimension ( K + M) (2 J + M) is Π Π 0 A K M =, S S IM M where I M M is the M M identity matrix. The vector of right-h sides of the constraints is the ( K + M) vector m m =. 0 M With this notation, the stard form of (6) is (7) imize : cz or cz z subject to: A z = m z 0. Maximizing cz is equivalent to imizing cz. 6

9 Mae the following assumption. Assumption 4: Let A be any K + M columns of A corresponding to elements of a basic solution to Ax = m. Let λ be the smallest eigenvalue of ( Am)( Am). There is a δ > 0 not depending on the basic solution or the choice of columns of A such that λ δ. Assumption 4 ensures that the basic optimal solution(s) to (6) (7) are nondegenerate. The assumption can be tested using methods lie those described by Härdle Hart (992). 4,5 Let opt z be an optimal solution to either version of (7). Let z B, opt denote the ( K + M) vector of basic variables in the optimal solution. Let A denote the ( K + M) ( K + M) matrix formed by the B columns of A corresponding to basic variables. Then z B, opt = AB m, under assumption 4, z B, opt > 0. Now let c B be the ( K + M) vector of components of c corresponding to the components of z, basic solution z B, opt is (8a) Z = c m B B AB for the imization version of (6) (8b) Z c m B = B AB for the imization version. B opt In stard form, the dual of problem (6) is (9) imize : mq or mq q subject to A q= c q 0, where q is a (2 K + M) vector, m = (0, m, m ) M A is the J (2 K + M) matrix ( S,, ) A = Π Π.. The optimal value of the objective function corresponding to Under Assumptions -3, (6) (9) both have feasible solutions. The optimal solutions of (6) (9) are bounded, the optimal values of the objective functions of (6) (9) are the same. The dual problem is used in Section 3.3 to form a test of assumption 2. 7

10 3. ESTIMATION OF L AND L This section presents consistent estimators of L L. The asymptotic distributions of these estimators are presented, methods obtaining confidence intervals are described. Tests of Assumptions 2 3 are outlined. 3. Consistent Estimators of L L L L can be estimated consistently by replacing Π m in (6) with consistent estimators. To this end, define n i i i= m = n YI( W = w ); =,.., K J K j i j i j= =. π = n I( X = x) IW ( = w); j=,..., J; =,..., K Then m π j, respectively, are strongly consistent estimators of m π j. Define m = ( m,..., m ). Define Π as the J K matrix whose ( j, ) element is π j K as the optimal values of the objective functions of the linear programs (0) imize (imize) : ch h subject to: Π h=m h Sh 0.. Define L L Assumptions 2 3 ensure that (0) has a feasible solution a bounded optimal solution with probability approaching as n. The stard form of (0) is () imize : cz or cz where z subject to: A z = m z 0, Π Π 0 K M A = S S I M M m m =. 0 M 8

11 As a consequence of 8(a)-8(b) the strong consistency of Π m for Π m, respectively, we have Theorem : Let assumptions -3 hold. As n, L L almost surely L L almost surely. 3.2 The Asymptotic Distributions of L L This section obtains the asymptotic distributions of L L shows how to use these to obtain confidence regions for the identification interval [ L, L ] the linear functional Lg ( ). We assume that Assumption 5: 2 EY ( W= w ) < for each =,..., K. We begin by deriving the asymptotic distribution of L. The derivation of the asymptotic distribution of L is similar. Let denote the set of optimal basic solutions to the imization version of (6). Let denote the number of basic solutions in. The basic solutions are at vertices of the feasible region. Because there are only finitely many vertices, the difference between the optimal value of the objective function of (6) the value of the objective function at any non-optimal feasible vertex is bounded away from zero. Moreover, the law of the iterated logarithm ensures that Π m, respectively, are in arbitrarily small neighborhoods of Π m with probability for all sufficiently large n. Therefore, for all sufficiently large n, the probability is zero that a basic solution is optimal in (0) but not (6). Let =,2,... index the basic solutions to (0). Let the rom variable Z denote the value of the objective function corresponding to basic solution. Let A c, respectively, be the versions of A B c B associated with the th basic solution of (6) or (0). Then, (2) Z = c A m. Moreover, with probability for all sufficiently large n, L = Z, ( ) n L L = n Z L. ( ) 9

12 Let Z denote the value of the objective function of (6) at the th basic solution. Then basic solution is optimal. Because probability for all sufficiently large n, (0), n ( L L ) = n ( Z L ) + o () Z = L if contains the optimal basic solution to (6), with p = n ( Z Z ) + o (). p An application of the delta method yields n ( Z Z ) n c ( A m A m) = for = c [ ( m m) ( ) m] + (), where A is the version of A n n A A A op A B that is associated with basic solution. The elements of A m are sample moments or constants, depending on the basic solution, not all are constants. In addition EA ( ) = A Em ( ) = m. Therefore, it follows from the Lindeberg-Levy Cramér-Wold theorems that the rom components of normally distributed with mean 0. There may be some values of n ( Z Z) ( ) are asymptotically multivariate for which n ( Z Z ) is deteristically 0. This can happen, for example, if the objective function of (6) is proportional to the left-h side of one of the shape constraints. In such cases, the entire vector n ( Z Z ) ( opt ) has asymptotically a degenerate multivariate normal distribution. Thus, n ( L L ) is asymptotically distributed as the imum of a rom vector with a possibly degenerate multivariate normal distribution whose mean is zero. Denote the rom vector by Z its covariance matrix by Σ. In general, Σ is a large matrix whose elements are algebraically complex tedious to enumerate. Section 4 presents bootstrap methods for estimating the asymptotic distribution of n ( L L ) that do not require nowledge of Σ or. Now consider L. Let denote the set of optimal basic solutions to the imization version of (6), let denote the number of basic solutions in. Define Z = c A m = c A Z m. Then arguments lie those made for L show that 0

13 n ( L L ) = n ( Z + L ) + o () p = n ( Z Z ) + o (). p The asymptotic distributional arguments made for n ( L L ) also apply to n ( L L ). Therefore, n ( L L ) is asymptotically distributed as the imum of a rom vector with a possibly degenerate multivariate normal distribution whose mean is zero. Denote this vector by its covariance matrix by Z Σ. Lie Σ, Σ is a large matrix whose elements are algebraically complex. Section 4 presents bootstrap methods for estimating the asymptotic distribution of n ( L L ) that do not require nowledge of Σ or. It follows from the foregoing discussion that [ n ( L L ), n ( L L )] is asymptotically distributed as ( Z, Z ). Z Z are not independent of one another. The bootstrap procedure described in Section 4 consistently estimates the asymptotic distribution of [ n ( L L ), n ( L L )]. The foregoing results are summarized in the following theorem. Theorem 2: Let assumptions -5 hold. As n, (i) n ( L L ) converges in distribution to the imum of a rom vector Z with a possibly degenerate multivariate normal distribution, mean zero, covariance matrix Σ ; (ii) n ( L L ) converges in distribution to the imum of a rom vector Z with a possibly degenerate multivariate normal distribution, mean zero, covariance matrix converges in distribution to ( Z, Z ). Σ ; (iii) [ n ( L L ), n ( L L )] The asymptotic distributions of n ( L L ), n ( L L ), [ n ( L L ), n ( L L )] are simpler if the imization imization versions of (6) have unique optimal solutions. Specifically, n ( L L ), n ( L L ) are asymptotically univariate normally distributed, [ n ( L L ), n ( L L )] is asymptotically bivariate normally distributed. Let distributions of 2 σ n ( L L ) 2 σ, respectively, denote the variances of the asymptotic n ( L L ). Let ρ denote the correlation coefficient of

14 the asymptotic bivariate normal distribution of [ n ( L L ), ( L L )]. Let N 2 (0, ρ ) denote the bivariate normal distribution with variances of correlation coefficient ρ. Then the following corollary to Theorem holds. is unique, then Corollary : Let assumptions -5 hold. If the optimal solution to the imization version of (6) of (6) is unique, then ( d n L L ) / σ N(0,). If the optimal solution to the imization version ( d n L L ) / σ N(0,). If the optimal solutions to both versions of (6) are unique, then [ ( )/, ( d n L L σ n L L )/ σ ] N (0, ρ). 2 Theorem 2 Corollary can be used to obtain asymptotic confidence intervals for [ L, L ] Lg ( ). A symmetrical asymptotic α confidence interval for [ L, L ] is [ L n c, L + n c ], where c α satisfies α α lim P( L n c L, L n c L ) n α + α > = α. Equal-tailed imum length asymptotic confidence intervals can be obtained in a similar way. A confidence interval for Lg ( ) can be obtained by using ideas described by Imbens Mansi (2004) Stoye (2009). In particular, as is discussed by Imbens Mansi (2004), an asymptotically valid pointwise α confidence interval for Lg ( ) can be obtained as the intersection of one-sided confidence intervals for L L. 6 Thus α confidence interval for Lg ( ), where [ L n c, L + n c ] is an asymptotic α, α, c α, c α,, respectively, satisfy lim Pn [ ( L L ) c α ] = α n, lim Pn [ ( L L ) c α ] = α. n, Estimating the critical values c α, c α,, lie estimating the asymptotic distributions of n ( L L ), n ( L L ), [ n ( L L ), n ( L L )], is difficult because Σ Σ are complicated unnown matrices, are unnown sets. Section 4 presents bootstrap methods for estimating. c α, c α, without nowledge of Σ, Σ,, 2

15 3.3 Uniformly Correct Confidence Intervals The asymptotic distributional results presented in Section 3.2 do not hold uniformly over all distributions of ( Y, XW, ) satisfying assumptions -5. This is because the asymptotic distributions of n ( L L ) n ( L L ) are discontinuous when the number of optimal basic solutions changes. This section explains how to overcome this problem for apply to n ( L L ) to the joint distribution of results for these quantities are presented without further explanation. We now consider n ( L L ). Similar arguments [ n ( L L ), n ( L L )]. The n ( L L ). Let be any basic solution to problem (6). Let Z denote the value of the objective function of the imization version of (0) corresponding to basic solution. Let be a set that contains every optimal basic solution to the imization version of (6). Let L, be the value of the objective function of the imization version of (6) corresponding to basic solution. Then n ( L L, ) > 0 for every n L L, ( ) = 0 for every optimal. Therefore, for any c > 0 any, P[ n ( L L ) > c] P{ [ n ( Z L ) n ( L L )] > c},, Moreover For any Pn [ ( Z L ) > c]., Pn [ ( L L ) > c] Pn [ ( Z L ) > c], α (0,), let c, α, satisfy Pn [ ( Z L ) > c α ] = α.,,, Define c α, = c, α,. Then Pn [ ( L L ) > c α ] α, uniformly over distributions of ( Y, XW, ) that satisfy assumptions -5. Now define = { : L Z n log n}. n, For all sufficiently large n, n, contains all optimal basic solutions to the imization version of (0) with probability. Define c α, = c. Then n,, α, 3

16 Pn [ ( L L ) > c α ] α + εn,, uniformly over distributions of ( Y, XW, ) that satisfy assumptions -5, where ε 0 with probability as n. Moreover (, ] L + n c α, is a confidence interval for n L whose asymptotic coverage probability is at least α uniformly over distributions of ( Y, XW, ) that satisfy assumptions -5. To obtain a uniform confidence interval for L, let Z denote the value of the objective function of the imization version of (0) corresponding to basic solution. Let L, be the value of the objective function of the imization version of (6) corresponding to basic solution. Define Define = { : Z L n log n}. n, c, α, by Pn [ ( L Z ) c α ] = α.,,, Finally, define c α, = c, α,. Then arguments lie those made for n, L show that Pn L L c α α + ε, [ ( ), ] n uniformly over distributions of ( Y, XW, ) that satisfy assumptions -5, where ε 0 with probability as n. Moreover [ L n c α,, ) is a confidence interval for L whose asymptotic coverage probability is at least α uniformly over distributions of ( Y, XW, ) that satisfy assumptions n -5. To obtain a confidence region for [ L, L ], for each n, n,, let c α satisfy,, P[ n ( Z L ) c ; n ( Z L ) c ], α,,, α,, = α, where Z Z, respectively, are the values of the objective functions of the imization imization versions of (0) corresponding to basic solutions. Define Then c α =, n, n, c α,, PL ( n c, α L L L + cα) α + εn. 4

17 uniformly over distributions of ( Y, XW, ) that satisfy assumptions -5, where ε 0 with probability as n. Moreover, [ L, L ]. [ L n c, L + n c ] is an asymptotic α confidence region for α α Section 4 outlines bootstrap methods for estimating the critical values applications. n c α,, c α,, c α in 3.4 Testing Assumptions 2 3 We begin this section by outlining a test of assumption 2. A linear program has a bounded solution if only if its dual has a feasible solution. A linear program has a basic feasible solution if it has a feasible solution. Therefore, assumption 2 can be tested by testing the hypothesis that the dual 2K + M problem (9) has a basic feasible solution. Let =,..., index basic solutions to (9). 7 A J basic solution is q = ( A ) c for the dual of the imization version of (6) or q = ( A ) c for the dual of the imization version, where A is the J J matrix consisting of the columns of A corresponding to the th basic solution of (9). The dual problem has a basic feasible solution if A c for some for the imization version of (6) ( ) 0 ( A ) 0 for some for the c imization version. Therefore, testing boundedness of L ( L ) is equivalent to testing the hypothesis c ( H0 : ( A ) 0 ( A ) 0) for some. c To test either hypothesis, define A as the matrix that is obtained by replacing the components of Π with the corresponding components of Π in (3) ( A ) ( A ) ( A ) ( A A )( A ) o ( n ). c = c c+ p A. Then an application of the delta method yields Equation (3) shows that the hypothesis c ( H0 : ( A ) 0 ( A ) 0) is asymptotically equivalent c to a one-sided hypothesis about a vector of population means. Testing c ( H0 : ( A ) 0 for some is asymptotically equivalent to testing a one-sided hypothesis about a vector of ( A ) 0) c J nonindependent population means. Methods for carrying out such tests issues associated with tests of multiple hypotheses are discussed by Lehmann Romano (2005) Romano, Shaih, Wolf (200), among others. The hypothesis of boundedness of L is rejected if c is H0 : ( A ) 0 rejected for at least one component of A c for each,..., ( ) =. The hypothesis of boundedness of 5

18 L is rejected if =,...,. H0 :( A ) c 0 is rejected for at least one component of ( A ) c for each We now consider assumption 3. Specifically, we describe a test of the null hypothesis, H 0, that there is a vector g satisfying Π g m = 0 Sg ε for some M vector ε > 0. A test can be carried out by solving the quadratic programg problem (4) imize : Q ( g) Π g-m g 2 subject to Sg ε, where denotes the Euclidean norm in K. Let Q opt denote the optimal value of the objective p function in (4). Under H 0, Q opt 0. Therefore, the result Q opt = 0 is consistent with H 0. A large value of Q opt is inconsistent with H 0. An asymptotically valid critical value can be obtained by using the bootstrap. Let j =, 2,... index the sets of constraints that may be binding in the optimal solution to (4), including the empty set, which represents the possibility that no constraints are binding. Let S j denote the matrix consisting of the rows of S corresponding to constraints j. Set S = 0 if no constraints bind. The bootstrap procedure is: j () Use the estimation data { Yi, Xi, W i} to solve problem (4). The optimal g is not unique. Use some procedure to choose a unique g from the optimal ones. One possibility is to set g A m A ε, j + + (5) = Π - 2 ĵ is the set of constraints that are binding in (4), A + is a J J matrix, A 2 + j g = g, where is a matrix with J rows as many columns as there are binding constraints, A 2 + = 0 if there are no binding constraints, (6) A A ΠΠ S 2 j = + + A S 2 A 22 0 j is the Moore-Penrose generalized inverse of ΠΠ S j. S 0 j Define P =Π g m Q = P P. opt j opt opt opt 6

19 (ii) Generate a bootstrap sample i i i { Y, X, W : i=,..., n} by sampling the estimation data { Yi, Xi, Wi : i=,..., n} romly with replacement. Compute the bootstrap versions of m π j. These are (7) (8) n = i i = i= m n Y IW ( w) π n j = n I Xi = xj IWi = w i= ( ) ( ). Compute P opt =Π g j m Q = P P, where opt opt opt g is obtained by replacing Π ĵ m with Π m, respectively, in (5) (6). (iii) Estimate the asymptotic distribution of Q opt by the empirical distribution of ( P P )( P P ) that is obtained by repeating steps (i) (ii) many times (the bootstrap opt opt opt opt distribution). Estimate the asymptotic α level critical value of Q opt by the α quantile of the bootstrap distribution of ( P P )( P P ). Denote this critical value by opt opt opt opt If H 0 is true the set of binding constraints in the population quadratic programg problem, q α. (9) imize: Π g m g 2 subject to Sg ε is unique (say j = j ), then j = j opt with probability for all sufficiently large n. The bootstrap opt estimates the asymptotic distribution of nq opt consistently because P opt is a smooth function of sample moments. 8 Therefore PQ ( > q α ) α as n. opt Now suppose that (9) has several optimal solutions with distinct sets of binding constraints, a circumstance that we consider unliely because it requires a very special combination of values of Π, m, ε. Let C opt denote the sets of optimal binding constraints. Then nq opt is asymptotically distributed as j C n Πg j m. But opt 2 j 2 2 j C j Πg m Πg m. The bootstrap critical opt value is obtained under the incorrect assumption that ĵ is asymptotically the only set of binding 7

20 constraints in (9). Therefore, lim ( n PQopt > q α ) α. The probability of rejecting a correct H 0 is less than or equal to the noal probability. 4. BOOTSTRAP ESTIMATION OF THE ASYMPTOTIC DISTRIBUTIONS OF L L This section present two bootstrap procedures that estimate the asymptotic distributions of n ( L L ), nowledge of n ( L L ), [ n ( L L ), n ( L L )] without requiring Σ, Σ,, or. The procedures also estimate the critical values c α, c α,. The first procedure yields confidence regions for [ L, L ] Lg ( ) with asymptotically correct coverage probabilities. That is, the asymptotic coverage probabilities of these regions equal the noal coverage probabilities. However, this procedure has the disadvantage of requiring a userselected tuning parameter. The procedure s finite-sample performance can be sensitive to the choice of the tuning parameter, a poor choice can cause the true coverage probabilities to be considerably lower than the noal ones. The second procedure does not require a user-selected tuning parameter. It yields confidence regions with asymptotically correct coverage probabilities if the optimal solutions to the imization imization versions of problem (6) are unique (that is, if, each contain only one basic solution). Otherwise, the asymptotic coverage probabilities are equal to or greater than the noal coverage probabilities. The procedures are described in Section 4.. Section 4.2 presents the results of a Monte Carlo investigation of the numerical performance of the procedures. 4. The Bootstrap Procedures This section describes the two bootstrap procedures. Both assume that the optimal solutions to the imization imization versions of problem (0) are rom. The procedures are not needed for deteristic optimal solutions. Let { cn : n=,2,...} be a sequence of positive constants such that c 0 n n c [ n/ (log log n)] as n. Let bootstrap sampling. The first bootstrap procedure is as follows. P denote the probability measure induced by (i) Generate a bootstrap sample i i i { Y, X, W : i=,..., n} by sampling the estimation data { Y, X, W : i=,..., n} romly with replacement. Use (7) (8) to compute the bootstrap versions i i i of m π j, which are m π j. Define Π m, respectively, as the matrix vector that are obtained by replacing the estimation sample with the bootstrap sample in Π m. For any basic 8

21 solution to problem (6), define A m by replacing the estimation sample with the bootstrap sample in A m. (ii) Define problem (B0) as problem (0) with Π m in place of Π m. Solve (B0). Let denote the resulting optimal basic solution. Let L, L, respectively, denote the values of the objective function of the imization imization versions of (0) at basic solution. For basic solution, define = ( c m c m ) - - n A A = ( m c m ). 2 n c A A (iii) Repeat steps (i) (ii) many times. Define = { : L, L cn} = { : L L c }., n (iv) Estimate the distributions of n ( L L ), [ n ( L L ), n ( L L )], respectively, by the empirical distributions of n ( L L ),, 2, 2 (, ). Estimate c α, c α,, respectively, by c α, c α,, which solve 2, P [ ( ) c α ] = α, P ( c α ) = α. Asymptotically, - - n ( c A m c A m ) ( ) is a linear function of sample moments. Therefore, the bootstrap distributions of asymptotic distributions of 2 uniformly consistently estimate the - - ± n ( c A m c A m ) for (Mammen 992). In addition, the foregoing procedure consistently estimates. Asymptotically, every basic solution that is feasible in problem (6) has a non-zero probability of being optimal in (B0). Therefore, with probability approaching as n, every feasible basic solution will be realized in sufficiently many bootstrap repetitions. Moreover, it follows from the law of the iterated logarithm that with probability for all sufficiently large n, only basic solutions in satisfy L, L cn 9

22 only basic solutions satisfy L, L cn. Therefore, = = with probability for all sufficiently large n. It follows that the bootstrap distributions of, 2, (, 2,) uniformly consistently estimate the asymptotic distributions of n ( L L ), n ( L L ) [ n ( L L ), n ( L L )], respectively. 9 These distributions are continuous, so c α is a consistent estimator of c α. These results are summarized in the following theorem. Let P denote the probability measure induced by bootstrap sampling. Theorem 3: Let assumptions -5 hold. Let n. Under the first bootstrap procedure, (i) (ii) sup ( ) [ ( p P z Pn L L ) z] 0 < z< sup ( ) [ ( p P z Pn L L ) z] 0 < z< 2 (iii) z n ( L L ) z P P < z, z2< z 2 2 ( n L L ) z2 sup 0 p (iv) α c c 0. α p The theory of the bootstrap assumes that there are infinitely many bootstrap repetitions, but only finitely many are possible in practice. With finitely many repetitions, it is possible that the first bootstrap procedure does not find all basic solutions for which L, L cn or L, L cn. However, when n is large, basic solutions for which L, L cn or L, L cn have high probabilities, basic solutions for which neither of these inequalities holds have low probabilities. Therefore, a large number of bootstrap repetitions is unliely to be needed to find all basic solutions for which one of the inequalities holds. In addition, arguments lie those used to prove Theorem 4 below show that if not all basic solutions satisfying L, L cn or L, L cn are found, then the resulting confidence regions have asymptotic coverage probabilities that equal or exceed their noal coverage probabilities. The error made by not finding all basic solutions satisfying the inequalities is in the direction of overcoverage, not undercoverage. 0 The second bootstrap procedure is as follows. Note that the optimal solution to the imization or imization version of (0) is unique if it is rom. 20

23 (i) Generate a bootstrap sample i i i { Y, X, W : i=,..., n} by sampling the estimation data { Y, X, W : i=,..., n} romly with replacement. Use (7) (8) to compute the bootstrap versions i i i of m π j, which are m π j. Define Π m, respectively, as the matrix vector that are obtained by replacing the estimation sample with the bootstrap sample in Π m. For any basic solution to problem (6), define A m by replacing the estimation sample with the bootstrap sample in A m. (ii) Let, respectively, denote the optimal basic solutions of the imization imization versions of problem (0). Define = ( c m c m ) - - n A A = ( m c m ). n c A A (iii) Repeat steps (i) (ii) many times. Estimate the distributions of n ( L L ), n ( L L ), of, which solve, [ n ( L L ), n ( L L )], respectively, by the empirical distributions (, ). Estimate c α, c α,, respectively, by c α, c α,,, P ( c α ) = α, P ( c α ) = α. If the imization version of (6) has a unique optimal basic solution,,opt, then = with probability for all sufficiently large n. Therefore, the second bootstrap procedure,opt estimates the asymptotic distribution of n ( L L ) uniformly consistently c α, is a consistent estimator of c α,. Similarly, if the imization version of (6) has a unique optimal basic solution, then the second bootstrap procedure estimates the asymptotic distribution of n ( L L ) uniformly consistently, c α, is a consistent estimator of c α,. If the imization version of (6) has two or more optimal basic solutions that produce nondeteristic values of the objective function of (0), then the limiting bootstrap distribution of n ( L L ) depends on is rom. In this case, the second bootstrap procedure does not 2

24 provide a consistent estimator of the distribution of n ( L L ) or c α,. Similarly, if the imization version of (6) has two or more optimal basic solutions that produce non-deteristic values of the objective function of (0), then the second bootstrap procedure does not provide a consistent estimator of the distribution of n ( L L ) or c α,. However, the following theorem shows that the asymptotic coverage probabilities of confidence regions based on the inconsistent estimators of c α, c α, equal or exceed the noal coverage probabilities. Thus, the error made by the second bootstrap procedure is in the direction of overcoverage. Theorem 4: Let assumptions -5 hold. Let n. Under the second bootstrap procedure, (i) PL ( L + c ) α + o p () α, (ii) PL ( L cα, ) α + o p (). Proof: Only part (i) is proved. The proof of part (ii) is similar. With probability for all sufficiently large n,, so ( ) α α,,. α = P ( c ) P c Therefore, by Theorem 3(i) α Pn [ ( L L ) c ) + o p (). α The bootstrap can also be used to obtain the conservative, uniformly consistent critical values described in Section 3.3. We now outline the bootstrap procedure for n ( L L ). For basic solution, estimate the distribution of n ( L L, ) by the bootstrap distribution of n,. Estimate the critical value cb, α by the bootstrap critical value c b, α, which is the solution to b, P ( > c α ) = α. Estimate c α by n, cb, α. Similar bootstrap procedures apply to ( n L L ) [ n ( L L ), n ( L L )]. 4.2 Monte Carlo Experiments This section reports the results of Monte Carlo experiments that investigate the numerical performance of the bootstrap procedure of Section 4.. The design of the experiments mimics the 22

25 empirical application presented in Section 5. The experiments investigate the finite-sample coverage probabilities of noal 95% confidence intervals for [ L, L ] Lg ( ). In the experiments, the support of W is {0,}, J = 4 or J = 6, depending on the experiment. In experiments with J = 6, X {2, 3, 4, 5, 6, 7} (20) Π= In experiments with J = 4, X {2, 3, 4, 5}, Π is obtained from (20) by P( X= jw, = j J+ ) = P( X= jw, = ). [ PX ( =, W= 0) + PX ( =, W= )] 5 = 2 In experiments with J = 6, g = (23,7,3,, 9, 8). Thus, gx ( ) is decreasing convex. We also require g() gj ( ) 52. In experiments with J = 4, g = (23,7,3,). The functionals Lg ( ) are g(3) g(2), g(5) g(2), g (4). The data are generated by sampling ( XW, ) from the distribution given by Π with the specified value of J. Then Y is generated from Y = g( X) + U, where 2 U = XZ E( X W ) Z ~ N (0,). There are 000 Monte Carlo replications per experiment. The sample sizes are n = 000 n = We show the results of experiments using bootstrap procedure with c n = bootstrap procedure 2, which corresponds to c n = 0. The results of experiments using bootstrap procedure with larger values of c n were similar to those with c n =. The results of the experiments are shown in Tables 2, which give empirical coverage probabilities of noal 95% confidence intervals for [ L, L ]. The empirical coverage probabilities of noal 95% confidence intervals for Lg ( ) are similar are not shown. The empirical coverage probabilities are close to the noal ones except when J = 4 Lg ( ) = g(4). In this case, the variance of Π is large, which produces a large error in the asymptotic linear approximation to c A m. 5. AN EMPIRICAL APPLICATION This section presents an empirical application that illustrates the use of the methods described in Sections 2-4. The application is motivated by Angrist Evans (998), who investigated the effects of children on several labor-maret outcomes of women. We use the data instrument of Angrist Evans (998) to estimate the relation between the number of children a woman has the number of wees she wors in a year. The model is that of (a)- (b), where Y is the number of wees a woman wors in a year, X is the number of children the woman 23

26 has, W is an instrument for the possibly endogenous explanatory variable X. X can have the values 2, 3, 4, 5. As in Angrist Evans (998), W is a binary rom variable, with W = if the woman s first two children have the same sex, W = 0 otherwise. We investigate the reductions in hours wored when the number of children increases from 2 to 3 from 2 to 5. In the first case, Lg ( ) = g(3) g(2). In the second case, Lg ( ) = g(5) g(2). The binary instrument W does not point identify Lg ( ) in either case. We estimate L L under each of two assumptions about the shape of g. The first assumption is that g is monotone non-increasing. The second is that g is monotone non-increasing convex. Both are reasonable assumptions about the shape of gx ( ) in this application. We also estimate Lg ( ) under the assumption that g is the linear function gx ( ) = β + β x, 0 where β 0 β are constants. The binary instrument W point identifies β 0 β. Therefore, Lg ( ) is also point identified under the assumption of linearity. With data { Y, X, W : i=,..., n}, the instrumental variables estimate of β is n i= β = n i= ( Y Y)( W W) i ( X X)( W W) i i i n where Y n = Y i, X n n X n = i, W n = W i i= Lg ( ) = β x,, i= where x = for Lg ( ) = g(3) g(2), x = 3 for Lg ( ) = g(5) g(2). i= i i i. The estimate of Lg ( ) is The data are a subset of those of Angrist Evans (998). They are taen from the 980 Census Public Use Micro Samples (PUMS). Our subset consists of 50,68 white women who are 2-35 years old, have 2-5 children, whose oldest child is between 8 2 years old. The estimation results are shown in Tables 3 4. Table 3 shows the estimated identification intervals [ L, L ] bootstrap 95% confidence intervals for [ L, L ] Lg ( ) under the two sets of shape assumptions. Table 4 shows point estimates 95% confidence intervals for Lg ( ) under the assumption that g is linear. It can be seen from Table 3 that the bounds on Lg ( ) are very wide when g is required to be monotonic but is not otherwise restricted. The change in the number of wees wored per year must be in the interval [ 52,0], so the estimated upper bound of the identification interval [ L, L ] is uninformative if Lg ( ) = g(3) g(2), the estimated lower bound is 24

27 uninformative if Lg ( ) = g(5) g(2). The estimated bounds are much narrower when g is required to be convex as well as monotonic. In particular, the 95% confidence intervals for [ L, L ] Lg ( ) under the assumption that g is monotonic convex are only slightly wider than the 95% confidence interval for Lg ( ) under the much stronger assumption that g is linear. 6. QUANTILE IV AND EXOGENOUS COVARIATES 6. Quantile IV We now consider a quantile version of model (a)-(b). The quantile model is (2) Y = g( X) + U (22) PU ( 0 W= w) = q for all w supp( W), where 0< q <, X is discretely distributed with mass points { x : j =,..., J} for some integer J <, W is discretely distributed with mass points { w : =,..., K} for some integer K <, K < J. Horowitz Lee (2007) show that model (2)-(22) is equivalent to the nonseparable model of Chernozuov, Imbens, Newey (2007) (23) Y= H( XV, ), where H is strictly increasing in its second argument V is a continuously distributed rom variable that is independent of the instrument W. As before, let g = gx ( ). Define π = PW ( = w ). Then (2)-(22) is equivalent to j j j (24) J PY ( gj 0, X= xj, W= w) = qπ ; =,..., K. j= Thus, model (2)-(22) is equivalent to K nonlinear equations in J > K unnowns. As is shown by example in Appendix A, (2)-(22) (24) do not point identify g = ( g,..., g ) or the linear functional L ( g) = c'g except, possibly, in special cases. Under the shape restriction Sh 0, sharp upper lower bounds on Lg ( ) are given by the optimal values of the objective functions of the nonlinear programg problems (25) imize (imize) : h h subject to c'h J (26) J j= PY ( h 0, X= x, W= w ) = qπ ; =,..., K j j (27) Sh 0. 25

28 Denote the optimal values of the imization imization versions of (25) by L L, respectively. Then Lg ( ) is contained in the interval [ L, L ] cannot be outside of this interval. However, the feasible region of (25) may be non-convex. Consequently, some points within [ L, L ] may correspond to points g that are not in the feasible region. When this happens, L L are sharp upper lower bounds on L( g ), but [ L, L ] is not the identification region for L( g ). The identification region is the union of disconnected subintervals of [ L, L ]. The example in Appendix A illustrates this situation. We now consider inference about L L in model (2)-(22). In applications, P π in (26) are unnown. The probabilities π are estimated consistently by n = n IWi = w i= π ( ). The most obvious estimator of P is the empirical probability function (28) ( n 0,, ) j = j = = ( i j, i = j, i = ); =,..., i= P Y h X x W w n I Y h X x W w K. This estimator is a step function of the h j s, so there may be no h j s that satisfy (26) when P are replaced with P π. One way of dealing with this problem is to smooth P so that it is continuous on < hj <. This can be done by using the estimator n hj Yi P ( Y hj 0, X = xj, W = w) = n K I( Xi = xj, Wi = w); =,..., K, i= sn where s n is a bwidth parameter K is a cumulative distribution function corresponding to a probability density function K that is bounded, symmetrical around 0, supported on [,]. In other words, K is the integral of a ernel function for nonparametric density estimation or mean regression. This approach has the disadvantage of requiring a user-selected tuning parameter, s n. There is no good empirical way of selecting s n in applications. We do not pursue this approach further in this paper. Under assumptions QIV QIV4 that are stated below, the need for a user-selected tuning parameter can be avoided by observing that (25)-(27) is equivalent to the unconstrained optimization problem π (29) M K + m m= = imize : ± ch + C ( Sh) + C P ( h ) qπ, h 26

Identification and shape restrictions in nonparametric instrumental variables estimation

Identification and shape restrictions in nonparametric instrumental variables estimation Joachim Freyberger Joel Horowitz The Institute for Fiscal Studies Department of Economics, UCL cemmap woring paper