Estimation of the Binary Response Model using a. Mixture of Distributions Estimator (MOD) Mark Coppejans 1. Department of Economics.

Size: px

Start display at page:

Download "Estimation of the Binary Response Model using a. Mixture of Distributions Estimator (MOD) Mark Coppejans 1. Department of Economics."

Bathsheba Byrd
5 years ago
Views:

1 Estimation of the Binary Response Model using a Mixture of Distributions Estimator (MOD) Mar Coppejans 1 Department of Economics Due University Durham NC USA Phone: (919) Fax: (919) mtc@econ.due.edu July I would lie to than Rosa Matzin, Torben Andersen, and especially Ian Domowitz for their comments. This and other papers by the author are available at

2 ABSTRACT This paper develops a semiparametric sieve estimator, which is termed a mixture of distributions estimator (MOD), to estimate a binary response model when the distribution of the errors is unnown. The estimator for the distribution function is composed of a mixture of smooth distributions, where the number of mixtures increases with the sample size. The model is semiparametric because it is assumed that a parametric index type restriction holds. Optimal rates of convergence are established for the distribution function under the L 2 norm, and conditions are derived under which estimates of the parametric component are asymptotically normal. An appealing feature about MOD is that it is possible to restrict the estimator of the distribution function, a priori, to be smooth, non-negative, increasing, and to integrate to one. This has important practical and theoretical implications. KEY WORDS: Binary Response Model, Mixture of Distributions, Sieve Estimator, Index Restriction.

3 1 Introduction This paper develops a semiparametric sieve estimator, which is termed a mixture of distributions estimator (MOD), to estimate a binary response model. The underlying form is y i = x 0 i 0 + i ; i = 1; : : : ; n, where fx 0 i; i g 0 are i.i.d. random vectors, x i 2 < d, d 1, and i 2 <, but the distribution of, F (), is unnown. One observes the sequence fy i ; x 0 ig 0, where y i is dened as y i = 8 >< >: 1; if y i > 0; 0; if y i 0: In many economic settings, the statistic of main interest is E[yjx], and this is not specic to the problem addressed here, nor is it specic to the estimation approach considered in this study. When the regressors and errors are mutually independent, E[yjx] = 1? F (?x 0 0 ) = E[yjx 0 0 ], and because the conditional expectation is completely specied as a function of x 0 0, this model is said to satisfy a single index restriction. We propose to estimate E[yjx] by estimating F () and 0 simultaneously. The estimator for F () is a type of mixture of distributions where the number of mixtures increases with the sample size. Given that the number of mixture terms grows at a suitable rate, optimal rates of convergence in L 2 E[yjx] and F () are obtained. We will also provide some results on the estimates of the density of the errors, f(), and on the asymptotic distribution of the estimates of 0. The estimation procedure is called the method of sieves, where a sieve is a sequence of nite parameter spaces constructed so that, in the limit, the function of interest, E[yjx], lies within it. Useful restrictions will be imposed on the sieve as well. For example, suppose a mixture of normals is used to estimate F (), P j=1 [(? j)= ], where () is the normal c.d.f., j and are parameters, = n, and > 0 is to be determined later. The sieve then is this sequence of mixtures. Under unconstrained optimization, for moderate values of, will typically be set to zero. The result is a step function, which, in general, will lead to poor estimates of F () if we believe that F (), as we do here, is smooth. The ey is to bound for 1

4 from below, allowing it to decrease with n. One must use care in placing this bound, however. If it is too small, the estimated distribution will be close to a step function. If the bound is too large, the sieve might be a poor approximation to the underlying function, and as a result, optimal rates may not be obtainable. 1 part alleviates this problem by placing bounds on the rate at which The theory developed in this paper in can decrease and can increase, where the choice of has important consequences. As Shen and Wong (1994) have shown in a general setting, if is too small, then optimal rates may not be obtainable, and if is too large, then even consistency may not be achieved. Using a nite mixture model to estimate a binary response model is not new, and it is, in fact, prevalent throughout economics. Nor is the idea of using a mixture model, where the number of mixtures increases with the sample size, new. One of the rst theoretical papers on sieve estimation, Geman and Hwang (1983), proved that estimating a density by using a mixture of densities is consistent under appropriate conditions. As another example, Hecman and Singer (1984) showed that estimating a distribution by a mixture of distributions is consistent, but their estimator is a step function, and as a result, it is not surprising that they obtained poor estimates. 2 What is novel in this paper is that we rigorously outline the necessary conditions under which optimal estimates of F () in L 2 guaranteed, asymptotically. This is a much stronger result than just consistency. As a consequence, optimal estimates of E[yjx] are also obtained. It is well nown that as we increase the number of terms, mixtures of, say, normals can approximate any continuous distribution function arbitrarily well (e.g. Zubov, 1995). Nonetheless, this alone will not provide us with anything more than consistency. Hence a stronger result is required for the rate of convergence calculations. This is obtained by bounding the sieve approximation error, which is dened here as the rate in terms of at which jf ()? F ()j tends to zero under the L 2 norm, where F can be thought of as the closest distribution function composed of mixtures to F (). In nonparametric estimation, 1 The intuition is similar to that of a ernel estimator, where plays the role of the bandwidth parameter. 2 Hecman and Singer (1984) were also interested in estimates of structural parameters and F (t) = R G(tj) d(), where, in their notation, G(tj) is nown up to the scalar, and () is an unnown distribution. They found that even though they do not estimate () very well, they did estimate the structural parameter and F (t) well, which is not surprising because the functional form of G(tj) is nown and integration acts as a smoother. is 2

5 there is often a variance and bias 2 tradeo, and the bias here can be thought of as the sieve approximation error. Once this is obtained, rates of convergence easily follow from the results in Shen and Wong (1994) or Chen and Shen (1998), both of whom treat the sieve approximation error as mainly given, instead focusing on the variance component of the variance and bias 2 tradeo. The main theoretical contribution of this paper, therefore, is in obtaining a lower bound on the sieve approximation error. For example, suppose that F () is twice continuously dierentiable. Then, on a compact support, the sieve approximation error is of order?6=7+, > 0 arbitrarily small, under the L 2 norm. In addition, F () and its rst two derivatives converge to F () and its rst two derivatives, respectively, in the strong norm. There are many dierent approaches in the literature on how to estimator this type of binary response model. However, unlie most of the other methods, MOD has the property that the estimator for F () can be restricted, a priori, to be a proper distribution function that is also smooth. This has both theoretical and practical advantages. From a statistical standpoint, this provides an estimator for those who would prefer to estimate a distribution with a distribution. From a microeconomic standpoint in which the underlying model is based on individual preferences, an estimator for F () that is not strictly monotonic implies that consumers are not necessarily utility maximizing a violation of basic economic principles. 3 Enforcing this type of restriction, then, which is derived from economic theory, is in line with Matzin (1992), for example. From a practical standpoint, non-monotonic estimates of E[yjx 0 0 ] are hard to interpret in an economic context. 4 As an example, Stern (1996) estimated supply and demand functions, and in doing so, illustrated the diculties involved when analyzing non-monotonic estimates of E[yjx 0 0 ]. Stern (1996) proposed an alternative estimator that is monotonic, but it tends to have \ins" and \at spots", which are also dicult to interpret in an economic context. Another nice feature about MOD is that when it is composed of a mixture of normals, it can be viewed as a natural extension 3 For example, let U i;j = x 0 i 0;j + i;j be the ith person's indirect utilities associated with goods j = 1; 2. Set 0 = 0;2? 0;1 and i = i;2? i;1. Then dening y i = 1 as the event that the ith person chooses good j = 2, we have P (y i = 1) = P (U i;2 > U i;1 ) = 1? F (?x 0 0 ), which is necessarily non-increasing in?x 0 0. Therefore, if an estimator for F () is somewhere decreasing, then it is possible to construct a situation such that y i = 1 but U i;2 < U i;1. 4 Of course there are ways of xing this ex post, but the method is often arbitrary. 3

6 of the paradigm case for purely parametric settings the probit model. Given the importance of E[yjx], it is somewhat surprising that most of the econometric literature in this area has focused almost entirely on estimates of 0. This was fueled, in large part, as a response to the results in Ruud (1983), who derived conditions under which consistency of 0 is possible even if F () is incorrectly specied as a normal. For example, Mansi (1985), Stoer (1986), and Han (1987) have all constructed consistent estimates of 0. 5 Two more recent approaches by Ichimura (1993) and Klein and Spady (1993) concentrated on the asymptotic distributional properties of the estimates of 0, but their methods can also be used to obtain optimal estimates of E[yjx] and F (). Both estimators are set up as Nadaraya{Watson type ernel regressions, where the former uses a nonparametric nonlinear least squares objective function, and the latter uses a nonparametric maximum lielihood objective function. The estimators also share the property that neither one is necessarily monotonic. The model developed in this paper is most similar to Cosslett's (1983) nonparametric MLE estimator because Cosslett's estimator for F () is a proper distribution function. However, Cosslett's estimator for F () is a step function, and rates of convergence are not established. Our method is also similar to Gallant and Nycha's (1987) technique for sample selection problems in that they estimate the unnown density function using a Hermite type polynomial expansion. Gallant and Nycha established consistency results for their estimator, but they do not establish general results on rates of convergence. This paper instead uses the conditions developed in Shen and Wong (1994) to derive rate results and the conditions in Shen (1997) to derive the asymptotic normality result. Horowitz (1996) has developed a n?1=2 {consistent, asymptotically normal, nonparametric estimator for F () for single index models with unnown transformation of the dependent variables. Horowitz's results, however, do not extend to the binary response model and the estimator is not necessarily monotonic. It should also be noted that other type of sieve estimators can also be used to construct 5 Their wor has generated many other papers, most of which provide asymptotic normality results for the estimates of 0. These include Horowitz's (1992) extension of Mansi's results, Powell's et al. (1989), Hardle and Stoer's (1989), and Newey and Stoer's (1993) extensions of Stoer's (1986) results, and Sherman's (1993) extension of Han's results. See Powell (1994) for a thorough review. 4

7 optimal estimates of E[yjx] and F (). For example, let s (x) be a polynomial of degree or a spline with nots. A possible estimator for F (), that is also a proper distribution function, is R?1 s (x) 2 dx= R 1?1 s (x) 2 dx. Nonetheless, this paper focuses on estimators of the form of mixtures of distributions because their use is widespread throughout economics. The organization of this paper is as follows. Section 2 begins by heuristically describing MOD. The rest of this section is broen into four parts. Subsection 2.1 covers identication, whereas Subsection 2.2 describes the sieve approximation error in more detail. The next subsection formally denes MOD, and it provides the asymptotic convergence results for E[yjx] and F () and its derivatives. Subsection 2.4 outlines the necessary conditions for estimates of 0 to be asymptotically normal. Section 3 is a Monte Carlo experiment, and the last section outlines a few extensions. All proofs are given in the Appendix. 2 Binary Response Model If the errors and regressors are mutually independent, then E[yjx] = P ( >?x 0 0 ) = 1? F (?x 0 0 ); where F () is the common c.d.f. of the errors. Hence the model reduces to a single index restriction, E[yjx] = E[yjx 0 0 ]. Klein and Spady (1993), among others, have shown that mutual independence of x and is not necessary for the index restriction to hold, but the assumption is used below only to identify F (). If one is only interested in estimates of E[yjx], then the mutual independence assumption can be relaxed as in Klein and Spady. A linear transformation of a mixture of distributions is used to estimate F (), X? j (; a; b; f j ; j g; ) = a + (b? a) j H ; (2.1) where is the number of mixing components, H() is a smooth distribution function, j, 0 j 1, P j=1 j = 1, is a mixing weight, j is a translation parameter, is a scaling constant, a is an intercept term, and b? a is a slope term, 0 a b 1. By construction, is a function of smooth distributions. 6 j=1 6 See Titterington et al. (1985) for an analysis of nite mixtures. 5

8 In most parametric applications, one would set a = 0 and b = 1, but as will be shown below, we need this additional parameterization when the support of x 0 0 is a strict subset of the support of. This is partly due to the fact that if the support of x 0 0 is [?M; M], 0 < M < 1, and the support of is (?M 0 ; M 0 ), M < M 0, for example, then F () will not be identied over the region (?M 0 ;?M) and (M; M 0 ) because there is no information there. The information in this setup comes from the possible values of?x 0 0. The idea here is to increase the number of mixtures,, as the sample size, n, increases. In this sense, MOD can be viewed as a natural extension of the purely parametric binary response model where is held xed. As gets larger, the estimator dened in (2.1) becomes more \exible", enabling it to give a better approximation of the true distribution function F (). This statistical procedure is called the method of sieves. Given ( ; a; b; f j ; j g; ) and, an estimator for E[yjx] is 1? (?x 0 ; a; b; f j ; j g; ) = (2.2) 1? 2 X 4 a + (b? a) j=1 j H!3?x0? j 5 : Since Efy? E[yjx 0 0 ]g = 0, we can estimate 0 and F () by maximizing Q n (!(x; ; )) =? 1 n nx i=1 with respect to (; a; b; f j ; j g; ), where!(x; ; ) 1? (?x 0 ; a; b; f j ; j g; ): fy i? [1? (?x 0 i ; a; b; f j ; j g; )]g 2 (2.3) For a xed, the method described above is just (misspecied) nonlinear least squares. To ensure consistency, it is necessary that! 1. We will also impose additional restrictions, such as placing a bound on the second derivative, which will be described in more detail below. Ichimura (1993) also exploited a similar type of objective function, but in that paper, ernels were used to estimate the conditional expectation. Hence estimation is performed only over given that the bandwidth parameter is some predetermined sequence. One could view this as advantageous because there are fewer parameters to estimate, but the cost is that the estimator for the distribution function is not necessarily a distribution itself. 6

9 A nonlinear least squares type of objective function is preferred to a lielihood based objective function, such as in Klein and Spady (1993), because (2.3) is bounded without any form of truncation. This maes maximization easier. The method developed here is also robust to certain types of temporal dependency, which will be further noted in Section 4. The function to be estimated is! 0 (x; F; 0 ) 1? F (?x 0 0 ); where! 0 2 0, a possibly innite dimensional parameter space. For notational simplicity, we will usually denote ( ; a; b; f j ; j g; ) and!( ; ; ) as just () and!(). The pseudometric is the L 2 norm, (! 1 ;! 2 ), and it is dened as n E[!1 (x)?! 2 (x)] 2 o 1=2 ; where the expectation is with respect to the distribution of x. Rates of convergence will be determined by the order at which (^! n ;! 0 ) is bounded in probability, where ^! n is our estimate of! 0. In other words, the pseudometric is used as a measure of how close our estimates of E[yjx] are to the conditional expectation itself. 2.1 Identication Let F () 2 F, where F is some family of distributions. Lie in Cosslett (1983), the model is considered identied if the following denition holds. DEFINITION 1 Let 1 2 and F 1 () 2 F. The model is identied if for all 1 2 and F 1 () 2 F such that F 1 (?x 0 1 ) = F (?x 0 0 ) for almost all x, then 1 = 0. The following assumptions will be used in part to identify both F () and 0. ASSUMPTION 1 0 = ( 01 ; : : : ; 0d ) 0, 1 d < 1, is an element in the interior of < d, where is compact, j 0j j C 0 < 1, j = 1; : : : ; d, and C 0 is some constant. ASSUMPTION 2 The index, x 0, is nown up to the parameter. ASSUMPTION 3 i F (). is a sequence of i.i.d. random variables with distribution function 7

10 ASSUMPTION 4 fx i g, x i 2 < d, 1 d < 1, is a sequence of i.i.d. random vectors with nite support. ASSUMPTION 5 x and are mutually independent. The rst two assumptions are standard for purely parametric models. The assumption that the regressors have nite support is made for simplicity. Assumptions 3 and 5 are used to identify F () and 0. We now from Cosslett (1983) that without further restrictions, the constant term (if there is one) is not identied and that the slope coecients are only identied up to scale. As a result, we use the normalization that 01 = 1 and, slightly abusing notation, rewrite the index as x 0 = x x d?1x d = x 1 + ~x: 2.2 Sieve Approximation Error Denote a sequence of spaces 1 ; : : : ; n as approximations to 0, the underlying parameter space. In the next subsection, these parameter spaces will be dened in more detail. Heuristically, the sieve approximation error measures how close n is to 0 with respect to the pseudometric (; ). Formally, suppose that for any! 0 2 0, there exists n (! 0 ) 2 n such that ( n (! 0 );! 0 )! 0 as n! 0. The rate at which this tends to zero is the sieve approximation error. A bound on the sieve approximation error is calculated here by combining results from the ernel and neural networ literature. The calculation itself is quite involved, so the details are provided in the Appendix. However, some of the restrictions imposed in this paper are a direct result of this approximation error, and without some understanding of how it is bounded, the restrictions may appear unnecessary. Two examples already given are the a and b parameters dened in (2.2). With this in mind, a setch of the main ideas behind the calculation is provided below. Observe rst that the sieve approximation error directly associated with 0 is zero because it will be assumed that 0 lies in n for all n. But this is not the case for general F () since we can thin of it as depending on an innite number of mixture terms. Hence, in most 8

11 cases, F () 2 n only as n! 1. As the number of mixture terms increases, there does exist some unnown sequence of distribution functions, F (), depending on mixtures, such that F () gets closer to F (). A lower bound on the rate at which (F (); F ())! 0 bounds the sieve approximation error here. Again denoting the support of?x 0 0 by [?M; M], suppose that we observe the following sequence of i.i.d. random variables, f~ j g ~ j=1, that are generated from the conditional distribution of F () given j j M. We can then estimate the conditional distribution function, ~F(), using the standard type of Rosenblatt-Parzen ernel estimator for distributions (Reiss, 1981). Denote this ernel estimator for ~ F() as K~ () = (1= ~ ) P~ j=1 H[(? ~ j)=h], where h is the bandwidth parameter. Using standard techniques, we rst show that K ~ () approximates ~ F() at a certain rate under the L 2 norm. For reasons described below, we also need to show that the qth derivative of K ~ () converges to the qth derivative of ~ F() under the strong norm, where we will assume that ~ F () is q times continuously dierentiable. Observe that ~ F() is related to F () by ~F(z) = or in terms of F (), F (z)? F (?M);?M z M; F (M)? F (?M) F (z) = [F (M)? F (?M)] ~ F(z) + F (?M);?M z M: To relate this with the estimator, (), in (2.1), set a = F (?M), b = F (M), j = 1= ~, j = ~ i, ~ = h, and ~ =. Clearly, a + (b? a)k ~ () has the same approximating accuracy to F () as K ~ () does to ~ F() on [?M; M]. The parameter values for f j g are random, but by using an argument similar to Barron's (1993), this is enough to show that there exists some nonstochastic values for f j g, call them f j g, such that (1= ~ ) P~ j=1 H[(? j )= ~ ] has the same approximating properties as K ~ () does in probability. By also allowing the f j g to vary, we can use a result in Maovoz (1996) to prove that an approximation with the same accuracy as above can be constructed with < ~ components, P j=1 j H[(? j )= ~ ], where is considerably smaller than ~. Dene the nal approximation to F () on [?M; M], F () = a + (b? a) P j=1 j H[(? j )= ~ ], as the sieve approximation, F = n (F ). This is a nonstochastic and unnown sequence of functions. Then [E(F? F ) 2 ] 1=2 bounds the sieve approximation error, which will depend on the conditions imposed on H() and F (). 9

12 2.3 Convergence Results The idea behind the consistency argument is to obtain an estimate of! 0 on a bounded parameter space, n, called a sieve. Even in the parametric case, theory often requires that we perform estimation over a compact parameter space (i.e. Amemiya, Theorem 4.1.1, 1984). The dierence here is that the parameter space depends on n, which is a result of estimating F (), the nonparametric component. For nite n, the true parameter space, 0, is generally too large to obtain consistent estimates over it. As a result, estimation is performed over n, which is a more manageable parameter space. We require that n is a \reasonable" approximation to 0 so that in the limit, the sieve is dense in the true parameter space, lim n!1 n 0. In other words, the sieve approximation error must tend to zero, which means that the sieve approximation must lie in n for all large n. Finally, we need to control the \size" of n, where size is dened here as the metric entropy, which is the log of the minimum number of -balls it taes to cover n. The metric entropy plays an integral role in deriving rates of convergence. 7 The assumptions below are used, in part, to construct a sieve with these desirable properties. Denote the sth derivative of some function G(x), (@ s =@x s )G(x), as G (s) (x), where G (0) (x) = G(x). ASSUMPTION 6 F () is twice continuously dierentiable, jf (s) ()j M s < 1; s = 1; 2, and R jf (2) ()jd < 1. ASSUMPTION 7 H() in (2:1) is a four times continuously dierentiable increasing function such that i) lim!?1 H() = 0 and lim!1 H() = 1, ii) jh (s) ()j C s < 1; s = 1; : : : ; 4, iii) R H (1) ()d = 0, iv) R 2 H (1) ()d < 1, and v) jh (1) ()j! 0 as! 1, where the C s 's are constants. ASSUMPTION 8 Denote r() as the density of?x 0 0. Then there exists constants c 1 ; c 2 ; c 3 ; c 4 such that R c 2 c 1 r(?x 0 0 ) d(?x 0 0 ) = 1, inf fc1?x 0 0 c 2 g r(?x 0 0 ) c 3 > 0, and sup fc1?x 0 0 c 2 g r(?x 0 0 ) c 4 < 1. 7 See Alexander (1984), Section 3, for a good example of how the metric entropy is used in deriving rates of convergence. 10

13 Placing a smoothness requirement on F (), as in Assumption 6, is common in nonparametric curve estimation. The added requirement that R jf (2) ()jd < 1 is used to show that the second derivative of the sieve approximation uniformly converges to F (2) (). This helps in bounding the metric entropy calculation. As Kolmogorov and Tihormirov (1959) have shown, the metric entropy of a twice continuously dierentiable function on a bounded support is of order?1=2. Assumption 7 is similar to those in the ernel distribution literature (see Reiss, 1981). For example, the normal distribution satises these requirements. We also place restrictions on the 3rd and 4th derivatives, which is also used to show that the second derivative of our sieve approximation uniformly converges to F (2) (). Variants of Assumption 8 are common in single index type settings because it maes some of the proofs less cumbersome by bounding the density of?x 0 0 from above and below. 8 The conditions are primarily used here in the sieve approximation error calculation, which requires the use of a Gabushin (1967) type bound (see the Appendix). Gabushin's result bounds derivatives of functions in terms of the function itself and a higher derivative. The problem is that this bound uses the non-weighted L 2 norm, Z c2 c 1 1=2 [! 1 (z)?! 2 (z)] 2 dz ; so we use Assumption 8 to relate this with the weighted L 2 norm used in the sieve approximation error calculation, Z c2 c 1 1=2 [! 1 (z)?! 2 (z)] 2 r(z) dz : The assumption could be weaened, such as in Gallant and Nycha (1987), at the expense of maing the proofs much more tedious. The next denition formally denes the sieve by restricting the class of functions in (2.2). 8 For a concrete example, suppose ~x 0 0 = (j? 1)=J, j = 1; : : :; J + 1 < 1, with probability p j > 0, and x 1 is distributed independently of ~x 0 0 with density r 1 (x 1 ), x 0 0 = x 1 + ~x 0 0. Let 0 < a 1 r 1 (x 1 ) A 1 < 1 Pfor all x 1, and r 1 () has support on [0; b 1 ] (e.g. uniformly distibuted on [0; b 1 ]). The density of x 0 0 is pj r 1 (x 0 0? 1=j). Hence the density of?x 0 0, r(), has support on [?b 1? 1; 0]. If b 1 1=j, then the conditions in Assumption 8 are satised with c 1 =?b 1?1, c 2 = 0, c 3 = a 1 min j p j, and c 4 = A 1. If b 1 < 1=j, then the conditions are not satised because inf r() = 0. As another example, suppose ~x 0 0 is uniformly distributed on [?1; 1], and given ~x 0 0, x 1 is distributed as a standard truncated normal with values greater than b 1? ~x 0 0 or smaller than?b 1? ~x 0 0 truncated, where b 1 > 1. Then the density of?x 0 0 has support on [?b 1 ; b 1 ] and c 3 = [(b 1 + 1)? (b 1? 1)]=f2[(b 1 )? (?b 1 )]g, where () is the c.d.f. of the standard normal. 11

14 DEFINITION 2 The sieve, n, is dened as the set of functions f!(x; ; )g = f1? (?x 0 ; a; b; f j ; j g; )g (2.4) subject to i) = O(n ); ii) j j j C ; j = 1; : : : ; d? 1; iii) 0 a < b 1; iv) X j=1 j = 1; 0 j 1; j = 1; : : : ; ; v) C L j C U ; j = 1; : : : ; ; vi) c n? C ; 0 < c C < 1; vii) sup C L zc U j (2) (z)j C (2) ; where 0 < ; will be determined below and C C 0 ; C L < c 1 ; C U > c 2 ; C (2) M 2 are nite constants with C 0 ; c 1 ; c 2 ; M 2 dened as in Assumptions 1, 8, and 6. on The constraints imposed on the sieve are to ensure that it is dense in 0. The restrictions (2) () directly bound the order of the metric entropy by?1=2. The above bounds are assumed to be constant, but this is not necessary. We could instead have them slowly increase with the sample size, but this greatly adds to the notational complexity. 9 We are now ready to state the main theorem. THEOREM 1 Suppose that Assumptions 1? 8 hold and that the model is identied. Let ^! n = 1? ^ (?x 0^n ) = 1? (?x 0^n ; ^a;^b; f^ j ; ^ j g; ^ ) be the estimate of! 0 that maximizes Q n (!) dened in (2:3) subject to the sieve dened in (2:4). For some arbitrarily small > 0, put = 7 15(1? ) and = 1? =2 5(1? ) : 9 See Shen and Wong (1994) for the necessary adjustments. 12

15 Then (^! n ;! 0 ) = O p n?2=5 : This theorem states that estimates of E[yjx], ^! n, converge in probability at rate of n?2=5 under the L 2 norm. Stone (1982) has shown that this is the optimal rate for univariate functions that are twice continuously dierentiable. Faster rates of convergence are possible if one assumes that F () is more than twice continuously dierentiable. This requires extensions to Assumptions 6 and 7, and as a consequence, MOD will unfortunately no longer necessarily be monotonic. A related problem, that is the estimator is no longer necessarily positive, occurs in the standard nonparametric ernel density estimation when higher order ernels are used. Again, optimal rates of convergence are possible under the L 2 norm. ASSUMPTION 9 F () is q times continuously dierentiable, jf (s) ()j M s < 1; s = 1; : : : q, q 3, and R jf (q) ()jd < 1. ASSUMPTION 10 H() in (2:1) is q + 2 times continuously dierentiable such that i) lim!?1 H() = 0 and lim!1 H() = 1, ii) jh (s) ()j C s < 1; s = 1; : : : ; q + 2, iii) R s H (1) ()d = 0; s = 1; : : : ; q? 1, iv) R j q H (1) ()jd < 1, and v) j s H (s) ()j! 0 as! 1; s = 1; : : : ; q? 1, where the C s 's are constants. As in Assumption 7, the restrictions imposed on H() in Assumption 10 are reasonable, so the usual methods for constructing higher order ernels can be used. The sieve with respect to (2.4 vii) must also be adjusted. That is, replace sup j C L zc U (2) (z)j C (2) with sup j C L zc U (q) (z)j C (q) ; where C (q) M q and M q is as in Assumption 9. COROLLARY 1 Suppose that the same setup as in Theorem 1 still holds with Assumptions 6 and 7 replaced with Assumptions 9 and 10. Put = 2q + 3 3(2q + 1)(1? ) and = 1? =2 (2q + 1)(1? ) : 13

16 Then (^! n ;! 0 ) = O p n?q=(2q+1) : In some cases, such as estimates of quantiles, a stronger norm may be desired. next result states the convergence rates of the estimate of F () and its derivatives, ^ (s) (), s = 0; : : : ; q? 1, under various norms. The convergence rate of Z c2 c 1 h^ (z)? F (z) i 2 r(z) dz 1=2 is optimal, but the other rates, in general, are not. The COROLLARY 2 Given the conditions in Theorem 1 or Corollary 1, if j 2q=(q? i) for some integers j and i, then ^ (i) (?x0 0 )? F (i) (?x 0 0 ) j = O p (n? ); = 2q q? i + 1=j (2q + 1) 2 ; where 0 i q? 1, j = (Ej j j ) 1=j, and j = 1 is the strong norm (q = 2 with regards to Theroem 1). 2.4 Asymptotic Normality A desirable feature of any semiparametric estimator is to have an asymptotic distributional results for the estimate of the parametric component. This is shown here by verifying the the conditions in Shen (1997). In this subsection, let! = (; 1? (?x 0 0 )), and dene! 0 ; n, and 0 analogously. The ey is to choose an appropriate inner product (recall that x = (x 1 ; ~x 0 ) 0 ), h! 1 ;! 2 i = E n [~x 0 1 f(?x 0 0 ) + (1? ;1(?x 0 0 ))] [~x 0 2 f(?x 0 0 ) + (1? ;2(?x 0 0 ))] o : Dene the norm jj jj such that <!;! > = jj!jj 2. A main step in the proof is to nd a v such that (? 0 ) 0 = h!?! 0 ; v i 14

17 for all! 2 n, where is an arbitrary unit vector in < d?1. By the Riesz representation theorem, v 2 V exists, where V is the completion of V, the space spanned by 0?! 0. Similar to Example 2 in Shen (1997) and Case 2.2 in Chen and Shen (1998), v =?1 ;?(?1 ) 0 f(?x 0 0 )E[~xj x 0 0 ] if f(?x 0 0 )E[~xj x 0 0 ] is smooth enough, where E n f(?x 0 0 ) 2 [~x? E(~xj x 0 0 )] [~x? E(~xj x 0 0 )] 0o : The next assumption guarantees that v 2 V, and it is in the same spirit as in Shen (1997) Example 2. ASSUMPTION 11 f(?x 0 0 )E[~xj x 0 0 ] is at least as smooth as F (?x 0 0 ). Unfortunately, Assumption 11 imposes a rather high-level condition. It is satised if E[~xj? x 0 0 ] = G(?x0 0 ) f(?x 0 0 ) ; where G() is some very smooth function. 10 Another case of when it will hold is if E[~xjx 0 0 ] and f(?x 0 0 ) are innitely times dierentiable with respect to?x 0 0. normal distribution is innitely times dierentiable. For example, the THEOREM 2 Suppose that the conditions in Theorem 1 or Corollary 1 hold and that Assumption 11 is satised. If is positive denite and x 1 is not a function of ~x 0 0, then p n(^n? 0 )! N(0;?1 J?1 ) in distribution, where J = EfF (?x 0 0 )[1? F (?x 0 0 )]f(?x 0 0 ) 2 [~x? E[~xjx 0 0 ]][~x? E[~xjx 0 0 ]] 0 g: As in Klein and Spady (1993) and Ichimura (1993), estimation of the covariance matrix can be performed in the usual way. The conditional expectations piece can be estimated by standard regression ernels with the order of the bandwidth set to n?1=3. This follows by the uniform convergence results in Klein and Spady. 10 As an example of when this might hold, let d = 2 and 0 = 1 for simplicity. Dene r(x 1 ; z) and r(z) as the respective joint and marginal densities, where z =?x 1?x 2. Suppose that on [c 1 ; c 2 ], r(z) is proportional to f(z). Then f(z)e[x 2 jz] = c R r(x 1 ; z) dx 1 for some positive constant c. In this case, Assumption 11 will be satised if r(; z) is very smooth with respect to z. 15

18 3 Simulations The primary purpose of this section is to show that MOD performs reasonably well. The experiment here is performed 1000 times across two sample sizes: n = 250 and n = There are two regressors, x 1, x 2, with x x 2, 0 = 1, where the support of x x 2 is on [?10; 10]. The errors are distributed as one of two mixtures of independent gammas. It is often the case in simulations that a mixture of normals is used instead, but this would not be a meaningful comparison here since we will be using a type of mixture of normals as one of our estimators. Denote G(; ) as a gamma distribution with shape parameter and scale parameter, G(?; ) as G(; ) whose density has been reected about the y-axis, and [G(; ) + ] as G(; ) whose density has been shifted to the right by. The error distributions considered in this experiment are G [G(6; 0:15)? 0:5] + 1 [G(?6; 0:15) + 0:5] [G(8; 1)? 2] [G(?8; 1) + 2]; and G G [G(4; 0:45) + 1] [G(?4; 0:45) + 5] [G(4; 0:75)? 1] + 1 [G(?4; 0:75)? 5]: 8 The rst mixture is symmetric and trimodal, and its form is not that atypical, whereas the shape of the second distribution is nonstandard. Even though it is unliely that G 2 occurs in an economic type setting, it is nonetheless a good test; if MOD performs well under this distribution, then it is liely to perform just as well or better under more \reasonable" distributions. Both G 1 and G 2 are two times continuously dierentiable, and their second derivatives are bounded by 0.5 and 0.7, respectively, in absolute value, and their rst four moments are 0.0, 22.15, 0.0, , and 0.0, 22.60, , The rst regressor, x 1, is uniformly distributed on [?1; 1], and the second regressor, x 2, will be a truncated normal with support on [?9; 9]. Let T N(0; 2 ; tr) represent a normal distribution with mean zero and variance 2, where values greater than tr in absolute value 16

19 have been truncated. Then x 2 will be distributed as either T N(0; 9 2 ; 9) or T N(0; 4:5 2 ; 9). Typically, the nonparametric component at any given point will only be estimated well if there are many observations around that point. Since the distribution function is dened over the range [?10; 10], estimates of it near the tails will only do well if the mass there, with respect to the distribution of?x 0 0, is large. To get a feel for the sensitivity of this, the results for two dierent distributions are provided, with the latter, T N(0; 4:5 2 ; 9), having less mass in the tails. 11 In all, four models are studied, Model 1: x 2 T N(0; 9 2 ; 9); G 1 ; Model 2: x 2 T N(0; 4:5 2 ; 9); G 1 ; Model 3: x 2 T N(0; 9 2 ; 9); G 2 ; Model 4: x 2 T N(0; 4:5 2 ; 9); G 2 ; across four dierent estimators. The estimators are the standard probit, two variants of MOD, and Ichimura's (1993) estimator. In all cases, the estimate of 0 was restricted to be less than 10 in absolute value. For the probit model, the estimator is [(x 1 + x 2? )=], where the parameters ; ; are to be estimated and () is the c.d.f. of the standard normal. The rst version of MOD, called MOD-1, sets H() as the standard normal distribution. As a practical matter, there are two crucial choices one must mae before implementing MOD: a lower bound for and an upper bound on the second derivative of our estimator. Both of these bounds will be obtained from the results of a probit model. Dene the estimated from the probit model as ^ p. A lower bound for with respect to MOD-1 was set here as ( min 0:5; 0:5 ) ^ p ; n 0:2 11 The distribution of the regressors was chosen here because both the uniform and normal are commonplace in simulation studies. Hence this maes the analysis of the estimates of the error density and distribution easier. However, note that the density of z =?x 1?x 2, r(z), when x 2 T N(0; 9 2 ; 9) is [(minfz +1; 9g=9)? (maxfz?1;?9g=9)]=f2[(1)?(?1)]g for?10 z 10. This implies that r(?10) = r(10) = 0, a violation of Assumption 8. The condition that the density has to be positive was only made for technical convenience, as stated earlier. Perhaps more problematic, in terms of the theory, is that the density is not dierentiable at z =?8 and z = 8, though it is innitely times dierentiable elsewhere. Thus Assumption 11 will also be violated. Nonetheless, MOD performing well under these conditions will provide some evidence of its robustness. 17

20 where the lower bound must decrease at a rate of about n?0:2 by the results from the last section. 12 One can use any positive constant besides 0:5 in the above calculation (since for large n it becomes irrelevant), but it is a reasonable upper bound to the lower bound because it deviates only slightly from the case of standard normality. Liewise, noting that the bound on the second derivative with regards to the probit model is exp(?0:5)(^ 2 pp 2)?1, we set the bound on the second derivative of MOD as 8 < C (2) min : exp(?0:5) p max4 0:5?2 ; 2 where 1000 is some predetermined upper bound. 13 2!?1:5 0:5 ^ p n 0:2 3 9 = 5 ; 1000 ; ; The latter bound is constructed so that for moderate sample sizes, the lower bound on the second derivative, (^ p n?0:2 )?1:5, is decreasing at a slower rate than the lower bound under normality, (^ p n?0:2 )?2. This can be viewed as a transition to the case of larger sample sizes, where the derivative will be xed at 1000, whereas the lower bound for eeps decreasing with n. In order to satisfy the second derivative bound, we impose the following constraints, 14 b? a 2 X j=1 j H (2) zj;l? j C (2) ; l = 1; 2; z j;1 = j? ; z j;2 = j + : Observe that jh (2) [(z? j )= ]j obtains a maximum value at j? and j +. These bounds do not guarantee that the overall second derivative is less than C (2) since we are only constraining the derivative at certain points. Nonetheless it wors well, especially when is small, because the normal has exponential tails. After estimation, it was checed if the overall derivative bounds where less than 1:5C (2), and all cases they were. Hence with some abuse of notation, denote the overall derivative bound as 1:5 C (2). 12 The upper bound for was set to 50. The average lower bound across the models and sample sizes was fairly constant, taing on a value of about 0.47 in each case. The median estimate for ^ p ranged from 4.5 to 5.0. In the larger sample sizes, the lower bound for was less than 0:5 about a third of the time, and liewise it occurred about a quarter of the time in the smaller sample sizes. 13 Lie the lower bound on, the average upper bound on the second derivative is fairly constant with a value of about 1.0 in each case. 14 In practice, these constraints may not be necessary and instead checed after optimization, but in a large scale simulation, they speed up the exercise. For example, one could estimate a range of mixtures, = 1; : : :;, without imposing the derivative bounds in the computer program itself, and select the largest model, in terms of, that satises the derivative bounds, where the derivative bounds are checed after estimation. This is much easier to program. 18

21 The last step is to determine the number of terms,, to use. The results from the last section state that it increases at an order of about n 7=15, but this leaves a wide range of choices. Even in moderately large samples, setting = n 7=15 seems unreasonably ambitious. Unfortunately, a trial of sample selection procedures, such as BIC and a variant of generalized cross-validation did not wor very well since they tended to under-predict the optimal number of mixtures (i.e. the number of mixtures which minimized the pseudometric on a per sample basis). However, for moderately large, the objective function did not perceptively decrease as increased. The reason is that the model is nested in terms of, and if was too large, the optimizer converged to a smaller model by either setting some of the i 's to zero and/or by equating some of the i 's. 15 So the ey is to nd a large enough. Given that this is extremely time consuming if done per sample, we instead did a search over the rst ten samples and chose large enough so that in each sample, the same result could have been achieved with a fewer number of terms. For n = 250, was chosen as seven across all models, and for n = 1000, was set to nine for the rst two models and it was set to eight for the remaining two. 16 Except for the nonlinear constraints, the optimization problem is the same as standard nonlinear least squares. There are a variety of computer pacages that handle these types of constraints, and the one used here is NPSOL by Gill et al (1986). 17 As a means of comparison, we also estimated the model using Ichimura's (1993) approach, where the ernel is dened as (x) = 8 >< >: 0; if jxj 1; 35 (1? x 2 ) 3 =32; if jxj < 1; which is constructed to be twice continuously dierentiable and everywhere positive. This ernel is similar to the one used in Lee (1995), who examined the simulated performance of ^ n under a closely related model. If the bandwidth is of order n?0:2, then the distribution 15 Both Geman and Hwang (1983) and Hecman and Singer (1984) also commented on this result with the latter paper calling it the \clustering phenomenon". 16 The theory in the last section is for the case when is a predetermined sequence such as = bn 7=15 c, where bc is the largest integer no bigger than, but the theory easily extends to the case where the search is done over the set fmax(bn 7=15 c? C 1 ; 1); : : :; min(bn 7=15 c + C 2 ; bnc)g, where C 1 ; C 2 are positive constants. The reason why the theory carries over is that the order of the metric entropy remains the same. 17 Optimization was performed on a Sun Ultra 2, and each run with a sample size of 1000 observations and = 9 too about a minute to converge. 19

22 function converges at a rate of n?0:4, which is the same as MOD. The rate of n?0:2 gives us little guidance to actual implementation, so the approach taen here is the same as in the empirical wor of Stern (1996) who treated the bandwidth as a parameter to be estimated as well as 0. We have restricted the bandwidth to lie in [1n?0:2 ; 15n?0:2 ], where the constants, 1 and 15, encompass the range used by Lee (1995). The Simplex method, outlined in Press et al. (1994), is used for optimization, which is also as in Lee (1995). To show that the results are not due to the dierences in ernels (i.e. H (1) () versus ()), even though () is very similar (\bell shaped") to a normal with a variance of 0.12, we used the following alternative function for H(), which is just R x?1 H(x) = 8 >< >: () d, 0; if x?1; 35 (x? x x 5 =5? x 7 =7)=32 + 0:5; if jxj < 1; 1; if x 1; and this model is called MOD-2. This function is only three times continuously dierentiable, but the theory from the last section requires H() to be four times continuously dierentiable. It is easy to construct a ernel with bounded support satisfying the stronger smoothness conditions, but for comparison reasons, the function above is used. The analogous strategy for setting and imposing the bounds as in the case of MOD-1 is adopted here. Because E[yjx] and (@=@x)e[yjx] depend on F (), f(), and 0, only results for ^ (), ^ (1) (), and ^ n are reported. A summary of the results for the distribution and density estimates are reported in Table 1, where ( ^ ; F ) and ( ^ (1) ; f) are the corresponding average (across the 1000 simulations) L 2 and density (i.e. ( ^ ; F ) f P 1000 j=1 calculations for the estimates of the distribution R?10[ ^ ;j(z)? F (z)] 2 r(z) dz=1000g 1=2, where ^ ;j is the estimate with respect to the jth simulation). A summary of the results for the coecient estimates are provided in Table 2, where ^ n, SD, and ASE are the average estimate, the standard deviation of the estimates, and the average of the estimated asymptotic standard errors, respectively. For both MOD models, E[x 2 jx x 2 ] was estimated using standard ernel regression with a normal density ernel and a bandwidth set to n?1=3. In Table 2, results for the exact MLE (i.e. the constant term and the slope coecient are the only parameters being estimated) are also given as a point of comparison. Not surprisingly, the eciency loss with respect to the semiparametric methods is quite considerable. Plots of the 20

23 average estimates of the distribution and density and their corresponding standard deviations are given in Figures 1 and 2 for Models 1 and Finally, to get a sense for the distribution of the estimated coecients, some histograms are provided in Figure 3. Overall, MOD performs very well, especially for the case of 1000 observations where it estimates the distribution and 0, relative to the probit and Ichimura's model, with good accuracy. It is surprising how similar MOD-1 and MOD-2 are, with the only noticeable dierence being in the standard deviations and standard errors for a sample size of 250 observations. MOD clearly outperforms the probit model by all measures in the case of the larger sample size. In the case of the smaller sample size, the only negative outcome is that there is little improvement (in L 2 ) in the estimates of the density. This should not be surprising, however, since it is typical in nonparametric estimation that the derivatives are estimated less precisely. To compound matters here, we only observe an indicator, versus a continuum of values, which is very little information. The similarities between MOD's performance in Models 1 and 2, as well as Models 3 and 4, suggest that the results are not overly sensitive to the underlying distribution of?x 0 0. The main reason why the analogous pseudometrics are sometimes smaller in Models 2 and 4 is due to the fact that the density of?x 0 0, under this second regime, is smaller in the tails, and this is where much of the error lies. As expected, MOD does not estimate the second error distribution, G 2, as well as the rst. In the case of such a complicated function, it appears that we need a relatively large sample size before we are able to obtain good estimates of it. Even though probit does a better job of estimating the density in L 2, it misses, on average, important shape characteristics such as the mass to the right of center and that the mass at the left tail is larger than at the right tail. The histograms provided in Figure 3 give some visual evidence that the distribution of the estimates are approximately normal for moderately large sample sizes. As n increases, the sample distribution appears to be more \bell shaped", even though they are still asymmetric. Somewhat surprising is the performance of Ichimura's estimator. Except for the estimates 18 MOD-2 is not given since its averages are almost identical to that of MOD-1. Models 2 and 4 are similar to Models 1 and 3, and hence omitted. 21

24 Table 1: Average L 2 Distances ( ^ ; F ) ( ^ (1) ; f) ( ^ ; F ) ( ^ (1) ; f) n = 1000 n = 250 Model 1 Probit Mod Ichimura Mod Model 2 Probit Mod Ichimura Mod Model 3 Probit Mod Ichimura Mod Model 4 Probit Mod Ichimura Mod

25 Table 2: Coecient Estimates ^ n SD ASE ^n SD ASE n = 1000 n = 250 Model 1 Probit MOD Ichimura MOD MLE Model 2 Probit MOD Ichimura MOD MLE Model 3 Probit MOD Ichimura MOD MLE Model 4 Probit MOD Ichimura MOD MLE

26 of 0, there is little improvement over the standard probit case. It should be ept in mind that the averages of the distributions and densities in Figures 1 and 2 can be somewhat misleading for this estimator since the non-monotonic regions, which can be very troublesome on a per sample basis, get averaged out across the 1000 simulations. The estimates of the density can also be quite jagged, which is evident by the plots of the standard deviations. In summary, MOD performs very well relative to both probit and Ichimura's model, especially when the errors are distributed under G 1. Given that the support of?x 0 0 was xed at [?10; 10], the variance of both the errors and?x 0 0 were constructed to be large so that the results here would be meaningful. For example, if x 2 T N(0; 1; 9), then P (jx 0 0 j > 5) 0: This causes two problems for analysis: 1) In the L 2 calculations, the tail estimates of the distribution and density are relatively unimportant since the density of?x 0 0 is close to zero there; 2) Tail estimates will be very imprecise since there is almost no information (i.e. observations) there. In practice, therefore, the researcher should also estimate the density of?x 0 0 and use caution for estimates of F () and f() where the estimated density of?x 0 0 is small. 4 Extensions A useful extension is an ordered probit type model. For example, suppose that one observes y i = 8 >< >: 1; if y i > C i ; 0; if c i y C i i;?1; if y t < c i ; where fc i ; C i g is a nown sequence of constants. Then E[y i jx i ] =?1 P (y i < c i jx i ) + 1 P (y i > C i jx i ) = 1? F [(c i? x 0 i 0 )]? F [(C i? x 0 i 0 )]: Given this moment, the problem is in the same format as in Section 2. It is also noted here that MOD can estimate a general class of univariate monotone functions or monotone functions that satisfy an index type restriction. To do this, simply relax the bounds on a and b in (2.4). As an example, suppose you want to estimate g(x), 24

Computation Of Asymptotic Distribution. For Semiparametric GMM Estimators. Hidehiko Ichimura. Graduate School of Public Policy

Computation Of Asymptotic Distribution For Semiparametric GMM Estimators Hidehiko Ichimura Graduate School of Public Policy and Graduate School of Economics University of Tokyo A Conference in honor of