On The Asymptotics of Minimum Disparity Estimation

Size: px

Start display at page:

Download "On The Asymptotics of Minimum Disparity Estimation"

Agnes Stokes
6 years ago
Views:

1 Noname manuscript No. (will be inserted by the editor) On The Asymptotics of Minimum Disparity Estimation Arun Kumar Kuchibhotla Ayanendranath Basu Received: date / Accepted: date Abstract Inference procedures based on the minimization of divergences are popular statistical tools. Beran (1977) proved consistency and asymptotic normality of the minimum Hellinger distance (MHD) estimator. This method was later extended to the large class of disparities in discrete models by Lindsay (1994) who proved existence of a sequence of roots of the estimating equation which is consistent and asymptotically normal. However the current literature does not provide a general asymptotic result about the minimizer of a generic disparity. In this paper we prove, under very general conditions, an asymptotic representation of the minimum disparity estimator itself (and not just for a root of the estimating equation), thus generalizing the results of Beran (1977) and Lindsay (1994). This leads to a general framework for minimum disparity estimation encompassing both discrete and continuous models. Keywords Disparity Quadratic Approximation Non-parametric Density Estimation 1 Introduction Different types of divergence measures have been used in the literature to measure the dissimilarity between two distributions. A prominent subclass of density-based divergences is the family of disparities which will be described in detail in Section 2. Given a density g and a family of parametric densities, a natural way of getting a best fitting parameter is to minimize a disparity measure between g and a density from the (parametric) family over the parameter space. When dealing with point estimation in parametric models, maximum likelihood is the most popular method of estimation. But other alternatives like the method of moments and M-estimators are also available. University of Pennsylvania and Indian Statistical Institute arunku@wharton.upenn.edu ayanbasu@isical.ac.in

2 2 Arun Kumar Kuchibhotla, Ayanendranath Basu Considering the efficiency of the estimator to be the criterion for comparison, the maximum likelihood estimator is one of the best under some regularity conditions. Rao (1961), Robertson (1972) and Fryer and Robertson (1972) have noted that there is a class of estimators containing the maximum likelihood estimator such that each estimator in the class is asymptotically efficient or asymptotically equivalent to the maximum likelihood estimator (up to order n 1/2 ). Many authors have followed this up by considering various other criteria like higher order efficiency in order to single out the maximum likelihood estimator as the best. But in the current era of big data, some errors in the generation, recording and transmission of data are not unexpected. Thus it appears justifiable that one should consider the asymptotic robustness of the estimators together with their asymptotic efficiency when comparing estimators. Note however that, while there is a well established concept of asymptotic efficiency of an estimator, there is no universal way of proving asymptotic robustness of the estimator or claiming that some estimator is the best robust estimator. Beran (1977) considered the minimum Hellinger distance estimator in continuous models. He appears to be the first to prove that there are estimators which are asymptotically fully efficient while enjoying strong robustness properties. Beran s (1977) approach required a non-parametric estimator of the data density. The Hellinger distance was then replaced by a general disparity by Lindsay (1994) who considered discrete models and used sample proportions as estimates of the actual density. A focal point of his work was the study of the properties of zeros of an estimating function obtained as the derivative of a disparity. The main result of Lindsay (1994) states that there exists a sequence of roots which is consistent and asymptotically normal with asymptotic variance coinciding with the inverse of Fisher information when the true density is an element of the parametric family. Later the results of Lindsay (1994) were extended by Basu and Lindsay (1994), Park and Basu (2004) and Kuchibhotla and Basu (2015) to continuous models under different conditions on the model, the kernel density estimate and the disparity generating function. However, these authors also consider the roots of an estimating equation rather than the minimum disparity estimator itself. As noted by Ferguson (1982), proving the asymptotic results for some sequence of roots of the disparity based estimating equation may not prove the same for the minimum disparity estimator. Also, the results of the previous authors only mention that there exists a good sequence of roots and do not prescribe how to get such a sequence when there are multiple roots of the estimating equation. In light of this discussion, we feel that one should derive the asymptotic results for the minimum disparity estimator. Also, an approach which parallels the framework of Lindsay (1994) in case of continuous models in terms of the conditions on the disparity does not exist in the literature. Although Kuchibhotla and Basu (2015) considered a set up where the disparity conditions are milder than those of Lindsay (1994), they have stronger conditions on the density estimator.

3 On The Asymptotics of Minimum Disparity Estimation 3 In this paper, we first prove a grand consistency theorem for the minimum disparity estimator under minimal conditions. We then develop an asymptotic representation of the minimum disparity estimator in a general framework. Our results are applicable whenever the densities exist with respect to a σ- finite base measure rather than being specific to the case of the Lebesgue measure. Also, the conditions on the disparity are exactly the same as those in Lindsay (1994). The specific achievements of this paper may be listed as follows. 1. Consistency is proved with minimal conditions for a suitable subclass of disparities; even the differentiability of the probability density function with respect to the parameter or smoothness of the disparity generating function are not required. 2. All the results proved in this paper relate to the minimizer of the disparity itself, and not just a suitable sequence of roots of the estimating equation. This is unlike most of the previous works done in this area; Beran (1977) is an exception. 3. The grand consistency theorem and the asymptotic representation of the disparity do not require the observations to be independent; neither is it necessary for the density estimator to be a kernel density estimator. 4. Theorem 4.1, together with Remark 10 establishes a general framework for minimum disparity estimation encompassing both discrete and continuous models. Results of Lindsay (1994) emerge as a special case. 5. The development described in the previous items establishes the legitimacy of the disparity based analogue of the likelihood ratio test considered in Theorems which depends explicitly on the minimizer of the disparity. This also avoids the possibility of having a negative statistic due to the use of a root which is not a global minimizer. We now outline the remaining sections of the paper. In Section 2, we present the grand consistency theorem of the minimum disparity estimator. In Section 3, we prove the quadratic approximation of the disparity which leads to an asymptotic representation of the minimum disparity estimator. In Section 4, we prove asymptotic normality of the estimating function which, combined with the asymptotic representation of the estimator, leads to the asymptotic normality of the minimum disparity estimator. In Section 5, we consider testing of hypothesis using disparities. Finally, we conclude with some remarks in Section 6. We try to present our results step-by-step so that the assumptions required for each step become transparent and the generalization of the results currently available only for kernel density estimators becomes easier. In this paper we deal with the asymptotic efficiency results of the minimum disparity estimator, and do not re-emphasize the well known robustness properties of these estimators. However, see Remark 5 and Theorem 5.3. Although we primarily follow the approach of Lindsay (1994) in defining the disparities, the class of disparities also coincides with the class of φ-divergences of Csiszár (1963) and Ali and Silvey (1966). Other authors have worked on the φ-divergence formulation and independently determined the properties of the

4 4 Arun Kumar Kuchibhotla, Ayanendranath Basu corresponding minimum distance procedures primarily in discrete models. See, for example, Morales et al. (1995) and Pardo (2006). However the literature is deficient in general results based on φ-divergences in continuous models, where the results are usually scattered, corresponding to specific divergences, as in Beran (1977) or Basu et al. (1997). 2 Consistency Let G represent the class of all probability distributions having densities with respect to some σ-finite base measure µ on some measurable space (Ω, Λ, µ) with Λ representing a σ-field on Ω. We assume that the true distribution G and the model F Θ = {F θ : θ Θ} belong to G. Let g and f θ be the corresponding densities (with respect to µ). Let X 1, X 2,..., X n be a random sample from G which is modelled by F Θ. We do not necessarily assume that the observations are independent, although we require them to be identically distributed. Our aim is to estimate the parameter θ by choosing the model density which gives the closest fit to the data. Let C be a real valued strictly convex function with C(0) = 0. Consider the divergence given by the form ( ) g(x) ρ C (g, f θ ) = C f θ (x) 1 f θ (x)dµ(x). This form describes the class of all disparities (Lindsay, 1994) between the densities g and f θ. For g(x) = 0 or f θ (x) = 0, we use the following convention ( ) 0 ( a ) 0C 0 1 = 0, and 0C 0 1 C(d) = a lim d d. The function C in the disparity ρ C (g, f θ ) is called the disparity generating function. An application of Jensen s inequality shows that ρ C (g, f θ ) 0 with equality if and only if g = f θ identically. If the base measure is the counting measure on the set {a 1, a 2,...}, then the disparity can be written as ρ C (g, f θ ) = i=1 ( ) g(ai ) C f θ (a i ) 1 f θ (a i ). If the base measure is the Lebesgue measure (λ) on the Euclidean space R m, then the disparity can be written as ( ) g ρ C (g, f θ ) = 1 f θ dλ. f θ R m C The residual δ(x) = (g(x)/f θ (x)) 1 has been called the Pearson residual in Lindsay (1994) and we follow this nomenclature here; the range of the Pearson residual is [ 1, ).

5 On The Asymptotics of Minimum Disparity Estimation 5 Remark 1 The disparity generating function can be changed to C 1 (δ) = C(δ) tδ (for any t R) without changing the disparity ρ C since δf θ dµ = 0. If C is differentiable, we can take t = C (0), so that C 1(0) = 0. Since the redefined disparity generating function C 1 is also strictly convex, any local minimum of C 1 represents a unique global minimum. Thus the condition C 1(0) = 0 actually makes the disparity generating function C 1 itself non-negative and therefore so is the integrand of the disparity. (Non-negativity of the disparity generating function, however, is not necessary for the disparity to be non-negative). Even when the disparity generating function C is not differentiable, one can still redefine it as C 1 (δ) = C(δ) tδ for any t, a sub-gradient of C at 0, so that C 1 is non-negative. See Proposition 8.5 of Vajda (1989) for more details. In our subsequent proofs, we will choose the disparity generating function to be non-negative. We denote by θ g = T (G), the best fitting parameter which minimizes ρ C (g, f θ ) over all θ Θ. To get an estimate of T (G), we consider minimizing an estimate of the disparity ρ C (g, f θ ) based on the random sample X 1, X 2,..., X n which are identically distributed with density g. A natural estimate of the disparity can be obtained by replacing g by a suitable density estimator g n. Thus, we consider the minimum disparity estimator ˆθ n of θ g defined by ˆθ n := arg min ρ C (g n, f θ ). θ The minimizer ˆθ n may not be unique and in such a case ˆθ n represents any one of the minimizers. We prove that any minimizer is strongly consistent under some conditions. The main component in the proof of the grand consistency theorem which follows is the uniform convergence of the objective function ρ C (g n, f θ ) to ρ C (g, f θ ) over all θ Θ; we then use van der Vaart (1998, Theorem 5.7) to get the strong consistency of any minimizer ˆθ n. Theorem 2.1 Suppose that the following assumptions hold: (C1) The parameter space Θ is compact; (C2) C( 1) + C ( ) <, where C ( ) = lim u C(u)/u; (C3) For each θ Θ and any sequence θ n θ, lim f θ n (x) = f θ (x), n for all x, except possibly on a set (which might depend on θ but not on the sequence {θ n }) of µ-measure zero. Also, θ g is the unique minimizer of the disparity ρ C (g, f θ ); (C4) g n is strongly consistent for g, i.e, g n (x) converges almost surely to g(x) for µ-almost all x. Then the minimum disparity estimator ˆθ n is strongly consistent for θ g. Proof First, we prove that continuity of f θ (x) in θ for almost all x implies that ρ C (g, f θ ) is also continuous in θ for every density g. This proves that under

6 6 Arun Kumar Kuchibhotla, Ayanendranath Basu assumptions (C1) and (C3), the minimizers ˆθ n and θ g exist since a continuous function on a compact set attains its minimum. Suppose θ n θ Θ as n. By Lemma 11.1 of Basu et al. (2011) and Remark 1, we have ( ) g 0 C 1 f θn C( 1) + C ( ) g f θn. (2.1) f θn 2 Clearly, the upper bound is integrable and g f θn dµ = 2 2 min{g, f θn }dµ. Since min{g, f θn } min{g, f θ } as n and min{g, f θn } g which is integrable, we have by dominated convergence theorem that min{g, f θn }dµ min{g, f θ }dµ, and hence g f θn dµ g f θ dµ. (2.2) Now using Pratt s lemma (Theorem 5.5, Gut (2013)) with Equations (2.1) and (2.2), we get, as n ρ C (g, f θn ) ρ C (g, f θ ), implying continuity of ρ C (g, f θ ) with respect to θ. Next we prove that the estimate of disparity ρ C (g n, f θ ) of ρ C (g, f θ ) is uniformly strongly consistent over θ Θ in the sense that sup ρ C (g n, f θ ) ρ C (g, f θ ) a.s 0, θ Θ as n. Since Θ is assumed to be compact and ρ C (g, f θ ) is continuous in θ, proving continuous convergence suffices to get uniform convergence (see Theorem 1 of Iséki (1957)). Take a sequence {θ n } Θ converging to θ Θ as n. Then we need to prove that ρ C (g n, f θn ) ρ C (g, f θn ) a.s 0, as n. We will prove using Pratt s lemma that ρ C (g n, f θn ) and ρ C (g, f θn ) converge (a.s) to ρ C (g, f θ ) as n. By Lemma 11.1 of Basu et al. (2011) with g n replacing g, we have 0 C ( gn f θn 1 ) f θn By triangle inequality, we get that g f θ dµ g n g dµ + g n f θn dµ g n g dµ + C( 1) + C ( ) g n f θn. 2 g n f θn dµ + f θ f θn dµ, (2.3) g f θ dµ + f θn f θ dµ. (2.4)

7 On The Asymptotics of Minimum Disparity Estimation 7 We know that g n (x) converges almost surely to g(x) for all x. Thus by Glick s theorem (Devroye and Györfi, 1985, page 10), that outside of a set B of measure 0, g n g dµ 0, as n. Fix an ω B c. Taking lim inf and lim sup on both sides in inequalities (2.3) and (2.4) respectively and using Glick s theorem, we get g f θ dµ lim inf g n f θn dµ, and lim sup g n f θn dµ g f θ dµ. n n Hence we get g n f θn dµ a.s g f θ dµ. Now applying Pratt s lemma for each ω B c and using the above relations, we get, as n, ρ C (g n, f θn ) a.s ρ C (g, f θ ). Similarly, one gets that ρ C (g, f θn ) converges almost surely to ρ C (g, f θ ) as n. Therefore by Theorem 1 of Iséki (1957), we have as n, sup ρ C (g n, f θ ) ρ C (g, f θ ) a.s 0. θ Θ By Theorem 5.7 of van der Vaart (1998), we get that any minimizer ˆθ n is strongly consistent for θ g. Remark 2 Under the assumption that g = f θ0 and that the model is identifiable (i.e, f θ1 = f θ2 if and only if θ 1 = θ 2 ), the minimizer θ g is unique and is given by θ 0. In general, it is easy to modify the proof of the above theorem to show that an approximate minimizer θ n satisfying ρ C (g n, f θn ) < ε n + inf θ Θ ρ C(g n, f θ ), is also strongly consistent for θ g. ε n a.s. 0, Remark 3 The above theorem is based on minimal assumptions. It does not even require the disparity generating function to be differentiable and the non-parametric density estimator to be the kernel density estimator. It only requires the density estimator to satisfy pointwise strong consistency. An elaborate discussion of different density estimators can be found in Prakasa Rao (1983). Various consistent estimators of density are available in the literature in many settings including i.i.d., censored and different types of dependent data. Some of these methods and references are mentioned in Table 1 (presented in the Supplementary Material); an examination of these settings demonstrates the generality of the grand consistency theorem which is applicable in all these cases. It is applicable even when Wald s consistency theorem, which requires log f θ (x) K(x) for some integrable function K, for maximum likelihood estimator is not applicable. See Chapter 17 of Ferguson (1996).

8 8 Arun Kumar Kuchibhotla, Ayanendranath Basu Remark 4 Theorem 2.1 only requires the parameter space to be a metric space (not necessarily an Euclidean space). Also, we do not require that the densities have to exist with respect to the Lebesgue measure and so the theorem is generally applicable for any class of densities with respect to any (a priori known) σ-finite measure. This theorem also treats the case of dependent observations provided a strongly consistent density estimator is available. In this respect, we observe that the existence of a (weakly) consistent density estimator implies that the minimum disparity estimator is (weakly) consistent. This theorem also does not require the observations to be real or R d valued. Remark 5 The only limitation (if at all) of the conditions above is that it requires the parameter space to be compact which can be relaxed to locally compact spaces if the actual disparity is a metric in the mathematical sense of the term. In this case, using the triangle inequality one gets the uniform convergence of the objective function proving the result. See Cheng and Vidyashankar (2006) for details in the case of the minimum Hellinger distance estimator. Also see remarks following Theorem 3.1 and Corollary 3.3 of Park and Basu (2004) for a discussion on accommodating some non-compact parameter spaces. Condition (C2) on the disparity generating function leads to an integrable upper bound on the integrand of the disparity. This condition is the only requirement on the disparity for Theorem 4.1 of Park and Basu (2004) which proves that the minimum disparity estimator under (C2) has an asymptotic breakdown point of at least 1/2. Thus all the minimum disparity estimators under assumption (C2) are strongly consistent and have asymptotic breakdown point of at least 1/2. Remark 6 Another condition under which consistency of the minimum disparity estimator can be proved was presented in Park and Basu (2004). These authors require the boundedness of the derivative of C. One can, instead, also use the boundedness of the sub-gradient of C and allow non-differentiability of the disparity generating function. This is because of the inequalities C(δ 1 ) C(δ 2 ) Ċ(δ 2)(δ 1 δ 2 ), C(δ 2 ) C(δ 1 ) Ċ(δ 1)(δ 2 δ 1 ), where Ċ represents any element of the sub-differential set. Thus, by strong consistency of the non-parametric density estimator and Glick s theorem, we get uniform convergence of the objective function implying the strong consistency of the minimum disparity estimator. Thus the differentiability condition of Park and Basu (2004) is avoidable and the boundedness requirement of the derivative can be replaced by the boundedness of a sub-gradient. The disparity between the distributions G, F G can also be calculated directly, without requiring the dominating measure, by using the alternative expression ( ) G(A) ϕ C (G, F ) = sup C F (A) 1 F (A), (2.5) D A A D

9 On The Asymptotics of Minimum Disparity Estimation 9 where A is a countably generated σ-algebra and the supremum extends over all finite partitions D A of the observation space. Here we have used G, F to represent the corresponding probability measures. See Liese and Vajda (1987) for more details. Note that ϕ C (G, F ) reduces to ρ C (g, f) when the densities g, f exist with respect to a common dominating measure. Remark 7 If we consider an increasing sequence of measurable decompositions D m of R d, then Equation (2.5) can be written as ( ) G(A) ρ C (G, F ) = lim m F (A) 1 F (A). A D m C This leads to an approximation of the disparity which can be used to avoid the non-parametric density estimation. Consider, for example, the decomposition sequence D (n) of R defined by the intervals [X i h n, X i + h n ), i = 1, 2,..., n, where h n = min i j X i X j /2. In this case, A D (n) C ( ) Gn (A) F (A) 1 F (A) = 1 n n i=1 ( ) 1 C 1 T i,n = 1 T i,n n n ξ(t i,n ), where ξ(x) = xc({1/x} 1) which is also a convex function and T i,n = nf [X i h n, X i + h n ). One can minimize this quantity instead of the integral form of the disparity. This method coincides exactly with the generalized spacings based estimation methodology of Ghosh and Jammalamadaka (2001). This remark is similar in spirit to Remark 1 of Györfi et al. (1994). See Remark 13 for some further comments in this connection. Also see Ekström (2001). Under differentiability of the model, thrice differentiability of the function C and the assumption that ˆθ n lies in the interior of Θ, ˆθ n can be obtained as a root of the equation A(δ n (x)) f θ (x)dµ(x) = 0, (2.6) where represents the gradient with respect to θ and A(δ) = C (δ)(δ + 1) C(δ). Let 2 denote the second derivative with respect to θ. The function A( ) is called the residual adjustment function (RAF) of the disparity and δ n = (g n /f θ ) 1 is the Pearson residual. An RAF is regular if A (δ) and A (δ)(δ+1) are bounded for δ [ 1, ). The RAF plays a vital role in the robustness properties of the estimator. The minimum disparity estimator need not always satisfy the estimating equation, for example, when the minimum is attained on the boundary of the parameter space or when C is not differentiable. i=1 3 Quadratic Approximation A popular and a powerful technique of proving asymptotic normality (or generally finding the asymptotic distribution) of maximizers or minimizers is to

10 10 Arun Kumar Kuchibhotla, Ayanendranath Basu prove a uniform quadratic approximation of the objective function. In the case of maximum likelihood estimation, local asymptotic normality (LAN) of the parametric model leads to a quadratic approximation of the log-likelihood and thus gives asymptotic normality of the maximum likelihood estimator. We will now follow the same approach in proving the asymptotic normality of the minimum disparity estimator. In this section, we prove an asymptotic representation of the estimator under the condition that θ g is an interior point of Θ. In case θ g Θ 0, the best fitting parameter θ g satisfies, under differentiability of f θ with respect to θ and differentiability of C, the equation ( ) g(x) A f θg (x) 1 f θg (x)dµ(x) = 0. (3.1) To establish uniform quadratic approximation of the disparity objective function, assume that Θ R p and define θ n = θ g + n 1/2 w Θ, and Λ n (w) = nρ C (g n, f θn ) nρ C (g n, f θg ). Let u θ (x) = log f θ (x) denote the usual likelihood score function. Consider the following assumptions, (A1) The residual adjustment function A is regular, i.e, the functions A ( ) and A ( )( + 1) are both bounded, say, by M; (A2) The parameter space Θ R p (for some p 1) is bounded and θ g Θ 0 ; (A3) Densities in the family F Θ are all twice continuously differentiable with respect to θ. Also, there exists a compact neighbourhood Θ g of θ g and a function M 0 such that the functions u θ, u θ and u θ u θ are all bounded by M 0 uniformly in θ Θ g and M 0 (X) has finite expectation. The function B(θ) is positive definite and is finite, where B(θ) := A (δ)(δ + 1)u θ u θ f θ dµ A(δ) 2 f θ dµ; (A4) The density estimator sequence {g n } also satisfies, almost surely as n, g n (x)m 0 (x)dµ(x) g(x)m 0 (x)dµ(x). (3.2) These assumptions are not very restrictive. Assumption (A1) is the very condition under which Lindsay (1994) presented his disparity framework in discrete models. Assumption (A2) is needed for applying Taylor series expansion around θ g. Assumption (A3) is required to prove uniform convergence of a quantity B n (θ) defined in the proof below. This assumption is one of the Cramér-Rao conditions required to prove asymptotic normality of the maximum likelihood estimator and is satisfied by the exponential family of densities. One could use separate bounds in (A3) for the three functions involved but this provides no theoretical advantage. Assumption (A4) can be seen as a generalization of the strong law of large numbers and holds by Kolmogorov s strong law of large numbers if the base measure is the counting measure and

11 On The Asymptotics of Minimum Disparity Estimation 11 g n is replaced by the usual proportions based estimator of a discrete density. Some simpler conditions under which assumption (A4) holds were presented in Theorem of van der Vaart and Wellner (1996) and in Zapa la (2008). The right hand side of Equation (3.2) is finite by assumption (A3). In particular cases like kernel density estimator, orthogonal series based estimator or delta sequence based estimator, assumption (A4) is not difficult to verify. Theorem 3.1 Under the assumptions (A1) - (A4), Λ n (w) = n 1/2 w A(δ n ) f θg dµ w B(θ g )w + o p (1), holds uniformly in w {w : w K} for every finite K. Here represents the Euclidean norm in R p. Proof First note that ( ) ( gn C 1 f θ = C gn 1 f θ f θ and Define B n (θ) = ) gn f θ f θ + C ( ) gn 1 f θ = A(δ n ) f θ, f θ [A(δ n ) f θ ] = A (δ n )(δ n + 1)u θ u θ A(δ n ) 2 f θ. A (δ n )(δ n + 1)u θ u θ dµ A(δ n ) 2 f θ dµ =: S 2 S 1. We will first prove that B n (θ) converges in probability to B(θ) uniformly in θ Θ g which will prove that sup B n (θ g + n 1/2 w) B n (θ g ) 2 sup B n (θ) B(θ) = o p (1). w: w K θ Θ g Here we use the fact that for any finite K, there exists a large enough n so that θ g + n 1/2 w Θ g for all w K. To deal with S 1, by assumption (A1) and Lemma 25 of Lindsay (1994), [A(δ n ) A(δ)] 2 f θ dµ A(δ n ) A(δ) 2 f θ dµ A(δ n ) A(δ) (δ n δ)a (δ) 2 f θ dµ + A (δ) g g n 2f θ dµ f θ ( B g 1/2 n g 1/2) 2 2 f θ dµ f θ + M g n g 2f θ dµ f θ (B + M) g n g 2f θ f θ dµ, (3.3)

12 12 Arun Kumar Kuchibhotla, Ayanendranath Basu because for any a, b > 0, ( a b) 2 a b. Since, g n (x) g(x) 2f θ f θ (g n (x) + g(x))m 0 (x), and (g n (x) + g(x))m 0 (x)dµ(x) a.s 2E[M 0 (X)], the quantity in inequality (3.3) converges almost surely to zero uniformly as n by generalized Pratt s lemma for random functions, Theorem 1 of Iséki (1957), and assumptions (A3) and (A4). Now consider S 2. Observe that {A (δ n )(δ n + 1) A (δ)(δ + 1)}u θ u θ f θ dµ [A (δ n ) A (δ)] (δ n + 1)u θ u θ f θ dµ + A (δ)(δ n δ)u θ u θ f θ dµ. (3.4) Note that, using assumption (A1), we have [A (δ n ) A (δ)] (δ n + 1)u θ u θ f θ 2Mg n (x)m 0 (x), A (δ)(g n g)u θ u θ M(g n (x) + g(x))m 0 (x). By (A1), Theorem 1 of Iséki (1957) and generalized Pratt s lemma along with the assumptions (A3), (A4), the terms on the right hand side of inequality (3.4) converge almost surely to zero uniformly on Θ g as n. Getting back to Λ n (w), we have, by a Taylor series expansion, { ( gn Λ n (w) = n C 1 f θn = n ) A(δ n )n 1/2 w f θg dµ + n 2 ( ) } g f θn C 1 f θg dµ f θg { n 1 w B n (θ )w }, where θ lies on the line joining θ g and θ g + n 1/2 w and so converges to θ g as n. By uniform convergence of B n (θ) to B(θ), we get Λ n (w) = n 1/2 w A(δ n ) f θg dµ w B(θ g )w + o p (1), (3.5) uniformly on {w R p : w K}. This quadratic approximation can be used to prove n 1/2 -consistency of the minimizer ˆθ n leading to an asymptotic representation.

13 On The Asymptotics of Minimum Disparity Estimation 13 Theorem 3.2 If A(δ n (x)) f θg (x)dµ(x) = O p (n 1/2 ) and ˆθ P n θ0, then the minimizer ˆθ n satisfies the equation ] n 1/2 (ˆθ n θ g ) = B 1 (θ g ) [n 1/2 A(δ n (x)) f θg (x)dµ(x) + o p (1). (3.6) Proof Note that the minimizer of Λ n (w) is n 1/2 (ˆθ n θ g ). Equation (3.5) proves that Λ n (w) and the quadratic term on the right hand side are close as stochastic processes indexed by w R p. Intuitively, this implies Equation (3.6). A rigorous argument follows the approach of Lemma 5.4 of Ichimura (1993). A close examination of the proof of Theorem 3.1 shows that ρ C (g n, fˆθn ) ρ C (g n, f θg ) = [ A(δ n (x)) f θg (x)dµ(x)] (ˆθ n θ g ) (ˆθ n θ g ) B(θ g )(ˆθ n θ g ) + o p ( ˆθ n θ g 2 ). Here represents the Euclidean norm and by convergence in probability of ˆθ n, there exists a N such that ˆθ n Θ g for all n N. Now following the proof of Lemma 5.4 of Ichimura (1993), we get that ˆθ n θ 0 = O p (n 1/2 ) and Equation (3.6). Remark 8 We do not require densities with respect to the Lebesgue measure. Nor do we require the observations to be independent. Remark 9 While we believe that our exposition adds substantially to the literature on disparities and minimum disparity estimators, it is important to recognize what has been left out. Our conditions (C2) and (A1) is not satisfied by the members of the Cressie-Read (Cressie and Read, 1984) family of disparities. We trust that this is compensated by the large number of disparities, including many that generate extremely robust estimators and tests, which do satisfy our conditions. We provide a more detailed discussion of this issue including a partial list of subclasses of disparities which satisfy assumptions (C2) and (A1), in the Supplementary Material. 4 Asymptotic Normality In this section, we present a general theorem proving asymptotic normality of the estimating function, A(δ n (x)) f θ (x)dµ(x), at θ = θ g, with some rate assumptions on the density estimator without assuming any particular form of the density estimator. Consider the following assumption.

14 14 Arun Kumar Kuchibhotla, Ayanendranath Basu (B1) The density estimator in conjunction with the parametric model F Θ satisfies n 1/2 A (δ) {g n (x) g(x)} u θg (x)dµ(x) L N(0, V (θ g )), for some positive definite matrix V (θ g ) and n 1/2 {g 1/2 n (x) g 1/2 (x)} 2 u θg (x) dµ(x) = o p (1). In the case of i.i.d. data, the first part of assumption (B1) readily holds if the base measure is the counting measure and the density estimator g n is the usual proportion based estimator; the conditions under which the second part holds were given in Lindsay (1994). If the base measure is the Lebesgue measure and the density estimator is the kernel density estimator, then assumption (B1) was verified through the proofs presented in Park and Basu (2004). Theorem 4.1 Under the assumptions (A1) and (B1), n 1/2 ( A(δ n ) f θg dµ ) L A(δ) f θg dµ N(0, V (θ g )). Proof The centering on the left hand side is actually zero by (3.1). Define S n (x) = A(δ n (x)) A(δ(x)) (δ n (x) δ(x))a (δ(x)). Observe that A(δ n ) f θ dµ A(δ) f θ dµ = + A (δ)(δ n δ) f θ (x)dµ S n (x) f θ (x)dµ. By Lemma 25 of Lindsay (1994), we have that S n (x) f θg (x)dµ(x) S n (x) f θg (x) dµ(x) ( ) B g 1/2 2 1/2 g n dµ = o uθg p(n 1/2 ), by assumption (B1). Another application of (B1) n 1/2 ( A(δ n ) f θg dµ ) L A(δ) f θg dµ N(0, V (θ g )).

15 On The Asymptotics of Minimum Disparity Estimation 15 Remark 10 This theorem, combined with Equation (3.1), Theorem 3.6 and the conditions implied therein imply that n 1/2 (ˆθ n θ g ) L N(0, B 1 (θ g )V (θ g )B 1 (θ g )). (4.1) Thus our approach essentially subsumes that of Lindsay (1994) and provides a unified framework of minimum disparity estimation in discrete and continuous models. This approach also includes the regression case solved recently by Hooker (2016) with relaxed assumptions in a more compact manner. Remark 11 When the base measure is the Lebesgue measure, the non-parametric density estimator involved is a kernel density estimator with bandwidth sequence h n, given by g n (x) = 1 nh n n ( ) x Xi K, i=1 and the random sample is an i.i.d. sample from a univariate distribution, we get, using assumptions of Theorem 3.4 of Park and Basu (2004), that h n V (θ g ) = Var(A (δ(x))u θg (X)). See also Cheng and Vidyashankar (2006) and Remark 13. Note that if g = f θ0, then θ g = θ 0 and V (θ 0 ) = B(θ 0 ) = I(θ 0 ) which is the Fisher information matrix. This proves that n 1/2 (ˆθ n θ 0 ) L N(0, I 1 (θ 0 )), which implies first order efficiency of the minimum disparity estimator. Actually, if the non-parametric density estimator is the kernel density estimator, one can show more, namely that under g = f θ0, n 1/2 (ˆθ n θ 0 ) = 1 n 1/2 I 1 (θ 0 ) n u θ0 (X i ) + o p (1), (4.2) which in turn shows that the maximum likelihood estimator and the minimum disparity estimator are asymptotically equivalent at the parametric model. Remark 12 The limit theorem (4.1) has an implicit advantage over the previous asymptotic normality results of Lindsay (1994), Park and Basu (2004) and Kuchibhotla and Basu (2015) in the sense that these authors prove that there exists a sequence of roots of the estimating equations which is strongly consistent and asymptotically normal but do not specify which roots. Our theorem proves that the minimizer of the disparity, which is also a root of the estimating equation, is strongly consistent and asymptotically normal. Remark 13 We thus obtain the first order efficiency of the minimum disparity estimator obtained from the integral form of the disparity. Note that the approximation method mentioned in Remark 7 leads to an inefficient estimator in general. See Theorem 3.2 of Ghosh and Jammalamadaka (2001). i=1

16 16 Arun Kumar Kuchibhotla, Ayanendranath Basu 5 Tests of Hypothesis A popular and widely used statistical tool for the hypothesis testing problem is the likelihood ratio test. The likelihood ratio test statistic can be viewed as the difference between the minimum of the likelihood disparity under the null and that without any constraint. Under certain regularity conditions, the likelihood ratio test enjoys some asymptotic optimality properties. However, as in the case of the maximum likelihood estimator, the likelihood ratio test exhibits poor robustness properties in many cases. As an alternative to the likelihood ratio test, Simpson (1989) introduced the Hellinger deviance test which was later generalized to disparity difference tests, in a unified way; see eg., Lindsay (1994) and Basu et al. (2011). However, the disparity difference test statistic can become potentially negative when using an arbitrary root of the estimating equation rather than the global minimizer. The set up under which we deal with the problem of hypothesis testing is as follows. We assume the parametric set up of Section 2 and let identically distributed random variables X 1, X 2,..., X n be available from the true distribution G. We assume the equivalence presented in Equation (4.2), which is satisfied, for example, if the non-parametric density estimator is the kernel density estimator. The hypothesis testing problem under consideration is H 0 : θ Θ 0 and H 1 : θ Θ \ Θ 0, for a proper subset Θ 0 of Θ. As an analogue of the likelihood ratio test, define the test statistic, [ ] W C (g n ) := 2n ρ C (g n, ) ρ fˆθ0 C (g n, fˆθ), (5.1) where ˆθ and ˆθ 0 denote the unrestricted minimizer of ρ C (g n, f θ ) and the minimizer under the constraint of θ Θ 0 and g n is the kernel density estimate. We will now present the main theorem of this section which establishes the asymptotic distribution of W C. Theorem 5.1 Under the model f θ0, θ 0 Θ 0 and assumptions (A1) - (A4) and (B1), the limiting null distribution of the test statistic W C (g n ) is χ 2 r, where r is the number of restrictions imposed by the null hypothesis H 0. Proof A Taylor series expansion of ρ C (g n, fˆθ0 ) in θ around ˆθ, gives [ ] W C (g n ) = 2n ρ C (g n, ) ρ fˆθ0 C (g n, fˆθ) [ = 2n (ˆθ 0 ˆθ) ] ρ C (g n, fˆθ) [ ] 1 + 2n 2 (ˆθ 0 ˆθ) 2 ρ C (g n, f θ )(ˆθ 0 ˆθ),

17 On The Asymptotics of Minimum Disparity Estimation 17 where θ belongs to the line joining ˆθ 0 and ˆθ. Note that the first term in the last expression is zero as ˆθ is the minimizer of ρ C over Θ. So, we only need to deal with the second term in the expansion. Now [ W C (g n ) = n (ˆθ 0 ˆθ) I(θ 0 )(ˆθ 0 ˆθ) ] [ + n (ˆθ 0 ˆθ) { 2 ρ C (g n, f θ ) I(θ 0 )}(ˆθ 0 ˆθ) ]. (5.2) Under the model f θ0, n 1/2 (ˆθ 0 θ 0 ) and n 1/2 (ˆθ θ 0 ) are both O p (1) provided the null is true. Thus, n 1/2 (ˆθ 0 ˆθ) = O p (1) in this case. As in Theorem 3.5, 2 ρ C (g n, f θ ) = B n (θ) converges to B(θ) uniformly in θ Θ g. Note that B(θ 0 ) = I(θ 0 ) under g = f θ0. Since ˆθ 0 θ 0 = o p (1) and ˆθ θ 0 = o p (1), θ Θ g for large enough n and so 2 ρ C (g n, f θ ) I(θ 0 ) 2 ρ C (g n, f θ ) B(θ ) + B(θ ) I(θ 0 ) sup θ Θ g 2 ρ C (g n, f θ ) B(θ) + B(θ ) I(θ 0 ) P 0. Hence, by the arguments above, the second term on the right hand side of Equation (5.2) converges in probability to zero. By Equation (4.2), we have n 1/2 (ˆθ ˆθ 0 ) = n 1/2 (ˆθ ML ˆθ 0,ML ) + o p (1), where ˆθ ML and ˆθ 0,ML are the unrestricted and constrained maximum likelihood estimators. Hence, W C (g n ) is equivalent to the likelihood ratio test statistic under the model f θ0 in the sense that [ W C (g n ) n (ˆθ 0,ML ˆθ ML ) I(θ 0 )(ˆθ 0,ML ˆθ ] ML ) = o p (1). (5.3) From the theory of likelihood ratio test, we conclude that W C converges in distribution to a χ 2 r as n as stated. See Serfling (1980, Section 4.4.4) for a complete discussion on likelihood ratio test. Theorem 5.2 The conditions of Theorem 5.1 and the additional assumption that the parametric family F θ satisfies the local asymptotic normality (LAN) condition imply that under f θn, W C (g n ) 2 n i=1 as n, where θ n = θ 0 + τn 1/2. [ log fˆθml (X i ) log fˆθ0,ml (X i )] P 0, Proof Under the assumptions of Theorem 5.1, Equation (5.3) implies the stated claim under f θ0, since the Wald test statistic is equivalent to the likelihood ratio test statistic under the null. See Serfling (1980, pg ) for details. By LAN condition, we have that f θn is contiguous to f θ0 and so convergence in probability under f θ0 implies convergence in probability under f θn. Thus the test given by statistic W C (g n ) has the same asymptotic contiguous power as the likelihood ratio test under the conditions of this theorem.

18 18 Arun Kumar Kuchibhotla, Ayanendranath Basu The following theorem explores the stability of the limiting distribution of the test statistic W C (g n ) under contamination. For this theorem the null hypothesis under consideration is H 0 : θ g = θ 0, where the unknown true distribution G may or may not be in the model. Theorem 5.3 Under assumptions (A1) - (A4) and (B1), under the null hypothesis, the representation W C (g n ) Y 2n = Y 1 + o p (1) holds, where Y 1 χ 2 p and lim g fθ0 Y 2n = 0 for any C. Here by g f θ0, we mean convergence in the L 1 sense. The rate at which Y 2n converges to 0 depends on the form of C. Proof The proof closely follows the proof of Theorem 5.1. As in Theorem 5.1, we get by Taylor series expansion of the test statistic around ˆθ n, [ ] W C (g n ) = 2n ρ C (g n, f θ0 ) ρ C (g n, ) fˆθn = n(θ 0 ˆθ n ) 2 ρ C (g n, f θ )(θ 0 ˆθ n ), where θ belongs to the line joining θ and θ 0. As in Theorem 3.1, 2 ρ C (g n, f θ ) converges in probability to B(θ 0 ) under the null hypothesis. Hence, we have Note that W C (g n ) = n(θ 0 ˆθ n ) B(θ 0 )(θ 0 ˆθ n ) + o p (1). B(θ 0 ) = B(θ 0 )V 1 (θ 0 )B(θ 0 ) B(θ 0 ) [ V 1 (θ 0 ) B 1 (θ 0 ) ] B(θ 0 ). By Remark 10, we get n(ˆθ n θ 0 )B(θ 0 )V 1 (θ 0 )B(θ 0 )(ˆθ n θ 0 ) = Y 1 + o p (1), where Y 1 χ 2 p. The remaining term given by Y 2n = n(ˆθ n θ 0 )B(θ 0 ) [ V 1 (θ 0 ) B 1 (θ 0 ) ] B(θ 0 )(ˆθ n θ 0 ), becomes zero if g = f θ0 and stays close to zero as g f θ0. Remark 14 This result extends Theorem 6 of Lindsay (1994), which considered the case of a scalar parameter. In our case, if p = 1, both B = B(θ 0 ) and V = V (θ 0 ) are scalars, so that W C (g n ) = V B X n + o p (1), L where X n χ 2 1 under H 0. Thus V/B, as a function of the true density g, and the disparity generating function C( ) represents the inflation in the χ 2 distribution, and can be legitimately called the χ 2 inflation factor. This is exactly the same as the inflation factor described in Theorem 6, part (ii) of Lindsay (1994). When g = f θ0 is the true distribution, V = B so that there is

19 On The Asymptotics of Minimum Disparity Estimation 19 no inflation. However, when the true distribution is a point mass mixture contamination Lindsay (1994) demonstrated, using the binomial model, that the inflation factor for the likelihood ratio test rises sharply with the contamination proportion, whereas for the Hellinger deviance test this rise is significantly dampened in comparison. Our calculations in the normal mean model exhibit improvements of similar order between the likelihood ratio test and other robust tests, although we do not present the actual numbers here. In the multidimensional case, however, the relation is not so simple as now it requires comparison between the matrices B(θ 0 )V 1 (θ 0 )B(θ 0 ) and B(θ 0 ). It could be of interest to develop a single quantitative measure of inflation for the multidimensional case in the future. 6 Conclusions We have proved, under different sets of regularity conditions, strong consistency, asymptotic representation of the minimum disparity estimator, asymptotic normality of the estimating function and the asymptotic normality of the estimator. All these results except the grand consistency theorem require a thrice differentiable C. It is possible to prove an asymptotic representation for non-differentiable C. Using the equality, x y x = y [ ] y [ ] 1 {x>0} 1 {x<0} +2 1{x s} 1 {x 0} ds for x 0, it is not difficult to prove that [ n g n (x) f θn (x) dx = w [ n 1/2 1 2 w 0 ] g n (x) f θ (x) dµ(x) f θ (x) [ ] ] 1 {gn(x) f θ (x)>0} 1 {gn(x) f θ (x)<0} dµ(x) 2 f θ (x) [ 1 {gn(x) f θ (x)>0} 1 {gn(x) f θ (x)<0}] dµ(x)w + op (1). Here θ n = θ + n 1/2 w and the o p (1) is uniform in {w : w < K} for every finite K. Thus, proving asymptotic normality of the quantity in braces proves asymptotic normality of the L 1 -based parameter estimate. Another method of proving asymptotic normality of an estimator based on a non-differentiable objective function is to approximate it by a smooth objective function as in Amemiya (1982). Further exploration of these methods may lead to the development of a set up which accommodates non-differentiable C functions within the fold of minimum disparity estimation. Another possible generalization is to use quadratic mean differentiability of densities f θ and include non-regular families into the framework. However, the commonly used kernel density estimators may not be good estimators in this case. Also, one can consider various dependent data settings; see Table 1 in

20 20 Arun Kumar Kuchibhotla, Ayanendranath Basu the Supplementary Material. Using these estimators and the asymptotic representation derived in this paper, one might derive the asymptotic distribution of the minimum disparity estimator. Finally, we mention that the above results can be proved in a similar manner if one uses a Bayesian non-parametric density estimator following the techniques of Wu and Hooker (2013). Acknowledgements. The authors dedicate this work to the memory of Professor Bruce G. Lindsay. The authors also thank four anonymous reviewers whose comments led to an improved version of the manuscript. References Ali, S. M. and Silvey, S. D. (1966). A general class of coefficients of divergence of one distribution from another. J. Roy. Statist. Soc. Ser. B, 28: Amemiya, T. (1982). Two stage least absolute deviations estimators. Econometrica, 50(3):pp Basu, A. and Lindsay, B. G. (1994). Minimum disparity estimation for continuous models: efficiency, distributions and robustness. Ann. Inst. Statist. Math., 46(4): Basu, A., Sarkar, S., and Vidyashankar, A. N. (1997). Minimum negative exponential disparity estimation in parametric models. J. Statist. Plann. Inference, 58(2): Basu, A., Shioya, H., and Park, C. (2011). Statistical inference:the minimum distance approach. CRC Press, Boca Raton, FL. Beran, R. (1977). Minimum Hellinger distance estimates for parametric models. Ann. Statist., 5(3): Cheng, A.-l. and Vidyashankar, A. N. (2006). Minimum Hellinger distance estimation for randomized play the winner design. J. Statist. Plann. Inference, 136(6): Cressie, N. and Read, T. R. C. (1984). Multinomial goodness-of-fit tests. J. Roy. Statist. Soc. Ser. B, 46(3): Csiszár, I. (1963). Eine informationstheoretische Ungleichung und ihre Anwendung auf den Beweis der Ergodizität von Markoffschen Ketten. Magyar Tud. Akad. Mat. Kutató Int. Közl., 8: Devroye, L. and Györfi, L. (1985). Nonparametric density estimation. Wiley Series in Probability and Mathematical Statistics: Tracts on Probability and Statistics. John Wiley & Sons, Inc., New York. The L 1 view. Ekström, M. (2001). Consistency of generalized maximum spacing estimates. Scand. J. Statist., 28(2): Ferguson, T. S. (1982). An inconsistent maximum likelihood estimate. J. Amer. Statist. Assoc., 77(380): Ferguson, T. S. (1996). A course in large sample theory. Texts in Statistical Science Series. Chapman & Hall, London. Fryer, J. G. and Robertson, C. A. (1972). A comparison of some methods for estimating mixed normal distributions. Biometrika, 59(3):pp

21 On The Asymptotics of Minimum Disparity Estimation 21 Ghosh, K. and Jammalamadaka, S. R. (2001). A general estimation method using spacings. J. Statist. Plann. Inference, 93(1-2): Gut, A. (2013). Probability: a graduate course. Springer Texts in Statistics. Springer, New York, second edition. Györfi, L., Vajda, I., and van der Meulen, E. (1994). Minimum Hellinger distance point estimates consistent under weak family regularity. Math. Methods Statist., 3(1): Hooker, G. (2016). Consistency, efficiency and robustness of conditional disparity methods. Bernoulli, 22(2): Ichimura, H. (1993). Semiparametric least squares (sls) and weighted sls estimation of single-index models. J. Econometrics, 58(1-2): Iséki, K. (1957). A theorem on continuous convergence. Proc. Japan Acad., 33: Kuchibhotla, A. K. and Basu, A. (2015). A general set up for minimum disparity estimation. Statist. Probab. Lett., 96: Liese, F. and Vajda, I. (1987). Convex statistical distances, volume 95 of Teubner Texts in Mathematics. BSB B. G. Teubner Verlagsgesellschaft. Lindsay, B. G. (1994). Efficiency versus robustness: the case for minimum Hellinger distance and related methods. Ann. Statist., 22(2): Morales, D., Pardo, L., and Vajda, I. (1995). Asymptotic divergence of estimates of discrete distributions. J. Statist. Plann. Inference, 48(3): Pardo, L. (2006). Statistical inference based on divergence measures, volume 185 of Statistics: Textbooks and Monographs. Chapman & Hall/CRC. Park, C. and Basu, A. (2004). Minimum disparity estimation: asymptotic normality and breakdown point results. Bull. Inform. Cybernet., 36: Prakasa Rao, B. L. S. (1983). Nonparametric functional estimation. Probability and Mathematical Statistics. Academic Press, Inc., New York. Rao, C. R. (1961). Asymptotic efficiency and limiting information. In Proc. 4th Berkeley Sympos. Math. Statist. and Prob., Vol. I, pages Univ. California Press, Berkeley, Calif. Robertson, C. A. (1972). On minimum discrepancy estimators. Sankhyā: The Indian Journal of Statistics, Series A, 34(2):pp Serfling, R. J. (1980). Approximation theorems of mathematical statistics. John Wiley & Sons, Inc., New York. Simpson, D. G. (1989). Hellinger deviance tests: efficiency, breakdown points, and examples. J. Amer. Statist. Assoc., 84(405): Vajda, I. (1989). Theory of statistical inference and information. Theory and decision library: Mathematical and statistical methods. van der Vaart, A. W. (1998). Asymptotic statistics. Cambridge University Press, Cambridge. van der Vaart, A. W. and Wellner, J. A. (1996). Weak convergence and empirical processes. Springer Series in Statistics. Springer-Verlag, New York. Wu, Y. and Hooker, G. (2013). Hellinger distance and bayesian nonparametrics: Hierarchical models for robust and efficient bayesian inference. ArXiv e-prints:

22 22 Arun Kumar Kuchibhotla, Ayanendranath Basu Zapa la, A. M. (2008). Unbounded mappings and weak convergence of measures. Statist. Probab. Lett., 78(6):

Minimum Hellinger Distance Estimation with Inlier Modification

Sankhyā : The Indian Journal of Statistics 2008, Volume 70-B, Part 2, pp. 0-12 c 2008, Indian Statistical Institute Minimum Hellinger Distance Estimation with Inlier Modification Rohit Kumar Patra Indian