Maximum Entropy Approximations for Asymptotic Distributions of Smooth Functions of Sample Means

Size: px

Start display at page:

Download "Maximum Entropy Approximations for Asymptotic Distributions of Smooth Functions of Sample Means"

Bethanie Farmer
5 years ago
Views:

1 Scandinavian Journal of Statistics Vol. 38: doi: /j x Published by Blackwell Publishing Ltd. Maximum Entropy Approximations for Asymptotic Distributions of Smooth Functions of Sample Means XIMING WU Department of Agricultural Economics Texas A&M University SUOJIN WANG Department of Statistics Texas A&M University ABSTRACT. We propose an information-theoretic approach to approximate asymptotic distributions of statistics using the maximum entropy (ME) densities. Conventional ME densities are typically defined on a bounded support. For distributions defined on unbounded supports we use an asymptotically negligible dampening function for the ME approximation such that it is well defined on the real line. We establish order n 1 asymptotic equivalence between the proposed method and the classical Edgeworth approximation for general statistics that are smooth functions of sample means. Numerical examples are provided to demonstrate the efficacy of the proposed method. Key words: asymptotic approximation Edgeworth approximation maximum entropy density 1. Introduction Asymptotic approximations to the distributions of estimators or test statistics are commonly used in statistical inferences such as approximate confidence limits for parameters of interest or p-values for hypothesis tests. According to the central limit theorem a statistic often has a limiting normal distribution under mild regularity conditions. The normal approximation however can be inadequate especially when the sample size is small. In this study we are concerned with using higher-order approximation methods to approximate the distributions of statistics with limiting normal distributions. The most commonly used higher-order approximation method is that of Edgeworth which takes a polynomial form weighted by a normal distribution or density function. The Edgeworth approximation is known to be quite accurate near the mean of a distribution. As a result of its polynomial form however it can sometimes produce spurious oscillations and give poor fits to the exact distributions in parts of the tails. Moreover density and tail approximations can sometimes produce negative values. Suppose that we are interested in the exact density f n of a statistic S n. In this study we consider an alternative higher-order approximation of the form K fn (x) = exp γ njx j ψ ( ) γ n (1) where γ n = {γ nj} K are the coefficients of the canonical exponential density and ψ(γ n)is a normalization term such that fn integrates to unity. The proposed approximation has an appealing information-theoretic interpretation as the maximum entropy (ME) density which is obtained by maximizing the information entropy subject to some moment conditions. The ME approximation is strictly positive and well defined on a bounded support. When

2 Scand J Statist 38 Maximum entropy asymptotic approximations 131 the underlying distribution has an unbounded support the ME approximation in its conventional form is not necessarily integrable. Instead of assuming a bounded support we propose a modified ME approximation with a dampening function that ensures the integrability of the approximation and at the same time is asymptotically negligible. Most existing studies on the ME density focus on its usage as a density estimator. This study is conceptually different in that it uses the ME density to approximate asymptotic distributions of statistics. For some given number of polynomial terms we investigate the performance of the approximation as n. In this study we focus on approximations up to order n 1 as approximations beyond order n 1 are rarely used in practice. Higher-order results can be obtained in a similar manner. We establish in the general framework of smooth functions of sample means that the ME approximations based on the first four moments of f n are asymptotically equivalent to the two-term Edgeworth approximation up to order n 1. We provide some numerical examples whose results suggest that the proposed method is a useful complement to the Edgeworth approximation especially when the sample size is small. The rest of the article is organized as follows. Section 2 reviews the literature on the Edgeworth approximation and the ME density. Section 3 proposes a higher-order approximation based on modified ME densities and establishes its asymptotic properties. Section 4 provides some numerical examples. Some concluding remarks are given in section 5. Proofs of the theorems and some technical details are collected in the appendices. 2. Background and motivation In this section we first review the literature on the Edgeworth approximation and some relevant variants. We then introduce the ME density based on which we construct our alternative higher-order approximations Edgeworth approximation In his seminal work Edgeworth (1905) applies the Gram Charlier approximation to the distribution of a sample mean and collects terms in powers of n 1/2. The resulting asymptotic approximation becomes known as the Edgeworth approximation. It has been extensively studied in the literature. See for example Barndorff-Nielsen & Cox (1989) Hall (1992) and references therein. Suppose that {Y i } n i = 1 is an i.i.d. random sample with mean zero and variance one. Let X n = 1 n n i = 1 Y i and f n be the density of X n. The two-term Edgeworth approximation to f n takes the form: { g n (x) = (x) 1 + κ 3H 3 (x) 6 n + κ 4H 4 (x) + κ2 3 H } 6(x) 24n 72n where (x) is the standard normal density κ j is the jth cumulant of Y 1 and H j is the jth Hermite polynomial of x. The Edgeworth approximation applies to not only sample means but also more general statistics. In this study we focus on smooth functions of sample means as does Hall (1992). Additional notations are required for more general statistics than sample means. We use bold face to denote a d-dimensional vector v and v (j) its jth element. Let S n be an asymptotically normal statistic with distribution F n and density f n. Define X = n 1 X i where X 1... X n are i.i.d. random column d-vectors with mean μ. Suppose that we have a smooth function B :R d R satisfying B(μ) = 0 and S n = n 1/2 B( X). For example B(x) = {τ(x) τ(μ)}/δ(μ) where

3 132 X. Wu and S. Wang Scand J Statist 38 τ(μ) is estimated by τ( X) and δ(μ) 2 is the asymptotic variance of n 1/2 τ( X); or B(x) = {τ(x) τ(μ)}/δ(x) where τ( ) and δ( ) are smooth functions and δ( X) is an estimate of δ(μ). For 2... denote the jth moment and cumulant of S n by μ nj and κ nj respectively. In general as the exact moments and cumulants for S n are not available we construct instead corresponding approximate moments and cumulants accurate to O(n 1 ) which are denoted by μ nj and κ nj respectively in the following. Define Z = n 1/2 ( X μ) and ( ) b i1...i j = j / x (i1 ) x ( ) i j B(x) x = μ. By a Taylor approximation and as B(μ) = 0 S n = n 1/2 B ( X ) = + n d d b i1 Z (i1 ) + n 1/2 1 2 i 1 = 1 d i 1 = 1 i 2 = 1 i 3 = 1 d i 1 = 1 i 2 = 1 d b i1 i 2 Z (i1 )Z (i2 ) d ( b i1 i 2 i 3 Z (i1 )Z (i2 )Z (i3 ) + O ) p n 3/2. The (approximate) moments of S n can then be computed from the aforementioned approximation. In principle one can compute its moments to an arbitrary degree of accuracy by taking a sufficiently large number of terms. Next we give general formulae that allow approximations accurate to order n 1. In particular we have μ n1 = 1 s 12 μ n n2 = s n s 22 μ n3 = 1 s 31 μ n n4 = s n s 42 (2) where s s 42 are defined in appendix 1. It is seen that E[Sn] j = μ nj + o(n 1 )...4. Define κ n1 = 1 k 12 κ n2 = k n n k 22 κ n3 = 1 k 31 κ n4 = 1 n n k 41 (3) where k k 41 are given in appendix 1. One can show that κ nj = κ n j + o(n 1 )...4. The Edgeworth approximation of the distribution of S n can be constructed using its approximate cumulants. See for example Bhattacharya & Ghosh (1978) and Hall (1992). In particular the two-term Edgeworth approximation to f n then takes the form: { g n (x) = (x) 1 + Q 1(x) + Q } 2(x) n n Q 1 (x) = k 12 H 1 (x) + k 31H 3 (x) ( ) 6 ( ) k22 + k12 2 H2 (x) k41 + 4k 12 k 31 H4 (x) Q 2 (x) = k2 31 H 6(x). 72 Note that g n (x) is a polynomial of degree 6 which might cause the so-called tail difficulties discussed in Wallace (1958). The tail of the Edgeworth approximation exhibits increasing oscillations as x increases owing to the high-degree polynomial. Moreover the Edgeworth approximation is not necessarily a bona fide density because it sometimes produces negative density approximations at the tails of distributions. Aiming to restrain the tail difficulties owing to polynomial oscillations Niki & Konishi (1986) propose a transformation method for the Edgeworth approximation. Let a general statistic S n = n(t n μ T )/σ T where μ T and σ 2 T are the mean and variance of T n. Niki &

4 Scand J Statist 38 Maximum entropy asymptotic approximations 133 Konishi (1986) note that the tail difficulties can be mitigated through a transformation φ(t n ) such that the third cumulant of standardized φ(t n ) is zero in which case the Edgeworth approximation is reduced to a degree 4 polynomial. The transformed statistic takes the form z n = n(φ(t n) φ(μ T )) σ T where φ is the first derivative of φ. For example let T φ (μ T ) n be the sample average of n i.i.d. chi-squared variates with k degrees of freedom then φ(x) = x 1/3. It is however generally rather difficult to derive the proposed transformation except for some special cases. Moreover the Edgeworth approximation for the transformed statistic can still be negative. Being a higher-order approximation around the mean the Edgeworth approximation is known to work well near the mean of a distribution but its performance sometimes deteriorates at the tails. Hampel (1973) introduces the so-called small sample asymptotic method which is essentially a local Edgeworth approximation. Although generally more accurate than the Edgeworth approximation Hampel s method requires the knowledge of f n which limits its applicability in practice. Field & Hampel (1982) briefly discuss an approximation to Hampel s small sample approximation that requires only the first few cumulants of S n (see also Field & Ronchetti 1990). Alternatively this approximation can be obtained by putting the Edgeworth approximation back to the exponent and collecting terms such that its first-order Taylor approximation is equivalent to the edgeworth approximation. Therefore we call this approximation the exponential Edgeworth (EE) approximation. Denote the EE approximation by (x) exp(k(x)). It can be constructed such that the first-order Taylor approximation of (x) exp(k(x)) at K(x) = 0 matches g n (x) uptoordern 1. We then have { Q h n (x) = (x)exp 1 (x) + Q 2a (x) + Q 2b (x) } n n Q 1(x) = k 12 H 1 (x) + k 31H 3 (x) 6 Q 2a(x) = k 22H 2 (x) + k 41H 4 (x) ( 2 24 Q 2b(x) = k2 12 H2 (x) H1 2(x)) + k ( 12k 31 H4 (x) H 1 (x)h 3 (x) ) ( + k2 31 H6 (x) H3 2(x)) (4) Clearly the EE approximation is strictly positive. In addition the exponent of h n (x) is a polynomial of degree 4 as H 6 (x) H3 2 (x) is a degree 4 polynomial. Thus the EE approximation is more likely to restrain polynomial oscillations. For a given x ( ) h n (x) is equivalent to the Edgeworth approximation g n (x) uptoordern 1. One problem with the EE approximation is that it is generally not integrable on the real line. To see this note that we can rewrite h n (x) in the canonical exponential form exp( 4 j = 0 γ njx j ) where γ nj... 4 are functions of k k 41. Obviously if γ n4 > 0 h n (x) goes to infinity as x. A second problem is about preservation of information. Both the two-term Edgeworth approximation g n and the EE approximation h n are based solely on { κ nj } 4. The g n preserves information in the sense that the first four cumulants of g n are identical to { κ nj } 4. However the first four cumulants of h n if existing are not exactly the same as { κ nj } 4. In this study we propose an alternative approximation that has the same functional form exp( 4 j = 0 γ njx j ) and at the same time preserves information such that its first four moments match { μ nj } 4. As is shown next it has an appealing information-theoretic interpretation.

5 134 X. Wu and S. Wang Scand J Statist ME density The exponential polynomial form of density approximation based on a set of moments arises naturally according to the principle of ME or minimum relative entropy. Given a continuous random variable X with density f the differential entropy is defined as: W(f ) = f (x)logf (x)dx. (5) The principle of ME provides a method of constructing a density from known moment conditions. Suppose all that is known about f is the first K moments. There might exist an infinite number of densities satisfying given moment conditions. Jaynes (1957) Principle of Maximum Entropy suggests selecting the density among all candidates satisfying the moment conditions which maximizes the entropy. The resulting ME density is uniquely determined as the one which is maximally noncommittal with regard to missing information and that it agrees with what is known but expresses maximum uncertainty with respect to all other matters. Formally the ME density is defined as: f = arg max W ( f ) f subject to the moment constraints that f (x) dx = 1 x j f (x) dx = μ j... K (6) where μ j = EX j. The problem can be solved using calculus of variations. Denote the Lagrangian { } K { L = f (x)logf (x) dx γ 0 f (x) dx 1 γ j x j f (x) dx μ j }. The necessary conditions for a stationary value are given by the Euler Lagrange equation log f (x) + 1 γ 0 K γ j x j = 0 (7) plus the constraints given in (6). Thus from (7) the ME density takes the general form: K K f (x) = exp γ j x j 1 + γ 0 = exp γ j x j ψ ( γ ) (8) where γ = {γ j } K and ψ(γ ) = log{ exp ( K γ j x j )dx} is the normalizing factor under the assumption that exp( K γ j x j )dx <. The Lagrangian multiplier γ can be solved from the K moment conditions in (6) by plugging f = f into (6). For distributions defined on R the necessary conditions for the integrability of f are that K is even and γ K < 0. A closely related concept is the relative entropy also called cross entropy or Kullback Leibler distance. Consider two densities f and f 0 defined on a common support. The relative entropy between f and f 0 is defined as: f (x)log f (x) dx. (9) f 0 (x) It is known that D( f f 0 ) 0 and D( f f 0 ) = 0 if and only if f (x) = f 0 (x) almost everywhere. Minimizing (9) subject to the same moment constraints (6) yields the minimum relative entropy density: D ( f f 0 ) =

6 Scand J Statist 38 Maximum entropy asymptotic approximations 135 f ( ) K x; f 0 = f0 (x)exp γ jf 0 x j ψ ( ) γ f 0 (10) ( ) { ( where γ = f 0 {γ jf 0 } K and ψ K ) } γ = f 0 log f0 (x)exp γ jf 0 x j dx is the normalizing factor under the assumption that ( K f 0 (x)exp γ jf 0 x )dx j <. Among all densities satisfying the given moment conditions f ( ) x; f 0 is the unique one that is closest to the reference density f 0 in terms of the relative entropy. Shore & Johnson (1981) provide an axiomatic justification of this approach. The ME density is a special case of the minimum relative entropy density wherein the reference density f 0 is set to be constant see e.g. Zellner & Highfield (1988) and Wu (2003). Existing studies on the ME densities mostly focus on its usage as a density estimator. The ME approximation is equivalent to approximating the logarithm of a density by polynomials which has long been studied. Earlier studies on the approximation of logdensities using polynomials include Neyman (1937) and Good (1963). When sample moments rather than population moments are used the maximum likelihood method gives efficient estimates of this canonical exponential family. Crain ( ) establishes existence and consistency of the maximum likelihood estimator of ME densities. Gilula & Haberman (2000) use a maximin argument based on information theory to derive the ME density based on summary statistics. Barron & Sheu (1991) allow the number of moments to increase with sample size and establish formally the ME approximation as a nonparametric density estimator. Their results are further extended to multivariate densities in Wu (forthcoming). This study is conceptually different. The ME densities are used to approximate distributions of statistics that are asymptotically normal. Instead of allowing K the degree of exponential polynomial increase with the sample size we study the accuracy of the approximation when the sample size n under some given K. Related work can be found in Harremoës (2005ab) and references therein. The former establishes lower bounds in terms of the relative entropy to determine the rate of convergence in the central limit theorem; the latter shows that the Edgeworth approximation can be considered as a linear extrapolation of the ME approximation. 3. ME approximation In this section we introduce an ME approximation to distributions of statistics that are asymptotically normal. As discussed before the ME density is not necessarily well defined for distributions on an unbounded support. Thus we propose a modified ME density that overcomes this constraint. This is particularly useful for asymptotic approximations to distributions of statistics that are defined on unbounded supports. We then establish the asymptotic properties of higher-order approximations based on modified ME densities Approximation for distributions on an unbounded support Whether an ME approximation is integrable on an unbounded support depends on the moment conditions. For example consider the case where the first four moments with μ 1 = 0 and μ 2 = 1 are used to construct an ME density. Rockinger & Jondeau (2002) compute numerically a boundary on the ( μ 3 μ 4 ) space beyond which an ME approximation is not obtainable. Under the same condition Tagliani (2003) shows that an ME approximation is not integrable if μ 3 = 0 and μ 4 > 0.

7 136 X. Wu and S. Wang Scand J Statist 38 We propose to use a dampening function in the exponent of an ME density to ensure its integrability. A similar approach is used in Wang (1992) to improve numerical stability of general saddlepoint approximations. Define the dampening function θ n (x) = c exp ( x ) (11) n 2 where c is a positive constant. The modified ME density based on { } 4 μ nj takes the form: 4 f nθ n (x) = exp γ njθ n x j θ n (x) ψ(γ nθ n ) (12) where the normalization term is defined as: 4 ψ(γ nθ n ) = log exp γ njθ n x j θ n (x) dx. This ME density can be obtained by f (x) min f (x)log f exp( θ n (x)) dx such that f (x)dx = 1 x j f (x) dx = μ nj...4. Thus the modified ME approximation is essentially a minimum cross entropy density based on given moment conditions and with a non-constant reference density proportional to exp ( θ n (x) ). The dampening function ensures that the exponent of the ME density is bounded above and goes to as x such that the ME density is integrable in R. One might be tempted to use a term like θ n (x) = cx r /n 2 as the dampening function where r is an even integer larger than 4. Integrability of the ME approximation however is not warranted with this choice. As is shown next the coefficient for x 4 γ n4θ n in the exponent is O(n 1 ). Suppose that r = 6. For x = O(n 1/3 ) it follows that γ n4θ n x 4 = O(n 1/3 ) and θ n (x) = O(1). Thus if γ n4θ n > 0 f nθ n (x) goes to infinity as x and is not integrable on R. Similarly we use the same dampening function to ensure the integrability of the EE approximations in R resulting in { Q h nθn (x) = (x)exp 1 (x) + Q 2a (x) + Q 2b (x) } θ n (x) (13) n n where Q 1 (x) Q 2a (x) and Q 2b (x) are defined in (4) and θ n(x) is given by (11). We note that the ME approximation adapts to the dampening function in the sense that its coefficients are determined through an optimization process that incorporates the dampening function as a reference density. In contrast the EE approximation does not have this appealing property because its coefficients are invariant to the introduction of a dampening function. This noteworthy difference is explored numerically in section 5. Although it does not affect the asymptotic result of order n 1 the dampening function may influence small sample performance of the ME and EE approximations. In practice we use the following rule: the constant c in the dampening function is chosen such that it is the smallest positive number that ensures both tails of the approximation decrease

8 Scand J Statist 38 Maximum entropy asymptotic approximations 137 monotonically. More precisely let f nθ n ( ; c ) be the ME approximation with a dampening function θ n ( ; c). We select c according to { c = inf a > 0: d f ( ) nθ n x; a > 0forx < 2 and d f ( ) } nθ n x; a < 0forx > 2. (14) dx dx In other words c is chosen such that the density approximation decays monotonically for x > 2as x which corresponds to approximately 5 per cent tail regions of the normal approximation. We use the lower bound c = 10 6 for c arbitrarily close to zero. The same rule is used to select the dampening function for the EE approximation (13) Asymptotic properties In this section we derive asymptotic properties of the ME approximation. As noted by Wallace (1958) asymptotic orders of several alternative higher-order approximations are established through their relationships with the Edgeworth approximation. The same approach is employed here. More specifically we use the EE approximation to facilitate our analysis noting that an appropriately constructed EE approximation is asymptotically equivalent to the Edgeworth approximation and at the same time shares a common canonical exponential form with the ME approximation. The asymptotic properties of the Edgeworth approximation have been extensively studied. Recall that we are interested in approximating the distribution of S n a smooth function of sample means as defined in section 2.1. We start with the following result for the Edgeworth approximation g n to f n. Lemma 1 (Hall 1992 p. 78). Assume that the function B has four continuous derivatives in a neighbourhood of μ and satisfies sup < x < n 1 S(nx) ( 1 + y ) 4 dy < where S(n x) ={y R d : n 1/2 B(μ + n 1/2 y)/σ = x}. Furthermore assume that E X 4 < and that the characteristic function φ of X satisfies φ(t) λ dt < R d for some λ 1. Then as n f n (x) g n (x) = o(n 1 ) uniformly in x. We next show that the EE approximation (13) is equivalent to the two-term Edgeworth approximation at order O(n 1 ). Theorem 1. Let h nθn be the (modified) EEapproximation (13) to f n. Under the conditions of lemma 1 f n (x) h nθn (x) = o(n 1 ) uniformly in x as n. A proof of theorem 1 and all other proofs are given in appendix 2. Next to make the ME approximation nθ n comparable with the Edgeworth and EE approximations we factor out (x). Reparameterizing γ n2θ n + 1 in (12) and denoting it by 2 γ n2θ n without confusion we obtain the following slightly different form: f

9 138 X. Wu and S. Wang Scand J Statist 38 4 f nθ n (x) = (x)exp γ njθ n x j θ n (x) ψ ( ) γ nθ n (15) where ψ ( ) γ nθ n = log 4 (x)exp γ njθ n x j θ n (x) dx. The asymptotic properties of f nθ n are established with the aid of the EE approximation (13). Heuristically using theorem 1 one can show that μ j ( h n θn )= μ nj + o(n 1 )for...4. Let μ( f ) = (μ j ( f )) 4 and be the Euclidean norm. As μ( f n) = μ( f nθ n ) + o(n 1 ) it follows that μ( h nθn ) μ( f nθ n ) = o(n 1 ). Under this condition we are then able to establish order n 1 equivalence between h nθn and f nθ n which share a common exponential polynomial form. By lemma 1 and theorem 1 we demonstrate formally the following result. Theorem 2. Let f nθ n be the ME approximation (15) to f n. Under the conditions of lemma 1 f n (x) f nθ n (x) = o ( n 1) uniformly in x as n. The asymptotic properties of the approximations to distribution functions readily follow from the aforementioned results. Let F n (x) = P ( S n x ). Denote by G n (x) the two-term Edgeworth approximation of F n. It is known that F n (x) G n (x) = o ( n 1) ; see for example Feller (1971). Define H nθn (x) = x h nθn (t) dt and F nθ n (x) = x f nθ n (t) dt respectively. Next we show that the same asymptotic order applies to H nθn and F nθ n. Theorem 3. Under the conditions of lemma 1 (a) the EE approximation to the distribution function satisfies that F n (x) H nθn (x) = o ( n 1) uniformly in x as n ;(b) the ME approximation to the distribution function satisfies that F n (x) F nθ n (x) = o ( n 1) uniformly in x as n. 4. Numerical examples In this section we provide some numerical examples to illustrate finite sample performance of the proposed method. Our experiments compare the Edgeworth the EE and the ME approximations. We also include the results of the normal approximation which is nested in all three higher-order approximations. Alternative higher-order asymptotic methods such as the bootstrap and the saddlepoint approximation are not considered. We focus our comparison between the ME and Edgeworth approximations because they share the same information requirement. For example only the first four moments or cumulants are required for approximations up to order n 1. The other methods are conceptually different. The bootstrap uses re-sampling and the saddlepoint approximation requires the knowledge of the moment-generating function. We leave the comparison with these alternative methods for future study. As by construction both Edgeworth and ME approximations integrate to unity we normalize the EE approximation such that it is directly comparable with the other two. We are interested in F n the distribution of S n that is a smooth function of sample means. Define s α by Pr ( S n s α ) = α. Let Ψn be some approximation to F n. In each experiment we report the one- and two-sided tail probabilities defined by

10 Scand J Statist 38 Maximum entropy asymptotic approximations 139 ( ) sα α 1 Ψn = dψ n (s) ( ) s1 α/2 α 2 Ψn = 1 dψ n (s). s α/2 We also report the relative error defined as α i ( Ψn ) α /α i = 1 2. In the first example suppose that we are interested in the distribution of the standardized mean S n = n i = 1 ( Yi 1 ) / 2n where {Yi } follows a Chi-squared distribution with one degree of freedom. The exact density of S n is given by f n (s) = 2n ( 2ns + n ) (n 2)/2 e ( 2ns + n)/2 Γ(n/2)2 n/2 s > n/2. We set n = 10. The constant c in the dampening function is chosen to be 10 6 for both the ME and EE approximations. The results are reported in Table 1. The accuracy of all approximations generally improves as α increases. The three higher-order approximations provide comparable performance and dominate the normal approximation. Although the magnitudes of relative errors are similar it is noted that both the one- and two-sided tail probabilities of the Edgeworth approximation are negative at extreme tails. We next examine a case where f n is symmetric and fat-tailed. Suppose that S n is the standardized mean of random variables from the t-distribution with five degrees of freedom where n = 10. Although the exact distribution of S n is unknown we can calculate its moments analytically (μ n3 = 0 and μ n4 = 3.6). As is discussed in the previous section for symmetric fat-tailed distributions both the ME and EE approximations if without a proper dampening function are not integrable on the real line. In this experiment the constant c in the dampening function is chosen to be 0.56 and 4.05 for the ME and EE approximations respectively using the automatic selection rule (14). We used 1 million simulations to obtain the critical values of the distribution of S n. The approximation results are reported in Table 2. The overall performance of the ME approximation is considerably better than that of the Edgeworth approximation which in turn outperforms the normal approximation. Unlike in the first example the performance of the EE approximation is essentially dominated by the other three. Note that a non-trivial c is chosen for both the ME and EE approximations in this case. A relatively heavy dampening function is chosen for the EE approximation to ensure its integrability. Its performance is clearly affected by the dampening function. In contrast the dampening function required for the ME approximation is considerably weaker. Table 1. Tail probabilities of approximated distributions of the standardized mean of n = 10χ 2 1 variables (NORM normal; ED Edgeworth; EE exponential Edgeworth; ME maximum entropy; RE relative errors) α NORM RE ED RE EE RE ME RE α 1 (Ψ n ) = s α dψ n(s) α 2 (Ψ n ) = 1 s 1 α/2 s α/2 dψ n (s)

11 140 X. Wu and S. Wang Scand J Statist 38 Table 2. Tail probabilities of approximated distribution of sample mean of n = 10 t 5 variables (NORM normal; ED Edgeworth; EE exponential Edgeworth; ME maximum entropy; RE relative errors) α NORM RE ED RE EE RE ME RE α 1 (Ψ n ) = s α dψ n(s) α 2 (Ψ n ) = 1 s 1 α/2 s α/2 dψ n (s) Generally the selection of the constant in the dampening function is expected to have little effects on the ME approximation because it adapts to the dampening function by adjusting its coefficients through an optimization process. However it has a non-negligible effect on the EE approximation whose coefficients are invariant to the introduction of the dampening function. To investigate how sensitive the reported results are with respect to the specification of the dampening function we conducted some further experiments. We used a grid search to locate the optimal c for the ME and EE approximations. The optimal c is defined as the minimizer of the sum of the eight relative errors reported in Table 2. The optimal c for the ME approximation is 0.6 which is obtained through a grid search on the interval [0.1 5] with an increment of 0.1. The optimal c for the EE approximation which is 5 is obtained in a similar fashion through a grid search on [1 20] with an increment of 1. Thus the cs selected by the automatic rule 0.56 and 4.05 respectively for the ME and EE approximations are quite close to their optimal values which are generally infeasible in practice. In the third experiment we used sample moments rather than population moments in the approximation of the distribution of a sample mean. Suppose that the variance is unknown in this case. We then have S n = Ȳ EY 1 n 1 n ( n t = 1 Yt Ȳ ) 2 where Ȳ = 1 n n t = 1 Y t. This studentized statistic is a smooth function of sample means. Without loss of generality suppose that {Y t } follows a distribution with mean zero and variance one. Using the general formulae given in appendix 1 we obtain the following approximate moments for S n : μ 3 μ n1 = 1 n μ 2 n2 = μ2 3 n μ 3 μ n3 = 7 n μ 2 n4 = μ2 3 2μ 4 n where μ j = E(Y j 1 )...4. The distribution of S n was obtained by 1 million Monte Carlo simulations. In our experiment we set n = 20 and generated {Y t } from the standard normal distribution. The population moments μ 3 and μ 4 are replaced by sample moments in the approximations. The dampening function is selected automatically according to the rule in (14). The experiment is repeated 500 times. Table 3 reports the average tail probabilities and corresponding

12 Scand J Statist 38 Maximum entropy asymptotic approximations 141 Table 3. Empirical tail probabilities of approximated distributions of studentized mean of n = 20 standard normal random variables (NORM normal; ED Edgeworth; EE exponential Edgeworth; ME maximum entropy; SE standard error) α NORM SE ED SE EE SE ME SE α 1 ( ˆΨ n ) = s α d ˆΨ n (s) (0.0045) (0.0013) (0.0105) (0.0013) (0.0100) (0.0059) (0.0163) (0.0042) (0.0135) (0.0093) (0.0195) (0.0065) (0.0240) (0.0208) (0.0296) (0.0156) α 2 ( ˆΨ n ) = 1 s 1 α/2 s d ˆΨ n (s) α/ (0.0074) (0.0012) (0.0097) (0.0013) (0.0148) (0.0064) (0.0145) (0.0036) (0.0197) (0.0105) (0.0164) (0.0056) (0.0351) (0.0177) (0.0181) (0.0131) standard errors. The results are qualitatively similar to those in Table 2. The ME and Edgeworth approximations dominate the EE and normal approximations. The overall performance of the ME approximation is better than that of the Edgeworth approximation especially for tail probabilities at 0.01 and In addition the standard errors of the ME approximation are smaller than those of the Edgeworth approximation in most cases. 5. Conclusion In this article we have proposed an ME approach for approximating the asymptotic distributions of smooth functions of sample means. Conventional ME densities are typically defined on a bounded support. The proposed ME approximation equipped with an asymptotically negligible dampening function is well defined on the real line. We have established order n 1 asymptotic equivalence between the proposed method and the classical Edgeworth approximation. Our numerical experiments show that the proposed method is comparable with and sometimes outperforms the Edgeworth approximation. Finally we note that the proposed method may be generalized to approximate bootstrap distributions of statistics and those of high dimensional statistics. References Barndorff-Nielsen O. E. & Cox D. R. (1989). Asymptotic techniques for use in statistics. Chapman and Hall London. Barron A. R. & Sheu C. (1991). Approximation of density functions by sequences of exponential families. Ann. Statist Bhattacharya R. N. & Ghosh J. K. (1978). On the validity of the formal Edgeworth approximation. Ann. Statist Crain B. R. (1974). Estimation of distributions using orthogonal approximations. Ann. Statist Crain B. R. (1977). An information theoretic approach to approximating a probability distribution. SIAM J. Appl. Math Edgeworth F. Y. (1905). The law of error. Trans. Camb. Philos. Soc Feller W. (1971). An introduction to probability theory and its application II. Wiley New York. Field C. A. & Hampel F. R. (1982). Small-sample asymptotic distributions of m-estimators of location. Biometrika Field C. A. & Ronchetti E. (1990). Small sample asymptotics. Institute of Mathematical Statistics Hayward CA. Good I. J. (1963). Maximum entropy for hypothesis formulation especially for multidimensional contingency tables. Ann. Math. Statist

13 142 X. Wu and S. Wang Scand J Statist 38 Gilula Z. & Haberman S. J. (2000). Density approximation by summary statistics: an information theoretic approach. Scand. J. Statist Hall P. (1992). The bootstrap and Edgeworth approximation. Springer-Verlag New York. Hampel F. R. (1973). Some small-sample asymptotics. In Proceedings of the Prague Symposium on Asymptotic Statistics (ed. J. Hajek) Charles University Prague. Harremoës P. (2005a). Lower bounds for divergence in central limit theorem. Electron. Notes Discrete Math Harremoës P. (2005b). Maximum entropy and the Edgeworth approximation. Proceedings of IEEE ISOC ITW 2005 on Coding and Complexity Jaynes E. T. (1957). Information theory and statistical mechanics. Phys. Rev Neyman J. (1937). Smooth test for goodness of fit. Skand. Aktuar Niki N. & Konishi S. (1986). Effects of transformations in higher order asymptotic approximations. Ann. Inst. Statist. Math Rockinger M. & Jondeau E. (2002). Entropy densities with an application to autoregressive conditional skewness and kurtosis. J. Econometrics Shore J. E. & Johnson R. W. (1981). Properties of cross-entropy minimization. IEEE Trans. Inform. Theory Tagliani A. (2003). A note on proximity of distributions in terms of coinciding moments. Appl. Math. Comput Wallace D. (1958). Asymptotic approximations to distributions. Ann. Math. Statist Wang S. (1992). General saddlepint approximation in the bootstrap. Statist. Probab. Lett Wu X. (2003). Calculation of maximum entropy densities with application to income distribution. J. Econometrics Wu X. (forthcoming). Multivariate exponential series density estimator. J. Econometrics. Zellner A. & Highfield R. A. (1988). Calculation of maximum entropy distribution and approximation of marginal posterior distributions. J. Econometrics Received December 2008 in final form January 2010 Ximing Wu Department of Agricultural Economics Texas A&M University College Station TX USA. xwu@tamu.edu Appendix 1: Approximate moments and cumulants We first define μ i1...i j = E[(X μ) (i1 ) (X μ) (ij )] j 1 and ν 1 = μ i1 ν 2 = μ i1 i 2 ν 3 = μ i1 i 2 i 3 ν 4 = μ i1 i 2 i 3 i 4 ν 22 = μ i1 i 2 μ i3 i 4 + μ i1 i 3 μ i2 i 4 + μ i1 i 4 μ i2 i 3 ν 23 = μ i1 i 2 μ i3 i 4 i 5 + μ i1 i 3 μ i2 i 4 i 5 + μ i1 i 4 μ i2 i 3 i 5 + μ i1 i 5 μ i2 i 3 i 4 + μ i2 i 3 μ i1 i 4 i 5 + μ i2 i 4 μ i1 i 3 i 5 + μ i2 i 5 μ i1 i 3 i 4 + μ i3 i 4 μ i1 i 2 i 5 ++μ i3 i 5 μ i1 i 2 i 4 + μ i4 i 5 μ i1 i 2 i 3 ν 222 = μ i1 i 2 μ i3 i 4 μ i5 i 6 + μ i1 i 2 μ i3 i 5 μ i4 i 6 + μ i1 i 2 μ i3 i 6 μ i4 i 5 + μ i1 i 3 μ i2 i 4 μ i5 i 6 + μ i1 i 3 μ i2 i 5 μ i4 i 6 + μ i1 i 3 μ i2 i 6 μ i4 i 5 + μ i1 i 4 μ i2 i 3 μ i5 i 6 + μ i1 i 4 μ i2 i 5 μ i3 i 6 + μ i1 i 4 μ i2 i 6 μ i3 i 5 + μ i1 i 5 μ i2 i 3 μ i4 i 6 + μ i1 i 5 μ i2 i 4 μ i3 i 6 + μ i1 i 5 μ i2 i 6 μ i3 i 4 + μ i1 i 6 μ i2 i 3 μ i4 i 5 + μ i1 i 6 μ i2 i 4 μ i3 i 5 + μ i1 i 6 μ i2 i 5 μ i3 i 4 where the dependence of νs oni 1... i 6 is suppressed for simplicity. One can then show that E[Z (i1 )] = ν 1 = 0 E[Z (i1 )Z (i2 )] = ν 2 E[Z (i1 )Z (i2 )Z (i3 )] = n 1/2 ν 3 E[Z (i1 ) Z (i4 )] = ν 22 + n 1 (ν 4 ν 22 ) + o(n 1 ) E[Z (i1 ) Z (i5 )] = n 1/2 ν 23 + o(n 1 ) E[Z (i1 ) Z (i6 )] = ν O(n 1 ).

14 Scand J Statist 38 Maximum entropy asymptotic approximations 143 Expanding E[Sn] j... 4 and collecting terms by order of n 1/2 then yields the first four approximate moments of S n. Denoting the j-fold summation d d i 1 = 1 i by i 1...i j to ease notation we obtain approximate moments of S n as follows: μ n1 = 1 1 b i1 i n 2 2 ν 2 = 1 s 12 i 1 i n 2 μ n2 = b i1 b i2 ν ( 1 b n i1 b i2 i 3 ν b i 1 b i2 i 3 i ) 4 b i 1 i 2 b i3 i 4 ν 22 i 1 i 2 i 1 i 2 i 3 i 1... i 6 i 1...i 4 = s n s 22 μ n3 = 1 3 b i1 b i2 b i3 ν 3 + n 2 b i 1 b i2 b i3 i 4 ν 22 = 1 s 31 i 1 i 2 i 3 i 1... i n 4 μ n4 = b i1 b i2 b i3 b i4 ν ( ) b n i1 b i2 b i3 b i4 ν4 ν b i1 b i2 b i3 b i4 i 5 ν 23 i 1 i 2 i 3 i 4 i 1...i 4 i 1... i 5 + ( 2 3 b i 1 b i2 b i3 b i4 i 5 i ) 2 b i 1 b i2 b i3 i 4 b i5 i 6 ν 222 = s n s 42 where the definitions of s s 42 are self-evident. As usually only a few b i1... i j terms are non-zero the actual formulae for the approximate moments of S n are often simpler than as appeared before. To construct approximate cumulants we first write the cumulants as a power series of n 1/2 such that κ n1 = E[S n ] = n 1/2 s 12 + o(n 1 ) = n 1/2 k 12 + o(n 1 ) κ n2 = E[Sn] 2 E[S n ] 2 = s 21 + n 1 (s 22 s12) 2 + o(n 1 ) = k 21 + n 1 k 22 + o(n 1 ) κ n3 = E[Sn] 3 3E[Sn]E[S 2 n ] + 2E[S n ] 3 = n 1/2 (s 31 3s 12 s 21 ) + o(n 1 ) = n 1/2 k 31 + o(n 1 ) κ n4 = E[Sn] 4 4E[Sn]E[S 3 n ] 3E[Sn] E[Sn]E[S 2 n ] 2 6E[S n ] 4 = n 1 (12s12s s 22 s 21 4s 31 s 12 + s 42 ) + o(n 1 ) = n 1 k 41 + o(n 1 ) where the definitions of k k 41 are self-evident. The second to last equality holds because s 41 3s21 2 = 0. When S n is a sample mean its cumulants can be easily calculated by κ nj = κ j /n j/2 1. For general statistics however this simple identity does not hold and the cumulants have to be constructed from the moments constructed before. Appendix 2: Proof of theorems Proof of theorem 1. Denote the exponent of (13) by K(x) = P 3(x) 6 n + P 4(x) 24n c exp ( ) x n 2 where P 3 (x) and P 4 (x) are polynomials of degrees 3 and 4 respectively whose coefficients are constants that do not depend on n.

15 144 X. Wu and S. Wang Scand J Statist 38 Without loss of generality assume that K(x) > 0. By Taylor s theorem there exists a z such that 0 < K(z) < K(x) and { h nθn (x) = (x) 1 + K(x) + K 2 (x) + K } 3 (x) exp(k(z)) 2! 3! { = (x) 1 + P 3(x) 6 n + P 4(x) 24n + P2 3 (x) } + (x)q(x) + (x) K 3 (x) exp(k(z)) (16) 72n 3! where Q(x) = P2 4 (x) 2(24n) + c2 exp(2 x ) 2 2n 4 + P 3(x)P 4 (x) P 3(x)c exp( x ) 144n 3/2 6n 5/2 P 4(x)c exp( x ) c exp( x ). (17) 24n 3 n 2 Note that the first term of (16) equals the Edgeworth approximation g n (x). As sup x R (x)x r < for all non-negative integers r we have sup x R (x)p r (x) <. Similarly one can show that sup x R P r (x) (x) exp( x ) <. It follows that (x)q(x) = O ( n 3/2). Note that { K 3 P3 (x) (x) = 6 n + P } 3 { 4(x) P3 (x) 3 24n 6 n + P } 2 4(x) c exp ( x ) 24n n 2 { P3 (x) n + P } { 4(x) c exp ( x ) } 2 { ( ) } 3 c exp x. 24n n 2 n 2 Using the fact that sup (x)p r (x) ( exp ( x )) j < j = r = x R we have (x) K 3 (x) = O ( n 3/2). 3! It follows that h nθn = g n + O ( n 3/2) + O ( n 3/2) exp ( K(z) ). (18) It now suffices to show that exp ( K(z) ) < such that h n θn = g n + o(n 1 ). Note that when z = o(n 1/6 ) we have P 3 (z)/ n = o(1) P 4 (z)/n = o(1) thus K(z) is bounded above because its third term is negative. When z = O(n 1/6 ) P 3 (z)/ n = O(1) P 4 (z)/n = o(1). But exp ( x ) /n 2 is increasing at a faster rate than any polynomial of z. Thus the third term of K(z) dominates and K(z) is bounded above. Lastly for z = O(n 1/6 + ε ) ε > 0 the third term of K(z) dominates by the same argument. It follows that sup exp z R ( P 3 (z) + P 4(z) n n c exp ( z ) n 2 ) <. Thus exp ( K(z) ) < and from (18) we have g n (x) h nθn (x) = o(n 1 ). By lemma 1 it then follows that f n (x) h nθn (x) f n (x) g n (x) + g n (x) h nθn (x) = o(n 1 ). As this result is valid for all x it holds uniformly in x.

16 Scand J Statist 38 Maximum entropy asymptotic approximations 145 Proof of theorem 2. Using an argument similar to the proof of theorem 1 we have for an arbitrary non-negative integer r ( ) μ r ( g n ) μ r hnθn x r g n (x) h nθn (x) dx = x r (x) Q(x) + K 3 (x) exp ( K(z) ) dx 3! = o(n 1 ) (19) where Q(x) is given by (17). The last equality uses the fact that x r (x)dx < for all non-negative integers r and the boundedness of exp(k(z)) established in the proof of theorem 1. Because for r = μ r ( g n ) = μ r ( f nθ n ) it follows that μ r ( f nθ n ) μ r ( h nθn ) = o(n 1 ). We then have μ( f nθ n ) μ( h nθn ) = o(n 1 ) where v = Σ d vj 2 is the Euclidean norm of a d-dimensional vector v. Let f (x) = (x) exp(σ 4 j = 0γ j x j c exp( x )) for arbitrary {γ j } 4 j = 0. Suppose that the moments μ( f ) ={μ j ( f )} 4 exist. There is a unique correspondence between μ( f ) and {γ j} 4 j = 0. Denote the coefficients {γ j } 4 j = 0 of f by γ(μ( f )). According to lemma 5 of Barron & Sheu (1991) γ(μ( f )) γ(μ(g)) and μ( f ) μ(g) are of the same asymptotic order for μ( f ) and μ(g) sufficiently close. Denote the coefficients of the ME and EE approximations by γ nθ n and γ n respectively. (Recall that the coefficients of the EE approximation are invariant to the dampening function θ n. Thus no subscript for θ n is required for its coefficients.) As μ( f nθ n ) μ( h nθn ) = o(n 1 ) we have γ nθ n γ n = o(n 1 ). Define K (x) = 4 j = 0 γ njθ n x j c exp( x ) and K (x) similarly. Without loss of generality n 2 assume that K (x) > 0. There exists a z such that 0 < K (z) < K (x) f nθ n (x) h 4 ( nθn (x) = (x) γ njθn γ nj) x j exp ( K (z) ) j = 0 exp ( K (z) ) 4 (x) x j γ njθn γ nj. (20) j = 0 From the proof of theorem 1 we have exp ( K (z) ) < and sup x R (x)x j < for all nonnegative integers j. It follows that f nθ n (x) h nθn (x) = o(n 1 ). Lastly we have f n (x) f nθ n (x) f n (x) h nθn (x) + h nθn (x) f nθ n (x) = o(n 1 ). As the result is valid for all x it holds uniformly in x. Proof of theorem 3. Note that F n (x) H nθn (x) = x x { f n (t) h nθn (t)} dt f n (t) h nθn (t) dt f n (t) g n (t) dt + g n (t) h nθn (t) dt.

17 146 X. Wu and S. Wang Scand J Statist 38 According to Hall (1992) the first integral is of order o(n 1 ). The second integral is of order o(n 1 ) according to (19) in the previous proof. It follows that F n (x) H nθn (x) = o(n 1 ) uniformly in x. It follows that F n (x) F nθ n (x) F n (x) H nθn (x) + H n θn (x) F nθ n (x) o(n 1 ) + h nθn (t) f nθ n (t) dt. Integrating the right-hand side of inequality (20) yields that h n θn (t) f nθ n (t) dt = o(n 1 ). Note that the presence of exp(k (z)) in (20) ensures that the n 1 order is preserved in the integration. Therefore F n (x) F nθ n (x) = o(n 1 ) uniformly in x.

Information Theoretic Asymptotic Approximations for Distributions of Statistics

Information Theoretic Asymptotic Approximations for Distributions of Statistics Ximing Wu Department of Agricultural Economics Texas A&M University Suojin Wang Department of Statistics Texas A&M University