arxiv:math/ v1 [math.st] 29 Mar 2005

Size: px

Start display at page:

Download "arxiv:math/ v1 [math.st] 29 Mar 2005"

Ferdinand Bruce
6 years ago
Views:

1 The Annals of Statistics 24, Vol. 32, No. 5, DOI: / c Institute of Mathematical Statistics, 24 arxiv:math/53674v1 [math.st] 29 Mar 25 EQUIVALENCE THEORY FOR DENSITY ESTIMATION, POISSON PROCESSES AND GAUSSIAN WHITE NOISE WITH DRIFT By Lawrence D. Brown 1, Andrew V. Carter, Mark G. Low 2 and Cun-Hui Zhang 3 University of Pennsylvania, University of California, Santa Barbara, University of Pennsylvania and Rutgers University This paper establishes the global asymptotic equivalence between a Poisson process with variable intensity and white noise with drift under sharp smoothness conditions on the unknown function. This equivalence is also extended to density estimation models by Poissonization. The asymptotic equivalences are established by constructing explicit equivalence mappings. The impact of such asymptotic equivalence results is that an investigation in one of these nonparametric models automatically yields asymptotically analogous results in the other models. 1. Introduction. The purpose of this paper is to give an explicit construction of global asymptotic equivalence in the sense of Le Cam (1964 between a Poisson process with variable intensity and white noise with drift. The construction is extended to density estimation models. It yields asymptotic solutions to both density estimation and Poisson process problems based on asymptotic solutions to white noise with drift problems and vice versa. Density estimation model. A random vector Vn of length n is observed such that Vn (V1,...,V n is a sequence of i.i.d. variables with a common density f F. Received April 22; revised February Supported by NSF Grant DMS Supported by NSF Grant DMS Supported by NSF Grants DMS and DMS AMS 2 subject classifications. Primary 62B15; secondary 62G7, 62G2. Key words and phrases. Asymptotic equivalence, decision theory, local limit theorem, quantile transform, white noise model. This is an electronic reprint of the original article published by the Institute of Mathematical Statistics in The Annals of Statistics, 24, Vol. 32, No. 5, This reprint differs from the original in pagination and typographic detail. 1

2 2 L. D. BROWN, A. V. CARTER, M. G. LOW AND C.-H. ZHANG Poisson process. A random vector of random length {N,X N } is observed such that N N n is a Poisson variable with EN = n and that given N = m, X N = X m (X 1,...,X m is a sequence of i.i.d. variables with a common density f F. The resulting observations are then distributed as a Poisson process with intensity function nf. White noise. A Gaussian process Z Zn {Zn(t, t 1} is observed such that t Zn(t B (1.1 (t f(xdx + 2 n, t 1, with a standard Brownian motion B (t and an unknown probability density function f F in [,1]. Asymptotic equivalence. For any two experiments ξ 1 and ξ 2 with a common parameter space Θ, (ξ 1,ξ 2 ;Θ denotes Le Cam s distance [cf., e.g., Le Cam (1986 or Le Cam and Yang (199] defined as (ξ 1,ξ 2 ;Θ sup max L j=1,2 sup inf sup δ (j δ (k θ Θ E (j θ L(θ,δ(j E (k θ L(θ,δ (k, where (a the first supremum is taken over all decision problems with loss function L 1, (b given the decision problem and j = 1,2, k 3 j (k = 2 for j = 1 and k = 1 for j = 2 the maximin value of the maximum difference in risks over Θ is computed over all (randomized statistical procedures δ (l for ξ l and (c the expectations E (l θ are evaluated in experiments ξ l with parameter θ, l = j,k. The statistical interpretation of the Le Cam distance is as follows: If (ξ 1,ξ 2 ;Θ < ε, then for any decision problem with L 1 and any statistical procedure δ (j with the experiment ξ j, j = 1,2, there exists a (randomized procedure δ (k with ξ k, k = 3 j, such that the risk of δ (k evaluated in ξ k nearly matches (within ε that of δ (j evaluated in ξ j. Two sequences of experiments {ξ 1,n, n 1} and {ξ 2,n, n 1}, with a common parameter space F, are asymptotically equivalent if (ξ 1,n,ξ 2,n ; F as n. The interpretation is that the risks of corresponding procedures converge. A key result of Le Cam (1964 is that this equivalence of experiments can be characterized using random transformations between the probability spaces. A random transformation, T(X, U which maps observations X into the space of observations Y (with possible dependence on an independent, uninformative random component U also maps distributions in ξ 1 to approximations of the distributions in ξ 2 via P (1 θ T P(2 θ. For the mapping between the Poisson and Gaussian processes we shall restrict ourselves

3 EQUIVALENCE THEORY FOR DENSITY ESTIMATION 3 to transformations T with deterministic inverses, T 1 (T(X,U = X. The experiments are asymptotically equivalent if the total-variation distance between P (2 θ and the distribution of T under P (1 θ converges to uniformly in θ. As explained in Brown and Low (1996 and Brown, Cai, Low and Zhang (22, knowing an appropriate T allows explicit construction of estimation procedures in ξ 1 by applying statistical procedures from ξ 2 to T(X,U. In general, asymptotic equivalence also implies a transformation from the P (2 θ to the P (1 θ and the corresponding total-variation distance bound. However, in the case of the equivalence between the Poisson process and white noise with drift, by requiring that the transformation be invertible, we have saved ourselves a step. The transformation in the other direction is T 1, and P (1 θ T P(2 θ P(1 θ TT 1 P (2 θ T 1 = P (1 θ P (2 θ T 1. Therefore, it is sufficient if sup θ P (1 θ T P(2 θ. The equivalence mappings T n constructed in this paper from the sample space of the Poisson process to the sample space of the white noise are invertible randomized mappings such that (1.2 sup H f (T n (N,X N,Zn f F under certain conditions on the family F. Here H f (Z 1,Z 2 denotes the Hellinger distance of stochastic processes or random vectors Z 1 and Z 2 living in the same sample space, when the true unknown density is f. Since T n are invertible randomized mappings, T n (N,X N are sufficient statistics for the Poisson processes and their inverses Tn 1 are necessarily many-to-one deterministic mappings. Similar considerations apply for the mapping of the density estimation problem to the white noise with drift problem, although in that case there are two mappings, one from the density estimation to the white noise with drift model and another from the white noise with drift model back to the density estimation model. These mappings are given in Section 2. There have recently been several papers on the global asymptotic equivalence of nonparametric experiments. Brown and Low (1996 established global asymptotic equivalence of the white noise problem with unknown drift f to a nonparametric regression problem with deterministic design and unknown regression f when f belongs to a Lipschitz class with smoothness index α > 1 2. It has also been demonstrated that such nonparametric problems are typically asymptotically nonequivalent when the unknown f belongs to larger classes, for example, with smoothness index α 1 2. Brown and Low (1996 showed the asymptotic nonequivalence between the

4 4 L. D. BROWN, A. V. CARTER, M. G. LOW AND C.-H. ZHANG white noise problem and nonparametric regression with deterministic design for α 1 2, Efromovich and Samarov (1996 showed that the asymptotic equivalence may fail when α < 1 4. Brown and Zhang (1998 showed the asymptotic nonequivalence for α 1 2 between any pair of the following four experiments: white noise, density problem, nonparametric regression with random design, and nonparametric regression with deterministic design. In Brown, Cai, Low and Zhang (22 the asymptotic equivalence for nonparametric regression with random design was shown under Besov constraints which include Lipschitz classes with any smoothness index α > 1 2. Gramma and Nussbaum (1998 solved the fixed-design nonparametric regression problem for nonnormal errors. Milstein and Nussbaum (1998 showed that some diffusion problems can be approximated by discrete versions that are nonparametric autoregression models, and Golubev and Nussbaum (1998 established a discrete Gaussian approximation to the problem of estimating the spectral density of a stationary process. Most closely related to this paper is the work in Nussbaum (1996 where global asymptotic equivalence of the white noise problem to the nonparametric density problem with unknown density g = f 2 /4 is shown. In this paper the global asymptotic equivalence was established under the following smoothness assumption: f belongs to the Lipschitz classes with smoothness index α > 1 2. The parameter spaces. The class of functions F will be assumed throughout to be densities with respect to Lebesgue measure on [,1] that are uniformly bounded away from. The smoothness conditions on F can be described in terms of Haar basis functions of the densities. Let (1.3 θ k,l θ k,l (f fφ k,l, l =,...,2 k 1, k =,1,..., be the Haar coefficients of f, where (1.4 φ k,l 2 k/2 (½ Ik+1,2l ½ Ik+1,2l+1 are the Haar basis functions with I k,l [l/2 k,(l + 1/2 k. The convergence of the Hellinger distance in (1.2 is established via an inequality in Theorem 3 in terms of the tails of the Besov norms f 1/2,2,2 and f 1/2,4,4 of the Haar coefficients θ k,l θ k,l (f in (1.3. The Besov norms f α,p,q for the Haar coefficients, with smoothness index α and shape parameters p and q, are defined by (1.5 [ 1 { ( q 2 k f α,p,q f + 2 k(α+1/2 1/p 1 1/p } q ] 1/q θ k,l (f p. k= l=

5 EQUIVALENCE THEORY FOR DENSITY ESTIMATION 5 Let f k be the piecewise average of f at resolution level k, that is, the piecewise constant function defined by (1.6 f k f 2 k 1 k (t ½{t I k,l }2 I k f. k,l l= Since f k f k+1 p p = l θ k,lφ k,l p = l θ k,l p 2 k(p/2 1, (1.5 can be written as { } f α,p,q f 1/q q + (2 kα f k f k+1 p q, and its tail at resolution level k is f f k α,p,q, k, with (1.7 k= f f { ( k q α,p,q = 2 k 1 2 k(α+1/2 1/p 1/p } q θ k,l p. k=k Let B(α,p,q be the Besov space l=1 B(α,p,q = {f : f α,p,q < }. The following two theorems on the equivalence of white noise with drift, density estimation and Poisson estimation models are corollaries of our main result, Theorem 3, which bounds the squared Hellinger distance between particular invertible randomized mappings of the Poisson process and white noise with drift models. The randomized mappings are given in Section 2. Proofs of these theorems are given in the Appendix. Theorem 1. Let Zn, {N,X N} and Vn be the Gaussian process, Poisson process and density estimation experiments, respectively. Suppose that H is compact in both B(1/2,2,2 and B(1/2,4,4 and that H {f :inf <x<1 f(x ε } for some ε >. Then (1.8 and (1.9 lim n (Z n, {N,X N }; H = lim n (Z n,vn; H =. Our construction also shows that asymptotic equivalence holds for a class F if F is bounded in the Lipschitz norm with smoothness index β and compact in the Sobolev norm with smoothness index α β such that α+β 1, α 3 4 or β > 1 2.

6 6 L. D. BROWN, A. V. CARTER, M. G. LOW AND C.-H. ZHANG For < β 1 the Lipschitz norm f (L β defined by f (L f(x f(y β sup x<y 1 x y β, f α (S and Sobolev norm f (S α n= n 2α c n (f 2, where c n (f 1 f(xe in2πx dx are the Fourier coefficients of f. are Theorem 2. Let Z n, {N,X N } and V n be the Gaussian process, Poisson process and density estimation experiments, respectively, and let F be bounded in the Lipschitz norm with smoothness index β and compact in the Sobolev norm with smoothness index α β. Suppose F {f :inf <x<1 f(x ε } for some ε >. Then if α + β 1, α 3 4 or β > 1 2, and lim n (Z n, {N,X N }; F = lim n (Z n, Vn; F =. 2. The equivalence mappings. This section describes in detail the mappings which provide the asymptotic equivalence claimed in this paper. The fact that these mappings yield asymptotic equivalence is a consequence of our major result, Theorem 3. The construction is broken into several stages. From observations of the white noise (1.1, define random vectors (2.1 (2.2 { ( Z k {Z k,l, l < 2k }, Z l + 1 k,l 2k Z 2 k Z ( l 2 k }, W k {W k,l, l < 2k }, W k,2l W k,2l+1 (Z k,2l Z k,2l+1 /2. Let k k,n be suitable integers with lim n k,n =. Following Brown, Cai, Low and Zhang (22, we construct equivalence mappings by finding the counterparts of Z k and W k, k > k, with the Poisson process (N,X N, to strongly approximate the Gaussian variables. It can be easily verified from (1.1 that {Z k,l, l < 2 k,w k,2l, l < 2 k 1,k > k } are uncorrelated normal random variables with EZ k,l = h k,l 2 k I k,l h, h f, (2.3 Var(Z k,l = σ k 2 k /(4n, for l =,...,2 k 1, and for l =,...,2 k 1 1, EW k,2l = 1 2 (h k,2l h k,2l+1 = 2 k 1 (2.4 Var(W k,2l = σ k 1. hφ k 1,l,

7 EQUIVALENCE THEORY FOR DENSITY ESTIMATION 7 Let Ũ = {Ũk,l,k k,l } be a sequence of i.i.d. uniform variables in [ 1/2,1/2 independent of (N,X N. For k =,1,... and l =,...,2 k 1 define (2.5 N k {N k,l, l < 2 k }, N k,l #{X i :X i I k,l }. We shall approximate Z k in (2.1 in distribution by (2.6 Z k {Z k,l, l < 2 k }, Z k,l 2σ k sgn(n k,l + Ũk,l N k,l + Ũk,l, at the initial resolution level k = k. Since N k,l are Poisson variables with (2.7 λ k,l EN k,l = n 2 k f k,l = f k,l 4σk 2, f k,l 2 I k f, k,l by Taylor expansion and central limit theory Z k,l 2σ k ( λk,l + N k,l λ k,l 2λ 1/2 k,l N( f k,l,σ 2 k as λ k,l, compared with (2.3. Note that f k,l h k,l under suitable smoothness conditions on f, in view of (2.3 and (2.7. The Poisson variables N k,l can be fully recovered from Z k,l, while the randomization turns N k,l into continuous variables. Approximation of W k,l for k > k is more delicate, since the central limit theorem is not sufficiently accurate at high resolution levels. Let F m be the cumulative distribution function of the independent sum of a binomial variable X m,1/2 with parameter (m, 1 2 and a uniform variable Ũ in [ 1 2, 1 2, (2.8 F m (x P { X m,1/2 + Ũ x}, with F being the uniform distribution in [ 1 2, 1 2. Let Φ be the N(,1 cumulative distribution. We shall approximate Wk by using a quantile transformation of randomized versions of the Poisson random variables. More specifically, let (2.9 W k {W k,l, l < 2 k }, W k,2l σ k 1 Φ 1 (F Nk 1,l (N k,2l + Ũk,2l withw k,2l W k,2l+1, l =,...,2 k 1 1, and theσ k in (2.3. Given N k 1,l = m, I N k,2l Bin(m,p k,2l, p k,2l k,2l f I k 1,l f = f k,2l (2.1, f k,2l + f k,2l+1 so that W k,2l is distributed exactly according to N(,σk 1 2 for p k,2l = 1 2, compared with (2.4. Thus, the distributions of W k,2l and Wk,2l are close

8 8 L. D. BROWN, A. V. CARTER, M. G. LOW AND C.-H. ZHANG at high resolution levels as long as f is sufficiently smooth, even for small N k 1,l = m. The equivalence mappings T n, with randomization through Ũ, are defined by T n : {N,X N,Ũ} W [k, Z n {Z n (t: t 1}, where for k k, W [k,k {Z k,w j,k < j < k}, and Z k and W k are as in (2.6 and (2.9. The inverse of T n is a deterministic many-to-one mapping defined by Tn 1 :Zn W[k, (N,X N, where for k k, W [k,k {Z k,w j,k < j < k}. Remark 1. One need only carry out the above construction to k = k 1 :2 k 1 > εn since we shall assume that f B( 1 2,2,2 and then the observations W[k,k {Z k,wj,k < j < k} and W [k,k {Z k,w j,k < j < k} are asymptotically sufficient for the Gaussian process and Poisson process experiments. See Brown and Low (1996 for a detailed argument in the context of nonparametric regression. Mappings for the density estimation model. The constructive asymptotic equivalence between density estimation experiments and Gaussian experiments is established by first randomizing the density estimation experiment to an approximation of the Poisson process and then applying the randomized mapping as given above. Set γ k = sup f H f f k 2 1/2,2,2 and note that since H is compact in B(1/2,2,2, γ k. Now let k be the smallest integer such that 4 k /n γ k and divide the unit interval into subintervals of equal length with length equal to 2 k. Let f n be the corresponding histogram estimate based on Vn. Now note that since functions f H are bounded below by ε > it follows that (2.11 Now ( ( fn 1 f 2 ( fn f 2 ( fn + f 2 1 ( = f n f 2. ε ε 1 E ( f 1 n f 2 = E ( f n f 1 k 2 + (f f k 2 and simple calculations show that the histogram estimate f n satisfies E f n (x = f k (x and Var f n (x f k (x 2k n. Hence, 1 n 1/2 E ( f n f k 2 n 1/22k (2.13 n 2γ1/2 k.

9 EQUIVALENCE THEORY FOR DENSITY ESTIMATION 9 Now n 1/2 2k γ 1/2 k and hence, from (1.7, ( n 1/2 (f f k 2 1 It thus follows from (2.11 to (2.14 that (2.15 n 1/2 sup E f H γ 1/2 k f f k 2 1/2,2,2 γ1/2 1 ( fn f 2. k. Hence the density estimate is squared Hellinger consistent at a rate faster than square root of n. Now generate Ñ, a Poisson random variable with expectation n and independent of Vn. If Ñ > n generate Ñ n conditionally independent observations Vn+1,...,V with common density f Ñ n. Finally let (Ñ, XÑ = (Ñ,V 1,V 2,...,V and write Ñ R1 n for this randomization from V n to (Ñ, XÑ, Rn 1 :V n (Ñ, XÑ. A map from the Poisson number of independent observations back to the fixed number of observations is obtained similarly. This time let ˆf n be the histogram estimator based on (N,X N. If N < n generate n N additional conditionally independent observations with common density ˆf n. It is also easy to check that (2.16 n 1/2 sup E f H 1 ( ˆ fn f 2. Now label these observations V n = (V 1,...,V n and write R 2 n for this randomization from (N,X N to V n, R 2 n :(N,X N V n. Remark 2. It should also be possible to map the density estimation problem directly into an approximation of the white noise with drift model. Dividing the interval into 2 k subintervals and conditioning on the number of observations falling in each subinterval, the conditional distribution within each subinterval is the same as for the Poisson process. Therefore, it is only necessary to have a version of Theorem 4 for a 2 k -dimensional multinomial experiment. Carter (22 provides a transformation from a 2 k -dimensional multinomial to a multivariate normal as in Theorem 4 such that the total-variation distance between the distributions is O(k 2 k n 1/2. The transformation is similar to ours in that it adds uniform noise and then uses the square root as a variance-stabilizing transformation. However, the covariance structure

10 1 L. D. BROWN, A. V. CARTER, M. G. LOW AND C.-H. ZHANG of the multinomial complicates the issue and necessitates using a multiresolution structure similar to the one applied here to the conditional experiments. The Carter (22 result can be used in place of Theorem 4 to get a slightly weaker bound on the error in the approximation in Theorem 3 (because of the extra k factor when the total number of observations is fixed. This is enough to establish Theorem 2 if the inequalities bounding α and β are changed to strictly greater than. It is also enough to establish Theorem 1 if H is a Besov space with α > 1 2. Carter (2 also showed that a somewhat more complicated transformation leads to a deficiency bound on the normal approximation to the multinomials without the added k factor. 3. Main theorem. The theorems in Section 1 on the equivalence of white noise with drift experiments and Poisson process experiments are consequences of the following theorem which uniformly bounds the Hellinger distance between the randomized mappings described in Section 2. Theorem 3. Suppose inf <x<1 f(x ε >. Let W [k,k {Z k,w j,k < j < k} with the variables in (2.1 and (2.2, and W [k,k {Z k,w j,k < j < k} with the variables in (2.6 and (2.9. Then there exist universal constants C, D 1 and D 2 such that for all k 1 > k, H 2 (W [k,k 1,W [k,k 1 C ε 4 k n + D 1 ε 2 C ε 4 k n + D 1 ε 2 2 k k=k 2 k 1 l= θ 2 k,l + D 2 ε 3 f f k 2 1/2,2,2 + D 2 ε 3 n 2 4 k 2 3k k 1 θ k,l 4 k=k l= n 4 k f f k 4 1/2,4,4, where θ k,l are the Haar coefficients of f as in (1.3, fk is as in (1.6 and 1/2,p,p are the Besov norms in (1.5. Remark 3. Here the universal constant C is the same as the one in Theorem 4, while D 1 = 3D and D 2 = D for the D in Theorem 5. The proof of Theorem 3 is based on the inequalities established in Sections 4 and 5 for the normal approximation of Poisson and Binomial variables. Some additional technical lemmas are given in the Appendix. Let X m,p be a Bin(m,p variable, X λ be a Poisson variable with mean λ, and Ũ be a uniform variable in [ 1 2, 1 2 independent of X m,p and X λ. Define (3.1 g m,p (x d dx P {Φ 1 (F m ( X m,p + Ũ x}

11 EQUIVALENCE THEORY FOR DENSITY ESTIMATION 11 with the F m in (2.8 and the N(,1 distribution function Φ, and define g λ (x d dx P {2sgn( X (3.2 λ + Ũ X λ + Ũ x}. Write ϕ b for the density of N(b,1 variables. Proof of Theorem 3. Let g [k,k (w [k,k and g [k,k(w [k,k be the joint densities of W [k,k and W [k,k, g k (w k be the joint density of W k, and g k (w k w [k,k be the conditional joint density of W k given W [k,k. Since W k is independent of W [k,k, g [k,k g [k,k g [k,k+1 g [k,k+1 = g [k,k g [k,k(1 g k g k, so that the Hellinger distance can be written as (3.3 Hf 2 (W [k,k 1,W [k,k 1 ( g = 2 1 [k,k 1 g [k,k 1 ( g = 2 1 [k,k +1 g [k,k +1 + g ( 2 [k,k g [k,k 1 k <k<k 1 = H 2 f(z k,z k + k <k<k 1 g kg k g [k,k g [k,kh 2 (g k,g k. At the initial resolution level k, N k,l are independent Poisson variables by (2.5, so that Z k,l are independent. This and the independence of Z k,l from (2.1 imply 2 k Hf 2 1 (Z k,z k Hf 2 (Z k,l,z k,l. l= By (2.6 and (3.2 Z k,l/σ k have densities g, while λk,l Z k,l/σ k are N(h k,l/σ k,1 variables by (2.3. Thus, Theorem 4 can be used to obtain Hf(Z 2 k,l,z k,l = Hf( g 2,ϕ λk,l h k,l/σ k C + 1 ( 2 λ k,l h 2 k,l. λ k,l 2 σ k Since λ k,l = f k,l /(4σ 2 k by (2.7 and σ2 k = 2k 2 /n by (2.3, the above calculation yields 2 k Hf 2 1 (Z k,z k C l= 2 k nf k,l 2 k 1 2n + 2 k ( f k,l h k,l 2 l=

12 12 L. D. BROWN, A. V. CARTER, M. G. LOW AND C.-H. ZHANG (3.4 C 22k 2 k 1 + nε l= by Lemma 1(i and the bound f ε. For k > k and l < 2 k 1 1, define (3.5 n2 k ( 2 2ε 3 (f f k,l 2 I k,l µ k,2l m k,2l (2p k,2l 1, β k,2l λ k 1,l (2p k,2l 1, where p k,2l are as in (2.1, λ k,l = f k,l n/2 k are as in (2.7, and the functions m k,2l m k,2l (w [k,k are defined by N k 1,l = m k,2l (W [k,k. At a fixed resolution level k > k, and for l =,...,2 k 1 1, N k,2l are independent binomial variables conditionally on W [k,k, so that by (2.9 and (3.1 W k,2l /σ k 1 are independent variables with densities g mk,2l,p k,2l under the conditional density g k. In addition, Wk,2l are independent normal variables with variance σk 1 2 under g k. Thus, (3.6 2 k 1 H 2 (gk,g 1 k H 2 ( g mk,2l,p k,2l,ϕ β k,2l, l= by (2.4, where β k,2l EW k,2l /σ k 1 = 4n hφ k 1,l. It follows from Theorem 5 and (3.5 that for fixed w [k,k, H 2 ( g mk,2l,p k,2l,ϕ β k,2l (3.7 {[ D p k,2l 2] 1 2 [ + m k,2l p k,2l 1 ] 4 } + (µ k,2l βk,2l Furthermore, it follows from Lemma 3 that g [k,k g [k,k( m k,2l λ k 1,l 2 g [k,k( m k,2l λ k 1,l 4 = E( N k 1,l λ k 1,l 4 2, so that by (3.5, g [k,k g [k,k(µ k,2l β k,2l 2 4(2p k,2l (β k,2l β k,2l 2. Similarly, g [k,k g [k,km k,2l EN 2 k 1,l λ k 1,l + 1/2. Thus, by (3.7, g [k,k g [k,kh 2 ( g mk,2l,p k,2l,ϕ β k,2l (3.8 4D 1 [p k,2l 1 2 ]2 + Dλ k 1,l [p k,2l 1 2 ]4 + (β k,2l β k,2l 2,

13 EQUIVALENCE THEORY FOR DENSITY ESTIMATION 13 with D 1 = 3D/ Now, by (2.1 and (1.3, p k,2l 1 2 = I k,2l f I k,2l+1 f 2 2 = k 1 θ k 1,l (3.9, I k 1,l f 2f k 1,l so that by (3.5, (2.7, the definition of βk,2l in (3.6 and Lemma 1(ii, β k,2l βk,2l = nf k 1,l 2 k 1 θ k 1,l 2 k 1 4n hφ k 1,l (3.1 = 4n θ k 1,l 2 f k 1,l 4n2 (k 1/2 1 f 3/2 k 1,l f k 1,l hφ k 1,l I k 1,l (f f k 1,l 2. Inserting (3.9 and (3.1 into (3.8 and summing over l via (3.6, we find g [k,k g [k,kh 2 (g k,g k ( k 1 1 l= 2 k 1 1 l= 2 k 1 1 l= g [k,k g [k,kh 2 ( g mk,2l,p k,2l,ϕ β k,2l [ 4D 1 2 k θ 2 k 1,l + Dλ k 1,l 4 k θ 4 k 1,l 8fk 1,l 2 64fk 1,l 4 ( 2 ] + n2k 2fk 1,l 3 (f f k 1,l 2 I k 1,l [ D1 ε 2 ( D n2 2 k 1 θk 1,l 2 k 1( 2 ] ε 3 (f f k 1,l 2, I k 1,l due to λ k,l = nf k,l /2 k in (2.7 and θ 2 k,l I k,l (f f k,l 2. Finally, inserting (3.4 and (3.11 into (3.3 and then using Lemma 2 yields H 2 f(w [k,k 1 +1,W [k,k 1 +1 C 22k + D k nε ε 2 2 k k=k ( D k1 2 k=k 2 k 1 l= 2 k 1 l= θ 2 k,l n2 k ( 2 ε 3 (f f k,l 2 I k,l

14 14 L. D. BROWN, A. V. CARTER, M. G. LOW AND C.-H. ZHANG C 4 k ε n + D k ε 2 2 k k 1 θk,l 2 + D 2 n 2 ε 3 k=k l= 4 k 2 3k k 1 θ k,l, 4 k=k l= with D 2 ( D /( = D and the theorem follows. 4. Approximation of Poisson variables. Let X λ be a Poisson random variable with mean λ and Ũ be a uniform variable on [ 1 2, 1 2 independent of X λ. Define (4.1 Z λ 2 sgn( X λ + Ũ X λ + Ũ, g λ(y d dy P { Z λ y}. The main result of this section is a local limit theorem which bounds the squared Hellinger distance between this transformed Poisson random variable and a normal random variable. Theorem 4. Let Z λ and g λ be as in (4.1. Let Z λ N(2 λ,1 and ϕ µ be the density of N(,µ. Let H(, be the Hellinger distance. Then, as λ, (4.2 H 2 ( Z λ,zλ = H 2 ( g λ,ϕ 2 λ = (7 + o(1 1 96λ. Consequently, there exists a universal constant C < such that (4.3 H 2 ( g λ,ϕ µ C/λ + (2 λ µ 2 /2 λ >,µ. Remark 4. The theorem remains valid if Z λ is replaced by Z λ 2 X λ + Ũ + 1 2, since H 2 ( Z λ, Z λ is bounded by { f 2 2 Xλ+Ũ f Xλ+Ũ+1/ = 1 E } e λ λ 2j+1 j!(j + 1! j= X λ (1, λ min C. λ Proof of Theorem 4. The second inequality of (4.3 follows immediately from (4.2, since H 2 (ϕ µ1,ϕ µ2 = (µ 1 µ 2 2 /4 [cf. Brown, Cai, Low and Zhang (22, Lemma 3] and H 2 ( g λ,ϕ µ 2. Let t(x 2 sgn(x x, a strictly increasing function. Define (4.4 X λ t 1 (Z λ = sgn(z λ(z λ 2 /4.

15 EQUIVALENCE THEORY FOR DENSITY ESTIMATION 15 Let f λ and fλ denote the densities of X λ + Ũ and X, respectively. Since t( is invertible, H 2 ( Z λ,zλ = H( X λ + Ũ, X λ = 2 2 f λ fλ, so that it suffices to show fλ (4.5 A λ fλ = 1 C λ λ, lim C λ = 7 λ 192. Since Ũ is uniform, f λ(x = e λ λ j /j! on [j 1/2,j + 1/2, so that j+1/2 (4.6 A λ = f λ (j {fλ (x/f λ(j} 1/2 dx. j= j 1/2 Since t (x = x 1/2, by (4.4 fλ (x = x 1/2 ϕ(t(x 2 λ. This gives fλ (x f λ (j = exp{ (2 x 2 λ 2 /2} = exp[2ψ 2πxe λ λ j j (x], j 1 2 /j! x < j + 1 2, for j 1, in view of the Stirling formula j! = 2πj j+1/2 exp( j +ε j, where ψ j (x ( x λ 2 logx + λ j [ ] j 2 log + logj j λ ε j (4.7 2 with 1/(12j + 1 < ε j < 1/(12j, for j = 1,2,... Now, by the mean-value theorem, j+1/2 { f } λ (x 1/2 dx j 1/2 f λ (j j+1/2 = exp [ψ j (j + ψ j(j(x j + ψ j (x j (x j ]dx 2 2 j 1/2 for some x j j 1 2, with (4.8 ψ j (x = λ x 1 1 4x, ψ j (x = λ 2x 3/ x 2. Since exp[ψ j (j + ψ j (x j(x j 2 /2] is symmetric about j, it follows that j+1/2 { f } λ (x 1/2 dx j 1/2 f λ (j (4.9 j+1/2 = exp [ψ j (j + ψ j (x j(x j 2 ] (ψ j (j(x j2k dx. j 1/2 2 (2k! Now, we shall take uniform Taylor expansions of ψ j and their derivatives in J λ {j : j/λ 1 λ 2/5 }. k=

16 16 L. D. BROWN, A. V. CARTER, M. G. LOW AND C.-H. ZHANG By (4.7, ψ j (j = λψ(j/λ + ε j /2 with ψ(x ( x x 2 + x 2 logx. Since ψ(1 = ψ (1 = ψ (1 =, ψ (1 = 1/4 and ψ (1 = 7 8, ( j λψ = λ λ (j λ 3 4 3!λ 3 7λ 8 (j λ 4 4!λ 4 (1 + o(1 = o(1. Since 1/(12j + 1 < ε j < 1/(12j, ε j /2 = (1 + o(1/(24λ = o(1. Thus, ψ j (j = (j λ3 24λ (j λ 4 24λ 3 (1 + o( o(1 = o(1 24λ uniformly in J λ as λ. Similarly, by (4.8 and x j j 1 2, {ψ j(j} 2 (j λ2 = (1 + o(1 4λ 2 + o(1 λ = o(1, ψ j (x 1 + o(1 j = = o(1. 2λ These expansions and (4.9 imply that uniformly in J λ, j+1/2 { f } λ (x 1/2 dx j 1/2 f λ (j j+1/2 [ ] = 1 + ψ j (j + {ψ j (x j + (ψ j (x (j2 j2 } dx 2 j 1/2 + o(1 = o(1 2 k= (j λ 2k λ k+1 (j λ3 24λ 2 7 (j λ λ λ k= (j λ 2k λ k+1, [ 1 2λ ] (j λ2 + 4λ 2 as j+1/2 j 1/2 (x j2 dx = Since f λ(j is the Poisson probability mass function of X λ, j+1/2 { f } f λ (j λ (x 1/2 dx j J j 1/2 f λ (j λ (4.1 = [ ] λ 8 24λ λ 1 96λ + o(1 λ = o(1 192λ

17 EQUIVALENCE THEORY FOR DENSITY ESTIMATION 17 as j J λ f λ (j = 1 + o(1/λ. Note that E( X λ λ 3 = λ and E( X λ λ 4 = 3λ 2 + λ. Hence, (4.5 follows from (4.6, (4.1 and the fact that j+1/2 { f } f λ (j λ (x 1/2 ( dx P { j 1/2 f j / J λ (j X λ / J λ }P { X 1 λ / J λ} = o. λ λ 5. Approximation of binomial variables. The strong approximation of a normal by a binomial depends on the cumulative distribution function F m in (2.8. The addition of the independent uniform Ũ in (2.8 to the binomial X m,1/2 makes the c.d.f. continuous and thus Φ 1 F m is a oneto-one function on ( 1 2,m that maps symmetric binomials to standard normals. Let ϕ b be the N(b,1 density and g m,p be the probability density of (5.1 Φ 1 (F m [ X m,p + Ũ], X m,p Bin(m,p, as in (3.1, where Ũ is an independent uniform on [ 1 2, 1 2. Theorem 5. There is a constant C 1 > such that, for all m, H 2 ( g m,p,ϕ b = ( g m,p ( ϕ b 2 b 2 (5.2 dz C 1 m + b8 m 2, where b = ( m/2log(p/(1 p. Consequently, [( H 2 ( g m,p,ϕ β D p ( + m p 1 4 ] + ( m(2p 1 β 2 ( Proof. The case when m = is trivial because X = with probability 1 and therefore g,p is exactly an N(,1. Thus, the following assumes that m 1. It follows from (3.1 that (5.4 g m,p (z = p j (1 p m j 2 m ϕ (z, where j = j(z is the integer between and m such that (5.5 Φ 1 [F m (j 1 2 ] z < Φ 1 [F m (j ]. Let θ = log(p/q so that log g ( m,p(z ϕ (z = θ j m 2 + mlog(4pq, 2 and the second term can be approximated by [ θ2 4 θ4 2 + e θ 24 log(4pq = log + e θ ] (5.6 θ θ4 32.

18 18 L. D. BROWN, A. V. CARTER, M. G. LOW AND C.-H. ZHANG Let h 1 (θ = (2 + e θ + e θ /4. The second inequality in (5.6 follows from log(h 1 (θ log(1+θ 2 /4 θ 2 /4 θ 4 /32. The first inequality in (5.6 follows from h 1 (θ 1 + θ 2 /4 + θ 4 /24 for θ 4, and from log(h 1 (θ θ θ 2 /4 for θ > 4. Now, let (5.7 z = z (z = j(z m/2 m/2 m and b = θ 2. Then for some 1/24 h 2 (θ 1/32 the log ratio is log g m,p(z ϕ (z = z b b2 2 + h 2(θmθ 4. The log ratio of normals with different means is log(ϕ /ϕ b = zb + b 2 /2. Therefore the ratio with respect to the normal with mean b is (5.8 log g m,p ϕ b = h 2 (θmθ 4 b(z z, h 2 (θ Since y log(x/y x y xlog(x/y, for all positive x and y, so that by (5.8, (5.9 ( 1 ϕb gm,p log ϕ b g m,p 1 ϕb log 2 g m,p 2 H 2 ( g p,m,ϕ b 1 4 { ( ϕb log g m,p ( mθ b } 2 (ϕ b + g m,pdz ( ϕb g m,p, (z z 2 (ϕ b + g m,p dz. It follows from Carter and Pollard (24 that the difference between z and z = z (z is bounded by { z z C2 (m 1/2 + m 1 z 3, for all z, (5.1 C 2 (m 1/2 + m 1 z 3, if z 2m, for some constant C 2. Thus, (z z 2 g 2 (5.11 m,p dz 2C 2 ( 1 m + z 6 m 2 g m,p dz + z 2 >2m z 6 m 2 g m,p dz. Since g m,p I{z = (j m/2/ m }dz = P { X m,p = j}, z 6 g ( X m,p m/2 6 ( ( p 1 6 m,p dz = E = O 1 + m 3 = O(1 + b 6 m 2 uniformly in (m,p. It follows from (5.4 that z 6 g m,p dz 2 m z 6 ϕ dz = O(2 m m 6 e m = O(m 1. z 2 >2m z 2 >2m

19 EQUIVALENCE THEORY FOR DENSITY ESTIMATION 19 The above two inequalities and (5.11 imply (z z 2 g m,p dz 2C 2 2 O(1/m + b 6 /m 2. Similarly, (z z 2 ϕ b dz 2C 2 2 O(1/m + b 6 /m 2. Inserting these two inequalities into (5.9 yields (5.2 in view of (5.7. Now let us prove (5.3. The Hellinger distance is bounded by 2, so that b 8 /m 2 in (5.2 can be replaced by b 4 /m and it suffices to consider p for the proof of (5.3. By inspecting the infinite series expansion of log( p q = log(1 + x log(1 x for x = 2p 1, we find that for p , log( p q 8 3 2p 1 and log(p q 4(p p 1 3. These inequalities, respectively, imply b 2 m + b4 m (2p m(2p and b m(2p m 2p m 2p 1 4, in view of the definition of b, which then imply (5.3 via (5.2 and the fact that H 2 (ϕ b ϕ β = (b β 2 /4. APPENDIX A.1. The Tusnády inequality. The coupling of symmetric binomials and normals maps the integers j onto intervals [β j, β j+1 ] such that the normal(m/2,m/4 probability in the interval is equal to the binomial probability at ( m j 2 j. Taking the standardized values z j = 2(β j m/2 2(j 1/2 m/2, u j =, m m Carter and Pollard (24 showed that for m/2 < j < m and certain universal finite constants C ± u j + 1 C m z j u j u2 ( j m γ uj log(1 u2 j /m u j + logm C + m 2cu j m where c = 2log2 and γ is an increasing function with γ( = 1/12 and γ(1 = log2 1/2. This immediately implies that (A.1 z j u j C m ( u j 3 + logm u2 j m 1 2 for a certain universal constant C <. We shall prove (5.1 here based on (A.1. Because of the symmetry in both distributions, it is only necessary to consider z >.

20 2 L. D. BROWN, A. V. CARTER, M. G. LOW AND C.-H. ZHANG It follows from (5.5 and (5.7 that z j z < z j+1 u j z = z (z < u j+1. Let z j z < z j+1. Since u j+1 u j = 2/ m, for u 2 j+1 m/2 (A.1 implies (A.2. z z z j u j z j+1 u j ( 1 C m m + z 3 z 3 m Since u j and z j are both increasing in j, it follows that (z z / m are uniformly bounded away from zero for u j+1 m/2, so that (A.3 z z z j u j z j+1 u j C z 3 z 3 m m for (m + 1/ m = u m+1 u j+1 m/2 and z 2m. Since z z z u m+1 2z for z > 2m, (A.2 and (A.3 imply { z z C2 (m 1/2 + m 1 z 3, for all z, C 2 (m 1/2 + m 1 z 3, if z 2m, for a certain universal C 2 <, that is, (5.1. A.2. Technical lemmas. The following three lemmas simplify the rest of the proof of Theorem 3. Lemma 1. (i Let f k,l and h k,l be as in (2.7 and (2.3. Then (A.4 f k,l h k,l 2 k 1 f 3/2 k,l (f f k,l 2. I k,l (ii Let θ k,l be the Haar coefficients of f as in (1.3. Then hφ k,l θ k,l 2 (A.5 f 2k/2 1 f 3/2 k,l (f f k,l 2. k,l I k,l Proof. Let T = (f f k,l /f k,l 1. By algebra, 1 + T 1 = It follows from (2.3 and (2.7 that h k,l = 2 k f k,l I k,l 1 + T = 2 k f k,l T T = T 2 T 2 2( T 2. I k,l ( 1 + f f k,l 2f k,l (f f k,l 2 2f 2 k,l ( T 2,

21 EQUIVALENCE THEORY FOR DENSITY ESTIMATION 21 which implies (A.4 as 2 k I k,l = 1 and by (2.7 I k,l (f f k,l =. For (ii we have hφ k,l = f k,l φ k,l 1 + T = f k,l ( φ k,l 1 + f f k,l 2f k,l (f f k,l 2 2f 2 k,l ( T 2 which implies (A.5 as φ k,l = and φ k,l 2 k by (1.4. Lemma 2. Let θ k,l be the Haar coefficients in (1.3 and f k,l be as in (2.7. Then 2 k k=k 2 k 1 l= Proof. Define ( 2 (f f k,l 2 2 ck I k,l (1 1/2 c 2 k=k 2 k(1+c { 1, if Ii,j I δ i,j,k,l k,l,, otherwise. 2 k 1 Since j δ i,j,k,l = 2 i k for i k, using Cauchy Schwarz twice yields ( ( 2 (f f k 2 2 i 1 2 = δ i,j,k,l θi,j 2 I k,l i=k j= [ ( 2 i 1 2 ic/2 2 ic 2 i k 1/2 ] 2 δ i,j,k,l θi,j 4 i=k j= i=k i=k j= l= 2 ic 2 i 1 2 ic 2 i k δ i,j,k,l θi,j 4 2 k(1+c 1 1/2 c i=k 2 i 1 2 i(1+c j= δ i,j,k,l θ 4 i,j. Since 2 k 1 l= δ i,j,k,l = 1 for i k, the above inequality implies 2 k k=k 2 k 1 l= ( I k,l (f f k,l 2 2 k 2 k(1+c 1 1/2 c k=k i=k 2 2 i 1 2 i(1+c j= 2 k 1 l= δ i,j,k,l θ 4 i,j, θ 4 k,l c >.

22 22 L. D. BROWN, A. V. CARTER, M. G. LOW AND C.-H. ZHANG = ( i i=k k=k 2 ck 1 1/2 c 2 i 1 2 i(1+c θi,j 4 j= 2 ck (1 1/2 c 2 i=k 2 i(1+c 2 i 1 j= θ 4 i,j. Lemma 3. Let X λ be a Poisson random variable with mean λ. Then E( X λ λ 4 4. Proof. Since E( X λ λ 4 = λ(3λ + 1, E( X λ λ 4 E( X λ λ 4 ( λ λ2 P( X λ = 3λ + 1 λ and A.3. Proof of Theorem 1. First note that H(T n R 1 n V n,z n H(T nr 1 n V n,t n(n,x N + H(T n (N,X N,Z n H(V n,r 2 nt 1 n Z n H(V n,r 2 n(n,x N + H(R 2 n(n,x N,R 2 nt 1 n Z n. Note also that since for any randomization T and random X and Y, H(TX,TY H(X,Y, it follows that and H(T n R 1 n V n,t n(n,x N H(R 1 n V n,(n,x N H(R 2 n(n,x N,R 2 nt 1 n Z n H((N,X N,T 1 n Z n = H(T n (N,X N,Z n. For the class H and the randomizations Rn 1 and R2 n it follows from (2.15, (2.16 and the proof of Proposition 3 on page 58 of Le Cam (1986 that and Hence (1.9 and (1.8 will follow once (A.6 sup H(Rn 1 V n,(n,x N f H sup H(Vn,R n(n,x 2 N. f H sup H(T n (N,X N,Z n f H

23 EQUIVALENCE THEORY FOR DENSITY ESTIMATION 23 is established. By Theorem 3, for (A.6 to hold it is sufficient to show that ( 4 k sup f H n + f f k 2 1/2,2,2 + n 4 k f f k 4 1/2,4,4. If the class of functions H is a compact set in the Besov spaces, then the partial sums converge uniformly to, sup f f k 1/2,p,p f H for p = 2 or 4 as k. This implies that there is a sequence γ k such that γk 1 sup f H f f k 4 1/2,4,4. To be specific, let γ k = sup f f k 2 1/2,4,4. f H It is necessary to choose the sequence of integers k (n that will be the critical dimension that divides the two techniques. Let k be the smallest integer such that 4k n γ k. Therefore, k (n, and as n, ( 4 k sup f H n + f f k 2 1/2,2,2 + n 4 k f f k 4 1/2,4,4 sup f H ( 4γ k + f f k 2 1/2,2,2 + 1 γ k f f k 4 1/2,4,4. A.4. Proof of Theorem 2. Theorem 2 follows from Theorem 1 and the fact that the Lipschitz and Sobolev spaces described are compact in the Besov spaces. The Lipschitz class is equivalent to B β,, and therefore is compact in B 1/2,p,p if β > 1 2. The Sobolev class is equivalent to B α,2,2 and f f k 2 α,2,2 C α c n (f 2 n 2α, where C α depends only on α. Thus if F is compact in Sobolev(α for α 1 2 then it is compact in B 1/2,2,2. Further restrictions are required to show that the Sobolev(α class is compact in B 1/2,4,4. If f (L β C (L, then f k f k+1 C (L 2 kβ, so that f f k 4 1/2,4,4 C2 (L f k f k+1 2 dx n k=k 2 k2(1 β = C 2 (L f f k 2 (1 β,2,2.

24 24 L. D. BROWN, A. V. CARTER, M. G. LOW AND C.-H. ZHANG Therefore, for F bounded in Lipschitz(β, a compact Sobolev(α set is also compact in B 1/2,4,4 if α 1 β. Finally, if F is compact in Sobolev(α, α 3/4, then it immediately follows from the Sobolev embedding theorem that the function is bounded in Lipschitz(1/4 [e.g., Folland (1984, pages 27 and 273], and it follows that F is compact in B 1/2,4,4. Acknowledgments. We thank the referees and an Associate Editor for several suggestions which led to improvements in the final manuscript. REFERENCES Brown, L. D., Cai, T., Low, M. G. and Zhang, C.-H. (22. Asymptotic equivalence theory for nonparametric regression with random design. Ann. Statist MR Brown, L. D. and Low, M. G. (1996. Asymptotic equivalence of nonparametric regression and white noise. Ann. Statist MR Brown, L. D. and Zhang, C.-H. (1998. Asymptotic nonequivalence of nonparametric experiments when the smoothness index is 1/2. Ann. Statist MR Carter, A. V. (2. Asymptotic equivalence of nonparametric experiments. Ph.D. dissertation, Yale Univ. Carter, A. V. (22. Deficiency distance between multinomial and multivariate normal experiments. Ann. Statist MR Carter, A. V. and Pollard, D. (24. Tusnády s inequality revisited. Ann. Statist. 32. To appear. Efromovich, S. and Samarov, A. (1996. Asymptotic equivalence of nonparametric regression and white noise has its limits. Statist. Probab. Lett MR Folland, G. B. (1984. Real Analysis. Wiley, New York. MR Golubev, G. and Nussbaum, M. (1998. Asymptotic equivalence of spectral density and regression estimation. Technical report, Weierstrass Institute, Berlin. Gramma, I. and Nussbaum, M. (1998. Asymptotic equivalence for nonparametric generalized linear models. Probab. Theory Related Fields MR Le Cam, L. (1964. Sufficiency and approximate sufficiency. Ann. Math. Statist MR2793 Le Cam, L. (1986. Asymptotic Methods in Statistical Decision Theory. Springer, New York. MR Le Cam, L. and Yang, G. (199. Asymptotics in Statistics. Springer, New York. MR Milstein, G. and Nussbaum, M. (1998. Diffusion approximation for nonparametric autoregression. Probab. Theory Related Fields MR Nussbaum, M. (1996. Asymptotic equivalence of density estimation and Gaussian white noise. Ann. Statist MR

25 EQUIVALENCE THEORY FOR DENSITY ESTIMATION 25 L. D. Brown M. G. Low Department of Statistics The Wharton School University of Pennsylvania Philadelphia, Pennsylvania USA C.-H. Zhang Department of Statistics 54 Hill Center, Busch Campus Rutgers University Piscataway, New Jersey USA A. V. Carter Department of Statistics and Applied Probability University of California, Santa Barbara Santa Barbara, California USA

Equivalence Theory for Density Estimation, Poisson Processes and Gaussian White Noise With Drift

University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 4 Equivalence Theory for Density Estimation, Poisson Processes and Gaussian White Noise With Drift Lawrence D. Brown