arxiv: v1 [stat.me] 21 Mar 2016

Size: px

Start display at page:

Download "arxiv: v1 [stat.me] 21 Mar 2016"

Brandon Walton
5 years ago
Views:

1 Sharp sup-norm Bayesian curve estimation Catia Scricciolo Department of Economics, University of Verona, Via Cantarane 24, Verona, Italy arxiv: v1 [stat.me] 21 Mar 216 Abstract Sup-norm curve estimation is a fundamental statistical problem and, in principle, a premise for the construction of confidence bands for infinite-dimensional parameters. In a Bayesian framework, the issue of whether the sup-normconcentration-of-posterior-measure approach proposed by Giné and Nickl (211), which involves solving a testing problem exploiting concentration properties of kernel and projection-type density estimators around their expectations, can yield minimax-optimal rates is herein settled in the affirmative beyond conjugate-prior settings obtaining sharp rates for common prior-model pairs like random histograms, Dirichlet Gaussian or Laplace mixtures, which can be employed for density, regression or quantile estimation. Keywords: McDiarmind s inequality, Nonparametric hypothesis testing, Posterior distributions, Sup-norm rates 1. Introduction The study of the frequentist asymptotic behaviour of Bayesian nonparametric (BNP) procedures has initially focused on the Hellinger or L 1 -distance loss, see Shen and Wasserman (21) and Ghosal et al. (2), but an extension and generalization of the results to L r -distance losses, 1 r, has been the object of two recent contributions by Giné and Nickl (211) and Castillo (214). Sup-norm estimation has particularly attracted attention as it may constitute the premise for the construction of confidence bands whose geometric structure can be easily visualized and interpreted. Furthermore, as shown in the example of Section 3.2, the study of sup-norm posterior contraction rates for density estimation can be motivated as being an intermediate step for the final assessment of convergence rates for quantile estimation. While the contribution of Castillo (214) has a more prior-model specific flavour, the article by Giné and Nickl (211) aims at a unified understanding of the drivers of the asymptotic behaviour of BNP procedures by developing a new approach to the involved testing problem constructing nonparametric tests that have good exponential bounds on the type-one and type-two error probabilities that rely on concentration properties of kernel and projection-type density estimators around their expectations. Even if Giné and Nickl (211) s approach can only be useful if a fine control of the approximation properties of the prior support is possible, it has the merit of replacing the entropy condition for sieve sets with approximating conditions. However, the result, as presented in their Theorem 2 (Theorem 3), can only retrieve minimax-optimal rates for L r -losses when 1 r 2, while rates deteriorate by a genuine power of n, in fact n 1/2, for r>2. Thus, the open question remains whether their approach can give the right rates for 2<r for non-conjugate priors and sub-optimal rates are possibly only an artifact of the proof. We herein settle this issue in the affirmative by refining their result and proof and showing in concrete examples that this approach retrieves the right rates. The paper is organized as follows. In Section 2, we state the main result whose proof is postponed to Appendix A. Examples concerning different statistical settings like density and quantile estimation are presented in Section 3. Corresponding author. address: catia.scricciolo@univr.it (Catia Scricciolo)

2 2. Main result In this section, we describe the set-up and present the main contribution of this note. Let ( (X,A, P), P P ) be a collection of probability measures on a measurable space (X, A) that possess densities with respect to some σ-finite dominating measureµ. LetΠ n be a sequence of priors on (P,B), wherebis aσ-field onpfor which the maps x p(x) are jointly measurable relative toa B. Let X 1,..., X n be i.i.d. (independent, identically distributed) observations from a common law P P with density p onxwith respect toµ, p = dp /dµ. For a probability measure P on (X,A) and ana-measurable function f :X R k, k 1, let P f denote the integral f dp, where, unless otherwise specified, the set of integration is understood to be the whole domain. When this notation is applied to the empirical measure P n associated with a sample X (n) := (X 1,..., X n ), namely the discrete uniform measure on the sample values, this yields P n f = n 1 n i=1 f (X i ). For each n N, let ˆp n ( j)( )=n 1 n i=1 K j (, X i ) be a kernel or projection-type density estimator based on X 1,..., X n at resolution level j, with K j as in Definition (1) below. Its expectation is then equal to P n ˆp n( j)( )=P K j (, X 1 )=K j (p )( ), where we have used the notation K j (p )( )= K j (, y)p (y) dy. In order to refine Giné and Nickl (211) s result, we use concentration properties of ˆp n ( j) K j (p ) 1 around its expectation by applying McDiarmind s inequality for bounded differences functions. The following definition, which corresponds to Condition in Giné and Nickl (215), is essential for the main result. Definition 1. LetX=R,X=[, 1] orx=(, 1]. The sequence of operators K j (x, y) := 2 j K(2 j x, 2 j y), x, y X, j, is called an admissible approximating sequence if it satisfies one of the following conditions: a) convolution kernel case,x=r: K(x, y)=k(x y), where K L 1 (R) L (R), integrates to 1 and is of bounded p-variation for some finite p 1 and right (left)-continuous; b) multi-resolution projection case,x=r: K(x, y)= k Zφ(x k)φ(y k), with K j as above or K j (x, y)= K(x, y)+ j 1 l= kψ lk (x)ψ lk (y), whereφ,ψ L 1 (R) L (R) define an S -regular wavelet basis, have bounded p-variation for some p 1 and are uniformly continuous, or define the Haar basis, see Chapter 4, ibidem; c) multi-resolution case,x=[, 1]: K j,bc (x, y) is the projection kernel at resolution j of a Cohen-Daubechies-Vial (CDV) wavelet basis, see Chapter 4, ibidem; d) multi-resolution case,x=(, 1]: K j,per (x, y) is the projection kernel at resolution j of the periodization of a scaling function satisfying b), see (4.126) and (4.127), ibidem. Remark 1. A useful property of S -regular wavelet bases is the following: there exists a non-negative measurable functionφ L 1 (R) L (R) such that K(x, y) Φ( x y ) for all x, y R, that is, K is dominated by a bounded and integrable convolution kernel Φ. In order to state the main result, we recall that a sequence of positive real numbers L n is slowly varying at if, for eachλ>, it holds that lim n (L [λn] /L n )=1. Also, for s, let L 1 (µ s ) be the space ofµ s -integrable functions, dµ s (x) := (1+ x ) s dx, equipped with the norm f L1 (µ s ) := f (x) (1+ x ) s dx. Theorem 1. Letǫ n and J n be sequences of positive real numbers such thatǫ n, nǫn 2 and 2 J n = O(nǫn). 2 For each r {1, } and a slowly varying sequence L n,r, letǫ n,r := L n,r ǫ n. Suppose that, for K as in Definition (1), with K 2,Φ 2 and p that integrate (1+ x ) s for some s>1in cases a) and b), K Jn (p ) p r = O(ǫ n,r ) (1) and, for a constant C>, setsp n {P P: K Jn (p) p r C K ǫ n,r }, where C K > only depends on K, we have (i)π n (P\P n ) exp ( (C+ 4)nǫ 2 n), (ii)π n ( P P: P log(p/p ) ǫ 2 n, P log 2 (p/p ) ǫ 2 n) exp ( Cnǫ 2 n ). 2

3 Then, for sufficiently large M r >, P n Π n( P P: p p r M r ǫ n,r X (n)). (2) If the convergence in (2) holds for r {1, }, then, for each 1< s<. P n Π n( P P: p p s M s ǫ n X (n)), where ǫ n := (L n,1 L n, )ǫ n. The assertion, whose proof is reported in Appendix A, is an in-probability statement that the posterior mass outside a sup-norm ball of radius a large multiple M ofǫ n is negligible. The theorem provides the same sufficient conditions for deriving sup-norm posterior contraction rates that are minimax-optimal, up to logarithmic factors, as in Giné and Nickl (211). Condition (ii), which is mutuated from Ghosal et al. (2), is the essential one: the prior concentration rate is the only determinant of the posterior contraction rate at densities p having sup-norm approximation error of the same order against a kernel-type approximant, provided the prior support is almost the set of densities with the same approximation property. 3. Examples In this section, we apply Theorem 1 to some prior-model pairs used for (conditional) density or regression estimation, including random histograms, Dirichlet Gaussian or Laplace mixtures, that have been selected in an attempt to reflect cases for which the issue of obtaining sup-norm posterior rates was still open. We do not consider Gaussian priors or wavelets series because these examples have been successfully worked out in Castillo (214) taking a different approach. We furthermore exhibit an example with the aim of illustrating that obtaining sup-norm posterior contraction rates for density estimation can be motivated as being an intermediate step for the final assessment of convergence rates for estimating single quantiles Density estimation Example 1 (Random dyadic histograms). For J n N, consider a partition of [, 1] into 2 J n intervals (bins) of equal length A 1,2 Jn= [, 2 J n ] and A j,2 Jn= (( j 1)2 J n, j2 J n ], j=2,..., 2 J n. Let Dir 2 Jn denote the Dirichlet distribution on the (2 J n 1)-dimensional unit simplex with all parameters equal to 1. Consider the random histogram 2 Jn w j,2 Jn 2 J n 1 A ( ), (w j,2 Jn 1,2 Jn,..., w 2 Jn,2 Jn ) Dir 2 Jn. j=1 Denote byπ 2 Jn the induced law on the space of probability measures with Lebesgue density on [, 1]. Let X 1,..., X n be i.i.d. observations from a density p on [, 1]. Then, the Bayes density estimator, that is the posterior expected histogram, has expression ˆp n (x)= 2 Jn j=1 1+N l(x) 2 J 2 J n 1 A n + n j,2 (x), x [, 1], Jn where l(x) identifies the bin containing x, i.e., A l(x),2 Jn x, and N l(x) stands for the number of observations falling into A l(x),2 Jn. LetC α ([, 1]) denote the class of Hölder continuous functions on [, 1] with exponentα>. Let ǫ n,α := ( n/ log n ) α/(2α+1) be the minimax rate of convergence over (C α ([, 1]), ). Proposition 1. Let X 1,..., X n be i.i.d. observations from a density p C α ([, 1]), withα (, 1], satisfying p > on [, 1]. Let J n be such that 2 J n ǫn,α 1/α. Then, for sufficiently large M>, P n Π 2 Jn (P : p p Mǫ n,α X (n) ). Consequently, P n ˆp n p ǫ n,α. The first part of the assertion, which concerns posterior contraction rates, immediately follows from Theorem (1) combined with the proof of Proposition 3 of Giné and Nickl (211), whose result, together with that of Theorem 3 in Castillo (214), is herein improved to the minimax-optimal rate ( n/ log n ) α/(2α+1) for every <α 1. The second part of the assertion, which concerns convergence rates for the histogram density estimator, is a consequence of Jensen s inequality and convexity of p p p, combined with the fact that the priorπ 2 Jn is supported on densities uniformly bounded above by 2 J n and that the proof of Theorem 1 yields the exponential order exp ( Bnǫn,α) 2 for the convergence of the posterior probability of the complement of an (Mǫ n,α )-ball around p, in symbols, P n ˆp n p < Mǫ n,α + 2 J n P n Π 2 Jn (P : p p Mǫ n,α X (n) ) Mǫ n,α + 2 J n exp ( Bnǫn,α), 2 whence P n ˆp n p = O(ǫ n,α ). 3

4 Example 2 (Dirichlet-Laplace mixtures). Consider, as in Scricciolo (211), Gao and van der Vaart (215), a Laplace mixture priorπthus defined. Forϕ(x) := 1 2 exp ( x ), x R, the density of a Laplace (, 1) distribution, let p G ( ) := ϕ( θ) dg(θ) denote a mixture of Laplace densities with mixing distribution G, G D α, the Dirichlet process with base measureα :=α R ᾱ, for <α R < andᾱaprobability measure onr. Proposition 2. Let X 1,..., X n be i.i.d. observations from a density p G, with G supported on a compact interval [ a, a]. Ifαhas support on [ a, a] with continuous Lebesgue density bounded below away from and above from, then, for sufficiently large M>, P n Π(P : p p M(n/ log n) 3/8 X (n) ). Consequently, for the Bayes estimator ˆp n ( )= p G ( )Π(dG X (n) ) we have P n ˆp n p (n/ log n) 3/8. Proof. It is known from Proposition 4 in Gao and van der Vaart (215) that the small-ball probability estimate in condition (ii) of Theorem 1 is satisfied forǫ n = (n/ log n) 3/8. For the bias condition, we takep n to be the support ofπand show that, for 2 J n ǫn 1/3 = (n/ log n) 1/8 and any symmetric density K with finite second moment, we have K Jn (p G ) p G = O(ǫ n ) uniformly over the support ofπ. Indeed, by applying Lemma 1 withβ=2, for each x R it results K Jn (p G )(x) p G (x) 2 K Jn (p G ) p G 2 2 ϕ(t) 2 K(2 J n t) 1 2 dt (2π) 1 (B 2 ϕ I 2 [ K])(2 2J n ) 3, which implies that both conditions (1) and (i) are satisfied. The assertion on the Bayes estimator follows from the same arguments laid out for random histograms together with the fact that p G 1/2 uniformly in G. Example 3 (Dirichlet-Gaussian mixtures). Consider, as in Ghosal and van der Vaart (21, 27), Shen et al. (213), Scricciolo (214), a Gaussian mixture prior Π G thus defined. For φ the standard normal density, let p F,σ ( ) := φ σ ( θ) df(θ) denote a mixture of Gaussian densities with mixing distribution F, F D α, the Dirichlet process with base measureα :=α R ᾱ, for <α R < andᾱaprobability measure on R, which has continuous and positive densityα (θ) e b θ δ as θ, for some constants <b< and <δ 2, σ G which has continuous and positive density g on (, ) such that, for constants <C 1, C 2, D 1, D 2 <, s, t<, C 1 σ s exp ( D 1 σ 1 log t (1/σ) g(σ) C 2 σ s exp ( D 2 σ 1 log t (1/σ)) for allσin a neighborhood of. LetC β (R) denote the class of Hölder continuous functions onrwith exponentβ>. Letǫ n,β := ( n/ log n ) β/(2β+1) be the minimax rate of convergence over (C β (R), ). For any realβ>, let β stand for the largest integer strictly smaller thanβ. Proposition 3. Let X 1,..., X n be i.i.d. observations from a density p L (R) C β (R) such that condition (ii) is satisfied forǫ n,β. Then, for sufficiently large M>, P n (Π G)((F,σ) : p F,σ p Mǫ n,β X (n) ). Proof. Let K L 1 (R) be a convolution kernel such that x k K(x) dx=1 {} (k), k=,..., β, and x β K(x) dx<, the Fourier transform K has supp( K) [ 1, 1]. Let 2 J n ǫ 1/β n,β. For every x R, K J n (p )(x) p (x) C 1 (2 J n ) β ǫ n,β, where the constant C 1 (1/ β!) x β K(x) dx does not depend on x. Thus, K Jn (p ) p = O(ǫ n,β ). For the bias condition, letσ n := E(nǫn,β 2 ) 1 (log n) ψ, with 1/2<ψ<tand a suitable constant <E<. For everyσ σ n and uniformly in F, K Jn (p F,σ ) p F,σ = sup K Jn (u)[φ σ (x v u) φ σ (x v)] du df(v) x R 1 2π sup e it(x v) φ σ (t) K(2 J n t) 1 dt df(v) x R 1 φ σ (t) dt σ 1 n π exp ( (ρσ n 2J n ) 2 ) n 1 <ε n,β t >2 Jn because (σ n 2 J n ) 2 (log n) 2ψ (log n) asψ > 1/2. Now, G(σ < σ n ) σ s n exp ( [D 2σ 1 n logt (1/σ n )]) exp ( (C+ 4)nǫn) 2 becauseψ<t, which implies that the remaining mass condition (ii) is satisfied. 4

5 Remark 2. Conditions on the density p under which assumption (ii) of Theorem 1 is satisfied can be found, for instance, in Shen et al. (213) and Scricciolo (214) Quantile estimation Forτ (, 1), consider the problem of estimating theτ-quantile q τ of the population distribution function F from observations X 1,..., X n. For any (possibly unbounded) interval I R and function g on I, define the Hölder norm as α g Cα (I) := g (k) g α (x) g α (y) L (I)+ sup. x, y I: x y x y α α k= LetC (I) denote the space of continuous and bounded functions on I andc α (I, R) :={g C (I) : g Cα (I) R}, R>. Proposition 4. Suppose that, givenτ (, 1), there are constants r,ζ> so that p ( +q τ ) Cα ([ ζ,ζ], R) and inf [q τ ζ, qτ +ζ] p (x) r. (3) Consider a priorπconcentrated on probability measures having densities p( +q τ ) Cα ([ ζ,ζ], R). If, for sufficiently large M, the posterior probability P n Π(P : p p Mǫ n,α X (n) ), then, there exists M > so that P n Π( qτ q τ M ǫ 1+1/α n,α X (n) ). Proof. We preliminarily make the following remark. Let F(x) := x p(y) dy, x R. Forτ (, 1), let qτ be theτquantile of F. By Lagrange s theorem, there exists a point q τ between qτ and q τ so that F(qτ ) F(q τ )= p(qτ )(qτ q τ ). Consequently, =τ τ= If p(q τ )>, then q q τ p(x) dx p (x) dx= p(x) dx+ q τ τ q τ = [F(qτ ) F (q τ )] p(q τ ) [p(x) p (x)] dx= p(q τ )(qτ q τ )+[F(qτ ) F (q τ )]. = [F(qτ ) τ] p(q τ. (4) ) In order to upper bound q τ q τ, by appealing to relationship (4), we can separately control F(qτ ) F (q τ ) and p(qτ ). Let the kernel function K L 1 (R) be such that x k K(x)dx=1 {} (k), k=,..., α +1, and x α+1 K(x) dx<, its Fourier transform K has supp( K) [ 1, 1]. By Lemma 5.2 in Dattner et al. (213), sup p ( +q τ ) Cα ([ ζ,ζ], R) with D := [R/( α +1)!+ 2ζ (α+1) ] x α+1 K(x) dx. Write F(q τ ) F (q τ )= [K b p p ](x) dx+ [K b p p ](x) dx Dbα+1, (5) [K b (p p )](x) dx+ [p K b p](x) dx=: T 1 + T 2 + T 3. By inequality (5), we have T 1 =O(b α+1 ). By the same reasoning, T 3 =O(b α+1 ). We now consider T 2. Taking into account that K(x) dx=1and ( 1 q τ ) T 2 := [K b (F F )](q τ )= b K u (F F )(u) du= b 5 K(z)(F F )(q τ bz) dz= K(z)(F F)(q τ bz) dz,

6 for some pointξ between q τ bz and qτ (clearly,ξ depends on qτ, z, b), T 2 = [K b (F F )](q τ ) (F F)(q τ )= K(z)[(F F)(q τ bz) (F F)(q τ )] dz+(f F)(q τ ) = K(z)( bz)[d 1 (F F)(ξ)] dz+ (F F)(q τ ) = ( b) zk(z)[(p p)(ξ)] dz+(f F)(q τ ). Then, F(q τ ) F (q τ )=T 1+ T 3 + ( b) zk(z)[(p p)(ξ)] dz [(F F )(q τ )], which implies that 2[F(qτ ) F (q τ )]= T 1 +T 3 +( b) zk(z)[(p p)(ξ)] dz. It follows that 2 F(q τ ) F (q τ ) T 1 + T 3 +b p p z K(z) dz. Taking into account that z K(z) dz<, T 1 =O(b α+1 ) and T 3 =O(b α+1 ), choosing b=o(ǫn,α 1/α ), we have F(q τ ) F (q τ ) T 1 + T 3 +b p p b α+1 + b p p ǫn,α 1+1/α. If p p ǫ n,α then, under condition (3), p(q τ )>r η> for every <η<r. In fact, for any interval I [q τ ζ, qτ +ζ] that includes the point qτ so that it also includes the intermediate point q τ between qτ and q τ, for anyη>we haveη p p sup I p(x) p (x) p( x) p ( x) for every x I. It follows that p(q τ )> p (q τ ) η inf x [q τ ζ, qτ +ζ] p (x) η r η. Conclude the proof by noting that, in virtue of (4), P n Π(P : p p < Mǫ n,α X (n) ) P n Π( qτ q τ < M ǫn,α 1+1/α X (n) ). The assertion then follows. Remark 3. Proposition 4 considers local Hölder regularity of p, which seems natural for estimating single quantiles. Clearly, requirements on p are automatically satisfied if p is globally Hölder regular and, in this case, the minimax-optimal sup-norm rate isǫ n,α = (n/ log n) α/(2α+1) so that the rate for estimating single quantiles is ǫn,α 1+1/α = (n/ log n) (α+1)/(2α+1). The conditions on the random density p are automatically satisfied if the prior is concentrated on probability measures possessing globally Hölder regular densities. Appendix A. Proof of Theorem 1 Proof. Using the remaining mass condition (i) and the small-ball probability estimate (ii), by the proof of Theorem 2.1 in Ghosal et al. (2), it is enough to construct, for each r {1, }, a testψ n,r for the hypothesis H : P=P vs. H 1 :{P P n : p p r M r ǫ n,r }, with M r > large enough, whereψ n,r Ψ n,r (X (n) ; P ) :X n {, 1} is the indicator function of the rejection region of H, such that P n Ψ n,r as n and sup P n (1 Ψ n,r ) exp ( K r Mr 2 nǫ 2 ) n,r for sufficiently large n, P P n : p p r M r ǫ n,r where K r Mr 2 (C+ 4), the constant C> being that appearing in (i) and (ii). By assumption (1), there exists a constant C,r > such that P n ˆp n p r = K Jn (p ) p r C,r ǫ n,r. Define T n,r := ˆp n p r. For a constant M,r > C,r, define the event A n,r := (T n,r > M,r ǫ n,r ) and the testψ n,r := 1 An,r. For r = 1, the triangular inequality T n,1 ˆp n P n ˆp n 1 + P n ˆp n p 1 implies that, when T n,1 > M,1 ǫ n,1, ˆp n P n ˆp n 1 T n,1 P n ˆp n p 1 > M,1 ǫ n,1 P n ˆp n p 1 (M,1 C,1 )ǫ n,1 ; r=, we have ˆp n (x) p (x) ˆp n (x) P n ˆp n(x) + P n ˆp n(x) p (x) ˆp n P n ˆp n 1 + P n ˆp n p for every x R. It follows that T n, ˆp n P n ˆp n 1 + P n ˆp n p, which implies that, when T n, > M, ǫ n,, ˆp n P n ˆp n 1 T n, P n ˆp n p > M, ǫ n, P n ˆp n p (M, C, )ǫ n,. Let h :X n [, 2] be the function defined as h(x (n) ) := ˆp n P n ˆp n 1. Thus, for each r {1, }, when T n,r > M,r ǫ n,r, the inequality h(x (n) )>(M,r C,r )ǫ n,r holds. Therefore, to control the type-one error probability, it is enough to bound above the probability on the right-hand side of the following display P n Ψ n,r P n ( h(x (n) )>(M,r C,r )ǫ n,r ), (A.1) 6

7 which can be done using McDiarmind s inequality, McDiarmid (1989). Given any x (n) := (x 1,..., x n ) X n, for each 1 i n, let x i be the ith component of x (n) and x i := (x i +δ) a perturbation of the ith variable withδ R so that x i X. Letting e i be the canonical vector with all zeros except for a 1 in the ith position, the vector with the perturbed ith variable can be expressed as x (n) +δe i. If (a) the function h has bounded differences: for some non-negative constants c 1,..., c n, (b) P n h(x(n) )=O(ǫ n ), sup h(x (n) ) h(x (n) +δe i ) c i, 1 i n, x (n), x i then, for C := n i=1 c 2 i, by McDiarmind s bounded differences inequality, We show that (a) and (b) are verified. t>, P n ( h(x (n) ) P n h(x(n) ) t ) 2 exp ( 2t 2 /C ). (a) Using the inequality a b a b, setting Φ = K under condition a) of Definition (1), [ i {1,..., n}, sup h(x (n) ) h(x (n) 1 n +δe i ) = sup K Jn (x, x i ) K Jn (p )(x) x (n), x i x (n), x n i i=1 1 n K Jn (x, x i )+ 1 ] n n K J n (x, x i ) K J n (p )(x) dx i i 1 sup x i, x n K J n (, x i ) K Jn (, x i ) 1 2 n Φ 1. i Hence, h has bounded differences with c i = 2 Φ 1 /n, 1 i n. (b) By Theorem in Giné and Nickl (215), P n h(x(n) ) L 2 J n /n=o(ǫn ), with the following upper bounds for the constant L: under conditions a) and b) of Definition (1), settingφ=kin case a), L 2/(s 1) Φ 2 1/2 L 1 (µ s ) p 1/2 L 1 (µ s ) ; under conditions c) and d), L C(φ)(1 p 1/2 ) 1/2, where the constant C(φ) only depends onφ. Forα (, 1), taking t= 2α(M,r C,r )ǫ n,r, P n ( h(x (n) ) P n h(x(n) ) 2α(M,r C,r )ǫ n,r ) 2 exp ( α 2 (M,r C,r ) 2 nǫ 2 n,r/ Φ 2 1). By (b), there exists a constant L L so that P n h(x(n) ) L ǫ n = (L /L n,r )ǫ n,r. Hence, h(x (n) ) P n h(x(n) ) h(x (n) ) P n h(x(n) ) h(x (n) ) (L /L n,r )ǫ n,r. Thus, for sufficiently large L n,r so that [(M,r C,r ) (L /L n,r )] 2α(M,r C,r ), P n ( ˆpn P n ˆp ) n 1 (M,r C,r )ǫ n,r P n( h(x (n) ) P n h(x(n) ) ) 2α(M,r C,r )ǫ n,r 2 exp ( α 2 (M,r C,r ) 2 nǫ 2 n,r/ Φ 2 1). We now provide an exponential upper bound on the type-two error probability. For r {1, }, let P P n be such that p p r M r ǫ n,r. For r=1, when T n,1 M,1 ǫ n,1, p p 1 p P n ˆp n 1 + ˆp n P n ˆp n 1 + T n,1 p P n ˆp n 1 + ˆp n P n ˆp n 1 + M,1 ǫ n,1, 7

8 r=, when T n, M, ǫ n,, x X, p(x) p (x) p P n ˆp n + ˆp n P n ˆp n 1 + T n, p P n ˆp n + ˆp n P n ˆp n 1 + M, ǫ n,, which implies that p p p P n ˆp n + ˆp n P n ˆp n 1 + M, ǫ n,. Summarizing, for r {1, }, when T n,r M,r ǫ n,r, we have p p r p P n ˆp n r + ˆp n P n ˆp n 1 + M,r ǫ n,r. If sup P Pn p P n ˆp n r = sup P Pn p K Jn (p) r C K ǫ n,r, we have ˆp n P n ˆp n 1 p p r p P n ˆp n r M,r ǫ n,r [M r (C K + M,r )]ǫ n,r. Using, as before, McDiarmind s inequality with P playing the same role as P, we get that for a constantα (, 1) small enough and [M r (C K + M,r )]>, sup P n (1 φ n,r )=P n ( ˆp n p r M,r ǫ n,r )= P ( ˆp n P n ) ˆp n 1 [M r (C K + M,r )]ǫ n,r P P n : p p r M r ǫ n,r 2 exp ( α 2 [M r (C K + M,r )] 2 nǫ 2 n,r/ Φ 2 1 ). We need thatα 2 [M r (C K + M,r )] 2 / Φ 2 1 (C+ 4), which implies that [M r (C K + M,r )] α 1 Φ 1 C+ 4. This concludes the proof of the first assertion. If the convergence in (2) holds for r=1 and r=, then the last assertion of the statement follows from the interpolation inequality: for every 1< s<, p p s max{ p p 1, p p }. Appendix B. Auxiliary results for Proposition 3 Following Parzen (1962), Watson and Leadbetter (1963), we adopt the subsequent definition. Definition 2. The Fourier transform or characteristic function of a Lebesgue probability density function p on R, denoted by p, is said to decrease algebraically of degreeβ> if lim t t β p(t) = B p, < B p <. The following lemma is essentially contained in the first theorem of section 3B in Watson and Leadbetter (1963). Lemma 1. Let p L 2 (R) be a probability density with characteristic function that decreases algebraically of degree β>1/2. Let h L 1 (R) have Fourier transform h satisfying 1 h(t) 2 I β [ h] := t 2β dt<. Then,δ 2(β 1/2) p p h δ 2 2 (2π) 1 B 2 p I β [ h] asδ. Proof. Since p L 1 (R) L 2 (R), then p h δ q p q h δ 1 <, for q=1, 2. Thus, p h δ L 1 (R) L 2 (R). It follows that (p p h δ ) L 1 (R) L 2 (R). Hence, { B 2 p I β[ h]+ p p h δ 2 2 =δ2β 1 2π 1 h(z) 2 z 2β [ z/δ 2β p(z/δ) 2 B 2 p ] dz }, where the second integral tends to by the dominated convergence theorem because of assumption (B.1). In the next remark, which is essentially due to Davis (1977), section 3, we consider a sufficient condition for a function h L 1 (R) to satisfy requirement (B.1). Remark 4. If h L 1 (R), then t 2β 1 h(t) 2 dt< forβ>1/2. Suppose further that there exists an integer r 2 1 such that x m h(x) dx=, for m=1,..., r 1, and x r h(x) dx. Then, [1 h(t)] t r [ = t r e itx r 1 (itx) j j= j! ] h(x) i r dx= (r 1)! x r h(x) 1 (1 u) r 1 e itxu du dx ir r! (B.1) x r h(x) dx, as t. For r β, the integral 1 t 2β 1 h(t) 2 dt<. Conversely, for r<β, the integral diverges. Therefore, for 1/2<β 2, any symmetric probability density h with finite second moment is such that I β [ h]< and condition (B.1) is verified. 8

9 References Barron, A., Schervish, M.J., Wasserman, L., The consistency of posterior distributions in nonparametric problems. The Annals of Statistics 27 (2), Castillo, I., 214. On Bayesian supremum norm contraction rates. The Annals of Statistics, 42 (5), Dattner, I., Reiß, M., Trabs, M., 213. Adaptive quantile estimation in deconvolution with unknown error distribution. Technical Report. URL< Bernoulli, to appear Davis, K.B., Mean integrated square error properties of density estimates. The Annals of Statistics 5 (3), Gao, F., van der Vaart, A., 215. Posterior contraction rates for deconvolution of Dirichlet-Laplace mixtures. Technical Report. URL< Ghosal, S., Ghosh, J.K., van der Vaart, A.W., 2. Convergence rates of posterior distributions. The Annals of Statistics 28 (2), Ghosal, S., van der Vaart, A.W., 21. Entropies and rates of convergence for maximum likelihood and Bayes estimation for mixtures of normal densities. The Annals of Statistics 29 (5), Ghosal, S., van der Vaart, A., 27. Posterior convergence rates of Dirichlet mixtures at smooth densities. The Annals of Statistics 35 (2), Giné, E., Nickl, R., 211. Rates of contraction for posterior distributions in L r -metrics, 1 r. The Annals of Statistics 39 (6), Giné, E., Nickl, R., 215. Mathematical Foundations of Infinite-dimensional Statistical Models. Cambridge Series in Statistical and Probabilistic Mathematics. McDiarmid, C., On the method of bounded differences. In: Surveys in Combinatorics, Cambridge University Press, Cambridge, pp Parzen, E., On estimation of a probability density function and mode. Annals of Mathematical Statistics 33 (3), Scricciolo, C., 27. On rates of convergence for Bayesian density estimation. Scandinavian Journal of Statistics 34 (3), Scricciolo, C., 211. Posterior rates of convergence for Dirichlet mixtures of exponential power densities. Electronic Journal of Statistics 5, Scricciolo, C., 214. Adaptive Bayesian density estimation in L p -metrics with Pitman-Yor or normalized inverse- Gaussian process kernel mixtures. Bayesian Analysis 9 (2), Shen, W., Tokdar, S.T., Ghosal, S., 213. Adaptive Bayesian multivariate density estimation with Dirichlet mixtures. Biometrika 1 (3), Shen, X., Wasserman, L., 21. Rates of convergence of posterior distributions. The Annals of Statistics 29 (3), Watson, G.S., Leadbetter, M.R., On the estimation of the probability density, I. Annals of Mathematical Statistics 34 (2),

On Deconvolution of Dirichlet-Laplace Mixtures

Working Paper Series Department of Economics University of Verona On Deconvolution of Dirichlet-Laplace Mixtures Catia Scricciolo WP Number: 6 May 2016 ISSN: 2036-2919 (paper), 2036-4679 (online) On Deconvolution