Convergence of latent mixing measures in nonparametric and mixture models 1

Size: px

Start display at page:

Download "Convergence of latent mixing measures in nonparametric and mixture models 1"

Emma Bruce
5 years ago
Views:

1 Convergence of latent mixing measures in nonparametric and mixture models 1 XuanLong Nguyen xuanlong@umich.edu Department of Statistics University of Michigan Technical Report 527 (Revised, Jan 25, 2012) Abstract We consider Wasserstein distance functionals for assessing the convergence of latent discrete measures, which serve as mixing distributions in hierarchical and nonparametric mixture models. We clarify the relationships between Wasserstein distances of mixing distributions and f-divergence functionals such as Hellinger and Kullback-Leibler distances on the space of mixture distributions using various identifiability conditions. The convergence in Wasserstein metrics for discrete measures has a natural interpretation of the convergence of individual atoms that provide support for the discrete measure. It is also typically stronger than the weak convergence induced by standard f-divergence metrics. We establish rates of convergence of posterior distributions for latent discrete measures in several mixture models, including finite mixtures of multivariate distributions, finite mixtures of Gaussian processes and infinite mixtures based on the Dirichlet process. 1 Introduction A notable feature in the development of hierarchical and Bayesian nonparametric models is the role of discrete probability measures, which serve as mixing distributions to combine relatively simple models into richer classes of statistical models [18, 20]. In recent years the mixture modeling methodology has been significantly extended, by many authors taking the mixing measure to be random and infinite dimensional via suitable priors constructed in a nested, hierarchical and nonparametric manner. This results in rich models that can fit more complex and high dimensional data (see, e.g., [8, 27, 25, 23, 21] for several examples of such models, as well as a recent book [14]). The focus of this paper is to analyze convergence behavior of the posterior distribution of latent mixing measures as they arise in several mixture models, including infinite mixture 1 AMS 2000 subject classification. Primary 62F15, 62G05; secondary 62G20. Key words and phrases: Mixture distributions, hierarchical models, Bayesian nonparametrics, Wasserstein distance, f-divergence, rates of convergence, Dirichlet processes, Gaussian processes. This work was supported in part by NSF grants CCF and OCI

2 and other nonparametric models. Let G = k i=1 p iδ θi denote a discrete probability measure. Atoms θ i s are elements in space Θ, while vector of probabilities p = (p 1,...,p k ) lies in a k 1 dimensional probability simplex. In a mixture setting, G is combined with a likelihood density f( θ) with respect to a dominating measure µ on X, to yield the mixture density: p G (x) = f(x θ)dg(θ) = k i=1 p if(x θ i ). In a clustering application, atoms θ i s represent distinct behaviors in a heterogeneous data population, while mixing probabilities p i s are the associated proportions of such behaviors. Under this interpretation, there is a need for comparing and assessing the quality of the discrete measure Ĝ estimated on the basis of available data. An important work in this direction is by Chen [3], who used the L 1 metric on the cumulative distribution functions on the real line to study convergence rates of the mixing measure G. Building upon Chen s work, Ishwaran, James and Sun [15] established the posterior consistency of a finite dimensional Dirichlet prior for Bayesian mixture models. Their analysis is specific to univariate finite mixture models, with k bounded by a known constant, while our interest is when k may be unbounded and/or Θ has high or infinite dimensions. For instance, Θ may be a subset of a function space as in the work of [8, 21], or a space of probability measures [25]. The analysis of consistency and convergence rates of posterior distributions for Bayesian estimation have seen much progress in the past decade. Key recent references include [1, 11, 26, 34, 12, 35]. Analysis of specific mixture models in a Bayesian setting have also been studied extensively [10, 9, 16, 13]. All these work primarily focus on the convergence in the topology of Hellinger or a comparable distance metric in the space of data densities p G. On the other hand, there are far less work concerning with the convergence behavior of latent mixing measures G. Notably, the analysis of convergence for mixing (smooth) densities often arised in the context of frequentist estimation for deconvolution problems, mainly with kernel density estimation methods [36, 6]. The primary contribution of this paper is to show that the Wasserstein distances provide a natural and useful metric for the analysis of convergence for latent and discrete mixing measures in mixture models, and to establish convergence rates of posterior distributions in a number of well-known Bayesian nonparametric and mixture models. Wasserstein distances originally arised in the problem of optimal transportation [32]. It has been utilized in a number of statistical contexts (e.g., [5, 19, 2, 4]). For discrete probability measures, they can be obtained by a minimum matching (or moving) procedure between the sets of atoms that provide support for the measures under comparison, and consequentially are simple to compute. Suppose that Θ is equipped with a metric ρ. Let G = k j=1 p j δ θ j. Then, the L r Wasserstein distance metric on the space of discrete probability measures with support in Θ, namely, Ḡ(Θ), is: [ d ρ (G, G ) = inf q 1/r q ij ρ r (θ i, θ j)], i,j where the infimum is taken over all joint probability distributions on [1,...,k] [1,...,k ] such that j q ij = p i and i q ij = p j. 2

3 As clearly seen from this definition, the Wasserstein distances inherit directly the metric of the space of atomic support Θ, suggesting that they can be useful for assessing estimation procedures for discrete measures in hierarchical models. It is worth noting that if (G n ) n 1 is a sequence of discrete probability measures with k distinct atoms and G n tends to some G 0 in d ρ metric, then G n s ordered set of atoms must converge to G 0 s atoms in ρ after some permutation of atom labels. Thus, in the clustering application illustrated above, convergence of mixing measure G may be interpreted as the convergence of distinct typical behavior θ i s that characterize the heterogeneous data population. A hint for the relevance of the Wasserstein distances can be drawn from an observation that the L 1 distance for the CDF s of univariate random variables, as studied by Chen [3], is in fact a special case of the L 1 Wasserstein metric when Θ = R. The plan for the paper is as follows. Section 2 explores the relationship between Wasserstein distances for mixing measures and well-known divergence functionals for mixture densities in a mixture model. We produce a simple lemma which gives an upper bound of f-divergences between mixture densities by certain Wasserstein distances between mixing measures. This implies that d ρ topology can be stronger than those induced by divergences between mixture densities. Next, we consider various identifiability conditions under which convergence of mixture densities entails convergence of mixing measures in the Wasserstein metric. We present two theorems, which provide upper bounds of d ρ (G, G ) in terms of divergences between p G and p G. Theorem 1 is applicable to mixing measures with a bounded number of atomic support, generalizing a result from [3]. Theorem 2 is applicable to mixing measures with unbounded number of support points, but is restricted to only convolution mixture models. Section 3 focuses on the convergence of posterior distributions of latent mixing measures in a Bayesian nonparametric setting. Here, the mixing measure G is endowed with a prior distribution Π. Assuming an n-sample X 1,...,X n that is generated according to p G0, we study conditions under which the postetrior distribution of G, namely, Π( X 1,...,X n ), contracts to the truth G 0 under the d ρ metric, and provide the contraction rates. In Theorems 3 and 4 of Section 3, we establish the convergence rates for the posterior distribution for G in terms of d ρ metric. These results are proved using the standard approach of Ghosal, Ghosh and van der Vaart [11]. Our convergence theorems have several notable features. They rely on separate conditions for the prior Π and likelihood function f, which are typically simpler to verify than conditions formulated in terms of mixture densities. The claim of convergence in Wasserstein metrics is typically stronger than the weak convergence induced by the Hellinger metric in the existing work mentioned above. In Section 4 posterior consistency and convergence rates of latent mixing measures are derived for a number of well-known mixture models in the literature, including finite mixtures of multivariate distributions, infinite mixtures based on Dirichlet processes, and finite mixtures of Gaussian processes. For finite mixtures with bounded number of atomic support, the posterior convergence rate for mixing measure is the minimax optimal n 1/4 under suitable identifiability conditions. For Dirichlet process mixtures defined on R d, specific rates are established under smoothness conditions of the likelihood density function 3

4 f. In particular, for ordinary smooth likelihood densities with smoothness β (e.g., Laplace), the rate achieved is (log n/n) γ 2 for any γ < (d+2)(4+(2β+1)d). For supersmooth likelihood densities with smoothness β (e.g., normal), the rate achieved is (log n) 1/β. Finally, for finite mixtures of Gaussian processes, we are also able to establish a convergence rate by utilizing a result on (single) Gaussian process prior by van der Vaart and van Zanten [31]. Notations. For ease of notations, we also use f i in place of f( θ i ), and f j in place of f( θ j ) for likelihood density functions. Divergences studied in the paper include the total variational distance: d V (p G, p G ) := 1 2 pg (x) p G (x) dµ, the Hellinger distance: d 2 h (p G, p G ) := 1 2 ( pg (x) p G (x)) 2 dµ, and the Kullback-Leibler divergence: d K (p G, p G ) = p G (x)log(p G (x)/p G (x))dµ. These divergences are related by d 2 V /2 d2 h d V and d 2 h d K/2. N(ǫ, Θ, ρ) denotes the covering number of the metric space (Θ, ρ), i.e., the minimum number of ǫ-balls needed to cover the entire space Θ. D(ǫ, Θ, ρ) denotes the packing number of (Θ, ρ), i.e., the maximum number of points that are mutually separated by at least ǫ in distance. They are related by N(ǫ, Θ, ρ) D(ǫ, Θ, ρ) N(ǫ/2, Θ, ρ). Diam(Θ) denotes the diameter of Θ. 2 Wasserstein distances for mixing measures 2.1 Definition and a basic inequality Let (Θ, ρ) be a space equiped with a non-negative distance function ρ such that ρ(θ 1, θ 2 ) = 0 if and only if θ 1 = θ 2. If in addition, ρ is symmetric (ρ(θ 1, θ 2 ) = ρ(θ 2, θ 1 )), and satisfies the triangle inequality, then it is a proper metric. A discrete probability measure G on a measure space equipped with the Borel sigma algebra takes the form G = k i=1 p iδ θi for some k N {+ }, where p = (p 1, p 2,...,p k ) denotes the proportion vector, while θ = (θ 1,...,θ k ) are the associated atoms in Θ. p has to satisfy 0 p i 1 and k i=1 p k = 1. Likewise, G = k j=1 p j δ θ j is another discrete probability measure that has at most k distinct atoms. Let G k (Θ) denote the space of all discrete probability measures with at most k atoms. Let G(Θ) = k 1 G k (Θ), the set of all discrete measures with finite support. Finally, Ḡ(Θ) denotes the space of all discrete measures (including those with countably infinite support). Let q = (q ij ) i k;j k [0, 1] k k denote a joint probability distribution on N + N + k i=1 q ij = p j and k j=1 q ij = p i for any i = that satisfies the marginal constraints: 1,...,k; j = 1,...,k. Let Q(p, p ) denote the space of all such joint distributions. We start with the L 1 Wasserstein distance: Definition 1. Let ρ be a distance function on Θ. The Wasserstein distance functional for two discrete measures G(p, θ) and G (p, θ ) is: d ρ (G, G ) = inf q Q(p,p ) q ij ρ(θ i, θ j). (1) i,j 4

5 We focus mainly on the L 1 Wasserstein d ρ, and L 2 Wasserstein distance. The latter corresponds to the square root of d ρ 2 in our definition, where ρ(θ i, θ j ) is replaced by ρ2 (θ i, θ j ). Note that d 2 ρ(g, G ) d ρ 2(G, G ) by an application of Cauchy-Schwarz inequality. We will consider a variety of choices of distance ρ in the sequel. From here on, discrete measure G Ḡ(Θ) plays the role of the mixing distribution in a mixture model. Let f(x θ) denote the density (with respect to a dominating measure µ) of a random variable X taking value in X, given parameter θ Θ. For the ease of notations, we also use f i (x) for f(x θ i ). Combining G with the likelihood function f yields a mixture distribution for X that takes the following density: p G (x) = f(x θ)dg(θ) = k p i f i (x). First we state a general result that relates Wasserstein distances between mixing measures G, G, namely, d ρ (G, G ) and divergences between mixture densities p G, p G. Divergence functionals that play important role in this paper are the total variational distance, Hellinger distance, and the Kullback-Leibler distance. All these are in fact instances of a broader class of divergence functionals known as the f-divergences (Csizar, 1966; Ali & Silvey, 1967): Definition 2. Let φ : R R denote a convex function. An f-divergence (or Ali-Silvey distance) between two probability densities f i and f j is defined as d φ(f i, f j ) = φ(f j /f i)f i dµ. Likewise, the f-divergence between p G and p G is d φ (p G, p G ) = φ(p G /p G )p G dµ. For φ(u) = 1 2 ( u 1) 2 we obtain the squared Hellinger. For φ(u) = 1 2 u 1 we obtain the variational distance. For φ(u) = log u, we obtain the Kullback-Leiber divergence. f-divergence functionals can be used as a distance function or metric on Θ, motivating the following definition. Definition 3. When ρ is taken to be an f-divergence, ρ(θ i, θ j ) = d φ(f i, f j ), for a convex function φ, the corresponding Wasserstein distance functional is called a composite Wasserstein distance: d ρφ (G, G ) = inf q Q(p,p ) q ij d φ (f i, f j). In particular, d V, d h, d K induce the composite Wasserstein distances d ρv, d ρh, d ρk, respectively. Let d ρh 2(G, G ) denote the composite Wasserstein distance function obtained by taking ρ(θ i, θ j ) := d2 h (f i, f j ). The following result shows that any f-divergence between mixture distribution p G, p G is dominated by a Wasserstein distance for a suitable choice of of distance ρ. Lemma 1. Let G, G Ḡ(Θ) such that both d φ(p G, p G ) and d ρφ (G, G ) are finite for some convex function φ. Then, d φ (p G, p G ) d ρφ (G, G ). ij i=1 5

6 As will be evident in the sequel, this lemma is also handy in enabling us to obtain lower bounds on small Kullback-Leibler ball probabilities in the space of mixture densities p G, terms of small ball probabilities in the metric space (Θ, ρ). The latter quantities are typically easier to obtain estimates than the former. Example 1. Suppose that Θ = R d, ρ is the Euclidean metric, f(x θ) is the multivariate normal density N(θ, I d d ) with mean θ and identity covariance matrix, then d 2 h (f i, f j ) = 1 exp 1 8 θ i θ j θ i θ j 2. So, d ρh 2(G, G ) d ρ 2(G, G )/8. The above lemma then entails that d h 2(p G, p G ) d ρ 2(G, G )/8. Similarly, for the Kullback-Leibler divergence, since d K (f i, f j ) = 1 2 θ i θ j 2, by Lemma 1, d K (p G, p G ) d ρk (G, G ) = 1 2 d ρ 2(G, G ). For another example, if f(x θ) is a Gamma density with location parameter θ, Θ is a compact subset of R that is bounded away from 0. Then d K (f i, f j ) = O( θ i θ j ). This entails that d K (p G, p G ) O(d ρ (G, G )). 2.2 Wasserstein metric identifiability in finite mixture models Lemma 1 shows that for many choices of ρ, d ρ yields a stronger topology on Ḡ(Θ) than the topology induced by f-divergences on the space of mixture distributions p G. In other words, convergence of p G may not imply convergence of G in d ρ metric. To ensure this property, additional conditions are needed on the space of discrete measures Ḡ(Θ), along with identifiability conditions for the family of likelihood functions {f( θ), θ Θ}. The classical definition of Teicher [29] specifies the family {f( θ), θ Θ} to be identifiable if for any G, G G(Θ), p G p G = 0 implies that G = G. We need a slightly stronger version, allowing for the inclusion for discrete measures with infinite support: Definition 4. The family {f( θ), θ Θ} is finitely identifiable if for any G G Θ and G ḠΘ, p G (x) p G (x) = 0 for almost all x X implies that G = G. To obtain convergence rates, we also need the notion of strong identifiability of [3], herein adapted to a multivariate setting. Definition 5. Assume that Θ R d and ρ is the Euclidean metric. The family {f( θ), θ Θ} is strongly identifiable if f(x θ) is twice differentiable in θ and for any finite k and k different θ 1,...,θ k, the equality ess sup x X k α i f(x θ i ) + βi T Df(x θ i ) + γi T D 2 f(x θ i )γ i = 0 (2) i=1 implies that α i = 0, β i = γ i = 0 R d for i = 1,...,k. Here, for each x, Df(x θ i ) and D 2 f(x θ i ) denote the gradient and the Hessian at θ i of function f(x ), respectively. Finite identifiability is satisfied for the family of Gaussian distributions [28], see also Theorem 1 of [16]. Chen identified a broad class of families, including the Gaussian family, for which the strong identifiablity condition holds [3]. 6

7 Define ψ(g, G ) = sup x p G (x) p G (x) /d ρ 2(G, G ) if G G and otherwise. Also define ψ 1 (G, G ) = d V (p G, p G )/d ρ 2(G, G ) if G G and otherwise. The notion of strong identifiability is useful via the following key result, which generalizes Chen s result to Θ of arbitrary dimensions. Theorem 1. (Strong identifiability). Suppose that Θ is compact subset of R d, the family {f( θ), θ Θ} is strongly identifiable, and for all x X, the Hessian matrix D 2 f(x θ) satisfies a uniform Lipschitz condition γ T (D 2 f(x θ 1 ) D 2 f(x θ 2 ))γ Cρ(θ 1, θ 2 ) δ γ 2, (3) for all x, θ 1, θ 2 and some fixed C and δ > 0. Then, for fixed G 0 G k (Θ), where k < : { } lim inf ψ(g, G ) : d ρ (G 0, G) d ρ (G 0, G ) ǫ > 0. (4) ǫ 0 G,G G k (Θ) The assertion also holds with ψ being replaced by ψ 1. Remarks. (i) In Section 5.2 the notion of strong identifiability is extended to an infinite dimensional setting via first and second order Fréchet derivatives in normed spaces. (ii) Suppose that G 0 has exactly k support points in Θ. Then, an examination of the proof reveals that the requirement that Θ be compact is not needed. Indeed, if there is a sequence of G n G k (Θ) such that d ρ (G 0, G n ) 0, then it is simple to show that there is a subsequence of G n that also has k distinct atoms, which converge in ρ metric to the set of k atoms of G 0 (up to some permutation of the labels). The proof of the theorem proceeds as before. For the rest of this paper, by strong identifiability we always mean conditions specified in Theorem 1 so that Eq. (4) can be deduced. This practically means that the conditions specified by Eq. (2) and Eq. (3) be given, while the compactness of Θ may sometimes be required. 2.3 Wasserstein metric identifiability in infinite mixture models Next, we state a counterpart of Theorem 1 for G, G Ḡ(Θ), i.e., mixing measures with infinitely many support points. We restrict our attention to convolution mixture models on R d. That is, the likelihood density function f(x θ), with respect to Lebesgue, takes the form f(x θ) for some multivariate density function f on R d. Thus, p G (x) = G f(x) = k i=1 p if(x θ i ) and p G (x) = G f(x) = k j=1 p j f(x θ j ). As before, we need a compactness condition for Θ. Additional key assumptions concern with the smoothness of density function f. This is characterized in terms of the tail behavior of the Fourier transform of f. We consider both ordinary smooth densities (e.g., Laplace and Gamma), and supersmooth densities (e.g., normal). The following result does not require that G, G be discrete. 7

8 Theorem 2. Suppose that G, G are probability measures that place full support on compact set Θ R d. f is a density function on R d that is symmetric (around 0), i.e., A fdx = A fdx for any Borel set A Rd. Moreover, assume that f(ω) 0 for any ω R d. (1) Ordinary smooth likelihood. If f(ω) d j=1 ω j β d 0 as ω j (j = 1,...,d) for some positive constant d 0 and constant β. Then for any m < 4/(4 + (2β + 1)d), there is some constant C(d, β, m) dependent only on d, β and m such that as d V (p G, p G ) 0. d ρ 2(G, G ) C(d, β, m)d V (p G, p G ) m, (2) Supersmooth likelihood. If f(ω) d j=1 exp( ω j β /γ) d 0 as ω j (j = 1,...,d) for some positive constants β, γ, d 0. There is some constant C(d, β) dependent only on d and β, such that as d V (p G, p G ) 0. d ρ 2(G, G ) C(d, β)( log d V (p G, p G )) 2/β, Example 2. For the standard normal density on R d, f(ω) = d j=1 exp ω2 i /2, we obtain that d 2 ρ(g, G ) < ( log d V (p G, p G )) 1 as d ρ (G, G ) 0 (so that d V (p G, p G ) 0, by Lemma 1). For a Laplace density on R, e.g., f(ω) = 1 1+ω 2, then d 2 ρ(g, G ) < d V (p G, p G ) m for any m < 4/(4 + 5d), as d ρ (G, G ) 0. 3 Convergence of posterior distributions of mixing measures We are ready to study the convergence of discrete mixing measures in a Bayesian setting. Let X 1,...,X n be an iid sample according to the mixture density p G (x) = f(x θ)dg(θ), where f is known, while G = G 0 for some unknown mixing measure in G k (Θ). The number support points k may not be known. In the Bayesian framework, G is endowed with a prior distribution Π on a suitable measure space of discrete probability measures in Ḡ(Θ). The posterior distribution of G is given by, for any measurable set B: Π(B X 1,...,X n ) = n B i=1 p G (X i )Π(G)/ n i=1 p G (X i )Π(G). We shall study conditions under which the posterior distribution is consistent, i.e., it concentrates on arbitrarily small d ρ neighborhoods of G 0, and establish the rates of the convergence. The analysis is based upon the general framework of Ghosal, Ghosh and van der Vaart [11], who analyzed convergence behavior of posterior distributions in terms of f-divergences such as Hellinger and variational distances on the mixture densities of the 8

9 data. In the following we formulate two convergence theorems for mixture model setting (which can be viewed as counterparts of Theorem 2.1 and 2.4 of [11]). A notable feature of our theorems is that conditions (e.g., entropy and prior concentration) are stated in terms of the Wasserstein metric, as opposed to f-divergences on the mixture densities. They may be typically separated into independent conditions for the prior for G and the likelihood family and are simpler to verify for mixture models. In addition, the convergence of the posterior distribution of mixing measures is established in terms of Wasserstein distance metrics. The following notion plays a central role in the theorem s formulation. Definition 6. Let G be a subset of Ḡ(Θ). For each k <, define the Hellinger information of d ρ metric as a real-valued function on the real line C k (G, ) : R R: C k (G, r) = inf G 0 G k (Θ),G G:d ρ(g 0,G) r/2 d2 h (p G 0, p G )/2. (5) It is obvious that C k is a non-negative and non-decreasing function. The following characterization of C k follows immediately from the results obtained in the previous section. Proposition 1. (a) If G and G k (Θ) are both compact in the Wasserstein topology, and the family of likelihood functions is finitely identifiable. Then, C k (G, r) > 0 for any r > 0. (b) If Θ R d is compact, and the family of likelihood functions is strongly identifiable as specified in Theorem 1. Then, for each k there is a constant c(k) > 0 such that C k (G k (Θ), r) c(k)r 4 for all r > 0. (c) If Θ R d is compact, and the family of likelihood functions is ordinary smooth with parameter β, as specified in Theorem 2. Then, for any ǫ > 0, there is some constant c(d, β) such that C k (Ḡ(Θ), r) c(d, β)r4+(2β+1)d+ǫ for any r > 0, and any k > 0. For supersmooth likelihood family, we have C k (Ḡ(Θ), r) exp[ c(d, β)r β ] for any r > 0 and any k > 0. The following two theorems have three types of conditions. The first is concerned with the size of support of Π, often quantified in terms of its entropy number. Estimates of the entropy number defined in terms of Wasserstein metrics for several measure classes of interest are given in Lemma 2. The second is on the Kullback-Leibler support of Π, which is related to both space of discrete measures Ḡ(Θ) and the family of likelihood functions f(x θ). The Kullback-Leibler neighborhood is defined as: ( B(ǫ) = {G Ḡ(Θ) : P G0 log p ) ( G ǫ 2, P G0 log p ) 2 } G ǫ 2. (6) p G0 p G0 The third condition is on the Hellinger information of d ρ metric, function C k (G, r), a characterization of which is given above. 9

10 Theorem 3. Let G 0 G k (Θ) Ḡ(Θ) for some k <, and the family of likelihood functions is finitely identifiable. Suppose that for a sequence (ǫ n ) n 1 that tends to a constant (or 0) such that nǫ 2 n, sets G n Ḡ(Θ) and a constant C > 0, we have Moreover, suppose M n is a sequence so that log D(ǫ n, G n, d ρ ) nǫ 2 n, (7) Π(Ḡ(Θ) \ G n) exp[ nǫ 2 n(c + 4)], (8) Π(B(ǫ n )) exp( nǫ 2 nc). (9) C k (G n, M n ǫ n ) ǫ 2 n(c + 4), (10) exp(2nǫ 2 n) j M n exp[ nc k (G n, jǫ n )] 0. (11) Then, Π(G : d ρ (G 0, G) M n ǫ n X 1,...,X n ) 0 in P G0 -probability. A stronger theorem (using a substantially weaker condition on the covering number) can be formulated as follows. Theorem 4. Let G 0 G k (Θ) Ḡ(Θ) for some k <, and the family of likelihood functions is finitely identifiable. Suppose that for a sequence ǫ n 0 such that nǫ 2 n is bounded away from 0 or tending to infinity, and sets G n Ḡ(Θ), we have log D(ǫ/2, {G G n : ǫ d ρ (G 0, G) 2ǫ}, d ρ ) nǫ 2 n for every ǫ ǫ n, (12) Π(G : jǫ n < d ρ (G, G 0 ) 2jǫ n ) Π(B(ǫ n )) where M n is a sequence such that exp(2nǫ 2 n) Π(Ḡ(Θ) \ G n) Π(B(ǫ n )) = o(exp( 2nǫ 2 n)), (13) exp[nc k (G n, jǫ n )/2] for any j M n, (14) j M n exp[ nc k (G n, jǫ n )/2] 0. (15) Then, we have that Π(G : d ρ (G 0, G) M n ǫ n X 1,...,X n ) 0 in P G0 -probability. Remarks. (i) The above statement continues to hold if conditions (14) and (15) are replaced by the following condition: exp(2nǫ 2 n)/π(b(ǫ n )) exp[ nc k (G n, jǫ n )] 0. (16) j M n (ii) The above theorems are stated for the L 1 Wasserstein metric d ρ, but they also hold for L 2 Wasserstein metric d 1/2 ρ 2, with a slight modification of the definition of the Hellinger information function to C k (G, r) = inf d 1/2 ρ 2 (G 0,G) r/2 d2 h (p G 0, p G ). 10

11 Before moving to specific examples, we state a simple lemma, which provides estimates of the entropy under d ρ metric for a number of classes of discrete measures of interest. Because d ρ inherits directly the ρ metric in Θ, the entropy for classes in (Ḡ(Θ), d ρ) can typically be bounded in terms of the covering number for subsets of (Θ, ρ). These bounds will be used extensively in the sequel. Lemma 2. (a) log N(2ǫ, G k (Θ), d ρ ) k(log N(ǫ, Θ, ρ) + log(e + ediam(θ)/ǫ)). (b) log N(2ǫ, Ḡ(Θ), d ρ) N(ǫ, Θ, ρ)log(e + ediam(θ)/ǫ). (c) Let G 0 = k i=1 p i δ θi G k (Θ). Assume that M = max k i=1 1/p i < and m = min i,j k ρ(θi, θ j ) > 0. Then, log N(ǫ/2, {G G k (Θ) : d ρ (G 0, G) 2ǫ}, d ρ ) k(sup log N(ǫ/4, Θ, ρ) + log(32kdiam(θ)/m)), Θ where the supremum in the right side is taken over all bounded subsets Θ Θ such that Diam(Θ ) 4Mǫ. 4 Examples In this section the general theory is illustrated in specific mixture models, including finite mixtures of multivariate distributions, infinite mixtures based on Dirichlet processes, and finite mixtures of Gaussian processes. 4.1 Finite mixture of multivariate distributions Let Θ be a subset of R d, ρ be the Euclidean metric, and Π is a prior distribution for discrete measures in G k (Θ), where k < is known. Suppose that the truth G 0 = k i=1 p i δ θi G k (Θ). To obtain the convergence rate of the posterior distribution of G, we need: Assumptions A. (A1) Θ is compact and the family of likelihood functions f( θ) is strongly identifiable. (A2) For some positive constant C 1, d K (f i, f j ) C 1ρ 2 (θ i, θ j ) for any θ i, θ j Θ. For any G supp(π), p G0 (log(p G0 /p G )) 2 < C 2 d K (p G0, p G ) for some constant C 2 > 0. (A3) Under prior Π, for small δ > 0, c 3 δ k Π( p i p i δ, i = 1...,k) C 3δ k and c 3 δ kd Π(ρ(θ i, θ i ) δ, i = 1...,k) C 3δ kd for some constants c 3, C 3 > 0. 11

12 Remarks. (A1) and (A2) hold for the family of Gaussian densities with mean parameter θ. (A3) holds when the prior distribution on the relevant parameters behave like a uniform distribution, up to a multiplicative constant. Let G = k i=1 p iδ θi. Combining Lemma 1 with Assumption (A2), if ρ(θ i, θi ) ǫ and p i p i ǫ2 /(kdiam(θ) 2 ) for i = 1,...,k, then d K (p G0, p G ) d ρk (G 0, G) C 1 1 i,j k q ijρ 2 (θi, θ j), for any q Q. Thus, d K (p G0, p G ) C 1 d ρ 2(G 0, G) C k 1 i=1 (p i p i )ρ 2 (θ i, θ i) + C 1 k i=1 p i p i Diam(Θ)2 2C 1 ǫ 2. Hence, under prior Π, Π(G : d K (p G0, p G ) ǫ 2 ) Π(G : ρ(θ i, θ i ) ǫ, p i p i ǫ 2 /(kdiam(θ) 2 ), i = 1,...,k). In view of Assumptions (A2) and (A3), we have Π(B(ǫ)) > ǫ k(d+2). Conversely, for sufficiently small ǫ, if d 2 ρ(g 0, G) ǫ 2 then by reordering the index of the atoms, we must have ρ(θ i, θ i ) = O(ǫ) and p i p i = O(ǫ2 ) for all i = 1,...,k (see the argument in the proof of Lemma 2(c)). This entails that under the prior Π, Π(G : d ρ 2(G 0, G) ǫ 2 ) Π(G : ρ(θ i, θ i ) O(ǫ), p i p i O(ǫ 2 ), i = 1,...,k) < ǫ k(d+2). Let ǫ n = n 1/2. We proceed by verifying conditions of Theorem 4, as this theorem provides the right rate for parametric mixture models under the L 2 Wasserstein distance metric d 1/2. Let G ρ 2 n := G k (Θ). Then Π(Ḡ(Θ) \ G n) = 0, so Eq. (13) trivially holds. Next, we show that D(ǫ/2, S, d 1/2 ), where S = {G G ρ 2 n : d ρ (G 0, G) 2ǫ}, is bounded above by a constant, so that (12) is satisfied. Indeed, for any ǫ > 0, log D(ǫ/2, S, d 1/2 ) ρ 2 log N(ǫ/4, S, d 1/2 ) N(ǫ/4, S, d ρ 2 ρ ). By Lemma 2 (c), N(ǫ/4, S, d ρ ) is bounded in terms of sup Θ log N(ǫ/8, Θ, ρ), which is bounded above by a constant when Θ s are subsets of Θ whose diameter is bounded by a multiple of ǫ. Thus, Eq. (12) holds. By Proposition 1(b) and Assumption (A4), there is C k (G n, jǫ n ) = inf dρ 2(G 0,G) (jǫ/2) 2 d2 h (p G 0, p G ) c(jǫ n ) 4 for some constant c > 0. To ensure condition (15), note that: exp(2nǫ 2 n) exp[ nc k (G n, jǫ n )/2] exp(2nǫ 2 n) exp[ nc(jǫ n ) 4 ] j M n j M n < exp(2nǫ 2 n ncm 4 nǫ 4 n). This upper bound goes to zero if ncmnǫ 4 4 n 4nǫ 2 n, which is satisfied by taking M n to be a large multiple of ǫn 1/2. Thus we need M n ǫ n ǫ 1/2 n = n 1/4. Under the assumptions specified above, Π(G : jǫ n < d ρ (G, G 0 ) 2jǫ n )/Π(B(ǫ n )) = O(1). On the other hand, for j M n, we have exp[nc k (G n, jǫ n )/2] exp[nc(jǫ n ) 4 /2] which is bounded below by arbitrarily large constant by choosing M n to be a large multiple of ǫn 1/2, thereby ensuring (14). Thus, by Theorem 4, rate of contraction for the posterior distribution of G under d 1/2 ρ 2 distance metric is n 1/4, which is also the minimax rate n 1/4 as proved in the univariate case by [3]. Our calculation is summarized by: 12

13 Theorem 5. Under Assumptions (A1 A3), the contraction rate in the L 2 Wasserstein distance metric of the posterior distribution of G is n 1/ Infinite mixture based on the Dirichlet process Given the true discrete measure G 0 = k i=1 p i δ θi G k (Θ), where Θ is a metric space but k is unknown. To estimate G 0, the prior distribution Π on discrete measure G Ḡ(Θ) is taken to be a Dirichlet process DP(ν, P 0) that centers at P 0 with concentration parameter ν > 0 [7]. Here, parameter P 0 is a probability measure on Θ. For any m 1, the following lemma provides a lower bound of small ball probabilities of metric space (Ḡ(Θ), d1/m m ) in terms of small probabilities of metric space (Θ, ρ). ρ Lemma 3. Let G DP(ν, P 0 ), where P 0 is a non-atomic base probability measure on a compact set Θ. For a small ǫ > 0, let D = D(ǫ, Θ, ρ) denote the packing number of Θ under ρ metric. Then, under the Dirichlet process distribution, D Π(G : d ρ m(g 0, G) (2 m + 1)ǫ m ) Γ(ν)(ǫ m D 1 /Diam(Θ) m ) D 1 ν D P 0 (S i ). Here, (S 1,...,S D ) denotes the D disjoint ǫ/2-balls that form a maximal packing of Θ. Γ( ) is the gamma function. Proof. Since every point in Θ is of distance at most ǫ to one of the centers of S 1,...,S D, there is a D-partition (S 1,...,S D ) of Θ, such that S i S i, and Diam(S i ) 2ǫ. for each i = 1,...,D. Let m i = G(S i ), µ i = P 0 (S i ), and ˆp i = G 0 (S i ). From the definition of Dirichlet processes, m = (m 1,...,m D ) Dir(νµ 1,...,νµ D ). Note that d ρ m(g 0, G) (2ǫ) m + m ˆp 1 [Diam(Θ)] m. Due to the non-atomicity of P 0, for ǫ sufficiently small, νµ i 1 for all i = 1,...,D. Let δ = ǫ/diam(θ). Then, under Π, Pr(d m ρ (G 0, G) (2 m +1)ǫ m ) Pr( m ˆp 1 δ m ) Pr( m i ˆp i δ m /D, i = 1,...,D) D 1 Γ(ν) D 1 = D i=1 Γ(νµ m νµ i 1 i (1 m i ) νµ D 1 dm i...dm D 1 i) Γ(ν) D i=1 Γ(νµ i) D 1 m i ˆp i δ m /D i=1 D 1 min(ˆpi +δ m /D,1) i=1 max(ˆp i δ m /D,0) m νµ i 1 i i=1 i=1 dm i Γ(ν)(δ m /D) D 1 D (νµ i ). The second inequality is due to (1 D 1 i=1 m i) νµ D 1 = m νµ D 1 D 1, since νµ D 1 and 0 < m D < 1 almost surely. The third inequality is due to the fact that Γ(α) 1/α for 0 < α 1. This gives the desired claim. i=1 13

14 Assumptions C. (C1) The non-atomic base measure P 0 places full support on a compact set Θ. (C2) For some constants C 1, m 1, d K (f i, f j ) C 1ρ m 1 (θ i, θ j ) for any θ i, θ j Θ. For any G supp(π), p G0 (log(p G0 /p G )) 2 C 2 d K (p G0, p G ) m 2 for some constants C 2, m 2 > 0. (C3) P 0 places sufficient probability mass on all small balls that pack Θ. Specifically, there is a universal constant c 3 > 0 such that the probability of the D-partition (S 1,...,S D ) specified in Lemma 3 satisfy for any ǫ > 0: log D P 0 (S i ) c 3 D log(1/d). i=1 (C4) Θ R d is compact, so that the packing number D(ǫ, Θ, ρ) [Diam(Θ)/ǫ] d. Theorem 6. Given Assumptions (C1 C4), and the smoothness conditions for the likelihood family as specified in Theorem 2, there is a sequence β n ց 0 such that Π(d ρ (G 0, G) β n X 1,...,X n ) 0 in P G0 probability. Specifically, (1) For ordinary smooth likelihood functions, take β n (log n/n) (d+2)(4+(2β+1)d)+δ, for any small δ > 0. (2) For supersmooth likelihood functions, take β n (log n) 1/β. Proof. The proof consists of two main steps. First, we shall prove that under Assumptions (C1 C4), conditions specified by Eqs. (7) (8) (9) in Theorem 3 are satisfied by taking G n = Ḡ(Θ), and ǫ n to be a large multiple of (log n/n) 1/(d+2). The second step involves constructing a sequence of M n and subsequentially β n = M n ǫ n for which Theorem 3 can be applied. Step 1: By Lemma 1 and (C2), d K (p G0, p G ) d ρk (G 0, G) C 1 d ρ m 1(G 0, G). Also, pg0 [log(p G0 /p G )] 2 < C 2 d ρ m 1(G 0, G) m 2. Without loss of generality, assume that m 1 m 2. We obtain that Π(G B(ǫ n )) Π(G : d ρ m 1(G 0, G) C 3 ǫ 2 2/m 2 n ) for some constant C 3. Combining this bound with (C3) and (C4), which are applied to Lemma 3 we have: log Π(G B(ǫ n )) > (D 1)log(ǫ n /Diam(Θ)) + (2D 1)log(1/D) + D log ν, where the approximation constant is dependent on m 1, m 2. Note that D [Diam(Θ)/ǫ n ] d. It is simple to check that condition (9) holds, log Π(G B(ǫ n )) Cnǫ 2 n, by the given rate of ǫ n, for any constant C > 0. Since G n = Ḡ(Θ), (8) trivially holds. Turning to condition (7), by Lemma 2(b), we have log N(2ǫ n, Ḡ(Θ), d ρ) N(ǫ n, Θ, ρ)log(e + ediam(θ)/ǫ n ) (Diam(Θ)/ǫ n ) d log(e + ediam(θ)/ǫ n ) nǫ 2 n by the specified rate of ǫ n. 2 14

15 Step 2: For any G Ḡ(Θ), let R k(g, r) be the inverse of the Hellinger information function of d ρ metric. Specifically, for any t 0, R k (G, t) = inf{r 0 C k (G, r) t}. Note that R k (G, 0) = 0. R k (G, ) is non-decreasing because C k (G, ) is. Let (ǫ n ) n 1 be the sequence determined in the previous step of the proof. Let M n = R k (Ḡ(Θ), ǫ2 n(c + 4))/ǫ n, and β n = M n ǫ n = R k (Ḡ(Θ), ǫ2 n(c + 4)). Condition (10) holds by definition of R k, i.e., C k (G(Θ), M n ǫ n ) ǫ 2 n(c+4). To verify (11), note that the running sum with respect to j cannot have more than Diam(Θ)/ǫ n, and due to the monotonicity of C k, we have exp(2nǫ 2 n) j M n exp[ nc k (G n, jǫ n )] Diam(Θ)/ǫ n exp(2nǫ 2 n nc k (G n, M n ǫ n )) 0. Hence, Theorem 3 can be applied to conclude that Π(d ρ (G 0, G) β n X 1,...,X n ) 0 in P G0 probability. Under ordinary smoothness condition (as specified in Theorem 2), R k (Ḡ(Θ), t) = t 1 4+(2β+1)d+δ, where δ is an arbitrarily positive constant. So, β n ǫ (2β+1)d+δ n = (log n/n) (d+2)(4+(2β+1)d+δ). On the other hand, under supersmoothness condition, R k (Ḡ(Θ), t) = (1/ log(1/t)) 1/β. So, β n (log(1/ǫ n )) 1/β (log n) 1/β. Remark. For supersmooth likelihood functions, the rate obtained is comparable to the optimal rate in the deconvolution problem [6, 36], although the latter problem deals with smooth mixing densities evaluated in L 2 metric. For ordinary likelihood functions, we do not know whether the rate obtained here is optimal or not (up to an logarithmic factor). 4.3 Finite mixture of Gaussian processes We now study an example in which Θ has infinite dimensions. Specifically, let T = [0, 1], and Θ = l (T) is the Banach space of bounded functions θ : T R, equipped with the uniform norm θ = sup{ θ(t) : t T }. Suppose that the true G 0 has k distinct atoms in Θ, G 0 = k i=1 p i δ θi, where k is known. We shall consider a mixture of Gaussian processes prior Π on G k (Θ). Specifically, a random draw G from Π is a discrete measure taking the form G = k i=1 p iδ θi, where θ i are independent random sample paths distributed according to a zero-mean Gaussian process. Let K : T T R be the covariance function that defines the Gaussian process we assume further that the Gaussian process has bounded sample paths and that it is Borel measurable. It is a known fact that the support of the defined zero-mean Gaussian measure is equal to the closure of the reproducing kernel Hilbert space (RKHS), to be denoted by H(K), or simply H, of the covariance kernel K of the process [17, 30]. Assume that θi H for all i = 1,...,k, so that the true G 0 is contained within the support of prior Π. 15

16 An ingredient of our analysis is drawn from a recent result by van der Vaart and van Zanten, who studied asymptotic behavior of priors based on Gaussian processes [31]. Define ρ(θ i, θ j ) = sup θ i (t) θ j (t) : t T. A key notion in their analysis is concentration functions. For each i = 1,..., k, define the concentration function: φ θ i (ǫ) = inf h H:ρ(h,θ i )<ǫ h 2 H log Pr( θ < ǫ), where Pr denotes the probability under the Gaussian process. Another ingredient is the extension of the notion of strong identifiability to function space Θ, see Section 5.2, so that the result of Theorem 1 continues to hold for the infinite dimensional Θ. Finally, we need additional assumptions: Assumptions B. (B1) The family of likelihood functions {f( θ), θ Θ} is strongly identifiable. (B2) For some positive constant C 1, d K (f i, f j ) C 1ρ 2 (θ i, θ j ) for any θ i, θ j Θ. For any G supp(π), p G0 (log(p G0 /p G )) 2 < C 2 d K (p G0, p G ) for some constant C 2 > 0. (B3) Under prior Π, for small δ > 0, Π( p i p i δ, i = 1...,k) c 3δ kα for some constants c 3, α > 0. (B4) Under prior Π, C 4 = E θ 2 <. Theorem 7. Given Assumptions (B1 B4). Let (ǫ n ) n N be a sequence of positive numbers tending to 0 such that log n = o(nǫ 2 n) and that for any i = 1,...,k and any n, φ θ i (ǫ n ) nǫ 2 n. (17) Then, for a sufficiently large constant M > 0, Π(d ρ (G 0, G) Mǫ 1/2 n X 1,...,X n ) 0 in P G0 -probability. Remarks. (i) The reader is referred to [31] for examples of convergence rates ǫ n that satisfy condition (17) for different choices of Θ. In particular, if under the Gaussian process prior, the supporting atoms θ i of G Π are functions on T = [0, 1] with smoothness γ 1 > 0, while the true support points θi s of G 0 are functions with smoothness γ 2 > 0, then concentration functions φ θ i (ǫ) = ǫ 1/(γ 1 γ 2 ) for each i = 1,...,k. Accordingly, the rate ǫ n for which Eq. (17) holds is ǫ n n γ 1 γ 2 2γ 1 γ The contraction rate for the posterior distribution of G is n γ 1 γ 2 2(2γ 1 γ 2 +1). 16

17 5 Proofs of main results 5.1 Identifiability results Proof of Theorem1. Proof. Suppose that Eq. (4) is not true, then there will be sequences of G n and G n tending to G 0 in d ρ metric, and that ψ(g n, G n) 0. We write G n = i=1 p n,iδ θn,i, where p n,i = 0 for indices i greater than k n, the number of atoms of G n. Similar notation is applied to G n. Since both G n and G n have finite number of atoms, there is q (n) Q(p n, p n) so that d ρ 2(G n, G n) = ij q(n) ij ρ2 (θ n,i, θ n,j ). Note that d2 ρ(g n, G n) d ρ 2(G n, G n) = O(d ρ (G n, G n)), while the latter inequality is due to the boundedness of Θ. Let O n = {(i, j) : ρ 2 (θ n,i, θ n,j ) d1 δ (G ρ 2 n, G n)} for some δ (0, 1). Then, (i,j) O n q (n) ij d ρ 2(G n, G n)/d 1 δ (G ρ 2 n, G n) 0 as n. Since q (n) Q(p n, p n), we can express ψ(g n, G n) = sup x k n i=1 = sup x ij p n,i f(x θ n,i ) q (n) k n p n,jf(x θ n,j) /d ρ 2(G n, G n) j=1 ij (f(x θ n,i) f(x θ n,j)) /d ρ 2(G n, G n), and, by Taylor s expansion, ψ(g n, G n) = sup q (n) ij (f(x θ n,j) f(x θ n,i )) + x (i,j) O n q (n) ij (θ n,j θ n,i ) T Df(x θ n,i ) (i,j) O n q (n) ij (θ n,j θ n,i ) T D 2 f(x θ n,i )(θ n,j θ n,i ) + (i,j) O n R n (x) /d ρ 2(G n, G n) where ( ) R n (x) = O q (n) ij ρ2+δ (θ n,i, θ n,j) (i,j) O n = sup A n (x) + B n (x) + C n (x) + R n (x) /D n, x ( ) = O q (n) ij ρ2 (θ n,i, θ n,j)d (1 δ )δ/2 (G ρ 2 n, G n) (i,j) O n due to Eq. (3) and the definition of O n. So R n (x)/d ρ 2(G n, G n) 0. The quantities A n (x), B n (x) and C n (x) are linear functionals of f(x θ), Df(x θ) and D 2 f(x θ) for different θ s, respectively. Since Θ is compact, subsequences of G n and G n can be chosen so that each of their support points converges to a fixed atom θ l, for l = 1,...,k k. 17

18 After being properly rescaled, the limits of A n (x), B n (x) and C n (x) are still linear functionals with constant coefficients not depending on x. In particular, C n (x)/d n k j=1 γt j D2 f(x θj )γ j for some γ j s and not all these coefficients vanishing, since k j=1 γ j 2 = 1. The coefficients in A n (x)/d n and B n (x)/d n can go either to infinity or to a constant by further selecting the subsequences of G n and G n. If they go to infinity, a sequence d n = O(1) can be found such that d n A n (x)/d n converges to k j=1 α jf(x θj ) and d n B n (x)/d n converges to k j=1 βt j Df(x θ j ) for some finite α j and β j. Thus, we have d n and α j, β j, γ j, not all being zero, such that d n p Gn (x) p G n (x) /d 2 ρ(g n, G k n) α j f(x θ j ) + βj T Df(x θ j ) + γj T D 2 f(x θ j )γ j. j=1 (18) for all x. This entails that the right side of the preceeding display must be 0 for all almost all x. By strong identifiability, all coefficients must be 0, which leads to contradiction. With respect to ψ 1 (G, G ), suppose that the claim is not true, which implies the existence of a subsequence G n, G n such that that ψ 1 (G n, G n) 0. Going through the same argument as above, we have α j, β j, γ j, not all of which are zero, such that Eq.(18) holds. An application of Fatou s lemma yields k j=1 α jf(x θ j ) + βj TDf(x θ j) + γj TD2 f(x θ j )γ j dµ = 0. Thus the integrand must be 0 for almost all x, leading to contradiction. Proof of Theorem 2. Proof. To obtain an upper bound of d ρ 2(G, G ) in terms of d V (p G, p G ) under the condition that d V (p G, p G ) 0, our strategy is approximate G and G by convolving these with some mollifier K δ. By triangular inequality, d ρ 2(G, G ) can be bounded in terms of d ρ 2(G, G K δ ), d ρ 2(G, G K δ ), and d ρ 2(G K δ, G K δ ). The first two terms are simple to bound, while the last term can be handled by expressing G K δ as the convolution the mixture density p G with another function. This trick was widely exploited in kernel density estimation method for deconvolution problems (e.g., [36, 6]). We also need the following elementary lemma. Lemma 4. Assume that p and p are two probability density functions on R d with bounded s-moments. (a) For t such that 0 < t < s, p(x) p (x) x t dx 2 p p (s t)/s L 1 (E p X s + E p X s ) t/s. (b) Let V d = π d/2 Γ(d/2 + 1) denote the volume of the d-dimensional unit sphere. Then, p p L1 2V s/(d+2s) d (E p X s + E p X s ) d d+2s p p 2s d+2s L 2. 18

19 Take any s > 0, and let K : R d (0, ) be a symmetric density function on R d whose Fourier transform K is a continuous function whose support is bounded in [ 1, 1] d. Moreover, K has bounded moments up to order s. Consider molifiers K δ (x) = 1 K(x/δ) δ d for δ > 0. Let K δ and f be the Fourier transforms for K δ and f, respectively. Define g δ to be the inverse Fourier transform of K δ / f: g δ (x) = 1 (2π) d R d e i ω,x K δ (ω) f(ω) dω. Note that function K δ (ω)/ f(ω) has bounded support. So, g δ L 1 (R), and g δ := K δ (ω)/ f(ω) is the Fourier transform of g δ. By the convolution theorem, f g δ = K δ. As a result, G K δ = G f g δ = p G g δ. Then the second moment under K δ is O(δ 2 ). It entails that d ρ 2(G, G K δ ) = O(δ 2 ). By triangular inequality, d 1/2 (G, G ) d 1/2 (G K ρ 2 ρ 2 δ, G K δ )+d 1/2 (G, G K ρ 2 δ )+d 1/2 (G, G ρ 2 K δ ) d 1/2 (G K ρ 2 δ, G K δ ) + O(δ), so for some constant C(K) > 0 dependent only on kernel K, d ρ 2(G, G ) 2d ρ 2(G K δ, G K δ ) + C(K)δ 2. (19) Theorem 6.15 of [33] provides an upper bound for the Wasserstein distance: for any two probability measures µ and ν, d ρ 2(µ, ν) 2 x 2 d µ ν (x), where µ ν is the total variation of measure µ ν. Thus, d ρ 2(G K δ, G K δ ) 2 x 2 G K δ (x) G K δ (x) dx. (20) We note that since density function K has bounded s-th moment, x s G K δ (dx) 2 s [ θ s dg(θ)+ x s K δ (x)dx = 2 s [ θ s dg(θ)+δ s x s K(x)dx] <, because G s support points lie in a compact set. Applying Lemma 4 to Eq.(20), we obtain that for δ < 1, d ρ 2(G K δ, G K δ ) C(d, K, s) G K δ G K δ (s 2)/s L 1 C(d, K, s) G K δ G K δ 2(s 2)/(d+2s) L 2. (21) Here, constants C(d, K, s) are different in each line, and they are dependent only on d, s and the s-th moment of density function K. Next, we use a known fact that for arbitrary (signed) measure µ on R d and function g L 2 (R d ), there holds µ g L2 µ g L2, where µ denotes the total variation of µ: G K δ G K δ L2 = p G g δ p G g δ L2 = (p G p G ) g δ L2 2d V (p G, p G ) g δ L2. (22) By Plancherel s identity, g δ 2 L 2 = 1 (2π) d Kδ (ω) 2 f(ω) dω = 1 Rd K(ωδ) 2 2 (2π) d f(ω) dω C f(ω) 2 dω. 2 [ 1/δ,1/δ] d 19

20 The last bound holds because K has support in [ 1, 1] d, and is bounded by a constant. Collecting Eqs.(19)(20)(21)(22) and the preceeding display, we have: { [ d ρ 2(G, G ) C(d, K, s) inf δ (0,1) δ2 + d V (p G, p G ) 2(s 2) d+2s [ 1/δ,1/δ] d f(ω) 2 dω ] s 2 d+2s }. If f(ω) d j=1 ω j β d 0 as ω j (j = 1,...,d) for some positive constant d 0, then { d ρ 2(G, G ) C(d, K, s, β) C(d, K, s, β)d V (p G, p G ) inf δ (0,1) δ2 + d V (p G, p G ) 2(s 2) (2β+1)d(s 2) d+2s (1/δ) d+2s 4(s 2) 2(d+2s)+(2β+1)d(s 2). The exponent tends to 4/(4 + (2β + 1)d) as s, we obtain that d ρ 2(G, G ) C(d, β, r)d V (p G, p G ) r, for any constant r < 4/(4 + (2β + 1)d), as d V (p G, p G ) 0. If f(ω) d j=1 exp( ω j β ) d 0 as ω j (j = 1,...,d) for some positive constants β, d 0, then { d 2 ρ(g, G ) C(d, K, s, β) inf δ (0,1) δ2 + d V (p G, p G ) 2(s 2)/(d+2s) exp 2dδ β s 2 d + 2s Taking δ β = 1 d log d V (p G, p G ), we obtain that d ρ 2(G, G ) C(d, β)( log d V (p G, p G )) 2/β. } }. Proof of Lemma 1. Proof. We exploit the variational characterization of f-divergences [22], d φ (f i, f j ) = sup ϕ ij ϕij f j φ (ϕ ij )f i dµ. Here, the infimum is taken over all measurable function on X. φ denotes the Legendre-Fenchel conjugate dual of convex function φ. (φ is again a convex function on R and is defined by φ (v) = sup u R (uv φ(u)).) Thus, d ρφ (G, G ) = inf q Q(p,p ) ij q ij sup ϕij ϕij f j φ (ϕ ij )f i. On the other hand, for any q Q(p, p ), d φ (p G, p G ) = sup ϕ = sup ϕ ij ϕp G φ (ϕ)p G = sup ϕ ϕ ij q ij sup ϕ ij q ij f j φ (ϕ) ij (ϕf j φ (ϕ)f i ), ϕ j q ij f i = sup ϕ p jf j φ (ϕ) p i f i i q ij (ϕf j φ (ϕ)f i ) where the last inequality holds because the supremum is taken over a larger set of functions. Moreover, the bound holds for any q Q(p, p ), so d φ (p G, p G ) d ρφ (G, G ). ij 20

21 Proof of Proposition 1. Proof. (a) Suppose that the claim is not true, there is a sequence of (G 0, G) G k (Θ) G such that d ρ (G 0, G 2 ) r/2 > 0 always holds, and that converges in d ρ metric to G 0 G k and G G, respectively. This is due to the compactness of both G k (Θ) and G. We must have d ρ (G 0, G ) r/2 > 0, so G 0 G. At the same time, d h (p G 0, p G ) = 0, which implies that p G 0 = p G for almost all x X. By finite identifiability condition, G 0 = G, which is a contradiction. (b) is an immediate consequence of Theorem 1, by noting that under the given hypothesis, there is c(k) > 0 depending on k, such that d 2 h (p G 0, p G ) d 2 V (p G 0, p G )/2 c(k)d 2 ρ 2 (G 0, G) c(k)d 4 ρ(g 0, G) for sufficiently small d ρ (G 0, G). The boundedness of Θ implies the boundedness of d ρ (G 0, G), thereby extending the claim for the entire admissible range of d ρ (G 0, G). (c) is obtained in a similar way from Theorem Strong identifiability conditions in normed spaces Let (Θ, ) be a real normed space. A continuous function f : Θ R is twice Fréchet differentiable at a point θ Θ if it is Fréchet differentiable at θ, with Fréchet derivative D θ f(θ ) (which is a bounded linear function from Θ into R), and there is a continuous bilinear function D 2 θ f(θ ;, ) from Θ Θ into R, called the second Fréchet differential of f, which has the property that lim γ 0 f(θ + γ) f(θ ) D θ f(θ )γ Dθ 2 f(θ ; γ, γ) / γ 2 = 0. We say that the family of density function {f( θ), θ Θ)} is strongly identifiable if f(x θ) is twice Fréchet differentiable in θ (with the Fréchet derivative D θ f(x θ)( ), and the second Fréchet differential Dθ 2f(x θ;, )), and for any finite k and k different θ 1,...,θ k, the equality k ess sup α i f(x θ i ) + D θ f(x θ i )β i + Dθ 2 f(x θ i; γ i, γ i ) = 0 (23) x X i=1 implies that α i = 0, β i = γ i = 0 Θ for i = 1,...,k. Note that if for each x X, f(x ) is twice Fréchet continuously differentiable, i.e., D θ (x ;, ) : Θ Θ Θ R is a continuous function, then f admits the following Taylor expansion (cf. pg. 659, [24]): f(x θ 2 ) f(x θ 1 ) = D θ f(x θ 1 )(θ 2 θ 1 )+ 1 2 D2 θ f(x θ 1 +s(θ 2 θ 1 ); (θ 2 θ 1 ), (θ 2 θ 1 )), for some s [0, 1]. Assume further that the second Fréchet differential of f(x ), Dθ 2 ( ;, ) satisfies a uniform Lipschitz condition: D 2 θ f(x θ 1; γ, γ) D 2 θ f(x θ 2; γ, γ) C θ 1 θ 2 δ γ 2 for all x, θ 1, θ 2 Θ, and some fixed C and δ > 0. It is simple to observe that Theorem 1 and its proof extend line-by-line to the normed space setting. 21

CONVERGENCE OF LATENT MIXING MEASURES IN FINITE AND INFINITE MIXTURE MODELS. BY XUANLONG NGUYEN 1 University of Michigan

The Annals of Statistics 2013, Vol. 41, No. 1, 370 400 DOI: 10.1214/12-AOS1065 Institute of Mathematical Statistics, 2013 CONVERGENCE OF LATENT MIXING MEASURES IN FINITE AND INFINITE MIXTURE MODELS BY