Convergence of latent mixing measures in nonparametric and mixture models 1

Size: px
Start display at page:

Download "Convergence of latent mixing measures in nonparametric and mixture models 1"

Transcription

1 Convergence of latent mixing measures in nonparametric and mixture models 1 XuanLong Nguyen xuanlong@umich.edu Department of Statistics University of Michigan Technical Report 527 (Revised, Jan 25, 2012) Abstract We consider Wasserstein distance functionals for assessing the convergence of latent discrete measures, which serve as mixing distributions in hierarchical and nonparametric mixture models. We clarify the relationships between Wasserstein distances of mixing distributions and f-divergence functionals such as Hellinger and Kullback-Leibler distances on the space of mixture distributions using various identifiability conditions. The convergence in Wasserstein metrics for discrete measures has a natural interpretation of the convergence of individual atoms that provide support for the discrete measure. It is also typically stronger than the weak convergence induced by standard f-divergence metrics. We establish rates of convergence of posterior distributions for latent discrete measures in several mixture models, including finite mixtures of multivariate distributions, finite mixtures of Gaussian processes and infinite mixtures based on the Dirichlet process. 1 Introduction A notable feature in the development of hierarchical and Bayesian nonparametric models is the role of discrete probability measures, which serve as mixing distributions to combine relatively simple models into richer classes of statistical models [18, 20]. In recent years the mixture modeling methodology has been significantly extended, by many authors taking the mixing measure to be random and infinite dimensional via suitable priors constructed in a nested, hierarchical and nonparametric manner. This results in rich models that can fit more complex and high dimensional data (see, e.g., [8, 27, 25, 23, 21] for several examples of such models, as well as a recent book [14]). The focus of this paper is to analyze convergence behavior of the posterior distribution of latent mixing measures as they arise in several mixture models, including infinite mixture 1 AMS 2000 subject classification. Primary 62F15, 62G05; secondary 62G20. Key words and phrases: Mixture distributions, hierarchical models, Bayesian nonparametrics, Wasserstein distance, f-divergence, rates of convergence, Dirichlet processes, Gaussian processes. This work was supported in part by NSF grants CCF and OCI

2 and other nonparametric models. Let G = k i=1 p iδ θi denote a discrete probability measure. Atoms θ i s are elements in space Θ, while vector of probabilities p = (p 1,...,p k ) lies in a k 1 dimensional probability simplex. In a mixture setting, G is combined with a likelihood density f( θ) with respect to a dominating measure µ on X, to yield the mixture density: p G (x) = f(x θ)dg(θ) = k i=1 p if(x θ i ). In a clustering application, atoms θ i s represent distinct behaviors in a heterogeneous data population, while mixing probabilities p i s are the associated proportions of such behaviors. Under this interpretation, there is a need for comparing and assessing the quality of the discrete measure Ĝ estimated on the basis of available data. An important work in this direction is by Chen [3], who used the L 1 metric on the cumulative distribution functions on the real line to study convergence rates of the mixing measure G. Building upon Chen s work, Ishwaran, James and Sun [15] established the posterior consistency of a finite dimensional Dirichlet prior for Bayesian mixture models. Their analysis is specific to univariate finite mixture models, with k bounded by a known constant, while our interest is when k may be unbounded and/or Θ has high or infinite dimensions. For instance, Θ may be a subset of a function space as in the work of [8, 21], or a space of probability measures [25]. The analysis of consistency and convergence rates of posterior distributions for Bayesian estimation have seen much progress in the past decade. Key recent references include [1, 11, 26, 34, 12, 35]. Analysis of specific mixture models in a Bayesian setting have also been studied extensively [10, 9, 16, 13]. All these work primarily focus on the convergence in the topology of Hellinger or a comparable distance metric in the space of data densities p G. On the other hand, there are far less work concerning with the convergence behavior of latent mixing measures G. Notably, the analysis of convergence for mixing (smooth) densities often arised in the context of frequentist estimation for deconvolution problems, mainly with kernel density estimation methods [36, 6]. The primary contribution of this paper is to show that the Wasserstein distances provide a natural and useful metric for the analysis of convergence for latent and discrete mixing measures in mixture models, and to establish convergence rates of posterior distributions in a number of well-known Bayesian nonparametric and mixture models. Wasserstein distances originally arised in the problem of optimal transportation [32]. It has been utilized in a number of statistical contexts (e.g., [5, 19, 2, 4]). For discrete probability measures, they can be obtained by a minimum matching (or moving) procedure between the sets of atoms that provide support for the measures under comparison, and consequentially are simple to compute. Suppose that Θ is equipped with a metric ρ. Let G = k j=1 p j δ θ j. Then, the L r Wasserstein distance metric on the space of discrete probability measures with support in Θ, namely, Ḡ(Θ), is: [ d ρ (G, G ) = inf q 1/r q ij ρ r (θ i, θ j)], i,j where the infimum is taken over all joint probability distributions on [1,...,k] [1,...,k ] such that j q ij = p i and i q ij = p j. 2

3 As clearly seen from this definition, the Wasserstein distances inherit directly the metric of the space of atomic support Θ, suggesting that they can be useful for assessing estimation procedures for discrete measures in hierarchical models. It is worth noting that if (G n ) n 1 is a sequence of discrete probability measures with k distinct atoms and G n tends to some G 0 in d ρ metric, then G n s ordered set of atoms must converge to G 0 s atoms in ρ after some permutation of atom labels. Thus, in the clustering application illustrated above, convergence of mixing measure G may be interpreted as the convergence of distinct typical behavior θ i s that characterize the heterogeneous data population. A hint for the relevance of the Wasserstein distances can be drawn from an observation that the L 1 distance for the CDF s of univariate random variables, as studied by Chen [3], is in fact a special case of the L 1 Wasserstein metric when Θ = R. The plan for the paper is as follows. Section 2 explores the relationship between Wasserstein distances for mixing measures and well-known divergence functionals for mixture densities in a mixture model. We produce a simple lemma which gives an upper bound of f-divergences between mixture densities by certain Wasserstein distances between mixing measures. This implies that d ρ topology can be stronger than those induced by divergences between mixture densities. Next, we consider various identifiability conditions under which convergence of mixture densities entails convergence of mixing measures in the Wasserstein metric. We present two theorems, which provide upper bounds of d ρ (G, G ) in terms of divergences between p G and p G. Theorem 1 is applicable to mixing measures with a bounded number of atomic support, generalizing a result from [3]. Theorem 2 is applicable to mixing measures with unbounded number of support points, but is restricted to only convolution mixture models. Section 3 focuses on the convergence of posterior distributions of latent mixing measures in a Bayesian nonparametric setting. Here, the mixing measure G is endowed with a prior distribution Π. Assuming an n-sample X 1,...,X n that is generated according to p G0, we study conditions under which the postetrior distribution of G, namely, Π( X 1,...,X n ), contracts to the truth G 0 under the d ρ metric, and provide the contraction rates. In Theorems 3 and 4 of Section 3, we establish the convergence rates for the posterior distribution for G in terms of d ρ metric. These results are proved using the standard approach of Ghosal, Ghosh and van der Vaart [11]. Our convergence theorems have several notable features. They rely on separate conditions for the prior Π and likelihood function f, which are typically simpler to verify than conditions formulated in terms of mixture densities. The claim of convergence in Wasserstein metrics is typically stronger than the weak convergence induced by the Hellinger metric in the existing work mentioned above. In Section 4 posterior consistency and convergence rates of latent mixing measures are derived for a number of well-known mixture models in the literature, including finite mixtures of multivariate distributions, infinite mixtures based on Dirichlet processes, and finite mixtures of Gaussian processes. For finite mixtures with bounded number of atomic support, the posterior convergence rate for mixing measure is the minimax optimal n 1/4 under suitable identifiability conditions. For Dirichlet process mixtures defined on R d, specific rates are established under smoothness conditions of the likelihood density function 3

4 f. In particular, for ordinary smooth likelihood densities with smoothness β (e.g., Laplace), the rate achieved is (log n/n) γ 2 for any γ < (d+2)(4+(2β+1)d). For supersmooth likelihood densities with smoothness β (e.g., normal), the rate achieved is (log n) 1/β. Finally, for finite mixtures of Gaussian processes, we are also able to establish a convergence rate by utilizing a result on (single) Gaussian process prior by van der Vaart and van Zanten [31]. Notations. For ease of notations, we also use f i in place of f( θ i ), and f j in place of f( θ j ) for likelihood density functions. Divergences studied in the paper include the total variational distance: d V (p G, p G ) := 1 2 pg (x) p G (x) dµ, the Hellinger distance: d 2 h (p G, p G ) := 1 2 ( pg (x) p G (x)) 2 dµ, and the Kullback-Leibler divergence: d K (p G, p G ) = p G (x)log(p G (x)/p G (x))dµ. These divergences are related by d 2 V /2 d2 h d V and d 2 h d K/2. N(ǫ, Θ, ρ) denotes the covering number of the metric space (Θ, ρ), i.e., the minimum number of ǫ-balls needed to cover the entire space Θ. D(ǫ, Θ, ρ) denotes the packing number of (Θ, ρ), i.e., the maximum number of points that are mutually separated by at least ǫ in distance. They are related by N(ǫ, Θ, ρ) D(ǫ, Θ, ρ) N(ǫ/2, Θ, ρ). Diam(Θ) denotes the diameter of Θ. 2 Wasserstein distances for mixing measures 2.1 Definition and a basic inequality Let (Θ, ρ) be a space equiped with a non-negative distance function ρ such that ρ(θ 1, θ 2 ) = 0 if and only if θ 1 = θ 2. If in addition, ρ is symmetric (ρ(θ 1, θ 2 ) = ρ(θ 2, θ 1 )), and satisfies the triangle inequality, then it is a proper metric. A discrete probability measure G on a measure space equipped with the Borel sigma algebra takes the form G = k i=1 p iδ θi for some k N {+ }, where p = (p 1, p 2,...,p k ) denotes the proportion vector, while θ = (θ 1,...,θ k ) are the associated atoms in Θ. p has to satisfy 0 p i 1 and k i=1 p k = 1. Likewise, G = k j=1 p j δ θ j is another discrete probability measure that has at most k distinct atoms. Let G k (Θ) denote the space of all discrete probability measures with at most k atoms. Let G(Θ) = k 1 G k (Θ), the set of all discrete measures with finite support. Finally, Ḡ(Θ) denotes the space of all discrete measures (including those with countably infinite support). Let q = (q ij ) i k;j k [0, 1] k k denote a joint probability distribution on N + N + k i=1 q ij = p j and k j=1 q ij = p i for any i = that satisfies the marginal constraints: 1,...,k; j = 1,...,k. Let Q(p, p ) denote the space of all such joint distributions. We start with the L 1 Wasserstein distance: Definition 1. Let ρ be a distance function on Θ. The Wasserstein distance functional for two discrete measures G(p, θ) and G (p, θ ) is: d ρ (G, G ) = inf q Q(p,p ) q ij ρ(θ i, θ j). (1) i,j 4

5 We focus mainly on the L 1 Wasserstein d ρ, and L 2 Wasserstein distance. The latter corresponds to the square root of d ρ 2 in our definition, where ρ(θ i, θ j ) is replaced by ρ2 (θ i, θ j ). Note that d 2 ρ(g, G ) d ρ 2(G, G ) by an application of Cauchy-Schwarz inequality. We will consider a variety of choices of distance ρ in the sequel. From here on, discrete measure G Ḡ(Θ) plays the role of the mixing distribution in a mixture model. Let f(x θ) denote the density (with respect to a dominating measure µ) of a random variable X taking value in X, given parameter θ Θ. For the ease of notations, we also use f i (x) for f(x θ i ). Combining G with the likelihood function f yields a mixture distribution for X that takes the following density: p G (x) = f(x θ)dg(θ) = k p i f i (x). First we state a general result that relates Wasserstein distances between mixing measures G, G, namely, d ρ (G, G ) and divergences between mixture densities p G, p G. Divergence functionals that play important role in this paper are the total variational distance, Hellinger distance, and the Kullback-Leibler distance. All these are in fact instances of a broader class of divergence functionals known as the f-divergences (Csizar, 1966; Ali & Silvey, 1967): Definition 2. Let φ : R R denote a convex function. An f-divergence (or Ali-Silvey distance) between two probability densities f i and f j is defined as d φ(f i, f j ) = φ(f j /f i)f i dµ. Likewise, the f-divergence between p G and p G is d φ (p G, p G ) = φ(p G /p G )p G dµ. For φ(u) = 1 2 ( u 1) 2 we obtain the squared Hellinger. For φ(u) = 1 2 u 1 we obtain the variational distance. For φ(u) = log u, we obtain the Kullback-Leiber divergence. f-divergence functionals can be used as a distance function or metric on Θ, motivating the following definition. Definition 3. When ρ is taken to be an f-divergence, ρ(θ i, θ j ) = d φ(f i, f j ), for a convex function φ, the corresponding Wasserstein distance functional is called a composite Wasserstein distance: d ρφ (G, G ) = inf q Q(p,p ) q ij d φ (f i, f j). In particular, d V, d h, d K induce the composite Wasserstein distances d ρv, d ρh, d ρk, respectively. Let d ρh 2(G, G ) denote the composite Wasserstein distance function obtained by taking ρ(θ i, θ j ) := d2 h (f i, f j ). The following result shows that any f-divergence between mixture distribution p G, p G is dominated by a Wasserstein distance for a suitable choice of of distance ρ. Lemma 1. Let G, G Ḡ(Θ) such that both d φ(p G, p G ) and d ρφ (G, G ) are finite for some convex function φ. Then, d φ (p G, p G ) d ρφ (G, G ). ij i=1 5

6 As will be evident in the sequel, this lemma is also handy in enabling us to obtain lower bounds on small Kullback-Leibler ball probabilities in the space of mixture densities p G, terms of small ball probabilities in the metric space (Θ, ρ). The latter quantities are typically easier to obtain estimates than the former. Example 1. Suppose that Θ = R d, ρ is the Euclidean metric, f(x θ) is the multivariate normal density N(θ, I d d ) with mean θ and identity covariance matrix, then d 2 h (f i, f j ) = 1 exp 1 8 θ i θ j θ i θ j 2. So, d ρh 2(G, G ) d ρ 2(G, G )/8. The above lemma then entails that d h 2(p G, p G ) d ρ 2(G, G )/8. Similarly, for the Kullback-Leibler divergence, since d K (f i, f j ) = 1 2 θ i θ j 2, by Lemma 1, d K (p G, p G ) d ρk (G, G ) = 1 2 d ρ 2(G, G ). For another example, if f(x θ) is a Gamma density with location parameter θ, Θ is a compact subset of R that is bounded away from 0. Then d K (f i, f j ) = O( θ i θ j ). This entails that d K (p G, p G ) O(d ρ (G, G )). 2.2 Wasserstein metric identifiability in finite mixture models Lemma 1 shows that for many choices of ρ, d ρ yields a stronger topology on Ḡ(Θ) than the topology induced by f-divergences on the space of mixture distributions p G. In other words, convergence of p G may not imply convergence of G in d ρ metric. To ensure this property, additional conditions are needed on the space of discrete measures Ḡ(Θ), along with identifiability conditions for the family of likelihood functions {f( θ), θ Θ}. The classical definition of Teicher [29] specifies the family {f( θ), θ Θ} to be identifiable if for any G, G G(Θ), p G p G = 0 implies that G = G. We need a slightly stronger version, allowing for the inclusion for discrete measures with infinite support: Definition 4. The family {f( θ), θ Θ} is finitely identifiable if for any G G Θ and G ḠΘ, p G (x) p G (x) = 0 for almost all x X implies that G = G. To obtain convergence rates, we also need the notion of strong identifiability of [3], herein adapted to a multivariate setting. Definition 5. Assume that Θ R d and ρ is the Euclidean metric. The family {f( θ), θ Θ} is strongly identifiable if f(x θ) is twice differentiable in θ and for any finite k and k different θ 1,...,θ k, the equality ess sup x X k α i f(x θ i ) + βi T Df(x θ i ) + γi T D 2 f(x θ i )γ i = 0 (2) i=1 implies that α i = 0, β i = γ i = 0 R d for i = 1,...,k. Here, for each x, Df(x θ i ) and D 2 f(x θ i ) denote the gradient and the Hessian at θ i of function f(x ), respectively. Finite identifiability is satisfied for the family of Gaussian distributions [28], see also Theorem 1 of [16]. Chen identified a broad class of families, including the Gaussian family, for which the strong identifiablity condition holds [3]. 6

7 Define ψ(g, G ) = sup x p G (x) p G (x) /d ρ 2(G, G ) if G G and otherwise. Also define ψ 1 (G, G ) = d V (p G, p G )/d ρ 2(G, G ) if G G and otherwise. The notion of strong identifiability is useful via the following key result, which generalizes Chen s result to Θ of arbitrary dimensions. Theorem 1. (Strong identifiability). Suppose that Θ is compact subset of R d, the family {f( θ), θ Θ} is strongly identifiable, and for all x X, the Hessian matrix D 2 f(x θ) satisfies a uniform Lipschitz condition γ T (D 2 f(x θ 1 ) D 2 f(x θ 2 ))γ Cρ(θ 1, θ 2 ) δ γ 2, (3) for all x, θ 1, θ 2 and some fixed C and δ > 0. Then, for fixed G 0 G k (Θ), where k < : { } lim inf ψ(g, G ) : d ρ (G 0, G) d ρ (G 0, G ) ǫ > 0. (4) ǫ 0 G,G G k (Θ) The assertion also holds with ψ being replaced by ψ 1. Remarks. (i) In Section 5.2 the notion of strong identifiability is extended to an infinite dimensional setting via first and second order Fréchet derivatives in normed spaces. (ii) Suppose that G 0 has exactly k support points in Θ. Then, an examination of the proof reveals that the requirement that Θ be compact is not needed. Indeed, if there is a sequence of G n G k (Θ) such that d ρ (G 0, G n ) 0, then it is simple to show that there is a subsequence of G n that also has k distinct atoms, which converge in ρ metric to the set of k atoms of G 0 (up to some permutation of the labels). The proof of the theorem proceeds as before. For the rest of this paper, by strong identifiability we always mean conditions specified in Theorem 1 so that Eq. (4) can be deduced. This practically means that the conditions specified by Eq. (2) and Eq. (3) be given, while the compactness of Θ may sometimes be required. 2.3 Wasserstein metric identifiability in infinite mixture models Next, we state a counterpart of Theorem 1 for G, G Ḡ(Θ), i.e., mixing measures with infinitely many support points. We restrict our attention to convolution mixture models on R d. That is, the likelihood density function f(x θ), with respect to Lebesgue, takes the form f(x θ) for some multivariate density function f on R d. Thus, p G (x) = G f(x) = k i=1 p if(x θ i ) and p G (x) = G f(x) = k j=1 p j f(x θ j ). As before, we need a compactness condition for Θ. Additional key assumptions concern with the smoothness of density function f. This is characterized in terms of the tail behavior of the Fourier transform of f. We consider both ordinary smooth densities (e.g., Laplace and Gamma), and supersmooth densities (e.g., normal). The following result does not require that G, G be discrete. 7

8 Theorem 2. Suppose that G, G are probability measures that place full support on compact set Θ R d. f is a density function on R d that is symmetric (around 0), i.e., A fdx = A fdx for any Borel set A Rd. Moreover, assume that f(ω) 0 for any ω R d. (1) Ordinary smooth likelihood. If f(ω) d j=1 ω j β d 0 as ω j (j = 1,...,d) for some positive constant d 0 and constant β. Then for any m < 4/(4 + (2β + 1)d), there is some constant C(d, β, m) dependent only on d, β and m such that as d V (p G, p G ) 0. d ρ 2(G, G ) C(d, β, m)d V (p G, p G ) m, (2) Supersmooth likelihood. If f(ω) d j=1 exp( ω j β /γ) d 0 as ω j (j = 1,...,d) for some positive constants β, γ, d 0. There is some constant C(d, β) dependent only on d and β, such that as d V (p G, p G ) 0. d ρ 2(G, G ) C(d, β)( log d V (p G, p G )) 2/β, Example 2. For the standard normal density on R d, f(ω) = d j=1 exp ω2 i /2, we obtain that d 2 ρ(g, G ) < ( log d V (p G, p G )) 1 as d ρ (G, G ) 0 (so that d V (p G, p G ) 0, by Lemma 1). For a Laplace density on R, e.g., f(ω) = 1 1+ω 2, then d 2 ρ(g, G ) < d V (p G, p G ) m for any m < 4/(4 + 5d), as d ρ (G, G ) 0. 3 Convergence of posterior distributions of mixing measures We are ready to study the convergence of discrete mixing measures in a Bayesian setting. Let X 1,...,X n be an iid sample according to the mixture density p G (x) = f(x θ)dg(θ), where f is known, while G = G 0 for some unknown mixing measure in G k (Θ). The number support points k may not be known. In the Bayesian framework, G is endowed with a prior distribution Π on a suitable measure space of discrete probability measures in Ḡ(Θ). The posterior distribution of G is given by, for any measurable set B: Π(B X 1,...,X n ) = n B i=1 p G (X i )Π(G)/ n i=1 p G (X i )Π(G). We shall study conditions under which the posterior distribution is consistent, i.e., it concentrates on arbitrarily small d ρ neighborhoods of G 0, and establish the rates of the convergence. The analysis is based upon the general framework of Ghosal, Ghosh and van der Vaart [11], who analyzed convergence behavior of posterior distributions in terms of f-divergences such as Hellinger and variational distances on the mixture densities of the 8

9 data. In the following we formulate two convergence theorems for mixture model setting (which can be viewed as counterparts of Theorem 2.1 and 2.4 of [11]). A notable feature of our theorems is that conditions (e.g., entropy and prior concentration) are stated in terms of the Wasserstein metric, as opposed to f-divergences on the mixture densities. They may be typically separated into independent conditions for the prior for G and the likelihood family and are simpler to verify for mixture models. In addition, the convergence of the posterior distribution of mixing measures is established in terms of Wasserstein distance metrics. The following notion plays a central role in the theorem s formulation. Definition 6. Let G be a subset of Ḡ(Θ). For each k <, define the Hellinger information of d ρ metric as a real-valued function on the real line C k (G, ) : R R: C k (G, r) = inf G 0 G k (Θ),G G:d ρ(g 0,G) r/2 d2 h (p G 0, p G )/2. (5) It is obvious that C k is a non-negative and non-decreasing function. The following characterization of C k follows immediately from the results obtained in the previous section. Proposition 1. (a) If G and G k (Θ) are both compact in the Wasserstein topology, and the family of likelihood functions is finitely identifiable. Then, C k (G, r) > 0 for any r > 0. (b) If Θ R d is compact, and the family of likelihood functions is strongly identifiable as specified in Theorem 1. Then, for each k there is a constant c(k) > 0 such that C k (G k (Θ), r) c(k)r 4 for all r > 0. (c) If Θ R d is compact, and the family of likelihood functions is ordinary smooth with parameter β, as specified in Theorem 2. Then, for any ǫ > 0, there is some constant c(d, β) such that C k (Ḡ(Θ), r) c(d, β)r4+(2β+1)d+ǫ for any r > 0, and any k > 0. For supersmooth likelihood family, we have C k (Ḡ(Θ), r) exp[ c(d, β)r β ] for any r > 0 and any k > 0. The following two theorems have three types of conditions. The first is concerned with the size of support of Π, often quantified in terms of its entropy number. Estimates of the entropy number defined in terms of Wasserstein metrics for several measure classes of interest are given in Lemma 2. The second is on the Kullback-Leibler support of Π, which is related to both space of discrete measures Ḡ(Θ) and the family of likelihood functions f(x θ). The Kullback-Leibler neighborhood is defined as: ( B(ǫ) = {G Ḡ(Θ) : P G0 log p ) ( G ǫ 2, P G0 log p ) 2 } G ǫ 2. (6) p G0 p G0 The third condition is on the Hellinger information of d ρ metric, function C k (G, r), a characterization of which is given above. 9

10 Theorem 3. Let G 0 G k (Θ) Ḡ(Θ) for some k <, and the family of likelihood functions is finitely identifiable. Suppose that for a sequence (ǫ n ) n 1 that tends to a constant (or 0) such that nǫ 2 n, sets G n Ḡ(Θ) and a constant C > 0, we have Moreover, suppose M n is a sequence so that log D(ǫ n, G n, d ρ ) nǫ 2 n, (7) Π(Ḡ(Θ) \ G n) exp[ nǫ 2 n(c + 4)], (8) Π(B(ǫ n )) exp( nǫ 2 nc). (9) C k (G n, M n ǫ n ) ǫ 2 n(c + 4), (10) exp(2nǫ 2 n) j M n exp[ nc k (G n, jǫ n )] 0. (11) Then, Π(G : d ρ (G 0, G) M n ǫ n X 1,...,X n ) 0 in P G0 -probability. A stronger theorem (using a substantially weaker condition on the covering number) can be formulated as follows. Theorem 4. Let G 0 G k (Θ) Ḡ(Θ) for some k <, and the family of likelihood functions is finitely identifiable. Suppose that for a sequence ǫ n 0 such that nǫ 2 n is bounded away from 0 or tending to infinity, and sets G n Ḡ(Θ), we have log D(ǫ/2, {G G n : ǫ d ρ (G 0, G) 2ǫ}, d ρ ) nǫ 2 n for every ǫ ǫ n, (12) Π(G : jǫ n < d ρ (G, G 0 ) 2jǫ n ) Π(B(ǫ n )) where M n is a sequence such that exp(2nǫ 2 n) Π(Ḡ(Θ) \ G n) Π(B(ǫ n )) = o(exp( 2nǫ 2 n)), (13) exp[nc k (G n, jǫ n )/2] for any j M n, (14) j M n exp[ nc k (G n, jǫ n )/2] 0. (15) Then, we have that Π(G : d ρ (G 0, G) M n ǫ n X 1,...,X n ) 0 in P G0 -probability. Remarks. (i) The above statement continues to hold if conditions (14) and (15) are replaced by the following condition: exp(2nǫ 2 n)/π(b(ǫ n )) exp[ nc k (G n, jǫ n )] 0. (16) j M n (ii) The above theorems are stated for the L 1 Wasserstein metric d ρ, but they also hold for L 2 Wasserstein metric d 1/2 ρ 2, with a slight modification of the definition of the Hellinger information function to C k (G, r) = inf d 1/2 ρ 2 (G 0,G) r/2 d2 h (p G 0, p G ). 10

11 Before moving to specific examples, we state a simple lemma, which provides estimates of the entropy under d ρ metric for a number of classes of discrete measures of interest. Because d ρ inherits directly the ρ metric in Θ, the entropy for classes in (Ḡ(Θ), d ρ) can typically be bounded in terms of the covering number for subsets of (Θ, ρ). These bounds will be used extensively in the sequel. Lemma 2. (a) log N(2ǫ, G k (Θ), d ρ ) k(log N(ǫ, Θ, ρ) + log(e + ediam(θ)/ǫ)). (b) log N(2ǫ, Ḡ(Θ), d ρ) N(ǫ, Θ, ρ)log(e + ediam(θ)/ǫ). (c) Let G 0 = k i=1 p i δ θi G k (Θ). Assume that M = max k i=1 1/p i < and m = min i,j k ρ(θi, θ j ) > 0. Then, log N(ǫ/2, {G G k (Θ) : d ρ (G 0, G) 2ǫ}, d ρ ) k(sup log N(ǫ/4, Θ, ρ) + log(32kdiam(θ)/m)), Θ where the supremum in the right side is taken over all bounded subsets Θ Θ such that Diam(Θ ) 4Mǫ. 4 Examples In this section the general theory is illustrated in specific mixture models, including finite mixtures of multivariate distributions, infinite mixtures based on Dirichlet processes, and finite mixtures of Gaussian processes. 4.1 Finite mixture of multivariate distributions Let Θ be a subset of R d, ρ be the Euclidean metric, and Π is a prior distribution for discrete measures in G k (Θ), where k < is known. Suppose that the truth G 0 = k i=1 p i δ θi G k (Θ). To obtain the convergence rate of the posterior distribution of G, we need: Assumptions A. (A1) Θ is compact and the family of likelihood functions f( θ) is strongly identifiable. (A2) For some positive constant C 1, d K (f i, f j ) C 1ρ 2 (θ i, θ j ) for any θ i, θ j Θ. For any G supp(π), p G0 (log(p G0 /p G )) 2 < C 2 d K (p G0, p G ) for some constant C 2 > 0. (A3) Under prior Π, for small δ > 0, c 3 δ k Π( p i p i δ, i = 1...,k) C 3δ k and c 3 δ kd Π(ρ(θ i, θ i ) δ, i = 1...,k) C 3δ kd for some constants c 3, C 3 > 0. 11

12 Remarks. (A1) and (A2) hold for the family of Gaussian densities with mean parameter θ. (A3) holds when the prior distribution on the relevant parameters behave like a uniform distribution, up to a multiplicative constant. Let G = k i=1 p iδ θi. Combining Lemma 1 with Assumption (A2), if ρ(θ i, θi ) ǫ and p i p i ǫ2 /(kdiam(θ) 2 ) for i = 1,...,k, then d K (p G0, p G ) d ρk (G 0, G) C 1 1 i,j k q ijρ 2 (θi, θ j), for any q Q. Thus, d K (p G0, p G ) C 1 d ρ 2(G 0, G) C k 1 i=1 (p i p i )ρ 2 (θ i, θ i) + C 1 k i=1 p i p i Diam(Θ)2 2C 1 ǫ 2. Hence, under prior Π, Π(G : d K (p G0, p G ) ǫ 2 ) Π(G : ρ(θ i, θ i ) ǫ, p i p i ǫ 2 /(kdiam(θ) 2 ), i = 1,...,k). In view of Assumptions (A2) and (A3), we have Π(B(ǫ)) > ǫ k(d+2). Conversely, for sufficiently small ǫ, if d 2 ρ(g 0, G) ǫ 2 then by reordering the index of the atoms, we must have ρ(θ i, θ i ) = O(ǫ) and p i p i = O(ǫ2 ) for all i = 1,...,k (see the argument in the proof of Lemma 2(c)). This entails that under the prior Π, Π(G : d ρ 2(G 0, G) ǫ 2 ) Π(G : ρ(θ i, θ i ) O(ǫ), p i p i O(ǫ 2 ), i = 1,...,k) < ǫ k(d+2). Let ǫ n = n 1/2. We proceed by verifying conditions of Theorem 4, as this theorem provides the right rate for parametric mixture models under the L 2 Wasserstein distance metric d 1/2. Let G ρ 2 n := G k (Θ). Then Π(Ḡ(Θ) \ G n) = 0, so Eq. (13) trivially holds. Next, we show that D(ǫ/2, S, d 1/2 ), where S = {G G ρ 2 n : d ρ (G 0, G) 2ǫ}, is bounded above by a constant, so that (12) is satisfied. Indeed, for any ǫ > 0, log D(ǫ/2, S, d 1/2 ) ρ 2 log N(ǫ/4, S, d 1/2 ) N(ǫ/4, S, d ρ 2 ρ ). By Lemma 2 (c), N(ǫ/4, S, d ρ ) is bounded in terms of sup Θ log N(ǫ/8, Θ, ρ), which is bounded above by a constant when Θ s are subsets of Θ whose diameter is bounded by a multiple of ǫ. Thus, Eq. (12) holds. By Proposition 1(b) and Assumption (A4), there is C k (G n, jǫ n ) = inf dρ 2(G 0,G) (jǫ/2) 2 d2 h (p G 0, p G ) c(jǫ n ) 4 for some constant c > 0. To ensure condition (15), note that: exp(2nǫ 2 n) exp[ nc k (G n, jǫ n )/2] exp(2nǫ 2 n) exp[ nc(jǫ n ) 4 ] j M n j M n < exp(2nǫ 2 n ncm 4 nǫ 4 n). This upper bound goes to zero if ncmnǫ 4 4 n 4nǫ 2 n, which is satisfied by taking M n to be a large multiple of ǫn 1/2. Thus we need M n ǫ n ǫ 1/2 n = n 1/4. Under the assumptions specified above, Π(G : jǫ n < d ρ (G, G 0 ) 2jǫ n )/Π(B(ǫ n )) = O(1). On the other hand, for j M n, we have exp[nc k (G n, jǫ n )/2] exp[nc(jǫ n ) 4 /2] which is bounded below by arbitrarily large constant by choosing M n to be a large multiple of ǫn 1/2, thereby ensuring (14). Thus, by Theorem 4, rate of contraction for the posterior distribution of G under d 1/2 ρ 2 distance metric is n 1/4, which is also the minimax rate n 1/4 as proved in the univariate case by [3]. Our calculation is summarized by: 12

13 Theorem 5. Under Assumptions (A1 A3), the contraction rate in the L 2 Wasserstein distance metric of the posterior distribution of G is n 1/ Infinite mixture based on the Dirichlet process Given the true discrete measure G 0 = k i=1 p i δ θi G k (Θ), where Θ is a metric space but k is unknown. To estimate G 0, the prior distribution Π on discrete measure G Ḡ(Θ) is taken to be a Dirichlet process DP(ν, P 0) that centers at P 0 with concentration parameter ν > 0 [7]. Here, parameter P 0 is a probability measure on Θ. For any m 1, the following lemma provides a lower bound of small ball probabilities of metric space (Ḡ(Θ), d1/m m ) in terms of small probabilities of metric space (Θ, ρ). ρ Lemma 3. Let G DP(ν, P 0 ), where P 0 is a non-atomic base probability measure on a compact set Θ. For a small ǫ > 0, let D = D(ǫ, Θ, ρ) denote the packing number of Θ under ρ metric. Then, under the Dirichlet process distribution, D Π(G : d ρ m(g 0, G) (2 m + 1)ǫ m ) Γ(ν)(ǫ m D 1 /Diam(Θ) m ) D 1 ν D P 0 (S i ). Here, (S 1,...,S D ) denotes the D disjoint ǫ/2-balls that form a maximal packing of Θ. Γ( ) is the gamma function. Proof. Since every point in Θ is of distance at most ǫ to one of the centers of S 1,...,S D, there is a D-partition (S 1,...,S D ) of Θ, such that S i S i, and Diam(S i ) 2ǫ. for each i = 1,...,D. Let m i = G(S i ), µ i = P 0 (S i ), and ˆp i = G 0 (S i ). From the definition of Dirichlet processes, m = (m 1,...,m D ) Dir(νµ 1,...,νµ D ). Note that d ρ m(g 0, G) (2ǫ) m + m ˆp 1 [Diam(Θ)] m. Due to the non-atomicity of P 0, for ǫ sufficiently small, νµ i 1 for all i = 1,...,D. Let δ = ǫ/diam(θ). Then, under Π, Pr(d m ρ (G 0, G) (2 m +1)ǫ m ) Pr( m ˆp 1 δ m ) Pr( m i ˆp i δ m /D, i = 1,...,D) D 1 Γ(ν) D 1 = D i=1 Γ(νµ m νµ i 1 i (1 m i ) νµ D 1 dm i...dm D 1 i) Γ(ν) D i=1 Γ(νµ i) D 1 m i ˆp i δ m /D i=1 D 1 min(ˆpi +δ m /D,1) i=1 max(ˆp i δ m /D,0) m νµ i 1 i i=1 i=1 dm i Γ(ν)(δ m /D) D 1 D (νµ i ). The second inequality is due to (1 D 1 i=1 m i) νµ D 1 = m νµ D 1 D 1, since νµ D 1 and 0 < m D < 1 almost surely. The third inequality is due to the fact that Γ(α) 1/α for 0 < α 1. This gives the desired claim. i=1 13

14 Assumptions C. (C1) The non-atomic base measure P 0 places full support on a compact set Θ. (C2) For some constants C 1, m 1, d K (f i, f j ) C 1ρ m 1 (θ i, θ j ) for any θ i, θ j Θ. For any G supp(π), p G0 (log(p G0 /p G )) 2 C 2 d K (p G0, p G ) m 2 for some constants C 2, m 2 > 0. (C3) P 0 places sufficient probability mass on all small balls that pack Θ. Specifically, there is a universal constant c 3 > 0 such that the probability of the D-partition (S 1,...,S D ) specified in Lemma 3 satisfy for any ǫ > 0: log D P 0 (S i ) c 3 D log(1/d). i=1 (C4) Θ R d is compact, so that the packing number D(ǫ, Θ, ρ) [Diam(Θ)/ǫ] d. Theorem 6. Given Assumptions (C1 C4), and the smoothness conditions for the likelihood family as specified in Theorem 2, there is a sequence β n ց 0 such that Π(d ρ (G 0, G) β n X 1,...,X n ) 0 in P G0 probability. Specifically, (1) For ordinary smooth likelihood functions, take β n (log n/n) (d+2)(4+(2β+1)d)+δ, for any small δ > 0. (2) For supersmooth likelihood functions, take β n (log n) 1/β. Proof. The proof consists of two main steps. First, we shall prove that under Assumptions (C1 C4), conditions specified by Eqs. (7) (8) (9) in Theorem 3 are satisfied by taking G n = Ḡ(Θ), and ǫ n to be a large multiple of (log n/n) 1/(d+2). The second step involves constructing a sequence of M n and subsequentially β n = M n ǫ n for which Theorem 3 can be applied. Step 1: By Lemma 1 and (C2), d K (p G0, p G ) d ρk (G 0, G) C 1 d ρ m 1(G 0, G). Also, pg0 [log(p G0 /p G )] 2 < C 2 d ρ m 1(G 0, G) m 2. Without loss of generality, assume that m 1 m 2. We obtain that Π(G B(ǫ n )) Π(G : d ρ m 1(G 0, G) C 3 ǫ 2 2/m 2 n ) for some constant C 3. Combining this bound with (C3) and (C4), which are applied to Lemma 3 we have: log Π(G B(ǫ n )) > (D 1)log(ǫ n /Diam(Θ)) + (2D 1)log(1/D) + D log ν, where the approximation constant is dependent on m 1, m 2. Note that D [Diam(Θ)/ǫ n ] d. It is simple to check that condition (9) holds, log Π(G B(ǫ n )) Cnǫ 2 n, by the given rate of ǫ n, for any constant C > 0. Since G n = Ḡ(Θ), (8) trivially holds. Turning to condition (7), by Lemma 2(b), we have log N(2ǫ n, Ḡ(Θ), d ρ) N(ǫ n, Θ, ρ)log(e + ediam(θ)/ǫ n ) (Diam(Θ)/ǫ n ) d log(e + ediam(θ)/ǫ n ) nǫ 2 n by the specified rate of ǫ n. 2 14

15 Step 2: For any G Ḡ(Θ), let R k(g, r) be the inverse of the Hellinger information function of d ρ metric. Specifically, for any t 0, R k (G, t) = inf{r 0 C k (G, r) t}. Note that R k (G, 0) = 0. R k (G, ) is non-decreasing because C k (G, ) is. Let (ǫ n ) n 1 be the sequence determined in the previous step of the proof. Let M n = R k (Ḡ(Θ), ǫ2 n(c + 4))/ǫ n, and β n = M n ǫ n = R k (Ḡ(Θ), ǫ2 n(c + 4)). Condition (10) holds by definition of R k, i.e., C k (G(Θ), M n ǫ n ) ǫ 2 n(c+4). To verify (11), note that the running sum with respect to j cannot have more than Diam(Θ)/ǫ n, and due to the monotonicity of C k, we have exp(2nǫ 2 n) j M n exp[ nc k (G n, jǫ n )] Diam(Θ)/ǫ n exp(2nǫ 2 n nc k (G n, M n ǫ n )) 0. Hence, Theorem 3 can be applied to conclude that Π(d ρ (G 0, G) β n X 1,...,X n ) 0 in P G0 probability. Under ordinary smoothness condition (as specified in Theorem 2), R k (Ḡ(Θ), t) = t 1 4+(2β+1)d+δ, where δ is an arbitrarily positive constant. So, β n ǫ (2β+1)d+δ n = (log n/n) (d+2)(4+(2β+1)d+δ). On the other hand, under supersmoothness condition, R k (Ḡ(Θ), t) = (1/ log(1/t)) 1/β. So, β n (log(1/ǫ n )) 1/β (log n) 1/β. Remark. For supersmooth likelihood functions, the rate obtained is comparable to the optimal rate in the deconvolution problem [6, 36], although the latter problem deals with smooth mixing densities evaluated in L 2 metric. For ordinary likelihood functions, we do not know whether the rate obtained here is optimal or not (up to an logarithmic factor). 4.3 Finite mixture of Gaussian processes We now study an example in which Θ has infinite dimensions. Specifically, let T = [0, 1], and Θ = l (T) is the Banach space of bounded functions θ : T R, equipped with the uniform norm θ = sup{ θ(t) : t T }. Suppose that the true G 0 has k distinct atoms in Θ, G 0 = k i=1 p i δ θi, where k is known. We shall consider a mixture of Gaussian processes prior Π on G k (Θ). Specifically, a random draw G from Π is a discrete measure taking the form G = k i=1 p iδ θi, where θ i are independent random sample paths distributed according to a zero-mean Gaussian process. Let K : T T R be the covariance function that defines the Gaussian process we assume further that the Gaussian process has bounded sample paths and that it is Borel measurable. It is a known fact that the support of the defined zero-mean Gaussian measure is equal to the closure of the reproducing kernel Hilbert space (RKHS), to be denoted by H(K), or simply H, of the covariance kernel K of the process [17, 30]. Assume that θi H for all i = 1,...,k, so that the true G 0 is contained within the support of prior Π. 15

16 An ingredient of our analysis is drawn from a recent result by van der Vaart and van Zanten, who studied asymptotic behavior of priors based on Gaussian processes [31]. Define ρ(θ i, θ j ) = sup θ i (t) θ j (t) : t T. A key notion in their analysis is concentration functions. For each i = 1,..., k, define the concentration function: φ θ i (ǫ) = inf h H:ρ(h,θ i )<ǫ h 2 H log Pr( θ < ǫ), where Pr denotes the probability under the Gaussian process. Another ingredient is the extension of the notion of strong identifiability to function space Θ, see Section 5.2, so that the result of Theorem 1 continues to hold for the infinite dimensional Θ. Finally, we need additional assumptions: Assumptions B. (B1) The family of likelihood functions {f( θ), θ Θ} is strongly identifiable. (B2) For some positive constant C 1, d K (f i, f j ) C 1ρ 2 (θ i, θ j ) for any θ i, θ j Θ. For any G supp(π), p G0 (log(p G0 /p G )) 2 < C 2 d K (p G0, p G ) for some constant C 2 > 0. (B3) Under prior Π, for small δ > 0, Π( p i p i δ, i = 1...,k) c 3δ kα for some constants c 3, α > 0. (B4) Under prior Π, C 4 = E θ 2 <. Theorem 7. Given Assumptions (B1 B4). Let (ǫ n ) n N be a sequence of positive numbers tending to 0 such that log n = o(nǫ 2 n) and that for any i = 1,...,k and any n, φ θ i (ǫ n ) nǫ 2 n. (17) Then, for a sufficiently large constant M > 0, Π(d ρ (G 0, G) Mǫ 1/2 n X 1,...,X n ) 0 in P G0 -probability. Remarks. (i) The reader is referred to [31] for examples of convergence rates ǫ n that satisfy condition (17) for different choices of Θ. In particular, if under the Gaussian process prior, the supporting atoms θ i of G Π are functions on T = [0, 1] with smoothness γ 1 > 0, while the true support points θi s of G 0 are functions with smoothness γ 2 > 0, then concentration functions φ θ i (ǫ) = ǫ 1/(γ 1 γ 2 ) for each i = 1,...,k. Accordingly, the rate ǫ n for which Eq. (17) holds is ǫ n n γ 1 γ 2 2γ 1 γ The contraction rate for the posterior distribution of G is n γ 1 γ 2 2(2γ 1 γ 2 +1). 16

17 5 Proofs of main results 5.1 Identifiability results Proof of Theorem1. Proof. Suppose that Eq. (4) is not true, then there will be sequences of G n and G n tending to G 0 in d ρ metric, and that ψ(g n, G n) 0. We write G n = i=1 p n,iδ θn,i, where p n,i = 0 for indices i greater than k n, the number of atoms of G n. Similar notation is applied to G n. Since both G n and G n have finite number of atoms, there is q (n) Q(p n, p n) so that d ρ 2(G n, G n) = ij q(n) ij ρ2 (θ n,i, θ n,j ). Note that d2 ρ(g n, G n) d ρ 2(G n, G n) = O(d ρ (G n, G n)), while the latter inequality is due to the boundedness of Θ. Let O n = {(i, j) : ρ 2 (θ n,i, θ n,j ) d1 δ (G ρ 2 n, G n)} for some δ (0, 1). Then, (i,j) O n q (n) ij d ρ 2(G n, G n)/d 1 δ (G ρ 2 n, G n) 0 as n. Since q (n) Q(p n, p n), we can express ψ(g n, G n) = sup x k n i=1 = sup x ij p n,i f(x θ n,i ) q (n) k n p n,jf(x θ n,j) /d ρ 2(G n, G n) j=1 ij (f(x θ n,i) f(x θ n,j)) /d ρ 2(G n, G n), and, by Taylor s expansion, ψ(g n, G n) = sup q (n) ij (f(x θ n,j) f(x θ n,i )) + x (i,j) O n q (n) ij (θ n,j θ n,i ) T Df(x θ n,i ) (i,j) O n q (n) ij (θ n,j θ n,i ) T D 2 f(x θ n,i )(θ n,j θ n,i ) + (i,j) O n R n (x) /d ρ 2(G n, G n) where ( ) R n (x) = O q (n) ij ρ2+δ (θ n,i, θ n,j) (i,j) O n = sup A n (x) + B n (x) + C n (x) + R n (x) /D n, x ( ) = O q (n) ij ρ2 (θ n,i, θ n,j)d (1 δ )δ/2 (G ρ 2 n, G n) (i,j) O n due to Eq. (3) and the definition of O n. So R n (x)/d ρ 2(G n, G n) 0. The quantities A n (x), B n (x) and C n (x) are linear functionals of f(x θ), Df(x θ) and D 2 f(x θ) for different θ s, respectively. Since Θ is compact, subsequences of G n and G n can be chosen so that each of their support points converges to a fixed atom θ l, for l = 1,...,k k. 17

18 After being properly rescaled, the limits of A n (x), B n (x) and C n (x) are still linear functionals with constant coefficients not depending on x. In particular, C n (x)/d n k j=1 γt j D2 f(x θj )γ j for some γ j s and not all these coefficients vanishing, since k j=1 γ j 2 = 1. The coefficients in A n (x)/d n and B n (x)/d n can go either to infinity or to a constant by further selecting the subsequences of G n and G n. If they go to infinity, a sequence d n = O(1) can be found such that d n A n (x)/d n converges to k j=1 α jf(x θj ) and d n B n (x)/d n converges to k j=1 βt j Df(x θ j ) for some finite α j and β j. Thus, we have d n and α j, β j, γ j, not all being zero, such that d n p Gn (x) p G n (x) /d 2 ρ(g n, G k n) α j f(x θ j ) + βj T Df(x θ j ) + γj T D 2 f(x θ j )γ j. j=1 (18) for all x. This entails that the right side of the preceeding display must be 0 for all almost all x. By strong identifiability, all coefficients must be 0, which leads to contradiction. With respect to ψ 1 (G, G ), suppose that the claim is not true, which implies the existence of a subsequence G n, G n such that that ψ 1 (G n, G n) 0. Going through the same argument as above, we have α j, β j, γ j, not all of which are zero, such that Eq.(18) holds. An application of Fatou s lemma yields k j=1 α jf(x θ j ) + βj TDf(x θ j) + γj TD2 f(x θ j )γ j dµ = 0. Thus the integrand must be 0 for almost all x, leading to contradiction. Proof of Theorem 2. Proof. To obtain an upper bound of d ρ 2(G, G ) in terms of d V (p G, p G ) under the condition that d V (p G, p G ) 0, our strategy is approximate G and G by convolving these with some mollifier K δ. By triangular inequality, d ρ 2(G, G ) can be bounded in terms of d ρ 2(G, G K δ ), d ρ 2(G, G K δ ), and d ρ 2(G K δ, G K δ ). The first two terms are simple to bound, while the last term can be handled by expressing G K δ as the convolution the mixture density p G with another function. This trick was widely exploited in kernel density estimation method for deconvolution problems (e.g., [36, 6]). We also need the following elementary lemma. Lemma 4. Assume that p and p are two probability density functions on R d with bounded s-moments. (a) For t such that 0 < t < s, p(x) p (x) x t dx 2 p p (s t)/s L 1 (E p X s + E p X s ) t/s. (b) Let V d = π d/2 Γ(d/2 + 1) denote the volume of the d-dimensional unit sphere. Then, p p L1 2V s/(d+2s) d (E p X s + E p X s ) d d+2s p p 2s d+2s L 2. 18

19 Take any s > 0, and let K : R d (0, ) be a symmetric density function on R d whose Fourier transform K is a continuous function whose support is bounded in [ 1, 1] d. Moreover, K has bounded moments up to order s. Consider molifiers K δ (x) = 1 K(x/δ) δ d for δ > 0. Let K δ and f be the Fourier transforms for K δ and f, respectively. Define g δ to be the inverse Fourier transform of K δ / f: g δ (x) = 1 (2π) d R d e i ω,x K δ (ω) f(ω) dω. Note that function K δ (ω)/ f(ω) has bounded support. So, g δ L 1 (R), and g δ := K δ (ω)/ f(ω) is the Fourier transform of g δ. By the convolution theorem, f g δ = K δ. As a result, G K δ = G f g δ = p G g δ. Then the second moment under K δ is O(δ 2 ). It entails that d ρ 2(G, G K δ ) = O(δ 2 ). By triangular inequality, d 1/2 (G, G ) d 1/2 (G K ρ 2 ρ 2 δ, G K δ )+d 1/2 (G, G K ρ 2 δ )+d 1/2 (G, G ρ 2 K δ ) d 1/2 (G K ρ 2 δ, G K δ ) + O(δ), so for some constant C(K) > 0 dependent only on kernel K, d ρ 2(G, G ) 2d ρ 2(G K δ, G K δ ) + C(K)δ 2. (19) Theorem 6.15 of [33] provides an upper bound for the Wasserstein distance: for any two probability measures µ and ν, d ρ 2(µ, ν) 2 x 2 d µ ν (x), where µ ν is the total variation of measure µ ν. Thus, d ρ 2(G K δ, G K δ ) 2 x 2 G K δ (x) G K δ (x) dx. (20) We note that since density function K has bounded s-th moment, x s G K δ (dx) 2 s [ θ s dg(θ)+ x s K δ (x)dx = 2 s [ θ s dg(θ)+δ s x s K(x)dx] <, because G s support points lie in a compact set. Applying Lemma 4 to Eq.(20), we obtain that for δ < 1, d ρ 2(G K δ, G K δ ) C(d, K, s) G K δ G K δ (s 2)/s L 1 C(d, K, s) G K δ G K δ 2(s 2)/(d+2s) L 2. (21) Here, constants C(d, K, s) are different in each line, and they are dependent only on d, s and the s-th moment of density function K. Next, we use a known fact that for arbitrary (signed) measure µ on R d and function g L 2 (R d ), there holds µ g L2 µ g L2, where µ denotes the total variation of µ: G K δ G K δ L2 = p G g δ p G g δ L2 = (p G p G ) g δ L2 2d V (p G, p G ) g δ L2. (22) By Plancherel s identity, g δ 2 L 2 = 1 (2π) d Kδ (ω) 2 f(ω) dω = 1 Rd K(ωδ) 2 2 (2π) d f(ω) dω C f(ω) 2 dω. 2 [ 1/δ,1/δ] d 19

20 The last bound holds because K has support in [ 1, 1] d, and is bounded by a constant. Collecting Eqs.(19)(20)(21)(22) and the preceeding display, we have: { [ d ρ 2(G, G ) C(d, K, s) inf δ (0,1) δ2 + d V (p G, p G ) 2(s 2) d+2s [ 1/δ,1/δ] d f(ω) 2 dω ] s 2 d+2s }. If f(ω) d j=1 ω j β d 0 as ω j (j = 1,...,d) for some positive constant d 0, then { d ρ 2(G, G ) C(d, K, s, β) C(d, K, s, β)d V (p G, p G ) inf δ (0,1) δ2 + d V (p G, p G ) 2(s 2) (2β+1)d(s 2) d+2s (1/δ) d+2s 4(s 2) 2(d+2s)+(2β+1)d(s 2). The exponent tends to 4/(4 + (2β + 1)d) as s, we obtain that d ρ 2(G, G ) C(d, β, r)d V (p G, p G ) r, for any constant r < 4/(4 + (2β + 1)d), as d V (p G, p G ) 0. If f(ω) d j=1 exp( ω j β ) d 0 as ω j (j = 1,...,d) for some positive constants β, d 0, then { d 2 ρ(g, G ) C(d, K, s, β) inf δ (0,1) δ2 + d V (p G, p G ) 2(s 2)/(d+2s) exp 2dδ β s 2 d + 2s Taking δ β = 1 d log d V (p G, p G ), we obtain that d ρ 2(G, G ) C(d, β)( log d V (p G, p G )) 2/β. } }. Proof of Lemma 1. Proof. We exploit the variational characterization of f-divergences [22], d φ (f i, f j ) = sup ϕ ij ϕij f j φ (ϕ ij )f i dµ. Here, the infimum is taken over all measurable function on X. φ denotes the Legendre-Fenchel conjugate dual of convex function φ. (φ is again a convex function on R and is defined by φ (v) = sup u R (uv φ(u)).) Thus, d ρφ (G, G ) = inf q Q(p,p ) ij q ij sup ϕij ϕij f j φ (ϕ ij )f i. On the other hand, for any q Q(p, p ), d φ (p G, p G ) = sup ϕ = sup ϕ ij ϕp G φ (ϕ)p G = sup ϕ ϕ ij q ij sup ϕ ij q ij f j φ (ϕ) ij (ϕf j φ (ϕ)f i ), ϕ j q ij f i = sup ϕ p jf j φ (ϕ) p i f i i q ij (ϕf j φ (ϕ)f i ) where the last inequality holds because the supremum is taken over a larger set of functions. Moreover, the bound holds for any q Q(p, p ), so d φ (p G, p G ) d ρφ (G, G ). ij 20

21 Proof of Proposition 1. Proof. (a) Suppose that the claim is not true, there is a sequence of (G 0, G) G k (Θ) G such that d ρ (G 0, G 2 ) r/2 > 0 always holds, and that converges in d ρ metric to G 0 G k and G G, respectively. This is due to the compactness of both G k (Θ) and G. We must have d ρ (G 0, G ) r/2 > 0, so G 0 G. At the same time, d h (p G 0, p G ) = 0, which implies that p G 0 = p G for almost all x X. By finite identifiability condition, G 0 = G, which is a contradiction. (b) is an immediate consequence of Theorem 1, by noting that under the given hypothesis, there is c(k) > 0 depending on k, such that d 2 h (p G 0, p G ) d 2 V (p G 0, p G )/2 c(k)d 2 ρ 2 (G 0, G) c(k)d 4 ρ(g 0, G) for sufficiently small d ρ (G 0, G). The boundedness of Θ implies the boundedness of d ρ (G 0, G), thereby extending the claim for the entire admissible range of d ρ (G 0, G). (c) is obtained in a similar way from Theorem Strong identifiability conditions in normed spaces Let (Θ, ) be a real normed space. A continuous function f : Θ R is twice Fréchet differentiable at a point θ Θ if it is Fréchet differentiable at θ, with Fréchet derivative D θ f(θ ) (which is a bounded linear function from Θ into R), and there is a continuous bilinear function D 2 θ f(θ ;, ) from Θ Θ into R, called the second Fréchet differential of f, which has the property that lim γ 0 f(θ + γ) f(θ ) D θ f(θ )γ Dθ 2 f(θ ; γ, γ) / γ 2 = 0. We say that the family of density function {f( θ), θ Θ)} is strongly identifiable if f(x θ) is twice Fréchet differentiable in θ (with the Fréchet derivative D θ f(x θ)( ), and the second Fréchet differential Dθ 2f(x θ;, )), and for any finite k and k different θ 1,...,θ k, the equality k ess sup α i f(x θ i ) + D θ f(x θ i )β i + Dθ 2 f(x θ i; γ i, γ i ) = 0 (23) x X i=1 implies that α i = 0, β i = γ i = 0 Θ for i = 1,...,k. Note that if for each x X, f(x ) is twice Fréchet continuously differentiable, i.e., D θ (x ;, ) : Θ Θ Θ R is a continuous function, then f admits the following Taylor expansion (cf. pg. 659, [24]): f(x θ 2 ) f(x θ 1 ) = D θ f(x θ 1 )(θ 2 θ 1 )+ 1 2 D2 θ f(x θ 1 +s(θ 2 θ 1 ); (θ 2 θ 1 ), (θ 2 θ 1 )), for some s [0, 1]. Assume further that the second Fréchet differential of f(x ), Dθ 2 ( ;, ) satisfies a uniform Lipschitz condition: D 2 θ f(x θ 1; γ, γ) D 2 θ f(x θ 2; γ, γ) C θ 1 θ 2 δ γ 2 for all x, θ 1, θ 2 Θ, and some fixed C and δ > 0. It is simple to observe that Theorem 1 and its proof extend line-by-line to the normed space setting. 21

CONVERGENCE OF LATENT MIXING MEASURES IN FINITE AND INFINITE MIXTURE MODELS. BY XUANLONG NGUYEN 1 University of Michigan

CONVERGENCE OF LATENT MIXING MEASURES IN FINITE AND INFINITE MIXTURE MODELS. BY XUANLONG NGUYEN 1 University of Michigan The Annals of Statistics 2013, Vol. 41, No. 1, 370 400 DOI: 10.1214/12-AOS1065 Institute of Mathematical Statistics, 2013 CONVERGENCE OF LATENT MIXING MEASURES IN FINITE AND INFINITE MIXTURE MODELS BY

More information

Bayesian Regularization

Bayesian Regularization Bayesian Regularization Aad van der Vaart Vrije Universiteit Amsterdam International Congress of Mathematicians Hyderabad, August 2010 Contents Introduction Abstract result Gaussian process priors Co-authors

More information

On strong identifiability and convergence rates of parameter estimation in finite mixtures

On strong identifiability and convergence rates of parameter estimation in finite mixtures Electronic Journal of Statistics ISSN: 935-754 On strong identifiability and convergence rates of parameter estimation in finite mixtures Nhat Ho and XuanLong Nguyen Department of Statistics University

More information

On posterior contraction of parameters and interpretability in Bayesian mixture modeling

On posterior contraction of parameters and interpretability in Bayesian mixture modeling On posterior contraction of parameters and interpretability in Bayesian mixture modeling Aritra Guha Nhat Ho XuanLong Nguyen Department of Statistics, University of Michigan Department of EECS, University

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

Nonparametric Bayesian Uncertainty Quantification

Nonparametric Bayesian Uncertainty Quantification Nonparametric Bayesian Uncertainty Quantification Lecture 1: Introduction to Nonparametric Bayes Aad van der Vaart Universiteit Leiden, Netherlands YES, Eindhoven, January 2017 Contents Introduction Recovery

More information

is a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications.

is a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications. Stat 811 Lecture Notes The Wald Consistency Theorem Charles J. Geyer April 9, 01 1 Analyticity Assumptions Let { f θ : θ Θ } be a family of subprobability densities 1 with respect to a measure µ on a measurable

More information

Foundations of Nonparametric Bayesian Methods

Foundations of Nonparametric Bayesian Methods 1 / 27 Foundations of Nonparametric Bayesian Methods Part II: Models on the Simplex Peter Orbanz http://mlg.eng.cam.ac.uk/porbanz/npb-tutorial.html 2 / 27 Tutorial Overview Part I: Basics Part II: Models

More information

Existence and Uniqueness

Existence and Uniqueness Chapter 3 Existence and Uniqueness An intellect which at a certain moment would know all forces that set nature in motion, and all positions of all items of which nature is composed, if this intellect

More information

Bayesian Adaptation. Aad van der Vaart. Vrije Universiteit Amsterdam. aad. Bayesian Adaptation p. 1/4

Bayesian Adaptation. Aad van der Vaart. Vrije Universiteit Amsterdam.  aad. Bayesian Adaptation p. 1/4 Bayesian Adaptation Aad van der Vaart http://www.math.vu.nl/ aad Vrije Universiteit Amsterdam Bayesian Adaptation p. 1/4 Joint work with Jyri Lember Bayesian Adaptation p. 2/4 Adaptation Given a collection

More information

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation Statistics 62: L p spaces, metrics on spaces of probabilites, and connections to estimation Moulinath Banerjee December 6, 2006 L p spaces and Hilbert spaces We first formally define L p spaces. Consider

More information

HILBERT SPACES AND THE RADON-NIKODYM THEOREM. where the bar in the first equation denotes complex conjugation. In either case, for any x V define

HILBERT SPACES AND THE RADON-NIKODYM THEOREM. where the bar in the first equation denotes complex conjugation. In either case, for any x V define HILBERT SPACES AND THE RADON-NIKODYM THEOREM STEVEN P. LALLEY 1. DEFINITIONS Definition 1. A real inner product space is a real vector space V together with a symmetric, bilinear, positive-definite mapping,

More information

Functional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability...

Functional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability... Functional Analysis Franck Sueur 2018-2019 Contents 1 Metric spaces 1 1.1 Definitions........................................ 1 1.2 Completeness...................................... 3 1.3 Compactness......................................

More information

An introduction to some aspects of functional analysis

An introduction to some aspects of functional analysis An introduction to some aspects of functional analysis Stephen Semmes Rice University Abstract These informal notes deal with some very basic objects in functional analysis, including norms and seminorms

More information

STAT 7032 Probability Spring Wlodek Bryc

STAT 7032 Probability Spring Wlodek Bryc STAT 7032 Probability Spring 2018 Wlodek Bryc Created: Friday, Jan 2, 2014 Revised for Spring 2018 Printed: January 9, 2018 File: Grad-Prob-2018.TEX Department of Mathematical Sciences, University of Cincinnati,

More information

Empirical Processes: General Weak Convergence Theory

Empirical Processes: General Weak Convergence Theory Empirical Processes: General Weak Convergence Theory Moulinath Banerjee May 18, 2010 1 Extended Weak Convergence The lack of measurability of the empirical process with respect to the sigma-field generated

More information

Estimating divergence functionals and the likelihood ratio by convex risk minimization

Estimating divergence functionals and the likelihood ratio by convex risk minimization Estimating divergence functionals and the likelihood ratio by convex risk minimization XuanLong Nguyen Dept. of Statistical Science Duke University xuanlong.nguyen@stat.duke.edu Martin J. Wainwright Dept.

More information

LARGE DEVIATIONS OF TYPICAL LINEAR FUNCTIONALS ON A CONVEX BODY WITH UNCONDITIONAL BASIS. S. G. Bobkov and F. L. Nazarov. September 25, 2011

LARGE DEVIATIONS OF TYPICAL LINEAR FUNCTIONALS ON A CONVEX BODY WITH UNCONDITIONAL BASIS. S. G. Bobkov and F. L. Nazarov. September 25, 2011 LARGE DEVIATIONS OF TYPICAL LINEAR FUNCTIONALS ON A CONVEX BODY WITH UNCONDITIONAL BASIS S. G. Bobkov and F. L. Nazarov September 25, 20 Abstract We study large deviations of linear functionals on an isotropic

More information

Math Tune-Up Louisiana State University August, Lectures on Partial Differential Equations and Hilbert Space

Math Tune-Up Louisiana State University August, Lectures on Partial Differential Equations and Hilbert Space Math Tune-Up Louisiana State University August, 2008 Lectures on Partial Differential Equations and Hilbert Space 1. A linear partial differential equation of physics We begin by considering the simplest

More information

LECTURE 15: COMPLETENESS AND CONVEXITY

LECTURE 15: COMPLETENESS AND CONVEXITY LECTURE 15: COMPLETENESS AND CONVEXITY 1. The Hopf-Rinow Theorem Recall that a Riemannian manifold (M, g) is called geodesically complete if the maximal defining interval of any geodesic is R. On the other

More information

Bayesian estimation of the discrepancy with misspecified parametric models

Bayesian estimation of the discrepancy with misspecified parametric models Bayesian estimation of the discrepancy with misspecified parametric models Pierpaolo De Blasi University of Torino & Collegio Carlo Alberto Bayesian Nonparametrics workshop ICERM, 17-21 September 2012

More information

Topological properties

Topological properties CHAPTER 4 Topological properties 1. Connectedness Definitions and examples Basic properties Connected components Connected versus path connected, again 2. Compactness Definition and first examples Topological

More information

Hilbert spaces. 1. Cauchy-Schwarz-Bunyakowsky inequality

Hilbert spaces. 1. Cauchy-Schwarz-Bunyakowsky inequality (October 29, 2016) Hilbert spaces Paul Garrett garrett@math.umn.edu http://www.math.umn.edu/ garrett/ [This document is http://www.math.umn.edu/ garrett/m/fun/notes 2016-17/03 hsp.pdf] Hilbert spaces are

More information

Examples of Dual Spaces from Measure Theory

Examples of Dual Spaces from Measure Theory Chapter 9 Examples of Dual Spaces from Measure Theory We have seen that L (, A, µ) is a Banach space for any measure space (, A, µ). We will extend that concept in the following section to identify an

More information

ON THE REGULARITY OF SAMPLE PATHS OF SUB-ELLIPTIC DIFFUSIONS ON MANIFOLDS

ON THE REGULARITY OF SAMPLE PATHS OF SUB-ELLIPTIC DIFFUSIONS ON MANIFOLDS Bendikov, A. and Saloff-Coste, L. Osaka J. Math. 4 (5), 677 7 ON THE REGULARITY OF SAMPLE PATHS OF SUB-ELLIPTIC DIFFUSIONS ON MANIFOLDS ALEXANDER BENDIKOV and LAURENT SALOFF-COSTE (Received March 4, 4)

More information

SHARP BOUNDARY TRACE INEQUALITIES. 1. Introduction

SHARP BOUNDARY TRACE INEQUALITIES. 1. Introduction SHARP BOUNDARY TRACE INEQUALITIES GILES AUCHMUTY Abstract. This paper describes sharp inequalities for the trace of Sobolev functions on the boundary of a bounded region R N. The inequalities bound (semi-)norms

More information

Overview of normed linear spaces

Overview of normed linear spaces 20 Chapter 2 Overview of normed linear spaces Starting from this chapter, we begin examining linear spaces with at least one extra structure (topology or geometry). We assume linearity; this is a natural

More information

SEPARABILITY AND COMPLETENESS FOR THE WASSERSTEIN DISTANCE

SEPARABILITY AND COMPLETENESS FOR THE WASSERSTEIN DISTANCE SEPARABILITY AND COMPLETENESS FOR THE WASSERSTEIN DISTANCE FRANÇOIS BOLLEY Abstract. In this note we prove in an elementary way that the Wasserstein distances, which play a basic role in optimal transportation

More information

Chapter 8. General Countably Additive Set Functions. 8.1 Hahn Decomposition Theorem

Chapter 8. General Countably Additive Set Functions. 8.1 Hahn Decomposition Theorem Chapter 8 General Countably dditive Set Functions In Theorem 5.2.2 the reader saw that if f : X R is integrable on the measure space (X,, µ) then we can define a countably additive set function ν on by

More information

Lectures on. Sobolev Spaces. S. Kesavan The Institute of Mathematical Sciences, Chennai.

Lectures on. Sobolev Spaces. S. Kesavan The Institute of Mathematical Sciences, Chennai. Lectures on Sobolev Spaces S. Kesavan The Institute of Mathematical Sciences, Chennai. e-mail: kesh@imsc.res.in 2 1 Distributions In this section we will, very briefly, recall concepts from the theory

More information

Notions such as convergent sequence and Cauchy sequence make sense for any metric space. Convergent Sequences are Cauchy

Notions such as convergent sequence and Cauchy sequence make sense for any metric space. Convergent Sequences are Cauchy Banach Spaces These notes provide an introduction to Banach spaces, which are complete normed vector spaces. For the purposes of these notes, all vector spaces are assumed to be over the real numbers.

More information

PROBLEMS. (b) (Polarization Identity) Show that in any inner product space

PROBLEMS. (b) (Polarization Identity) Show that in any inner product space 1 Professor Carl Cowen Math 54600 Fall 09 PROBLEMS 1. (Geometry in Inner Product Spaces) (a) (Parallelogram Law) Show that in any inner product space x + y 2 + x y 2 = 2( x 2 + y 2 ). (b) (Polarization

More information

Real Analysis Notes. Thomas Goller

Real Analysis Notes. Thomas Goller Real Analysis Notes Thomas Goller September 4, 2011 Contents 1 Abstract Measure Spaces 2 1.1 Basic Definitions........................... 2 1.2 Measurable Functions........................ 2 1.3 Integration..............................

More information

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk)

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk) 10-704: Information Processing and Learning Spring 2015 Lecture 21: Examples of Lower Bounds and Assouad s Method Lecturer: Akshay Krishnamurthy Scribes: Soumya Batra Note: LaTeX template courtesy of UC

More information

Ordinary differential equations. Recall that we can solve a first-order linear inhomogeneous ordinary differential equation (ODE) dy dt

Ordinary differential equations. Recall that we can solve a first-order linear inhomogeneous ordinary differential equation (ODE) dy dt MATHEMATICS 3103 (Functional Analysis) YEAR 2012 2013, TERM 2 HANDOUT #1: WHAT IS FUNCTIONAL ANALYSIS? Elementary analysis mostly studies real-valued (or complex-valued) functions on the real line R or

More information

The Dirichlet s P rinciple. In this lecture we discuss an alternative formulation of the Dirichlet problem for the Laplace equation:

The Dirichlet s P rinciple. In this lecture we discuss an alternative formulation of the Dirichlet problem for the Laplace equation: Oct. 1 The Dirichlet s P rinciple In this lecture we discuss an alternative formulation of the Dirichlet problem for the Laplace equation: 1. Dirichlet s Principle. u = in, u = g on. ( 1 ) If we multiply

More information

5 Measure theory II. (or. lim. Prove the proposition. 5. For fixed F A and φ M define the restriction of φ on F by writing.

5 Measure theory II. (or. lim. Prove the proposition. 5. For fixed F A and φ M define the restriction of φ on F by writing. 5 Measure theory II 1. Charges (signed measures). Let (Ω, A) be a σ -algebra. A map φ: A R is called a charge, (or signed measure or σ -additive set function) if φ = φ(a j ) (5.1) A j for any disjoint

More information

INTEGRATION ON MANIFOLDS and GAUSS-GREEN THEOREM

INTEGRATION ON MANIFOLDS and GAUSS-GREEN THEOREM INTEGRATION ON MANIFOLS and GAUSS-GREEN THEOREM 1. Schwarz s paradox. Recall that for curves one defines length via polygonal approximation by line segments: a continuous curve γ : [a, b] R n is rectifiable

More information

A note on some approximation theorems in measure theory

A note on some approximation theorems in measure theory A note on some approximation theorems in measure theory S. Kesavan and M. T. Nair Department of Mathematics, Indian Institute of Technology, Madras, Chennai - 600 06 email: kesh@iitm.ac.in and mtnair@iitm.ac.in

More information

l(y j ) = 0 for all y j (1)

l(y j ) = 0 for all y j (1) Problem 1. The closed linear span of a subset {y j } of a normed vector space is defined as the intersection of all closed subspaces containing all y j and thus the smallest such subspace. 1 Show that

More information

Asymptotic properties of posterior distributions in nonparametric regression with non-gaussian errors

Asymptotic properties of posterior distributions in nonparametric regression with non-gaussian errors Ann Inst Stat Math (29) 61:835 859 DOI 1.17/s1463-8-168-2 Asymptotic properties of posterior distributions in nonparametric regression with non-gaussian errors Taeryon Choi Received: 1 January 26 / Revised:

More information

A VERY BRIEF REVIEW OF MEASURE THEORY

A VERY BRIEF REVIEW OF MEASURE THEORY A VERY BRIEF REVIEW OF MEASURE THEORY A brief philosophical discussion. Measure theory, as much as any branch of mathematics, is an area where it is important to be acquainted with the basic notions and

More information

L p -boundedness of the Hilbert transform

L p -boundedness of the Hilbert transform L p -boundedness of the Hilbert transform Kunal Narayan Chaudhury Abstract The Hilbert transform is essentially the only singular operator in one dimension. This undoubtedly makes it one of the the most

More information

Introduction to Real Analysis Alternative Chapter 1

Introduction to Real Analysis Alternative Chapter 1 Christopher Heil Introduction to Real Analysis Alternative Chapter 1 A Primer on Norms and Banach Spaces Last Updated: March 10, 2018 c 2018 by Christopher Heil Chapter 1 A Primer on Norms and Banach Spaces

More information

Part V. 17 Introduction: What are measures and why measurable sets. Lebesgue Integration Theory

Part V. 17 Introduction: What are measures and why measurable sets. Lebesgue Integration Theory Part V 7 Introduction: What are measures and why measurable sets Lebesgue Integration Theory Definition 7. (Preliminary). A measure on a set is a function :2 [ ] such that. () = 2. If { } = is a finite

More information

Set, functions and Euclidean space. Seungjin Han

Set, functions and Euclidean space. Seungjin Han Set, functions and Euclidean space Seungjin Han September, 2018 1 Some Basics LOGIC A is necessary for B : If B holds, then A holds. B A A B is the contraposition of B A. A is sufficient for B: If A holds,

More information

MAT 578 FUNCTIONAL ANALYSIS EXERCISES

MAT 578 FUNCTIONAL ANALYSIS EXERCISES MAT 578 FUNCTIONAL ANALYSIS EXERCISES JOHN QUIGG Exercise 1. Prove that if A is bounded in a topological vector space, then for every neighborhood V of 0 there exists c > 0 such that tv A for all t > c.

More information

i. v = 0 if and only if v 0. iii. v + w v + w. (This is the Triangle Inequality.)

i. v = 0 if and only if v 0. iii. v + w v + w. (This is the Triangle Inequality.) Definition 5.5.1. A (real) normed vector space is a real vector space V, equipped with a function called a norm, denoted by, provided that for all v and w in V and for all α R the real number v 0, and

More information

Part III. 10 Topological Space Basics. Topological Spaces

Part III. 10 Topological Space Basics. Topological Spaces Part III 10 Topological Space Basics Topological Spaces Using the metric space results above as motivation we will axiomatize the notion of being an open set to more general settings. Definition 10.1.

More information

Math The Laplacian. 1 Green s Identities, Fundamental Solution

Math The Laplacian. 1 Green s Identities, Fundamental Solution Math. 209 The Laplacian Green s Identities, Fundamental Solution Let be a bounded open set in R n, n 2, with smooth boundary. The fact that the boundary is smooth means that at each point x the external

More information

3 Integration and Expectation

3 Integration and Expectation 3 Integration and Expectation 3.1 Construction of the Lebesgue Integral Let (, F, µ) be a measure space (not necessarily a probability space). Our objective will be to define the Lebesgue integral R fdµ

More information

Estimates for probabilities of independent events and infinite series

Estimates for probabilities of independent events and infinite series Estimates for probabilities of independent events and infinite series Jürgen Grahl and Shahar evo September 9, 06 arxiv:609.0894v [math.pr] 8 Sep 06 Abstract This paper deals with finite or infinite sequences

More information

The small ball property in Banach spaces (quantitative results)

The small ball property in Banach spaces (quantitative results) The small ball property in Banach spaces (quantitative results) Ehrhard Behrends Abstract A metric space (M, d) is said to have the small ball property (sbp) if for every ε 0 > 0 there exists a sequence

More information

MAT 570 REAL ANALYSIS LECTURE NOTES. Contents. 1. Sets Functions Countability Axiom of choice Equivalence relations 9

MAT 570 REAL ANALYSIS LECTURE NOTES. Contents. 1. Sets Functions Countability Axiom of choice Equivalence relations 9 MAT 570 REAL ANALYSIS LECTURE NOTES PROFESSOR: JOHN QUIGG SEMESTER: FALL 204 Contents. Sets 2 2. Functions 5 3. Countability 7 4. Axiom of choice 8 5. Equivalence relations 9 6. Real numbers 9 7. Extended

More information

4 Expectation & the Lebesgue Theorems

4 Expectation & the Lebesgue Theorems STA 205: Probability & Measure Theory Robert L. Wolpert 4 Expectation & the Lebesgue Theorems Let X and {X n : n N} be random variables on a probability space (Ω,F,P). If X n (ω) X(ω) for each ω Ω, does

More information

An introduction to Mathematical Theory of Control

An introduction to Mathematical Theory of Control An introduction to Mathematical Theory of Control Vasile Staicu University of Aveiro UNICA, May 2018 Vasile Staicu (University of Aveiro) An introduction to Mathematical Theory of Control UNICA, May 2018

More information

CHAPTER 7. Connectedness

CHAPTER 7. Connectedness CHAPTER 7 Connectedness 7.1. Connected topological spaces Definition 7.1. A topological space (X, T X ) is said to be connected if there is no continuous surjection f : X {0, 1} where the two point set

More information

Laplace s Equation. Chapter Mean Value Formulas

Laplace s Equation. Chapter Mean Value Formulas Chapter 1 Laplace s Equation Let be an open set in R n. A function u C 2 () is called harmonic in if it satisfies Laplace s equation n (1.1) u := D ii u = 0 in. i=1 A function u C 2 () is called subharmonic

More information

1. Bounded linear maps. A linear map T : E F of real Banach

1. Bounded linear maps. A linear map T : E F of real Banach DIFFERENTIABLE MAPS 1. Bounded linear maps. A linear map T : E F of real Banach spaces E, F is bounded if M > 0 so that for all v E: T v M v. If v r T v C for some positive constants r, C, then T is bounded:

More information

Exercise Solutions to Functional Analysis

Exercise Solutions to Functional Analysis Exercise Solutions to Functional Analysis Note: References refer to M. Schechter, Principles of Functional Analysis Exersize that. Let φ,..., φ n be an orthonormal set in a Hilbert space H. Show n f n

More information

Information geometry for bivariate distribution control

Information geometry for bivariate distribution control Information geometry for bivariate distribution control C.T.J.Dodson + Hong Wang Mathematics + Control Systems Centre, University of Manchester Institute of Science and Technology Optimal control of stochastic

More information

Bayesian nonparametrics

Bayesian nonparametrics Bayesian nonparametrics 1 Some preliminaries 1.1 de Finetti s theorem We will start our discussion with this foundational theorem. We will assume throughout all variables are defined on the probability

More information

Functional Analysis. Martin Brokate. 1 Normed Spaces 2. 2 Hilbert Spaces The Principle of Uniform Boundedness 32

Functional Analysis. Martin Brokate. 1 Normed Spaces 2. 2 Hilbert Spaces The Principle of Uniform Boundedness 32 Functional Analysis Martin Brokate Contents 1 Normed Spaces 2 2 Hilbert Spaces 2 3 The Principle of Uniform Boundedness 32 4 Extension, Reflexivity, Separation 37 5 Compact subsets of C and L p 46 6 Weak

More information

Optimization and Optimal Control in Banach Spaces

Optimization and Optimal Control in Banach Spaces Optimization and Optimal Control in Banach Spaces Bernhard Schmitzer October 19, 2017 1 Convex non-smooth optimization with proximal operators Remark 1.1 (Motivation). Convex optimization: easier to solve,

More information

Probability and Measure

Probability and Measure Part II Year 2018 2017 2016 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2018 84 Paper 4, Section II 26J Let (X, A) be a measurable space. Let T : X X be a measurable map, and µ a probability

More information

L p Spaces and Convexity

L p Spaces and Convexity L p Spaces and Convexity These notes largely follow the treatments in Royden, Real Analysis, and Rudin, Real & Complex Analysis. 1. Convex functions Let I R be an interval. For I open, we say a function

More information

Combinatorics in Banach space theory Lecture 12

Combinatorics in Banach space theory Lecture 12 Combinatorics in Banach space theory Lecture The next lemma considerably strengthens the assertion of Lemma.6(b). Lemma.9. For every Banach space X and any n N, either all the numbers n b n (X), c n (X)

More information

Part II Probability and Measure

Part II Probability and Measure Part II Probability and Measure Theorems Based on lectures by J. Miller Notes taken by Dexter Chua Michaelmas 2016 These notes are not endorsed by the lecturers, and I have modified them (often significantly)

More information

Metric Spaces and Topology

Metric Spaces and Topology Chapter 2 Metric Spaces and Topology From an engineering perspective, the most important way to construct a topology on a set is to define the topology in terms of a metric on the set. This approach underlies

More information

USING FUNCTIONAL ANALYSIS AND SOBOLEV SPACES TO SOLVE POISSON S EQUATION

USING FUNCTIONAL ANALYSIS AND SOBOLEV SPACES TO SOLVE POISSON S EQUATION USING FUNCTIONAL ANALYSIS AND SOBOLEV SPACES TO SOLVE POISSON S EQUATION YI WANG Abstract. We study Banach and Hilbert spaces with an eye towards defining weak solutions to elliptic PDE. Using Lax-Milgram

More information

Lecture 4 Lebesgue spaces and inequalities

Lecture 4 Lebesgue spaces and inequalities Lecture 4: Lebesgue spaces and inequalities 1 of 10 Course: Theory of Probability I Term: Fall 2013 Instructor: Gordan Zitkovic Lecture 4 Lebesgue spaces and inequalities Lebesgue spaces We have seen how

More information

Eigenvalues and Eigenfunctions of the Laplacian

Eigenvalues and Eigenfunctions of the Laplacian The Waterloo Mathematics Review 23 Eigenvalues and Eigenfunctions of the Laplacian Mihai Nica University of Waterloo mcnica@uwaterloo.ca Abstract: The problem of determining the eigenvalues and eigenvectors

More information

CHAPTER VIII HILBERT SPACES

CHAPTER VIII HILBERT SPACES CHAPTER VIII HILBERT SPACES DEFINITION Let X and Y be two complex vector spaces. A map T : X Y is called a conjugate-linear transformation if it is a reallinear transformation from X into Y, and if T (λx)

More information

Econ Lecture 3. Outline. 1. Metric Spaces and Normed Spaces 2. Convergence of Sequences in Metric Spaces 3. Sequences in R and R n

Econ Lecture 3. Outline. 1. Metric Spaces and Normed Spaces 2. Convergence of Sequences in Metric Spaces 3. Sequences in R and R n Econ 204 2011 Lecture 3 Outline 1. Metric Spaces and Normed Spaces 2. Convergence of Sequences in Metric Spaces 3. Sequences in R and R n 1 Metric Spaces and Metrics Generalize distance and length notions

More information

The main results about probability measures are the following two facts:

The main results about probability measures are the following two facts: Chapter 2 Probability measures The main results about probability measures are the following two facts: Theorem 2.1 (extension). If P is a (continuous) probability measure on a field F 0 then it has a

More information

Exercises to Applied Functional Analysis

Exercises to Applied Functional Analysis Exercises to Applied Functional Analysis Exercises to Lecture 1 Here are some exercises about metric spaces. Some of the solutions can be found in my own additional lecture notes on Blackboard, as the

More information

REAL AND COMPLEX ANALYSIS

REAL AND COMPLEX ANALYSIS REAL AND COMPLE ANALYSIS Third Edition Walter Rudin Professor of Mathematics University of Wisconsin, Madison Version 1.1 No rights reserved. Any part of this work can be reproduced or transmitted in any

More information

MATH MEASURE THEORY AND FOURIER ANALYSIS. Contents

MATH MEASURE THEORY AND FOURIER ANALYSIS. Contents MATH 3969 - MEASURE THEORY AND FOURIER ANALYSIS ANDREW TULLOCH Contents 1. Measure Theory 2 1.1. Properties of Measures 3 1.2. Constructing σ-algebras and measures 3 1.3. Properties of the Lebesgue measure

More information

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product Chapter 4 Hilbert Spaces 4.1 Inner Product Spaces Inner Product Space. A complex vector space E is called an inner product space (or a pre-hilbert space, or a unitary space) if there is a mapping (, )

More information

Solutions to Problem Set 5 for , Fall 2007

Solutions to Problem Set 5 for , Fall 2007 Solutions to Problem Set 5 for 18.101, Fall 2007 1 Exercise 1 Solution For the counterexample, let us consider M = (0, + ) and let us take V = on M. x Let W be the vector field on M that is identically

More information

An exponential family of distributions is a parametric statistical model having densities with respect to some positive measure λ of the form.

An exponential family of distributions is a parametric statistical model having densities with respect to some positive measure λ of the form. Stat 8112 Lecture Notes Asymptotics of Exponential Families Charles J. Geyer January 23, 2013 1 Exponential Families An exponential family of distributions is a parametric statistical model having densities

More information

Theorems. Theorem 1.11: Greatest-Lower-Bound Property. Theorem 1.20: The Archimedean property of. Theorem 1.21: -th Root of Real Numbers

Theorems. Theorem 1.11: Greatest-Lower-Bound Property. Theorem 1.20: The Archimedean property of. Theorem 1.21: -th Root of Real Numbers Page 1 Theorems Wednesday, May 9, 2018 12:53 AM Theorem 1.11: Greatest-Lower-Bound Property Suppose is an ordered set with the least-upper-bound property Suppose, and is bounded below be the set of lower

More information

Problem set 1, Real Analysis I, Spring, 2015.

Problem set 1, Real Analysis I, Spring, 2015. Problem set 1, Real Analysis I, Spring, 015. (1) Let f n : D R be a sequence of functions with domain D R n. Recall that f n f uniformly if and only if for all ɛ > 0, there is an N = N(ɛ) so that if n

More information

Normed and Banach spaces

Normed and Banach spaces Normed and Banach spaces László Erdős Nov 11, 2006 1 Norms We recall that the norm is a function on a vectorspace V, : V R +, satisfying the following properties x + y x + y cx = c x x = 0 x = 0 We always

More information

INDEX. Bolzano-Weierstrass theorem, for sequences, boundary points, bounded functions, 142 bounded sets, 42 43

INDEX. Bolzano-Weierstrass theorem, for sequences, boundary points, bounded functions, 142 bounded sets, 42 43 INDEX Abel s identity, 131 Abel s test, 131 132 Abel s theorem, 463 464 absolute convergence, 113 114 implication of conditional convergence, 114 absolute value, 7 reverse triangle inequality, 9 triangle

More information

L p Functions. Given a measure space (X, µ) and a real number p [1, ), recall that the L p -norm of a measurable function f : X R is defined by

L p Functions. Given a measure space (X, µ) and a real number p [1, ), recall that the L p -norm of a measurable function f : X R is defined by L p Functions Given a measure space (, µ) and a real number p [, ), recall that the L p -norm of a measurable function f : R is defined by f p = ( ) /p f p dµ Note that the L p -norm of a function f may

More information

Real Analysis Math 131AH Rudin, Chapter #1. Dominique Abdi

Real Analysis Math 131AH Rudin, Chapter #1. Dominique Abdi Real Analysis Math 3AH Rudin, Chapter # Dominique Abdi.. If r is rational (r 0) and x is irrational, prove that r + x and rx are irrational. Solution. Assume the contrary, that r+x and rx are rational.

More information

A Brunn Minkowski theory for coconvex sets of finite volume

A Brunn Minkowski theory for coconvex sets of finite volume A Brunn Minkowski theory for coconvex sets of finite volume Rolf Schneider Abstract Let C be a closed convex cone in R n, pointed and with interior points. We consider sets of the form A = C \ K, where

More information

Math212a1413 The Lebesgue integral.

Math212a1413 The Lebesgue integral. Math212a1413 The Lebesgue integral. October 28, 2014 Simple functions. In what follows, (X, F, m) is a space with a σ-field of sets, and m a measure on F. The purpose of today s lecture is to develop the

More information

1 Topology Definition of a topology Basis (Base) of a topology The subspace topology & the product topology on X Y 3

1 Topology Definition of a topology Basis (Base) of a topology The subspace topology & the product topology on X Y 3 Index Page 1 Topology 2 1.1 Definition of a topology 2 1.2 Basis (Base) of a topology 2 1.3 The subspace topology & the product topology on X Y 3 1.4 Basic topology concepts: limit points, closed sets,

More information

02. Measure and integral. 1. Borel-measurable functions and pointwise limits

02. Measure and integral. 1. Borel-measurable functions and pointwise limits (October 3, 2017) 02. Measure and integral Paul Garrett garrett@math.umn.edu http://www.math.umn.edu/ garrett/ [This document is http://www.math.umn.edu/ garrett/m/real/notes 2017-18/02 measure and integral.pdf]

More information

COMPLETELY INVARIANT JULIA SETS OF POLYNOMIAL SEMIGROUPS

COMPLETELY INVARIANT JULIA SETS OF POLYNOMIAL SEMIGROUPS Series Logo Volume 00, Number 00, Xxxx 19xx COMPLETELY INVARIANT JULIA SETS OF POLYNOMIAL SEMIGROUPS RICH STANKEWITZ Abstract. Let G be a semigroup of rational functions of degree at least two, under composition

More information

Recall that if X is a compact metric space, C(X), the space of continuous (real-valued) functions on X, is a Banach space with the norm

Recall that if X is a compact metric space, C(X), the space of continuous (real-valued) functions on X, is a Banach space with the norm Chapter 13 Radon Measures Recall that if X is a compact metric space, C(X), the space of continuous (real-valued) functions on X, is a Banach space with the norm (13.1) f = sup x X f(x). We want to identify

More information

Measurable Choice Functions

Measurable Choice Functions (January 19, 2013) Measurable Choice Functions Paul Garrett garrett@math.umn.edu http://www.math.umn.edu/ garrett/ [This document is http://www.math.umn.edu/ garrett/m/fun/choice functions.pdf] This note

More information

Can we do statistical inference in a non-asymptotic way? 1

Can we do statistical inference in a non-asymptotic way? 1 Can we do statistical inference in a non-asymptotic way? 1 Guang Cheng 2 Statistics@Purdue www.science.purdue.edu/bigdata/ ONR Review Meeting@Duke Oct 11, 2017 1 Acknowledge NSF, ONR and Simons Foundation.

More information

THEOREMS, ETC., FOR MATH 515

THEOREMS, ETC., FOR MATH 515 THEOREMS, ETC., FOR MATH 515 Proposition 1 (=comment on page 17). If A is an algebra, then any finite union or finite intersection of sets in A is also in A. Proposition 2 (=Proposition 1.1). For every

More information

MATHS 730 FC Lecture Notes March 5, Introduction

MATHS 730 FC Lecture Notes March 5, Introduction 1 INTRODUCTION MATHS 730 FC Lecture Notes March 5, 2014 1 Introduction Definition. If A, B are sets and there exists a bijection A B, they have the same cardinality, which we write as A, #A. If there exists

More information

Lebesgue Measure on R n

Lebesgue Measure on R n CHAPTER 2 Lebesgue Measure on R n Our goal is to construct a notion of the volume, or Lebesgue measure, of rather general subsets of R n that reduces to the usual volume of elementary geometrical sets

More information

The Lebesgue Integral

The Lebesgue Integral The Lebesgue Integral Brent Nelson In these notes we give an introduction to the Lebesgue integral, assuming only a knowledge of metric spaces and the iemann integral. For more details see [1, Chapters

More information

MATH41011/MATH61011: FOURIER SERIES AND LEBESGUE INTEGRATION. Extra Reading Material for Level 4 and Level 6

MATH41011/MATH61011: FOURIER SERIES AND LEBESGUE INTEGRATION. Extra Reading Material for Level 4 and Level 6 MATH41011/MATH61011: FOURIER SERIES AND LEBESGUE INTEGRATION Extra Reading Material for Level 4 and Level 6 Part A: Construction of Lebesgue Measure The first part the extra material consists of the construction

More information