Lectures on Nonparametric Bayesian Statistics

Size: px

Start display at page:

Download "Lectures on Nonparametric Bayesian Statistics"

Stephany Stewart
5 years ago
Views:

1 Lectures on Nonparametric Bayesian Statistics Aad van der Vaart Universiteit Leiden, Netherlands Bad Belzig, March 2013

2 Contents Introduction Dirichlet process Consistency and rates Gaussian process priors Dirichlet mixtures All the rest

3 Introduction

4 The Bayesian paradigm A parameter Θ is generated according to a prior distribution Π. Given Θ = θ the data X is generated according to a measure P θ. This gives a joint distribution of (X,Θ). Given observed data x the statistician computes the conditional distribution of Θ given X = x, the posterior distribution: Π(θ B X).

5 The Bayesian paradigm A parameter Θ is generated according to a prior distribution Π. Given Θ = θ the data X is generated according to a measure P θ. This gives a joint distribution of (X,Θ). Given observed data x the statistician computes the conditional distribution of Θ given X = x, the posterior distribution: Π(θ B X). We assume whatever needed (e.g. Θ Polish and Π a probability distribution on its Borel σ-field; Polish sample space) to make this well defined.

6 Bayes s rule A parameter Θ is generated according to a prior distribution Π. Given Θ = θ the data X is generated according to a measure P θ. This gives a joint distribution of (X,Θ). Given observed data x the statistician computes the conditional distribution of Θ given X = x, the posterior distribution: Π(θ B X). If P θ is given by a density x p θ (x), then Bayes s rule gives Π(Θ B X) = Bp θ(x)dπ(θ) Θ p θ(x)dπ(θ)

7 Bayes s rule A parameter Θ is generated according to a prior distribution Π. Given Θ = θ the data X is generated according to a measure P θ. This gives a joint distribution of (X,Θ). Given observed data x the statistician computes the conditional distribution of Θ given X = x, the posterior distribution: Π(θ B X). If P θ is given by a density x p θ (x), then Bayes s rule gives dπ(θ X) p θ (X)dΠ(θ)

8 Reverend Thomas Thomas Bayes ( , 1763) followed this argument with Θ possessing the uniform distribution and X given Θ = θ binomial (n,θ). Using his famous rule he computed that the posterior distribution is then Beta(X +1,n X +1). Pr(a Θ b) = b a, 0 < a < b < 1, ( ) n Pr(X = x Θ = θ) = θ x (1 θ) n x, x = 0,1,...,n, x Pr(a Θ b X = x) = b a θ x (1 θ) n x dθ/b(x+1,n x+1).

9 Reverend Thomas Thomas Bayes ( , 1763) followed this argument with Θ possessing the uniform distribution and X given Θ = θ binomial (n,θ). Using his famous rule he computed that the posterior distribution is then Beta(X +1,n X +1). Pr(a Θ b) = b a, 0 < a < b < 1, ( ) n Pr(X = x Θ = θ) = θ x (1 θ) n x, x = 0,1,...,n, x dπ(θ X) = θ X (1 θ) n X 1.

10 Reverend Thomas Thomas Bayes ( , 1763) followed this argument with Θ possessing the uniform distribution and X given Θ = θ binomial (n,θ). Using his famous rule he computed that the posterior distribution is then Beta(X +1,n X +1). n=20, x=

11 Reverend Thomas Thomas Bayes ( , 1763) followed this argument with Θ possessing the uniform distribution and X given Θ = θ binomial (n,θ). Using his famous rule he computed that the posterior distribution is then Beta(X +1,n X +1)

Reverend Thomas Thomas Bayes (1702 1761, 1763) followed this argument with Θ possessing the uniform distribution and X

12 Reverend Thomas Thomas Bayes ( , 1763) followed this argument with Θ possessing the uniform distribution and X given Θ = θ binomial (n,θ). Using his famous rule he computed that the posterior distribution is then Beta(X +1,n X +1).

Parametric Bayes Pierre-Simon Laplace (1749-1827) rediscovered Bayes argument and applied it to general parametric models: models smoothly indexed by a Euclidean parameter θ.

13 Parametric Bayes Pierre-Simon Laplace ( ) rediscovered Bayes argument and applied it to general parametric models: models smoothly indexed by a Euclidean parameter θ. For instance, the linear regression model, where one observes (x 1,Y n ),...,(x n,y n ) following Y i = θ 0 +θ 1 x i +e i, for e 1,...,e n independent normal errors with zero mean.

14 Nonparametric Bayes If the parameter θ is a function, then the prior is a probability distribution on an function space. So is the posterior, given the data. Bayes formula does not change: dπ(θ X) p θ (X)dΠ(θ). Prior and posterior can be visualized by plotting functions that are simulated from these distributions

15 Nonparametric Bayes If the parameter θ is a function, then the prior is a probability distribution on an function space. So is the posterior, given the data. Bayes formula does not change: dπ(θ X) p θ (X)dΠ(θ). Prior and posterior can be visualized by plotting functions that are simulated from these distributions

16 Nonparametric Bayes If the parameter θ is a function, then the prior is a probability distribution on an function space. So is the posterior, given the data. Bayes formula does not change: dπ(θ X) p θ (X)dΠ(θ). Prior and posterior can be visualized by plotting functions that are simulated from these distributions

17 Nonparametric Bayes If the parameter θ is a function, then the prior is a probability distribution on an function space. So is the posterior, given the data. Bayes formula does not change: dπ(θ X) p θ (X)dΠ(θ). Prior and posterior can be visualized by plotting functions that are simulated from these distributions

18 Nonparametric Bayes If the parameter θ is a function, then the prior is a probability distribution on an function space. So is the posterior, given the data. Bayes formula does not change: dπ(θ X) p θ (X)dΠ(θ). Prior and posterior can be visualized by plotting functions that are simulated from these distributions

19 Nonparametric Bayes If the parameter θ is a function, then the prior is a probability distribution on an function space. So is the posterior, given the data. Bayes formula does not change: dπ(θ X) p θ (X)dΠ(θ). Prior and posterior can be visualized by plotting functions that are simulated from these distributions

20 Subjectivism A philosophical Bayesian statistician views the prior distribution as an expression of his personal beliefs on the state of the world, before gathering the data. After seeing the data he updates his beliefs into the posterior distribution. Most scientists do not like dependence on subjective priors. One can opt for objective or noninformative priors. One can also mathematically study the role of the prior, and hope to find that it is small.

21 Frequentist Bayesian Assume that the data X is generated according to a given parameter θ 0 and consider the posterior Π(θ X) as a random measure on the parameter set dependent on X. We like Π(θ X) to put most of its mass near θ 0 for most X.

22 Frequentist Bayesian Assume that the data X is generated according to a given parameter θ 0 and consider the posterior Π(θ X) as a random measure on the parameter set dependent on X. We like Π(θ X) to put most of its mass near θ 0 for most X. Asymptotic setting: data X n where the information increases as n. We like the posterior Π n ( X n ) to contract to {θ 0 }, at a good rate.

23 Frequentist Bayesian Assume that the data X is generated according to a given parameter θ 0 and consider the posterior Π(θ X) as a random measure on the parameter set dependent on X. We like Π(θ X) to put most of its mass near θ 0 for most X. Asymptotic setting: data X n where the information increases as n. We like the posterior Π n ( X n ) to contract to {θ 0 }, at a good rate. Desirable properties: Consistency + rate Adaptation Distributional approximations Uncertainty quantification

24 Frequentist Bayesian Assume that the data X is generated according to a given parameter θ 0 and consider the posterior Π(θ X) as a random measure on the parameter set dependent on X. We like Π(θ X) to put most of its mass near θ 0 for most X. Asymptotic setting: data X n where the information increases as n. We like the posterior Π n ( X n ) to contract to {θ 0 }, at a good rate. Desirable properties: Consistency + rate Adaptation Distributional approximations Uncertainty quantification We assume that P θ0 P θ dπ(θ) to make these questions well posed.

25 Parametric models Suppose the data are a random sample X 1,...,X n from a density x p θ (x) that is smoothly and identifiably parametrized by a vector θ R d (e.g. θ p θ continuously differentiable as map in L 2 (µ)). Theorem (Laplace, Bernstein, von Mises, LeCam 1989). Under P n θ 0 -probability, for any prior with density that is positive around θ 0, Π( X 1,...,X n ) N d ( θn, 1 n I 1 θ 0 ) ( ) 0. Here θ n is any efficient estimator of θ n= n=

26 Parametric models Suppose the data are a random sample X 1,...,X n from a density x p θ (x) that is smoothly and identifiably parametrized by a vector θ R d (e.g. θ p θ continuously differentiable as map in L 2 (µ)). Theorem (Laplace, Bernstein, von Mises, LeCam 1989). Under P n θ 0 -probability, for any prior with density that is positive around θ 0, Π( X 1,...,X n ) N d ( θn, 1 n I 1 θ 0 ) ( ) 0. Here θ n is any efficient estimator of θ. In particular, the posterior distribution concentrates most of its mass on balls of radius O(1/ n) around θ 0, and a central set of posterior probability 95 % is equivalent to the usual Wald confidence set. The prior washes out completely.

27 Support Definition. The support of a prior Π is the smallest closed set F with Π(F) = 1. In nonparametrics we like priors with big (or even full) support, equal to a infinite-dimensional set. Full support means that every open set has positive (prior) probability.

28 Support Definition. The support of a prior Π is the smallest closed set F with Π(F) = 1. In nonparametrics we like priors with big (or even full) support, equal to a infinite-dimensional set. Full support means that every open set has positive (prior) probability. Support depends on topology. It is well defined, e.g. if the parameter space is Polish.

29 Dirichlet process

30 Random measures M: all probability measures on (Polish) sample space (X, X ). M : σ-field generated by all maps M M(A), A X. Lemma. M is also the Borel σ-field on M equipped with the weak topology ( of convergence in distribution ).

31 Random measures M: all probability measures on (Polish) sample space (X, X ). M : σ-field generated by all maps M M(A), A X. Lemma. M is also the Borel σ-field on M equipped with the weak topology ( of convergence in distribution ). Definition. A random probability measure on (X, X ) is a map P:(Ω, U,Pr) M such that P(A) is a random variable for every A X. (Equivalently, a Borel measurable map in M.)

32 Random measures M: all probability measures on (Polish) sample space (X, X ). M : σ-field generated by all maps M M(A), A X. Lemma. M is also the Borel σ-field on M equipped with the weak topology ( of convergence in distribution ). Definition. A random probability measure on (X, X ) is a map P:(Ω, U,Pr) M such that P(A) is a random variable for every A X. (Equivalently, a Borel measurable map in M.) The law of P is a prior on (M, M).

33 Random measures M: all probability measures on (Polish) sample space (X, X ). M : σ-field generated by all maps M M(A), A X. Lemma. M is also the Borel σ-field on M equipped with the weak topology ( of convergence in distribution ). Definition. A random probability measure on (X, X ) is a map P:(Ω, U,Pr) M such that P(A) is a random variable for every A X. (Equivalently, a Borel measurable map in M.) The law of P is a prior on (M, M). Definition. The mean measure of P is the measure A EP(A).

34 Discrete random measures W 1,W 2,... nonnegative variables with i=1 W i = 1, independent of θ 1,θ 2,... iid G, random variables with values in X. P = W i δ θi. i=1

35 Discrete random measures W 1,W 2,... nonnegative variables with i=1 W i = 1, independent of θ 1,θ 2,... iid G, random variables with values in X. P = W i δ θi. i=1 Lemma. If (W 1,...,W n ) has (full) support the unit simplex for every n and the law of θ 1 has (full) support X, then P has full support M relative to the weak topology.

36 Discrete random measures W 1,W 2,... nonnegative variables with i=1 W i = 1, independent of θ 1,θ 2,... iid G, random variables with values in X. P = W i δ θi. i=1 Lemma. If (W 1,...,W n ) has (full) support the unit simplex for every n and the law of θ 1 has (full) support X, then P has full support M relative to the weak topology. Proof. Finitely discrete distributions are weakly dense in M. It suffices to show that P gives positive probability to any weak neighbourhood of a distribution of the form P = k i=1 w i δ θi. { i>k W i < ǫ,max i k W i wi θ i θi < ǫ} is open and hence has positive probability.

37 Discrete random measures W 1,W 2,... nonnegative variables with i=1 W i = 1, independent of θ 1,θ 2,... iid G, random variables with values in X. P = W i δ θi. i=1 Lemma. If (W 1,...,W n ) has (full) support the unit simplex for every n and the law of θ 1 has (full) support X, then P has full support M relative to the weak topology. Proof. Finitely discrete distributions are weakly dense in M. It suffices to show that P gives positive probability to any weak neighbourhood of a distribution of the form P = k i=1 w i δ θi. { i>k W i < ǫ,max i k W i wi θ i θi < ǫ} is open and hence has positive probability.

38 Stick breaking Given i.i.d. Y 1,Y 2,... in [0,1], W 1 = Y 1,W 2 = (1 Y 1 )Y 2,W 3 = (1 Y 1 )(1 Y 2 )Y 3,...

39 Stick breaking Given i.i.d. Y 1,Y 2,... in [0,1], W 1 = Y 1,W 2 = (1 Y 1 )Y 2,W 3 = (1 Y 1 )(1 Y 2 )Y 3,... Lemma. (W 1,W 2,...) is a random probability measure on N iff P(Y 1 > 0) > 0, and has full support if Y 1 has support [0,1].

40 Stick breaking Given i.i.d. Y 1,Y 2,... in [0,1], W 1 = Y 1,W 2 = (1 Y 1 )Y 2,W 3 = (1 Y 1 )(1 Y 2 )Y 3,... Lemma. (W 1,W 2,...) is a random probability measure on N iff P(Y 1 > 0) > 0, and has full support if Y 1 has support [0,1]. Proof. The remaining length of the stick at stage j is 1 j l=1 W l = j l=1 (1 Y l), and tends to zero a.s. iff j l=1 (1 EY l) 0. (W 1,...,W n ) is a continuous function of (Y 1,...,Y n ) with full range, for every k.

41 Stick breaking Given i.i.d. Y 1,Y 2,... in [0,1], W 1 = Y 1,W 2 = (1 Y 1 )Y 2,W 3 = (1 Y 1 )(1 Y 2 )Y 3,... Lemma. (W 1,W 2,...) is a random probability measure on N iff P(Y 1 > 0) > 0, and has full support if Y 1 has support [0,1]. Proof. The remaining length of the stick at stage j is 1 j l=1 W l = j l=1 (1 Y l), and tends to zero a.s. iff j l=1 (1 EY l) 0. (W 1,...,W n ) is a continuous function of (Y 1,...,Y n ) with full range, for every k. EXAMPLE OF PARTICULAR INTEREST: Y 1,Y 2, iid Be(1,M).

42 Random measures as stochastic processes A random measure P induces the distributions on R k of the random vectors ( P(A1 ),...,P(A k ) ), A 1,...,A k X.

43 Random measures as stochastic processes A random measure P induces the distributions on R k of the random vectors ( P(A1 ),...,P(A k ) ), A 1,...,A k X. Conversely suppose we want a measure with particular distributions, and can construct a stochastic process ( P(A):A X ) with these distributions (e.g. by Kolmogorov s consistency theorem).

44 Random measures as stochastic processes A random measure P induces the distributions on R k of the random vectors ( P(A1 ),...,P(A k ) ), A 1,...,A k X. Conversely suppose we want a measure with particular distributions, and can construct a stochastic process ( P(A):A X ) with these distributions (e.g. by Kolmogorov s consistency theorem). It will be true that (i). P( ) = 0, P(X ) = 1, a.s. (ii). P(A 1 A 1 ) = P(A 1 )+P(A 2 ), a.s., for any disjoint A 1,A 2.

45 Random measures as stochastic processes A random measure P induces the distributions on R k of the random vectors ( P(A1 ),...,P(A k ) ), A 1,...,A k X. Conversely suppose we want a measure with particular distributions, and can construct a stochastic process ( P(A):A X ) with these distributions (e.g. by Kolmogorov s consistency theorem). It will be true that (i). P( ) = 0, P(X ) = 1, a.s. (ii). P(A 1 A 1 ) = P(A 1 )+P(A 2 ), a.s., for any disjoint A 1,A 2. However, we do not automatically have that P is a random measure. Theorem. If ( P(A):A X ) is a stochastic process that satisfies (i) and (ii) and whose mean A EP(A) is a Borel measure on X, then there exists a version of P that is a random measure on (X, X ).

46 Finite-dimensonal Dirichlet distribution Definition. (X 1,...,X k ) possesses a Dirichlet (k,α 1,...,α k ) distribution for α i > 0 it has (Lebesgue) density on the unit simplex proportional to x x α x α k 1 k.

47 Finite-dimensonal Dirichlet distribution Definition. (X 1,...,X k ) possesses a Dirichlet (k,α 1,...,α k ) distribution for α i > 0 it has (Lebesgue) density on the unit simplex proportional to x x α x α k 1 k. We extend to α i = 0 for one or more i on the understanding that X i = 0.

48 Finite-dimensonal Dirichlet distribution Definition. (X 1,...,X k ) possesses a Dirichlet (k,α 1,...,α k ) distribution for α i > 0 it has (Lebesgue) density on the unit simplex proportional to x x α x α k 1 k. We extend to α i = 0 for one or more i on the understanding that X i = 0. EXAMPLES For k = 2 we have X 1 Be(α 1,α 2 ) and X 2 = 1 X 1 Be(α 2,α 1 ). The Dir(k;1,...,1)-distribution is the uniform distribution on the simplex.

49 Dirichlet distribution properties Proposition (Gamma representation). If Y i ind Ga(α i,1), then (Y 1 /Y,...,Y k /Y) Dir(k;α 1,...,α k ), and is independent of and Y:= k i=1 Y i. Proposition (Aggregation). If X Dir(k;α 1,...,α k ) and Z j = i I j X i for a given partition I 1,...,I m of {1,...,k}, then (i). (Z 1,...,Z m ) Dir(m;β 1,...,β m ), where β j = i I j α i. (ii). (iii). (X i /Z j :i I j ) ind Dir(#I j ;α i,i I j ), for j = 1,...,m. (Z 1,...,Z m ) and (X i /Z j :i I j,j = 1,...,m) are independent. Conversely, if X is a random vector such that (i) (iii) hold, for a given partition I 1,...,I m and Z j = i I j X i, then X Dir(k;α 1,...,α k ). Proposition. E(X i ) = α i / α and var(x i ) = α i ( α α i )/( α 2 ( α +1)), for α = k i=1 α i. Proposition (Conjugacy). If p Dir(k;α) and N p MN(n,k;p), then p N Dir(k;α+N).

50 Dirichlet process Definition. A random measure P on (X, X ) is a Dirichlet process with base measure α, if for every partition A 1,...,A k of X, ( P(A1 ),...,P(A k ) ) Dir ( k;α(a 1 ),...,α(a k ) ). We write P DP(α), α := α(x) and ᾱ:= α/ α.

51 Dirichlet process Definition. A random measure P on (X, X ) is a Dirichlet process with base measure α, if for every partition A 1,...,A k of X, ( P(A1 ),...,P(A k ) ) Dir ( k;α(a 1 ),...,α(a k ) ). We write P DP(α), α := α(x) and ᾱ:= α/ α. EP(A) = ᾱ(a), varp(a) = ᾱ(a)ᾱ(ac ). 1+ α

52 Dirichlet process Definition. A random measure P on (X, X ) is a Dirichlet process with base measure α, if for every partition A 1,...,A k of X, ( P(A1 ),...,P(A k ) ) Dir ( k;α(a 1 ),...,α(a k ) ). We write P DP(α), α := α(x) and ᾱ:= α/ α. EP(A) = ᾱ(a), varp(a) = ᾱ(a)ᾱ(ac ). 1+ α

53 Dirichlet process existence Theorem. For any Borel measure α the Dirichlet process exists as a Borel measure on M.

54 Dirichlet process existence Theorem. For any Borel measure α the Dirichlet process exists as a Borel measure on M. Proof. An arbitrary collection of sets A 1,...,A k can be partitioned in 2 k atoms B j = A 1 A 2 A k, where A stands for A or A c. The distribution of ( P(B j ):j = 1,...,2 k) must be Dirichlet. Define the distribution of ( P(A 1 ),...,P(A k ) ) corresponding to the fact that each P(A i ) must be a sum of some set of P(B j ). Using properties of finite-dimensional Dirichlets, check that this is consistent in the sense of Kolmogorov, so that a version of the stochastic process ( P(A):A X ) exists. Apply the general theorem on existence of random measures.

55 Sethuraman representation Theorem. If θ 1,θ 2,... ᾱ iid and Y 1,Y 2,... Be(1,M) iid are independent random variables and W j = Y j 1 j l=1 (1 Y l), then j=1 W jδ θj DP(Mᾱ).

56 Sethuraman representation Theorem. If θ 1,θ 2,... ᾱ iid and Y 1,Y 2,... Be(1,M) iid are independent random variables and W j = Y j 1 j l=1 (1 Y l), then j=1 W jδ θj DP(Mᾱ). Proof. P:= W 1 δ θ1 + W j δ θj = Y 1 δ θ1 +(1 Y 1 )P, P = j=2 j 1 (Y j j=2 l=2 (1 Y l ))δ θj. Hence Q = ( P(A 1 ),...,P(A k ) ) and N = ( δ θ1 (A 1 ),...,δ θ1 (A k ) ) satisfy Now Q = d YN +(1 Y)Q. For given Y Be(1,M) and independent θ G there is at most one solution in distribution Q. A Dirichlet vector Q is a solution. Second follows by properties of Dirichlet (not obvious!). First: see next slide.

57 Sethuraman representation Proof. (Continued) Q = d YN +(1 Y)Q. Given i.i.d. copies (Y n,n n ) and given independent solutions Q and Q : Q 0 = Q, Q 0 = Q, Q n = Y n N n +(1 Y n )Q n 1, Q n = Y n N n +(1 Y n )Q n 1. Then Q n = d Q and Q n = d Q for every n, and Q n Q n = 1 Y n Q n 1 Q n 1 = n 1 Y i Q Q 0 i=1 Hence Q = d Q.

58 Tail-free processes Let X = A 0 A 1 = (A 00 A 01 ) (A 10 A 11 ) = be nested partitions, rich enough that they generates the Borel σ-field. X V 0 V 1 A 0 A 1 V 00 V 01 V 10 V 11 A 00 A 01 A 10 A 11

59 Tail-free processes Let X = A 0 A 1 = (A 00 A 01 ) (A 10 A 11 ) = be nested partitions, rich enough that they generates the Borel σ-field. X V 0 V 1 A 0 A 1 V 00 V 01 V 10 V 11 A 00 A 01 A 10 A 11 Splitting variables: V ε0 = P(A ε0 A ε ), and V ε1 = P(A ε1 A ε ). P(A ε1 ε m ) = V ε1 V ε1 ε 2 V ε1 ε m, ε = ε 1 ε m {0,1} m.

60 Tail-free processes (2) P(A ε1 ε m ) = V ε1 V ε1 ε 2 V ε1 ε m, ε = ε 1 ε m {0,1} m. (1) Theorem. Suppose A ε = {A εδ :A εδ compact,a εδ A ε } (V ε :ε E ) stochastic process with 0 V ε 1 and V ε0 +V ε1 = 1. There is a Borel measure with µ(a ε ):= EV ε1 V ε1 ε 2 V ε1 ε m. Then there exists a random Borel measure P satisfying (1)

61 Tail-free processes (2) P(A ε1 ε m ) = V ε1 V ε1 ε 2 V ε1 ε m, ε = ε 1 ε m {0,1} m. (1) Theorem. Suppose A ε = {A εδ :A εδ compact,a εδ A ε } (V ε :ε E ) stochastic process with 0 V ε 1 and V ε0 +V ε1 = 1. There is a Borel measure with µ(a ε ):= EV ε1 V ε1 ε 2 V ε1 ε m. Then there exists a random Borel measure P satisfying (1) SPECIAL CASE: Polya tree prior: all V ε independent Beta variables.

62 Tail-free processes (3) V ε0 = P(A ε0 A ε ), and V ε1 = P(A ε1 A ε ). P(A ε1 ε m ) = V ε1 V ε1 ε 2 V ε1 ε m, ε = ε 1 ε m {0,1} m.

63 Tail-free processes (3) V ε0 = P(A ε0 A ε ), and V ε1 = P(A ε1 A ε ). P(A ε1 ε m ) = V ε1 V ε1 ε 2 V ε1 ε m, ε = ε 1 ε m {0,1} m. Notation: U V means U and V are independent U V Z means U and V are conditionally independent given Z.

64 Tail-free processes (3) V ε0 = P(A ε0 A ε ), and V ε1 = P(A ε1 A ε ). P(A ε1 ε m ) = V ε1 V ε1 ε 2 V ε1 ε m, ε = ε 1 ε m {0,1} m. Notation: U V means U and V are independent U V Z means U and V are conditionally independent given Z. Definition (Tail-free). The random measure P is a tail-free process with respect to the sequence of partitions if {V 0 } {V 00,V 10 } {V ε0 :ε E m }.

65 Tail-free processes (3) V ε0 = P(A ε0 A ε ), and V ε1 = P(A ε1 A ε ). P(A ε1 ε m ) = V ε1 V ε1 ε 2 V ε1 ε m, ε = ε 1 ε m {0,1} m. Notation: U V means U and V are independent U V Z means U and V are conditionally independent given Z. Definition (Tail-free). The random measure P is a tail-free process with respect to the sequence of partitions if {V 0 } {V 00,V 10 } {V ε0 :ε E m }. Theorem. The DP(α) prior is tail free. All splitting variables V ε0 are independent and V ε0 Be ( α(a ε0 ),α(a ε1 ) ). Proof. This follows from properties of the finite-dimensional Dirichlet.

66 Posterior distribution For X 1,...,X n P iid P define count variables: N ε := #{1 i n:x i A ε }.

67 Posterior distribution For X 1,...,X n P iid P define count variables: N ε := #{1 i n:x i A ε }. Theorem. If P is tail-free, then for every m and n the posterior distribution of ( P(A ε ):ε E m) given X 1,...,X n depends only on (N ε :ε E m ).

68 Posterior distribution For X 1,...,X n P iid P define count variables: N ε := #{1 i n:x i A ε }. Theorem. If P is tail-free, then for every m and n the posterior distribution of ( P(A ε ):ε E m) given X 1,...,X n depends only on (N ε :ε E m ). Proof. We may generate the variables P,X 1,...,X n in four steps: (a) Generate θ:= ( P(A ε ):ε E m) from its prior. (b) Given θ generate N = (N ε :ε E m ) multinomial (n,θ). (c) Generate η:= ( P(A A ε ):A X,ε E m). (d) Given (N,η) generate for every ε E m a random sample of size N ε from P( A ε ), independently across ε E m ; let X 1,...,X n be the n values in a random order. Then η θ and N η θ and X θ (N,η). Thus θ X N.

69 Posterior distribution (continued) Theorem. If P is tail-free, then the posterior P X 1,...,X n is tail-free.

70 Posterior distribution (continued) Theorem. If P is tail-free, then the posterior P X 1,...,X n is tail-free. Proof. Suffices to show, for every level: (V ε0 :ε E m ) ( P(A ε ):ε E m) X 1,...,X n. In view of preceding theorem, suffices: (V ε0 :ε E m ) ( P(A ε ):ε E m) (N εδ :ε E m,δ E). The likelihood for (V,θ,N), where θ ε = P(A ε ), takes the form ( ) n (θ ε V εδ ) N εδ dπ 1 (V)dΠ 2 (θ). N ε E m,δ E This factorizes in parts involving (V,N) and involving (θ,n).

71 Conjugacy of Dirichlet process P DP(α), X 1,X 2,... P iid P.

72 Conjugacy of Dirichlet process P DP(α), X 1,X 2,... P iid P. Theorem. P X 1,...,X n DP(α+nP n ), for P n = n 1 n i=1 δ X i.

73 Conjugacy of Dirichlet process P DP(α), X 1,X 2,... P iid P. Theorem. P X 1,...,X n DP(α+nP n ), for P n = n 1 n i=1 δ X i. Proof. ( P(A 1 ),...,P(A k ) ) X 1,...,X n ( P(A 1 ),...,P(A k ) ) N. Apply result for finite-dimensional Dirichlet.

74 Conjugacy of Dirichlet process P DP(α), X 1,X 2,... P iid P. Theorem. P X 1,...,X n DP(α+nP n ), for P n = n 1 n i=1 δ X i. Proof. ( P(A 1 ),...,P(A k ) ) X 1,...,X n ( P(A 1 ),...,P(A k ) ) N. Apply result for finite-dimensional Dirichlet. E ( P(A) X 1,...,X n ) = α α +nᾱ(a)+ n α +n P n(a), var ( P(A) X 1,...,X n ) = Pn (A) P n (A c ) 1+ α +n 1 4(1+ α +n). Corollary. P(A) X 1,...,X n d δ P0 (A) as n, a.s. [P 0 ].

75 Conjugacy of Dirichlet process P DP(α), X 1,X 2,... P iid P. Theorem. P X 1,...,X n DP(α+nP n ), for P n = n 1 n i=1 δ X i. Proof. ( P(A 1 ),...,P(A k ) ) X 1,...,X n ( P(A 1 ),...,P(A k ) ) N. Apply result for finite-dimensional Dirichlet. Corollary. P(A) X 1,...,X n d δ P0 (A) as n, a.s. [P 0 ]

76 Predictive distribution P DP(α), X 1,X 2,... P iid P. Theorem. X i X 1,...,X i 1 δ X1, with probability 1 α +i 1,.. 1 δ Xi 1, with probability ᾱ, with probability α +i 1, α α +i 1.

77 Predictive distribution P DP(α), X 1,X 2,... P iid P. Theorem. X i X 1,...,X i 1 δ X1, with probability 1 α +i 1,.. 1 δ Xi 1, with probability ᾱ, with probability α +i 1, α α +i 1. Proof. (i). Pr(X 1 A) = EPr(X 1 A P) = EP(A) = ᾱ(a). (ii). Preceding step means: X 1 P P and P DP(α) imply X 1 ᾱ. Hence X 2 (P,X 1 ) P and P X 1 DP(α+δ X1 ) imply X 2 X 1 (α+δ X1 )/( α +1). (iii). etc.

78 Dirichlet process mixtures Given a probability density x ψ(x;θ) consider data X 1,...,X n F p iid F (x):= ψ(x;θ)df(θ).

79 Dirichlet process mixtures Given a probability density x ψ(x;θ) consider data X 1,...,X n F p iid F (x):= ψ(x;θ)df(θ). For F DP(α), this gives Bayesian model: X 1,...,X n θ 1,...,θ n,f ind ψ( ;θ i ), θ 1,...,θ n F iid F, F DP(α).

80 Dirichlet process mixtures Given a probability density x ψ(x;θ) consider data X 1,...,X n F p iid F (x):= ψ(x;θ)df(θ). For F DP(α), this gives Bayesian model: X 1,...,X n θ 1,...,θ n,f ind ψ( ;θ i ), θ 1,...,θ n F iid F, F DP(α). Lemma. For any θ ψ(θ) (e.g. ψ(x, )), ( ) E ψdf θ 1,...,θ n,x 1,...,X n = 1 [ ψdα+ α +n n j=1 ] ψ(θ j ).

81 Dirichlet process mixtures Given a probability density x ψ(x;θ) consider data X 1,...,X n F p iid F (x):= ψ(x;θ)df(θ). For F DP(α), this gives Bayesian model: X 1,...,X n θ 1,...,θ n,f ind ψ( ;θ i ), θ 1,...,θ n F iid F, F DP(α). Lemma. For any θ ψ(θ) (e.g. ψ(x, )), ( ) E ψdf θ 1,...,θ n,x 1,...,X n = 1 [ ψdα+ α +n n j=1 ] ψ(θ j ). Proof. F X 1,...,X n θ 1,...,θ n ; F θ 1,...,θ n DP(α+ n i=1 δ θ i ).

82 Dirichlet process mixtures Given a probability density x ψ(x;θ) consider data X 1,...,X n F p iid F (x):= ψ(x;θ)df(θ). For F DP(α), this gives Bayesian model: X 1,...,X n θ 1,...,θ n,f ind ψ( ;θ i ), θ 1,...,θ n F iid F, F DP(α). Lemma. For any θ ψ(θ) (e.g. ψ(x, )), ( ) E ψdf θ 1,...,θ n,x 1,...,X n = 1 [ ψdα+ α +n n j=1 ] ψ(θ j ). Proof. F X 1,...,X n θ 1,...,θ n ; F θ 1,...,θ n DP(α+ n i=1 δ θ i ). Compute conditional expectation given X 1,...,X n by generating samples θ 1,...,θ n from θ 1,...,θ n X 1,...,X n, and averaging.

83 Gibbs sampler X i θ i,f ind ψ( ;θ i ), θ i F iid F, F DP(α). Theorem (Gibbs sampler). θ i θ i X 1,...,X n j i q i,j δ θj +q i,0 G b,i, where (q i,j :j {0,1,...,n} {i}) is the probability vector satisfying q i,j { ψ(x i ;θ j ), j i,j 1, ψ(xi ;θ)dα(θ), j = 0, and G b,i is the baseline posterior measure given by dg b,i (θ X i ) ψ(x i ;θ)dα(θ).

84 Gibbs sampler proof Proof. E ( ) 1l A (X i )1l B (θ i ) θ i,x i = E (E ( ) ) 1l A (X i )1l B (θ i ) F,θ i,x i θ i,x i ( ) = E 1l A (x)1l B (θ)ψ(x;θ)dµ(x)df(θ) θ i 1 ( = 1l A (x)1l B (θ)ψ(x;θ)dµ(x)d α+ α +n j i δ θj )(θ). By Bayes s rule (applied conditionally given (θ i,x i )) Pr ( ) θ i B X i,θ i,x i = B ψ(x i;θ)d(α+ j i δ θ j )(θ) ψ(xi ;θ)d(α+ j i δ θ j )(θ).

85 Further properties The number of distinct values in (X 1,...,X n ) is O P (logn). The pattern of equal values induces the same random partition of the set {1,2,...,n} as the Kingman coalescent. The Dirichlet distribution has full support relative to the weak topology. DP(α 1 ) DP(α 2 ) as soon as α1 c αc 2 or αd 1 and αd 1 have different supports. In particular prior DP(α) and posterior DP(α+nP n ) are typically orthogonal. The cdf of P DP(α) is a normalized Gamma process. The tails of P DP(α) are much thinner than the tails of α. The Dirichlet is the only prior that is tail-free relative to any partition. The splitting variables of a Polya tree can be defined so that the prior is absolutely continuous.

86 Consistency and rates

87 Consistency X (n) observation in sample space (X (n), X (n) ) with distribution P (n) θ. θ belongs to metric space (Θ,d). Definition. The posterior distribution is consistent at θ 0 Θ if Π n ( θ:d(θ,θ0 ) > ǫ X (n)) 0 in P (n) θ 0 -probability, as n, for every ǫ > 0.

88 Point estimator Proposition. If the posterior distribution is consistent at θ 0 then ˆθ n defined as the center of a (nearly) smallest ball that contains posterior mass at least 1/2 satisfies d(ˆθ n,θ 0 ) 0 in P (n) θ 0 -probability.

89 Point estimator Proposition. If the posterior distribution is consistent at θ 0 then ˆθ n defined as the center of a (nearly) smallest ball that contains posterior mass at least 1/2 satisfies d(ˆθ n,θ 0 ) 0 in P (n) θ 0 -probability. Proof. For B(θ,r) = {s Θ:d(s,θ) r} let Then ˆr n (ˆθ n ) inf θ ˆr n (θ). ˆr n (θ) = inf{r:π n ( B(θ,r) X (n) ) 1/2}. Π n ( B(θ0,ǫ) X (n)) 1 in probability. ˆr n (θ 0 ) ǫ with probability tending to 1, whence ˆr n (ˆθ n ) ˆr n (θ 0 ) ǫ. B(θ 0,ǫ) and B (ˆθn,ˆr n (ˆθ n ) ) cannot be disjoint. d(θ 0,ˆθ n ) ǫ+ ˆr n (ˆθ n ) 2ǫ.

90 Point estimator Proposition. If the posterior distribution is consistent at θ 0 then ˆθ n defined as the center of a (nearly) smallest ball that contains posterior mass at least 1/2 satisfies d(ˆθ n,θ 0 ) 0 in P (n) θ 0 -probability. Proof. For B(θ,r) = {s Θ:d(s,θ) r} let Then ˆr n (ˆθ n ) inf θ ˆr n (θ). ˆr n (θ) = inf{r:π n ( B(θ,r) X (n) ) 1/2}. Π n ( B(θ0,ǫ) X (n)) 1 in probability. ˆr n (θ 0 ) ǫ with probability tending to 1, whence ˆr n (ˆθ n ) ˆr n (θ 0 ) ǫ. B(θ 0,ǫ) and B (ˆθn,ˆr n (ˆθ n ) ) cannot be disjoint. d(θ 0,ˆθ n ) ǫ+ ˆr n (ˆθ n ) 2ǫ. Alternative: posterior mean θdπ n (θ X (n) ).

91 Doob s theorem Theorem (Doob). Let (X, X,P θ :θ Θ) be experiments with (X, X ) a standard Borel space and Θ a Borel subset of a Polish space such that θ P θ (A) is Borel measurable for every A X and the map θ P θ is one-to-one. Then for any prior Π on the Borel sets of Θ the posterior Π n ( X 1,...,X n ) in the model X 1,...,X n θ iid p θ and θ Π is consistent at θ, for Π-almost every θ.

92 Kullback-Leibler property Parameter p: ν-density on sample space (X, X ). True value p 0. Kullback-Leibler divergence: K(p 0 ;p) = p 0 log(p 0 /p)dν, K(p 0 ;P 0 ) = inf K(p 0 ;p). p P 0

93 Kullback-Leibler property Parameter p: ν-density on sample space (X, X ). True value p 0. Kullback-Leibler divergence: K(p 0 ;p) = p 0 log(p 0 /p)dν, K(p 0 ;P 0 ) = inf K(p 0 ;p). p P 0 Definition. p 0 is said to possess the Kullback-Leibler property relative to Π if Π ( p:k(p 0 ;p) < ǫ ) > 0 for every ǫ > 0.

94 Kullback-Leibler property Parameter p: ν-density on sample space (X, X ). True value p 0. Kullback-Leibler divergence: K(p 0 ;p) = p 0 log(p 0 /p)dν, K(p 0 ;P 0 ) = inf K(p 0 ;p). p P 0 Definition. p 0 is said to possess the Kullback-Leibler property relative to Π if Π ( p:k(p 0 ;p) < ǫ ) > 0 for every ǫ > 0. EXAMPLES Polya tree prior with dyadic partition and splitting variables V ε0 Be(a ε,a ε ) for m a 1 m < and K(p 0,λ) <. Dirichlet mixtures ψ(,θ)df(θ) with F DP(α), under some regularity conditions.

95 Schwartz s theorem Bayesian model: X 1,...,X n p iid p, p Π.

96 Schwartz s theorem Bayesian model: X 1,...,X n p iid p, p Π. Theorem. If p 0 has KL-property, and for every neighbourhood U of p 0 there exist tests φ n such that P n 0 φ n 0, sup P n (1 φ n ) 0, p U c then Π n ( X 1,...,X n ) is consistent at p 0.

97 Schwartz s theorem Bayesian model: X 1,...,X n p iid p, p Π. Theorem. If p 0 has KL-property, and for every neighbourhood U of p 0 there exist tests φ n such that P n 0 φ n 0, sup P n (1 φ n ) 0, p U c then Π n ( X 1,...,X n ) is consistent at p 0. Proof. By grouping the observations and using Hoeffding s inequality we can find tests ψ n with P n 0 ψ n e Cn, sup P n (1 ψ n ) e Cn. p U c Then apply the theorem later on.

98 Weak consistency Consider the topology induced on p by the weak topology on the probability measures P.

99 Weak consistency Consider the topology induced on p by the weak topology on the probability measures P. Theorem. The posterior distribution is consistent for the weak topology at any p 0 with the Kullback-Leibler property.

100 Weak consistency Consider the topology induced on p by the weak topology on the probability measures P. Theorem. The posterior distribution is consistent for the weak topology at any p 0 with the Kullback-Leibler property. Proof. Consistent tests always exist: Subbasis for the weak neighbourhoods are sets of the type U = { p:pψ < P 0 ψ +ǫ }, for ψ:x [0,1] continuous and ǫ > 0. Given a test for each neighbourhood the maximum of the tests works for a finite intersection. Use Hoeffding s inequality to bound the error probabilities of the test { 1 φ n = 1l n n i=1 } ψ(x i ) > P 0 ψ +ǫ/2.

101 Extended Schwartz s theorem Bayesian model: X 1,...,X n p iid p, p Π. Theorem. If If p 0 has KL-property and for every neighbourhood U of p 0 there exist C > 0, sets P n P and tests φ n such that Π(P P n ) < e Cn, P n 0 φ n e Cn, sup p P n U c P n (1 φ n ) e Cn, then the posterior distribution Π n ( X 1,...,X n ) is consistent at p 0.

102 Extended Schwartz s theorem Bayesian model: X 1,...,X n p iid p, p Π. Theorem. If If p 0 has KL-property and for every neighbourhood U of p 0 there exist C > 0, sets P n P and tests φ n such that Π(P P n ) < e Cn, P n 0 φ n e Cn, sup p P n U c P n (1 φ n ) e Cn, then the posterior distribution Π n ( X 1,...,X n ) is consistent at p 0. Proof. Follow steps 1 4. Π n (U c ) = n U c i=1 (p/p 0)(X i )dπ(p) n i=1 (p/p 0)(X i )dπ(p).

103 Extended Schwartz s theorem proof Proof. continued. Step 1: for any ǫ > 0 eventually a.s. [P 0 ]: n i=1 p p 0 (X i )dπ(p) Π ( p:k(p 0 ;p) < ǫ)e nǫ. (2)

104 Extended Schwartz s theorem proof Proof. continued. Step 1: for any ǫ > 0 eventually a.s. [P 0 ]: n i=1 p p 0 (X i )dπ(p) Π ( p:k(p 0 ;p) < ǫ)e nǫ. (2) Proof: for Π ǫ ( ) = Π( P ǫ )/Π(P ǫ ), and P ǫ = {p:k(p 0 ;p) < ǫ}, n p log (X i )dπ(p) logπ(p ǫ ) P ǫ p i=1 0 n p = log (X i )dπ ǫ (p) p 0 = n i=1 i=1 log log p (X i )dπ ǫ (p) = n p 0 n i=1 p p 0 (X i )dπ ǫ (p), K(p 0 ;p)dπ ǫ (p)+o(n), a.s.

105 Extended Schwartz s theorem proof (2) Proof. continued. Step 2: n Π n (U c U X 1,...,X n ) φ n +(1 φ n ) c i=1 (p/p 0)(X i )dπ(p) n i=1 (p/p 0)(X i )dπ(p) φ n +Π ( n p:k(p 0 ;p) < ǫ)e nǫ (1 φ n ) (p/p 0 )(X i )dπ(p) U c i=1

106 Extended Schwartz s theorem proof (2) Proof. continued. Step 2: n Π n (U c U X 1,...,X n ) φ n +(1 φ n ) c i=1 (p/p 0)(X i )dπ(p) n i=1 (p/p 0)(X i )dπ(p) φ n +Π ( n p:k(p 0 ;p) < ǫ)e nǫ (1 φ n ) (p/p 0 )(X i )dπ(p) Step 3: P n 0 ( (1 φ n ) U c n i=1 p ) (X i )dπ(p) p 0 = U c P n 0 U c i=1 [ (1 φ n ) n i=1 U c P n (1 φ n )dπ(p). p ] (X i ) dπ(p) p 0

107 Extended Schwartz s theorem proof (2) Proof. continued. Step 2: n Π n (U c U X 1,...,X n ) φ n +(1 φ n ) c i=1 (p/p 0)(X i )dπ(p) n i=1 (p/p 0)(X i )dπ(p) φ n +Π ( n p:k(p 0 ;p) < ǫ)e nǫ (1 φ n ) (p/p 0 )(X i )dπ(p) Step 3: P n 0 ( (1 φ n ) U c n i=1 p ) (X i )dπ(p) p 0 = U c P n 0 U c i=1 [ (1 φ n ) n i=1 U c P n (1 φ n )dπ(p). Step 4: Split U c in U c P n and U c P c n and use that P n (1 φ n ) e Cn on first set, while Π(U c P c n) e Cn. p ] (X i ) dπ(p) p 0

108 Strong consistency and entropy Definition (Covering number). N(ǫ, P, d) is the minimal number of d-balls of radius ǫ needed to cover P. Theorem. The posterior distribution is consistent relative to the L 1 -distance at every p 0 with the KL-property if for every ǫ > 0 there exist a partition P = P n,1 P n,2 (which may depend on ǫ) such that, for C > 0, (i) Π(P n,2 ) e Cn. (ii) logn ( ǫ,p n,1, 1 ) nǫ 2 /3.

109 Strong consistency and entropy Definition (Covering number). N(ǫ, P, d) is the minimal number of d-balls of radius ǫ needed to cover P. Theorem. The posterior distribution is consistent relative to the L 1 -distance at every p 0 with the KL-property if for every ǫ > 0 there exist a partition P = P n,1 P n,2 (which may depend on ǫ) such that, for C > 0, (i) Π(P n,2 ) e Cn. (ii) logn ( ǫ,p n,1, 1 ) nǫ 2 /3. Proof. Entropy gives tests. See below. Apply Extended Schwartz s theorem.

110 Tests

111 Tests

112 Tests the two Luciens Lucien le Cam Lucien Birgé

113 Tests minimax theorem minimax risk for testing P versus Q : π(p,q) = inf φ ( Pφ+ sup Q(1 φ) Q Q ).

114 Tests minimax theorem minimax risk for testing P versus Q : Hellinger affinity: π(p,q) = inf φ ( Pφ+ sup Q(1 φ) Q Q ). ρ 1/2 (p,q) = p qdµ = 1 h 2 (p,q)/2, for h 2 (p,q) = ( p q) 2 dµ square Hellinger distance

115 Tests minimax theorem minimax risk for testing P versus Q : Hellinger affinity: π(p,q) = inf φ ( Pφ+ sup Q(1 φ) Q Q ). ρ 1/2 (p,q) = p qdµ = 1 h 2 (p,q)/2, for h 2 (p,q) = ( p q) 2 dµ square Hellinger distance Proposition. For dominated probability measures P and Q π(p,q) = P conv(q) 1 sup ρ 1/2 (p,q). Q conv(q)

116 Tests minimax risk Proof. π(p,q) = inf φ = sup sup Q conv(q) Q conv(q) = sup Q conv(q) inf φ ( ) Pφ+Q(1 φ) ( ) Pφ+Q(1 φ) ( ) P1l{p < q}+q1l{p q} ( = sup p q 1). Q conv(q) P1l{p < q}+q1l{p q} = p<q pdµ+ p q qdµ p qdµ.

117 Tests product measures ρ 1/2 (p 1 p 2,q 1 q 2 ) = ρ 1/2 (p 1,q 1 )ρ 1/2 (p 2,q 2 ).

118 Tests product measures ρ 1/2 (p 1 p 2,q 1 q 2 ) = ρ 1/2 (p 1,q 1 )ρ 1/2 (p 2,q 2 ). Lemma. For any probability measures P i and Q i ρ 1/2 ( i P i,conv( i Q i ) ) i ρ 1/2 ( Pi,conv(Q i ) ).

119 Tests product measures ρ 1/2 (p 1 p 2,q 1 q 2 ) = ρ 1/2 (p 1,q 1 )ρ 1/2 (p 2,q 2 ). Lemma. For any probability measures P i and Q i ρ 1/2 ( i P i,conv( i Q i ) ) i ρ 1/2 ( Pi,conv(Q i ) ). Proof. Suffices to consider products of 2. If q(x,y) = j κ jq 1j (x)q 2j (y), then ρ 1/2 (p 1 p 2,q) = p 1 (x) 1/2( ) 1/2 [ κ j q 1j (x) p 2 (y) 1/2( j κ jq 1j (x)q 2j (y)) 1/2dµ2 (y)] j κ dµ 1 (x). jq 1j (x) j

120 Tests product measures (2) Corollary. π(p n,q n ) ρ 1/2 (P n,conv(q n )) ρ 1/2 (P,conv(Q)) n.

121 Tests product measures (2) Corollary. π(p n,q n ) ρ 1/2 (P n,conv(q n )) ρ 1/2 (P,conv(Q)) n. Theorem. For any probability measure P and convex set of dominated probability measures Q with h(p,q) > ǫ for every q Q and any n N, there exists a test φ such that P n φ e nǫ2 /2, sup Q n (1 φ) e nǫ2 /2. Q Q

122 Tests product measures (2) Corollary. π(p n,q n ) ρ 1/2 (P n,conv(q n )) ρ 1/2 (P,conv(Q)) n. Theorem. For any probability measure P and convex set of dominated probability measures Q with h(p,q) > ǫ for every q Q and any n N, there exists a test φ such that P n φ e nǫ2 /2, sup Q n (1 φ) e nǫ2 /2. Q Q Proof. ρ 1/2 (P,Q) = h2 (P,Q) 1 ǫ 2 /2. π(p n,q n ) (1 ǫ 2 /2) n e nǫ2 /2.

123 Tests nonconvex alternatives Definition (Covering number). N(ǫ, Q, d) is the minimal number of d-balls of radius ǫ needed to cover Q. Proposition. Let d h be a metric whose balls are convex. If N(ǫ/4,Q,d) N(ǫ) for every ǫ > ǫ n > 0 and some nonincreasing function N:(0, ) (0, ), then for every ǫ > ǫ n and n there exists a test φ such that, for all j N, e nǫ2 /2 P n φ N(ǫ) 1 e nǫ2 /8, sup Q n (1 φ) e nǫ2 j 2 /8. Q Q:d(P,Q)>jǫ

124 Tests nonconvex alternatives Proof. For j N, choose a maximal set of jǫ/2-separated points Q j,1,...,q j,nj in Q j := { Q Q:jǫ < d(p,q) < 2jǫ }. (i). N j N(jǫ/4,Q j,d). (ii). The N j balls B j,l of radius jǫ/2 around the Q j,l cover Q j. (iii). h(p,b j,l ) d(p,b j,l ) > jǫ/2 for every ball B j,l. For every ball take a test φ j,l of P versus B j,l. Let φ be their supremum. P n φ j=1 N j l=1 e nj2 ǫ 2 /8 N(jǫ/4,Q j,d)e nj2 ǫ 2 /8 N(ǫ) j=1 e nǫ2 /8 1 e nǫ2 /8 and, for every j N, sup Q n (1 φ) supe nl2 ǫ 2 /8 e nj2 ǫ 2 /8. Q l>j Q l l>j

125 Rate of contraction Definition. The posterior distribution Π n ( X (n) ) contracts at rate ǫ n 0 ( at θ 0 Θ if Π n θ:d(θ,θ0 ) > M n ǫ n X (n)) 0 in P (n) θ 0 -probability, for every M n as n.

126 Rate of contraction Definition. The posterior distribution Π n ( X (n) ) contracts at rate ǫ n 0 ( at θ 0 Θ if Π n θ:d(θ,θ0 ) > M n ǫ n X (n)) 0 in P (n) θ 0 -probability, for every M n as n. Proposition (Point estimator). If the posterior distribution contracts at rate ǫ n at θ 0, then ˆθ n defined as the center of a (nearly) smallest ball that contains posterior mass at least 1/2 satisfies d(ˆθ n,θ 0 ) = O P (ǫ n ) under P (n) θ 0.

127 Basic contraction theorem K(p 0 ;p) = P 0 log p ( 0 p, V(p 0;p) = P 0 log p ) 0 2. p Theorem. Given d h whose balls are convex suppose that there exist P n P and C > 0, such that, (i) Π n ( p:k(p0 ;p) < ǫ 2 n,v(p 0 ;p) < ǫ 2 n) e Cnǫ 2 n, (ii) logn ( ǫ n,p n,d ) nǫ 2 n. (iii) Π n (P c n) e (C+4)nǫ2 n. Then the posterior rate of convergence for d is ǫ n n 1/2.

128 Basic contraction theorem proof Proof. There exist tests φ n with P n 0 φ n e nǫ2 n e nm2 ǫ 2 n /8 1 e nm2 ǫ 2 n /8, For A n = { n i=1 (p/p } 0)(X i )dπ n (p) e (2+C)nǫ2 n Π n ( p:d(p,p0 ) > Mǫ n X 1,...,X n ) φ n +1l{A c n}+e (2+C)nǫ2 n P n 0 (Ac n) 0. See further on. sup P n nm2ǫ2 (1 φ n ) e n /8. p P n :d(p,p 0 )>Mǫ n d(p,p 0 )>Mǫ n n i=1 p p 0 (X i )dπ n (p)(1 φ n ).

129 Basic contraction theorem proof continued Proof. (Continued) P n 0 p P n :d(p,p 0 )>Mǫ n n i=1 p p 0 (X i )dπ n (p) p P n :d(p,p 0 )>Mǫ n P n (1 φ n )dπ n (p) e nm2 ǫ 2 n /8 P n 0 P P n n i=1 p p 0 (X i )dπ n (p) Π n (P P n ).

130 Bounding the denominator Lemma. For any probability measure Π on P, and positive constant ǫ, with P n 0 -probability at least 1 (nǫ2 ) 1, n i=1 p p 0 (X i )dπ(p) Π ( p:k(p 0 ;p) < ǫ 2,V(p 0 ;p) < ǫ 2) e 2nǫ2.

131 Bounding the denominator Lemma. For any probability measure Π on P, and positive constant ǫ, with P n 0 -probability at least 1 (nǫ2 ) 1, n i=1 p p 0 (X i )dπ(p) Π ( p:k(p 0 ;p) < ǫ 2,V(p 0 ;p) < ǫ 2) e 2nǫ2. Proof. B:= { p:k(p 0 ;p) < ǫ 2 n,v(p 0 ;p) < ǫ 2 n}. n log i=1 p p 0 (X i )dπ(p) n i=1 log p p 0 (X i )dπ(p) =:Z. EZ = n K(p 0 ;p)dπ(p) > nǫ 2, varz np 0 ( log p 0 p dπ(p) ) 2 np0 Apply Chebyshev s inequality. ( log p 0 p ) 2dΠ(p) nǫ 2,

132 Interpretation Consider a maximal set of points p 1,...,p N in P n with d(p i,p j ) ǫ n. Maximality implies N N(ǫ n,p n,d) e c 1nǫ 2 n, under the entropy bound. The balls of radius ǫ n /2 around the points are disjoint and hence the sum of their prior masses will be less than 1. If the prior mass were evenly distributed over these balls, then each would have no more mass than e c 1nǫ 2 n. This is of the same order as the prior mass bound. This argument suggests that the conditions can only be satisfied for every p 0 in the model if the prior distributes its mass uniformly, at discretization level ǫ n.

133 General observations Experiments (X (n), X (n),p (n) θ :θ Θ n ), with observations X (n), and true parameters θ n,0 Θ n. d n and e n semi-metrics on Θ n such that: there exist ξ,k > 0 such that for every ǫ > 0 and every θ n,1 Θ n with d n (θ 1,θ n,0 ) > ǫ, there exists a test φ n such that P (n) θ n,0 φ n e Knǫ2, sup P (n) θ (1 φ n )n e Knǫ2. θ Θ n :e n (θ,θ n,1 )<ξǫ

134 General observations rate of contraction B n,k (θ n,0,ǫ) = { θ Θ n :K(p (n) θ n,0 ;p (n) θ ) nǫ 2,V k,0 (p (n) θ n,0 ;p (n) θ ) n k/2 ǫ k}. Theorem. If for arbitrary Θ n,1 Θ n and k > 1, nǫ 2 n 1, and every j N, (i) Π ( ) n θ Θn,1 :jǫ n < d n (θ,θ 0 ) 2jǫ n ( Π n Bn,k (θ 0,ǫ n ) ) e Knǫ2 n j2 /2, (ii) sup ǫ>ǫ n logn ( ξǫ,{θ Θ n,1 :d n (θ,θ n,0 ) < 2ǫ},e n ) nǫ 2 n, then Π n ( θ Θn,1 :d n (θ,θ n,0 ) M n ǫ n X (n)) 0, in P (n) θ n,0 -probability, for every M n. Theorem. If for arbitrary Θ n,2 Θ n, some k > 1, (iii) Π n (Θ n,2 ) ( ) ( Π n Bn,k (θ n,0,ǫ n ) ) = o e 2nǫ2 n. (3) then Π n (Θ n,2 X (n) ) 0, in P (n) θ n,0 -probability if,

135 Gaussian process priors

136 Gaussian processes Definition. A Gaussian process is a set of random variables (or vectors) W = (W t :t T) such that (W t1,...,w tk ) is multivariate normal, for every t 1,...,t k T. The finite-dimensional distributions are determined by the mean function and the covariance function µ(t) = EW t, K(s,t) = EW s W t, s,t T.

137 Gaussian processes Definition. A Gaussian process is a set of random variables (or vectors) W = (W t :t T) such that (W t1,...,w tk ) is multivariate normal, for every t 1,...,t k T. The finite-dimensional distributions are determined by the mean function and the covariance function µ(t) = EW t, K(s,t) = EW s W t, s,t T. The law of a Gaussian process is a prior for a function.

138 Gaussian processes Definition. A Gaussian process is a set of random variables (or vectors) W = (W t :t T) such that (W t1,...,w tk ) is multivariate normal, for every t 1,...,t k T. The finite-dimensional distributions are determined by the mean function and the covariance function µ(t) = EW t, K(s,t) = EW s W t, s,t T. The law of a Gaussian process is a prior for a function. Gaussian process priors have been found useful, because they offer great variety they are easy (?) to understand through their covariance function they can be computationally attractive (e.g.

139 Brownian density estimation X 1,...,X n i.i.d. from density p 0 on [0,1] (W x :x [0,1]) Brownian motion As prior on p use: x ew x 1 0 ew y dy

140 Brownian density estimation Brownian motion t Wt Prior density t cexp(wt)

141 Brownian density estimation X 1,...,X n i.i.d. from density p 0 on [0,1] (W x :x [0,1]) Brownian motion As prior on p use: x ew x 1 0 ew y dy

142 Brownian density estimation X 1,...,X n i.i.d. from density p 0 on [0,1] (W x :x [0,1]) Brownian motion As prior on p use: x ew x 1 0 ew y dy Theorem. If w 0 := logp 0 C α [0,1], then L 2 -rate is: { n 1/4, if α 1/2; n α/2, if α 1/2.

143 Brownian density estimation X 1,...,X n i.i.d. from density p 0 on [0,1] (W x :x [0,1]) Brownian motion As prior on p use: x ew x 1 0 ew y dy Theorem. If w 0 := logp 0 C α [0,1], then L 2 -rate is: { n 1/4, if α 1/2; n α/2, if α 1/2. This is optimal if and only if α = 1/2. Rate does not improve if α increases from 1/2. Consistency for any α > 0.

144 Integrated Brownian density estimation Integrated Brownian motion Prior density

145 Integrated Brownian motion: Riemann-Liouville process α 1/2 times integrated Brownian motion, released at 0 W t = t 0 (t s) α 1/2 db s + [α]+1 k=0 Z k t k [B Brownian motion, α > 0, (Z k ) iid N(0,1), fractional integral ] Theorem. IBM gives appropriate model for α-smooth functions: consistency if w 0 C β [0,1] for any β > 0, but the optimal n β/(2β+1) if and only if α = β.

146 Settings Density estimation X 1,...,X n iid in [0,1], p θ (x) = eθ(x) 1 0 eθ(t) dt. Distance on parameter: Hellinger on p θ. Norm on W : uniform. Classification (X 1,Y 1 ),...,(X n,y n ) iid in [0,1] {0,1} Pr θ (Y = 1 X = x) = 1 1+e θ(x). Regression Y 1,...,Y n independent N(θ(x i ),σ 2 ), for fixed design points x 1,...,x n. Distance on parameter: L 2 (G) on Pr θ. (G marginal of X i.) Norm on W : L 2 (G). Distance on parameter: empirical L 2 -distance on θ. Norm on W : empirical L 2 -distance. Ergodic diffusions (X t :t [0,n]), ergodic, recurrent: dx t = θ(x t )dt+σ(x t )db t. Distance on parameter: random Hellinger h n ( /σ µ0,2). Norm on W : L 2 (µ 0 ). (µ 0 stationary measure.)

147 Other Gaussian processes Brownian sheet alpha= alpha= Fractional Brownian motion θ(x) = i θ ie i (x), θ i indep N(0,λ i ) Series prior

148 Stationary processes A stationary Gaussian field (W t :t R d ) is characterized through a spectral measure µ, by cov(w s,w t ) = e iλt (s t) dµ(λ) Gaussian measure; basis spectral radial Matérn spectral measure (3/2)

149 Stationary processes radial basis Stationary Gaussian field (W t :t R d ) characterized through cov(w s,w t ) = e iλt (s t) e λ2 dλ Theorem. Let ŵ 0 be the Fourier transform of the true parameter w 0 :[0,1] d R. If e λ ŵ 0 (λ) 2 dλ <, then rate of contraction is near 1/ n. If ŵ 0 (λ) (1+ λ 2 ) β, then rate is power of 1/logn. Excellent if truth is supersmooth; disastrous otherwise.

150 Stretching or shrinking: length scale Sample paths can be smoothed by stretching

151 Stretching or shrinking: length scale Sample paths can be smoothed by stretching and roughened by shrinking

152 Rescaled Brownian motion W t = B t/cn for B Brownian motion, and c n n (2α 1)/(2α+1) α < 1/2: c n 0 (shrink) α (1/2,1]: c n (stretch) Theorem. The prior W t = B t/cn gives optimal rate for w 0 C α [0,1], α (0,1].

153 Rescaled Brownian motion W t = B t/cn for B Brownian motion, and c n n (2α 1)/(2α+1) α < 1/2: c n 0 (shrink) α (1/2,1]: c n (stretch) Theorem. The prior W t = B t/cn gives optimal rate for w 0 C α [0,1], α (0,1]. Surprising? (Brownian motion is self-similar!)

154 Rescaled Brownian motion W t = B t/cn for B Brownian motion, and c n n (2α 1)/(2α+1) α < 1/2: c n 0 (shrink) α (1/2,1]: c n (stretch) Theorem. The prior W t = B t/cn gives optimal rate for w 0 C α [0,1], α (0,1]. Surprising? (Brownian motion is self-similar!) Appropriate rescaling of k times integrated Brownian motion gives optimal prior for every α (0,k +1].

155 Rescaled smooth stationary process A Gaussian field with infinitely-smooth sample paths is obtained with EG s G t = ψ(s t), e λ ˆψ(λ)dλ < Gaussian measure; basis spectral radial Theorem. The prior W t = G t/cn for c n n 1/(2α+d) gives nearly optimal rate for w 0 C α [0,1], any α > 0.

156 Gaussian elements in a Banach space Definition. A Gaussian random variable in a (separable) Banach space B is a Borel measurable map W:(Ω, U,Pr) B such that b W is normally distributed for every b in the dual space B. Many Gaussian processes (W t :t T) can be viewed as a Gaussian variable in a space of functions w:t R d. EXAMPLES Brownian motion can be viewed as a map in C[0,1], equipped with the uniform norm w = sup t [0,1] w(t).

157 Gaussian elements in a Banach space Definition. A Gaussian random variable in a (separable) Banach space B is a Borel measurable map W:(Ω, U,Pr) B such that b W is normally distributed for every b in the dual space B. Many Gaussian processes (W t :t T) can be viewed as a Gaussian variable in a space of functions w:t R d. EXAMPLES Brownian motion can be viewed as a map in C[0,1], equipped with the uniform norm w = sup t [0,1] w(t). Brownian motion is also a map in L 2 [0,1], or C 1/4 [0,1], or some Besov space.

158 RKHS definition W zero-mean Gaussian in Banach space (B, ). S:B B, Sb = EWb (W). Definition. The reproducing kernel Hilbert space (H, H ) of W is the completion of SB under. Sb 1,Sb 2 H = Eb 1(W)b 2(W)

159 RKHS definition (2) W = (W t :t T) Gaussian process that can be seen as tight, Borel measurable map in l (T) = {f:t R: f := sup t f(t) < }. with covariance function K(s,t) = EW s W t. Theorem. Then RKHS is completion of the set of functions t i α i K(s i,t) relative to inner product α i K(r i, ), j i β j K(s j, ) = H i α i β j K(r i,t j ). j

160 RKHS definition (2) W = (W t :t T) Gaussian process that can be seen as tight, Borel measurable map in l (T) = {f:t R: f := sup t f(t) < }, with covariance function K(s,t) = EW s W t. Theorem. Then RKHS is completion of the set of functions t i α i K(s i,t)= E( i α i W si )W t relative to inner product α i K(r i, ), j i β j K(s j, ) = H i α i β j K(r i,t j ) j i.e. all functions t h L (t):= ELW t, where L L 2 (W), with inner product h L1,h L2 H = EL 1 L 2.

161 RKHS definition (3) Any Gaussian random element in a separable Banach space can be represented (in many ways, e.g. spectral decomposition) as for µ i 0 Z 1,Z 2,... i.i.d. N(0,1) e 1 = e 2 = = 1 W = µ i Z i e i i=1

162 RKHS definition (3) Any Gaussian random element in a separable Banach space can be represented (in many ways, e.g. spectral decomposition) as for µ i 0 Z 1,Z 2,... i.i.d. N(0,1) e 1 = e 2 = = 1 W = µ i Z i e i i=1 Theorem. The RKHS consists of all elements h:= i h ie i with h 2 H := i h 2 i µ 2 i <

163 EXAMPLE Brownian motion Theorem. The RKHS of k times IBM is { f:f (k+1) L 2 [0,1],f(0) = = f (k) (0) = 0 }, f H = f (k+1) 2.

164 EXAMPLE Brownian motion Theorem. The RKHS of k times IBM is { f:f (k+1) L 2 [0,1],f(0) = = f (k) (0) = 0 }, f H = f (k+1) 2. Proof. For k = 0: EW s W t = s t = t 0 1l [0,s]dλ. The set of all linear combinations i α i1l [0,si ] is dense in L 2 [0,1]. For k > 0: use the general result that the RKHS is equivariant under continous linear transformations, like integration.

165 EXAMPLE Brownian motion Theorem. The RKHS of k times IBM is { f:f (k+1) L 2 [0,1],f(0) = = f (k) (0) = 0 }, f H = f (k+1) 2. Proof. For k = 0: EW s W t = s t = t 0 1l [0,s]dλ. The set of all linear combinations i α i1l [0,si ] is dense in L 2 [0,1]. For k > 0: use the general result that the RKHS is equivariant under continous linear transformations, like integration. Theorem. The RKHS of the sum of k times IBM and t k i=0 Z it i is { f:f (k+1) L 2 [0,1] }, f 2 H = f(k+1) k f (i) (0) 2. i=0

166 Example stationary processes A stationary Gaussian process is characterized through a spectral measure µ, by cov(w s,w t ) = e iλt (s t) dµ(λ). Theorem. The RKHS of (W t :t T) is the set of real parts of the functions t e iλtt ψ(λ)dµ(λ), ψ L 2 (µ), with RKHS-norm h H = inf{ ψ 2 :h ψ = h}. If the interior of T is nonempty and e λ µ(dλ) <, then ψ is unique and h H = ψ 2. Proof. EW s W t = e s,e t 2,µ, e s (λ) = e iλts.

167 Small ball probability Definition. The small ball probability of a Gaussian random element W in (B, ) is Pr( W < ǫ), and the small ball exponent is φ 0 (ǫ) = logpr( W < ǫ).

168 Small ball probability Definition. The small ball probability of a Gaussian random element W in (B, ) is Pr( W < ǫ), and the small ball exponent is φ 0 (ǫ) = logpr( W < ǫ). EXAMPLES Brownian motion: φ 0 (ǫ) (1/ǫ) 2.

169 Small ball probability Definition. The small ball probability of a Gaussian random element W in (B, ) is Pr( W < ǫ), and the small ball exponent is φ 0 (ǫ) = logpr( W < ǫ). EXAMPLES Brownian motion: φ 0 (ǫ) (1/ǫ) 2. α 1/2 times integrated BM: φ 0 (ǫ) (1/ǫ) 1/α.

170 Small ball probability Definition. The small ball probability of a Gaussian random element W in (B, ) is Pr( W < ǫ), and the small ball exponent is φ 0 (ǫ) = logpr( W < ǫ). EXAMPLES Brownian motion: φ 0 (ǫ) (1/ǫ) 2. α 1/2 times integrated BM: φ 0 (ǫ) (1/ǫ) 1/α. Radial basis: φ 0 (ǫ) ( log(1/ǫ) ) 1+d.

171 Small ball probability Definition. The small ball probability of a Gaussian random element W in (B, ) is Pr( W < ǫ), and the small ball exponent is φ 0 (ǫ) = logpr( W < ǫ). Small ball probabilities can be computed either by probabilistic arguments, or analytically from the RKHS. Theorem. ( ǫ ) φ 0 (ǫ) logn φ0 (ǫ),h 1, EXAMPLE RKHS of Brownian motion is Sobolev space of first order. Unit ball has entropy 1/ǫ for uniform norm.. 1 ( ǫ 2 logn ǫ ),H 1, (1/ǫ) 2

172 Posterior contraction rates for Gaussian priors Prior W is centered Gaussian map in Banach space (B, ) with RKHS (H, H ) and small ball exponent φ 0 (ǫ) = logπ( W < ǫ). Theorem. If statistical distances on the model combine appropriately with the norm of B, then the posterior rate is ǫ n if φ 0 (ǫ n ) nǫ n 2 AND inf h 2 H nǫ n 2. h H: h w 0 <ǫ n

173 Posterior contraction rates for Gaussian priors Prior W is centered Gaussian map in Banach space (B, ) with RKHS (H, H ) and small ball exponent φ 0 (ǫ) = logπ( W < ǫ). Theorem. If statistical distances on the model combine appropriately with the norm of B, then the posterior rate is ǫ n if φ 0 (ǫ n ) nǫ n 2 AND inf h 2 H nǫ n 2. h H: h w 0 <ǫ n Both inequalities give lower bound on ǫ n. The first depends on W and not on w 0. If w 0 H, then second inequality is satisfied for ǫ n 1/ n.

174 Density estimation As prior on density p use p W for: p w (x) = ew x 1 0 ew t dt.

175 Density estimation As prior on density p use p W for: p w (x) = ew x 1 0 ew t dt. Lemma. v,w h(p v,p w ) v w e v w /2 K(p v,p w ) v w 2 e v w (1+ v w ) V(p v,p w ) v w 2 e v w (1+ v w ) 2

176 Settings Density estimation X 1,...,X n iid in [0,1], p θ (x) = eθ(x) 1 0 eθ(t) dt. Distance on parameter: Hellinger on p θ. Norm on W : uniform. Classification (X 1,Y 1 ),...,(X n,y n ) iid in [0,1] {0,1} Pr θ (Y = 1 X = x) = 1 1+e θ(x). Regression Y 1,...,Y n independent N(θ(x i ),σ 2 ), for fixed design points x 1,...,x n. Distance on parameter: L 2 (G) on Pr θ. (G marginal of X i.) Norm on W : L 2 (G). Distance on parameter: empirical L 2 -distance on θ. Norm on W : empirical L 2 -distance. Ergodic diffusions (X t :t [0,n]), ergodic, recurrent: dx t = θ(x t )dt+σ(x t )db t. Distance on parameter: random Hellinger h n ( /σ µ0,2). Norm on W : L 2 (µ 0 ). (µ 0 stationary measure.)

177 Brownian Motion rate calculation Small ball probability: φ 0 (ǫ) (1/ǫ) 2 nǫ 2 implies ǫ n 1/4. Approximation: if w 0 C β [0,1], β 1, inf h H: h w 0 <ǫ h 2 2 ǫ (2 2β)/β (Attained for h = w 0 φ σ with σ ǫ 1/β.) ǫ (2 2β)/β nǫ 2 implies ǫ n β/2. Contraction rate is the slowest of the two rates.

178 Example radial basis stationary process Small ball pobabilility: φ 0 (ǫ) ( log(1/ǫ) ) 2 nǫ 2 implies ǫ n 1/2 (logn) 2. Approximation: since δµ(λ) = e λ2 dλ: w 0 (t) = e ittλŵ 0 (λ)dλ = e ittλŵ 0 (λ)e λ2 dµ(λ). If the red function is in L 2 (µ), then w 0 H. Otherwise approximate it by ψ(λ) = ŵ 0 (λ)e λ2 1l{ λ M}. Optimize over M. Contraction rate is the slowest of the two rates, typically the second.

179 Posterior contraction rates for Gaussian priors Prior W centered Gaussian map in Banach space (B, ) with RKHS (H, H ) and small ball exponent φ 0 (ǫ) = logπ( W < ǫ). Theorem. If statistical distances on the model combine appropriately with the norm of B, then the posterior rate is ǫ n if φ 0 (ǫ n ) nǫ n 2 AND inf h 2 H nǫ n 2. h H: h w 0 <ǫ n

180 Posterior contraction rates for Gaussian priors Prior W centered Gaussian map in Banach space (B, ) with RKHS (H, H ) and small ball exponent φ 0 (ǫ) = logπ( W < ǫ). Theorem. If statistical distances on the model combine appropriately with the norm of B, then the posterior rate is ǫ n if φ 0 (ǫ n ) nǫ n 2 AND inf h 2 H nǫ n 2. h H: h w 0 <ǫ n Proof. Suffices: existence of B n B with logn ( ǫ n,b n, ) nǫ 2 n Π n (B n ) = 1 o(e nǫ2 n ( ) Π n w: w w0 < ǫ n e nǫ 2 n complexity remaining mass prior mass

181 Posterior contraction rates for Gaussian priors Prior W centered Gaussian map in Banach space (B, ) with RKHS (H, H ) and small ball exponent φ 0 (ǫ) = logπ( W < ǫ). Theorem. If statistical distances on the model combine appropriately with the norm of B, then the posterior rate is ǫ n if φ 0 (ǫ n ) nǫ n 2 AND inf h 2 H nǫ n 2. h H: h w 0 <ǫ n Proof. Suffices: existence of B n B with logn ( ǫ n,b n, ) nǫ 2 n complexity Π n (B n ) = 1 o(e nǫ2 n ( ) remaining mass Π n w: w w0 < ǫ n e nǫ 2 n prior mass Take B n = M n H 1 +ǫ n B 1 for appropriate M n.

182 Prior mass decentered small ball probability W a centered Gaussian map in (B, ) with RKHS (H, H ) and small ball exponent φ 0 (ǫ). φ w0 (ǫ):= φ 0 (ǫ)+ 1 2 inf h H: h w 0 <ǫ h 2 H

183 Prior mass decentered small ball probability W a centered Gaussian map in (B, ) with RKHS (H, H ) and small ball exponent φ 0 (ǫ). φ w0 (ǫ):= φ 0 (ǫ)+ 1 2 inf h H: h w 0 <ǫ h 2 H Theorem. Pr( W w 0 < 2ǫ) e φ w 0 (ǫ)

184 Prior mass decentered small ball probability proof Proof. (Sketch) For h H the distribution of W +h is absolute continuous relative to that of W and Pr ( W h < ǫ ) = Ee Uh 1 2 h 2 H 1l{ W < ǫ}. The left side does not change if h replaces h. Take average: Pr ( W h < ǫ ) = E 1 2 (e Uh +e Uh )e 1 2 h 2 H 1l{ W < ǫ} e 1 2 h 2 H Pr( W < ǫ).

185 Prior mass decentered small ball probability proof Proof. (Sketch) For h H the distribution of W +h is absolute continuous relative to that of W and Pr ( W h < ǫ ) = Ee Uh 1 2 h 2 H 1l{ W < ǫ}. The left side does not change if h replaces h. Take average: Pr ( W h < ǫ ) = E 1 2 (e Uh +e Uh )e 1 2 h 2 H 1l{ W < ǫ} e 1 2 h 2 H Pr( W < ǫ). For general w 0 : if h H with w 0 h < ǫ, then W h < ǫ implies W w 0 < 2ǫ.

186 Complexity and remaining mass Theorem. The closure of H in B is support of the Gaussian measure (and hence posterior is inconsistent if w 0 H > 0).

187 Complexity and remaining mass Theorem. The closure of H in B is support of the Gaussian measure (and hence posterior is inconsistent if w 0 H > 0). Theorem (Borell 75). For H 1 and B 1 the unit balls of RKHS and B, Pr(W / MH 1 +ǫb 1 ) 1 Φ ( Φ 1 (e φ 0(ǫ) )+M )

188 Complexity and remaining mass Theorem. The closure of H in B is support of the Gaussian measure (and hence posterior is inconsistent if w 0 H > 0). Theorem (Borell 75). For H 1 and B 1 the unit balls of RKHS and B, Pr(W / MH 1 +ǫb 1 ) 1 Φ ( Φ 1 (e φ 0(ǫ) )+M ) Corollary. For M(W) a median of W and σ 2 (W) = sup b 1varb W, Pr(W M(W) x) 1 Φ(x/σ(W)) e 1 2 x2 /σ 2 (W)

189 Adaptation Every Gaussian prior is good for some regularity class, but may be very bad for another. This can be alleviated by adapting the prior to the data by hierarchical Bayes: putting a prior on the regularity, or on a scaling. empirical Bayes: using a regularity or scaling determined by maximum likelihood on the marginal distribution of the data. The first is known to work in some generality. For the second there are some, but not many results.

190 Adaptation by random scaling example Choose A d from a Gamma distribution. Choose (G t :t R d +) radial basis stationary Gaussian process. Set W t G At Theorem. if w 0 C β [0,1] d, then the rate of contraction is nearly n β/(2β+d). if w 0 is supersmooth, then the rate is nearly n 1/2. Proof. Use the basic contraction theorem (and careful estimates).

191 Acknowledgement Harry van Zanten

192 Dirichlet mixtures

193 Acknowledgement Subhashis Ghosal

Nonparametric Bayesian Uncertainty Quantification

Nonparametric Bayesian Uncertainty Quantification Lecture 1: Introduction to Nonparametric Bayes Aad van der Vaart Universiteit Leiden, Netherlands YES, Eindhoven, January 2017 Contents Introduction Recovery