Lectures on Nonparametric Bayesian Statistics

Size: px
Start display at page:

Download "Lectures on Nonparametric Bayesian Statistics"

Transcription

1 Lectures on Nonparametric Bayesian Statistics Aad van der Vaart Universiteit Leiden, Netherlands Bad Belzig, March 2013

2 Contents Introduction Dirichlet process Consistency and rates Gaussian process priors Dirichlet mixtures All the rest

3 Introduction

4 The Bayesian paradigm A parameter Θ is generated according to a prior distribution Π. Given Θ = θ the data X is generated according to a measure P θ. This gives a joint distribution of (X,Θ). Given observed data x the statistician computes the conditional distribution of Θ given X = x, the posterior distribution: Π(θ B X).

5 The Bayesian paradigm A parameter Θ is generated according to a prior distribution Π. Given Θ = θ the data X is generated according to a measure P θ. This gives a joint distribution of (X,Θ). Given observed data x the statistician computes the conditional distribution of Θ given X = x, the posterior distribution: Π(θ B X). We assume whatever needed (e.g. Θ Polish and Π a probability distribution on its Borel σ-field; Polish sample space) to make this well defined.

6 Bayes s rule A parameter Θ is generated according to a prior distribution Π. Given Θ = θ the data X is generated according to a measure P θ. This gives a joint distribution of (X,Θ). Given observed data x the statistician computes the conditional distribution of Θ given X = x, the posterior distribution: Π(θ B X). If P θ is given by a density x p θ (x), then Bayes s rule gives Π(Θ B X) = Bp θ(x)dπ(θ) Θ p θ(x)dπ(θ)

7 Bayes s rule A parameter Θ is generated according to a prior distribution Π. Given Θ = θ the data X is generated according to a measure P θ. This gives a joint distribution of (X,Θ). Given observed data x the statistician computes the conditional distribution of Θ given X = x, the posterior distribution: Π(θ B X). If P θ is given by a density x p θ (x), then Bayes s rule gives dπ(θ X) p θ (X)dΠ(θ)

8 Reverend Thomas Thomas Bayes ( , 1763) followed this argument with Θ possessing the uniform distribution and X given Θ = θ binomial (n,θ). Using his famous rule he computed that the posterior distribution is then Beta(X +1,n X +1). Pr(a Θ b) = b a, 0 < a < b < 1, ( ) n Pr(X = x Θ = θ) = θ x (1 θ) n x, x = 0,1,...,n, x Pr(a Θ b X = x) = b a θ x (1 θ) n x dθ/b(x+1,n x+1).

9 Reverend Thomas Thomas Bayes ( , 1763) followed this argument with Θ possessing the uniform distribution and X given Θ = θ binomial (n,θ). Using his famous rule he computed that the posterior distribution is then Beta(X +1,n X +1). Pr(a Θ b) = b a, 0 < a < b < 1, ( ) n Pr(X = x Θ = θ) = θ x (1 θ) n x, x = 0,1,...,n, x dπ(θ X) = θ X (1 θ) n X 1.

10 Reverend Thomas Thomas Bayes ( , 1763) followed this argument with Θ possessing the uniform distribution and X given Θ = θ binomial (n,θ). Using his famous rule he computed that the posterior distribution is then Beta(X +1,n X +1). n=20, x=

11 Reverend Thomas Thomas Bayes ( , 1763) followed this argument with Θ possessing the uniform distribution and X given Θ = θ binomial (n,θ). Using his famous rule he computed that the posterior distribution is then Beta(X +1,n X +1)

12 Reverend Thomas Thomas Bayes ( , 1763) followed this argument with Θ possessing the uniform distribution and X given Θ = θ binomial (n,θ). Using his famous rule he computed that the posterior distribution is then Beta(X +1,n X +1).

13 Parametric Bayes Pierre-Simon Laplace ( ) rediscovered Bayes argument and applied it to general parametric models: models smoothly indexed by a Euclidean parameter θ. For instance, the linear regression model, where one observes (x 1,Y n ),...,(x n,y n ) following Y i = θ 0 +θ 1 x i +e i, for e 1,...,e n independent normal errors with zero mean.

14 Nonparametric Bayes If the parameter θ is a function, then the prior is a probability distribution on an function space. So is the posterior, given the data. Bayes formula does not change: dπ(θ X) p θ (X)dΠ(θ). Prior and posterior can be visualized by plotting functions that are simulated from these distributions

15 Nonparametric Bayes If the parameter θ is a function, then the prior is a probability distribution on an function space. So is the posterior, given the data. Bayes formula does not change: dπ(θ X) p θ (X)dΠ(θ). Prior and posterior can be visualized by plotting functions that are simulated from these distributions

16 Nonparametric Bayes If the parameter θ is a function, then the prior is a probability distribution on an function space. So is the posterior, given the data. Bayes formula does not change: dπ(θ X) p θ (X)dΠ(θ). Prior and posterior can be visualized by plotting functions that are simulated from these distributions

17 Nonparametric Bayes If the parameter θ is a function, then the prior is a probability distribution on an function space. So is the posterior, given the data. Bayes formula does not change: dπ(θ X) p θ (X)dΠ(θ). Prior and posterior can be visualized by plotting functions that are simulated from these distributions

18 Nonparametric Bayes If the parameter θ is a function, then the prior is a probability distribution on an function space. So is the posterior, given the data. Bayes formula does not change: dπ(θ X) p θ (X)dΠ(θ). Prior and posterior can be visualized by plotting functions that are simulated from these distributions

19 Nonparametric Bayes If the parameter θ is a function, then the prior is a probability distribution on an function space. So is the posterior, given the data. Bayes formula does not change: dπ(θ X) p θ (X)dΠ(θ). Prior and posterior can be visualized by plotting functions that are simulated from these distributions

20 Subjectivism A philosophical Bayesian statistician views the prior distribution as an expression of his personal beliefs on the state of the world, before gathering the data. After seeing the data he updates his beliefs into the posterior distribution. Most scientists do not like dependence on subjective priors. One can opt for objective or noninformative priors. One can also mathematically study the role of the prior, and hope to find that it is small.

21 Frequentist Bayesian Assume that the data X is generated according to a given parameter θ 0 and consider the posterior Π(θ X) as a random measure on the parameter set dependent on X. We like Π(θ X) to put most of its mass near θ 0 for most X.

22 Frequentist Bayesian Assume that the data X is generated according to a given parameter θ 0 and consider the posterior Π(θ X) as a random measure on the parameter set dependent on X. We like Π(θ X) to put most of its mass near θ 0 for most X. Asymptotic setting: data X n where the information increases as n. We like the posterior Π n ( X n ) to contract to {θ 0 }, at a good rate.

23 Frequentist Bayesian Assume that the data X is generated according to a given parameter θ 0 and consider the posterior Π(θ X) as a random measure on the parameter set dependent on X. We like Π(θ X) to put most of its mass near θ 0 for most X. Asymptotic setting: data X n where the information increases as n. We like the posterior Π n ( X n ) to contract to {θ 0 }, at a good rate. Desirable properties: Consistency + rate Adaptation Distributional approximations Uncertainty quantification

24 Frequentist Bayesian Assume that the data X is generated according to a given parameter θ 0 and consider the posterior Π(θ X) as a random measure on the parameter set dependent on X. We like Π(θ X) to put most of its mass near θ 0 for most X. Asymptotic setting: data X n where the information increases as n. We like the posterior Π n ( X n ) to contract to {θ 0 }, at a good rate. Desirable properties: Consistency + rate Adaptation Distributional approximations Uncertainty quantification We assume that P θ0 P θ dπ(θ) to make these questions well posed.

25 Parametric models Suppose the data are a random sample X 1,...,X n from a density x p θ (x) that is smoothly and identifiably parametrized by a vector θ R d (e.g. θ p θ continuously differentiable as map in L 2 (µ)). Theorem (Laplace, Bernstein, von Mises, LeCam 1989). Under P n θ 0 -probability, for any prior with density that is positive around θ 0, Π( X 1,...,X n ) N d ( θn, 1 n I 1 θ 0 ) ( ) 0. Here θ n is any efficient estimator of θ n= n=

26 Parametric models Suppose the data are a random sample X 1,...,X n from a density x p θ (x) that is smoothly and identifiably parametrized by a vector θ R d (e.g. θ p θ continuously differentiable as map in L 2 (µ)). Theorem (Laplace, Bernstein, von Mises, LeCam 1989). Under P n θ 0 -probability, for any prior with density that is positive around θ 0, Π( X 1,...,X n ) N d ( θn, 1 n I 1 θ 0 ) ( ) 0. Here θ n is any efficient estimator of θ. In particular, the posterior distribution concentrates most of its mass on balls of radius O(1/ n) around θ 0, and a central set of posterior probability 95 % is equivalent to the usual Wald confidence set. The prior washes out completely.

27 Support Definition. The support of a prior Π is the smallest closed set F with Π(F) = 1. In nonparametrics we like priors with big (or even full) support, equal to a infinite-dimensional set. Full support means that every open set has positive (prior) probability.

28 Support Definition. The support of a prior Π is the smallest closed set F with Π(F) = 1. In nonparametrics we like priors with big (or even full) support, equal to a infinite-dimensional set. Full support means that every open set has positive (prior) probability. Support depends on topology. It is well defined, e.g. if the parameter space is Polish.

29 Dirichlet process

30 Random measures M: all probability measures on (Polish) sample space (X, X ). M : σ-field generated by all maps M M(A), A X. Lemma. M is also the Borel σ-field on M equipped with the weak topology ( of convergence in distribution ).

31 Random measures M: all probability measures on (Polish) sample space (X, X ). M : σ-field generated by all maps M M(A), A X. Lemma. M is also the Borel σ-field on M equipped with the weak topology ( of convergence in distribution ). Definition. A random probability measure on (X, X ) is a map P:(Ω, U,Pr) M such that P(A) is a random variable for every A X. (Equivalently, a Borel measurable map in M.)

32 Random measures M: all probability measures on (Polish) sample space (X, X ). M : σ-field generated by all maps M M(A), A X. Lemma. M is also the Borel σ-field on M equipped with the weak topology ( of convergence in distribution ). Definition. A random probability measure on (X, X ) is a map P:(Ω, U,Pr) M such that P(A) is a random variable for every A X. (Equivalently, a Borel measurable map in M.) The law of P is a prior on (M, M).

33 Random measures M: all probability measures on (Polish) sample space (X, X ). M : σ-field generated by all maps M M(A), A X. Lemma. M is also the Borel σ-field on M equipped with the weak topology ( of convergence in distribution ). Definition. A random probability measure on (X, X ) is a map P:(Ω, U,Pr) M such that P(A) is a random variable for every A X. (Equivalently, a Borel measurable map in M.) The law of P is a prior on (M, M). Definition. The mean measure of P is the measure A EP(A).

34 Discrete random measures W 1,W 2,... nonnegative variables with i=1 W i = 1, independent of θ 1,θ 2,... iid G, random variables with values in X. P = W i δ θi. i=1

35 Discrete random measures W 1,W 2,... nonnegative variables with i=1 W i = 1, independent of θ 1,θ 2,... iid G, random variables with values in X. P = W i δ θi. i=1 Lemma. If (W 1,...,W n ) has (full) support the unit simplex for every n and the law of θ 1 has (full) support X, then P has full support M relative to the weak topology.

36 Discrete random measures W 1,W 2,... nonnegative variables with i=1 W i = 1, independent of θ 1,θ 2,... iid G, random variables with values in X. P = W i δ θi. i=1 Lemma. If (W 1,...,W n ) has (full) support the unit simplex for every n and the law of θ 1 has (full) support X, then P has full support M relative to the weak topology. Proof. Finitely discrete distributions are weakly dense in M. It suffices to show that P gives positive probability to any weak neighbourhood of a distribution of the form P = k i=1 w i δ θi. { i>k W i < ǫ,max i k W i wi θ i θi < ǫ} is open and hence has positive probability.

37 Discrete random measures W 1,W 2,... nonnegative variables with i=1 W i = 1, independent of θ 1,θ 2,... iid G, random variables with values in X. P = W i δ θi. i=1 Lemma. If (W 1,...,W n ) has (full) support the unit simplex for every n and the law of θ 1 has (full) support X, then P has full support M relative to the weak topology. Proof. Finitely discrete distributions are weakly dense in M. It suffices to show that P gives positive probability to any weak neighbourhood of a distribution of the form P = k i=1 w i δ θi. { i>k W i < ǫ,max i k W i wi θ i θi < ǫ} is open and hence has positive probability.

38 Stick breaking Given i.i.d. Y 1,Y 2,... in [0,1], W 1 = Y 1,W 2 = (1 Y 1 )Y 2,W 3 = (1 Y 1 )(1 Y 2 )Y 3,...

39 Stick breaking Given i.i.d. Y 1,Y 2,... in [0,1], W 1 = Y 1,W 2 = (1 Y 1 )Y 2,W 3 = (1 Y 1 )(1 Y 2 )Y 3,... Lemma. (W 1,W 2,...) is a random probability measure on N iff P(Y 1 > 0) > 0, and has full support if Y 1 has support [0,1].

40 Stick breaking Given i.i.d. Y 1,Y 2,... in [0,1], W 1 = Y 1,W 2 = (1 Y 1 )Y 2,W 3 = (1 Y 1 )(1 Y 2 )Y 3,... Lemma. (W 1,W 2,...) is a random probability measure on N iff P(Y 1 > 0) > 0, and has full support if Y 1 has support [0,1]. Proof. The remaining length of the stick at stage j is 1 j l=1 W l = j l=1 (1 Y l), and tends to zero a.s. iff j l=1 (1 EY l) 0. (W 1,...,W n ) is a continuous function of (Y 1,...,Y n ) with full range, for every k.

41 Stick breaking Given i.i.d. Y 1,Y 2,... in [0,1], W 1 = Y 1,W 2 = (1 Y 1 )Y 2,W 3 = (1 Y 1 )(1 Y 2 )Y 3,... Lemma. (W 1,W 2,...) is a random probability measure on N iff P(Y 1 > 0) > 0, and has full support if Y 1 has support [0,1]. Proof. The remaining length of the stick at stage j is 1 j l=1 W l = j l=1 (1 Y l), and tends to zero a.s. iff j l=1 (1 EY l) 0. (W 1,...,W n ) is a continuous function of (Y 1,...,Y n ) with full range, for every k. EXAMPLE OF PARTICULAR INTEREST: Y 1,Y 2, iid Be(1,M).

42 Random measures as stochastic processes A random measure P induces the distributions on R k of the random vectors ( P(A1 ),...,P(A k ) ), A 1,...,A k X.

43 Random measures as stochastic processes A random measure P induces the distributions on R k of the random vectors ( P(A1 ),...,P(A k ) ), A 1,...,A k X. Conversely suppose we want a measure with particular distributions, and can construct a stochastic process ( P(A):A X ) with these distributions (e.g. by Kolmogorov s consistency theorem).

44 Random measures as stochastic processes A random measure P induces the distributions on R k of the random vectors ( P(A1 ),...,P(A k ) ), A 1,...,A k X. Conversely suppose we want a measure with particular distributions, and can construct a stochastic process ( P(A):A X ) with these distributions (e.g. by Kolmogorov s consistency theorem). It will be true that (i). P( ) = 0, P(X ) = 1, a.s. (ii). P(A 1 A 1 ) = P(A 1 )+P(A 2 ), a.s., for any disjoint A 1,A 2.

45 Random measures as stochastic processes A random measure P induces the distributions on R k of the random vectors ( P(A1 ),...,P(A k ) ), A 1,...,A k X. Conversely suppose we want a measure with particular distributions, and can construct a stochastic process ( P(A):A X ) with these distributions (e.g. by Kolmogorov s consistency theorem). It will be true that (i). P( ) = 0, P(X ) = 1, a.s. (ii). P(A 1 A 1 ) = P(A 1 )+P(A 2 ), a.s., for any disjoint A 1,A 2. However, we do not automatically have that P is a random measure. Theorem. If ( P(A):A X ) is a stochastic process that satisfies (i) and (ii) and whose mean A EP(A) is a Borel measure on X, then there exists a version of P that is a random measure on (X, X ).

46 Finite-dimensonal Dirichlet distribution Definition. (X 1,...,X k ) possesses a Dirichlet (k,α 1,...,α k ) distribution for α i > 0 it has (Lebesgue) density on the unit simplex proportional to x x α x α k 1 k.

47 Finite-dimensonal Dirichlet distribution Definition. (X 1,...,X k ) possesses a Dirichlet (k,α 1,...,α k ) distribution for α i > 0 it has (Lebesgue) density on the unit simplex proportional to x x α x α k 1 k. We extend to α i = 0 for one or more i on the understanding that X i = 0.

48 Finite-dimensonal Dirichlet distribution Definition. (X 1,...,X k ) possesses a Dirichlet (k,α 1,...,α k ) distribution for α i > 0 it has (Lebesgue) density on the unit simplex proportional to x x α x α k 1 k. We extend to α i = 0 for one or more i on the understanding that X i = 0. EXAMPLES For k = 2 we have X 1 Be(α 1,α 2 ) and X 2 = 1 X 1 Be(α 2,α 1 ). The Dir(k;1,...,1)-distribution is the uniform distribution on the simplex.

49 Dirichlet distribution properties Proposition (Gamma representation). If Y i ind Ga(α i,1), then (Y 1 /Y,...,Y k /Y) Dir(k;α 1,...,α k ), and is independent of and Y:= k i=1 Y i. Proposition (Aggregation). If X Dir(k;α 1,...,α k ) and Z j = i I j X i for a given partition I 1,...,I m of {1,...,k}, then (i). (Z 1,...,Z m ) Dir(m;β 1,...,β m ), where β j = i I j α i. (ii). (iii). (X i /Z j :i I j ) ind Dir(#I j ;α i,i I j ), for j = 1,...,m. (Z 1,...,Z m ) and (X i /Z j :i I j,j = 1,...,m) are independent. Conversely, if X is a random vector such that (i) (iii) hold, for a given partition I 1,...,I m and Z j = i I j X i, then X Dir(k;α 1,...,α k ). Proposition. E(X i ) = α i / α and var(x i ) = α i ( α α i )/( α 2 ( α +1)), for α = k i=1 α i. Proposition (Conjugacy). If p Dir(k;α) and N p MN(n,k;p), then p N Dir(k;α+N).

50 Dirichlet process Definition. A random measure P on (X, X ) is a Dirichlet process with base measure α, if for every partition A 1,...,A k of X, ( P(A1 ),...,P(A k ) ) Dir ( k;α(a 1 ),...,α(a k ) ). We write P DP(α), α := α(x) and ᾱ:= α/ α.

51 Dirichlet process Definition. A random measure P on (X, X ) is a Dirichlet process with base measure α, if for every partition A 1,...,A k of X, ( P(A1 ),...,P(A k ) ) Dir ( k;α(a 1 ),...,α(a k ) ). We write P DP(α), α := α(x) and ᾱ:= α/ α. EP(A) = ᾱ(a), varp(a) = ᾱ(a)ᾱ(ac ). 1+ α

52 Dirichlet process Definition. A random measure P on (X, X ) is a Dirichlet process with base measure α, if for every partition A 1,...,A k of X, ( P(A1 ),...,P(A k ) ) Dir ( k;α(a 1 ),...,α(a k ) ). We write P DP(α), α := α(x) and ᾱ:= α/ α. EP(A) = ᾱ(a), varp(a) = ᾱ(a)ᾱ(ac ). 1+ α

53 Dirichlet process existence Theorem. For any Borel measure α the Dirichlet process exists as a Borel measure on M.

54 Dirichlet process existence Theorem. For any Borel measure α the Dirichlet process exists as a Borel measure on M. Proof. An arbitrary collection of sets A 1,...,A k can be partitioned in 2 k atoms B j = A 1 A 2 A k, where A stands for A or A c. The distribution of ( P(B j ):j = 1,...,2 k) must be Dirichlet. Define the distribution of ( P(A 1 ),...,P(A k ) ) corresponding to the fact that each P(A i ) must be a sum of some set of P(B j ). Using properties of finite-dimensional Dirichlets, check that this is consistent in the sense of Kolmogorov, so that a version of the stochastic process ( P(A):A X ) exists. Apply the general theorem on existence of random measures.

55 Sethuraman representation Theorem. If θ 1,θ 2,... ᾱ iid and Y 1,Y 2,... Be(1,M) iid are independent random variables and W j = Y j 1 j l=1 (1 Y l), then j=1 W jδ θj DP(Mᾱ).

56 Sethuraman representation Theorem. If θ 1,θ 2,... ᾱ iid and Y 1,Y 2,... Be(1,M) iid are independent random variables and W j = Y j 1 j l=1 (1 Y l), then j=1 W jδ θj DP(Mᾱ). Proof. P:= W 1 δ θ1 + W j δ θj = Y 1 δ θ1 +(1 Y 1 )P, P = j=2 j 1 (Y j j=2 l=2 (1 Y l ))δ θj. Hence Q = ( P(A 1 ),...,P(A k ) ) and N = ( δ θ1 (A 1 ),...,δ θ1 (A k ) ) satisfy Now Q = d YN +(1 Y)Q. For given Y Be(1,M) and independent θ G there is at most one solution in distribution Q. A Dirichlet vector Q is a solution. Second follows by properties of Dirichlet (not obvious!). First: see next slide.

57 Sethuraman representation Proof. (Continued) Q = d YN +(1 Y)Q. Given i.i.d. copies (Y n,n n ) and given independent solutions Q and Q : Q 0 = Q, Q 0 = Q, Q n = Y n N n +(1 Y n )Q n 1, Q n = Y n N n +(1 Y n )Q n 1. Then Q n = d Q and Q n = d Q for every n, and Q n Q n = 1 Y n Q n 1 Q n 1 = n 1 Y i Q Q 0 i=1 Hence Q = d Q.

58 Tail-free processes Let X = A 0 A 1 = (A 00 A 01 ) (A 10 A 11 ) = be nested partitions, rich enough that they generates the Borel σ-field. X V 0 V 1 A 0 A 1 V 00 V 01 V 10 V 11 A 00 A 01 A 10 A 11

59 Tail-free processes Let X = A 0 A 1 = (A 00 A 01 ) (A 10 A 11 ) = be nested partitions, rich enough that they generates the Borel σ-field. X V 0 V 1 A 0 A 1 V 00 V 01 V 10 V 11 A 00 A 01 A 10 A 11 Splitting variables: V ε0 = P(A ε0 A ε ), and V ε1 = P(A ε1 A ε ). P(A ε1 ε m ) = V ε1 V ε1 ε 2 V ε1 ε m, ε = ε 1 ε m {0,1} m.

60 Tail-free processes (2) P(A ε1 ε m ) = V ε1 V ε1 ε 2 V ε1 ε m, ε = ε 1 ε m {0,1} m. (1) Theorem. Suppose A ε = {A εδ :A εδ compact,a εδ A ε } (V ε :ε E ) stochastic process with 0 V ε 1 and V ε0 +V ε1 = 1. There is a Borel measure with µ(a ε ):= EV ε1 V ε1 ε 2 V ε1 ε m. Then there exists a random Borel measure P satisfying (1)

61 Tail-free processes (2) P(A ε1 ε m ) = V ε1 V ε1 ε 2 V ε1 ε m, ε = ε 1 ε m {0,1} m. (1) Theorem. Suppose A ε = {A εδ :A εδ compact,a εδ A ε } (V ε :ε E ) stochastic process with 0 V ε 1 and V ε0 +V ε1 = 1. There is a Borel measure with µ(a ε ):= EV ε1 V ε1 ε 2 V ε1 ε m. Then there exists a random Borel measure P satisfying (1) SPECIAL CASE: Polya tree prior: all V ε independent Beta variables.

62 Tail-free processes (3) V ε0 = P(A ε0 A ε ), and V ε1 = P(A ε1 A ε ). P(A ε1 ε m ) = V ε1 V ε1 ε 2 V ε1 ε m, ε = ε 1 ε m {0,1} m.

63 Tail-free processes (3) V ε0 = P(A ε0 A ε ), and V ε1 = P(A ε1 A ε ). P(A ε1 ε m ) = V ε1 V ε1 ε 2 V ε1 ε m, ε = ε 1 ε m {0,1} m. Notation: U V means U and V are independent U V Z means U and V are conditionally independent given Z.

64 Tail-free processes (3) V ε0 = P(A ε0 A ε ), and V ε1 = P(A ε1 A ε ). P(A ε1 ε m ) = V ε1 V ε1 ε 2 V ε1 ε m, ε = ε 1 ε m {0,1} m. Notation: U V means U and V are independent U V Z means U and V are conditionally independent given Z. Definition (Tail-free). The random measure P is a tail-free process with respect to the sequence of partitions if {V 0 } {V 00,V 10 } {V ε0 :ε E m }.

65 Tail-free processes (3) V ε0 = P(A ε0 A ε ), and V ε1 = P(A ε1 A ε ). P(A ε1 ε m ) = V ε1 V ε1 ε 2 V ε1 ε m, ε = ε 1 ε m {0,1} m. Notation: U V means U and V are independent U V Z means U and V are conditionally independent given Z. Definition (Tail-free). The random measure P is a tail-free process with respect to the sequence of partitions if {V 0 } {V 00,V 10 } {V ε0 :ε E m }. Theorem. The DP(α) prior is tail free. All splitting variables V ε0 are independent and V ε0 Be ( α(a ε0 ),α(a ε1 ) ). Proof. This follows from properties of the finite-dimensional Dirichlet.

66 Posterior distribution For X 1,...,X n P iid P define count variables: N ε := #{1 i n:x i A ε }.

67 Posterior distribution For X 1,...,X n P iid P define count variables: N ε := #{1 i n:x i A ε }. Theorem. If P is tail-free, then for every m and n the posterior distribution of ( P(A ε ):ε E m) given X 1,...,X n depends only on (N ε :ε E m ).

68 Posterior distribution For X 1,...,X n P iid P define count variables: N ε := #{1 i n:x i A ε }. Theorem. If P is tail-free, then for every m and n the posterior distribution of ( P(A ε ):ε E m) given X 1,...,X n depends only on (N ε :ε E m ). Proof. We may generate the variables P,X 1,...,X n in four steps: (a) Generate θ:= ( P(A ε ):ε E m) from its prior. (b) Given θ generate N = (N ε :ε E m ) multinomial (n,θ). (c) Generate η:= ( P(A A ε ):A X,ε E m). (d) Given (N,η) generate for every ε E m a random sample of size N ε from P( A ε ), independently across ε E m ; let X 1,...,X n be the n values in a random order. Then η θ and N η θ and X θ (N,η). Thus θ X N.

69 Posterior distribution (continued) Theorem. If P is tail-free, then the posterior P X 1,...,X n is tail-free.

70 Posterior distribution (continued) Theorem. If P is tail-free, then the posterior P X 1,...,X n is tail-free. Proof. Suffices to show, for every level: (V ε0 :ε E m ) ( P(A ε ):ε E m) X 1,...,X n. In view of preceding theorem, suffices: (V ε0 :ε E m ) ( P(A ε ):ε E m) (N εδ :ε E m,δ E). The likelihood for (V,θ,N), where θ ε = P(A ε ), takes the form ( ) n (θ ε V εδ ) N εδ dπ 1 (V)dΠ 2 (θ). N ε E m,δ E This factorizes in parts involving (V,N) and involving (θ,n).

71 Conjugacy of Dirichlet process P DP(α), X 1,X 2,... P iid P.

72 Conjugacy of Dirichlet process P DP(α), X 1,X 2,... P iid P. Theorem. P X 1,...,X n DP(α+nP n ), for P n = n 1 n i=1 δ X i.

73 Conjugacy of Dirichlet process P DP(α), X 1,X 2,... P iid P. Theorem. P X 1,...,X n DP(α+nP n ), for P n = n 1 n i=1 δ X i. Proof. ( P(A 1 ),...,P(A k ) ) X 1,...,X n ( P(A 1 ),...,P(A k ) ) N. Apply result for finite-dimensional Dirichlet.

74 Conjugacy of Dirichlet process P DP(α), X 1,X 2,... P iid P. Theorem. P X 1,...,X n DP(α+nP n ), for P n = n 1 n i=1 δ X i. Proof. ( P(A 1 ),...,P(A k ) ) X 1,...,X n ( P(A 1 ),...,P(A k ) ) N. Apply result for finite-dimensional Dirichlet. E ( P(A) X 1,...,X n ) = α α +nᾱ(a)+ n α +n P n(a), var ( P(A) X 1,...,X n ) = Pn (A) P n (A c ) 1+ α +n 1 4(1+ α +n). Corollary. P(A) X 1,...,X n d δ P0 (A) as n, a.s. [P 0 ].

75 Conjugacy of Dirichlet process P DP(α), X 1,X 2,... P iid P. Theorem. P X 1,...,X n DP(α+nP n ), for P n = n 1 n i=1 δ X i. Proof. ( P(A 1 ),...,P(A k ) ) X 1,...,X n ( P(A 1 ),...,P(A k ) ) N. Apply result for finite-dimensional Dirichlet. Corollary. P(A) X 1,...,X n d δ P0 (A) as n, a.s. [P 0 ]

76 Predictive distribution P DP(α), X 1,X 2,... P iid P. Theorem. X i X 1,...,X i 1 δ X1, with probability 1 α +i 1,.. 1 δ Xi 1, with probability ᾱ, with probability α +i 1, α α +i 1.

77 Predictive distribution P DP(α), X 1,X 2,... P iid P. Theorem. X i X 1,...,X i 1 δ X1, with probability 1 α +i 1,.. 1 δ Xi 1, with probability ᾱ, with probability α +i 1, α α +i 1. Proof. (i). Pr(X 1 A) = EPr(X 1 A P) = EP(A) = ᾱ(a). (ii). Preceding step means: X 1 P P and P DP(α) imply X 1 ᾱ. Hence X 2 (P,X 1 ) P and P X 1 DP(α+δ X1 ) imply X 2 X 1 (α+δ X1 )/( α +1). (iii). etc.

78 Dirichlet process mixtures Given a probability density x ψ(x;θ) consider data X 1,...,X n F p iid F (x):= ψ(x;θ)df(θ).

79 Dirichlet process mixtures Given a probability density x ψ(x;θ) consider data X 1,...,X n F p iid F (x):= ψ(x;θ)df(θ). For F DP(α), this gives Bayesian model: X 1,...,X n θ 1,...,θ n,f ind ψ( ;θ i ), θ 1,...,θ n F iid F, F DP(α).

80 Dirichlet process mixtures Given a probability density x ψ(x;θ) consider data X 1,...,X n F p iid F (x):= ψ(x;θ)df(θ). For F DP(α), this gives Bayesian model: X 1,...,X n θ 1,...,θ n,f ind ψ( ;θ i ), θ 1,...,θ n F iid F, F DP(α). Lemma. For any θ ψ(θ) (e.g. ψ(x, )), ( ) E ψdf θ 1,...,θ n,x 1,...,X n = 1 [ ψdα+ α +n n j=1 ] ψ(θ j ).

81 Dirichlet process mixtures Given a probability density x ψ(x;θ) consider data X 1,...,X n F p iid F (x):= ψ(x;θ)df(θ). For F DP(α), this gives Bayesian model: X 1,...,X n θ 1,...,θ n,f ind ψ( ;θ i ), θ 1,...,θ n F iid F, F DP(α). Lemma. For any θ ψ(θ) (e.g. ψ(x, )), ( ) E ψdf θ 1,...,θ n,x 1,...,X n = 1 [ ψdα+ α +n n j=1 ] ψ(θ j ). Proof. F X 1,...,X n θ 1,...,θ n ; F θ 1,...,θ n DP(α+ n i=1 δ θ i ).

82 Dirichlet process mixtures Given a probability density x ψ(x;θ) consider data X 1,...,X n F p iid F (x):= ψ(x;θ)df(θ). For F DP(α), this gives Bayesian model: X 1,...,X n θ 1,...,θ n,f ind ψ( ;θ i ), θ 1,...,θ n F iid F, F DP(α). Lemma. For any θ ψ(θ) (e.g. ψ(x, )), ( ) E ψdf θ 1,...,θ n,x 1,...,X n = 1 [ ψdα+ α +n n j=1 ] ψ(θ j ). Proof. F X 1,...,X n θ 1,...,θ n ; F θ 1,...,θ n DP(α+ n i=1 δ θ i ). Compute conditional expectation given X 1,...,X n by generating samples θ 1,...,θ n from θ 1,...,θ n X 1,...,X n, and averaging.

83 Gibbs sampler X i θ i,f ind ψ( ;θ i ), θ i F iid F, F DP(α). Theorem (Gibbs sampler). θ i θ i X 1,...,X n j i q i,j δ θj +q i,0 G b,i, where (q i,j :j {0,1,...,n} {i}) is the probability vector satisfying q i,j { ψ(x i ;θ j ), j i,j 1, ψ(xi ;θ)dα(θ), j = 0, and G b,i is the baseline posterior measure given by dg b,i (θ X i ) ψ(x i ;θ)dα(θ).

84 Gibbs sampler proof Proof. E ( ) 1l A (X i )1l B (θ i ) θ i,x i = E (E ( ) ) 1l A (X i )1l B (θ i ) F,θ i,x i θ i,x i ( ) = E 1l A (x)1l B (θ)ψ(x;θ)dµ(x)df(θ) θ i 1 ( = 1l A (x)1l B (θ)ψ(x;θ)dµ(x)d α+ α +n j i δ θj )(θ). By Bayes s rule (applied conditionally given (θ i,x i )) Pr ( ) θ i B X i,θ i,x i = B ψ(x i;θ)d(α+ j i δ θ j )(θ) ψ(xi ;θ)d(α+ j i δ θ j )(θ).

85 Further properties The number of distinct values in (X 1,...,X n ) is O P (logn). The pattern of equal values induces the same random partition of the set {1,2,...,n} as the Kingman coalescent. The Dirichlet distribution has full support relative to the weak topology. DP(α 1 ) DP(α 2 ) as soon as α1 c αc 2 or αd 1 and αd 1 have different supports. In particular prior DP(α) and posterior DP(α+nP n ) are typically orthogonal. The cdf of P DP(α) is a normalized Gamma process. The tails of P DP(α) are much thinner than the tails of α. The Dirichlet is the only prior that is tail-free relative to any partition. The splitting variables of a Polya tree can be defined so that the prior is absolutely continuous.

86 Consistency and rates

87 Consistency X (n) observation in sample space (X (n), X (n) ) with distribution P (n) θ. θ belongs to metric space (Θ,d). Definition. The posterior distribution is consistent at θ 0 Θ if Π n ( θ:d(θ,θ0 ) > ǫ X (n)) 0 in P (n) θ 0 -probability, as n, for every ǫ > 0.

88 Point estimator Proposition. If the posterior distribution is consistent at θ 0 then ˆθ n defined as the center of a (nearly) smallest ball that contains posterior mass at least 1/2 satisfies d(ˆθ n,θ 0 ) 0 in P (n) θ 0 -probability.

89 Point estimator Proposition. If the posterior distribution is consistent at θ 0 then ˆθ n defined as the center of a (nearly) smallest ball that contains posterior mass at least 1/2 satisfies d(ˆθ n,θ 0 ) 0 in P (n) θ 0 -probability. Proof. For B(θ,r) = {s Θ:d(s,θ) r} let Then ˆr n (ˆθ n ) inf θ ˆr n (θ). ˆr n (θ) = inf{r:π n ( B(θ,r) X (n) ) 1/2}. Π n ( B(θ0,ǫ) X (n)) 1 in probability. ˆr n (θ 0 ) ǫ with probability tending to 1, whence ˆr n (ˆθ n ) ˆr n (θ 0 ) ǫ. B(θ 0,ǫ) and B (ˆθn,ˆr n (ˆθ n ) ) cannot be disjoint. d(θ 0,ˆθ n ) ǫ+ ˆr n (ˆθ n ) 2ǫ.

90 Point estimator Proposition. If the posterior distribution is consistent at θ 0 then ˆθ n defined as the center of a (nearly) smallest ball that contains posterior mass at least 1/2 satisfies d(ˆθ n,θ 0 ) 0 in P (n) θ 0 -probability. Proof. For B(θ,r) = {s Θ:d(s,θ) r} let Then ˆr n (ˆθ n ) inf θ ˆr n (θ). ˆr n (θ) = inf{r:π n ( B(θ,r) X (n) ) 1/2}. Π n ( B(θ0,ǫ) X (n)) 1 in probability. ˆr n (θ 0 ) ǫ with probability tending to 1, whence ˆr n (ˆθ n ) ˆr n (θ 0 ) ǫ. B(θ 0,ǫ) and B (ˆθn,ˆr n (ˆθ n ) ) cannot be disjoint. d(θ 0,ˆθ n ) ǫ+ ˆr n (ˆθ n ) 2ǫ. Alternative: posterior mean θdπ n (θ X (n) ).

91 Doob s theorem Theorem (Doob). Let (X, X,P θ :θ Θ) be experiments with (X, X ) a standard Borel space and Θ a Borel subset of a Polish space such that θ P θ (A) is Borel measurable for every A X and the map θ P θ is one-to-one. Then for any prior Π on the Borel sets of Θ the posterior Π n ( X 1,...,X n ) in the model X 1,...,X n θ iid p θ and θ Π is consistent at θ, for Π-almost every θ.

92 Kullback-Leibler property Parameter p: ν-density on sample space (X, X ). True value p 0. Kullback-Leibler divergence: K(p 0 ;p) = p 0 log(p 0 /p)dν, K(p 0 ;P 0 ) = inf K(p 0 ;p). p P 0

93 Kullback-Leibler property Parameter p: ν-density on sample space (X, X ). True value p 0. Kullback-Leibler divergence: K(p 0 ;p) = p 0 log(p 0 /p)dν, K(p 0 ;P 0 ) = inf K(p 0 ;p). p P 0 Definition. p 0 is said to possess the Kullback-Leibler property relative to Π if Π ( p:k(p 0 ;p) < ǫ ) > 0 for every ǫ > 0.

94 Kullback-Leibler property Parameter p: ν-density on sample space (X, X ). True value p 0. Kullback-Leibler divergence: K(p 0 ;p) = p 0 log(p 0 /p)dν, K(p 0 ;P 0 ) = inf K(p 0 ;p). p P 0 Definition. p 0 is said to possess the Kullback-Leibler property relative to Π if Π ( p:k(p 0 ;p) < ǫ ) > 0 for every ǫ > 0. EXAMPLES Polya tree prior with dyadic partition and splitting variables V ε0 Be(a ε,a ε ) for m a 1 m < and K(p 0,λ) <. Dirichlet mixtures ψ(,θ)df(θ) with F DP(α), under some regularity conditions.

95 Schwartz s theorem Bayesian model: X 1,...,X n p iid p, p Π.

96 Schwartz s theorem Bayesian model: X 1,...,X n p iid p, p Π. Theorem. If p 0 has KL-property, and for every neighbourhood U of p 0 there exist tests φ n such that P n 0 φ n 0, sup P n (1 φ n ) 0, p U c then Π n ( X 1,...,X n ) is consistent at p 0.

97 Schwartz s theorem Bayesian model: X 1,...,X n p iid p, p Π. Theorem. If p 0 has KL-property, and for every neighbourhood U of p 0 there exist tests φ n such that P n 0 φ n 0, sup P n (1 φ n ) 0, p U c then Π n ( X 1,...,X n ) is consistent at p 0. Proof. By grouping the observations and using Hoeffding s inequality we can find tests ψ n with P n 0 ψ n e Cn, sup P n (1 ψ n ) e Cn. p U c Then apply the theorem later on.

98 Weak consistency Consider the topology induced on p by the weak topology on the probability measures P.

99 Weak consistency Consider the topology induced on p by the weak topology on the probability measures P. Theorem. The posterior distribution is consistent for the weak topology at any p 0 with the Kullback-Leibler property.

100 Weak consistency Consider the topology induced on p by the weak topology on the probability measures P. Theorem. The posterior distribution is consistent for the weak topology at any p 0 with the Kullback-Leibler property. Proof. Consistent tests always exist: Subbasis for the weak neighbourhoods are sets of the type U = { p:pψ < P 0 ψ +ǫ }, for ψ:x [0,1] continuous and ǫ > 0. Given a test for each neighbourhood the maximum of the tests works for a finite intersection. Use Hoeffding s inequality to bound the error probabilities of the test { 1 φ n = 1l n n i=1 } ψ(x i ) > P 0 ψ +ǫ/2.

101 Extended Schwartz s theorem Bayesian model: X 1,...,X n p iid p, p Π. Theorem. If If p 0 has KL-property and for every neighbourhood U of p 0 there exist C > 0, sets P n P and tests φ n such that Π(P P n ) < e Cn, P n 0 φ n e Cn, sup p P n U c P n (1 φ n ) e Cn, then the posterior distribution Π n ( X 1,...,X n ) is consistent at p 0.

102 Extended Schwartz s theorem Bayesian model: X 1,...,X n p iid p, p Π. Theorem. If If p 0 has KL-property and for every neighbourhood U of p 0 there exist C > 0, sets P n P and tests φ n such that Π(P P n ) < e Cn, P n 0 φ n e Cn, sup p P n U c P n (1 φ n ) e Cn, then the posterior distribution Π n ( X 1,...,X n ) is consistent at p 0. Proof. Follow steps 1 4. Π n (U c ) = n U c i=1 (p/p 0)(X i )dπ(p) n i=1 (p/p 0)(X i )dπ(p).

103 Extended Schwartz s theorem proof Proof. continued. Step 1: for any ǫ > 0 eventually a.s. [P 0 ]: n i=1 p p 0 (X i )dπ(p) Π ( p:k(p 0 ;p) < ǫ)e nǫ. (2)

104 Extended Schwartz s theorem proof Proof. continued. Step 1: for any ǫ > 0 eventually a.s. [P 0 ]: n i=1 p p 0 (X i )dπ(p) Π ( p:k(p 0 ;p) < ǫ)e nǫ. (2) Proof: for Π ǫ ( ) = Π( P ǫ )/Π(P ǫ ), and P ǫ = {p:k(p 0 ;p) < ǫ}, n p log (X i )dπ(p) logπ(p ǫ ) P ǫ p i=1 0 n p = log (X i )dπ ǫ (p) p 0 = n i=1 i=1 log log p (X i )dπ ǫ (p) = n p 0 n i=1 p p 0 (X i )dπ ǫ (p), K(p 0 ;p)dπ ǫ (p)+o(n), a.s.

105 Extended Schwartz s theorem proof (2) Proof. continued. Step 2: n Π n (U c U X 1,...,X n ) φ n +(1 φ n ) c i=1 (p/p 0)(X i )dπ(p) n i=1 (p/p 0)(X i )dπ(p) φ n +Π ( n p:k(p 0 ;p) < ǫ)e nǫ (1 φ n ) (p/p 0 )(X i )dπ(p) U c i=1

106 Extended Schwartz s theorem proof (2) Proof. continued. Step 2: n Π n (U c U X 1,...,X n ) φ n +(1 φ n ) c i=1 (p/p 0)(X i )dπ(p) n i=1 (p/p 0)(X i )dπ(p) φ n +Π ( n p:k(p 0 ;p) < ǫ)e nǫ (1 φ n ) (p/p 0 )(X i )dπ(p) Step 3: P n 0 ( (1 φ n ) U c n i=1 p ) (X i )dπ(p) p 0 = U c P n 0 U c i=1 [ (1 φ n ) n i=1 U c P n (1 φ n )dπ(p). p ] (X i ) dπ(p) p 0

107 Extended Schwartz s theorem proof (2) Proof. continued. Step 2: n Π n (U c U X 1,...,X n ) φ n +(1 φ n ) c i=1 (p/p 0)(X i )dπ(p) n i=1 (p/p 0)(X i )dπ(p) φ n +Π ( n p:k(p 0 ;p) < ǫ)e nǫ (1 φ n ) (p/p 0 )(X i )dπ(p) Step 3: P n 0 ( (1 φ n ) U c n i=1 p ) (X i )dπ(p) p 0 = U c P n 0 U c i=1 [ (1 φ n ) n i=1 U c P n (1 φ n )dπ(p). Step 4: Split U c in U c P n and U c P c n and use that P n (1 φ n ) e Cn on first set, while Π(U c P c n) e Cn. p ] (X i ) dπ(p) p 0

108 Strong consistency and entropy Definition (Covering number). N(ǫ, P, d) is the minimal number of d-balls of radius ǫ needed to cover P. Theorem. The posterior distribution is consistent relative to the L 1 -distance at every p 0 with the KL-property if for every ǫ > 0 there exist a partition P = P n,1 P n,2 (which may depend on ǫ) such that, for C > 0, (i) Π(P n,2 ) e Cn. (ii) logn ( ǫ,p n,1, 1 ) nǫ 2 /3.

109 Strong consistency and entropy Definition (Covering number). N(ǫ, P, d) is the minimal number of d-balls of radius ǫ needed to cover P. Theorem. The posterior distribution is consistent relative to the L 1 -distance at every p 0 with the KL-property if for every ǫ > 0 there exist a partition P = P n,1 P n,2 (which may depend on ǫ) such that, for C > 0, (i) Π(P n,2 ) e Cn. (ii) logn ( ǫ,p n,1, 1 ) nǫ 2 /3. Proof. Entropy gives tests. See below. Apply Extended Schwartz s theorem.

110 Tests

111 Tests

112 Tests the two Luciens Lucien le Cam Lucien Birgé

113 Tests minimax theorem minimax risk for testing P versus Q : π(p,q) = inf φ ( Pφ+ sup Q(1 φ) Q Q ).

114 Tests minimax theorem minimax risk for testing P versus Q : Hellinger affinity: π(p,q) = inf φ ( Pφ+ sup Q(1 φ) Q Q ). ρ 1/2 (p,q) = p qdµ = 1 h 2 (p,q)/2, for h 2 (p,q) = ( p q) 2 dµ square Hellinger distance

115 Tests minimax theorem minimax risk for testing P versus Q : Hellinger affinity: π(p,q) = inf φ ( Pφ+ sup Q(1 φ) Q Q ). ρ 1/2 (p,q) = p qdµ = 1 h 2 (p,q)/2, for h 2 (p,q) = ( p q) 2 dµ square Hellinger distance Proposition. For dominated probability measures P and Q π(p,q) = P conv(q) 1 sup ρ 1/2 (p,q). Q conv(q)

116 Tests minimax risk Proof. π(p,q) = inf φ = sup sup Q conv(q) Q conv(q) = sup Q conv(q) inf φ ( ) Pφ+Q(1 φ) ( ) Pφ+Q(1 φ) ( ) P1l{p < q}+q1l{p q} ( = sup p q 1). Q conv(q) P1l{p < q}+q1l{p q} = p<q pdµ+ p q qdµ p qdµ.

117 Tests product measures ρ 1/2 (p 1 p 2,q 1 q 2 ) = ρ 1/2 (p 1,q 1 )ρ 1/2 (p 2,q 2 ).

118 Tests product measures ρ 1/2 (p 1 p 2,q 1 q 2 ) = ρ 1/2 (p 1,q 1 )ρ 1/2 (p 2,q 2 ). Lemma. For any probability measures P i and Q i ρ 1/2 ( i P i,conv( i Q i ) ) i ρ 1/2 ( Pi,conv(Q i ) ).

119 Tests product measures ρ 1/2 (p 1 p 2,q 1 q 2 ) = ρ 1/2 (p 1,q 1 )ρ 1/2 (p 2,q 2 ). Lemma. For any probability measures P i and Q i ρ 1/2 ( i P i,conv( i Q i ) ) i ρ 1/2 ( Pi,conv(Q i ) ). Proof. Suffices to consider products of 2. If q(x,y) = j κ jq 1j (x)q 2j (y), then ρ 1/2 (p 1 p 2,q) = p 1 (x) 1/2( ) 1/2 [ κ j q 1j (x) p 2 (y) 1/2( j κ jq 1j (x)q 2j (y)) 1/2dµ2 (y)] j κ dµ 1 (x). jq 1j (x) j

120 Tests product measures (2) Corollary. π(p n,q n ) ρ 1/2 (P n,conv(q n )) ρ 1/2 (P,conv(Q)) n.

121 Tests product measures (2) Corollary. π(p n,q n ) ρ 1/2 (P n,conv(q n )) ρ 1/2 (P,conv(Q)) n. Theorem. For any probability measure P and convex set of dominated probability measures Q with h(p,q) > ǫ for every q Q and any n N, there exists a test φ such that P n φ e nǫ2 /2, sup Q n (1 φ) e nǫ2 /2. Q Q

122 Tests product measures (2) Corollary. π(p n,q n ) ρ 1/2 (P n,conv(q n )) ρ 1/2 (P,conv(Q)) n. Theorem. For any probability measure P and convex set of dominated probability measures Q with h(p,q) > ǫ for every q Q and any n N, there exists a test φ such that P n φ e nǫ2 /2, sup Q n (1 φ) e nǫ2 /2. Q Q Proof. ρ 1/2 (P,Q) = h2 (P,Q) 1 ǫ 2 /2. π(p n,q n ) (1 ǫ 2 /2) n e nǫ2 /2.

123 Tests nonconvex alternatives Definition (Covering number). N(ǫ, Q, d) is the minimal number of d-balls of radius ǫ needed to cover Q. Proposition. Let d h be a metric whose balls are convex. If N(ǫ/4,Q,d) N(ǫ) for every ǫ > ǫ n > 0 and some nonincreasing function N:(0, ) (0, ), then for every ǫ > ǫ n and n there exists a test φ such that, for all j N, e nǫ2 /2 P n φ N(ǫ) 1 e nǫ2 /8, sup Q n (1 φ) e nǫ2 j 2 /8. Q Q:d(P,Q)>jǫ

124 Tests nonconvex alternatives Proof. For j N, choose a maximal set of jǫ/2-separated points Q j,1,...,q j,nj in Q j := { Q Q:jǫ < d(p,q) < 2jǫ }. (i). N j N(jǫ/4,Q j,d). (ii). The N j balls B j,l of radius jǫ/2 around the Q j,l cover Q j. (iii). h(p,b j,l ) d(p,b j,l ) > jǫ/2 for every ball B j,l. For every ball take a test φ j,l of P versus B j,l. Let φ be their supremum. P n φ j=1 N j l=1 e nj2 ǫ 2 /8 N(jǫ/4,Q j,d)e nj2 ǫ 2 /8 N(ǫ) j=1 e nǫ2 /8 1 e nǫ2 /8 and, for every j N, sup Q n (1 φ) supe nl2 ǫ 2 /8 e nj2 ǫ 2 /8. Q l>j Q l l>j

125 Rate of contraction Definition. The posterior distribution Π n ( X (n) ) contracts at rate ǫ n 0 ( at θ 0 Θ if Π n θ:d(θ,θ0 ) > M n ǫ n X (n)) 0 in P (n) θ 0 -probability, for every M n as n.

126 Rate of contraction Definition. The posterior distribution Π n ( X (n) ) contracts at rate ǫ n 0 ( at θ 0 Θ if Π n θ:d(θ,θ0 ) > M n ǫ n X (n)) 0 in P (n) θ 0 -probability, for every M n as n. Proposition (Point estimator). If the posterior distribution contracts at rate ǫ n at θ 0, then ˆθ n defined as the center of a (nearly) smallest ball that contains posterior mass at least 1/2 satisfies d(ˆθ n,θ 0 ) = O P (ǫ n ) under P (n) θ 0.

127 Basic contraction theorem K(p 0 ;p) = P 0 log p ( 0 p, V(p 0;p) = P 0 log p ) 0 2. p Theorem. Given d h whose balls are convex suppose that there exist P n P and C > 0, such that, (i) Π n ( p:k(p0 ;p) < ǫ 2 n,v(p 0 ;p) < ǫ 2 n) e Cnǫ 2 n, (ii) logn ( ǫ n,p n,d ) nǫ 2 n. (iii) Π n (P c n) e (C+4)nǫ2 n. Then the posterior rate of convergence for d is ǫ n n 1/2.

128 Basic contraction theorem proof Proof. There exist tests φ n with P n 0 φ n e nǫ2 n e nm2 ǫ 2 n /8 1 e nm2 ǫ 2 n /8, For A n = { n i=1 (p/p } 0)(X i )dπ n (p) e (2+C)nǫ2 n Π n ( p:d(p,p0 ) > Mǫ n X 1,...,X n ) φ n +1l{A c n}+e (2+C)nǫ2 n P n 0 (Ac n) 0. See further on. sup P n nm2ǫ2 (1 φ n ) e n /8. p P n :d(p,p 0 )>Mǫ n d(p,p 0 )>Mǫ n n i=1 p p 0 (X i )dπ n (p)(1 φ n ).

129 Basic contraction theorem proof continued Proof. (Continued) P n 0 p P n :d(p,p 0 )>Mǫ n n i=1 p p 0 (X i )dπ n (p) p P n :d(p,p 0 )>Mǫ n P n (1 φ n )dπ n (p) e nm2 ǫ 2 n /8 P n 0 P P n n i=1 p p 0 (X i )dπ n (p) Π n (P P n ).

130 Bounding the denominator Lemma. For any probability measure Π on P, and positive constant ǫ, with P n 0 -probability at least 1 (nǫ2 ) 1, n i=1 p p 0 (X i )dπ(p) Π ( p:k(p 0 ;p) < ǫ 2,V(p 0 ;p) < ǫ 2) e 2nǫ2.

131 Bounding the denominator Lemma. For any probability measure Π on P, and positive constant ǫ, with P n 0 -probability at least 1 (nǫ2 ) 1, n i=1 p p 0 (X i )dπ(p) Π ( p:k(p 0 ;p) < ǫ 2,V(p 0 ;p) < ǫ 2) e 2nǫ2. Proof. B:= { p:k(p 0 ;p) < ǫ 2 n,v(p 0 ;p) < ǫ 2 n}. n log i=1 p p 0 (X i )dπ(p) n i=1 log p p 0 (X i )dπ(p) =:Z. EZ = n K(p 0 ;p)dπ(p) > nǫ 2, varz np 0 ( log p 0 p dπ(p) ) 2 np0 Apply Chebyshev s inequality. ( log p 0 p ) 2dΠ(p) nǫ 2,

132 Interpretation Consider a maximal set of points p 1,...,p N in P n with d(p i,p j ) ǫ n. Maximality implies N N(ǫ n,p n,d) e c 1nǫ 2 n, under the entropy bound. The balls of radius ǫ n /2 around the points are disjoint and hence the sum of their prior masses will be less than 1. If the prior mass were evenly distributed over these balls, then each would have no more mass than e c 1nǫ 2 n. This is of the same order as the prior mass bound. This argument suggests that the conditions can only be satisfied for every p 0 in the model if the prior distributes its mass uniformly, at discretization level ǫ n.

133 General observations Experiments (X (n), X (n),p (n) θ :θ Θ n ), with observations X (n), and true parameters θ n,0 Θ n. d n and e n semi-metrics on Θ n such that: there exist ξ,k > 0 such that for every ǫ > 0 and every θ n,1 Θ n with d n (θ 1,θ n,0 ) > ǫ, there exists a test φ n such that P (n) θ n,0 φ n e Knǫ2, sup P (n) θ (1 φ n )n e Knǫ2. θ Θ n :e n (θ,θ n,1 )<ξǫ

134 General observations rate of contraction B n,k (θ n,0,ǫ) = { θ Θ n :K(p (n) θ n,0 ;p (n) θ ) nǫ 2,V k,0 (p (n) θ n,0 ;p (n) θ ) n k/2 ǫ k}. Theorem. If for arbitrary Θ n,1 Θ n and k > 1, nǫ 2 n 1, and every j N, (i) Π ( ) n θ Θn,1 :jǫ n < d n (θ,θ 0 ) 2jǫ n ( Π n Bn,k (θ 0,ǫ n ) ) e Knǫ2 n j2 /2, (ii) sup ǫ>ǫ n logn ( ξǫ,{θ Θ n,1 :d n (θ,θ n,0 ) < 2ǫ},e n ) nǫ 2 n, then Π n ( θ Θn,1 :d n (θ,θ n,0 ) M n ǫ n X (n)) 0, in P (n) θ n,0 -probability, for every M n. Theorem. If for arbitrary Θ n,2 Θ n, some k > 1, (iii) Π n (Θ n,2 ) ( ) ( Π n Bn,k (θ n,0,ǫ n ) ) = o e 2nǫ2 n. (3) then Π n (Θ n,2 X (n) ) 0, in P (n) θ n,0 -probability if,

135 Gaussian process priors

136 Gaussian processes Definition. A Gaussian process is a set of random variables (or vectors) W = (W t :t T) such that (W t1,...,w tk ) is multivariate normal, for every t 1,...,t k T. The finite-dimensional distributions are determined by the mean function and the covariance function µ(t) = EW t, K(s,t) = EW s W t, s,t T.

137 Gaussian processes Definition. A Gaussian process is a set of random variables (or vectors) W = (W t :t T) such that (W t1,...,w tk ) is multivariate normal, for every t 1,...,t k T. The finite-dimensional distributions are determined by the mean function and the covariance function µ(t) = EW t, K(s,t) = EW s W t, s,t T. The law of a Gaussian process is a prior for a function.

138 Gaussian processes Definition. A Gaussian process is a set of random variables (or vectors) W = (W t :t T) such that (W t1,...,w tk ) is multivariate normal, for every t 1,...,t k T. The finite-dimensional distributions are determined by the mean function and the covariance function µ(t) = EW t, K(s,t) = EW s W t, s,t T. The law of a Gaussian process is a prior for a function. Gaussian process priors have been found useful, because they offer great variety they are easy (?) to understand through their covariance function they can be computationally attractive (e.g.

139 Brownian density estimation X 1,...,X n i.i.d. from density p 0 on [0,1] (W x :x [0,1]) Brownian motion As prior on p use: x ew x 1 0 ew y dy

140 Brownian density estimation Brownian motion t Wt Prior density t cexp(wt)

141 Brownian density estimation X 1,...,X n i.i.d. from density p 0 on [0,1] (W x :x [0,1]) Brownian motion As prior on p use: x ew x 1 0 ew y dy

142 Brownian density estimation X 1,...,X n i.i.d. from density p 0 on [0,1] (W x :x [0,1]) Brownian motion As prior on p use: x ew x 1 0 ew y dy Theorem. If w 0 := logp 0 C α [0,1], then L 2 -rate is: { n 1/4, if α 1/2; n α/2, if α 1/2.

143 Brownian density estimation X 1,...,X n i.i.d. from density p 0 on [0,1] (W x :x [0,1]) Brownian motion As prior on p use: x ew x 1 0 ew y dy Theorem. If w 0 := logp 0 C α [0,1], then L 2 -rate is: { n 1/4, if α 1/2; n α/2, if α 1/2. This is optimal if and only if α = 1/2. Rate does not improve if α increases from 1/2. Consistency for any α > 0.

144 Integrated Brownian density estimation Integrated Brownian motion Prior density

145 Integrated Brownian motion: Riemann-Liouville process α 1/2 times integrated Brownian motion, released at 0 W t = t 0 (t s) α 1/2 db s + [α]+1 k=0 Z k t k [B Brownian motion, α > 0, (Z k ) iid N(0,1), fractional integral ] Theorem. IBM gives appropriate model for α-smooth functions: consistency if w 0 C β [0,1] for any β > 0, but the optimal n β/(2β+1) if and only if α = β.

146 Settings Density estimation X 1,...,X n iid in [0,1], p θ (x) = eθ(x) 1 0 eθ(t) dt. Distance on parameter: Hellinger on p θ. Norm on W : uniform. Classification (X 1,Y 1 ),...,(X n,y n ) iid in [0,1] {0,1} Pr θ (Y = 1 X = x) = 1 1+e θ(x). Regression Y 1,...,Y n independent N(θ(x i ),σ 2 ), for fixed design points x 1,...,x n. Distance on parameter: L 2 (G) on Pr θ. (G marginal of X i.) Norm on W : L 2 (G). Distance on parameter: empirical L 2 -distance on θ. Norm on W : empirical L 2 -distance. Ergodic diffusions (X t :t [0,n]), ergodic, recurrent: dx t = θ(x t )dt+σ(x t )db t. Distance on parameter: random Hellinger h n ( /σ µ0,2). Norm on W : L 2 (µ 0 ). (µ 0 stationary measure.)

147 Other Gaussian processes Brownian sheet alpha= alpha= Fractional Brownian motion θ(x) = i θ ie i (x), θ i indep N(0,λ i ) Series prior

148 Stationary processes A stationary Gaussian field (W t :t R d ) is characterized through a spectral measure µ, by cov(w s,w t ) = e iλt (s t) dµ(λ) Gaussian measure; basis spectral radial Matérn spectral measure (3/2)

149 Stationary processes radial basis Stationary Gaussian field (W t :t R d ) characterized through cov(w s,w t ) = e iλt (s t) e λ2 dλ Theorem. Let ŵ 0 be the Fourier transform of the true parameter w 0 :[0,1] d R. If e λ ŵ 0 (λ) 2 dλ <, then rate of contraction is near 1/ n. If ŵ 0 (λ) (1+ λ 2 ) β, then rate is power of 1/logn. Excellent if truth is supersmooth; disastrous otherwise.

150 Stretching or shrinking: length scale Sample paths can be smoothed by stretching

151 Stretching or shrinking: length scale Sample paths can be smoothed by stretching and roughened by shrinking

152 Rescaled Brownian motion W t = B t/cn for B Brownian motion, and c n n (2α 1)/(2α+1) α < 1/2: c n 0 (shrink) α (1/2,1]: c n (stretch) Theorem. The prior W t = B t/cn gives optimal rate for w 0 C α [0,1], α (0,1].

153 Rescaled Brownian motion W t = B t/cn for B Brownian motion, and c n n (2α 1)/(2α+1) α < 1/2: c n 0 (shrink) α (1/2,1]: c n (stretch) Theorem. The prior W t = B t/cn gives optimal rate for w 0 C α [0,1], α (0,1]. Surprising? (Brownian motion is self-similar!)

154 Rescaled Brownian motion W t = B t/cn for B Brownian motion, and c n n (2α 1)/(2α+1) α < 1/2: c n 0 (shrink) α (1/2,1]: c n (stretch) Theorem. The prior W t = B t/cn gives optimal rate for w 0 C α [0,1], α (0,1]. Surprising? (Brownian motion is self-similar!) Appropriate rescaling of k times integrated Brownian motion gives optimal prior for every α (0,k +1].

155 Rescaled smooth stationary process A Gaussian field with infinitely-smooth sample paths is obtained with EG s G t = ψ(s t), e λ ˆψ(λ)dλ < Gaussian measure; basis spectral radial Theorem. The prior W t = G t/cn for c n n 1/(2α+d) gives nearly optimal rate for w 0 C α [0,1], any α > 0.

156 Gaussian elements in a Banach space Definition. A Gaussian random variable in a (separable) Banach space B is a Borel measurable map W:(Ω, U,Pr) B such that b W is normally distributed for every b in the dual space B. Many Gaussian processes (W t :t T) can be viewed as a Gaussian variable in a space of functions w:t R d. EXAMPLES Brownian motion can be viewed as a map in C[0,1], equipped with the uniform norm w = sup t [0,1] w(t).

157 Gaussian elements in a Banach space Definition. A Gaussian random variable in a (separable) Banach space B is a Borel measurable map W:(Ω, U,Pr) B such that b W is normally distributed for every b in the dual space B. Many Gaussian processes (W t :t T) can be viewed as a Gaussian variable in a space of functions w:t R d. EXAMPLES Brownian motion can be viewed as a map in C[0,1], equipped with the uniform norm w = sup t [0,1] w(t). Brownian motion is also a map in L 2 [0,1], or C 1/4 [0,1], or some Besov space.

158 RKHS definition W zero-mean Gaussian in Banach space (B, ). S:B B, Sb = EWb (W). Definition. The reproducing kernel Hilbert space (H, H ) of W is the completion of SB under. Sb 1,Sb 2 H = Eb 1(W)b 2(W)

159 RKHS definition (2) W = (W t :t T) Gaussian process that can be seen as tight, Borel measurable map in l (T) = {f:t R: f := sup t f(t) < }. with covariance function K(s,t) = EW s W t. Theorem. Then RKHS is completion of the set of functions t i α i K(s i,t) relative to inner product α i K(r i, ), j i β j K(s j, ) = H i α i β j K(r i,t j ). j

160 RKHS definition (2) W = (W t :t T) Gaussian process that can be seen as tight, Borel measurable map in l (T) = {f:t R: f := sup t f(t) < }, with covariance function K(s,t) = EW s W t. Theorem. Then RKHS is completion of the set of functions t i α i K(s i,t)= E( i α i W si )W t relative to inner product α i K(r i, ), j i β j K(s j, ) = H i α i β j K(r i,t j ) j i.e. all functions t h L (t):= ELW t, where L L 2 (W), with inner product h L1,h L2 H = EL 1 L 2.

161 RKHS definition (3) Any Gaussian random element in a separable Banach space can be represented (in many ways, e.g. spectral decomposition) as for µ i 0 Z 1,Z 2,... i.i.d. N(0,1) e 1 = e 2 = = 1 W = µ i Z i e i i=1

162 RKHS definition (3) Any Gaussian random element in a separable Banach space can be represented (in many ways, e.g. spectral decomposition) as for µ i 0 Z 1,Z 2,... i.i.d. N(0,1) e 1 = e 2 = = 1 W = µ i Z i e i i=1 Theorem. The RKHS consists of all elements h:= i h ie i with h 2 H := i h 2 i µ 2 i <

163 EXAMPLE Brownian motion Theorem. The RKHS of k times IBM is { f:f (k+1) L 2 [0,1],f(0) = = f (k) (0) = 0 }, f H = f (k+1) 2.

164 EXAMPLE Brownian motion Theorem. The RKHS of k times IBM is { f:f (k+1) L 2 [0,1],f(0) = = f (k) (0) = 0 }, f H = f (k+1) 2. Proof. For k = 0: EW s W t = s t = t 0 1l [0,s]dλ. The set of all linear combinations i α i1l [0,si ] is dense in L 2 [0,1]. For k > 0: use the general result that the RKHS is equivariant under continous linear transformations, like integration.

165 EXAMPLE Brownian motion Theorem. The RKHS of k times IBM is { f:f (k+1) L 2 [0,1],f(0) = = f (k) (0) = 0 }, f H = f (k+1) 2. Proof. For k = 0: EW s W t = s t = t 0 1l [0,s]dλ. The set of all linear combinations i α i1l [0,si ] is dense in L 2 [0,1]. For k > 0: use the general result that the RKHS is equivariant under continous linear transformations, like integration. Theorem. The RKHS of the sum of k times IBM and t k i=0 Z it i is { f:f (k+1) L 2 [0,1] }, f 2 H = f(k+1) k f (i) (0) 2. i=0

166 Example stationary processes A stationary Gaussian process is characterized through a spectral measure µ, by cov(w s,w t ) = e iλt (s t) dµ(λ). Theorem. The RKHS of (W t :t T) is the set of real parts of the functions t e iλtt ψ(λ)dµ(λ), ψ L 2 (µ), with RKHS-norm h H = inf{ ψ 2 :h ψ = h}. If the interior of T is nonempty and e λ µ(dλ) <, then ψ is unique and h H = ψ 2. Proof. EW s W t = e s,e t 2,µ, e s (λ) = e iλts.

167 Small ball probability Definition. The small ball probability of a Gaussian random element W in (B, ) is Pr( W < ǫ), and the small ball exponent is φ 0 (ǫ) = logpr( W < ǫ).

168 Small ball probability Definition. The small ball probability of a Gaussian random element W in (B, ) is Pr( W < ǫ), and the small ball exponent is φ 0 (ǫ) = logpr( W < ǫ). EXAMPLES Brownian motion: φ 0 (ǫ) (1/ǫ) 2.

169 Small ball probability Definition. The small ball probability of a Gaussian random element W in (B, ) is Pr( W < ǫ), and the small ball exponent is φ 0 (ǫ) = logpr( W < ǫ). EXAMPLES Brownian motion: φ 0 (ǫ) (1/ǫ) 2. α 1/2 times integrated BM: φ 0 (ǫ) (1/ǫ) 1/α.

170 Small ball probability Definition. The small ball probability of a Gaussian random element W in (B, ) is Pr( W < ǫ), and the small ball exponent is φ 0 (ǫ) = logpr( W < ǫ). EXAMPLES Brownian motion: φ 0 (ǫ) (1/ǫ) 2. α 1/2 times integrated BM: φ 0 (ǫ) (1/ǫ) 1/α. Radial basis: φ 0 (ǫ) ( log(1/ǫ) ) 1+d.

171 Small ball probability Definition. The small ball probability of a Gaussian random element W in (B, ) is Pr( W < ǫ), and the small ball exponent is φ 0 (ǫ) = logpr( W < ǫ). Small ball probabilities can be computed either by probabilistic arguments, or analytically from the RKHS. Theorem. ( ǫ ) φ 0 (ǫ) logn φ0 (ǫ),h 1, EXAMPLE RKHS of Brownian motion is Sobolev space of first order. Unit ball has entropy 1/ǫ for uniform norm.. 1 ( ǫ 2 logn ǫ ),H 1, (1/ǫ) 2

172 Posterior contraction rates for Gaussian priors Prior W is centered Gaussian map in Banach space (B, ) with RKHS (H, H ) and small ball exponent φ 0 (ǫ) = logπ( W < ǫ). Theorem. If statistical distances on the model combine appropriately with the norm of B, then the posterior rate is ǫ n if φ 0 (ǫ n ) nǫ n 2 AND inf h 2 H nǫ n 2. h H: h w 0 <ǫ n

173 Posterior contraction rates for Gaussian priors Prior W is centered Gaussian map in Banach space (B, ) with RKHS (H, H ) and small ball exponent φ 0 (ǫ) = logπ( W < ǫ). Theorem. If statistical distances on the model combine appropriately with the norm of B, then the posterior rate is ǫ n if φ 0 (ǫ n ) nǫ n 2 AND inf h 2 H nǫ n 2. h H: h w 0 <ǫ n Both inequalities give lower bound on ǫ n. The first depends on W and not on w 0. If w 0 H, then second inequality is satisfied for ǫ n 1/ n.

174 Density estimation As prior on density p use p W for: p w (x) = ew x 1 0 ew t dt.

175 Density estimation As prior on density p use p W for: p w (x) = ew x 1 0 ew t dt. Lemma. v,w h(p v,p w ) v w e v w /2 K(p v,p w ) v w 2 e v w (1+ v w ) V(p v,p w ) v w 2 e v w (1+ v w ) 2

176 Settings Density estimation X 1,...,X n iid in [0,1], p θ (x) = eθ(x) 1 0 eθ(t) dt. Distance on parameter: Hellinger on p θ. Norm on W : uniform. Classification (X 1,Y 1 ),...,(X n,y n ) iid in [0,1] {0,1} Pr θ (Y = 1 X = x) = 1 1+e θ(x). Regression Y 1,...,Y n independent N(θ(x i ),σ 2 ), for fixed design points x 1,...,x n. Distance on parameter: L 2 (G) on Pr θ. (G marginal of X i.) Norm on W : L 2 (G). Distance on parameter: empirical L 2 -distance on θ. Norm on W : empirical L 2 -distance. Ergodic diffusions (X t :t [0,n]), ergodic, recurrent: dx t = θ(x t )dt+σ(x t )db t. Distance on parameter: random Hellinger h n ( /σ µ0,2). Norm on W : L 2 (µ 0 ). (µ 0 stationary measure.)

177 Brownian Motion rate calculation Small ball probability: φ 0 (ǫ) (1/ǫ) 2 nǫ 2 implies ǫ n 1/4. Approximation: if w 0 C β [0,1], β 1, inf h H: h w 0 <ǫ h 2 2 ǫ (2 2β)/β (Attained for h = w 0 φ σ with σ ǫ 1/β.) ǫ (2 2β)/β nǫ 2 implies ǫ n β/2. Contraction rate is the slowest of the two rates.

178 Example radial basis stationary process Small ball pobabilility: φ 0 (ǫ) ( log(1/ǫ) ) 2 nǫ 2 implies ǫ n 1/2 (logn) 2. Approximation: since δµ(λ) = e λ2 dλ: w 0 (t) = e ittλŵ 0 (λ)dλ = e ittλŵ 0 (λ)e λ2 dµ(λ). If the red function is in L 2 (µ), then w 0 H. Otherwise approximate it by ψ(λ) = ŵ 0 (λ)e λ2 1l{ λ M}. Optimize over M. Contraction rate is the slowest of the two rates, typically the second.

179 Posterior contraction rates for Gaussian priors Prior W centered Gaussian map in Banach space (B, ) with RKHS (H, H ) and small ball exponent φ 0 (ǫ) = logπ( W < ǫ). Theorem. If statistical distances on the model combine appropriately with the norm of B, then the posterior rate is ǫ n if φ 0 (ǫ n ) nǫ n 2 AND inf h 2 H nǫ n 2. h H: h w 0 <ǫ n

180 Posterior contraction rates for Gaussian priors Prior W centered Gaussian map in Banach space (B, ) with RKHS (H, H ) and small ball exponent φ 0 (ǫ) = logπ( W < ǫ). Theorem. If statistical distances on the model combine appropriately with the norm of B, then the posterior rate is ǫ n if φ 0 (ǫ n ) nǫ n 2 AND inf h 2 H nǫ n 2. h H: h w 0 <ǫ n Proof. Suffices: existence of B n B with logn ( ǫ n,b n, ) nǫ 2 n Π n (B n ) = 1 o(e nǫ2 n ( ) Π n w: w w0 < ǫ n e nǫ 2 n complexity remaining mass prior mass

181 Posterior contraction rates for Gaussian priors Prior W centered Gaussian map in Banach space (B, ) with RKHS (H, H ) and small ball exponent φ 0 (ǫ) = logπ( W < ǫ). Theorem. If statistical distances on the model combine appropriately with the norm of B, then the posterior rate is ǫ n if φ 0 (ǫ n ) nǫ n 2 AND inf h 2 H nǫ n 2. h H: h w 0 <ǫ n Proof. Suffices: existence of B n B with logn ( ǫ n,b n, ) nǫ 2 n complexity Π n (B n ) = 1 o(e nǫ2 n ( ) remaining mass Π n w: w w0 < ǫ n e nǫ 2 n prior mass Take B n = M n H 1 +ǫ n B 1 for appropriate M n.

182 Prior mass decentered small ball probability W a centered Gaussian map in (B, ) with RKHS (H, H ) and small ball exponent φ 0 (ǫ). φ w0 (ǫ):= φ 0 (ǫ)+ 1 2 inf h H: h w 0 <ǫ h 2 H

183 Prior mass decentered small ball probability W a centered Gaussian map in (B, ) with RKHS (H, H ) and small ball exponent φ 0 (ǫ). φ w0 (ǫ):= φ 0 (ǫ)+ 1 2 inf h H: h w 0 <ǫ h 2 H Theorem. Pr( W w 0 < 2ǫ) e φ w 0 (ǫ)

184 Prior mass decentered small ball probability proof Proof. (Sketch) For h H the distribution of W +h is absolute continuous relative to that of W and Pr ( W h < ǫ ) = Ee Uh 1 2 h 2 H 1l{ W < ǫ}. The left side does not change if h replaces h. Take average: Pr ( W h < ǫ ) = E 1 2 (e Uh +e Uh )e 1 2 h 2 H 1l{ W < ǫ} e 1 2 h 2 H Pr( W < ǫ).

185 Prior mass decentered small ball probability proof Proof. (Sketch) For h H the distribution of W +h is absolute continuous relative to that of W and Pr ( W h < ǫ ) = Ee Uh 1 2 h 2 H 1l{ W < ǫ}. The left side does not change if h replaces h. Take average: Pr ( W h < ǫ ) = E 1 2 (e Uh +e Uh )e 1 2 h 2 H 1l{ W < ǫ} e 1 2 h 2 H Pr( W < ǫ). For general w 0 : if h H with w 0 h < ǫ, then W h < ǫ implies W w 0 < 2ǫ.

186 Complexity and remaining mass Theorem. The closure of H in B is support of the Gaussian measure (and hence posterior is inconsistent if w 0 H > 0).

187 Complexity and remaining mass Theorem. The closure of H in B is support of the Gaussian measure (and hence posterior is inconsistent if w 0 H > 0). Theorem (Borell 75). For H 1 and B 1 the unit balls of RKHS and B, Pr(W / MH 1 +ǫb 1 ) 1 Φ ( Φ 1 (e φ 0(ǫ) )+M )

188 Complexity and remaining mass Theorem. The closure of H in B is support of the Gaussian measure (and hence posterior is inconsistent if w 0 H > 0). Theorem (Borell 75). For H 1 and B 1 the unit balls of RKHS and B, Pr(W / MH 1 +ǫb 1 ) 1 Φ ( Φ 1 (e φ 0(ǫ) )+M ) Corollary. For M(W) a median of W and σ 2 (W) = sup b 1varb W, Pr(W M(W) x) 1 Φ(x/σ(W)) e 1 2 x2 /σ 2 (W)

189 Adaptation Every Gaussian prior is good for some regularity class, but may be very bad for another. This can be alleviated by adapting the prior to the data by hierarchical Bayes: putting a prior on the regularity, or on a scaling. empirical Bayes: using a regularity or scaling determined by maximum likelihood on the marginal distribution of the data. The first is known to work in some generality. For the second there are some, but not many results.

190 Adaptation by random scaling example Choose A d from a Gamma distribution. Choose (G t :t R d +) radial basis stationary Gaussian process. Set W t G At Theorem. if w 0 C β [0,1] d, then the rate of contraction is nearly n β/(2β+d). if w 0 is supersmooth, then the rate is nearly n 1/2. Proof. Use the basic contraction theorem (and careful estimates).

191 Acknowledgement Harry van Zanten

192 Dirichlet mixtures

193 Acknowledgement Subhashis Ghosal

Nonparametric Bayesian Uncertainty Quantification

Nonparametric Bayesian Uncertainty Quantification Nonparametric Bayesian Uncertainty Quantification Lecture 1: Introduction to Nonparametric Bayes Aad van der Vaart Universiteit Leiden, Netherlands YES, Eindhoven, January 2017 Contents Introduction Recovery

More information

Bayesian Regularization

Bayesian Regularization Bayesian Regularization Aad van der Vaart Vrije Universiteit Amsterdam International Congress of Mathematicians Hyderabad, August 2010 Contents Introduction Abstract result Gaussian process priors Co-authors

More information

Priors for the frequentist, consistency beyond Schwartz

Priors for the frequentist, consistency beyond Schwartz Victoria University, Wellington, New Zealand, 11 January 2016 Priors for the frequentist, consistency beyond Schwartz Bas Kleijn, KdV Institute for Mathematics Part I Introduction Bayesian and Frequentist

More information

Nonparametric Bayesian Methods - Lecture I

Nonparametric Bayesian Methods - Lecture I Nonparametric Bayesian Methods - Lecture I Harry van Zanten Korteweg-de Vries Institute for Mathematics CRiSM Masterclass, April 4-6, 2016 Overview of the lectures I Intro to nonparametric Bayesian statistics

More information

Foundations of Nonparametric Bayesian Methods

Foundations of Nonparametric Bayesian Methods 1 / 27 Foundations of Nonparametric Bayesian Methods Part II: Models on the Simplex Peter Orbanz http://mlg.eng.cam.ac.uk/porbanz/npb-tutorial.html 2 / 27 Tutorial Overview Part I: Basics Part II: Models

More information

Bayesian Adaptation. Aad van der Vaart. Vrije Universiteit Amsterdam. aad. Bayesian Adaptation p. 1/4

Bayesian Adaptation. Aad van der Vaart. Vrije Universiteit Amsterdam.  aad. Bayesian Adaptation p. 1/4 Bayesian Adaptation Aad van der Vaart http://www.math.vu.nl/ aad Vrije Universiteit Amsterdam Bayesian Adaptation p. 1/4 Joint work with Jyri Lember Bayesian Adaptation p. 2/4 Adaptation Given a collection

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

Bayesian Sparse Linear Regression with Unknown Symmetric Error

Bayesian Sparse Linear Regression with Unknown Symmetric Error Bayesian Sparse Linear Regression with Unknown Symmetric Error Minwoo Chae 1 Joint work with Lizhen Lin 2 David B. Dunson 3 1 Department of Mathematics, The University of Texas at Austin 2 Department of

More information

Semiparametric posterior limits

Semiparametric posterior limits Statistics Department, Seoul National University, Korea, 2012 Semiparametric posterior limits for regular and some irregular problems Bas Kleijn, KdV Institute, University of Amsterdam Based on collaborations

More information

ICES REPORT Model Misspecification and Plausibility

ICES REPORT Model Misspecification and Plausibility ICES REPORT 14-21 August 2014 Model Misspecification and Plausibility by Kathryn Farrell and J. Tinsley Odena The Institute for Computational Engineering and Sciences The University of Texas at Austin

More information

Bayesian estimation of the discrepancy with misspecified parametric models

Bayesian estimation of the discrepancy with misspecified parametric models Bayesian estimation of the discrepancy with misspecified parametric models Pierpaolo De Blasi University of Torino & Collegio Carlo Alberto Bayesian Nonparametrics workshop ICERM, 17-21 September 2012

More information

Semiparametric Minimax Rates

Semiparametric Minimax Rates arxiv: math.pr/0000000 Semiparametric Minimax Rates James Robins, Eric Tchetgen Tchetgen, Lingling Li, and Aad van der Vaart James Robins Eric Tchetgen Tchetgen Department of Biostatistics and Epidemiology

More information

STAT Advanced Bayesian Inference

STAT Advanced Bayesian Inference 1 / 32 STAT 625 - Advanced Bayesian Inference Meng Li Department of Statistics Jan 23, 218 The Dirichlet distribution 2 / 32 θ Dirichlet(a 1,...,a k ) with density p(θ 1,θ 2,...,θ k ) = k j=1 Γ(a j) Γ(

More information

Minimax lower bounds I

Minimax lower bounds I Minimax lower bounds I Kyoung Hee Kim Sungshin University 1 Preliminaries 2 General strategy 3 Le Cam, 1973 4 Assouad, 1983 5 Appendix Setting Family of probability measures {P θ : θ Θ} on a sigma field

More information

CS Lecture 19. Exponential Families & Expectation Propagation

CS Lecture 19. Exponential Families & Expectation Propagation CS 6347 Lecture 19 Exponential Families & Expectation Propagation Discrete State Spaces We have been focusing on the case of MRFs over discrete state spaces Probability distributions over discrete spaces

More information

is a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications.

is a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications. Stat 811 Lecture Notes The Wald Consistency Theorem Charles J. Geyer April 9, 01 1 Analyticity Assumptions Let { f θ : θ Θ } be a family of subprobability densities 1 with respect to a measure µ on a measurable

More information

DISCUSSION: COVERAGE OF BAYESIAN CREDIBLE SETS. By Subhashis Ghosal North Carolina State University

DISCUSSION: COVERAGE OF BAYESIAN CREDIBLE SETS. By Subhashis Ghosal North Carolina State University Submitted to the Annals of Statistics DISCUSSION: COVERAGE OF BAYESIAN CREDIBLE SETS By Subhashis Ghosal North Carolina State University First I like to congratulate the authors Botond Szabó, Aad van der

More information

Asymptotic properties of posterior distributions in nonparametric regression with non-gaussian errors

Asymptotic properties of posterior distributions in nonparametric regression with non-gaussian errors Ann Inst Stat Math (29) 61:835 859 DOI 1.17/s1463-8-168-2 Asymptotic properties of posterior distributions in nonparametric regression with non-gaussian errors Taeryon Choi Received: 1 January 26 / Revised:

More information

On the Support of MacEachern s Dependent Dirichlet Processes and Extensions

On the Support of MacEachern s Dependent Dirichlet Processes and Extensions Bayesian Analysis (2012) 7, Number 2, pp. 277 310 On the Support of MacEachern s Dependent Dirichlet Processes and Extensions Andrés F. Barrientos, Alejandro Jara and Fernando A. Quintana Abstract. We

More information

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation Statistics 62: L p spaces, metrics on spaces of probabilites, and connections to estimation Moulinath Banerjee December 6, 2006 L p spaces and Hilbert spaces We first formally define L p spaces. Consider

More information

Bayesian nonparametrics

Bayesian nonparametrics Bayesian nonparametrics 1 Some preliminaries 1.1 de Finetti s theorem We will start our discussion with this foundational theorem. We will assume throughout all variables are defined on the probability

More information

Introduction to Bayesian learning Lecture 2: Bayesian methods for (un)supervised problems

Introduction to Bayesian learning Lecture 2: Bayesian methods for (un)supervised problems Introduction to Bayesian learning Lecture 2: Bayesian methods for (un)supervised problems Anne Sabourin, Ass. Prof., Telecom ParisTech September 2017 1/78 1. Lecture 1 Cont d : Conjugate priors and exponential

More information

Lecture 1: Bayesian Framework Basics

Lecture 1: Bayesian Framework Basics Lecture 1: Bayesian Framework Basics Melih Kandemir melih.kandemir@iwr.uni-heidelberg.de April 21, 2014 What is this course about? Building Bayesian machine learning models Performing the inference of

More information

On Consistent Hypotheses Testing

On Consistent Hypotheses Testing Mikhail Ermakov St.Petersburg E-mail: erm2512@gmail.com History is rather distinctive. Stein (1954); Hodges (1954); Hoefding and Wolfowitz (1956), Le Cam and Schwartz (1960). special original setups: Shepp,

More information

Bayes spaces: use of improper priors and distances between densities

Bayes spaces: use of improper priors and distances between densities Bayes spaces: use of improper priors and distances between densities J. J. Egozcue 1, V. Pawlowsky-Glahn 2, R. Tolosana-Delgado 1, M. I. Ortego 1 and G. van den Boogaart 3 1 Universidad Politécnica de

More information

Model Selection and Geometry

Model Selection and Geometry Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model

More information

Markov Chain Monte Carlo (MCMC)

Markov Chain Monte Carlo (MCMC) Markov Chain Monte Carlo (MCMC Dependent Sampling Suppose we wish to sample from a density π, and we can evaluate π as a function but have no means to directly generate a sample. Rejection sampling can

More information

Clustering using Mixture Models

Clustering using Mixture Models Clustering using Mixture Models The full posterior of the Gaussian Mixture Model is p(x, Z, µ,, ) =p(x Z, µ, )p(z )p( )p(µ, ) data likelihood (Gaussian) correspondence prob. (Multinomial) mixture prior

More information

Bayesian nonparametrics

Bayesian nonparametrics Bayesian nonparametrics Posterior contraction and limiting shape Ismaël Castillo LPMA Université Paris VI Berlin, September 2016 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September

More information

Lecture 1: Introduction

Lecture 1: Introduction Principles of Statistics Part II - Michaelmas 208 Lecturer: Quentin Berthet Lecture : Introduction This course is concerned with presenting some of the mathematical principles of statistical theory. One

More information

Lecture 2: Statistical Decision Theory (Part I)

Lecture 2: Statistical Decision Theory (Part I) Lecture 2: Statistical Decision Theory (Part I) Hao Helen Zhang Hao Helen Zhang Lecture 2: Statistical Decision Theory (Part I) 1 / 35 Outline of This Note Part I: Statistics Decision Theory (from Statistical

More information

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1 Chapter 4 HOMEWORK ASSIGNMENTS These homeworks may be modified as the semester progresses. It is your responsibility to keep up to date with the correctly assigned homeworks. There may be some errors in

More information

ASYMPTOTIC EQUIVALENCE OF DENSITY ESTIMATION AND GAUSSIAN WHITE NOISE. By Michael Nussbaum Weierstrass Institute, Berlin

ASYMPTOTIC EQUIVALENCE OF DENSITY ESTIMATION AND GAUSSIAN WHITE NOISE. By Michael Nussbaum Weierstrass Institute, Berlin The Annals of Statistics 1996, Vol. 4, No. 6, 399 430 ASYMPTOTIC EQUIVALENCE OF DENSITY ESTIMATION AND GAUSSIAN WHITE NOISE By Michael Nussbaum Weierstrass Institute, Berlin Signal recovery in Gaussian

More information

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution Outline A short review on Bayesian analysis. Binomial, Multinomial, Normal, Beta, Dirichlet Posterior mean, MAP, credible interval, posterior distribution Gibbs sampling Revisit the Gaussian mixture model

More information

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed 18.466 Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed 1. MLEs in exponential families Let f(x,θ) for x X and θ Θ be a likelihood function, that is, for present purposes,

More information

Part IA Probability. Definitions. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

Part IA Probability. Definitions. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015 Part IA Probability Definitions Based on lectures by R. Weber Notes taken by Dexter Chua Lent 2015 These notes are not endorsed by the lecturers, and I have modified them (often significantly) after lectures.

More information

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk)

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk) 10-704: Information Processing and Learning Spring 2015 Lecture 21: Examples of Lower Bounds and Assouad s Method Lecturer: Akshay Krishnamurthy Scribes: Soumya Batra Note: LaTeX template courtesy of UC

More information

Bayesian Consistency for Markov Models

Bayesian Consistency for Markov Models Bayesian Consistency for Markov Models Isadora Antoniano-Villalobos Bocconi University, Milan, Italy. Stephen G. Walker University of Texas at Austin, USA. Abstract We consider sufficient conditions for

More information

Part IA Probability. Theorems. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

Part IA Probability. Theorems. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015 Part IA Probability Theorems Based on lectures by R. Weber Notes taken by Dexter Chua Lent 2015 These notes are not endorsed by the lecturers, and I have modified them (often significantly) after lectures.

More information

EE514A Information Theory I Fall 2013

EE514A Information Theory I Fall 2013 EE514A Information Theory I Fall 2013 K. Mohan, Prof. J. Bilmes University of Washington, Seattle Department of Electrical Engineering Fall Quarter, 2013 http://j.ee.washington.edu/~bilmes/classes/ee514a_fall_2013/

More information

Convergence of latent mixing measures in nonparametric and mixture models 1

Convergence of latent mixing measures in nonparametric and mixture models 1 Convergence of latent mixing measures in nonparametric and mixture models 1 XuanLong Nguyen xuanlong@umich.edu Department of Statistics University of Michigan Technical Report 527 (Revised, Jan 25, 2012)

More information

Probability and Measure

Probability and Measure Part II Year 2018 2017 2016 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2018 84 Paper 4, Section II 26J Let (X, A) be a measurable space. Let T : X X be a measurable map, and µ a probability

More information

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite

More information

Part II Probability and Measure

Part II Probability and Measure Part II Probability and Measure Theorems Based on lectures by J. Miller Notes taken by Dexter Chua Michaelmas 2016 These notes are not endorsed by the lecturers, and I have modified them (often significantly)

More information

Nonparametric Bayesian model selection and averaging

Nonparametric Bayesian model selection and averaging Electronic Journal of Statistics Vol. 2 2008 63 89 ISSN: 1935-7524 DOI: 10.1214/07-EJS090 Nonparametric Bayesian model selection and averaging Subhashis Ghosal Department of Statistics, North Carolina

More information

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01 STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01 Nasser Sadeghkhani a.sadeghkhani@queensu.ca There are two main schools to statistical inference: 1-frequentist

More information

Brownian motion. Samy Tindel. Purdue University. Probability Theory 2 - MA 539

Brownian motion. Samy Tindel. Purdue University. Probability Theory 2 - MA 539 Brownian motion Samy Tindel Purdue University Probability Theory 2 - MA 539 Mostly taken from Brownian Motion and Stochastic Calculus by I. Karatzas and S. Shreve Samy T. Brownian motion Probability Theory

More information

Statistica Sinica Preprint No: SS R2

Statistica Sinica Preprint No: SS R2 Statistica Sinica Preprint No: SS-2017-0074.R2 Title The semi-parametric Bernstein-von Mises theorem for regression models with symmetric errors Manuscript ID SS-2017-0074.R2 URL http://www.stat.sinica.edu.tw/statistica/

More information

LARGE DEVIATIONS OF TYPICAL LINEAR FUNCTIONALS ON A CONVEX BODY WITH UNCONDITIONAL BASIS. S. G. Bobkov and F. L. Nazarov. September 25, 2011

LARGE DEVIATIONS OF TYPICAL LINEAR FUNCTIONALS ON A CONVEX BODY WITH UNCONDITIONAL BASIS. S. G. Bobkov and F. L. Nazarov. September 25, 2011 LARGE DEVIATIONS OF TYPICAL LINEAR FUNCTIONALS ON A CONVEX BODY WITH UNCONDITIONAL BASIS S. G. Bobkov and F. L. Nazarov September 25, 20 Abstract We study large deviations of linear functionals on an isotropic

More information

A semiparametric Bernstein - von Mises theorem for Gaussian process priors

A semiparametric Bernstein - von Mises theorem for Gaussian process priors Noname manuscript No. (will be inserted by the editor) A semiparametric Bernstein - von Mises theorem for Gaussian process priors Ismaël Castillo Received: date / Accepted: date Abstract This paper is

More information

PMR Learning as Inference

PMR Learning as Inference Outline PMR Learning as Inference Probabilistic Modelling and Reasoning Amos Storkey Modelling 2 The Exponential Family 3 Bayesian Sets School of Informatics, University of Edinburgh Amos Storkey PMR Learning

More information

Math212a1413 The Lebesgue integral.

Math212a1413 The Lebesgue integral. Math212a1413 The Lebesgue integral. October 28, 2014 Simple functions. In what follows, (X, F, m) is a space with a σ-field of sets, and m a measure on F. The purpose of today s lecture is to develop the

More information

Consistency of the maximum likelihood estimator for general hidden Markov models

Consistency of the maximum likelihood estimator for general hidden Markov models Consistency of the maximum likelihood estimator for general hidden Markov models Jimmy Olsson Centre for Mathematical Sciences Lund University Nordstat 2012 Umeå, Sweden Collaborators Hidden Markov models

More information

The Bayesian approach to inverse problems

The Bayesian approach to inverse problems The Bayesian approach to inverse problems Youssef Marzouk Department of Aeronautics and Astronautics Center for Computational Engineering Massachusetts Institute of Technology ymarz@mit.edu, http://uqgroup.mit.edu

More information

Nonparametric Bayes Inference on Manifolds with Applications

Nonparametric Bayes Inference on Manifolds with Applications Nonparametric Bayes Inference on Manifolds with Applications Abhishek Bhattacharya Indian Statistical Institute Based on the book Nonparametric Statistics On Manifolds With Applications To Shape Spaces

More information

Random sets. Distributions, capacities and their applications. Ilya Molchanov. University of Bern, Switzerland

Random sets. Distributions, capacities and their applications. Ilya Molchanov. University of Bern, Switzerland Random sets Distributions, capacities and their applications Ilya Molchanov University of Bern, Switzerland Molchanov Random sets - Lecture 1. Winter School Sandbjerg, Jan 2007 1 E = R d ) Definitions

More information

Exchangeability. Peter Orbanz. Columbia University

Exchangeability. Peter Orbanz. Columbia University Exchangeability Peter Orbanz Columbia University PARAMETERS AND PATTERNS Parameters P(X θ) = Probability[data pattern] 3 2 1 0 1 2 3 5 0 5 Inference idea data = underlying pattern + independent noise Peter

More information

Phenomena in high dimensions in geometric analysis, random matrices, and computational geometry Roscoff, France, June 25-29, 2012

Phenomena in high dimensions in geometric analysis, random matrices, and computational geometry Roscoff, France, June 25-29, 2012 Phenomena in high dimensions in geometric analysis, random matrices, and computational geometry Roscoff, France, June 25-29, 202 BOUNDS AND ASYMPTOTICS FOR FISHER INFORMATION IN THE CENTRAL LIMIT THEOREM

More information

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner Fundamentals CS 281A: Statistical Learning Theory Yangqing Jia Based on tutorial slides by Lester Mackey and Ariel Kleiner August, 2011 Outline 1 Probability 2 Statistics 3 Linear Algebra 4 Optimization

More information

Machine Learning using Bayesian Approaches

Machine Learning using Bayesian Approaches Machine Learning using Bayesian Approaches Sargur N. Srihari University at Buffalo, State University of New York 1 Outline 1. Progress in ML and PR 2. Fully Bayesian Approach 1. Probability theory Bayes

More information

Asymptotic Statistics-VI. Changliang Zou

Asymptotic Statistics-VI. Changliang Zou Asymptotic Statistics-VI Changliang Zou Kolmogorov-Smirnov distance Example (Kolmogorov-Smirnov confidence intervals) We know given α (0, 1), there is a well-defined d = d α,n such that, for any continuous

More information

Lecture 8: Minimax Lower Bounds: LeCam, Fano, and Assouad

Lecture 8: Minimax Lower Bounds: LeCam, Fano, and Assouad 40.850: athematical Foundation of Big Data Analysis Spring 206 Lecture 8: inimax Lower Bounds: LeCam, Fano, and Assouad Lecturer: Fang Han arch 07 Disclaimer: These notes have not been subjected to the

More information

Multinomial Data. f(y θ) θ y i. where θ i is the probability that a given trial results in category i, i = 1,..., k. The parameter space is

Multinomial Data. f(y θ) θ y i. where θ i is the probability that a given trial results in category i, i = 1,..., k. The parameter space is Multinomial Data The multinomial distribution is a generalization of the binomial for the situation in which each trial results in one and only one of several categories, as opposed to just two, as in

More information

A Very Brief Summary of Bayesian Inference, and Examples

A Very Brief Summary of Bayesian Inference, and Examples A Very Brief Summary of Bayesian Inference, and Examples Trinity Term 009 Prof Gesine Reinert Our starting point are data x = x 1, x,, x n, which we view as realisations of random variables X 1, X,, X

More information

Introduction to Empirical Processes and Semiparametric Inference Lecture 02: Overview Continued

Introduction to Empirical Processes and Semiparametric Inference Lecture 02: Overview Continued Introduction to Empirical Processes and Semiparametric Inference Lecture 02: Overview Continued Michael R. Kosorok, Ph.D. Professor and Chair of Biostatistics Professor of Statistics and Operations Research

More information

Nonparametric Bayes Uncertainty Quantification

Nonparametric Bayes Uncertainty Quantification Nonparametric Bayes Uncertainty Quantification David Dunson Department of Statistical Science, Duke University Funded from NIH R01-ES017240, R01-ES017436 & ONR Review of Bayes Intro to Nonparametric Bayes

More information

Lecture 2: Repetition of probability theory and statistics

Lecture 2: Repetition of probability theory and statistics Algorithms for Uncertainty Quantification SS8, IN2345 Tobias Neckel Scientific Computing in Computer Science TUM Lecture 2: Repetition of probability theory and statistics Concept of Building Block: Prerequisites:

More information

System Identification, Lecture 4

System Identification, Lecture 4 System Identification, Lecture 4 Kristiaan Pelckmans (IT/UU, 2338) Course code: 1RT880, Report code: 61800 - Spring 2016 F, FRI Uppsala University, Information Technology 13 April 2016 SI-2016 K. Pelckmans

More information

Context tree models for source coding

Context tree models for source coding Context tree models for source coding Toward Non-parametric Information Theory Licence de droits d usage Outline Lossless Source Coding = density estimation with log-loss Source Coding and Universal Coding

More information

Brownian Motion and Conditional Probability

Brownian Motion and Conditional Probability Math 561: Theory of Probability (Spring 2018) Week 10 Brownian Motion and Conditional Probability 10.1 Standard Brownian Motion (SBM) Brownian motion is a stochastic process with both practical and theoretical

More information

STAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song

STAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song STAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song Presenter: Jiwei Zhao Department of Statistics University of Wisconsin Madison April

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Bayesian Nonparametrics

Bayesian Nonparametrics Bayesian Nonparametrics Lorenzo Rosasco 9.520 Class 18 April 11, 2011 About this class Goal To give an overview of some of the basic concepts in Bayesian Nonparametrics. In particular, to discuss Dirichelet

More information

Gaussian Approximations for Probability Measures on R d

Gaussian Approximations for Probability Measures on R d SIAM/ASA J. UNCERTAINTY QUANTIFICATION Vol. 5, pp. 1136 1165 c 2017 SIAM and ASA. Published by SIAM and ASA under the terms of the Creative Commons 4.0 license Gaussian Approximations for Probability Measures

More information

MATH 426, TOPOLOGY. p 1.

MATH 426, TOPOLOGY. p 1. MATH 426, TOPOLOGY THE p-norms In this document we assume an extended real line, where is an element greater than all real numbers; the interval notation [1, ] will be used to mean [1, ) { }. 1. THE p

More information

VARIATIONAL PRINCIPLE FOR THE ENTROPY

VARIATIONAL PRINCIPLE FOR THE ENTROPY VARIATIONAL PRINCIPLE FOR THE ENTROPY LUCIAN RADU. Metric entropy Let (X, B, µ a measure space and I a countable family of indices. Definition. We say that ξ = {C i : i I} B is a measurable partition if:

More information

Bayesian Nonparametrics: Models Based on the Dirichlet Process

Bayesian Nonparametrics: Models Based on the Dirichlet Process Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro Panella Department of Computer Science University of Illinois at Chicago Machine Learning Seminar Series February 18, 2013 Alessandro

More information

Empirical Processes: General Weak Convergence Theory

Empirical Processes: General Weak Convergence Theory Empirical Processes: General Weak Convergence Theory Moulinath Banerjee May 18, 2010 1 Extended Weak Convergence The lack of measurability of the empirical process with respect to the sigma-field generated

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Introduction to Probabilistic Methods Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

4 Sums of Independent Random Variables

4 Sums of Independent Random Variables 4 Sums of Independent Random Variables Standing Assumptions: Assume throughout this section that (,F,P) is a fixed probability space and that X 1, X 2, X 3,... are independent real-valued random variables

More information

On Bayes Risk Lower Bounds

On Bayes Risk Lower Bounds Journal of Machine Learning Research 17 (2016) 1-58 Submitted 4/16; Revised 10/16; Published 12/16 On Bayes Risk Lower Bounds Xi Chen Stern School of Business New York University New York, NY 10012, USA

More information

D I S C U S S I O N P A P E R

D I S C U S S I O N P A P E R I N S T I T U T D E S T A T I S T I Q U E B I O S T A T I S T I Q U E E T S C I E N C E S A C T U A R I E L L E S ( I S B A ) UNIVERSITÉ CATHOLIQUE DE LOUVAIN D I S C U S S I O N P A P E R 2014/06 Adaptive

More information

The largest eigenvalues of the sample covariance matrix. in the heavy-tail case

The largest eigenvalues of the sample covariance matrix. in the heavy-tail case The largest eigenvalues of the sample covariance matrix 1 in the heavy-tail case Thomas Mikosch University of Copenhagen Joint work with Richard A. Davis (Columbia NY), Johannes Heiny (Aarhus University)

More information

Lecture 4 Lebesgue spaces and inequalities

Lecture 4 Lebesgue spaces and inequalities Lecture 4: Lebesgue spaces and inequalities 1 of 10 Course: Theory of Probability I Term: Fall 2013 Instructor: Gordan Zitkovic Lecture 4 Lebesgue spaces and inequalities Lebesgue spaces We have seen how

More information

Learning Bayesian network : Given structure and completely observed data

Learning Bayesian network : Given structure and completely observed data Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution

More information

Lecture 6: September 19

Lecture 6: September 19 36-755: Advanced Statistical Theory I Fall 2016 Lecture 6: September 19 Lecturer: Alessandro Rinaldo Scribe: YJ Choe Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have

More information

TANG, YONGQIANG. Dirichlet Process Mixture Models for Markov processes. (Under the direction of Dr. Subhashis Ghosal.)

TANG, YONGQIANG. Dirichlet Process Mixture Models for Markov processes. (Under the direction of Dr. Subhashis Ghosal.) ABSTRACT TANG, YONGQIANG. Dirichlet Process Mixture Models for Markov processes. (Under the direction of Dr. Subhashis Ghosal.) Prediction of the future observations is an important practical issue for

More information

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 ECE598: Information-theoretic methods in high-dimensional statistics Spring 06 Lecture : Mutual Information Method Lecturer: Yihong Wu Scribe: Jaeho Lee, Mar, 06 Ed. Mar 9 Quick review: Assouad s lemma

More information

Least squares under convex constraint

Least squares under convex constraint Stanford University Questions Let Z be an n-dimensional standard Gaussian random vector. Let µ be a point in R n and let Y = Z + µ. We are interested in estimating µ from the data vector Y, under the assumption

More information

On Bayesian bandit algorithms

On Bayesian bandit algorithms On Bayesian bandit algorithms Emilie Kaufmann joint work with Olivier Cappé, Aurélien Garivier, Nathaniel Korda and Rémi Munos July 1st, 2012 Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms

More information

System Identification, Lecture 4

System Identification, Lecture 4 System Identification, Lecture 4 Kristiaan Pelckmans (IT/UU, 2338) Course code: 1RT880, Report code: 61800 - Spring 2012 F, FRI Uppsala University, Information Technology 30 Januari 2012 SI-2012 K. Pelckmans

More information

Lecture 32: Asymptotic confidence sets and likelihoods

Lecture 32: Asymptotic confidence sets and likelihoods Lecture 32: Asymptotic confidence sets and likelihoods Asymptotic criterion In some problems, especially in nonparametric problems, it is difficult to find a reasonable confidence set with a given confidence

More information

A Very Brief Summary of Statistical Inference, and Examples

A Very Brief Summary of Statistical Inference, and Examples A Very Brief Summary of Statistical Inference, and Examples Trinity Term 2008 Prof. Gesine Reinert 1 Data x = x 1, x 2,..., x n, realisations of random variables X 1, X 2,..., X n with distribution (model)

More information

Statistics 581 Revision of Section 4.4: Consistency of Maximum Likelihood Estimates Wellner; 11/30/2001

Statistics 581 Revision of Section 4.4: Consistency of Maximum Likelihood Estimates Wellner; 11/30/2001 Statistics 581 Revision of Section 4.4: Consistency of Maximum Likelihood Estimates Wellner; 11/30/2001 Some Uniform Strong Laws of Large Numbers Suppose that: A. X, X 1,...,X n are i.i.d. P on the measurable

More information

INFORMATION-THEORETIC DETERMINATION OF MINIMAX RATES OF CONVERGENCE 1. By Yuhong Yang and Andrew Barron Iowa State University and Yale University

INFORMATION-THEORETIC DETERMINATION OF MINIMAX RATES OF CONVERGENCE 1. By Yuhong Yang and Andrew Barron Iowa State University and Yale University The Annals of Statistics 1999, Vol. 27, No. 5, 1564 1599 INFORMATION-THEORETIC DETERMINATION OF MINIMAX RATES OF CONVERGENCE 1 By Yuhong Yang and Andrew Barron Iowa State University and Yale University

More information

Estimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk

Estimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk Ann Inst Stat Math (0) 64:359 37 DOI 0.007/s0463-00-036-3 Estimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk Paul Vos Qiang Wu Received: 3 June 009 / Revised:

More information

Measure-theoretic probability

Measure-theoretic probability Measure-theoretic probability Koltay L. VEGTMAM144B November 28, 2012 (VEGTMAM144B) Measure-theoretic probability November 28, 2012 1 / 27 The probability space De nition The (Ω, A, P) measure space is

More information

Lecture 2. We now introduce some fundamental tools in martingale theory, which are useful in controlling the fluctuation of martingales.

Lecture 2. We now introduce some fundamental tools in martingale theory, which are useful in controlling the fluctuation of martingales. Lecture 2 1 Martingales We now introduce some fundamental tools in martingale theory, which are useful in controlling the fluctuation of martingales. 1.1 Doob s inequality We have the following maximal

More information

6.1 Variational representation of f-divergences

6.1 Variational representation of f-divergences ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 6: Variational representation, HCR and CR lower bounds Lecturer: Yihong Wu Scribe: Georgios Rovatsos, Feb 11, 2016

More information