Nonparametric Bayesian Uncertainty Quantification

Size: px

Start display at page:

Download "Nonparametric Bayesian Uncertainty Quantification"

Ophelia Marshall
6 years ago
Views:

1 Nonparametric Bayesian Uncertainty Quantification Lecture 1: Introduction to Nonparametric Bayes Aad van der Vaart Universiteit Leiden, Netherlands YES, Eindhoven, January 2017

2 Contents Introduction Recovery Gaussian process priors Dirichlet process Dirichlet process mixtures

3 Introduction

4 The Bayesian paradigm A parameter Θ is generated according to a prior distribution Π. Given Θ = θ the data X is generated according to a measure P θ. This gives a joint distribution of (X,Θ). Given observed data x the statistician computes the conditional distribution of Θ given X = x, the posterior distribution: Π(θ B X).

5 The Bayesian paradigm A parameter Θ is generated according to a prior distribution Π. Given Θ = θ the data X is generated according to a measure P θ. This gives a joint distribution of (X,Θ). Given observed data x the statistician computes the conditional distribution of Θ given X = x, the posterior distribution: Π(θ B X). We assume whatever needed (e.g. Θ Polish and Π a probability distribution on its Borel σ-field; Polish sample space) to make this well defined.

6 Bayes s rule A parameter Θ is generated according to a prior distribution Π. Given Θ = θ the data X is generated according to a measure P θ. This gives a joint distribution of (X,Θ). Given observed data x the statistician computes the conditional distribution of Θ given X = x, the posterior distribution: Π(θ B X). If P θ is given by a density x p θ (x), then Bayes s rule gives Π(Θ B X) = B p θ(x)dπ(θ) Θ p θ(x)dπ(θ)

7 Bayes s rule A parameter Θ is generated according to a prior distribution Π. Given Θ = θ the data X is generated according to a measure P θ. This gives a joint distribution of (X,Θ). Given observed data x the statistician computes the conditional distribution of Θ given X = x, the posterior distribution: Π(θ B X). If P θ is given by a density x p θ (x), then Bayes s rule gives dπ(θ X) p θ (X)dΠ(θ)

8 Reverend Thomas Thomas Bayes ( , 1763) followed this argument with Θ possessing the uniform distribution and X given Θ = θ binomial (n,θ). Using his famous rule he computed that the posterior distribution is then Beta(X +1,n X +1). P(a Θ b) = b a, 0 < a < b < 1, ( ) n P(X = x Θ = θ) = θ x (1 θ) n x, x = 0,1,...,n, x dπ(θ X) = θ X (1 θ) n X 1.

9 Reverend Thomas Thomas Bayes ( , 1763) followed this argument with Θ possessing the uniform distribution and X given Θ = θ binomial (n,θ). Using his famous rule he computed that the posterior distribution is then Beta(X +1,n X +1). n=20, x=

10 Reverend Thomas Thomas Bayes ( , 1763) followed this argument with Θ possessing the uniform distribution and X given Θ = θ binomial (n,θ). Using his famous rule he computed that the posterior distribution is then Beta(X +1,n X +1)

Parametric Bayes Pierre-Simon Laplace (1749-1827) rediscovered Bayes argument and applied it to general parametric

He had many followers, but Ronald Aylmer Fisher (1890 1962) did not buy into it.

11 Parametric Bayes Pierre-Simon Laplace ( ) rediscovered Bayes argument and applied it to general parametric models: models smoothly indexed by a Euclidean parameter θ. He had many followers, but Ronald Aylmer Fisher ( ) did not buy into it. The Bayesian method regained popularity following the development of MCMC methods in the 1980/90s. Nonparametric Bayesian statistics set off in the 1970/80/90s, although it was long thought not too work.

12 Nonparametric Bayes If the parameter θ is a function, then the prior is a probability distribution on an function space. So is the posterior, given the data. Bayes s formula does not change: dπ(θ X) p θ (X)dΠ(θ). Prior and posterior can be visualized by plotting functions that are simulated from these distributions

13 Nonparametric Bayes If the parameter θ is a function, then the prior is a probability distribution on an function space. So is the posterior, given the data. Bayes s formula does not change: dπ(θ X) p θ (X)dΠ(θ). Prior and posterior can be visualized by plotting functions that are simulated from these distributions

14 Nonparametric Bayes If the parameter θ is a function, then the prior is a probability distribution on an function space. So is the posterior, given the data. Bayes s formula does not change: dπ(θ X) p θ (X)dΠ(θ). Prior and posterior can be visualized by plotting functions that are simulated from these distributions

15 Nonparametric Bayes If the parameter θ is a function, then the prior is a probability distribution on an function space. So is the posterior, given the data. Bayes s formula does not change: dπ(θ X) p θ (X)dΠ(θ). Prior and posterior can be visualized by plotting functions that are simulated from these distributions

16 Nonparametric Bayes If the parameter θ is a function, then the prior is a probability distribution on an function space. So is the posterior, given the data. Bayes s formula does not change: dπ(θ X) p θ (X)dΠ(θ). Prior and posterior can be visualized by plotting functions that are simulated from these distributions

17 Nonparametric Bayes If the parameter θ is a function, then the prior is a probability distribution on an function space. So is the posterior, given the data. Bayes s formula does not change: dπ(θ X) p θ (X)dΠ(θ). Prior and posterior can be visualized by plotting functions that are simulated from these distributions

18 Frequentist Bayesian Assume that the data X is generated according to a given parameter θ 0 and consider the posterior Π(θ X) as a random measure on the parameter set dependent on X. RECOVERY We like Π(θ X) to put most of its mass near θ 0 for most X.

19 Frequentist Bayesian Assume that the data X is generated according to a given parameter θ 0 and consider the posterior Π(θ X) as a random measure on the parameter set dependent on X. RECOVERY We like Π(θ X) to put most of its mass near θ 0 for most X. UNCERTAINTY QUANTIFICATION We like the spread of Π(θ X) to indicate remaining uncertainty.

20 Frequentist Bayesian Assume that the data X is generated according to a given parameter θ 0 and consider the posterior Π(θ X) as a random measure on the parameter set dependent on X. RECOVERY We like Π(θ X) to put most of its mass near θ 0 for most X. UNCERTAINTY QUANTIFICATION We like the spread of Π(θ X) to indicate remaining uncertainty. A set with C X with Π(θ C X X) = 0.95 is called a credible set. Is it a confidence set? Is P θ0 (C X θ 0 ) = 0.95? Is its order of magnitude correct?

21 Frequentist Bayesian Assume that the data X is generated according to a given parameter θ 0 and consider the posterior Π(θ X) as a random measure on the parameter set dependent on X. RECOVERY We like Π(θ X) to put most of its mass near θ 0 for most X. UNCERTAINTY QUANTIFICATION We like the spread of Π(θ X) to indicate remaining uncertainty. A set with C X with Π(θ C X X) = 0.95 is called a credible set. Is it a confidence set? Is P θ0 (C X θ 0 ) = 0.95? Is its order of magnitude correct? Asymptotic setting: data X (n) where the information increases as n. We want Π n ( X (n) δ θ0, at a good rate. We like P θ0 (C X (n) θ 0 ) 0.95.

22 Parametric models Suppose the data are a random sample X 1,...,X n from a density x p θ (x) that is smoothly and identifiably parametrized by a vector θ R d (e.g. θ p θ continuously differentiable as map in L 2 (µ)). Theorem. [Laplace, Bernstein, von Mises, LeCam 1989] Under Pθ n 0 -probability, for any prior with density that is positive around θ 0, Π( X 1,...,X n ) N d ( θn, 1 n I 1 θ 0 ) ( ) 0. Here θ n are estimators with n( θ n θ 0 ) N(0,I 1 θ 0 ) n= n=

23 Parametric models Suppose the data are a random sample X 1,...,X n from a density x p θ (x) that is smoothly and identifiably parametrized by a vector θ R d (e.g. θ p θ continuously differentiable as map in L 2 (µ)). Theorem. [Laplace, Bernstein, von Mises, LeCam 1989] Under Pθ n 0 -probability, for any prior with density that is positive around θ 0, Π( X 1,...,X n ) N d ( θn, 1 n I 1 θ 0 ) ( ) 0. Here θ n are estimators with n( θ n θ 0 ) N(0,I 1 θ 0 ). RECOVERY: The posterior distribution concentrates most of its mass on balls of radius O(1/ n) around θ 0. UNCERTAINTY QUANTIFICATION: A central set of posterior probability 95 % is equivalent to the usual Wald confidence set { θ:n(θ θ n ) T I θn (θ θ n ) χ 2 d,1 α}.

24 These lectures Recovery and uncertainty quantification for nonparametric models. LECTURE 1: Introduction to recovery. LECTURE 2: Uncertainty quantification for curve fitting/smoothing. LECTURE 3: Uncertainty quantification in high dimensions under sparsity. Point of view: How does the posterior distribution for natural priors behave, in particular for priors that adapt to complexity in the data

25 Recovery

26 Consistency X (n) observation in sample space (X (n),x (n) ) with distribution P (n) θ. θ belongs to metric space (Θ,d). Definition. ǫ > 0. The posterior distribution is consistent at θ 0 Θ if for every E θ0 Π n ( θ:d(θ,θ0 ) > ǫ X (n)) 0, n.

27 Schwartz s theorem (1965) Parameter p: ν-density on sample space (X,X). True value p 0. K(p 0 ;p) = p 0 log(p 0 /p)dν.

28 Schwartz s theorem (1965) Parameter p: ν-density on sample space (X,X). True value p 0. K(p 0 ;p) = p 0 log(p 0 /p)dν. Definition. p 0 is said to possess the Kullback-Leibler property relative to Π if Π ( p:k(p 0 ;p) < ǫ ) > 0 for every ǫ > 0. Bayesian model: X 1,...,X n p iid p, p Π.

29 Schwartz s theorem (1965) Parameter p: ν-density on sample space (X,X). True value p 0. K(p 0 ;p) = p 0 log(p 0 /p)dν. Definition. p 0 is said to possess the Kullback-Leibler property relative to Π if Π ( p:k(p 0 ;p) < ǫ ) > 0 for every ǫ > 0. Bayesian model: X 1,...,X n p iid p, p Π. Theorem. If p 0 has KL-property, and for every neighbourhood U of p 0 there exist tests φ n such that P n 0 φ n 0, sup P n (1 φ n ) 0, p U c then Π n ( X 1,...,X n ) is consistent at p 0.

30 Extended Schwartz s theorem Bayesian model: X 1,...,X n p iid p, p Π. Theorem. If If p 0 has KL-property and for every neighbourhood U of p 0 there exist C > 0, sets P n P and tests φ n such that Π(P \P n ) < e Cn, P n 0 φ n e Cn, sup p P n U c P n (1 φ n ) e Cn, then Π n ( X 1,...,X n ) is consistent at p 0. Tests exist if: Weak topology: always L 1 -distance: if logn ( ǫ,p n, 1 ) nǫ 2 /3, for all ε > 0. Definition. The covering number N(ǫ,P,d) is the minimal number of d-balls of radius ǫ needed to cover P.

31 Rate of contraction Bayesian model: X 1,...,X n p iid p, p Π. Definition. The posterior distribution Π n ( X (n) ) contracts at rate ǫ n 0 at θ 0 Θ if, for every M n, E θ0 Π n ( θ:d(θ,θ0 ) > M n ǫ n X (n)) 0, n.

32 Rate of contraction Bayesian model: X 1,...,X n p iid p, p Π. Definition. The posterior distribution Π n ( X (n) ) contracts at rate ǫ n 0 at θ 0 Θ if, for every M n, E θ0 Π n ( θ:d(θ,θ0 ) > M n ǫ n X (n)) 0, n. Benchmark rate for curve fitting: A function θ of d variables that has bounded derivatives of order β is estimable based on n observations at rate n β/(2β+d). Proposition. If the posterior distribution contracts at rate ǫ n at θ 0, then ˆθ n defined as the center of a (nearly) smallest ball that contains posterior mass at least 1/2 satisfies d(ˆθ n,θ 0 ) = O P (ǫ n ) under P (n) θ 0.

33 Basic contraction theorem Bayesian model: X 1,...,X n p iid p, p Π. K(p 0 ;p) = P 0 log p ( 0 p, V(p 0;p) = P 0 log p ) 0 2, h 2 (p 0,p) = p ( p 0 p) 2 dν. Theorem. Given metric d h whose balls are convex suppose that there exist P n P and C > 0, such that, ( (i) Π n p:k(p0 ;p) < ǫ 2 n,v(p 0 ;p) < ǫ 2 n) e Cnǫ 2 n, (prior mass) (ii) logn ( ǫ n,p n,d ) nǫ 2 n. (complexity) (iii) Π n (Pn) c e (C+4)nǫ2 n. Then the posterior rate of convergence for d is ǫ n n 1/2.

34 Basic contraction theorem proof Proof. There exist tests φ n with P n 0 φ n e nǫ2 n e nm2 ǫ 2 n/8 1 e nm2 ǫ 2 n/8, For A n = { n i=1 (p/p } 0)(X i )dπ n (p) e (2+C)nǫ2 n Π n ( p:d(p,p0 ) > Mǫ n X 1,...,X n ) φ n +1l{A c n}+e (2+C)nǫ2 n P n 0 (Ac n) 0. See further on. sup P n nm2ǫ2 (1 φ n ) e n/8. p P n :d(p,p 0 )>Mǫ n d(p,p 0 )>Mǫ n n i=1 p p 0 (X i )dπ n (p)(1 φ n ).

35 Basic contraction theorem proof continued Proof. By Fubini (Continued) P n 0 p P n :d(p,p 0 )>Mǫ n n i=1 p p 0 (X i )dπ n (p) P n nm2ǫ2 (1 φ n )dπ n (p) e n /8 p P n :d(p,p 0 )>Mǫ n By Fubini P n 0 P\P n n i=1 p p 0 (X i )dπ n (p) Π n (P \P n ).

36 Bounding the denominator Lemma. For any probability measure Π on P, and positive constant ǫ, with P n 0 -probability at least 1 (nǫ2 ) 1, n i=1 p p 0 (X i )dπ(p) Π ( p:k(p 0 ;p) < ǫ 2,V(p 0 ;p) < ǫ 2) e 2nǫ2.

37 Apply Chebyshev s inequality. Bounding the denominator Lemma. For any probability measure Π on P, and positive constant ǫ, with P n 0 -probability at least 1 (nǫ2 ) 1, n i=1 p p 0 (X i )dπ(p) Π ( p:k(p 0 ;p) < ǫ 2,V(p 0 ;p) < ǫ 2) e 2nǫ2. Proof. B:= { p:k(p 0 ;p) < ǫ 2 n,v(p 0 ;p) < ǫ 2 n}. n p log (X i )dπ(p) p 0 i=1 n i=1 log p p 0 (X i )dπ(p) =:Z. EZ = n K(p 0 ;p)dπ(p) > nǫ 2, varz np 0 ( log p 0 p dπ(p) ) 2 np0 ( log p 0 p ) 2dΠ(p) nǫ 2,

38 Interpretation Consider a maximal set of points p 1,...,p N in P n with d(p i,p j ) ǫ n. Maximality implies N N(ǫ n,p n,d) e c 1nǫ 2 n, under the entropy bound. The balls of radius ǫ n /2 around the points are disjoint and hence the sum of their prior masses will be less than 1. If the prior mass were evenly distributed over these balls, then each would have no more mass than e c 1nǫ 2 n. This is of the same order as the prior mass bound. This argument suggests that the conditions can only be satisfied for every p 0 in the model if the prior distributes its mass uniformly, at discretization level ǫ n.

39 Gaussian process priors

40 Gaussian process prior The law of a stochastic process W = (W t :t T) is a prior distribution on the space of functions θ:t R W is a Gaussian process if (W t1,...,w tk ) is multivariate Gaussian, for every t 1,...,t k. Mean and covariance function: t EW t, and (s,t) cov(w s,w t ), s,t T.

41 Brownian motion and its primitives , 1, 2 and 3 times integrated Brownian motion

42 Settings Density estimation X 1,...,X n iid in [0,1], p θ (x) = eθ(x) 1 0 eθ(t) dt. Distance on parameter: Hellinger on p θ. Norm on W : uniform. Classification (X 1,Y 1 ),...,(X n,y n ) iid in [0,1] {0,1} P θ (Y = 1 X = x) = 1 1+e θ(x). Regression Y 1,...,Y n independent N(θ(x i ),σ 2 ), for fixed design points x 1,...,x n. Ergodic diffusions (X t :t [0,n]), ergodic, recurrent: dx t = θ(x t )dt+σ(x t )db t. Distance on parameter: L 2 (G) on P θ. (G marginal of X i.) Norm on W : L 2 (G). Distance on parameter: empirical L 2 -distance on θ. Norm on W : empirical L 2 -distance. Distance on parameter: random Hellinger h n ( /σ µ0,2). Norm on W : L 2 (µ 0 ). (µ 0 stationary measure.)

43 Posterior contraction rates for Gaussian priors View Gaussian process W as measurable map into Banach space (B, ). Theorem. If statistical distances on the model combine appropriately with the norm of B, then the posterior rate is ε n if P ( W w 0 < ε n ) e nε 2 n.

44 Posterior contraction rates for Gaussian priors View Gaussian process W as measurable map into Banach space (B, ). Theorem. If statistical distances on the model combine appropriately with the norm of B, then the posterior rate is ε n if P ( W w 0 < ε n ) e nε 2 n. Proof. Stated condition is prior mass. Complexity is automatic due to concentration of Gaussian processes.

45 Posterior contraction rates for Gaussian priors View Gaussian process W as measurable map into Banach space (B, ). Theorem. If statistical distances on the model combine appropriately with the norm of B, then the posterior rate is ε n if P ( W w 0 < ε n ) e nε 2 n. An equivalent condition is φ 0 (ε n ) nε n 2 AND inf h 2 H nε n 2, h H: h w 0 <ε n where φ 0 (ε) = logπ( W < ε) is the small ball exponent and (H, H ) is the RKHS. Both inequalities give lower bound on ε n. The first depends on W and not on w 0.

46 Brownian Motion Theorem. If θ 0 C β [0,1], then the rate for Brownian motion is n β/2 if β 1/2 and n 1/4 for every β 1/2. The rate is n β/(2β+1) iff β = 1/2.

47 Brownian Motion Theorem. If θ 0 C β [0,1], then the rate for Brownian motion is n β/2 if β 1/2 and n 1/4 for every β 1/2. The rate is n β/(2β+1) iff β = 1/2. The small ball probability of Brownian motion is P ( W < ε ) e (1/ε)2, ε 0. This causes a n 1/4 -rate even for very smooth truths.

48 Integrated Brownian Motion Theorem. If θ 0 C β [0,1], then the rate for (α 1/2)-times integrated Brownian is n (α β)/(2α+d). The rate is n β/(2β+1) iff β = α.

49 Integrated Brownian motion 1/c Γ(a,b). (G t :t > 0) the k-fold integral of Brownian motion released at zero, W t cg t. Theorem. The prior W = ( cg t :0 t 1) gives contraction rate n β/(2β+1) for θ 0 C β [0,1], for any β (0,k +1]. Bayes solves the bandwidth problem.

50 Square exponential process cov(g s,g t ) = e s t 2, s,t R d Theorem. The prior G gives a rate (logn) γ / n if θ 0 is analytic, but may give a rate (logn) γ if θ 0 is only ordinary smooth.

51 Square exponential process adaptation by random time scaling c d Γ. (G t :t > 0) square exponential process. W t G ct. Theorem. if θ 0 C β [0,1] d, then the rate of contraction is nearly n β/(2β+d). if θ 0 is supersmooth, then the rate is nearly n 1/

52 Gaussian processes: summary Recovery is best if prior matches truth. Mismatch slows down, but does not prevent, recovery. Mismatch can be prevented by using hyperparameters.

53 Dirichlet process

54 Finite-dimensonal Dirichlet distribution Definition. (θ 1,...,θ k ) possesses a Dir(k,α 1,...,α k )-distribution for α i > 0 it has (Lebesgue) density on the unit simplex proportional to θ θ α θ α k 1 k.

55 Finite-dimensonal Dirichlet distribution Definition. (θ 1,...,θ k ) possesses a Dir(k,α 1,...,α k )-distribution for α i > 0 it has (Lebesgue) density on the unit simplex proportional to θ θ α θ α k 1 k. We extend to α i = 0 for one or more i on the understanding that θ i = 0. Theorem. If θ Dir(k,α) and N θ Multinom(n,α), then θ N Dir(k,α+N). Proof. Π(dθ N) ( n) k N i=1 θn i i k i=1 θα i 1 i.

56 Dirichlet process Definition. A random measure P on a measurable space (X,X) is a Dirichlet process with base measure α, if for every partition A 1,...,A k of X, ( P(A1 ),...,P(A k ) ) Dir ( k;α(a 1 ),...,α(a k ) ).

57 Dirichlet process Definition. A random measure P on a measurable space (X,X) is a Dirichlet process with base measure α, if for every partition A 1,...,A k of X, ( P(A1 ),...,P(A k ) ) Dir ( k;α(a 1 ),...,α(a k ) ). Lemma. EP(A) = α(a) α.

58 Dirichlet process Definition. A random measure P on a measurable space (X,X) is a Dirichlet process with base measure α, if for every partition A 1,...,A k of X, ( P(A1 ),...,P(A k ) ) Dir ( k;α(a 1 ),...,α(a k ) ). Lemma. EP(A) = α(a) α

59 Dirichlet process Definition. A random measure P on a measurable space (X,X) is a Dirichlet process with base measure α, if for every partition A 1,...,A k of X, ( P(A1 ),...,P(A k ) ) Dir ( k;α(a 1 ),...,α(a k ) ). Lemma. EP(A) = α(a) α. Theorem. If θ 1,θ 2,... ᾱ iid and Y 1,Y 2,... Be(1,M) iid are independent, and W j = (1 Y 1 ) (1 Y j 1 )Y j, then P:= W j δ θj DP(Mᾱ). j=1

60 Conjugacy of Dirichlet process P DP(α), X 1,X 2,... P iid P. Theorem. P X 1,...,X n DP(α+nP n ), for P n = n 1 n i=1 δ X i.

61 Conjugacy of Dirichlet process P DP(α), X 1,X 2,... P iid P. Theorem. P X 1,...,X n DP(α+nP n ), for P n = n 1 n i=1 δ X i. Proof. ( P(A1 ),...,P(A k ) ) X 1,...,X n ( P(A 1 ),...,P(A k ) ) (N 1,...,N k ), where N i = np n (A i ). Apply result for finite-dimensional Dirichlet.

62 Conjugacy of Dirichlet process P DP(α), X 1,X 2,... P iid P. Theorem. Corollary. P X 1,...,X n DP(α+nP n ), for P n = n 1 n i=1 δ X i. E ( P(A) X 1,...,X n ) = α(a) α +n + n α +n P n(a).

63 Conjugacy of Dirichlet process P DP(α), X 1,X 2,... P iid P. Theorem. P X 1,...,X n DP(α+nP n ), for P n = n 1 n i=1 δ X i. Theorem. The posterior distribution of P(A) satisfies a Bernstein-von Mises theorem: Π ( ) ( P(A) X 1,...,X n N Pn (A), P 0(A)(1 P 0 )(A)) 0. n

64 Pitman-Yor process θ 1,θ 2,... iid ᾱ independent of Y j ind Be(1 σ,m +jσ). W j = (1 Y 1 ) (1 Y j 1 )Y j. P:= W j δ θj. j=1 Theorem. The posterior distribution based on X 1,...,X n P P is inconsistent unless σ = 0 or P 0 = α or P 0 is discrete

65 Dirichlet process mixtures

66 Dirichlet normal mixtures p F,σ (x) = 1 ( x z ) σ φ df(z). σ X 1,...,X n F,σ iid p F,σ, F DP(α) σ π. Posterior mean (solid black) and 10 draws of the posterior distribution for a sample of size 50 from a mixture of two normals (red).

67 Dirichlet normal mixtures p F,σ (x) = 1 ( x z ) σ φ df(z). σ X 1,...,X n F,σ iid p F,σ, F DP(α) σ π. Two cases for the true density p 0 : Supersmooth: p 0 = p F0,σ 0, for compactly supported F 0. Take prior for σ with continuous positive density on (a,b) σ 0. Ordinary smooth: p 0 has β derivatives and exponentially small tails. Take 1/σ a priori Gamma distributed.

68 Dirichlet normal mixtures p F,σ (x) = 1 ( x z ) σ φ df(z). σ X 1,...,X n F,σ iid p F,σ, F DP(α) σ π. Two cases for the true density p 0 : Supersmooth: p 0 = p F0,σ 0, for compactly supported F 0. Take prior for σ with continuous positive density on (a,b) σ 0. Ordinary smooth: p 0 has β derivatives and exponentially small tails. Take 1/σ a priori Gamma distributed. Theorem. Hellinger rate of contraction is nearly n 1/2 in the supersmooth case. nearly n β/(2β+1) in the ordinary smooth case.

69 Dirichlet normal mixtures p F,σ (x) = 1 ( x z ) σ φ df(z). σ X 1,...,X n F,σ iid p F,σ, F DP(α) σ π. Two cases for the true density p 0 : Supersmooth: p 0 = p F0,σ 0, for compactly supported F 0. Take prior for σ with continuous positive density on (a,b) σ 0. Ordinary smooth: p 0 has β derivatives and exponentially small tails. Take 1/σ a priori Gamma distributed. Adaptation to any smoothness with a Gaussian kernel! Kernel density estimation needs higher order kernels. 1 nσ n ( x Xi ) φ σ i=1 = p Fn,σ(x).

70 Key lemma: finite approximation Lemma. For any probability measure F on the interval [0,1] there exists a discrete probability measure F on with at most N log 1 ε support points, such that p F,1 p F,1 ε, ( p F,1 p F,1 1 ε log 1 1/2. ε) Proof. Match moments of F and F up to order log(1/ǫ). Taylor expand the kernel z φ(x z).

Bayesian Regularization

Bayesian Regularization Aad van der Vaart Vrije Universiteit Amsterdam International Congress of Mathematicians Hyderabad, August 2010 Contents Introduction Abstract result Gaussian process priors Co-authors