Bayesian Regularization

Size: px

Start display at page:

Download "Bayesian Regularization"

Mercy Allison
6 years ago
Views:

1 Bayesian Regularization Aad van der Vaart Vrije Universiteit Amsterdam International Congress of Mathematicians Hyderabad, August 2010

2 Contents Introduction Abstract result Gaussian process priors

3 Co-authors Abstract result Subhashis Ghosal Gaussian process priors Harry van Zanten

4 Introduction

5 The Bayesian paradigm A parameter Θ is generated according to a prior distribution Π. Given Θ = θ the data X is generated according to a probability density p θ. This gives a joint distribution of (X, Θ). Given observed data x the statistician computes the conditional distribution of Θ given X = x: the posterior distribution. dπ(θ X) p θ (X)dΠ(θ)

6 The Bayesian paradigm A parameter Θ is generated according to a prior distribution Π. Given Θ = θ the data X is generated according to a probability density p θ. This gives a joint distribution of (X, Θ). Given observed data x the statistician computes the conditional distribution of Θ given X = x: the posterior distribution. Π(Θ B X) = B p θ(x)dπ(θ) Θ p θ(x)dπ(θ)

7 The Bayesian paradigm A parameter Θ is generated according to a prior distribution Π. Given Θ = θ the data X is generated according to a probability density p θ. This gives a joint distribution of (X, Θ). Given observed data x the statistician computes the conditional distribution of Θ given X = x: the posterior distribution. dπ(θ X) p θ (X)dΠ(θ)

8 Reverend Thomas Thomas Bayes ( , 1763) followed this argument with Θ possessing the uniform distribution and X given Θ = θ binomial (n, θ). Using his famous rule he computed that the posterior distribution is then Beta(X + 1, n X + 1). dπ n (θ X) ( ) n θ X (1 θ) n X 1. X

9 Reverend Thomas Thomas Bayes ( , 1763) followed this argument with Θ possessing the uniform distribution and X given Θ = θ binomial (n, θ). Using his famous rule he computed that the posterior distribution is then Beta(X + 1, n X + 1)

10 Reverend Thomas Thomas Bayes ( , 1763) followed this argument with Θ possessing the uniform distribution and X given Θ = θ binomial (n, θ). Using his famous rule he computed that the posterior distribution is then Beta(X + 1, n X + 1) n=20,x=

11 Reverend Thomas Thomas Bayes ( , 1763) followed this argument with Θ possessing the uniform distribution and X given Θ = θ binomial (n, θ). Using his famous rule he computed that the posterior distribution is then Beta(X + 1, n X + 1) n=20,x=

12 Reverend Thomas Thomas Bayes ( , 1763) followed this argument with Θ possessing the uniform distribution and X given Θ = θ binomial (n, θ). Using his famous rule he computed that the posterior distribution is then Beta(X + 1, n X + 1) n=20,x=

13 Nonparametric Bayes If the parameter θ is a function, then the prior is a probability distribution on an function space. So is the posterior, given the data. Bayes formula does not change: dπ(θ X) p θ (X)dΠ(θ).

14 Nonparametric Bayes If the parameter θ is a function, then the prior is a probability distribution on an function space. So is the posterior, given the data. Bayes formula does not change: dπ(θ X) p θ (X)dΠ(θ). Prior and posterior can be visualized by plotting functions that are simulated from these distributions

15 Nonparametric Bayes If the parameter θ is a function, then the prior is a probability distribution on an function space. So is the posterior, given the data. Bayes formula does not change: dπ(θ X) p θ (X)dΠ(θ). Prior and posterior can be visualized by plotting functions that are simulated from these distributions

16 Nonparametric Bayes If the parameter θ is a function, then the prior is a probability distribution on an function space. So is the posterior, given the data. Bayes formula does not change: dπ(θ X) p θ (X)dΠ(θ). Prior and posterior can be visualized by plotting functions that are simulated from these distributions

17 Nonparametric Bayes If the parameter θ is a function, then the prior is a probability distribution on an function space. So is the posterior, given the data. Bayes formula does not change: dπ(θ X) p θ (X)dΠ(θ). Prior and posterior can be visualized by plotting functions that are simulated from these distributions

18 Nonparametric Bayes If the parameter θ is a function, then the prior is a probability distribution on an function space. So is the posterior, given the data. Bayes formula does not change: dπ(θ X) p θ (X)dΠ(θ). Prior and posterior can be visualized by plotting functions that are simulated from these distributions

19 Nonparametric Bayes If the parameter θ is a function, then the prior is a probability distribution on an function space. So is the posterior, given the data. Bayes formula does not change: dπ(θ X) p θ (X)dΠ(θ). Prior and posterior can be visualized by plotting functions that are simulated from these distributions

20 Nonparametric Bayes If the parameter θ is a function, then the prior is a probability distribution on an function space. So is the posterior, given the data. Bayes formula does not change: dπ(θ X) p θ (X)dΠ(θ). Prior and posterior can be visualized by plotting functions that are simulated from these distributions

21 Nonparametric Bayes If the parameter θ is a function, then the prior is a probability distribution on an function space. So is the posterior, given the data. Bayes formula does not change: dπ(θ X) p θ (X)dΠ(θ). Prior and posterior can be visualized by plotting functions that are simulated from these distributions

22 Frequentist Bayesian Assume that the data X is generated according to a given parameter θ 0 and consider the posterior Π(θ X) as a random measure on the parameter set.

23 Frequentist Bayesian Assume that the data X is generated according to a given parameter θ 0 and consider the posterior Π(θ X) as a random measure on the parameter set. We like Π(θ X) to put most of its mass near θ 0 for most X.

24 Frequentist Bayesian Assume that the data X is generated according to a given parameter θ 0 and consider the posterior Π(θ X) as a random measure on the parameter set. We like Π(θ X) to put most of its mass near θ 0 for most X. Asymptotic setting: data X n where the information increases as n. We like the posterior Π n ( X n ) to contract to {θ 0 }, at a good rate.

25 Rate of contraction Assume X n is generated according to a given parameter θ 0 where the information increases as n. Posterior is consistent if, for every ε > 0, E θ0 Π ( θ: d(θ, θ 0 ) < ε X n) 1. Posterior contracts at rate at least ε n if E θ0 Π ( θ: d(θ, θ 0 ) < ε n X n) 1. Basic results on consistency were proved by Doob (1948) and Schwarz (1965). Interest in rates is recent.

26 Minimaxity To a given model Θ is attached an optimal rate of convergence defined by the minimax criterion ε n = inf sup E θ d ( T(X), θ ). T θ Θ This criterion has nothing to do with Bayes. A prior is good if the posterior contracts at this rate. (?)

27 Adaptation A model can be viewed as an instrument to test quality. It makes sense to use a collection (Θ α : α A) of models simultaneously, e.g. a scale of regularity classes. A posterior is good if it adapts: if the true parameter belongs to Θ α, then the contraction rate is at least the minimax rate for this model.

28 Bayesian perspective Any prior (and hence posterior) is appropriate per se. In complex situations subject knowledge can be and must be incorporated in the prior. Computational ease is important for prior choice as well. Frequentist properties reveal key properties of priors of interest.

29 Abstract result

30 Entropy The covering number N(ε, Θ, d) of a metric space (Θ, d) is the minimal number of balls of radius ε needed to cover Θ. ε big ε small Entropy is its logarithm: log N(ε, Θ, d).

31 Rate theorem iid observations Given a random sample X 1,...,X n from a density p 0 and a prior Π on a set P of densities consider the posterior dπ n (p X 1,..., X n ) n p(x i )dπ(p). i=1 THEOREM [Ghosal+Ghosh+vdV, 2000] The Hellinger contraction rate is ε n if there exist P n P such that (1) log N(ε n, P n, h) nε 2 n and Π(P n ) = 1 o(e 3nε2 n ). entropy. (2) Π ( B KL (p 0, ε n ) ) e nε2 n. prior mass. h is the Hellinger distance : h 2 (p, q) = ( p q) 2 dµ. B KL (p 0, ε) is a Kullback-Leibler neighborhood of p 0.

32 Rate theorem iid observations Given a random sample X 1,...,X n from a density p 0 and a prior Π on a set P of densities consider the posterior dπ n (p X 1,..., X n ) n p(x i )dπ(p). i=1 THEOREM [Ghosal+Ghosh+vdV, 2000] The Hellinger contraction rate is ε n if there exist P n P such that (1) log N(ε n, P n, h) nε 2 n and Π(P n ) = 1 o(e 3nε2 n ). entropy. (2) Π ( B KL (p 0, ε n ) ) e nε2 n. prior mass. The entropy condition ensures that the likelihood is not too variable, so that it cannot be large at a wrong place by pure randomness. Le Cam (1964) showed that it gives the minimax rate.

33 Rate theorem general Given data X n following a model (Pθ n : θ Θ) that satisfies Le Cam s testing criterion, and a prior Π, form posterior dπ n (θ X n ) p n θ (Xn )dπ(θ). THEOREM The rate of contraction is ε n 1/ n if there exist Θ n Θ such that (1) D n (ε n, Θ n, d n ) nε 2 n and Π n (Θ Θ n ) = o(e 3nε2 n ). (2) Π n ( Bn (θ 0, ε n ; k) ) e nε2 n. B n (θ 0, ε; k) is Kullback-Leibler type neighbourhood of p n θ 0.

34 Rate theorem general Given data X n following a model (Pθ n : θ Θ) that satisfies Le Cam s testing criterion, and a prior Π, form posterior dπ n (θ X n ) p n θ (Xn )dπ(θ). THEOREM The rate of contraction is ε n 1/ n if there exist Θ n Θ such that (1) D n (ε n, Θ n, d n ) nε 2 n and Π n (Θ Θ n ) = o(e 3nε2 n ). (2) Π n ( Bn (θ 0, ε n ; k) ) e nε2 n. The theorem can be refined in various ways.

35 Le Cam s testing criterion Statistical model (P n θ : θ Θ) indexed by metric space (Θ, d). For all ε > 0: for all θ 1 with d(θ 1, θ 0 ) > ε test φ n with P n θ 0 φ n e nε2, sup Pθ n (1 φ n) e nε2 θ Θ:d(θ,θ 1 )<ε/2 θ 1 θ 0.. > ɛ radius ɛ/2 This applies to independent data, Markov chains, Gaussian time series, ergodic diffusions,....

36 Adaptation by hierarchical prior i.i.d. P n,α collection of densities with prior Π n,α, for α A. Prior weights λ n = (λ n,α : α A). Π n = α A λ n,απ n,α. THEOREM The Hellinger contraction rate is ε n,β if the prior weights satisfies (*) below and (1) log N ( ε n,α, P n,α, h ) nε 2 n,α, every α A. (2) Π n,β ( Bn,β (p 0, ε n,β ) ) e nε2 n,β. B n,α (p 0, ε) is Kullback-Leibler type neighbourhood of p 0 within P n,α.

37 Condition (*) on prior weights (simplified) α<β λ n,α e nε2 n,α + λ n,α e nε2 n,β, λ n,β λ n,β α>β α<β λ n,α λ n,β Π n,α ( Cn,α (p 0, ε n,α ) ) e 4nε2 n,β. α < β means ε n,α ε n,β. C n,α (p 0, ε) is Hellinger ball of radius ε around p 0 in P n,α.

38 Condition (*) on prior weights (simplified) α<β λ n,α e nε2 n,α + λ n,α e nε2 n,β, λ n,β λ n,β α>β α<β λ n,α λ n,β Π n,α ( Cn,α (p 0, ε n,α ) ) e 4nε2 n,β. α < β means ε n,α ε n,β. C n,α (p 0, ε) is Hellinger ball of radius ε around p 0 in P n,α. In many situations there is much freedom in choice of weights. The weights λ n,α µ α e Cnε2 n,α always work.

39 Model selection THEOREM Under the conditions of the theorem Π n ( α: α < β X1,, X n ) P 0, Π n ( α: α β, h(pn,α, p 0 ) ε n,β X 1,, X n ) P 0. Too big models do not get posterior weight. Neither do small models that are far from the truth.

40 Examples of priors Dirichlet mixtures of normals. Discrete priors. Mixtures of betas. Series priors (splines, Fourier, wavelets,...). Independent increment process priors. Sparse priors Gaussian process priors.

41 Gaussian process priors

42 Gaussian process The law of a stochastic process W = (W t : t T) is a prior distribution on the space of functions w: T R. Gaussian processes have been found useful, because of their variety and because of computational properties. Every Gaussian prior is reasonable in some way. We shall study performance with smoothness classes as test case.

43 Example: Brownian density estimation For W Brownian motion use as prior on a density p on [0, 1]: x ew x 1 0 ew y dy. [Leonard, Lenk, Tokdar & Ghosh]

44 Example: Brownian density estimation For W Brownian motion use as prior on a density p on [0, 1]: x ew x 1 0 ew y dy Brownian motion t W t Prior density t c exp(w t )

45 Integrated Brownian motion , 1, 2 and 3 times integrated Brownian motion

46 Stationary processes Gaussian spectral measure Matérn spectral measure (3/2)

47 Other Gaussian processes Brownian sheet alpha= alpha= Fractional Brownian motion w = i w ie i, w i ind N(0, σ 2 i ) Series prior

48 Rates for Gaussian priors Prior W is Gaussian map in (B, ) with RKHS (H, H ) and small ball exponent φ 0 (ε) = log P( W < ε). THEOREM If statistical distances on the model combine appropriately with the norm of B, then the posterior rate is ε n if φ 0 (ε n ) nε n 2 AND inf h 2 H nε n 2. h H: h w 0 <ε n

49 Rates for Gaussian priors Prior W is Gaussian map in (B, ) with RKHS (H, H ) and small ball exponent φ 0 (ε) = log P( W < ε). THEOREM If statistical distances on the model combine appropriately with the norm of B, then the posterior rate is ε n if φ 0 (ε n ) nε n 2 AND inf h 2 H nε n 2. h H: h w 0 <ε n Both inequalities give lower bound on ε n. The first depends on W and not on w 0. If w 0 H, then second inequality is satisfied for ε n 1/ n.

50 Rates for Gaussian priors Prior W is Gaussian map in (B, ) with RKHS (H, H ) and small ball exponent φ 0 (ε) = log P( W < ε). THEOREM If statistical distances on the model combine appropriately with the norm of B, then the posterior rate is ε n if φ 0 (ε n ) nε n 2 AND inf h 2 H nε n 2. h H: h w 0 <ε n Both inequalities give lower bound on ε n. The first depends on W and not on w 0. If w 0 H, then second inequality is satisfied for ε n 1/ n.

51 Settings Density estimation X 1,..., X n iid in [0,1], p w (x) = ew x 1 0 ew t dt. Distance on parameter: Hellinger on p w. Norm on W : uniform. Classification (X 1, Y 1 ),..., (X n, Y n ) iid in [0,1] {0,1} P w (Y = 1 X = x) = e w x. Distance on parameter: L 2 (G) on P w. (G marginal of X i.) Norm on W : L 2 (G). Regression Y 1,..., Y n independent N(w(x i ), σ 2 ), for fixed design points x 1,..., x n. Distance on parameter: empirical L 2 -distance on w. Norm on W : empirical L 2 -distance. Ergodic diffusions (X t : t [0, n]), ergodic, recurrent: dx t = w(x t ) dt + σ(x t ) db t. Distance on parameter: Hellinger h n ( /σ µ0,2). Norm on W : L 2 (µ 0 ). (µ 0 stationary measure.) random

52 Reproducing kernel Hilbert space To every Gaussian random element with values in a Banach space (B, ) is attached a certain Hilbert space (H, H ), called the RKHS. H is stronger than and hence can consider H B.

53 Reproducing kernel Hilbert space To every Gaussian random element with values in a Banach space (B, ) is attached a certain Hilbert space (H, H ), called the RKHS. H is stronger than and hence can consider H B. DEFINITION For S: B B defined by Sb = EWb (W), the RKHS is the completion of SB under Sb 1, Sb 2 H = Eb 1(W)b 2(W).

54 Reproducing kernel Hilbert space To every Gaussian random element with values in a Banach space (B, ) is attached a certain Hilbert space (H, H ), called the RKHS. H is stronger than and hence can consider H B. DEFINITION For a process W = (W x : x X) with bounded sample paths and covariance function K(x, y) = EW x W y, the RKHS is the completion of the set of functions x α i K(y i, x), i under α i K(y i, ), i j β j K(z j, ) = H i α i β j K(y i, z j ). j

55 Reproducing kernel Hilbert space To every Gaussian random element with values in a Banach space (B, ) is attached a certain Hilbert space (H, H ), called the RKHS. H is stronger than and hence can consider H B. EXAMPLE If W is multivariate normal N d (0, Σ), then the RKHS is R d with norm h H = h t Σ 1 h

56 Reproducing kernel Hilbert space To every Gaussian random element with values in a Banach space (B, ) is attached a certain Hilbert space (H, H ), called the RKHS. H is stronger than and hence can consider H B. EXAMPLE Any W can be represented as W = µ i Z i e i, i=1 for numbers µ i 0, iid standard normal Z 1, Z 2,..., and e 1, e 2,... B with e 1 = e 2 = = 1. The RKHS consists of all h: = i h ie i with h 2 H : = i h 2 i µ 2 i <.

57 Reproducing kernel Hilbert space To every Gaussian random element with values in a Banach space (B, ) is attached a certain Hilbert space (H, H ), called the RKHS. H is stronger than and hence can consider H B. EXAMPLE Brownian motion is a random element in C[0, 1]. Its RKHS is H = {h: h (t) 2 dt < } with norm h H = h 2.

58 Small ball probability The small ball probability of a Gaussian random element W in (B, ) is P( W < ε), and the small ball exponent φ 0 (ε) is minus the logarithm of this small ball for uniform norm

59 Small ball probability The small ball probability of a Gaussian random element W in (B, ) is P( W < ε), and the small ball exponent φ 0 (ε) is minus the logarithm of this. It can be computed either by probabilistic arguments, or analytically from the RKHS. THEOREM [Kuelbs & Li (93)] For H 1 the unit ball of the RKHS (up to constants), ( ε ) φ 0 (ε) log N φ0 (ε), H 1,. There is a big literature. (In July entries in database maintained by Michael Lifshits.)

60 Rates for Gaussian priors proof Prior W is Gaussian map in (B, ) with RKHS (H, H ) and small ball exponent φ 0 (ε) = log P( W < ε). THEOREM If statistical distances on the model combine appropriately with the norm of B, then the posterior rate is ε n if φ 0 (ε n ) nε n 2 AND inf h 2 H nε n 2. h H: h w 0 <ε n

61 Rates for Gaussian priors proof Prior W is Gaussian map in (B, ) with RKHS (H, H ) and small ball exponent φ 0 (ε) = log P( W < ε). THEOREM If statistical distances on the model combine appropriately with the norm of B, then the posterior rate is ε n if φ 0 (ε n ) nε n 2 AND inf h 2 H nε n 2. h H: h w 0 <ε n PROOF The posterior rate is ε n if there exist sets B n such that (1) log N(ε n, B n, d) nε 2 n and P(W B n ) = 1 o(e 3nε2 n ). entropy. (2) P ( W w 0 < ε n ) e nε 2 n. prior mass.

62 Rates for Gaussian priors proof Prior W is Gaussian map in (B, ) with RKHS (H, H ) and small ball exponent φ 0 (ε) = log P( W < ε). THEOREM If statistical distances on the model combine appropriately with the norm of B, then the posterior rate is ε n if φ 0 (ε n ) nε n 2 AND inf h 2 H nε n 2. h H: h w 0 <ε n PROOF The posterior rate is ε n if there exist sets B n such that (1) log N(ε n, B n, d) nε 2 n and P(W B n ) = 1 o(e 3nε2 n ). entropy. (2) P ( W w 0 < ε n ) e nε 2 n. prior mass. Take B n = M n H 1 + ε n B 1 for large M n (H 1, B 1 the unit balls of H, B).

63 Proof (2) key results φ w0 (ε): = φ 0 (ε) + inf h H: h w 0 <ε h 2 H. THEOREM [Kuelbs & Li (93)] Concentration function measures concentration around w 0 (up to factors 2): P( W w 0 < ε) e φ w 0 (ε). THEOREM [Borell (75)] For H 1 and B 1 the unit balls of RKHS and B P(W / MH 1 + εb 1 ) 1 Φ ( Φ 1 (e φ 0(ε) ) + M ).

64 (Integrated) Brownian Motion THEOREM If w 0 C β [0, 1], then the rate for Brownian motion is: n 1/4 if β 1/2; n β/2 if β 1/2. The small ball exponent of Brownian motion is φ 0 (ε) (1/ε) 2 as ε 0. This gives the n 1/4 -rate, even for very smooth truths. Truths with β 1/2 are far from the RKHS, giving the rate n β/2. The minimax rate is attained iff β = 1/2.

65 (Integrated) Brownian Motion THEOREM If w 0 C β [0, 1], then the rate for Brownian motion is: n 1/4 if β 1/2; n β/2 if β 1/2. The small ball exponent of Brownian motion is φ 0 (ε) (1/ε) 2 as ε 0. This gives the n 1/4 -rate, even for very smooth truths. Truths with β 1/2 are far from the RKHS, giving the rate n β/2. The minimax rate is attained iff β = 1/2. THEOREM If w 0 C β [0, 1], then the rate for (α 1/2)-times integrated Brownian is n (β α)/(2α+d). The minimax rate is attained iff β = α.

66 Stationary processes A stationary Gaussian field (W t : t R d ) is characterized through a spectral measure µ, by cov(w s, W t ) = e iλt (s t) dµ(λ).

67 Stationary processes A stationary Gaussian field (W t : t R d ) is characterized through a spectral measure µ, by cov(w s, W t ) = e iλt (s t) dµ(λ). THEOREM Suppose that µ is Gaussian. Let ŵ 0 be the Fourier transform of w 0 : [0, 1] d R. If e λ ŵ 0 (λ) 2 dλ <, then rate of contraction is near 1/ n. If (1 + λ 2 ) β ŵ 0 (λ) 2 dλ <, then rate is (1/log n) κ β.

68 Stationary processes A stationary Gaussian field (W t : t R d ) is characterized through a spectral measure µ, by cov(w s, W t ) = e iλt (s t) dµ(λ). THEOREM Suppose that µ is Gaussian. Let ŵ 0 be the Fourier transform of w 0 : [0, 1] d R. If e λ ŵ 0 (λ) 2 dλ <, then rate of contraction is near 1/ n. If (1 + λ 2 ) β ŵ 0 (λ) 2 dλ <, then rate is (1/log n) κ β. THEOREM Suppose that dµ(λ) = (1 + λ 2 ) (α d/2) dλ. If w 0 C β [0, 1] d, then rate of contraction is n (α β)/(2α+d).

69 Adaptation Every Gaussian prior is good for some regularity class, but may be very bad for another. This can be alleviated by putting a prior on the regularity of the process. An alternative, more attractive approach is scaling.

70 Stretching or shrinking Sample paths can be smoothed by stretching

71 Stretching or shrinking Sample paths can be smoothed by stretching and roughened by shrinking

72 Scaled (integrated) Brownian motion W t = B t/cn for B Brownian motion, and c n n (2α 1)/(2α+1) α < 1/2: c n 0 (shrink). α (1/2, 1]: c n (stretch). THEOREM The prior W t = B t/cn gives optimal rate for w 0 C α [0, 1], α (0, 1].

73 Scaled (integrated) Brownian motion W t = B t/cn for B Brownian motion, and c n n (2α 1)/(2α+1) α < 1/2: c n 0 (shrink). α (1/2, 1]: c n (stretch). THEOREM The prior W t = B t/cn gives optimal rate for w 0 C α [0, 1], α (0, 1]. THEOREM Appropriate scaling of k times integrated Brownian motion gives optimal prior for every α (0, k + 1].

74 Scaled (integrated) Brownian motion W t = B t/cn for B Brownian motion, and c n n (2α 1)/(2α+1) α < 1/2: c n 0 (shrink). α (1/2, 1]: c n (stretch). THEOREM The prior W t = B t/cn gives optimal rate for w 0 C α [0, 1], α (0, 1]. THEOREM Appropriate scaling of k times integrated Brownian motion gives optimal prior for every α (0, k + 1]. Stretching helps a little, shrinking helps a lot.

75 Scaled smooth stationary process A Gaussian field with infinitely-smooth sample paths is obtained for EG s G t = exp( s t 2 ). THEOREM The prior W t = G t/cn for c n n 1/(2α+d) gives nearly optimal rate for w 0 C α [0, 1], any α > 0.

76 Adaptation by random scaling Choose A d from a Gamma distribution. Choose (G t : t > 0) centered Gaussian with EG s G t = exp ( s t 2). Set W t G At. THEOREM if w 0 C α [0, 1] d, then the rate of contraction is nearly n α/(2α+d). if w 0 is supersmooth, then the rate is nearly n 1/2.

77 Adaptation by random scaling Choose A d from a Gamma distribution. Choose (G t : t > 0) centered Gaussian with EG s G t = exp ( s t 2). Set W t G At. THEOREM if w 0 C α [0, 1] d, then the rate of contraction is nearly n α/(2α+d). if w 0 is supersmooth, then the rate is nearly n 1/2. The first result is also true for randomly scaled k-times integrated Brownian motion and α k + 1.

78 Conclusion

79 Conclusion and final remark There exist natural (fixed) priors that yield fully automatic smoothing at the correct bandwidth. (For instance, randomly scaled Gaussian processes.)

80 Conclusion and final remark There exist natural (fixed) priors that yield fully automatic smoothing at the correct bandwidth. (For instance, randomly scaled Gaussian processes.) Similar statements are true for adaptation to the scale of models described by sparsity (research in progress).

Nonparametric Bayesian Uncertainty Quantification

Nonparametric Bayesian Uncertainty Quantification Lecture 1: Introduction to Nonparametric Bayes Aad van der Vaart Universiteit Leiden, Netherlands YES, Eindhoven, January 2017 Contents Introduction Recovery