Bayesian nonparametrics

Size: px

Start display at page:

Download "Bayesian nonparametrics"

Wesley Wood
5 years ago
Views:

1 Bayesian nonparametrics Posterior contraction and limiting shape Ismaël Castillo LPMA Université Paris VI Berlin, September 2016 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

2 Part I Introduction Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

3 Standard frequentist framework Statistical experiment X random object = data P model P = {P θ, θ Θ}. Frequentist assumption θ 0 Θ, X P θ0 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

4 Standard frequentist framework Statistical experiment X random object = data P model P = {P θ, θ Θ}. Frequentist assumption θ 0 Θ, X P θ0 Estimator a measurable function ˆθ(X ) Θ ˆθ(X ) is a random point in Θ one studies ˆθ(X ) under X P θ0 example ˆθ MLE (X ) = argmax p θ (X ) θ Θ Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

5 Bayesian framework Statistical experiment X random object = data P = {P θ, θ Θ} model Bayesian setting [Do not know θ? View it as random!] a) θ Π proba measure on Θ the prior distribution b) View P θ as law of X θ joint distribution of (θ, X ) is specified c) law of θ X is posterior distribution denoted Π[ X ] Bayesian estimator Π( X ) M 1(Θ) Π( X ) is a data dependent measure on Θ Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

6 E0 Example 0 X = (X 1,..., X n) P = {N (θ, 1) n, θ R} Frequentist estimator ˆθ MLE (X ) = X n Bayesian setting a) θ N (0, 1) = Π prior (say) b) X θ N (θ, 1) n ( ) nx c) θ X N n, 1 = Π[ X ] posterior n+1 n+1 Π[ X ] = N ( ) nx n n + 1, 1 n + 1 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

7 Bayesian framework Bayesian setting a) θ Π prior b) X θ P θ c) θ X : Π[ X ] posterior All this produces a data dependent measure Π[ X ] Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

8 Bayesian framework Bayesian setting a) θ Π prior b) X θ P θ c) θ X : Π[ X ] posterior All this produces a data dependent measure Π[ X ] And what if one would forget a)+b)+c) and study Π[ X ] as a standard estimator?? Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

9 Frequentist analysis of Bayesian procedures Posterior distribution Π[ X ] Frequentist assumption θ 0 Θ, X P θ0 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

10 Frequentist analysis of Bayesian procedures Posterior distribution Π[ X ] Frequentist assumption θ 0 Θ, X P θ0 E0 - Example 0 Π[ X ] = N X θ N (θ, 1) n θ N (0, 1) = Π ( ) nx n n + 1, 1 1 θ(x ) + N (0, 1) n + 1 n + 1 centered close to ˆθ MLE (X ) = X n converges at rate 1/ n towards θ 0 [see below] Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

11 Nonparametric models regression E1 Gaussian white noise dx (n) (t) = f (t)dt + 1 n db(t), t [0, 1] f L 2 [0, 1], B Brownian motion E1 Fixed design regression X i = f (i/n) + ε i f C 0 [0, 1], ε 1,..., ε n iid N (0, 1) Unknow parameter f is a function, element of functional space Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

12 Sparsity E2 Sparse gaussian sequence model X i = θ i + ε i, 1 i n θ l 0[s] = {θ R n, {i : θ i 0} s} Unknown θ is a sparse vector Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

13 Nonparametric models sampling models E3 Sampling from unknown distribution X (n) = (X 1,..., X n) i.i.d. P on [0, 1] P probability measure on [0, 1] E3 Density estimation X (n) = (X 1,..., X n) i.i.d. f density on [0, 1] f density on [0, 1] Unknown P measure with total mass 1 Unknown f density: f 0, f = 1 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

14 Nonparametric models functionals E4 linear functional ψ(f ) = 1 a(u)f (u)du 0 a bounded on [0, 1] E5 distribution function F ( ) = 0 f (u)du f density Many other semiparametric functionals are also of interest... Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

15 Prior on functions on [0, 1] Consider the law of a process Z with a.s. continuous sample paths Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

16 Prior on functions on [0, 1] Consider the law of a process Z with a.s. continuous sample paths Example (centered Gaussian process) K(x, y) = E(Z xz y ) covariance of Z GP K(x, y) = 1 + x y K(x, y) = e (x y) Brownian motion released at 0 Squared-exponential GP Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

17 Prior on functions on [0, 1] Consider the law of a process Z with a.s. continuous sample paths Example (centered Gaussian process) K(x, y) = E(Z xz y ) covariance of Z GP K(x, y) = 1 + x y K(x, y) = e (x y) Brownian motion released at 0 Squared-exponential GP Example (series expansions point of view) Z(x) = k 1 γ k α k ε k (x) Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

18 Prior on sparse vectors Define Π prior on l 0[s] = {θ R n, {i : θ i 0} s} draw k π n distribution on {0,..., n} given k, pick S {1,..., n} of size S = k uniformly at random ( ) n Π n(s k) = 1/ k given S, set θ S c = 0 θ S i S g, g density on R Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

19 Sparsity Example [special case of π n = binomial] [ coin flipping prior and e.g. α n = 1/n] k Bin(n, α n) = π n g density on R Π n (1 α n)δ 0 + α ng i=1 [Remark. The posterior median can be shown to be a strict thresholding rule ˆθ med i (X i ) = 0 X i t(α n) for some threshold t(α n). For α n = 1/n, have t(α n) 2 log(n). ] Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

20 Prior on densities on [0, 1] Example renormalisation [0, 1] x ezx 1 0 ezu du z t Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

21 Prior on densities on [0, 1] Example renormalisation [0, 1] x ezx 1 0 ezu du z Example [more to follow!] t Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

22 Priors on probability measures Prior on probability measures on R Dirichlet process DP(M, G 0) on R G 0 probability measure on R mean and M > 0 concentration parameter There exists a random probability measure on R such that For any finite partition (B 1,..., B r ) of R, (G(B 1),..., G(B r )) Dir(MG 0(B 1),..., MG 0(B r )) with Dir(a 1,..., a r ) the standard Dirichlet distribution on the r-simplex. [Ferguson 1973] Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

23 Priors on probability measures, Dirichlet process G DP(M, G 0) What does it look like? G is a discrete distribution almost surely We have the representation [Sethuraman 94] G π k δ θk k=1 θ k G 0( ) i.i.d. π k weights given by stickbreaking πk = V k 1 k i=1 (1 V i) V1, V 2,... Beta(1, M) i.i.d. Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

24 Priors on probability measures, Dirichlet process G DP(M, G 0) What does it look like? G is a discrete distribution almost surely We have the representation [Sethuraman 94] G π k δ θk k=1 θ k G 0( ) i.i.d. π k weights given by stickbreaking πk = V k 1 k i=1 (1 V i) V1, V 2,... Beta(1, M) i.i.d. Pitman-Yor process PY(M, d, G 0) Same representation with V k Beta(1 d, M + kd) Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

25 Priors on probability measures, Pólya trees on [0, 1] Consider a sequence I of dyadic partitions of [0, 1] = I I 0 = {[0, 1]}, I 1 = {I 0, I 1}, I 2 = {I 00, I 01, I 10, I 11},..., I k = {I ε, ε {0, 1} k },... say for simplicity split in half dyadic intervals (k2 l, (k + 1)2 l ] Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

26 Priors on probability measures, Pólya trees on [0, 1] Consider a sequence I of dyadic partitions of [0, 1] = I I 0 = {[0, 1]}, I 1 = {I 0, I 1}, I 2 = {I 00, I 01, I 10, I 11},..., I k = {I ε, ε {0, 1} k },... say for simplicity split in half dyadic intervals (k2 l, (k + 1)2 l ] I V 0 V 1 I 0 V 00 V 01 I 1 V 10 V 11 I 00 I 01 I 10 I 11 To get a probability measure, one sets V 1 = 1 V 0 and more generally V ε1 = 1 V ε0 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

27 Priors on probability measures, Pólya trees on [0, 1] Consider a sequence I of dyadic partitions of [0, 1] = I I 0 = {[0, 1]}, I 1 = {I 0, I 1}, I 2 = {I 00, I 01, I 10, I 11},..., I k = {I ε, ε {0, 1} k },... say for simplicity split in half dyadic intervals (k2 l, (k + 1)2 l ] I V 0 V 1 I 0 V 00 V 01 I 1 V 10 V 11 I 00 I 01 I 10 I 11 We set P(I 01) = V 0V 01 and more generally, P(I ε1...ε k ) = V ε1 V ε1 ε 2 V ε1 ε 2 ε k for ε = ε 1ε 2 ε k To get a probability measure, one sets V 1 = 1 V 0 and more generally V ε1 = 1 V ε0 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

28 Pólya trees on [0, 1] A Pólya tree PT (I, α) on [0, 1] is the random probability measure P on [0, 1] defined as Given a collection of independent random variables, V ε0 Beta(α ε0, α ε1), any ε {0, 1} k, k 0 Set, for any ε = ε 1... ε k {0, 1} k, any k 0, k k P(I ε) = V ε1...ε j 1 0 (1 V ε1...ε j 1 0) j=1; ε j =0 j=1; ε j =1 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

29 Pólya trees on [0, 1] A Pólya tree PT (I, α) on [0, 1] is the random probability measure P on [0, 1] defined as Given a collection of independent random variables, V ε0 Beta(α ε0, α ε1), any ε {0, 1} k, k 0 Set, for any ε = ε 1... ε k {0, 1} k, any k 0, Special cases P(I ε) = k j=1; ε j =0 V ε1...ε j 1 0 k j=1; ε j =1 (1 V ε1...ε j 1 0) If α ε = Mα(I ε) for a probability measure α( ) on [0, 1], DP(M, α) For α ε1 ε k 1 0 = α ε1 ε k 1 1 = a k and k=1 a 1 k < with I regular partition In this case it is a prior on densities! P Leb[0, 1] a.s. Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

30 Part II Rates Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

31 Bayesian dominated framework Experiment. X = X (n), P = {P (n), θ F}, (F, F) measure space θ Dominated framework. Suppose there exists a dominating measure µ (n) Bayesian setting. a) θ Π prior distribution b) X θ P (n) θ c) θ X : Π[ X ] posterior dp (n) θ = p (n) θ ( )dµ(n) Bayes formula. For any measurable set B F, Π(B X (n) ) = Remark. Π[B] = 0 Π[B X ] = 0 B p(n) θ (X (n) )dπ(θ) (n) p θ (X (n) )dπ(θ). Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

32 Consistency Consistency. The posterior is consistent at θ 0 if, as n, Π( X (n) ) w δ θ0, in P (n) θ 0 -probability. In a separable metric space equipped with a distance d, for any ε > 0, Π(θ : d(θ, θ 0) < ε X (n) ) 1, in P (n) θ 0 -probab. [Notation: in the sequel, drop index n and write P θ0, E θ0, X ] General consistency results [Doob (1949)] [Schwartz (1965)] [Barron, Schervish, Wasserman (1999)] Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

33 Convergence rate Convergence rate. The posterior converges at rate ε n for the distance d at θ 0 if, as n, E θ0 Π(θ : d(θ, θ 0) ε n X ) 1 It is an upper bound: we look for the smallest possible ε n. What happens in a nonparametric framework? [Ghosal, Ghosh, van der Vaart 00], [Ghosal, van der Vaart 07] Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

34 Convergence rate lower bound Lower bound to rate. We say that ζ n is a lower bound for the rate for d at θ 0 if E θ0 Π(θ : d(θ, θ 0) ζ n X ) 0 One looks for the largest possible ζ n. May look a little odd first that there is no mass close to θ 0... It just says that ζ n is a too fast scaling that captures little posterior mass Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

35 Rates, regular parametric models E0 X θ N (θ, 1) n Π[ X ] = N θ N (0, 1) = Π ( ) nx n n + 1, 1 n + 1 Exercise. Show that, for any M n, [ ] 1 E θ0 Π θ : θ θ 0 Mn X M n n n 1 This extends to regular parametric models and most reasonable priors Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

36 Posterior and aspects of posterior The posterior distribution Π[ X ] may have a well-defined posterior mean posterior mode θ(x ) := θdπ(θ X ) θ (X ) := argmax θ posterior median, posterior quantiles... These are estimators in the standard sense. dπ(θ X ) dµ Often, but not always, a convergence rate for Π[ X ] implies the same rate for a given aspect of it Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

37 Posterior and aspects of posterior Fact 1. Suppose Π[d(θ, θ 0) > ε n X ] = o P (1). Let θ B (X ) be the center of the smallest d ball containing at least 1/2 of the posterior mass d(θ B (X ), θ 0) = O P (ε n) Fact 2. Suppose Π[d(θ, θ 0) > ε n X ] = O P (ε 2 n) θ d 2 (θ, θ 0) is convex and bounded [e.g. d=h] Then d( θ(x ), θ 0) = O P (ε n) Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

38 Posterior integrated loss Fact 3. Suppose Then for any M n E θ0 d(θ, θ 0) 2 dπ(θ X ) ε 2 n If θ d 2 (θ, θ 0) is convex, E θ0 Π[θ : d(θ, θ 0) 2 > M nε 2 n X ] = o(1) E θ0 [ d( θ(x ), θ 0) 2] ε 2 n Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

39 First examples E1 Fixed design regression [van der Vaart, van Zanten 08, 09] True function. Let f 0 C β [0, 1] Loss function. g 2 n = n 1 n i=1 g(t i) 2 Prior. Brownian motion + Gaussian W t = B t + Z 0, with Z 0 N (0, 1) Then as n, where E f0 Π [f : f f 0 n ε n X ] 1, { ε n n 4 1 β 2 n 1/4 if β 1/2 = n β/2 if β 1/2 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

40 First examples E1 Fixed design regression (followed) Prior. Riemann-Liouville process with parameter α > 0 W t = t 0 (t s)α 1/2 db s + α k=0 Z kt k, with Z k N (0, 1) iid Then where E f0 Π [f : f f 0 n ε n X ] 1, { ε n n α β n α 2α+1 = 2α+1 if β α n β 2α+1 if β α Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

41 First examples E3 Density estimation [van der Vaart, van Zanten 08, 09] True density. Let f 0 C β [0, 1] with f 0 > 0. Loss function. Hellinger distance h(f, g) 2 = ( f g) 2 Prior. Consider the distribution on continuous functions induced by t ewt 1 0 ewu du with W t either Brownian motion or Riemann-Liouville process with parameter α > 0 Then, for ε n as before, E f0 Π [h(f, f 0) ε n X ] 1. Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

42 Theory: Bayesian nonparametrics Consistency [Doob (1949)] Consistency up to a Π-null set [Schwartz (1965)] Consistency under testing and enough prior mass in Kullback-Leibler-type neighborhoods [Diaconis, Freedman (1986)] Example of innocent prior in a semiparametric framework, for which the posterior is not consistent Rates of convergence? Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

43 GGV theory: first lemma To fix ideas consider first the density estimation model X = (X 1,..., X n) P = {P n f, dp f = fdµ, f F} F some set of densities Π prior distribution on F Bayes formula n B i=1 Π[B X ] = f (X i)dπ(f ) n i=1 f (X i)dπ(f ) Notation K(f 0, f ) = log f0 f f0dµ ( 2 V (f 0, f ) = log f0 K(f 0, f )) f 0dµ f Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

44 GGV theory: first lemma Define a Kullback Leibler type neighborhood of f 0 by B KL (f 0, ε n) := {f : K(f 0, f ) ε 2 n, V (f 0, f ) ε 2 n} Lemma 1. Let A n measurable and let nε 2 n. Suppose Then Π[A n X ] = o P (1). Π[A n] e 2nε2 n Π[B KL (f 0, ε n)] = o(1). [Idea: very small prior mass implies small posterior mass] Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

45 Lemma 2. For any Π probability measure on F, any C > 0 and ε > 0, n i=1 with P f0 probability at least 1 1/(C 2 nε 2 ). f (X i )dπ(f ) Π(B KL (f 0, ε))e (1+C)nε 2 f 0 n i=1 f (X i )dπ(f ) f 0 B n i=1 f f 0 (X i )d Π(f )Π(B), Π = Π B Π(B) Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

46 Lemma 2. For any Π probability measure on F, any C > 0 and ε > 0, n i=1 with P f0 probability at least 1 1/(C 2 nε 2 ). f (X i )dπ(f ) Π(B KL (f 0, ε))e (1+C)nε 2 f 0 n i=1 f (X i )dπ(f ) f 0 B n i=1 f f 0 (X i )d Π(f )Π(B), Π = Π B Π(B) log B n i=1 f f 0 (X i )d Π(f ) n i=1 B log f f 0 (X i )d Π(f ) [Jensen] n [ ] log f0 i=1 B f (X i) KL(f 0, f ) d Π(f ) n KL(f 0, f )d Π(f ) B n Z i nε 2 if B = B KL (f 0, ε) i=1 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

47 We have Z i iid and E f0 Z 1 = 0 Var f0 Z 1 E f0 = [ ] Z i := log f0 B f (X i) KL(f 0, f ) d Π(f ) B B [Fubini] [ ] 2 log f0 f (X i) KL(f 0, f ) d Π(f ) [CS] V (f 0, f )d Π(f ) ε 2 Π(B) = ε 2 [Fubini] Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

48 We have Z i iid and E f0 Z 1 = 0 Var f0 Z 1 E f0 = By Chebyshev s inequality, [ ] Z i := log f0 B f (X i) KL(f 0, f ) d Π(f ) B B [Fubini] [ ] 2 log f0 f (X i) KL(f 0, f ) d Π(f ) [CS] V (f 0, f )d Π(f ) ε 2 Π(B) = ε 2 P f0 [ ] n Z i > cnε 2 i=1 1 C 2 nε 2. On the complementary event, one thus has n f log (X i )d Π(f ) (C + 1)nε 2 f 0 that is n i=1 i=1 f (X i )d f Π(f ) Π(B)e (C+1)nε 2. 0 [Fubini] Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

49 GGV theory: first lemma Let B KL (f 0, ε n) := {f : K(f 0, f ) ε 2 n, V (f 0, f ) ε 2 n} Lemma 1. Let A n measurable and let nε 2 n. Suppose Then Π[A n X ] = o P (1). Π[A n] e 2nε2 n Π[B KL (f 0, ε n)] = o(1). [Proof of lemma 1] Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

50 Theory: Bayesian nonparametrics Experiment. X = X (n), P = {P (n) θ, θ F} as before [not necessarily iid sampling] Prior. Π prior distribution on θ Goal. For some distance d n and rate ε n, E θ0 Π [θ : d n(θ, θ 0) > Mε n X ] 0 Let us assume that, for nε 2 n, Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

51 Theory: Bayesian nonparametrics Experiment. X = X (n), P = {P (n) θ, θ F} as before [not necessarily iid sampling] Prior. Π prior distribution on θ Goal. For some distance d n and rate ε n, E θ0 Π [θ : d n(θ, θ 0) > Mε n X ] 0 Let us assume that, for nε 2 n, The prior puts enough mass on neighborhoods of θ 0 { (n) B KL (θ 0, ε n) = p θ 0 Π(B KL (θ 0, ε n)) e cnε2 n log p (n) θ 0 nε 2 n, p (n) θ p (n) θ 0 } log 2 p (n) θ 0 nε 2 p (n) n θ extends definition given above in iid sampling model Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

52 Theory: Bayesian nonparametrics Some sieve sets Θ n capture most of the prior mass Π(Θ\Θ n) e (c+4)nε2 n Test true parameter vs. Complement of a ball (T1) There exist tests ψ n such that E θ0 ψ n 0 sup E θ (1 ψ n) e (c+4)nε 2 n θ Θ n: d n(θ,θ 0 )>Mε n Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

53 Theory: Bayesian nonparametrics Some sieve sets Θ n capture most of the prior mass Π(Θ\Θ n) e (c+4)nε2 n Test true parameter vs. Complement of a ball (T1) There exist tests ψ n such that E θ0 ψ n 0 sup E θ (1 ψ n) e (c+4)nε 2 n θ Θ n: d n(θ,θ 0 )>Mε n A sufficient condition is a control of an entropy [see below] Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

54 Theory: Bayesian nonparametrics Theorem 1. [Ghosal, Ghosh, van der Vaart 00], [Ghosal, van der Vaart 07] If there are sets Θ n Θ, c, M > 0 such that there exist tests ψ n with E θ0 ψ n 0, sup E θ (1 ψ n) e (c+4)nε 2 n θ Θ n: d n(θ,θ 0 )>Mε n tests Π(Θ\Θ n) e (c+4)nε2 n Π(B KL (θ 0, ε n)) e cnε2 n remaining mass prior mass Then for M > 0 as above, E θ0 Π[θ : d n(θ, θ 0) Mε n X ] 1 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

55 Testing via entropy Tests point vs. ball for some distance d n (T0) N(ε, Θ n, d n) = minimal number of balls of radius ε for d n to cover Θ n Lemma 3. Suppose (T0) holds and log N(ε, Θ n, d n) Cnε 2 n Then for any given c > 0 and M large enough, (T1) holds. Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

56 Proof of lemma 3 [sketch] Build shells C j covering the complement of the ball Cover each shell by a minimal number of balls To each center of ball, associate ϕ i,j test via (T0) Set ψ := max ϕ i,j i,j testing power e Knjε2 over shells is summable in j Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

57 Testing distances: examples About the testing distance (T0) [Le Cam 70 s, Birgé 80 s] general results for testing between convex sets Imply that (T0) holds in iid and some simple dependency settings for d = h or d = 1 Proposition. In density estimation P (n) f = P n f, then (T0) holds for d = 1 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

58 Testing distances: examples About the testing distance (T0) [Le Cam 70 s, Birgé 80 s] general results for testing between convex sets Imply that (T0) holds in iid and some simple dependency settings for d = h or d = 1 Proposition. In density estimation P (n) f = P n f, then (T0) holds for d = 1 [Idea of proof.] Let f 0, f 1 with f 0 f 1 1 > ε. For A = {x : f 0(x) < f 1(x)}, P n(a) = 1 n { consider φ n = 1l P n(a) > P f0 (A) + Control of E f0 φ n and E f (1 φ n) via Hoeffding n i=1 1l Xi A, f0 f1 1 3 } Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

59 Theory: Bayesian nonparametrics Theorem 1 entropy version [GGV 00] If Θ n Θ and c > 0 such that, for d n such that (T0) is verified, log N(ε n, Θ n, d n) dnε 2 n Π(Θ\Θ n) e (c+4)nε2 n Π(B KL (θ 0, ε n)) e cnε2 n entropy remaining mass prior mass Then for M > 0 large enough, E θ0 Π[θ : d n(θ, θ 0) Mε n X ] 1 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

60 Theory, Gaussian priors W = (W t : t T ) centered Gaussian process taking values in Banach space (B, ) Covariance kernel K(s, t) = E(W sw t) Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

61 Theory, Gaussian priors W = (W t : t T ) centered Gaussian process taking values in Banach space (B, ) Covariance kernel K(s, t) = E(W sw t) Reproducing Kernel Hilbert Space H (RKHS) associated to W. Define a norm H via p q a i K(s i, ), b j K(t j, ) H = i=1 j=1 i,j a i b j K(s i, t j ) Then one sets H = Vect{K(s, ), s T } H Brownian motion (B t) H = { Series prior σ k ν k ε k H = {h = (h k ) l 2, k 1 0 f (u)du, f L 2 [0, 1] } k 1 σ 2 k hk 2 < + } Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

62 Theory, Gaussian priors Fact. For all g in the support of W in B, and all ε > 0, e ϕg (ε/2) ϕg (ε) P( W g < ε) e where Concentration function. Let g be in the support of W in B. For ε > 0, set h 2 H ϕ g (ε) = inf h H: h g <ε 2 Approximation log P( W < ε) Small ball probability Example [small ball probability] Brownian motion (B t) log P( B < ε) ε 2 (ε 0) Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

63 Theory, Gaussian priors [van der Vaart, van Zanten 08] Consider a nonparametric problem with unknown function f 0 B. Prior Π = law of a Gaussian process W on B, with RKHS H. Assume that f 0 is in the support in B of the prior the norm on B combines correctly with the testing distance d Let ε n be a solution of the equation ϕ f0 (ε n) nε 2 n Then the posterior contracts at rate ε n: for large enough M, Lower bound [C 08] E f0 Π(d(f, f 0) > Mε n X ) 0 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

64 Theory, Gaussian priors Ingredients of proof : prior mass the [Fact] links P( W w < ε) and concentration function sieves [Borell 75] s inequality Let B 1 and H 1 unit balls B and H associated to W P(W / MH 1 + εb 1) 1 Φ(Φ 1 (e φ 0(ε) ) + M) Suggests to set Θ n = nε nh 1 + ε nb 1 entropy can link entropy of H 1 and small ball probability Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

65 Example Density estimation and Gaussian priors Squared-exponential covariance kernel Centered Gaussian process Z t with covariance E(Z tz s) = e (s t)2 /L [van der Vaart, van Zanten 10] show that, for fixed L, there are regular functions f 0 for which the rate is at best logarithmic ε n (log n) γ(β) Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

66 Example Density estimation and Gaussian priors Squared-exponential covariance kernel Centered Gaussian process Z t with covariance E(Z tz s) = e (s t)2 /L [van der Vaart, van Zanten 10] show that, for fixed L, there are regular functions f 0 for which the rate is at best logarithmic ε n (log n) γ(β) However, those priors are used in machine learning and give very good results in practice when the parameter L is "well chosen"... Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

67 Example Density estimation and Gaussian priors, adaptation [van der Vaart, van Zanten 09] !0.5!1!1.5!2! Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

68 Example Density estimation and Gaussian priors, adaptation [van der Vaart, van Zanten 09] !0.5!1!1.5!2! Prior Π: consider Z At and set t A Gamma distribution ez At 1 0 ez Au du, where u Z u centered GP with squared-exponential kernel Then the posterior converges at minimax rate up to a log factor ] E f0 Π [h(f, f 0) M(log n) γ(β) n β 2β+1 X 1 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

69 Example Adaptation on manifolds [C, Kerkyacharian, Picard 14] On M compact Riemannian manifold of dimension d without boundary Laplacian M linear operator on functions on M with discrete spectrum ( M)ϕ p = λ pϕ p Note that 0 λ 1 λ 2 M ( e λpt ϕ p ) = λ pe λpt ϕ p t e λpt ϕ p = λ pe λpt ϕ p This is a special solution of the "heat equation" on M Mf = t f Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

70 Example Adaptation on manifolds Let α p N (0, 1) iid. A GP "random solution of the heat equation" is W t := p 1 e λpt/2 α pϕ p The associated family of covariance kernels is P t(x, y) = p 1 e λpt ϕ p(x)ϕ p(y) M-Heat Kernel Subgaussian estimates c 1e c ρ2 (x,y) t B(x, t) B(y, P t(x, y) t) c 2e c ρ2 (x,y) t B(x, t) B(y, t) Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

71 Example Adaptation on manifolds dx (n) (x) = f (x)dx + 1 n dz(x), x M Prior Π W T = p e λpt /2 α pϕ p with T t a e t d/2 log q (1/t) W T seen as prior on (B, ) = (L 2, 2) Set q = 1 + d/2 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

72 Example Adaptation on manifolds dx (n) (x) = f (x)dx + 1 n dz(x), x M Prior Π W T = p e λpt /2 α pϕ p with T t a e t d/2 log q (1/t) W T seen as prior on (B, ) = (L 2, 2) Set q = 1 + d/2 Suppose f 0 B2, (M). s Then for M large enough, as n, [ ( ) ] s/(2s+d) log n E f0 Π f f 0 2 M X 0 n The rate is sharp Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

73 Example Adaptation on manifolds dx (n) (x) = f (x)dx + 1 n dz(x), x M Prior Π W T = p e λpt /2 α pϕ p with T t a e t d/2 log q (1/t) W T seen as prior on (B, ) = (L 2, 2) Set q = 1 + d/2 Suppose f 0 B2, (M). s Then for M large enough, as n, [ ( ) ] s/(2s+d) log n E f0 Π f f 0 2 M X 0 n The rate is sharp for small enough ρ, there exists f 0 in B2, (M), s [ ( ) ] s/(2s+d) log n Π f f 0 2 ρ X 0 n Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

74 Example Sparsity Recall X i = θ i + ε i and θ l 0[s] s sparse vector Example [coin flipping prior with fixed α n = 1/n] k g Π Bin(n, α n) = π n density on R n (1 α n)δ 0 + α ng i=1 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

75 Example Sparsity Example [coin flipping prior with random α] α Beta(1, n) k α Bin(n, α) = π n g density on R α Beta(1, n) n Π α (1 α)δ 0 + αg i=1 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

76 Example Sparsity Theorem. Let Π be the Bayesian coin flipping prior with random α with α Beta(1, n) g the Laplace (or a heavier tailed) density Then for M large enough, sup E θ0 Π [θ : θ θ 0 > Ms n log(n/s n) X ] 0. θ 0 l 0 [s n] Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

77 Example Mixtures [Rousseau 10] Observations X 1,..., X n i.i.d. density f 0 > 0 and Hölder C β [0, 1] Beta densities g(x a, b) = xa 1 (1 x) b 1 B(a,b) usual parametrisation α g α,ε(x) = g(x, α ) 1 ε ε reparametrisation Prior Π = hierarchical mixture of Beta densities k g α,k,p k,ɛk( ) = p j g α,εj ( ) j=1 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

78 Example Mixtures [Rousseau 10] Observations X 1,..., X n i.i.d. density f 0 > 0 and Hölder C β [0, 1] Beta densities g(x a, b) = xa 1 (1 x) b 1 B(a,b) usual parametrisation α g α,ε(x) = g(x, α ) 1 ε ε reparametrisation Prior Π = hierarchical mixture of Beta densities g α,k,p k,ɛk( ) = k p j g α,εj ( ) j=1 α π α k π k p k k (p 1,..., p k ) law on the canonical simplex of R k ɛ k k (ε 1,..., ε k ) law in (0, 1) k Then the posterior concentrates at rate ] E f0 Π [h(f, f 0) M(log n) ρ(β) n β 2β+1 X 1 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

79 Posterior measure and aspects of the measure (I) MESSAGE Full posterior measure and aspects of it may differ significantly! Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

80 Posterior measure and aspects of the measure (I) MESSAGE Full posterior measure and aspects of it may differ significantly! Y = I n nθ + ɛ [seq.model] Let Π λ p i=1 Laplace(λ) [ without point masses at zero ] [ ] ˆθ λ LASSO = arg max θ Y X θ 2 2 2λ θ 1 posterior mode of Π λ [ Y ] Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

81 Posterior measure and aspects of the measure (I) MESSAGE Full posterior measure and aspects of it may differ significantly! Y = I n nθ + ɛ [seq.model] Let Π λ p i=1 Laplace(λ) [ without point masses at zero ] [ ] ˆθ λ LASSO = arg max θ Y X θ 2 2 2λ θ 1 posterior mode of Π λ [ Y ] Lemma [C, Schmidt-Hieber, van der Vaart 15] There exists δ > 0 such that, as n, E Π n ) θ 0 =0 λn (θ : θ 2 δ Y 0 with λ n = λ LASSO n = C log n. λ n Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

82 Posterior measure and aspects of the measure (I) MESSAGE Full posterior measure and aspects of it may differ significantly! Y = I n nθ + ɛ [seq.model] Let Π λ p i=1 Laplace(λ) [ without point masses at zero ] [ ] ˆθ λ LASSO = arg max θ Y X θ 2 2 2λ θ 1 posterior mode of Π λ [ Y ] Lemma [C, Schmidt-Hieber, van der Vaart 15] There exists δ > 0 such that, as n, E Π n ) θ 0 =0 λn (θ : θ 2 δ Y 0 with λ n = λ LASSO n = C log n. λ n Note n/ log n suboptimal convergence rate! Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

83 Posterior measure and aspects of the measure (II) Y = I n nθ + ε Prior Π on θ Bayesian coin flipping with random α and g (e.g.) the Laplace density Consider estimation of θ under l q metric, 0 < q 2, n d q(θ, θ ) = θ i θ i q. i=1 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

84 Posterior measure and aspects of the measure (II) Y = I n nθ + ε Prior Π on θ Bayesian coin flipping with random α and g (e.g.) the Laplace density Consider estimation of θ under l q metric, 0 < q 2, n d q(θ, θ ) = θ i θ i q. i=1 Theorem [C, van der Vaart 12] For large enough M, as n, sup E θ0 Π [ θ : d q(θ, θ 0) > Mrn,q Y ] 0 θ 0 l 0 [s n] Posterior measure is rate-optimal for any 0 < q 2 Posterior mean is suboptimal (!) for q < 1 [Johnstone, Silverman 04] Posterior median is rate-optimal for any 0 < q 2 [Johnstone, Silverman 04], [C, van der Vaart 12] Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

85 Rates: conclusion and further topics The previous general rate theorem applies to a variety of problems in i.i.d., non-i.i.d. Markov, hidden Markov,... as soon as one can find a suitable testing distance d n and verify prior mass type conditions Sometimes refinements can be used to tackle specific problems [e.g. more precise assumptions using similar ideas] Other formulations/approaches are also possible: PAC Bayes, non asymptotic versions etc. Yet, with these techniques, unclear how to obtain results for other distances/loss functions that do not satify testing (T0) Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

86 Part III Shape Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

87 Gaussian example E0 X θ N (θ, 1) n ; θ N (0, 1) = Π ( ) nx n Π[ X ] = N n + 1, 1 n + 1 Is Π[ X ], after recentering and rescaling, close to N (0, 1)? Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

88 Gaussian example E0 X θ N (θ, 1) n ; θ N (0, 1) = Π ( ) nx n Π[ X ] = N n + 1, 1 n + 1 Is Π[ X ], after recentering and rescaling, close to N (0, 1)? recentering and rescaling. Set τ a : x n(x a) comparing distributions. P Q TV = sup A P(A) Q(A) = 1 P Q 1 2 The recentering choice a = X n seems natural. Then Π[ X ] τ 1 n X n n X n = n N (0, 1). n + 1 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

89 Gaussian example E0 X θ N (θ, 1) n ; θ N (0, 1) = Π ( ) nx n Π[ X ] = N n + 1, 1 n + 1 Set τ Xn : x n(x X n) Proposition [convergence in Gaussian conjugate case] Π E θ0 [ X ] τ 1 X n N (0, 1) 0 TV Exercise. Prove it. [For instance: use 2 1 KL and compute KL distance between gaussians] Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

90 Historical example Laplace [towards 1810] considered the setting X θ Bin(n, θ) θ Unif[0, 1] = Π with associated posterior [ a fact noted by Thomas Bayes a few decades before ] Π[ X ] Beta(X + 1, n X + 1) Laplace noticed that, with τ X /n : x n(x X /n), Π[ X ] τ 1 X /n N (0, 1) Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

91 Regular parametric models P = {P θ, θ Θ R k } dp θ (x) = p θ (x)dx, l θ := log p θ Suppose l θ = l θ / θ exists and 0 < I θ := E θ [ l θ l T θ ] <. The model is locally asymptotically normal (LAN) at point θ 0 Θ if h R k log n i=1 p θ0 +h/ n p θ0 (X i ) = 1 n h T n l θ0 (X i ) 1 2 ht I θ0 h + o Pθ0 (1). i=1 An estimator ˆθ = ˆθ n(x ) is (asymp. linear and) efficient at θ 0 if n(ˆθ θ0) = I 1 θ 0 1 n Recall the notation τ ˆθ : x n(x ˆθ) n l θ0 (X i ) + o Pθ0 (1). i=1 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

92 Bernstein von Mises, smooth parametric models X 1,..., X n θ P n θ, θ Π Theorem [Bernstein von Mises] [Le Cam van der Vaart] Suppose (S) the model is LAN at θ 0 (T) for any ε > 0 there exist tests φ n with E θ0 φ n = o(1), sup E θ (1 φ n) = o(1) θ θ 0 >ε (P) the prior Π has a continuous positive density at θ 0 Then for ˆθ n efficient estimator at θ 0, as n, Π[ X ] τ 1 N ( 1 0, I ) ˆθ θ0 ( ) n 0 under P θ0. 1 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

93 Bernstein von Mises, smooth parametric models By the Bernstein von Mises (BvM) theorem, Π[ X ] τ 1 N ( 1 0, I ) ˆθ θ0 ( ) n 0 under P θ0. 1 For ˆθ efficient estimator, under P θ0, n(ˆθ θ 0) d N (0, I 1 θ 0 ) This shows the remarkable duality, asymptotically, L Π ( n(θ ˆθ) X ) L( n(ˆθ θ) θ = θ 0) Bayes law Frequentist law Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

94 Parametric BvM implies: credible sets are confidence sets! Suppose BvM holds for Θ R [dimension 1] Π( X ) τ 1 N ( 0, I 1 ) ˆθ θ 0 ( ) 0 under P θ 1 0. Let A n, B n be such that, for α = 5%, Π((, A n) X ) = Π((B n, + ) X ) = α/2 By definition Π [(A n, B n) X ] = 1 α Credible set Exercise. As n, P θ0 [θ 0 (A n, B n)] 1 α. so that (A n, B n) is an asymptotic confidence set at level 1 α. So, if the parametric BvM holds, credible sets are (asymptotically) confidence sets. Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

95 BvM in Gaussian model for nonconjugate prior As a particular case of the previous BvM theorem, we have Proposition [BvM for nonconjugate prior in Gaussian model]. Let X 1,..., X n θ N (θ, 1) θ Π with dπ(θ) = π(θ)dθ and π continuous and positive density on R. Then Π[ X ] τ 1 X n N (0, 1)( ) P 0. 1 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

96 BvM in Gaussian model for nonconjugate prior As a particular case of the previous BvM theorem, we have Proposition [BvM for nonconjugate prior in Gaussian model]. Let X 1,..., X n θ N (θ, 1) θ Π with dπ(θ) = π(θ)dθ and π continuous and positive density on R. Then Π[ X ] τ 1 X n N (0, 1)( ) P 0. 1 Idea of proof g X n (u) := exp( u2 /2)π( X n + u n ) exp( u2 /2)π( X n + u n )du exp( u2 /2)π(θ 0) 2ππ(θ0) use Scheffé s lemma on event { X n θ 0 1/ log(n)} Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

97 Towards a nonparametric BvM? And what about a nonparametric BvM? Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

98 Gaussian white noise model dx (n) (t) = f (t)dt + 1 n dw (t) t [0, 1] Observe a trajectory X (n) f L 2 [0, 1] =: L 2 is unknown = object of interest Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

99 Gaussian white noise model dx (n) (t) = f (t)dt + 1 n dw (t) t [0, 1] Observe a trajectory X (n) f L 2 [0, 1] =: L 2 is unknown = object of interest ϕ k (t)dx (n) (t) = ϕ k (t)f (t)dt + n 1/2 X k = f k + n 1/2 ε k, k 1 ϕ k (t)dw (t), k 1 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

100 Gaussian white noise model dx (n) (t) = f (t)dt + 1 n dw (t) t [0, 1] Observe a trajectory X (n) f L 2 [0, 1] =: L 2 is unknown = object of interest ϕ k (t)dx (n) (t) = ϕ k (t)f (t)dt + n 1/2 X k = f k + n 1/2 ε k, k 1 ϕ k (t)dw (t), k 1 Equivalently, writing X (n) for the sequence {X k, k 1}, X (n) = f + n 1/2 W Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

101 Bernstein-von Mises, nonparametric? X (n) = f + n 1/2 W Let Π be a prior on f L 2 prior on the coordinates f k = f, ϕ k, k 1 Set ˆf := E Π [f X (n) ] posterior mean Two questions 1 Can one find Π in such a way that Π(f ˆf X (n) ) L(ˆf f f ) optimal law in some sense? 2 In which sense should one interpret? Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

102 Nonparametric BvMs, related work Negative results [Cox 93], [Freedman 99], [Leahu 11] Let Π be a Gaussian prior on f sitting on L 2 A nonparametric BvM does not hold in L 2 Some positive results exist for specific models/priors Mostly in cases where some form of conjugacy is present [Lo 80s] Consider a i.i.d. sample X1,..., X n P in R with c.d.f. F Dirichlet process prior on P nonparametric-bvm for F n(f Fn) X 1,..., X n d G, in probability, with G law of a P-Brownian bridge. [Kim and Lee 06] Neutral to the right prior on F in Survival analysis model NP-BvM for A cumulative hazard function Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

103 A large enough Hilbert space Standard Sobolev spaces For {ϕ k, k 1} smooth enough orthonormal basis of L 2 H s 2 := { f L 2 ([0, 1]) : f 2 s,2 := } k 2s ϕ k, f 2 < k 1 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

104 A large enough Hilbert space Standard Sobolev spaces {ψ lk } smooth enough orthonormal basis of L 2, { H2 s := f L 2 ([0, 1]) : f 2 s,2 := } al 2s ψ lk, f 2 < l L k Z l 1 Z l = 1, a l = max(2, l ) Fourier-type basis 2 L N, a l = Z l = 2 l wavelet-type basis Negative Sobolev space: for s R, H s 2 { f : f 2 s,2 := l L a 2s l } ψ lk, f 2 < k Z l Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

105 A large enough Hilbert space H We use logarithmic Sobolev spaces to get sharp rates. For δ 1 and s = 1/2, let { H := H 1/2,δ 2 f : f 2 1/2,2,δ := a 2( 1/2) } l ψ (log a l ) 2δ lk, f 2 < l L k Z l Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

106 A large enough Hilbert space H We use logarithmic Sobolev spaces to get sharp rates. For δ 1 and s = 1/2, let { H := H 1/2,δ 2 f : f 2 1/2,2,δ := a 2( 1/2) } l ψ (log a l ) 2δ lk, f 2 < l L k Z l X (n) = f + 1 n W white noise model in H Note that W a.s. belong to H (as well as X (n), f ), as E W 2 1/2,2,δ = l L a 1 l (log a l ) 2δ k Z l Eg 2 lk <. [This implies that W takes values in H a.s. and is tight in H ] W is a centered Gaussian measure on H with EW(g)W(h) = g, h f, g L 2. Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

107 A large enough Hilbert space H We use logarithmic Sobolev spaces to get sharp rates. For δ 1 and s = 1/2, let { H := H 1/2,δ 2 f : f 2 1/2,2,δ := a 2( 1/2) } l ψ (log a l ) 2δ lk, f 2 < l L k Z l X (n) = f + 1 n W white noise model in H Note that W a.s. belong to H (as well as X (n), f ), as E W 2 1/2,2,δ = l L a 1 l (log a l ) 2δ k Z l Eg 2 lk <. [This implies that W takes values in H a.s. and is tight in H ] W is a centered Gaussian measure on H with EW(g)W(h) = g, h f, g L 2. Denote by N the law of W as a r.v. in H [limiting non-parametric distribution] Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

108 Weak nonparametric BvM, definition X (n) = f + 1 n W Let Π prior on f L 2 H. Let Π n = Π( X (n) ) be the posterior distribution on H. Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

109 Weak nonparametric BvM, definition X (n) = f + 1 n W Let Π prior on f L 2 H. Let Π n = Π( X (n) ) be the posterior distribution on H. Denote τ : f n(f X (n) ) and Π n τ 1 be the rescaled posterior. Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

110 Weak nonparametric BvM, definition X (n) = f + 1 n W Let Π prior on f L 2 H. Let Π n = Π( X (n) ) be the posterior distribution on H. Denote τ : f n(f X (n) ) and Π n τ 1 be the rescaled posterior. Let β be the bounded Lipschitz metric for weak cv. of probability measures on H β(µ, ν) = sup u(s)(dµ dν)(s) u BL(1) S Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

111 Weak nonparametric BvM, definition X (n) = f + 1 n W Let Π prior on f L 2 H. Let Π n = Π( X (n) ) be the posterior distribution on H. Denote τ : f n(f X (n) ) and Π n τ 1 be the rescaled posterior. Let β be the bounded Lipschitz metric for weak cv. of probability measures on H β(µ, ν) = sup u(s)(dµ dν)(s) u BL(1) S Definition The prior Π satisfies the weak Bernstein - von Mises phenomenon in H if β(π n τ 1, N ) Pn f 0 0, (n ). Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

112 Product priors and γ-smooth functions Consider priors of the form Π l,k π lk defined on the coordinates of the basis {ψ lk }, with π lk probability measures with Lebesgue density ϕ lk on R. Further assume, for some fixed density ϕ, ϕ lk ( ) = 1 ( ) ϕ k Z l with σ l > 0. σ l σ l For instance, to model γ-smooth functions, take σ l = 2 l(γ+1/2) and set, with g lk ϕ, G γ = 2 l(γ+1/2) g lk ψ lk, γ > 0 l k Z l Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

113 Main theorem - Condition (P) Denote f 0,lk = f 0, ψ lk the coordinates of the true f 0 on basis {ψ lk }. Suppose that for some M > 0, (P1) sup f 0,lk /σ l M. l,k Suppose also that ϕ is bounded and that for some τ > M, c ϕ > 0, (P2) ϕ(x) c ϕ x ( τ, τ), x 2 ϕ(x)dx <. R For the previous wavelet prior, (P1) asks for f 0,lk M2 l(γ+1/2) l, k f 0 C γ. Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

114 Main theorem in white noise [C. and Nickl 13] Theorem [Weak nonparametric BvM in H] Any product prior Π and f 0 satisfying Conditions (P) verify the weak Bernstein-von Mises phenomenon in H, β(π( X (n) ) τ 1, N ) Pn f 0 0. Moreover the posterior mean ˆf n = E Π [f X (n) ] is efficient in the sense that ˆf n X (n) H = o P (1/ n). Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

115 Nonparametric BvM: Note on weak convergence Unlike in total variation convergence one does not have uniformity in all Borel sets B in Π( X ) τ 1 T (B) N (B) Pn f 0 0. One does have uniformity in all sets that have a uniformly smooth boundary for the probability measure N. In particular uniformity holds for all H -balls {B H (0, t) : 0 < t M} Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

116 Application: Credible ellipsoids Suppose the weak BvM theorem holds for Π. A natural (1 α)-credible set is then obtained by solving for R n = R n(α, X (n) ) such that { Π(C n X (n) ) = 1 α, where C n = f : f X (n) H R } n/ n C n is the smallest ball around X (n) charged by the posterior with probability 1 α. Theorem [Elliptical Credible set C n] The random set C n has confidence 1 α asymptotically in that P n f 0 (f 0 C n) 1 α and R n = O P (1). Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

117 Application: BvM and Credible sets for smooth functionals Linear functionals, for g L H s 2, s > 1/2, L : f 1 0 f (t)g L (t)dt. Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

118 Application: BvM and Credible sets for smooth functionals Linear functionals, for g L H s 2, s > 1/2, L : f Smooth nonlinear functionals, e.g. f f (t)g L (t)dt. f 2 (t)dt, f 0 C β, β > 1/2. Self-convolutions of 1-periodic functions f f f Leads to BvM for these functionals Deduce confident credible sets by taking quantiles of the induced posterior Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

119 Application: Confidence sets in L 2, uniform priors Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

120 Application: Confidence sets in L 2, uniform priors Consider first the special case of a uniform wavelet prior Π on L 2 U γ,m = l,k 2 l(γ+1/2) u lk ψ lk ( ), u lk U[ M, M] i.i.d. Such priors model functions in a Hölder ball of radius M, with posteriors Π n contracting about f 0 at the L 2 -minimax rate n γ/(2γ+1) within log factors if f 0 γ, M. It is natural to intersect the ellipsoid credible set C n with the Hölderian support of Π { C n = f : f ˆf n H R } n/ n, f γ, M Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

121 Application: Confidence sets in L 2, uniform priors Consider first the special case of a uniform wavelet prior Π on L 2 U γ,m = l,k 2 l(γ+1/2) u lk ψ lk ( ), u lk U[ M, M] i.i.d. Such priors model functions in a Hölder ball of radius M, with posteriors Π n contracting about f 0 at the L 2 -minimax rate n γ/(2γ+1) within log factors if f 0 γ, M. It is natural to intersect the ellipsoid credible set C n with the Hölderian support of Π { C n = f : f ˆf n H R } n/ n, f γ, M Proposition [Confidence sets for uniform product priors] Π(C n X (n) ) = 1 α, Pf n 0 (f 0 C n) 1 α and the L 2 -diameter C n 2 of C n satisfies, for some κ > 0, C n 2 = O P (n γ/(2γ+1) (log n) κ ). Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

122 Application: Confidence sets in L 2, general product priors Consider more generally, if f 0 C γ, G γ = l,k 2 l(γ+1/2) g lk ψ lk ( ), g lk i.i.d. ϕ Let δ > 0 arbitrary. Let, for γ,2,1 the γ-sobolev norm with log correction, { C n = f : f ˆf n H R } n/ n, f γ,2,1 M n + 4δ, where M n is defined as a quantile for the γ-norm Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

123 Application: Confidence sets in L 2, general product priors Consider more generally, if f 0 C γ, G γ = l,k 2 l(γ+1/2) g lk ψ lk ( ), g lk i.i.d. ϕ Let δ > 0 arbitrary. Let, for γ,2,1 the γ-sobolev norm with log correction, { C n = f : f ˆf n H R } n/ n, f γ,2,1 M n + 4δ, where M n is defined as a quantile for the γ-norm: for any n and δ n = (log n) 1/4, M n = inf {M > 0 : Π n(f : f γ,2,1 M δ) 1 δ n}, Proposition [Confidence sets for general product priors] P n f 0 (f 0 C n ) 1 α, Π(C n X (n) ) = 1 α + o P (1) C n 2 = O P (n γ/(2γ+1) (log n) κ ). Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

124 Nonparametric BvMs in white noise, further questions Adaptation Prior on regularity α [Kolyan Ray] one obtains e.g. adaptive confidence regions (modulo expected appropriate restrictions) Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

Bayesian Regularization

Bayesian Regularization Aad van der Vaart Vrije Universiteit Amsterdam International Congress of Mathematicians Hyderabad, August 2010 Contents Introduction Abstract result Gaussian process priors Co-authors