Bayesian nonparametrics

Size: px
Start display at page:

Download "Bayesian nonparametrics"

Transcription

1 Bayesian nonparametrics Posterior contraction and limiting shape Ismaël Castillo LPMA Université Paris VI Berlin, September 2016 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

2 Part I Introduction Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

3 Standard frequentist framework Statistical experiment X random object = data P model P = {P θ, θ Θ}. Frequentist assumption θ 0 Θ, X P θ0 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

4 Standard frequentist framework Statistical experiment X random object = data P model P = {P θ, θ Θ}. Frequentist assumption θ 0 Θ, X P θ0 Estimator a measurable function ˆθ(X ) Θ ˆθ(X ) is a random point in Θ one studies ˆθ(X ) under X P θ0 example ˆθ MLE (X ) = argmax p θ (X ) θ Θ Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

5 Bayesian framework Statistical experiment X random object = data P = {P θ, θ Θ} model Bayesian setting [Do not know θ? View it as random!] a) θ Π proba measure on Θ the prior distribution b) View P θ as law of X θ joint distribution of (θ, X ) is specified c) law of θ X is posterior distribution denoted Π[ X ] Bayesian estimator Π( X ) M 1(Θ) Π( X ) is a data dependent measure on Θ Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

6 E0 Example 0 X = (X 1,..., X n) P = {N (θ, 1) n, θ R} Frequentist estimator ˆθ MLE (X ) = X n Bayesian setting a) θ N (0, 1) = Π prior (say) b) X θ N (θ, 1) n ( ) nx c) θ X N n, 1 = Π[ X ] posterior n+1 n+1 Π[ X ] = N ( ) nx n n + 1, 1 n + 1 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

7 Bayesian framework Bayesian setting a) θ Π prior b) X θ P θ c) θ X : Π[ X ] posterior All this produces a data dependent measure Π[ X ] Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

8 Bayesian framework Bayesian setting a) θ Π prior b) X θ P θ c) θ X : Π[ X ] posterior All this produces a data dependent measure Π[ X ] And what if one would forget a)+b)+c) and study Π[ X ] as a standard estimator?? Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

9 Frequentist analysis of Bayesian procedures Posterior distribution Π[ X ] Frequentist assumption θ 0 Θ, X P θ0 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

10 Frequentist analysis of Bayesian procedures Posterior distribution Π[ X ] Frequentist assumption θ 0 Θ, X P θ0 E0 - Example 0 Π[ X ] = N X θ N (θ, 1) n θ N (0, 1) = Π ( ) nx n n + 1, 1 1 θ(x ) + N (0, 1) n + 1 n + 1 centered close to ˆθ MLE (X ) = X n converges at rate 1/ n towards θ 0 [see below] Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

11 Nonparametric models regression E1 Gaussian white noise dx (n) (t) = f (t)dt + 1 n db(t), t [0, 1] f L 2 [0, 1], B Brownian motion E1 Fixed design regression X i = f (i/n) + ε i f C 0 [0, 1], ε 1,..., ε n iid N (0, 1) Unknow parameter f is a function, element of functional space Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

12 Sparsity E2 Sparse gaussian sequence model X i = θ i + ε i, 1 i n θ l 0[s] = {θ R n, {i : θ i 0} s} Unknown θ is a sparse vector Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

13 Nonparametric models sampling models E3 Sampling from unknown distribution X (n) = (X 1,..., X n) i.i.d. P on [0, 1] P probability measure on [0, 1] E3 Density estimation X (n) = (X 1,..., X n) i.i.d. f density on [0, 1] f density on [0, 1] Unknown P measure with total mass 1 Unknown f density: f 0, f = 1 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

14 Nonparametric models functionals E4 linear functional ψ(f ) = 1 a(u)f (u)du 0 a bounded on [0, 1] E5 distribution function F ( ) = 0 f (u)du f density Many other semiparametric functionals are also of interest... Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

15 Prior on functions on [0, 1] Consider the law of a process Z with a.s. continuous sample paths Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

16 Prior on functions on [0, 1] Consider the law of a process Z with a.s. continuous sample paths Example (centered Gaussian process) K(x, y) = E(Z xz y ) covariance of Z GP K(x, y) = 1 + x y K(x, y) = e (x y) Brownian motion released at 0 Squared-exponential GP Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

17 Prior on functions on [0, 1] Consider the law of a process Z with a.s. continuous sample paths Example (centered Gaussian process) K(x, y) = E(Z xz y ) covariance of Z GP K(x, y) = 1 + x y K(x, y) = e (x y) Brownian motion released at 0 Squared-exponential GP Example (series expansions point of view) Z(x) = k 1 γ k α k ε k (x) Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

18 Prior on sparse vectors Define Π prior on l 0[s] = {θ R n, {i : θ i 0} s} draw k π n distribution on {0,..., n} given k, pick S {1,..., n} of size S = k uniformly at random ( ) n Π n(s k) = 1/ k given S, set θ S c = 0 θ S i S g, g density on R Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

19 Sparsity Example [special case of π n = binomial] [ coin flipping prior and e.g. α n = 1/n] k Bin(n, α n) = π n g density on R Π n (1 α n)δ 0 + α ng i=1 [Remark. The posterior median can be shown to be a strict thresholding rule ˆθ med i (X i ) = 0 X i t(α n) for some threshold t(α n). For α n = 1/n, have t(α n) 2 log(n). ] Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

20 Prior on densities on [0, 1] Example renormalisation [0, 1] x ezx 1 0 ezu du z t Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

21 Prior on densities on [0, 1] Example renormalisation [0, 1] x ezx 1 0 ezu du z Example [more to follow!] t Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

22 Priors on probability measures Prior on probability measures on R Dirichlet process DP(M, G 0) on R G 0 probability measure on R mean and M > 0 concentration parameter There exists a random probability measure on R such that For any finite partition (B 1,..., B r ) of R, (G(B 1),..., G(B r )) Dir(MG 0(B 1),..., MG 0(B r )) with Dir(a 1,..., a r ) the standard Dirichlet distribution on the r-simplex. [Ferguson 1973] Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

23 Priors on probability measures, Dirichlet process G DP(M, G 0) What does it look like? G is a discrete distribution almost surely We have the representation [Sethuraman 94] G π k δ θk k=1 θ k G 0( ) i.i.d. π k weights given by stickbreaking πk = V k 1 k i=1 (1 V i) V1, V 2,... Beta(1, M) i.i.d. Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

24 Priors on probability measures, Dirichlet process G DP(M, G 0) What does it look like? G is a discrete distribution almost surely We have the representation [Sethuraman 94] G π k δ θk k=1 θ k G 0( ) i.i.d. π k weights given by stickbreaking πk = V k 1 k i=1 (1 V i) V1, V 2,... Beta(1, M) i.i.d. Pitman-Yor process PY(M, d, G 0) Same representation with V k Beta(1 d, M + kd) Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

25 Priors on probability measures, Pólya trees on [0, 1] Consider a sequence I of dyadic partitions of [0, 1] = I I 0 = {[0, 1]}, I 1 = {I 0, I 1}, I 2 = {I 00, I 01, I 10, I 11},..., I k = {I ε, ε {0, 1} k },... say for simplicity split in half dyadic intervals (k2 l, (k + 1)2 l ] Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

26 Priors on probability measures, Pólya trees on [0, 1] Consider a sequence I of dyadic partitions of [0, 1] = I I 0 = {[0, 1]}, I 1 = {I 0, I 1}, I 2 = {I 00, I 01, I 10, I 11},..., I k = {I ε, ε {0, 1} k },... say for simplicity split in half dyadic intervals (k2 l, (k + 1)2 l ] I V 0 V 1 I 0 V 00 V 01 I 1 V 10 V 11 I 00 I 01 I 10 I 11 To get a probability measure, one sets V 1 = 1 V 0 and more generally V ε1 = 1 V ε0 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

27 Priors on probability measures, Pólya trees on [0, 1] Consider a sequence I of dyadic partitions of [0, 1] = I I 0 = {[0, 1]}, I 1 = {I 0, I 1}, I 2 = {I 00, I 01, I 10, I 11},..., I k = {I ε, ε {0, 1} k },... say for simplicity split in half dyadic intervals (k2 l, (k + 1)2 l ] I V 0 V 1 I 0 V 00 V 01 I 1 V 10 V 11 I 00 I 01 I 10 I 11 We set P(I 01) = V 0V 01 and more generally, P(I ε1...ε k ) = V ε1 V ε1 ε 2 V ε1 ε 2 ε k for ε = ε 1ε 2 ε k To get a probability measure, one sets V 1 = 1 V 0 and more generally V ε1 = 1 V ε0 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

28 Pólya trees on [0, 1] A Pólya tree PT (I, α) on [0, 1] is the random probability measure P on [0, 1] defined as Given a collection of independent random variables, V ε0 Beta(α ε0, α ε1), any ε {0, 1} k, k 0 Set, for any ε = ε 1... ε k {0, 1} k, any k 0, k k P(I ε) = V ε1...ε j 1 0 (1 V ε1...ε j 1 0) j=1; ε j =0 j=1; ε j =1 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

29 Pólya trees on [0, 1] A Pólya tree PT (I, α) on [0, 1] is the random probability measure P on [0, 1] defined as Given a collection of independent random variables, V ε0 Beta(α ε0, α ε1), any ε {0, 1} k, k 0 Set, for any ε = ε 1... ε k {0, 1} k, any k 0, Special cases P(I ε) = k j=1; ε j =0 V ε1...ε j 1 0 k j=1; ε j =1 (1 V ε1...ε j 1 0) If α ε = Mα(I ε) for a probability measure α( ) on [0, 1], DP(M, α) For α ε1 ε k 1 0 = α ε1 ε k 1 1 = a k and k=1 a 1 k < with I regular partition In this case it is a prior on densities! P Leb[0, 1] a.s. Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

30 Part II Rates Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

31 Bayesian dominated framework Experiment. X = X (n), P = {P (n), θ F}, (F, F) measure space θ Dominated framework. Suppose there exists a dominating measure µ (n) Bayesian setting. a) θ Π prior distribution b) X θ P (n) θ c) θ X : Π[ X ] posterior dp (n) θ = p (n) θ ( )dµ(n) Bayes formula. For any measurable set B F, Π(B X (n) ) = Remark. Π[B] = 0 Π[B X ] = 0 B p(n) θ (X (n) )dπ(θ) (n) p θ (X (n) )dπ(θ). Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

32 Consistency Consistency. The posterior is consistent at θ 0 if, as n, Π( X (n) ) w δ θ0, in P (n) θ 0 -probability. In a separable metric space equipped with a distance d, for any ε > 0, Π(θ : d(θ, θ 0) < ε X (n) ) 1, in P (n) θ 0 -probab. [Notation: in the sequel, drop index n and write P θ0, E θ0, X ] General consistency results [Doob (1949)] [Schwartz (1965)] [Barron, Schervish, Wasserman (1999)] Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

33 Convergence rate Convergence rate. The posterior converges at rate ε n for the distance d at θ 0 if, as n, E θ0 Π(θ : d(θ, θ 0) ε n X ) 1 It is an upper bound: we look for the smallest possible ε n. What happens in a nonparametric framework? [Ghosal, Ghosh, van der Vaart 00], [Ghosal, van der Vaart 07] Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

34 Convergence rate lower bound Lower bound to rate. We say that ζ n is a lower bound for the rate for d at θ 0 if E θ0 Π(θ : d(θ, θ 0) ζ n X ) 0 One looks for the largest possible ζ n. May look a little odd first that there is no mass close to θ 0... It just says that ζ n is a too fast scaling that captures little posterior mass Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

35 Rates, regular parametric models E0 X θ N (θ, 1) n Π[ X ] = N θ N (0, 1) = Π ( ) nx n n + 1, 1 n + 1 Exercise. Show that, for any M n, [ ] 1 E θ0 Π θ : θ θ 0 Mn X M n n n 1 This extends to regular parametric models and most reasonable priors Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

36 Posterior and aspects of posterior The posterior distribution Π[ X ] may have a well-defined posterior mean posterior mode θ(x ) := θdπ(θ X ) θ (X ) := argmax θ posterior median, posterior quantiles... These are estimators in the standard sense. dπ(θ X ) dµ Often, but not always, a convergence rate for Π[ X ] implies the same rate for a given aspect of it Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

37 Posterior and aspects of posterior Fact 1. Suppose Π[d(θ, θ 0) > ε n X ] = o P (1). Let θ B (X ) be the center of the smallest d ball containing at least 1/2 of the posterior mass d(θ B (X ), θ 0) = O P (ε n) Fact 2. Suppose Π[d(θ, θ 0) > ε n X ] = O P (ε 2 n) θ d 2 (θ, θ 0) is convex and bounded [e.g. d=h] Then d( θ(x ), θ 0) = O P (ε n) Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

38 Posterior integrated loss Fact 3. Suppose Then for any M n E θ0 d(θ, θ 0) 2 dπ(θ X ) ε 2 n If θ d 2 (θ, θ 0) is convex, E θ0 Π[θ : d(θ, θ 0) 2 > M nε 2 n X ] = o(1) E θ0 [ d( θ(x ), θ 0) 2] ε 2 n Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

39 First examples E1 Fixed design regression [van der Vaart, van Zanten 08, 09] True function. Let f 0 C β [0, 1] Loss function. g 2 n = n 1 n i=1 g(t i) 2 Prior. Brownian motion + Gaussian W t = B t + Z 0, with Z 0 N (0, 1) Then as n, where E f0 Π [f : f f 0 n ε n X ] 1, { ε n n 4 1 β 2 n 1/4 if β 1/2 = n β/2 if β 1/2 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

40 First examples E1 Fixed design regression (followed) Prior. Riemann-Liouville process with parameter α > 0 W t = t 0 (t s)α 1/2 db s + α k=0 Z kt k, with Z k N (0, 1) iid Then where E f0 Π [f : f f 0 n ε n X ] 1, { ε n n α β n α 2α+1 = 2α+1 if β α n β 2α+1 if β α Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

41 First examples E3 Density estimation [van der Vaart, van Zanten 08, 09] True density. Let f 0 C β [0, 1] with f 0 > 0. Loss function. Hellinger distance h(f, g) 2 = ( f g) 2 Prior. Consider the distribution on continuous functions induced by t ewt 1 0 ewu du with W t either Brownian motion or Riemann-Liouville process with parameter α > 0 Then, for ε n as before, E f0 Π [h(f, f 0) ε n X ] 1. Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

42 Theory: Bayesian nonparametrics Consistency [Doob (1949)] Consistency up to a Π-null set [Schwartz (1965)] Consistency under testing and enough prior mass in Kullback-Leibler-type neighborhoods [Diaconis, Freedman (1986)] Example of innocent prior in a semiparametric framework, for which the posterior is not consistent Rates of convergence? Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

43 GGV theory: first lemma To fix ideas consider first the density estimation model X = (X 1,..., X n) P = {P n f, dp f = fdµ, f F} F some set of densities Π prior distribution on F Bayes formula n B i=1 Π[B X ] = f (X i)dπ(f ) n i=1 f (X i)dπ(f ) Notation K(f 0, f ) = log f0 f f0dµ ( 2 V (f 0, f ) = log f0 K(f 0, f )) f 0dµ f Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

44 GGV theory: first lemma Define a Kullback Leibler type neighborhood of f 0 by B KL (f 0, ε n) := {f : K(f 0, f ) ε 2 n, V (f 0, f ) ε 2 n} Lemma 1. Let A n measurable and let nε 2 n. Suppose Then Π[A n X ] = o P (1). Π[A n] e 2nε2 n Π[B KL (f 0, ε n)] = o(1). [Idea: very small prior mass implies small posterior mass] Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

45 Lemma 2. For any Π probability measure on F, any C > 0 and ε > 0, n i=1 with P f0 probability at least 1 1/(C 2 nε 2 ). f (X i )dπ(f ) Π(B KL (f 0, ε))e (1+C)nε 2 f 0 n i=1 f (X i )dπ(f ) f 0 B n i=1 f f 0 (X i )d Π(f )Π(B), Π = Π B Π(B) Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

46 Lemma 2. For any Π probability measure on F, any C > 0 and ε > 0, n i=1 with P f0 probability at least 1 1/(C 2 nε 2 ). f (X i )dπ(f ) Π(B KL (f 0, ε))e (1+C)nε 2 f 0 n i=1 f (X i )dπ(f ) f 0 B n i=1 f f 0 (X i )d Π(f )Π(B), Π = Π B Π(B) log B n i=1 f f 0 (X i )d Π(f ) n i=1 B log f f 0 (X i )d Π(f ) [Jensen] n [ ] log f0 i=1 B f (X i) KL(f 0, f ) d Π(f ) n KL(f 0, f )d Π(f ) B n Z i nε 2 if B = B KL (f 0, ε) i=1 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

47 We have Z i iid and E f0 Z 1 = 0 Var f0 Z 1 E f0 = [ ] Z i := log f0 B f (X i) KL(f 0, f ) d Π(f ) B B [Fubini] [ ] 2 log f0 f (X i) KL(f 0, f ) d Π(f ) [CS] V (f 0, f )d Π(f ) ε 2 Π(B) = ε 2 [Fubini] Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

48 We have Z i iid and E f0 Z 1 = 0 Var f0 Z 1 E f0 = By Chebyshev s inequality, [ ] Z i := log f0 B f (X i) KL(f 0, f ) d Π(f ) B B [Fubini] [ ] 2 log f0 f (X i) KL(f 0, f ) d Π(f ) [CS] V (f 0, f )d Π(f ) ε 2 Π(B) = ε 2 P f0 [ ] n Z i > cnε 2 i=1 1 C 2 nε 2. On the complementary event, one thus has n f log (X i )d Π(f ) (C + 1)nε 2 f 0 that is n i=1 i=1 f (X i )d f Π(f ) Π(B)e (C+1)nε 2. 0 [Fubini] Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

49 GGV theory: first lemma Let B KL (f 0, ε n) := {f : K(f 0, f ) ε 2 n, V (f 0, f ) ε 2 n} Lemma 1. Let A n measurable and let nε 2 n. Suppose Then Π[A n X ] = o P (1). Π[A n] e 2nε2 n Π[B KL (f 0, ε n)] = o(1). [Proof of lemma 1] Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

50 Theory: Bayesian nonparametrics Experiment. X = X (n), P = {P (n) θ, θ F} as before [not necessarily iid sampling] Prior. Π prior distribution on θ Goal. For some distance d n and rate ε n, E θ0 Π [θ : d n(θ, θ 0) > Mε n X ] 0 Let us assume that, for nε 2 n, Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

51 Theory: Bayesian nonparametrics Experiment. X = X (n), P = {P (n) θ, θ F} as before [not necessarily iid sampling] Prior. Π prior distribution on θ Goal. For some distance d n and rate ε n, E θ0 Π [θ : d n(θ, θ 0) > Mε n X ] 0 Let us assume that, for nε 2 n, The prior puts enough mass on neighborhoods of θ 0 { (n) B KL (θ 0, ε n) = p θ 0 Π(B KL (θ 0, ε n)) e cnε2 n log p (n) θ 0 nε 2 n, p (n) θ p (n) θ 0 } log 2 p (n) θ 0 nε 2 p (n) n θ extends definition given above in iid sampling model Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

52 Theory: Bayesian nonparametrics Some sieve sets Θ n capture most of the prior mass Π(Θ\Θ n) e (c+4)nε2 n Test true parameter vs. Complement of a ball (T1) There exist tests ψ n such that E θ0 ψ n 0 sup E θ (1 ψ n) e (c+4)nε 2 n θ Θ n: d n(θ,θ 0 )>Mε n Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

53 Theory: Bayesian nonparametrics Some sieve sets Θ n capture most of the prior mass Π(Θ\Θ n) e (c+4)nε2 n Test true parameter vs. Complement of a ball (T1) There exist tests ψ n such that E θ0 ψ n 0 sup E θ (1 ψ n) e (c+4)nε 2 n θ Θ n: d n(θ,θ 0 )>Mε n A sufficient condition is a control of an entropy [see below] Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

54 Theory: Bayesian nonparametrics Theorem 1. [Ghosal, Ghosh, van der Vaart 00], [Ghosal, van der Vaart 07] If there are sets Θ n Θ, c, M > 0 such that there exist tests ψ n with E θ0 ψ n 0, sup E θ (1 ψ n) e (c+4)nε 2 n θ Θ n: d n(θ,θ 0 )>Mε n tests Π(Θ\Θ n) e (c+4)nε2 n Π(B KL (θ 0, ε n)) e cnε2 n remaining mass prior mass Then for M > 0 as above, E θ0 Π[θ : d n(θ, θ 0) Mε n X ] 1 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

55 Testing via entropy Tests point vs. ball for some distance d n (T0) N(ε, Θ n, d n) = minimal number of balls of radius ε for d n to cover Θ n Lemma 3. Suppose (T0) holds and log N(ε, Θ n, d n) Cnε 2 n Then for any given c > 0 and M large enough, (T1) holds. Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

56 Proof of lemma 3 [sketch] Build shells C j covering the complement of the ball Cover each shell by a minimal number of balls To each center of ball, associate ϕ i,j test via (T0) Set ψ := max ϕ i,j i,j testing power e Knjε2 over shells is summable in j Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

57 Testing distances: examples About the testing distance (T0) [Le Cam 70 s, Birgé 80 s] general results for testing between convex sets Imply that (T0) holds in iid and some simple dependency settings for d = h or d = 1 Proposition. In density estimation P (n) f = P n f, then (T0) holds for d = 1 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

58 Testing distances: examples About the testing distance (T0) [Le Cam 70 s, Birgé 80 s] general results for testing between convex sets Imply that (T0) holds in iid and some simple dependency settings for d = h or d = 1 Proposition. In density estimation P (n) f = P n f, then (T0) holds for d = 1 [Idea of proof.] Let f 0, f 1 with f 0 f 1 1 > ε. For A = {x : f 0(x) < f 1(x)}, P n(a) = 1 n { consider φ n = 1l P n(a) > P f0 (A) + Control of E f0 φ n and E f (1 φ n) via Hoeffding n i=1 1l Xi A, f0 f1 1 3 } Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

59 Theory: Bayesian nonparametrics Theorem 1 entropy version [GGV 00] If Θ n Θ and c > 0 such that, for d n such that (T0) is verified, log N(ε n, Θ n, d n) dnε 2 n Π(Θ\Θ n) e (c+4)nε2 n Π(B KL (θ 0, ε n)) e cnε2 n entropy remaining mass prior mass Then for M > 0 large enough, E θ0 Π[θ : d n(θ, θ 0) Mε n X ] 1 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

60 Theory, Gaussian priors W = (W t : t T ) centered Gaussian process taking values in Banach space (B, ) Covariance kernel K(s, t) = E(W sw t) Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

61 Theory, Gaussian priors W = (W t : t T ) centered Gaussian process taking values in Banach space (B, ) Covariance kernel K(s, t) = E(W sw t) Reproducing Kernel Hilbert Space H (RKHS) associated to W. Define a norm H via p q a i K(s i, ), b j K(t j, ) H = i=1 j=1 i,j a i b j K(s i, t j ) Then one sets H = Vect{K(s, ), s T } H Brownian motion (B t) H = { Series prior σ k ν k ε k H = {h = (h k ) l 2, k 1 0 f (u)du, f L 2 [0, 1] } k 1 σ 2 k hk 2 < + } Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

62 Theory, Gaussian priors Fact. For all g in the support of W in B, and all ε > 0, e ϕg (ε/2) ϕg (ε) P( W g < ε) e where Concentration function. Let g be in the support of W in B. For ε > 0, set h 2 H ϕ g (ε) = inf h H: h g <ε 2 Approximation log P( W < ε) Small ball probability Example [small ball probability] Brownian motion (B t) log P( B < ε) ε 2 (ε 0) Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

63 Theory, Gaussian priors [van der Vaart, van Zanten 08] Consider a nonparametric problem with unknown function f 0 B. Prior Π = law of a Gaussian process W on B, with RKHS H. Assume that f 0 is in the support in B of the prior the norm on B combines correctly with the testing distance d Let ε n be a solution of the equation ϕ f0 (ε n) nε 2 n Then the posterior contracts at rate ε n: for large enough M, Lower bound [C 08] E f0 Π(d(f, f 0) > Mε n X ) 0 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

64 Theory, Gaussian priors Ingredients of proof : prior mass the [Fact] links P( W w < ε) and concentration function sieves [Borell 75] s inequality Let B 1 and H 1 unit balls B and H associated to W P(W / MH 1 + εb 1) 1 Φ(Φ 1 (e φ 0(ε) ) + M) Suggests to set Θ n = nε nh 1 + ε nb 1 entropy can link entropy of H 1 and small ball probability Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

65 Example Density estimation and Gaussian priors Squared-exponential covariance kernel Centered Gaussian process Z t with covariance E(Z tz s) = e (s t)2 /L [van der Vaart, van Zanten 10] show that, for fixed L, there are regular functions f 0 for which the rate is at best logarithmic ε n (log n) γ(β) Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

66 Example Density estimation and Gaussian priors Squared-exponential covariance kernel Centered Gaussian process Z t with covariance E(Z tz s) = e (s t)2 /L [van der Vaart, van Zanten 10] show that, for fixed L, there are regular functions f 0 for which the rate is at best logarithmic ε n (log n) γ(β) However, those priors are used in machine learning and give very good results in practice when the parameter L is "well chosen"... Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

67 Example Density estimation and Gaussian priors, adaptation [van der Vaart, van Zanten 09] !0.5!1!1.5!2! Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

68 Example Density estimation and Gaussian priors, adaptation [van der Vaart, van Zanten 09] !0.5!1!1.5!2! Prior Π: consider Z At and set t A Gamma distribution ez At 1 0 ez Au du, where u Z u centered GP with squared-exponential kernel Then the posterior converges at minimax rate up to a log factor ] E f0 Π [h(f, f 0) M(log n) γ(β) n β 2β+1 X 1 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

69 Example Adaptation on manifolds [C, Kerkyacharian, Picard 14] On M compact Riemannian manifold of dimension d without boundary Laplacian M linear operator on functions on M with discrete spectrum ( M)ϕ p = λ pϕ p Note that 0 λ 1 λ 2 M ( e λpt ϕ p ) = λ pe λpt ϕ p t e λpt ϕ p = λ pe λpt ϕ p This is a special solution of the "heat equation" on M Mf = t f Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

70 Example Adaptation on manifolds Let α p N (0, 1) iid. A GP "random solution of the heat equation" is W t := p 1 e λpt/2 α pϕ p The associated family of covariance kernels is P t(x, y) = p 1 e λpt ϕ p(x)ϕ p(y) M-Heat Kernel Subgaussian estimates c 1e c ρ2 (x,y) t B(x, t) B(y, P t(x, y) t) c 2e c ρ2 (x,y) t B(x, t) B(y, t) Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

71 Example Adaptation on manifolds dx (n) (x) = f (x)dx + 1 n dz(x), x M Prior Π W T = p e λpt /2 α pϕ p with T t a e t d/2 log q (1/t) W T seen as prior on (B, ) = (L 2, 2) Set q = 1 + d/2 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

72 Example Adaptation on manifolds dx (n) (x) = f (x)dx + 1 n dz(x), x M Prior Π W T = p e λpt /2 α pϕ p with T t a e t d/2 log q (1/t) W T seen as prior on (B, ) = (L 2, 2) Set q = 1 + d/2 Suppose f 0 B2, (M). s Then for M large enough, as n, [ ( ) ] s/(2s+d) log n E f0 Π f f 0 2 M X 0 n The rate is sharp Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

73 Example Adaptation on manifolds dx (n) (x) = f (x)dx + 1 n dz(x), x M Prior Π W T = p e λpt /2 α pϕ p with T t a e t d/2 log q (1/t) W T seen as prior on (B, ) = (L 2, 2) Set q = 1 + d/2 Suppose f 0 B2, (M). s Then for M large enough, as n, [ ( ) ] s/(2s+d) log n E f0 Π f f 0 2 M X 0 n The rate is sharp for small enough ρ, there exists f 0 in B2, (M), s [ ( ) ] s/(2s+d) log n Π f f 0 2 ρ X 0 n Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

74 Example Sparsity Recall X i = θ i + ε i and θ l 0[s] s sparse vector Example [coin flipping prior with fixed α n = 1/n] k g Π Bin(n, α n) = π n density on R n (1 α n)δ 0 + α ng i=1 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

75 Example Sparsity Example [coin flipping prior with random α] α Beta(1, n) k α Bin(n, α) = π n g density on R α Beta(1, n) n Π α (1 α)δ 0 + αg i=1 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

76 Example Sparsity Theorem. Let Π be the Bayesian coin flipping prior with random α with α Beta(1, n) g the Laplace (or a heavier tailed) density Then for M large enough, sup E θ0 Π [θ : θ θ 0 > Ms n log(n/s n) X ] 0. θ 0 l 0 [s n] Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

77 Example Mixtures [Rousseau 10] Observations X 1,..., X n i.i.d. density f 0 > 0 and Hölder C β [0, 1] Beta densities g(x a, b) = xa 1 (1 x) b 1 B(a,b) usual parametrisation α g α,ε(x) = g(x, α ) 1 ε ε reparametrisation Prior Π = hierarchical mixture of Beta densities k g α,k,p k,ɛk( ) = p j g α,εj ( ) j=1 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

78 Example Mixtures [Rousseau 10] Observations X 1,..., X n i.i.d. density f 0 > 0 and Hölder C β [0, 1] Beta densities g(x a, b) = xa 1 (1 x) b 1 B(a,b) usual parametrisation α g α,ε(x) = g(x, α ) 1 ε ε reparametrisation Prior Π = hierarchical mixture of Beta densities g α,k,p k,ɛk( ) = k p j g α,εj ( ) j=1 α π α k π k p k k (p 1,..., p k ) law on the canonical simplex of R k ɛ k k (ε 1,..., ε k ) law in (0, 1) k Then the posterior concentrates at rate ] E f0 Π [h(f, f 0) M(log n) ρ(β) n β 2β+1 X 1 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

79 Posterior measure and aspects of the measure (I) MESSAGE Full posterior measure and aspects of it may differ significantly! Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

80 Posterior measure and aspects of the measure (I) MESSAGE Full posterior measure and aspects of it may differ significantly! Y = I n nθ + ɛ [seq.model] Let Π λ p i=1 Laplace(λ) [ without point masses at zero ] [ ] ˆθ λ LASSO = arg max θ Y X θ 2 2 2λ θ 1 posterior mode of Π λ [ Y ] Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

81 Posterior measure and aspects of the measure (I) MESSAGE Full posterior measure and aspects of it may differ significantly! Y = I n nθ + ɛ [seq.model] Let Π λ p i=1 Laplace(λ) [ without point masses at zero ] [ ] ˆθ λ LASSO = arg max θ Y X θ 2 2 2λ θ 1 posterior mode of Π λ [ Y ] Lemma [C, Schmidt-Hieber, van der Vaart 15] There exists δ > 0 such that, as n, E Π n ) θ 0 =0 λn (θ : θ 2 δ Y 0 with λ n = λ LASSO n = C log n. λ n Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

82 Posterior measure and aspects of the measure (I) MESSAGE Full posterior measure and aspects of it may differ significantly! Y = I n nθ + ɛ [seq.model] Let Π λ p i=1 Laplace(λ) [ without point masses at zero ] [ ] ˆθ λ LASSO = arg max θ Y X θ 2 2 2λ θ 1 posterior mode of Π λ [ Y ] Lemma [C, Schmidt-Hieber, van der Vaart 15] There exists δ > 0 such that, as n, E Π n ) θ 0 =0 λn (θ : θ 2 δ Y 0 with λ n = λ LASSO n = C log n. λ n Note n/ log n suboptimal convergence rate! Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

83 Posterior measure and aspects of the measure (II) Y = I n nθ + ε Prior Π on θ Bayesian coin flipping with random α and g (e.g.) the Laplace density Consider estimation of θ under l q metric, 0 < q 2, n d q(θ, θ ) = θ i θ i q. i=1 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

84 Posterior measure and aspects of the measure (II) Y = I n nθ + ε Prior Π on θ Bayesian coin flipping with random α and g (e.g.) the Laplace density Consider estimation of θ under l q metric, 0 < q 2, n d q(θ, θ ) = θ i θ i q. i=1 Theorem [C, van der Vaart 12] For large enough M, as n, sup E θ0 Π [ θ : d q(θ, θ 0) > Mrn,q Y ] 0 θ 0 l 0 [s n] Posterior measure is rate-optimal for any 0 < q 2 Posterior mean is suboptimal (!) for q < 1 [Johnstone, Silverman 04] Posterior median is rate-optimal for any 0 < q 2 [Johnstone, Silverman 04], [C, van der Vaart 12] Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

85 Rates: conclusion and further topics The previous general rate theorem applies to a variety of problems in i.i.d., non-i.i.d. Markov, hidden Markov,... as soon as one can find a suitable testing distance d n and verify prior mass type conditions Sometimes refinements can be used to tackle specific problems [e.g. more precise assumptions using similar ideas] Other formulations/approaches are also possible: PAC Bayes, non asymptotic versions etc. Yet, with these techniques, unclear how to obtain results for other distances/loss functions that do not satify testing (T0) Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

86 Part III Shape Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

87 Gaussian example E0 X θ N (θ, 1) n ; θ N (0, 1) = Π ( ) nx n Π[ X ] = N n + 1, 1 n + 1 Is Π[ X ], after recentering and rescaling, close to N (0, 1)? Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

88 Gaussian example E0 X θ N (θ, 1) n ; θ N (0, 1) = Π ( ) nx n Π[ X ] = N n + 1, 1 n + 1 Is Π[ X ], after recentering and rescaling, close to N (0, 1)? recentering and rescaling. Set τ a : x n(x a) comparing distributions. P Q TV = sup A P(A) Q(A) = 1 P Q 1 2 The recentering choice a = X n seems natural. Then Π[ X ] τ 1 n X n n X n = n N (0, 1). n + 1 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

89 Gaussian example E0 X θ N (θ, 1) n ; θ N (0, 1) = Π ( ) nx n Π[ X ] = N n + 1, 1 n + 1 Set τ Xn : x n(x X n) Proposition [convergence in Gaussian conjugate case] Π E θ0 [ X ] τ 1 X n N (0, 1) 0 TV Exercise. Prove it. [For instance: use 2 1 KL and compute KL distance between gaussians] Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

90 Historical example Laplace [towards 1810] considered the setting X θ Bin(n, θ) θ Unif[0, 1] = Π with associated posterior [ a fact noted by Thomas Bayes a few decades before ] Π[ X ] Beta(X + 1, n X + 1) Laplace noticed that, with τ X /n : x n(x X /n), Π[ X ] τ 1 X /n N (0, 1) Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

91 Regular parametric models P = {P θ, θ Θ R k } dp θ (x) = p θ (x)dx, l θ := log p θ Suppose l θ = l θ / θ exists and 0 < I θ := E θ [ l θ l T θ ] <. The model is locally asymptotically normal (LAN) at point θ 0 Θ if h R k log n i=1 p θ0 +h/ n p θ0 (X i ) = 1 n h T n l θ0 (X i ) 1 2 ht I θ0 h + o Pθ0 (1). i=1 An estimator ˆθ = ˆθ n(x ) is (asymp. linear and) efficient at θ 0 if n(ˆθ θ0) = I 1 θ 0 1 n Recall the notation τ ˆθ : x n(x ˆθ) n l θ0 (X i ) + o Pθ0 (1). i=1 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

92 Bernstein von Mises, smooth parametric models X 1,..., X n θ P n θ, θ Π Theorem [Bernstein von Mises] [Le Cam van der Vaart] Suppose (S) the model is LAN at θ 0 (T) for any ε > 0 there exist tests φ n with E θ0 φ n = o(1), sup E θ (1 φ n) = o(1) θ θ 0 >ε (P) the prior Π has a continuous positive density at θ 0 Then for ˆθ n efficient estimator at θ 0, as n, Π[ X ] τ 1 N ( 1 0, I ) ˆθ θ0 ( ) n 0 under P θ0. 1 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

93 Bernstein von Mises, smooth parametric models By the Bernstein von Mises (BvM) theorem, Π[ X ] τ 1 N ( 1 0, I ) ˆθ θ0 ( ) n 0 under P θ0. 1 For ˆθ efficient estimator, under P θ0, n(ˆθ θ 0) d N (0, I 1 θ 0 ) This shows the remarkable duality, asymptotically, L Π ( n(θ ˆθ) X ) L( n(ˆθ θ) θ = θ 0) Bayes law Frequentist law Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

94 Parametric BvM implies: credible sets are confidence sets! Suppose BvM holds for Θ R [dimension 1] Π( X ) τ 1 N ( 0, I 1 ) ˆθ θ 0 ( ) 0 under P θ 1 0. Let A n, B n be such that, for α = 5%, Π((, A n) X ) = Π((B n, + ) X ) = α/2 By definition Π [(A n, B n) X ] = 1 α Credible set Exercise. As n, P θ0 [θ 0 (A n, B n)] 1 α. so that (A n, B n) is an asymptotic confidence set at level 1 α. So, if the parametric BvM holds, credible sets are (asymptotically) confidence sets. Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

95 BvM in Gaussian model for nonconjugate prior As a particular case of the previous BvM theorem, we have Proposition [BvM for nonconjugate prior in Gaussian model]. Let X 1,..., X n θ N (θ, 1) θ Π with dπ(θ) = π(θ)dθ and π continuous and positive density on R. Then Π[ X ] τ 1 X n N (0, 1)( ) P 0. 1 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

96 BvM in Gaussian model for nonconjugate prior As a particular case of the previous BvM theorem, we have Proposition [BvM for nonconjugate prior in Gaussian model]. Let X 1,..., X n θ N (θ, 1) θ Π with dπ(θ) = π(θ)dθ and π continuous and positive density on R. Then Π[ X ] τ 1 X n N (0, 1)( ) P 0. 1 Idea of proof g X n (u) := exp( u2 /2)π( X n + u n ) exp( u2 /2)π( X n + u n )du exp( u2 /2)π(θ 0) 2ππ(θ0) use Scheffé s lemma on event { X n θ 0 1/ log(n)} Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

97 Towards a nonparametric BvM? And what about a nonparametric BvM? Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

98 Gaussian white noise model dx (n) (t) = f (t)dt + 1 n dw (t) t [0, 1] Observe a trajectory X (n) f L 2 [0, 1] =: L 2 is unknown = object of interest Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

99 Gaussian white noise model dx (n) (t) = f (t)dt + 1 n dw (t) t [0, 1] Observe a trajectory X (n) f L 2 [0, 1] =: L 2 is unknown = object of interest ϕ k (t)dx (n) (t) = ϕ k (t)f (t)dt + n 1/2 X k = f k + n 1/2 ε k, k 1 ϕ k (t)dw (t), k 1 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

100 Gaussian white noise model dx (n) (t) = f (t)dt + 1 n dw (t) t [0, 1] Observe a trajectory X (n) f L 2 [0, 1] =: L 2 is unknown = object of interest ϕ k (t)dx (n) (t) = ϕ k (t)f (t)dt + n 1/2 X k = f k + n 1/2 ε k, k 1 ϕ k (t)dw (t), k 1 Equivalently, writing X (n) for the sequence {X k, k 1}, X (n) = f + n 1/2 W Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

101 Bernstein-von Mises, nonparametric? X (n) = f + n 1/2 W Let Π be a prior on f L 2 prior on the coordinates f k = f, ϕ k, k 1 Set ˆf := E Π [f X (n) ] posterior mean Two questions 1 Can one find Π in such a way that Π(f ˆf X (n) ) L(ˆf f f ) optimal law in some sense? 2 In which sense should one interpret? Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

102 Nonparametric BvMs, related work Negative results [Cox 93], [Freedman 99], [Leahu 11] Let Π be a Gaussian prior on f sitting on L 2 A nonparametric BvM does not hold in L 2 Some positive results exist for specific models/priors Mostly in cases where some form of conjugacy is present [Lo 80s] Consider a i.i.d. sample X1,..., X n P in R with c.d.f. F Dirichlet process prior on P nonparametric-bvm for F n(f Fn) X 1,..., X n d G, in probability, with G law of a P-Brownian bridge. [Kim and Lee 06] Neutral to the right prior on F in Survival analysis model NP-BvM for A cumulative hazard function Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

103 A large enough Hilbert space Standard Sobolev spaces For {ϕ k, k 1} smooth enough orthonormal basis of L 2 H s 2 := { f L 2 ([0, 1]) : f 2 s,2 := } k 2s ϕ k, f 2 < k 1 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

104 A large enough Hilbert space Standard Sobolev spaces {ψ lk } smooth enough orthonormal basis of L 2, { H2 s := f L 2 ([0, 1]) : f 2 s,2 := } al 2s ψ lk, f 2 < l L k Z l 1 Z l = 1, a l = max(2, l ) Fourier-type basis 2 L N, a l = Z l = 2 l wavelet-type basis Negative Sobolev space: for s R, H s 2 { f : f 2 s,2 := l L a 2s l } ψ lk, f 2 < k Z l Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

105 A large enough Hilbert space H We use logarithmic Sobolev spaces to get sharp rates. For δ 1 and s = 1/2, let { H := H 1/2,δ 2 f : f 2 1/2,2,δ := a 2( 1/2) } l ψ (log a l ) 2δ lk, f 2 < l L k Z l Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

106 A large enough Hilbert space H We use logarithmic Sobolev spaces to get sharp rates. For δ 1 and s = 1/2, let { H := H 1/2,δ 2 f : f 2 1/2,2,δ := a 2( 1/2) } l ψ (log a l ) 2δ lk, f 2 < l L k Z l X (n) = f + 1 n W white noise model in H Note that W a.s. belong to H (as well as X (n), f ), as E W 2 1/2,2,δ = l L a 1 l (log a l ) 2δ k Z l Eg 2 lk <. [This implies that W takes values in H a.s. and is tight in H ] W is a centered Gaussian measure on H with EW(g)W(h) = g, h f, g L 2. Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

107 A large enough Hilbert space H We use logarithmic Sobolev spaces to get sharp rates. For δ 1 and s = 1/2, let { H := H 1/2,δ 2 f : f 2 1/2,2,δ := a 2( 1/2) } l ψ (log a l ) 2δ lk, f 2 < l L k Z l X (n) = f + 1 n W white noise model in H Note that W a.s. belong to H (as well as X (n), f ), as E W 2 1/2,2,δ = l L a 1 l (log a l ) 2δ k Z l Eg 2 lk <. [This implies that W takes values in H a.s. and is tight in H ] W is a centered Gaussian measure on H with EW(g)W(h) = g, h f, g L 2. Denote by N the law of W as a r.v. in H [limiting non-parametric distribution] Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

108 Weak nonparametric BvM, definition X (n) = f + 1 n W Let Π prior on f L 2 H. Let Π n = Π( X (n) ) be the posterior distribution on H. Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

109 Weak nonparametric BvM, definition X (n) = f + 1 n W Let Π prior on f L 2 H. Let Π n = Π( X (n) ) be the posterior distribution on H. Denote τ : f n(f X (n) ) and Π n τ 1 be the rescaled posterior. Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

110 Weak nonparametric BvM, definition X (n) = f + 1 n W Let Π prior on f L 2 H. Let Π n = Π( X (n) ) be the posterior distribution on H. Denote τ : f n(f X (n) ) and Π n τ 1 be the rescaled posterior. Let β be the bounded Lipschitz metric for weak cv. of probability measures on H β(µ, ν) = sup u(s)(dµ dν)(s) u BL(1) S Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

111 Weak nonparametric BvM, definition X (n) = f + 1 n W Let Π prior on f L 2 H. Let Π n = Π( X (n) ) be the posterior distribution on H. Denote τ : f n(f X (n) ) and Π n τ 1 be the rescaled posterior. Let β be the bounded Lipschitz metric for weak cv. of probability measures on H β(µ, ν) = sup u(s)(dµ dν)(s) u BL(1) S Definition The prior Π satisfies the weak Bernstein - von Mises phenomenon in H if β(π n τ 1, N ) Pn f 0 0, (n ). Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

112 Product priors and γ-smooth functions Consider priors of the form Π l,k π lk defined on the coordinates of the basis {ψ lk }, with π lk probability measures with Lebesgue density ϕ lk on R. Further assume, for some fixed density ϕ, ϕ lk ( ) = 1 ( ) ϕ k Z l with σ l > 0. σ l σ l For instance, to model γ-smooth functions, take σ l = 2 l(γ+1/2) and set, with g lk ϕ, G γ = 2 l(γ+1/2) g lk ψ lk, γ > 0 l k Z l Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

113 Main theorem - Condition (P) Denote f 0,lk = f 0, ψ lk the coordinates of the true f 0 on basis {ψ lk }. Suppose that for some M > 0, (P1) sup f 0,lk /σ l M. l,k Suppose also that ϕ is bounded and that for some τ > M, c ϕ > 0, (P2) ϕ(x) c ϕ x ( τ, τ), x 2 ϕ(x)dx <. R For the previous wavelet prior, (P1) asks for f 0,lk M2 l(γ+1/2) l, k f 0 C γ. Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

114 Main theorem in white noise [C. and Nickl 13] Theorem [Weak nonparametric BvM in H] Any product prior Π and f 0 satisfying Conditions (P) verify the weak Bernstein-von Mises phenomenon in H, β(π( X (n) ) τ 1, N ) Pn f 0 0. Moreover the posterior mean ˆf n = E Π [f X (n) ] is efficient in the sense that ˆf n X (n) H = o P (1/ n). Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

115 Nonparametric BvM: Note on weak convergence Unlike in total variation convergence one does not have uniformity in all Borel sets B in Π( X ) τ 1 T (B) N (B) Pn f 0 0. One does have uniformity in all sets that have a uniformly smooth boundary for the probability measure N. In particular uniformity holds for all H -balls {B H (0, t) : 0 < t M} Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

116 Application: Credible ellipsoids Suppose the weak BvM theorem holds for Π. A natural (1 α)-credible set is then obtained by solving for R n = R n(α, X (n) ) such that { Π(C n X (n) ) = 1 α, where C n = f : f X (n) H R } n/ n C n is the smallest ball around X (n) charged by the posterior with probability 1 α. Theorem [Elliptical Credible set C n] The random set C n has confidence 1 α asymptotically in that P n f 0 (f 0 C n) 1 α and R n = O P (1). Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

117 Application: BvM and Credible sets for smooth functionals Linear functionals, for g L H s 2, s > 1/2, L : f 1 0 f (t)g L (t)dt. Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

118 Application: BvM and Credible sets for smooth functionals Linear functionals, for g L H s 2, s > 1/2, L : f Smooth nonlinear functionals, e.g. f f (t)g L (t)dt. f 2 (t)dt, f 0 C β, β > 1/2. Self-convolutions of 1-periodic functions f f f Leads to BvM for these functionals Deduce confident credible sets by taking quantiles of the induced posterior Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

119 Application: Confidence sets in L 2, uniform priors Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

120 Application: Confidence sets in L 2, uniform priors Consider first the special case of a uniform wavelet prior Π on L 2 U γ,m = l,k 2 l(γ+1/2) u lk ψ lk ( ), u lk U[ M, M] i.i.d. Such priors model functions in a Hölder ball of radius M, with posteriors Π n contracting about f 0 at the L 2 -minimax rate n γ/(2γ+1) within log factors if f 0 γ, M. It is natural to intersect the ellipsoid credible set C n with the Hölderian support of Π { C n = f : f ˆf n H R } n/ n, f γ, M Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

121 Application: Confidence sets in L 2, uniform priors Consider first the special case of a uniform wavelet prior Π on L 2 U γ,m = l,k 2 l(γ+1/2) u lk ψ lk ( ), u lk U[ M, M] i.i.d. Such priors model functions in a Hölder ball of radius M, with posteriors Π n contracting about f 0 at the L 2 -minimax rate n γ/(2γ+1) within log factors if f 0 γ, M. It is natural to intersect the ellipsoid credible set C n with the Hölderian support of Π { C n = f : f ˆf n H R } n/ n, f γ, M Proposition [Confidence sets for uniform product priors] Π(C n X (n) ) = 1 α, Pf n 0 (f 0 C n) 1 α and the L 2 -diameter C n 2 of C n satisfies, for some κ > 0, C n 2 = O P (n γ/(2γ+1) (log n) κ ). Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

122 Application: Confidence sets in L 2, general product priors Consider more generally, if f 0 C γ, G γ = l,k 2 l(γ+1/2) g lk ψ lk ( ), g lk i.i.d. ϕ Let δ > 0 arbitrary. Let, for γ,2,1 the γ-sobolev norm with log correction, { C n = f : f ˆf n H R } n/ n, f γ,2,1 M n + 4δ, where M n is defined as a quantile for the γ-norm Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

123 Application: Confidence sets in L 2, general product priors Consider more generally, if f 0 C γ, G γ = l,k 2 l(γ+1/2) g lk ψ lk ( ), g lk i.i.d. ϕ Let δ > 0 arbitrary. Let, for γ,2,1 the γ-sobolev norm with log correction, { C n = f : f ˆf n H R } n/ n, f γ,2,1 M n + 4δ, where M n is defined as a quantile for the γ-norm: for any n and δ n = (log n) 1/4, M n = inf {M > 0 : Π n(f : f γ,2,1 M δ) 1 δ n}, Proposition [Confidence sets for general product priors] P n f 0 (f 0 C n ) 1 α, Π(C n X (n) ) = 1 α + o P (1) C n 2 = O P (n γ/(2γ+1) (log n) κ ). Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

124 Nonparametric BvMs in white noise, further questions Adaptation Prior on regularity α [Kolyan Ray] one obtains e.g. adaptive confidence regions (modulo expected appropriate restrictions) Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September / 97

Bayesian Regularization

Bayesian Regularization Bayesian Regularization Aad van der Vaart Vrije Universiteit Amsterdam International Congress of Mathematicians Hyderabad, August 2010 Contents Introduction Abstract result Gaussian process priors Co-authors

More information

Nonparametric Bayesian Uncertainty Quantification

Nonparametric Bayesian Uncertainty Quantification Nonparametric Bayesian Uncertainty Quantification Lecture 1: Introduction to Nonparametric Bayes Aad van der Vaart Universiteit Leiden, Netherlands YES, Eindhoven, January 2017 Contents Introduction Recovery

More information

Bayesian estimation of the discrepancy with misspecified parametric models

Bayesian estimation of the discrepancy with misspecified parametric models Bayesian estimation of the discrepancy with misspecified parametric models Pierpaolo De Blasi University of Torino & Collegio Carlo Alberto Bayesian Nonparametrics workshop ICERM, 17-21 September 2012

More information

Bayesian Sparse Linear Regression with Unknown Symmetric Error

Bayesian Sparse Linear Regression with Unknown Symmetric Error Bayesian Sparse Linear Regression with Unknown Symmetric Error Minwoo Chae 1 Joint work with Lizhen Lin 2 David B. Dunson 3 1 Department of Mathematics, The University of Texas at Austin 2 Department of

More information

Foundations of Nonparametric Bayesian Methods

Foundations of Nonparametric Bayesian Methods 1 / 27 Foundations of Nonparametric Bayesian Methods Part II: Models on the Simplex Peter Orbanz http://mlg.eng.cam.ac.uk/porbanz/npb-tutorial.html 2 / 27 Tutorial Overview Part I: Basics Part II: Models

More information

DISCUSSION: COVERAGE OF BAYESIAN CREDIBLE SETS. By Subhashis Ghosal North Carolina State University

DISCUSSION: COVERAGE OF BAYESIAN CREDIBLE SETS. By Subhashis Ghosal North Carolina State University Submitted to the Annals of Statistics DISCUSSION: COVERAGE OF BAYESIAN CREDIBLE SETS By Subhashis Ghosal North Carolina State University First I like to congratulate the authors Botond Szabó, Aad van der

More information

Priors for the frequentist, consistency beyond Schwartz

Priors for the frequentist, consistency beyond Schwartz Victoria University, Wellington, New Zealand, 11 January 2016 Priors for the frequentist, consistency beyond Schwartz Bas Kleijn, KdV Institute for Mathematics Part I Introduction Bayesian and Frequentist

More information

Nonparametric Bayesian Methods - Lecture I

Nonparametric Bayesian Methods - Lecture I Nonparametric Bayesian Methods - Lecture I Harry van Zanten Korteweg-de Vries Institute for Mathematics CRiSM Masterclass, April 4-6, 2016 Overview of the lectures I Intro to nonparametric Bayesian statistics

More information

A semiparametric Bernstein - von Mises theorem for Gaussian process priors

A semiparametric Bernstein - von Mises theorem for Gaussian process priors Noname manuscript No. (will be inserted by the editor) A semiparametric Bernstein - von Mises theorem for Gaussian process priors Ismaël Castillo Received: date / Accepted: date Abstract This paper is

More information

Semiparametric posterior limits

Semiparametric posterior limits Statistics Department, Seoul National University, Korea, 2012 Semiparametric posterior limits for regular and some irregular problems Bas Kleijn, KdV Institute, University of Amsterdam Based on collaborations

More information

D I S C U S S I O N P A P E R

D I S C U S S I O N P A P E R I N S T I T U T D E S T A T I S T I Q U E B I O S T A T I S T I Q U E E T S C I E N C E S A C T U A R I E L L E S ( I S B A ) UNIVERSITÉ CATHOLIQUE DE LOUVAIN D I S C U S S I O N P A P E R 2014/06 Adaptive

More information

Bayesian Aggregation for Extraordinarily Large Dataset

Bayesian Aggregation for Extraordinarily Large Dataset Bayesian Aggregation for Extraordinarily Large Dataset Guang Cheng 1 Department of Statistics Purdue University www.science.purdue.edu/bigdata Department Seminar Statistics@LSE May 19, 2017 1 A Joint Work

More information

Bayesian Nonparametrics

Bayesian Nonparametrics Bayesian Nonparametrics Peter Orbanz Columbia University PARAMETERS AND PATTERNS Parameters P(X θ) = Probability[data pattern] 3 2 1 0 1 2 3 5 0 5 Inference idea data = underlying pattern + independent

More information

Introduction to Bayesian learning Lecture 2: Bayesian methods for (un)supervised problems

Introduction to Bayesian learning Lecture 2: Bayesian methods for (un)supervised problems Introduction to Bayesian learning Lecture 2: Bayesian methods for (un)supervised problems Anne Sabourin, Ass. Prof., Telecom ParisTech September 2017 1/78 1. Lecture 1 Cont d : Conjugate priors and exponential

More information

Statistica Sinica Preprint No: SS R2

Statistica Sinica Preprint No: SS R2 Statistica Sinica Preprint No: SS-2017-0074.R2 Title The semi-parametric Bernstein-von Mises theorem for regression models with symmetric errors Manuscript ID SS-2017-0074.R2 URL http://www.stat.sinica.edu.tw/statistica/

More information

Minimax lower bounds I

Minimax lower bounds I Minimax lower bounds I Kyoung Hee Kim Sungshin University 1 Preliminaries 2 General strategy 3 Le Cam, 1973 4 Assouad, 1983 5 Appendix Setting Family of probability measures {P θ : θ Θ} on a sigma field

More information

Asymptotic properties of posterior distributions in nonparametric regression with non-gaussian errors

Asymptotic properties of posterior distributions in nonparametric regression with non-gaussian errors Ann Inst Stat Math (29) 61:835 859 DOI 1.17/s1463-8-168-2 Asymptotic properties of posterior distributions in nonparametric regression with non-gaussian errors Taeryon Choi Received: 1 January 26 / Revised:

More information

Can we do statistical inference in a non-asymptotic way? 1

Can we do statistical inference in a non-asymptotic way? 1 Can we do statistical inference in a non-asymptotic way? 1 Guang Cheng 2 Statistics@Purdue www.science.purdue.edu/bigdata/ ONR Review Meeting@Duke Oct 11, 2017 1 Acknowledge NSF, ONR and Simons Foundation.

More information

ICES REPORT Model Misspecification and Plausibility

ICES REPORT Model Misspecification and Plausibility ICES REPORT 14-21 August 2014 Model Misspecification and Plausibility by Kathryn Farrell and J. Tinsley Odena The Institute for Computational Engineering and Sciences The University of Texas at Austin

More information

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk)

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk) 10-704: Information Processing and Learning Spring 2015 Lecture 21: Examples of Lower Bounds and Assouad s Method Lecturer: Akshay Krishnamurthy Scribes: Soumya Batra Note: LaTeX template courtesy of UC

More information

CS Lecture 19. Exponential Families & Expectation Propagation

CS Lecture 19. Exponential Families & Expectation Propagation CS 6347 Lecture 19 Exponential Families & Expectation Propagation Discrete State Spaces We have been focusing on the case of MRFs over discrete state spaces Probability distributions over discrete spaces

More information

Lecture 8: Information Theory and Statistics

Lecture 8: Information Theory and Statistics Lecture 8: Information Theory and Statistics Part II: Hypothesis Testing and I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 23, 2015 1 / 50 I-Hsiang

More information

Optimality of Poisson Processes Intensity Learning with Gaussian Processes

Optimality of Poisson Processes Intensity Learning with Gaussian Processes Journal of Machine Learning Research 16 (2015) 2909-2919 Submitted 9/14; Revised 3/15; Published 12/15 Optimality of Poisson Processes Intensity Learning with Gaussian Processes Alisa Kirichenko Harry

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

Bayesian Consistency for Markov Models

Bayesian Consistency for Markov Models Bayesian Consistency for Markov Models Isadora Antoniano-Villalobos Bocconi University, Milan, Italy. Stephen G. Walker University of Texas at Austin, USA. Abstract We consider sufficient conditions for

More information

Nonparametric Bayes Inference on Manifolds with Applications

Nonparametric Bayes Inference on Manifolds with Applications Nonparametric Bayes Inference on Manifolds with Applications Abhishek Bhattacharya Indian Statistical Institute Based on the book Nonparametric Statistics On Manifolds With Applications To Shape Spaces

More information

TANG, YONGQIANG. Dirichlet Process Mixture Models for Markov processes. (Under the direction of Dr. Subhashis Ghosal.)

TANG, YONGQIANG. Dirichlet Process Mixture Models for Markov processes. (Under the direction of Dr. Subhashis Ghosal.) ABSTRACT TANG, YONGQIANG. Dirichlet Process Mixture Models for Markov processes. (Under the direction of Dr. Subhashis Ghosal.) Prediction of the future observations is an important practical issue for

More information

Adaptive Bayesian procedures using random series priors

Adaptive Bayesian procedures using random series priors Adaptive Bayesian procedures using random series priors WEINING SHEN 1 and SUBHASHIS GHOSAL 2 1 Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX 77030 2 Department

More information

Model Selection and Geometry

Model Selection and Geometry Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model

More information

arxiv: v2 [math.st] 18 Feb 2013

arxiv: v2 [math.st] 18 Feb 2013 Bayesian optimal adaptive estimation arxiv:124.2392v2 [math.st] 18 Feb 213 using a sieve prior Julyan Arbel 1,2, Ghislaine Gayraud 1,3 and Judith Rousseau 1,4 1 Laboratoire de Statistique, CREST, France

More information

Nonparametric regression with martingale increment errors

Nonparametric regression with martingale increment errors S. Gaïffas (LSTA - Paris 6) joint work with S. Delattre (LPMA - Paris 7) work in progress Motivations Some facts: Theoretical study of statistical algorithms requires stationary and ergodicity. Concentration

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Probability and Measure

Probability and Measure Part II Year 2018 2017 2016 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2018 84 Paper 4, Section II 26J Let (X, A) be a measurable space. Let T : X X be a measurable map, and µ a probability

More information

Nonparametric estimation using wavelet methods. Dominique Picard. Laboratoire Probabilités et Modèles Aléatoires Université Paris VII

Nonparametric estimation using wavelet methods. Dominique Picard. Laboratoire Probabilités et Modèles Aléatoires Université Paris VII Nonparametric estimation using wavelet methods Dominique Picard Laboratoire Probabilités et Modèles Aléatoires Université Paris VII http ://www.proba.jussieu.fr/mathdoc/preprints/index.html 1 Nonparametric

More information

Part III. A Decision-Theoretic Approach and Bayesian testing

Part III. A Decision-Theoretic Approach and Bayesian testing Part III A Decision-Theoretic Approach and Bayesian testing 1 Chapter 10 Bayesian Inference as a Decision Problem The decision-theoretic framework starts with the following situation. We would like to

More information

Gaussian Approximations for Probability Measures on R d

Gaussian Approximations for Probability Measures on R d SIAM/ASA J. UNCERTAINTY QUANTIFICATION Vol. 5, pp. 1136 1165 c 2017 SIAM and ASA. Published by SIAM and ASA under the terms of the Creative Commons 4.0 license Gaussian Approximations for Probability Measures

More information

Statistics: Learning models from data

Statistics: Learning models from data DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial

More information

Theorem 2.1 (Caratheodory). A (countably additive) probability measure on a field has an extension. n=1

Theorem 2.1 (Caratheodory). A (countably additive) probability measure on a field has an extension. n=1 Chapter 2 Probability measures 1. Existence Theorem 2.1 (Caratheodory). A (countably additive) probability measure on a field has an extension to the generated σ-field Proof of Theorem 2.1. Let F 0 be

More information

Bayesian Adaptation. Aad van der Vaart. Vrije Universiteit Amsterdam. aad. Bayesian Adaptation p. 1/4

Bayesian Adaptation. Aad van der Vaart. Vrije Universiteit Amsterdam.  aad. Bayesian Adaptation p. 1/4 Bayesian Adaptation Aad van der Vaart http://www.math.vu.nl/ aad Vrije Universiteit Amsterdam Bayesian Adaptation p. 1/4 Joint work with Jyri Lember Bayesian Adaptation p. 2/4 Adaptation Given a collection

More information

ANISOTROPIC FUNCTION ESTIMATION USING MULTI-BANDWIDTH GAUSSIAN PROCESSES

ANISOTROPIC FUNCTION ESTIMATION USING MULTI-BANDWIDTH GAUSSIAN PROCESSES Submitted to the Annals of Statistics arxiv: 1111.1044v1 ANISOTROPIC FUNCTION ESTIMATION USING MULTI-BANDWIDTH GAUSSIAN PROCESSES By Anirban Bhattacharya,, Debdeep Pati, and David Dunson, Duke University,

More information

Bayesian Statistics in High Dimensions

Bayesian Statistics in High Dimensions Bayesian Statistics in High Dimensions Lecture 2: Sparsity Aad van der Vaart Universiteit Leiden, Netherlands 47th John H. Barrett Memorial Lectures, Knoxville, Tenessee, May 2017 Contents Sparsity Bayesian

More information

Brittleness and Robustness of Bayesian Inference

Brittleness and Robustness of Bayesian Inference Brittleness and Robustness of Bayesian Inference Tim Sullivan with Houman Owhadi and Clint Scovel (Caltech) Mathematics Institute, University of Warwick Predictive Modelling Seminar University of Warwick,

More information

arxiv: v1 [math.st] 15 Dec 2017

arxiv: v1 [math.st] 15 Dec 2017 A THEORETICAL FRAMEWORK FOR BAYESIAN NONPARAMETRIC REGRESSION: ORTHONORMAL RANDOM SERIES AND RATES OF CONTRACTION arxiv:171.05731v1 math.st] 15 Dec 017 By Fangzheng Xie, Wei Jin, and Yanxun Xu Johns Hopkins

More information

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 19: Data Representation by Design What is data representation? Let X be a data-space X M (M) F (M) X A data representation

More information

Lecture 1: Introduction

Lecture 1: Introduction Principles of Statistics Part II - Michaelmas 208 Lecturer: Quentin Berthet Lecture : Introduction This course is concerned with presenting some of the mathematical principles of statistical theory. One

More information

On Bayesian bandit algorithms

On Bayesian bandit algorithms On Bayesian bandit algorithms Emilie Kaufmann joint work with Olivier Cappé, Aurélien Garivier, Nathaniel Korda and Rémi Munos July 1st, 2012 Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms

More information

Approximation Theoretical Questions for SVMs

Approximation Theoretical Questions for SVMs Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually

More information

Lectures on Nonparametric Bayesian Statistics

Lectures on Nonparametric Bayesian Statistics Lectures on Nonparametric Bayesian Statistics Aad van der Vaart Universiteit Leiden, Netherlands Bad Belzig, March 2013 Contents Introduction Dirichlet process Consistency and rates Gaussian process priors

More information

Optimal Estimation of a Nonsmooth Functional

Optimal Estimation of a Nonsmooth Functional Optimal Estimation of a Nonsmooth Functional T. Tony Cai Department of Statistics The Wharton School University of Pennsylvania http://stat.wharton.upenn.edu/ tcai Joint work with Mark Low 1 Question Suppose

More information

Minimax Estimation of Kernel Mean Embeddings

Minimax Estimation of Kernel Mean Embeddings Minimax Estimation of Kernel Mean Embeddings Bharath K. Sriperumbudur Department of Statistics Pennsylvania State University Gatsby Computational Neuroscience Unit May 4, 2016 Collaborators Dr. Ilya Tolstikhin

More information

Introduction to Probabilistic Machine Learning

Introduction to Probabilistic Machine Learning Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning

More information

RATE EXACT BAYESIAN ADAPTATION WITH MODIFIED BLOCK PRIORS. By Chao Gao and Harrison H. Zhou Yale University

RATE EXACT BAYESIAN ADAPTATION WITH MODIFIED BLOCK PRIORS. By Chao Gao and Harrison H. Zhou Yale University Submitted to the Annals o Statistics RATE EXACT BAYESIAN ADAPTATION WITH MODIFIED BLOCK PRIORS By Chao Gao and Harrison H. Zhou Yale University A novel block prior is proposed or adaptive Bayesian estimation.

More information

A Very Brief Summary of Statistical Inference, and Examples

A Very Brief Summary of Statistical Inference, and Examples A Very Brief Summary of Statistical Inference, and Examples Trinity Term 2008 Prof. Gesine Reinert 1 Data x = x 1, x 2,..., x n, realisations of random variables X 1, X 2,..., X n with distribution (model)

More information

A Very Brief Summary of Bayesian Inference, and Examples

A Very Brief Summary of Bayesian Inference, and Examples A Very Brief Summary of Bayesian Inference, and Examples Trinity Term 009 Prof Gesine Reinert Our starting point are data x = x 1, x,, x n, which we view as realisations of random variables X 1, X,, X

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Stratégies bayésiennes et fréquentistes dans un modèle de bandit

Stratégies bayésiennes et fréquentistes dans un modèle de bandit Stratégies bayésiennes et fréquentistes dans un modèle de bandit thèse effectuée à Telecom ParisTech, co-dirigée par Olivier Cappé, Aurélien Garivier et Rémi Munos Journées MAS, Grenoble, 30 août 2016

More information

ASYMPTOTIC EQUIVALENCE OF DENSITY ESTIMATION AND GAUSSIAN WHITE NOISE. By Michael Nussbaum Weierstrass Institute, Berlin

ASYMPTOTIC EQUIVALENCE OF DENSITY ESTIMATION AND GAUSSIAN WHITE NOISE. By Michael Nussbaum Weierstrass Institute, Berlin The Annals of Statistics 1996, Vol. 4, No. 6, 399 430 ASYMPTOTIC EQUIVALENCE OF DENSITY ESTIMATION AND GAUSSIAN WHITE NOISE By Michael Nussbaum Weierstrass Institute, Berlin Signal recovery in Gaussian

More information

Lecture 2: Statistical Decision Theory (Part I)

Lecture 2: Statistical Decision Theory (Part I) Lecture 2: Statistical Decision Theory (Part I) Hao Helen Zhang Hao Helen Zhang Lecture 2: Statistical Decision Theory (Part I) 1 / 35 Outline of This Note Part I: Statistics Decision Theory (from Statistical

More information

Context tree models for source coding

Context tree models for source coding Context tree models for source coding Toward Non-parametric Information Theory Licence de droits d usage Outline Lossless Source Coding = density estimation with log-loss Source Coding and Universal Coding

More information

Empirical priors for targeted posterior concentration rates

Empirical priors for targeted posterior concentration rates Empirical priors for targeted posterior concentration rates Ryan Martin and Stephen G. Walker June 7, 2018 arxiv:1604.05734v4 [math.st] 6 Jun 2018 Abstract In high-dimensional problems, choosing a prior

More information

Multiple Random Variables

Multiple Random Variables Multiple Random Variables Joint Probability Density Let X and Y be two random variables. Their joint distribution function is F ( XY x, y) P X x Y y. F XY ( ) 1, < x

More information

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite

More information

Nonparametric estimation under Shape Restrictions

Nonparametric estimation under Shape Restrictions Nonparametric estimation under Shape Restrictions Jon A. Wellner University of Washington, Seattle Statistical Seminar, Frejus, France August 30 - September 3, 2010 Outline: Five Lectures on Shape Restrictions

More information

Brownian motion. Samy Tindel. Purdue University. Probability Theory 2 - MA 539

Brownian motion. Samy Tindel. Purdue University. Probability Theory 2 - MA 539 Brownian motion Samy Tindel Purdue University Probability Theory 2 - MA 539 Mostly taken from Brownian Motion and Stochastic Calculus by I. Karatzas and S. Shreve Samy T. Brownian motion Probability Theory

More information

Optimal series representations of continuous Gaussian random fields

Optimal series representations of continuous Gaussian random fields Optimal series representations of continuous Gaussian random fields Antoine AYACHE Université Lille 1 - Laboratoire Paul Painlevé A. Ayache (Lille 1) Optimality of continuous Gaussian series 04/25/2012

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini April 27, 2018 1 / 80 Classical case: n d. Asymptotic assumption: d is fixed and n. Basic tools: LLN and CLT. High-dimensional setting: n d, e.g. n/d

More information

Lecture 2: Priors and Conjugacy

Lecture 2: Priors and Conjugacy Lecture 2: Priors and Conjugacy Melih Kandemir melih.kandemir@iwr.uni-heidelberg.de May 6, 2014 Some nice courses Fred A. Hamprecht (Heidelberg U.) https://www.youtube.com/watch?v=j66rrnzzkow Michael I.

More information

Asymptotics for posterior hazards

Asymptotics for posterior hazards Asymptotics for posterior hazards Pierpaolo De Blasi University of Turin 10th August 2007, BNR Workshop, Isaac Newton Intitute, Cambridge, UK Joint work with Giovanni Peccati (Université Paris VI) and

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

arxiv: v1 [math.st] 6 Nov 2012

arxiv: v1 [math.st] 6 Nov 2012 The Annals of Statistics 212, Vol. 4, No. 4, 269 211 DOI: 1.1214/12-AOS129 c Institute of Mathematical Statistics, 212 arxiv:1211.1197v1 [math.st] 6 Nov 212 NEEDLES AND STRAW IN A HAYSTACK: POSTERIOR CONCENTRATION

More information

Some Asymptotic Bayesian Inference (background to Chapter 2 of Tanner s book)

Some Asymptotic Bayesian Inference (background to Chapter 2 of Tanner s book) Some Asymptotic Bayesian Inference (background to Chapter 2 of Tanner s book) Principal Topics The approach to certainty with increasing evidence. The approach to consensus for several agents, with increasing

More information

1 Continuity Classes C m (Ω)

1 Continuity Classes C m (Ω) 0.1 Norms 0.1 Norms A norm on a linear space X is a function : X R with the properties: Positive Definite: x 0 x X (nonnegative) x = 0 x = 0 (strictly positive) λx = λ x x X, λ C(homogeneous) x + y x +

More information

Nonparametric estimation under Shape Restrictions

Nonparametric estimation under Shape Restrictions Nonparametric estimation under Shape Restrictions Jon A. Wellner University of Washington, Seattle Statistical Seminar, Frejus, France August 30 - September 3, 2010 Outline: Five Lectures on Shape Restrictions

More information

POSTERIOR CONSISTENCY IN CONDITIONAL DENSITY ESTIMATION BY COVARIATE DEPENDENT MIXTURES. By Andriy Norets and Justinas Pelenis

POSTERIOR CONSISTENCY IN CONDITIONAL DENSITY ESTIMATION BY COVARIATE DEPENDENT MIXTURES. By Andriy Norets and Justinas Pelenis Resubmitted to Econometric Theory POSTERIOR CONSISTENCY IN CONDITIONAL DENSITY ESTIMATION BY COVARIATE DEPENDENT MIXTURES By Andriy Norets and Justinas Pelenis Princeton University and Institute for Advanced

More information

The main results about probability measures are the following two facts:

The main results about probability measures are the following two facts: Chapter 2 Probability measures The main results about probability measures are the following two facts: Theorem 2.1 (extension). If P is a (continuous) probability measure on a field F 0 then it has a

More information

General Bayesian Inference I

General Bayesian Inference I General Bayesian Inference I Outline: Basic concepts, One-parameter models, Noninformative priors. Reading: Chapters 10 and 11 in Kay-I. (Occasional) Simplified Notation. When there is no potential for

More information

Lecture 16: Sample quantiles and their asymptotic properties

Lecture 16: Sample quantiles and their asymptotic properties Lecture 16: Sample quantiles and their asymptotic properties Estimation of quantiles (percentiles Suppose that X 1,...,X n are i.i.d. random variables from an unknown nonparametric F For p (0,1, G 1 (p

More information

A talk on Oracle inequalities and regularization. by Sara van de Geer

A talk on Oracle inequalities and regularization. by Sara van de Geer A talk on Oracle inequalities and regularization by Sara van de Geer Workshop Regularization in Statistics Banff International Regularization Station September 6-11, 2003 Aim: to compare l 1 and other

More information

ON THE REGULARITY OF SAMPLE PATHS OF SUB-ELLIPTIC DIFFUSIONS ON MANIFOLDS

ON THE REGULARITY OF SAMPLE PATHS OF SUB-ELLIPTIC DIFFUSIONS ON MANIFOLDS Bendikov, A. and Saloff-Coste, L. Osaka J. Math. 4 (5), 677 7 ON THE REGULARITY OF SAMPLE PATHS OF SUB-ELLIPTIC DIFFUSIONS ON MANIFOLDS ALEXANDER BENDIKOV and LAURENT SALOFF-COSTE (Received March 4, 4)

More information

Posterior properties of the support function for set inference

Posterior properties of the support function for set inference Posterior properties of the support function for set inference Yuan Liao University of Maryland Anna Simoni CNRS and CREST November 26, 2014 Abstract Inference on sets is of huge importance in many statistical

More information

Nonparametric estimation of log-concave densities

Nonparametric estimation of log-concave densities Nonparametric estimation of log-concave densities Jon A. Wellner University of Washington, Seattle Seminaire, Institut de Mathématiques de Toulouse 5 March 2012 Seminaire, Toulouse Based on joint work

More information

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable Lecture Notes 1 Probability and Random Variables Probability Spaces Conditional Probability and Independence Random Variables Functions of a Random Variable Generation of a Random Variable Jointly Distributed

More information

Spring 2012 Math 541B Exam 1

Spring 2012 Math 541B Exam 1 Spring 2012 Math 541B Exam 1 1. A sample of size n is drawn without replacement from an urn containing N balls, m of which are red and N m are black; the balls are otherwise indistinguishable. Let X denote

More information

Estimation of the functional Weibull-tail coefficient

Estimation of the functional Weibull-tail coefficient 1/ 29 Estimation of the functional Weibull-tail coefficient Stéphane Girard Inria Grenoble Rhône-Alpes & LJK, France http://mistis.inrialpes.fr/people/girard/ June 2016 joint work with Laurent Gardes,

More information

EE514A Information Theory I Fall 2013

EE514A Information Theory I Fall 2013 EE514A Information Theory I Fall 2013 K. Mohan, Prof. J. Bilmes University of Washington, Seattle Department of Electrical Engineering Fall Quarter, 2013 http://j.ee.washington.edu/~bilmes/classes/ee514a_fall_2013/

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

BERNSTEIN POLYNOMIAL DENSITY ESTIMATION 1265 bounded away from zero and infinity. Gibbs sampling techniques to compute the posterior mean and other po

BERNSTEIN POLYNOMIAL DENSITY ESTIMATION 1265 bounded away from zero and infinity. Gibbs sampling techniques to compute the posterior mean and other po The Annals of Statistics 2001, Vol. 29, No. 5, 1264 1280 CONVERGENCE RATES FOR DENSITY ESTIMATION WITH BERNSTEIN POLYNOMIALS By Subhashis Ghosal University of Minnesota Mixture models for density estimation

More information

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable Lecture Notes 1 Probability and Random Variables Probability Spaces Conditional Probability and Independence Random Variables Functions of a Random Variable Generation of a Random Variable Jointly Distributed

More information

Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics

Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics Data from one or a series of random experiments are collected. Planning experiments and collecting data (not discussed here). Analysis:

More information

Markov Chain Monte Carlo (MCMC)

Markov Chain Monte Carlo (MCMC) Markov Chain Monte Carlo (MCMC Dependent Sampling Suppose we wish to sample from a density π, and we can evaluate π as a function but have no means to directly generate a sample. Rejection sampling can

More information

Statistical inference on Lévy processes

Statistical inference on Lévy processes Alberto Coca Cabrero University of Cambridge - CCA Supervisors: Dr. Richard Nickl and Professor L.C.G.Rogers Funded by Fundación Mutua Madrileña and EPSRC MASDOC/CCA student workshop 2013 26th March Outline

More information

Nonparametric estimation of log-concave densities

Nonparametric estimation of log-concave densities Nonparametric estimation of log-concave densities Jon A. Wellner University of Washington, Seattle Northwestern University November 5, 2010 Conference on Shape Restrictions in Non- and Semi-Parametric

More information

Brittleness and Robustness of Bayesian Inference

Brittleness and Robustness of Bayesian Inference Brittleness and Robustness of Bayesian Inference Tim Sullivan 1 with Houman Owhadi 2 and Clint Scovel 2 1 Free University of Berlin / Zuse Institute Berlin, Germany 2 California Institute of Technology,

More information

1.1 Basis of Statistical Decision Theory

1.1 Basis of Statistical Decision Theory ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 1: Introduction Lecturer: Yihong Wu Scribe: AmirEmad Ghassami, Jan 21, 2016 [Ed. Jan 31] Outline: Introduction of

More information

Foundations of Statistical Inference

Foundations of Statistical Inference Foundations of Statistical Inference Julien Berestycki Department of Statistics University of Oxford MT 2016 Julien Berestycki (University of Oxford) SB2a MT 2016 1 / 32 Lecture 14 : Variational Bayes

More information

Bayesian statistics: Inference and decision theory

Bayesian statistics: Inference and decision theory Bayesian statistics: Inference and decision theory Patric Müller und Francesco Antognini Seminar über Statistik FS 28 3.3.28 Contents 1 Introduction and basic definitions 2 2 Bayes Method 4 3 Two optimalities:

More information

TD 1: Hilbert Spaces and Applications

TD 1: Hilbert Spaces and Applications Université Paris-Dauphine Functional Analysis and PDEs Master MMD-MA 2017/2018 Generalities TD 1: Hilbert Spaces and Applications Exercise 1 (Generalized Parallelogram law). Let (H,, ) be a Hilbert space.

More information

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review STATS 200: Introduction to Statistical Inference Lecture 29: Course review Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout

More information

Brownian Motion and Conditional Probability

Brownian Motion and Conditional Probability Math 561: Theory of Probability (Spring 2018) Week 10 Brownian Motion and Conditional Probability 10.1 Standard Brownian Motion (SBM) Brownian motion is a stochastic process with both practical and theoretical

More information