Bayesian Adaptation. Aad van der Vaart. Vrije Universiteit Amsterdam. aad. Bayesian Adaptation p. 1/4

Bayesian Adaptation Aad van der Vaart http://www.math.vu.nl/ aad Vrije Universiteit Amsterdam Bayesian Adaptation p. 1/4

Joint work with Jyri Lember Bayesian Adaptation p. 2/4

Adaptation Given a collection of possible models find a single procedure that works well for all models Bayesian Adaptation p. 3/4

Adaptation Given a collection of possible models find a single procedure that works well for all models as well as a procedure specifically targetted to the correct model Bayesian Adaptation p. 3/4

Adaptation Given a collection of possible models find a single procedure that works well for all models as well as a procedure specifically targetted to the correct model correct model is the one that contains the true distribution of the data Bayesian Adaptation p. 3/4

Adaptation to Smoothness Given a random sample of size n from a density p 0 on R that is known to have α derivatives, there exist estimators ˆp n with rate ǫ n,α = n α/(2α+1) Bayesian Adaptation p. 4/4

Adaptation to Smoothness Given a random sample of size n from a density p 0 on R that is known to have α derivatives, there exist estimators ˆp n with rate ǫ n,α = n α/(2α+1) i.e. E p0 d 2 (ˆp n,p 0 ) 2 = O(ǫ 2 n,α), uniformly in p 0 with (p (α) 0 )2 dλ bounded (if d = 2 ) Bayesian Adaptation p. 4/4

Distances Global distances on densities d can be one of: Hellinger: h(p,q) = p q 2 dµ, Total variation: p q 1 = p q dµ, L 2 : p q 2 = p q 2 dµ. Bayesian Adaptation p. 5/4

Adaptation Data X 1,...,X n i.i.d. p 0 Models P n,α for α A, countable Optimal rates ǫ n,α Bayesian Adaptation p. 6/4

Adaptation Data X 1,...,X n i.i.d. p 0 Models P n,α for α A, countable Optimal rates ǫ n,α p 0 contained in or close to P n,β, some β A Bayesian Adaptation p. 6/4

Adaptation Data X 1,...,X n i.i.d. p 0 Models P n,α for α A, countable Optimal rates ǫ n,α p 0 contained in or close to P n,β, some β A We want procedures that (almost) attain rate ǫ n,β, but we do not know β Bayesian Adaptation p. 6/4

Adaptation-NonBayesian Main methods: Penalization Cross validation Bayesian Adaptation p. 7/4

Adaptation-NonBayesian Main methods: Penalization Cross validation Penalization: Minimize your favourite contrast function (MLE, LS,..), but add a penalty for model complexity Cross validation: Split the sample Use first half to select best estimator for each model Use second half to select best model Bayesian Adaptation p. 7/4

Adaptation-Penalization Models P n,α, α A Estimator given model ˆp n,α = argmin M n (p) p P n,α Bayesian Adaptation p. 8/4

Adaptation-Penalization Models P n,α, α A Estimator given model ˆp n,α = argmin M n (p) p P n,α ( Estimator model ˆα n = argmin Mn (ˆp n,α ) + pen n (α) ) α A Bayesian Adaptation p. 8/4

Adaptation-Penalization Models P n,α, α A Estimator given model ˆp n,α = argmin M n (p) p P n,α ( Estimator model ˆα n = argmin Mn (ˆp n,α ) + pen n (α) ) α A Final estimator ˆp n = ˆp n,ˆαn Bayesian Adaptation p. 8/4

Adaptation-Penalization Models P n,α, α A Estimator given model ˆp n,α = argmin M n (p) p P n,α ( Estimator model ˆα n = argmin Mn (ˆp n,α ) + pen n (α) ) α A Final estimator ˆp n = ˆp n,ˆαn If M n is the log likelihood, then ˆp n is the posterior mode relative to prior π n (p,α) exp ( pen n (α) ) Bayesian Adaptation p. 8/4

Adaptation-Bayesian Models P n,α, α A Prior Π n,α on P n,α Prior (λ n,α ) α A on A Overall Prior Π n = α A λ n,απ n,α Bayesian Adaptation p. 9/4

Adaptation-Bayesian Models P n,α, α A Prior Π n,α on P n,α Prior (λ n,α ) α A on A Overall Prior Π n = α A λ n,απ n,α Posterior B Π n (B X 1,...,X n ), n B i=1 Π n (B X 1,...,X n ) = p(x i)dπ n (p) n i=1 p(x i)dπ n (p) α A n λ n n,α p P = n,α :p B i=1 p(x i)dπ n,α (p) α A n λ n n,α p P n,α i=1 p(x. i)dπ n,α (p) Bayesian Adaptation p. 9/4

Adaptation-Bayesian Models P n,α, α A Prior Π n,α on P n,α Prior (λ n,α ) α A on A Overall Prior Π n = α A λ n,απ n,α Posterior B Π n (B X 1,...,X n ) Desired result: If p 0 P n,β (or is close) then E p0 Π n ( p : d(p,p0 ) M n ǫ n,β X 1,...,X n ) 0 for every M n Bayesian Adaptation p. 9/4

Single Model Model Prior THEOREM P n,β Π n,β (GGvdV, 2000) If log N(ǫ n,β, P n,β,d) Enǫ 2 n,β Π n,β ( Bn,β (ǫ n,β ) ) e Fnǫ2 n,β entropy prior mass then the posterior rate of convergence is ǫ n,β B n,α (ǫ) is a Kullback-Leibler ball around p 0 : ( ) 2 B n,α (ǫ) = {p } P n,α : P 0 log p ǫ 2,P p0 0 log p p 0 ǫ 2 Bayesian Adaptation p. 10/4

Covering Numbers DEFINITION The covering number N ( ǫ, P,d ) is the minimal number of balls of radius ǫ needed to cover the set P. Bayesian Adaptation p. 11/4

Covering Numbers DEFINITION The covering number N ( ǫ, P,d ) is the minimal number of balls of radius ǫ needed to cover the set P. Rate at which N ( ǫ, P,d ) increases if ǫ 0 determines size of model Parametric model (1/ǫ) d Nonparametric model e (1/ǫ)1/α e.g. smoothness α Bayesian Adaptation p. 11/4

Motivation Entropy Solution ǫ n to log N(ǫ, P n,d) nǫ 2 gives optimal rate of convergence for model P n in minimax sense Le Cam (1975, 1986), Birgé (1983), Barron and Yang (1999) Bayesian Adaptation p. 12/4

Motivation Prior Mass Π n ( Bn (ǫ n ) ) e nǫ2 n prior mass Bayesian Adaptation p. 14/4

Motivation Prior Mass Π n ( Bn (ǫ n ) ) e nǫ2 n prior mass Need N(ǫ, P,d) exp(nǫ 2 n) balls Bayesian Adaptation p. 14/4

Motivation Prior Mass Π n ( Bn (ǫ n ) ) e nǫ2 n prior mass Need N(ǫ, P,d) exp(nǫ 2 n) balls Can place exp(cnǫ 2 n) balls Bayesian Adaptation p. 14/4

Motivation Prior Mass Π n ( Bn (ǫ n ) ) e nǫ2 n prior mass Need N(ǫ, P,d) exp(nǫ 2 n) balls Can place exp(cnǫ 2 n) balls If Π n uniform, then each ball receives mass exp( Cnǫ 2 n) Bayesian Adaptation p. 14/4

Equivalence KL and Hellinger The prior mass condition uses Kullback-Leibler balls, whereas the entropy condition uses d-balls These are typically (almost) equivalent Bayesian Adaptation p. 15/4

Equivalence KL and Hellinger The prior mass condition uses Kullback-Leibler balls, whereas the entropy condition uses d-balls These are typically (almost) equivalent If ratios p 0 /p of densities are bounded, then fully equivalent Bayesian Adaptation p. 15/4

Single Model Model Prior THEOREM P n,β Π n,β (GGvdV, 2000) If log N(ǫ n,β, P n,β,d) Enǫ 2 n,β Π n,β ( Bn,β (ǫ n,β ) ) e Fnǫ2 n,β entropy prior mass then the posterior rate of convergence is ǫ n,β Can actually replace entropy log N(ǫ, P n,β,d) by Le Cam dimension sup η>ǫ log N ( η/2,c n,β (η),d ) Can also refine the prior mass condition Bayesian Adaptation p. 16/4

Adaptation-Bayesian Models P n,α, α A Prior Π n,α on P n,α Prior (λ n,α ) α A on A Overall Prior α A λ n,απ n,α Posterior B Π n (B X 1,...,X n ) Desired result: If p 0 P n,β (or is close) then E p0 Π n ( p : d(p,p0 ) Mǫ n,βn X 1,...,X n ) 0 for every sufficiently large M. Bayesian Adaptation p. 17/4

Adaptation (1) A finite, ordered nǫ 2 n,β ǫ n,α ǫ n,β if α β Bayesian Adaptation p. 18/4

Adaptation (1) A finite, ordered nǫ 2 n,β ǫ n,α ǫ n,β if α β λ n,α λ α e Cnǫ2 n,α Bayesian Adaptation p. 18/4

Adaptation (1) A finite, ordered nǫ 2 n,β ǫ n,α ǫ n,β if α β λ n,α λ α e Cnǫ2 n,α Small models get big weights Bayesian Adaptation p. 18/4

Adaptation (1) A finite, ordered nǫ 2 n,β ǫ n,α ǫ n,β if α β λ n,α λ α e Cnǫ2 n,α THEOREM If log N(ǫ n,α, P n,α,d) Enǫ 2 n,α Π n,β ( Bn,β (ǫ n,β ) ) e Fnǫ2 n,β entropy, α. prior mass then posterior rate is ǫ n,β Bayesian Adaptation p. 18/4

Adaptation (2) Extension to countable A possible in two ways: truncation of weights λ n,α to subsets A n A additional entropy control Bayesian Adaptation p. 19/4

Adaptation (2) Extension to countable A possible in two ways: truncation of weights λ n,α to subsets A n A additional entropy control Also replace β by β n Bayesian Adaptation p. 19/4

Adaptation (2) Extension to countable A possible in two ways: truncation of weights λ n,α to subsets A n A additional entropy control Also replace β by β n Always assume α (λ α/λ βn ) exp( Cǫ 2 n,α/4) = O(1) Bayesian Adaptation p. 19/4

Adaptation (2a)-Truncation A n A, β n A n, log(#a n ) nǫ 2 n,β n nǫ 2 n,β n Bayesian Adaptation p. 20/4

Adaptation (2a)-Truncation A n A, β n A n, log(#a n ) nǫ 2 n,β n nǫ 2 n,β n λ n,α λ α e Cnǫ2 n,α 1An (α) Bayesian Adaptation p. 20/4

Adaptation (2a)-Truncation A n A, β n A n, log(#a n ) nǫ 2 n,β n nǫ 2 n,β n λ n,α λ α e Cnǫ2 n,α 1An (α) max α A n :ǫ 2 n,α Hǫ 2 n,βn E α ǫ 2 n,α ǫ 2 n,β n = O(1), H 1 Bayesian Adaptation p. 20/4

Adaptation (2a)-Truncation A n A, β n A n, log(#a n ) nǫ 2 n,β n nǫ 2 n,β n λ n,α λ α e Cnǫ2 n,α 1An (α) max α A n :ǫ 2 n,α Hǫ 2 n,βn E α ǫ 2 n,α ǫ 2 n,β n = O(1), H 1 THEOREM If log N(ǫ n,α, P n,α,d) Enǫ 2 n,α Π n,βn ( Bn,βn (ǫ n,βn ) ) e Fnǫ2 n,βn entropy, α. prior mass then posterior rate is ǫ n,βn Bayesian Adaptation p. 20/4

Adaptation (2b)-Entropy control A countable nǫ 2 n,β Bayesian Adaptation p. 21/4

Adaptation (2b)-Entropy control A countable nǫ 2 n,β λ n,α λ α e Cnǫ2 n,α Bayesian Adaptation p. 21/4

Adaptation (2b)-Entropy control A countable nǫ 2 n,β λ n,α λ α e Cnǫ2 n,α THEOREM ( log N ǫ n,βn, If H 1 and ) P n,α,d Enǫ 2 n,β n, entropy, α:ǫ n,α Hǫ n,β n Π n,βn ( Bn,βn (ǫ n,βn ) ) e Fnǫ2 n,βn, prior mass then posterior rate is ǫ n,βn Bayesian Adaptation p. 21/4

Discrete priors Discrete priors that are uniform on specially constructed approximating sets are universal in the sense that under abstract and mild conditions they give the desired result To avoid unnecessary logarithmic factors we need to replace ordinary entropy by the slightly more restrictive bracketing entropy Bayesian Adaptation p. 22/4

Bracketing Numbers Given l,u : X R the bracket [l,u] is the set of p : X R with l p u. 0 1 2 3 4 5 6 0 2 4 6 8 An ǫ-bracket relative to d is a bracket [l,u] with d(u,l) < ǫ. DEFINITION The bracketing number N [ ] ( ǫ, P,d ) is the minimum number of ǫ-brackets needed to cover P. Bayesian Adaptation p. 23/4

Discrete priors Q n,α collection of nonnegative functions with log N ] (ǫ n,α, Q n,α,h) E α nǫ 2 n,α u 1,...,u N minimal set of ǫ n,α -upper brackets ũ 1,...,ũ N normalized functions Bayesian Adaptation p. 24/4

Discrete priors Q n,α collection of nonnegative functions with log N ] (ǫ n,α, Q n,α,h) E α nǫ 2 n,α u 1,...,u N minimal set of ǫ n,α -upper brackets ũ 1,...,ũ N normalized functions Prior Model Π n,α uniform on ũ 1,...,ũ N M>0 MQ n,α THEOREM If λ n,α and A n A are as before, and p 0 M 0 Q n,β then posterior rate is ǫ n,β, relative to the Hellinger distance. Bayesian Adaptation p. 24/4

Smoothness Spaces B α 1 unit ball in a Banach Bα of functions log N ] ( ǫn,α, B α 1, 2) Eα nǫ 2 n,α Model p B α Bayesian Adaptation p. 25/4

Smoothness Spaces B α 1 unit ball in a Banach Bα of functions log N ] ( ǫn,α, B α 1, 2) Eα nǫ 2 n,α Model p B α THEOREM There exists a prior such that the posterior rate is ǫ n,β whenever p 0 B β for some β > 0. EXAMPLE Hölder spaces and Sobolev spaces of α-smooth functions, with ǫ n,α = n α/(2α+1). Besov spaces (in progress) Bayesian Adaptation p. 25/4

Finite-Dimensional Models Model P J of dimension J Bayesian Adaptation p. 26/4

Finite-Dimensional Models Model P J of dimension J Bias p 0 β-regular if d(p 0, P J ) (1/J) β Variance Precision when estimating J parameters J/n Bayesian Adaptation p. 26/4

Finite-Dimensional Models Model P J of dimension J Bias p 0 β-regular if d(p 0, P J ) (1/J) β Variance Precision when estimating J parameters J/n Bias-variance trade-off (1/J) 2β J/n Optimal dimension Rate J n 1/(2β+1) ǫ n,j n β/(2β+1) Bayesian Adaptation p. 26/4

Finite-Dimensional Models Model P J of dimension J Model dimension can be taken as Le Cam dimension J sup η>ǫ log N ( η/2, {p P J : d(p,p 0 ) < η},d ) dimension 2 Bayesian Adaptation p. 27/4

Finite-Dimensional Models Models P J,M of Le Cam dimension A M J, J N, M M, Prior Π J,M ( BJ,M (ǫ) ) ( B J C M ǫ ) J, ǫ > DM d(p 0, P J,M ) Bayesian Adaptation p. 28/4

Finite-Dimensional Models Models P J,M of Le Cam dimension A M J, J N, M M, Prior Π J,M ( BJ,M (ǫ) ) ( B J C M ǫ ) J, ǫ > DM d(p 0, P J,M ) This correspond to a smooth prior on the J-dimensional model Bayesian Adaptation p. 28/4

Finite-Dimensional Models Models P J,M of Le Cam dimension A M J, J N, M M, ( Prior Π J,M BJ,M (ǫ) ) ( B J C M ǫ ) J, ǫ > DM d(p 0, P J,M ) Weights λ n,j,m e Cnǫ2 n,j,m 1Jn M n (J,M) Bayesian Adaptation p. 28/4

Finite-Dimensional Models Models P J,M of Le Cam dimension A M J, J N, M M, ( Prior Π J,M BJ,M (ǫ) ) ( B J C M ǫ ) J, ǫ > DM d(p 0, P J,M ) Weights λ n,j,m e Cnǫ2 n,j,m 1Jn M n (J,M) (log C M )A M 1, B J J k, M M e HA M < ǫ n,j,m = J log n n A M THEOREM If there exist J n J n with J n n and d(p 0, P n,jn,m 0 ) ǫ n,jn,m 0, then posterior rate is ǫ n,jn,m 0 Bayesian Adaptation p. 28/4

Finite-Dimensional Models: Examples If p 0 P J0,M 0 for some J 0, then rate (log n)/n. Bayesian Adaptation p. 29/4

Finite-Dimensional Models: Examples If p 0 P J0,M 0 for some J 0, then rate (log n)/n. If d(p 0, P J,M0 ) J β for every J, rate (n/ log n) β/(2β+1). Bayesian Adaptation p. 29/4

Finite-Dimensional Models: Examples If p 0 P J0,M 0 for some J 0, then rate (log n)/n. If d(p 0, P J,M0 ) J β for every J, rate (n/ log n) β/(2β+1). If d(p 0, P J,M0 ) e Jβ for every J, then rate (log n) 1/β+1/2 / n. Bayesian Adaptation p. 29/4

Splines [0, 1) = K [ ) k=1 (k 1)/K,k/K Spline of order q is continuous function f : [0, 1] R with q 2 times differentiable on [0, 1) restriction to every [ (k 1)/K,k/K ) is a polynomial of degree < q. 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0 linear spine Bayesian Adaptation p. 30/4

Splines-Properties [0, 1) = K [ ) k=1 (k 1)/K,k/K θ T B J = j θ jb J,j θ R J, J = K + q 1 Approximation of smooth functions If q α > 0 and f in C α [0, 1], then Equivalence of norms For any θ R J, inf θ R J θ T B J f C q,α ( 1 J θ θ T B J θ, ) α f α θ 2 J θ T B J 2 θ 2. Bayesian Adaptation p. 31/4

Log Spline Models [0, 1) = K [ ) k=1 (k 1)/K,k/K θ T B J = j θ jb J,j, J = K + q 1 p J,θ (x) = e θt B J (x) c J (θ), e c J(θ) = 1 0 e θt B J (x) dx. Bayesian Adaptation p. 32/4

Log Spline Models [0, 1) = K [ ) k=1 (k 1)/K,k/K θ T B J = j θ jb J,j, J = K + q 1 p J,θ (x) = e θt B J (x) c J (θ), e c J(θ) = 1 0 e θt B J (x) dx. prior on θ induces prior on p J,θ for fixed J prior on J give model weights λ n,j flat prior on θ and model weights λ n,j as before gives adaptation to smoothness classes up to logarithmic factor Bayesian Adaptation p. 32/4

Adaptation (3) A finite, ordered ǫ n,α < ǫ n,β if α > β for every α nǫ 2 n,α THEOREM If sup log N ( ǫ/2,c n,α (ǫ),d ) Enǫ 2 n,α, α A, ǫ ǫ n,α ( λ n,α Π n,α Cn,α (Bǫ n,α ) ( λ n,β Π n,β Bn,β (ǫ n,β ) ) = o ( e n,β) 2nǫ2, α < β, ( λ n,α Π n,α Cn,α (iǫ n,α ) ) ( λ n,β Π n,β Bn,β (ǫ n,β ) ) e i2 n(ǫ 2 n,α ǫ 2 n,β ), then posterior rate is ǫ n,β B n,α (ǫ) and C n,α (ǫ) are KL-ball and d-ball in P n,α around p 0 Bayesian Adaptation p. 33/4

Log Spline Models Consider four combinations of priors Π n,α on θ weights λ n,α on J n,α to adapt to smoothness classes J n,α n 1/(2α+1) ǫ n,α = n α/(2α+1) Assume p 0 is β-smooth and sufficiently regular Bayesian Adaptation p. 34/4

Flat prior, uniform weights Π n,α uniform on [ M,M] J n,α, M large Uniform weights λ n,α = λ α Bayesian Adaptation p. 35/4

Flat prior, uniform weights Π n,α uniform on [ M,M] J n,α, M large Uniform weights λ n,α = λ α THEOREM Posterior rate is ǫ n,β log n Bayesian Adaptation p. 35/4

Flat prior, decreasing weights Π n,α uniform on [ M,M] J n,α, λ n,α γ<α (Cǫ n,γ) J n,γ, C > 1 M large THEOREM Posterior rate is ǫ n,β Bayesian Adaptation p. 36/4

Flat prior, decreasing weights Π n,α uniform on [ M,M] J n,α, λ n,α γ<α (Cǫ n,γ) J n,γ, C > 1 M large THEOREM Posterior rate is ǫ n,β Small models get small weight! Bayesian Adaptation p. 36/4

Discrete priors, increasing weights Π n,α discrete on R J with minimal number of support points to obtain approximation error ǫ n,α λ n,α λ α e Cnǫ2 n,α THEOREM Posterior rate is ǫ n,β Splines of dimension J n,α give approximation error ǫ n,α. A uniform grid on coefficients in dimension J n,α that gives approximation error ǫ n,α is too large. Need sparse subset. Similarly a smooth prior on coefficients in dimension J n,α is too rich. Bayesian Adaptation p. 37/4

Special smooth prior, increasing weights Π n,α continuous and uniform on minimal subset of R J that allows approximation with error ǫ n,α Special, increasing weights λ n,α THEOREM (Huang, 2002) Posterior rate is ǫ n,β Huang obtains this result for the full scale of regularity spaces in a general finite-dimensional setting Bayesian Adaptation p. 38/4

Conclusion There is a range of weights λ n,α that works Which weights λ n,α work depends on the fine properties of the priors on the models P n,α Bayesian Adaptation p. 39/4

Gaussian mixtures Model Prior p F,σ (x) = φ σ (x z)df(z) F Dirichlet(α), σ π n, independent (α Gaussian, π n smooth) Bayesian Adaptation p. 40/4

Gaussian mixtures Model Prior p F,σ (x) = φ σ (x z)df(z) F Dirichlet(α), σ π n, independent (α Gaussian, π n smooth) CASE ss: π n fixed CASE s : π n shrinks at rate n 1/5 Bayesian Adaptation p. 40/4

Gaussian mixtures Model Prior p F,σ (x) = φ σ (x z)df(z) F Dirichlet(α), σ π n, independent (α Gaussian, π n smooth) CASE ss: π n fixed CASE s : π n shrinks at rate n 1/5 THEOREM (Ghosal,vdV) Rate of convergence relative to (truncated) Hellinger distance is CASE ss: if p 0 = p σ0,f 0, then (log n) k / n CASE s: if p 0 is 2-smooth, then n 2/5 (log n) 2 Assume p 0 subgaussian Bayesian Adaptation p. 40/4

Gaussian mixtures Weights λ n,s et λ n,ss Bayesian Adaptation p. 41/4

Gaussian mixtures Weights λ n,s et λ n,ss THEOREM Adaptation up to logarithmic factors if exp ( c(log n) k) < λ n,ss λ n,s < exp ( Cn 1/5 (log n) k) Bayesian Adaptation p. 41/4

Gaussian mixtures Weights λ n,s et λ n,ss THEOREM Adaptation up to logarithmic factors if exp ( c(log n) k) < λ n,ss λ n,s < exp ( Cn 1/5 (log n) k) We believe this works already if exp ( c(log n) k) < λ n,ss λ n,s < exp ( Cn 1/5 (log n) k) In particular: equal weights. Bayesian Adaptation p. 41/4

Conclusion There is a range of weights λ n,α that works Which weights λ n,α work depends on the fine properties of the priors on the models P n,α Bayesian Adaptation p. 42/4

Conclusion There is a range of weights λ n,α that works Which weights λ n,α work depends on the fine properties of the priors on the models P n,α This interaction makes comparison with penalized minimum contrast estimation difficult Need refined asymptotics and numerical implementation for further understanding Bayesian Adaptation p. 42/4