Bayesian Adaptation. Aad van der Vaart. Vrije Universiteit Amsterdam. aad. Bayesian Adaptation p. 1/4

Similar documents
Bayesian Regularization

Nonparametric Bayesian model selection and averaging

Nonparametric Bayesian Uncertainty Quantification

Model Selection and Geometry

INFORMATION-THEORETIC DETERMINATION OF MINIMAX RATES OF CONVERGENCE 1. By Yuhong Yang and Andrew Barron Iowa State University and Yale University

Lecture 35: December The fundamental statistical distances

ASYMPTOTIC EQUIVALENCE OF DENSITY ESTIMATION AND GAUSSIAN WHITE NOISE. By Michael Nussbaum Weierstrass Institute, Berlin

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

DISCUSSION: COVERAGE OF BAYESIAN CREDIBLE SETS. By Subhashis Ghosal North Carolina State University

Asymptotic properties of posterior distributions in nonparametric regression with non-gaussian errors

1234 S. GHOSAL AND A. W. VAN DER VAART algorithm to adaptively select a finitely supported mixing distribution. These authors showed consistency of th

PATTERN RECOGNITION AND MACHINE LEARNING

Bayesian Aggregation for Extraordinarily Large Dataset

Foundations of Nonparametric Bayesian Methods

arxiv: v2 [math.st] 18 Feb 2013

Minimax lower bounds I

RATES OF CONVERGENCE OF ESTIMATES, KOLMOGOROV S ENTROPY AND THE DIMENSIONALITY REDUCTION PRINCIPLE IN REGRESSION 1

Optimal Estimation of a Nonsmooth Functional

Lecture 8: Information Theory and Statistics

Penalized Squared Error and Likelihood: Risk Bounds and Fast Algorithms

Regression, Ridge Regression, Lasso

Optimal global rates of convergence for interpolation problems with random design

Bayesian estimation of the discrepancy with misspecified parametric models

arxiv: v1 [math.st] 15 Dec 2017

16.1 Bounding Capacity with Covering Number

Bayesian Sparse Linear Regression with Unknown Symmetric Error

A BLEND OF INFORMATION THEORY AND STATISTICS. Andrew R. Barron. Collaborators: Cong Huang, Jonathan Li, Gerald Cheang, Xi Luo

Curve Fitting Re-visited, Bishop1.2.5

Convergence of latent mixing measures in nonparametric and mixture models 1

Nonparametric estimation using wavelet methods. Dominique Picard. Laboratoire Probabilités et Modèles Aléatoires Université Paris VII

Density Estimation. Seungjin Choi

Machine Learning Basics: Maximum Likelihood Estimation

Introduction to Empirical Processes and Semiparametric Inference Lecture 02: Overview Continued

Minimax theory for a class of non-linear statistical inverse problems

Adaptive Bayesian procedures using random series priors

Quantifying Stochastic Model Errors via Robust Optimization

Semiparametric Minimax Rates

Can we do statistical inference in a non-asymptotic way? 1

University of California, Berkeley

Segment Description of Turbulence

Minimax Estimation of a nonlinear functional on a structured high-dimensional model

D I S C U S S I O N P A P E R

A semiparametric Bernstein - von Mises theorem for Gaussian process priors

Worst-Case Bounds for Gaussian Process Models

Stochastic Proximal Gradient Algorithm

Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo

Bayesian Statistics and Data Assimilation. Jonathan Stroud. Department of Statistics The George Washington University

Model selection theory: a tutorial with applications to learning

Self-Organization by Optimizing Free-Energy

arxiv: v1 [stat.me] 21 Mar 2016

RATE EXACT BAYESIAN ADAPTATION WITH MODIFIED BLOCK PRIORS. By Chao Gao and Harrison H. Zhou Yale University

A talk on Oracle inequalities and regularization. by Sara van de Geer

Various types of likelihood

Latent factor density regression models

Inconsistency of Bayesian inference when the model is wrong, and how to repair it

MULTIVARIATE HISTOGRAMS WITH DATA-DEPENDENT PARTITIONS

Nonparametric Bayes Inference on Manifolds with Applications

Statistical inference on Lévy processes

Generative Models and Stochastic Algorithms for Population Average Estimation and Image Analysis

Lecture 3: Statistical Decision Theory (Part II)

large number of i.i.d. observations from P. For concreteness, suppose

Stratégies bayésiennes et fréquentistes dans un modèle de bandit

SHARP BOUNDARY TRACE INEQUALITIES. 1. Introduction

RATE-OPTIMAL GRAPHON ESTIMATION. By Chao Gao, Yu Lu and Harrison H. Zhou Yale University

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk)

STA414/2104 Statistical Methods for Machine Learning II

The Calibrated Bayes Factor for Model Comparison

Non-Parametric Bayes

Robust estimation, efficiency, and Lasso debiasing

Semiparametric minimax rates

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

arxiv: v1 [stat.me] 17 Jan 2017

Maximum Likelihood Estimation. only training data is available to design a classifier

Penalized Barycenters in the Wasserstein space

Coding on Countably Infinite Alphabets

ASYMPTOTIC PROPERTIES OF PREDICTIVE RECURSION: ROBUSTNESS AND RATE OF CONVERGENCE

On Bayes Risk Lower Bounds

Information Measure Estimation and Applications: Boosting the Effective Sample Size from n to n ln n

Adaptive Piecewise Polynomial Estimation via Trend Filtering

Probabilistic Graphical Models

Concentration behavior of the penalized least squares estimator

High Dimensional Empirical Likelihood for Generalized Estimating Equations with Dependent Data

On the Choice of Parametric Families of Copulas

Priors for the frequentist, consistency beyond Schwartz

Phenomena in high dimensions in geometric analysis, random matrices, and computational geometry Roscoff, France, June 25-29, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Bayesian Nonparametric Point Estimation Under a Conjugate Prior

Nonparametric Drift Estimation for Stochastic Differential Equations

STA 414/2104, Spring 2014, Practice Problem Set #1

Robustness to Parametric Assumptions in Missing Data Models

12 - Nonparametric Density Estimation

Surrogate loss functions, divergences and decentralized detection

Estimating divergence functionals and the likelihood ratio by convex risk minimization

Local Asymptotics and the Minimum Description Length

Information and Statistics

COS513 LECTURE 8 STATISTICAL CONCEPTS

Context tree models for source coding

Local Minimax Testing

Information geometry for bivariate distribution control

Fall, 2007 Nonlinear Econometrics. Theory: Consistency for Extremum Estimators. Modeling: Probit, Logit, and Other Links.

Transcription:

Bayesian Adaptation Aad van der Vaart http://www.math.vu.nl/ aad Vrije Universiteit Amsterdam Bayesian Adaptation p. 1/4

Joint work with Jyri Lember Bayesian Adaptation p. 2/4

Adaptation Given a collection of possible models find a single procedure that works well for all models Bayesian Adaptation p. 3/4

Adaptation Given a collection of possible models find a single procedure that works well for all models as well as a procedure specifically targetted to the correct model Bayesian Adaptation p. 3/4

Adaptation Given a collection of possible models find a single procedure that works well for all models as well as a procedure specifically targetted to the correct model correct model is the one that contains the true distribution of the data Bayesian Adaptation p. 3/4

Adaptation to Smoothness Given a random sample of size n from a density p 0 on R that is known to have α derivatives, there exist estimators ˆp n with rate ǫ n,α = n α/(2α+1) Bayesian Adaptation p. 4/4

Adaptation to Smoothness Given a random sample of size n from a density p 0 on R that is known to have α derivatives, there exist estimators ˆp n with rate ǫ n,α = n α/(2α+1) i.e. E p0 d 2 (ˆp n,p 0 ) 2 = O(ǫ 2 n,α), uniformly in p 0 with (p (α) 0 )2 dλ bounded (if d = 2 ) Bayesian Adaptation p. 4/4

Distances Global distances on densities d can be one of: Hellinger: h(p,q) = p q 2 dµ, Total variation: p q 1 = p q dµ, L 2 : p q 2 = p q 2 dµ. Bayesian Adaptation p. 5/4

Adaptation Data X 1,...,X n i.i.d. p 0 Models P n,α for α A, countable Optimal rates ǫ n,α Bayesian Adaptation p. 6/4

Adaptation Data X 1,...,X n i.i.d. p 0 Models P n,α for α A, countable Optimal rates ǫ n,α p 0 contained in or close to P n,β, some β A Bayesian Adaptation p. 6/4

Adaptation Data X 1,...,X n i.i.d. p 0 Models P n,α for α A, countable Optimal rates ǫ n,α p 0 contained in or close to P n,β, some β A We want procedures that (almost) attain rate ǫ n,β, but we do not know β Bayesian Adaptation p. 6/4

Adaptation-NonBayesian Main methods: Penalization Cross validation Bayesian Adaptation p. 7/4

Adaptation-NonBayesian Main methods: Penalization Cross validation Penalization: Minimize your favourite contrast function (MLE, LS,..), but add a penalty for model complexity Bayesian Adaptation p. 7/4

Adaptation-NonBayesian Main methods: Penalization Cross validation Penalization: Minimize your favourite contrast function (MLE, LS,..), but add a penalty for model complexity Cross validation: Split the sample Use first half to select best estimator for each model Use second half to select best model Bayesian Adaptation p. 7/4

Adaptation-Penalization Models P n,α, α A Estimator given model ˆp n,α = argmin M n (p) p P n,α Bayesian Adaptation p. 8/4

Adaptation-Penalization Models P n,α, α A Estimator given model ˆp n,α = argmin M n (p) p P n,α ( Estimator model ˆα n = argmin Mn (ˆp n,α ) + pen n (α) ) α A Bayesian Adaptation p. 8/4

Adaptation-Penalization Models P n,α, α A Estimator given model ˆp n,α = argmin M n (p) p P n,α ( Estimator model ˆα n = argmin Mn (ˆp n,α ) + pen n (α) ) α A Final estimator ˆp n = ˆp n,ˆαn Bayesian Adaptation p. 8/4

Adaptation-Penalization Models P n,α, α A Estimator given model ˆp n,α = argmin M n (p) p P n,α ( Estimator model ˆα n = argmin Mn (ˆp n,α ) + pen n (α) ) α A Final estimator ˆp n = ˆp n,ˆαn If M n is the log likelihood, then ˆp n is the posterior mode relative to prior π n (p,α) exp ( pen n (α) ) Bayesian Adaptation p. 8/4

Adaptation-Bayesian Models P n,α, α A Prior Π n,α on P n,α Prior (λ n,α ) α A on A Overall Prior Π n = α A λ n,απ n,α Bayesian Adaptation p. 9/4

Adaptation-Bayesian Models P n,α, α A Prior Π n,α on P n,α Prior (λ n,α ) α A on A Overall Prior Π n = α A λ n,απ n,α Posterior B Π n (B X 1,...,X n ), n B i=1 Π n (B X 1,...,X n ) = p(x i)dπ n (p) n i=1 p(x i)dπ n (p) α A n λ n n,α p P = n,α :p B i=1 p(x i)dπ n,α (p) α A n λ n n,α p P n,α i=1 p(x. i)dπ n,α (p) Bayesian Adaptation p. 9/4

Adaptation-Bayesian Models P n,α, α A Prior Π n,α on P n,α Prior (λ n,α ) α A on A Overall Prior Π n = α A λ n,απ n,α Posterior B Π n (B X 1,...,X n ) Desired result: If p 0 P n,β (or is close) then E p0 Π n ( p : d(p,p0 ) M n ǫ n,β X 1,...,X n ) 0 for every M n Bayesian Adaptation p. 9/4

Single Model Model Prior THEOREM P n,β Π n,β (GGvdV, 2000) If log N(ǫ n,β, P n,β,d) Enǫ 2 n,β Π n,β ( Bn,β (ǫ n,β ) ) e Fnǫ2 n,β entropy prior mass then the posterior rate of convergence is ǫ n,β Bayesian Adaptation p. 10/4

Single Model Model Prior THEOREM P n,β Π n,β (GGvdV, 2000) If log N(ǫ n,β, P n,β,d) Enǫ 2 n,β Π n,β ( Bn,β (ǫ n,β ) ) e Fnǫ2 n,β entropy prior mass then the posterior rate of convergence is ǫ n,β B n,α (ǫ) is a Kullback-Leibler ball around p 0 : ( ) 2 B n,α (ǫ) = {p } P n,α : P 0 log p ǫ 2,P p0 0 log p p 0 ǫ 2 Bayesian Adaptation p. 10/4

Covering Numbers DEFINITION The covering number N ( ǫ, P,d ) is the minimal number of balls of radius ǫ needed to cover the set P. Bayesian Adaptation p. 11/4

Covering Numbers DEFINITION The covering number N ( ǫ, P,d ) is the minimal number of balls of radius ǫ needed to cover the set P. Bayesian Adaptation p. 11/4

Covering Numbers DEFINITION The covering number N ( ǫ, P,d ) is the minimal number of balls of radius ǫ needed to cover the set P. Rate at which N ( ǫ, P,d ) increases if ǫ 0 determines size of model Parametric model (1/ǫ) d Nonparametric model e (1/ǫ)1/α e.g. smoothness α Bayesian Adaptation p. 11/4

Motivation Entropy Solution ǫ n to log N(ǫ, P n,d) nǫ 2 gives optimal rate of convergence for model P n in minimax sense Le Cam (1975, 1986), Birgé (1983), Barron and Yang (1999) Bayesian Adaptation p. 12/4

Single Model Model Prior THEOREM P n,β Π n,β (GGvdV, 2000) If log N(ǫ n,β, P n,β,d) Enǫ 2 n,β Π n,β ( Bn,β (ǫ n,β ) ) e Fnǫ2 n,β entropy prior mass then the posterior rate of convergence is ǫ n,β Bayesian Adaptation p. 13/4

Motivation Prior Mass Π n ( Bn (ǫ n ) ) e nǫ2 n prior mass Bayesian Adaptation p. 14/4

Motivation Prior Mass Π n ( Bn (ǫ n ) ) e nǫ2 n prior mass Need N(ǫ, P,d) exp(nǫ 2 n) balls Bayesian Adaptation p. 14/4

Motivation Prior Mass Π n ( Bn (ǫ n ) ) e nǫ2 n prior mass Need N(ǫ, P,d) exp(nǫ 2 n) balls Can place exp(cnǫ 2 n) balls Bayesian Adaptation p. 14/4

Motivation Prior Mass Π n ( Bn (ǫ n ) ) e nǫ2 n prior mass Need N(ǫ, P,d) exp(nǫ 2 n) balls Can place exp(cnǫ 2 n) balls If Π n uniform, then each ball receives mass exp( Cnǫ 2 n) Bayesian Adaptation p. 14/4

Equivalence KL and Hellinger The prior mass condition uses Kullback-Leibler balls, whereas the entropy condition uses d-balls These are typically (almost) equivalent Bayesian Adaptation p. 15/4

Equivalence KL and Hellinger The prior mass condition uses Kullback-Leibler balls, whereas the entropy condition uses d-balls These are typically (almost) equivalent If ratios p 0 /p of densities are bounded, then fully equivalent Bayesian Adaptation p. 15/4

Equivalence KL and Hellinger The prior mass condition uses Kullback-Leibler balls, whereas the entropy condition uses d-balls These are typically (almost) equivalent If ratios p 0 /p of densities are bounded, then fully equivalent If P 0 (p 0 /p) b is bounded, some b > 0, then equivalent up to logarithmic factors Bayesian Adaptation p. 15/4

Single Model Model Prior THEOREM P n,β Π n,β (GGvdV, 2000) If log N(ǫ n,β, P n,β,d) Enǫ 2 n,β Π n,β ( Bn,β (ǫ n,β ) ) e Fnǫ2 n,β entropy prior mass then the posterior rate of convergence is ǫ n,β Bayesian Adaptation p. 16/4

Single Model Model Prior THEOREM P n,β Π n,β (GGvdV, 2000) If log N(ǫ n,β, P n,β,d) Enǫ 2 n,β Π n,β ( Bn,β (ǫ n,β ) ) e Fnǫ2 n,β entropy prior mass then the posterior rate of convergence is ǫ n,β Can actually replace entropy log N(ǫ, P n,β,d) by Le Cam dimension sup η>ǫ log N ( η/2,c n,β (η),d ) Can also refine the prior mass condition Bayesian Adaptation p. 16/4

Adaptation-Bayesian Models P n,α, α A Prior Π n,α on P n,α Prior (λ n,α ) α A on A Overall Prior α A λ n,απ n,α Posterior B Π n (B X 1,...,X n ) Desired result: If p 0 P n,β (or is close) then E p0 Π n ( p : d(p,p0 ) Mǫ n,βn X 1,...,X n ) 0 for every sufficiently large M. Bayesian Adaptation p. 17/4

Adaptation (1) A finite, ordered nǫ 2 n,β ǫ n,α ǫ n,β if α β Bayesian Adaptation p. 18/4

Adaptation (1) A finite, ordered nǫ 2 n,β ǫ n,α ǫ n,β if α β λ n,α λ α e Cnǫ2 n,α Bayesian Adaptation p. 18/4

Adaptation (1) A finite, ordered nǫ 2 n,β ǫ n,α ǫ n,β if α β λ n,α λ α e Cnǫ2 n,α Small models get big weights Bayesian Adaptation p. 18/4

Adaptation (1) A finite, ordered nǫ 2 n,β ǫ n,α ǫ n,β if α β λ n,α λ α e Cnǫ2 n,α THEOREM If log N(ǫ n,α, P n,α,d) Enǫ 2 n,α Π n,β ( Bn,β (ǫ n,β ) ) e Fnǫ2 n,β entropy, α. prior mass then posterior rate is ǫ n,β Bayesian Adaptation p. 18/4

Adaptation (2) Extension to countable A possible in two ways: truncation of weights λ n,α to subsets A n A additional entropy control Bayesian Adaptation p. 19/4

Adaptation (2) Extension to countable A possible in two ways: truncation of weights λ n,α to subsets A n A additional entropy control Also replace β by β n Bayesian Adaptation p. 19/4

Adaptation (2) Extension to countable A possible in two ways: truncation of weights λ n,α to subsets A n A additional entropy control Also replace β by β n Always assume α (λ α/λ βn ) exp( Cǫ 2 n,α/4) = O(1) Bayesian Adaptation p. 19/4

Adaptation (2a)-Truncation A n A, β n A n, log(#a n ) nǫ 2 n,β n nǫ 2 n,β n Bayesian Adaptation p. 20/4

Adaptation (2a)-Truncation A n A, β n A n, log(#a n ) nǫ 2 n,β n nǫ 2 n,β n λ n,α λ α e Cnǫ2 n,α 1An (α) Bayesian Adaptation p. 20/4

Adaptation (2a)-Truncation A n A, β n A n, log(#a n ) nǫ 2 n,β n nǫ 2 n,β n λ n,α λ α e Cnǫ2 n,α 1An (α) max α A n :ǫ 2 n,α Hǫ 2 n,βn E α ǫ 2 n,α ǫ 2 n,β n = O(1), H 1 Bayesian Adaptation p. 20/4

Adaptation (2a)-Truncation A n A, β n A n, log(#a n ) nǫ 2 n,β n nǫ 2 n,β n λ n,α λ α e Cnǫ2 n,α 1An (α) max α A n :ǫ 2 n,α Hǫ 2 n,βn E α ǫ 2 n,α ǫ 2 n,β n = O(1), H 1 THEOREM If log N(ǫ n,α, P n,α,d) Enǫ 2 n,α Π n,βn ( Bn,βn (ǫ n,βn ) ) e Fnǫ2 n,βn entropy, α. prior mass then posterior rate is ǫ n,βn Bayesian Adaptation p. 20/4

Adaptation (2b)-Entropy control A countable nǫ 2 n,β Bayesian Adaptation p. 21/4

Adaptation (2b)-Entropy control A countable nǫ 2 n,β λ n,α λ α e Cnǫ2 n,α Bayesian Adaptation p. 21/4

Adaptation (2b)-Entropy control A countable nǫ 2 n,β λ n,α λ α e Cnǫ2 n,α THEOREM ( log N ǫ n,βn, If H 1 and ) P n,α,d Enǫ 2 n,β n, entropy, α:ǫ n,α Hǫ n,β n Π n,βn ( Bn,βn (ǫ n,βn ) ) e Fnǫ2 n,βn, prior mass then posterior rate is ǫ n,βn Bayesian Adaptation p. 21/4

Discrete priors Discrete priors that are uniform on specially constructed approximating sets are universal in the sense that under abstract and mild conditions they give the desired result To avoid unnecessary logarithmic factors we need to replace ordinary entropy by the slightly more restrictive bracketing entropy Bayesian Adaptation p. 22/4

Bracketing Numbers Given l,u : X R the bracket [l,u] is the set of p : X R with l p u. 0 1 2 3 4 5 6 0 2 4 6 8 An ǫ-bracket relative to d is a bracket [l,u] with d(u,l) < ǫ. DEFINITION The bracketing number N [ ] ( ǫ, P,d ) is the minimum number of ǫ-brackets needed to cover P. Bayesian Adaptation p. 23/4

Discrete priors Q n,α collection of nonnegative functions with log N ] (ǫ n,α, Q n,α,h) E α nǫ 2 n,α u 1,...,u N minimal set of ǫ n,α -upper brackets ũ 1,...,ũ N normalized functions Bayesian Adaptation p. 24/4

Discrete priors Q n,α collection of nonnegative functions with log N ] (ǫ n,α, Q n,α,h) E α nǫ 2 n,α u 1,...,u N minimal set of ǫ n,α -upper brackets ũ 1,...,ũ N normalized functions Prior Model Π n,α uniform on ũ 1,...,ũ N M>0 MQ n,α Bayesian Adaptation p. 24/4

Discrete priors Q n,α collection of nonnegative functions with log N ] (ǫ n,α, Q n,α,h) E α nǫ 2 n,α u 1,...,u N minimal set of ǫ n,α -upper brackets ũ 1,...,ũ N normalized functions Prior Model Π n,α uniform on ũ 1,...,ũ N M>0 MQ n,α THEOREM If λ n,α and A n A are as before, and p 0 M 0 Q n,β then posterior rate is ǫ n,β, relative to the Hellinger distance. Bayesian Adaptation p. 24/4

Smoothness Spaces B α 1 unit ball in a Banach Bα of functions log N ] ( ǫn,α, B α 1, 2) Eα nǫ 2 n,α Model p B α Bayesian Adaptation p. 25/4

Smoothness Spaces B α 1 unit ball in a Banach Bα of functions log N ] ( ǫn,α, B α 1, 2) Eα nǫ 2 n,α Model p B α THEOREM There exists a prior such that the posterior rate is ǫ n,β whenever p 0 B β for some β > 0. Bayesian Adaptation p. 25/4

Smoothness Spaces B α 1 unit ball in a Banach Bα of functions log N ] ( ǫn,α, B α 1, 2) Eα nǫ 2 n,α Model p B α THEOREM There exists a prior such that the posterior rate is ǫ n,β whenever p 0 B β for some β > 0. EXAMPLE Hölder spaces and Sobolev spaces of α-smooth functions, with ǫ n,α = n α/(2α+1). Besov spaces (in progress) Bayesian Adaptation p. 25/4

Finite-Dimensional Models Model P J of dimension J Bayesian Adaptation p. 26/4

Finite-Dimensional Models Model P J of dimension J Bias p 0 β-regular if d(p 0, P J ) (1/J) β Variance Precision when estimating J parameters J/n Bayesian Adaptation p. 26/4

Finite-Dimensional Models Model P J of dimension J Bias p 0 β-regular if d(p 0, P J ) (1/J) β Variance Precision when estimating J parameters J/n Bias-variance trade-off (1/J) 2β J/n Optimal dimension Rate J n 1/(2β+1) ǫ n,j n β/(2β+1) Bayesian Adaptation p. 26/4

Finite-Dimensional Models Model P J of dimension J Bias p 0 β-regular if d(p 0, P J ) (1/J) β Variance Precision when estimating J parameters J/n Bias-variance trade-off (1/J) 2β J/n Optimal dimension Rate J n 1/(2β+1) ǫ n,j n β/(2β+1) We want to adapt to β by putting weights on J Bayesian Adaptation p. 26/4

Finite-Dimensional Models Model P J of dimension J Model dimension can be taken as Le Cam dimension J sup η>ǫ log N ( η/2, {p P J : d(p,p 0 ) < η},d ) dimension 2 Bayesian Adaptation p. 27/4

Finite-Dimensional Models Models P J,M of Le Cam dimension A M J, J N, M M, Prior Π J,M ( BJ,M (ǫ) ) ( B J C M ǫ ) J, ǫ > DM d(p 0, P J,M ) Bayesian Adaptation p. 28/4

Finite-Dimensional Models Models P J,M of Le Cam dimension A M J, J N, M M, Prior Π J,M ( BJ,M (ǫ) ) ( B J C M ǫ ) J, ǫ > DM d(p 0, P J,M ) This correspond to a smooth prior on the J-dimensional model Bayesian Adaptation p. 28/4

Finite-Dimensional Models Models P J,M of Le Cam dimension A M J, J N, M M, ( Prior Π J,M BJ,M (ǫ) ) ( B J C M ǫ ) J, ǫ > DM d(p 0, P J,M ) Weights λ n,j,m e Cnǫ2 n,j,m 1Jn M n (J,M) Bayesian Adaptation p. 28/4

Finite-Dimensional Models Models P J,M of Le Cam dimension A M J, J N, M M, ( Prior Π J,M BJ,M (ǫ) ) ( B J C M ǫ ) J, ǫ > DM d(p 0, P J,M ) Weights λ n,j,m e Cnǫ2 n,j,m 1Jn M n (J,M) (log C M )A M 1, B J J k, M M e HA M < ǫ n,j,m = J log n n A M THEOREM If there exist J n J n with J n n and d(p 0, P n,jn,m 0 ) ǫ n,jn,m 0, then posterior rate is ǫ n,jn,m 0 Bayesian Adaptation p. 28/4

Finite-Dimensional Models: Examples If p 0 P J0,M 0 for some J 0, then rate (log n)/n. Bayesian Adaptation p. 29/4

Finite-Dimensional Models: Examples If p 0 P J0,M 0 for some J 0, then rate (log n)/n. If d(p 0, P J,M0 ) J β for every J, rate (n/ log n) β/(2β+1). Bayesian Adaptation p. 29/4

Finite-Dimensional Models: Examples If p 0 P J0,M 0 for some J 0, then rate (log n)/n. If d(p 0, P J,M0 ) J β for every J, rate (n/ log n) β/(2β+1). If d(p 0, P J,M0 ) e Jβ for every J, then rate (log n) 1/β+1/2 / n. Bayesian Adaptation p. 29/4

Finite-Dimensional Models: Examples If p 0 P J0,M 0 for some J 0, then rate (log n)/n. If d(p 0, P J,M0 ) J β for every J, rate (n/ log n) β/(2β+1). If d(p 0, P J,M0 ) e Jβ for every J, then rate (log n) 1/β+1/2 / n. Can logarithmic factors be avoided? By using different weights and/or different model priors? Bayesian Adaptation p. 29/4

Splines [0, 1) = K [ ) k=1 (k 1)/K,k/K Spline of order q is continuous function f : [0, 1] R with q 2 times differentiable on [0, 1) restriction to every [ (k 1)/K,k/K ) is a polynomial of degree < q. 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0 linear spine Bayesian Adaptation p. 30/4

Splines [0, 1) = K [ ) k=1 (k 1)/K,k/K Spline of order q is continuous function f : [0, 1] R with q 2 times differentiable on [0, 1) restriction to every [ (k 1)/K,k/K ) is a polynomial of degree < q. 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0 linear spine Splines form a J = q + K 1-dimensional vector space Convenient basis B-splines B J,1,...,B J,J Bayesian Adaptation p. 30/4

Splines-Properties [0, 1) = K [ ) k=1 (k 1)/K,k/K θ T B J = j θ jb J,j θ R J, J = K + q 1 Approximation of smooth functions If q α > 0 and f in C α [0, 1], then Equivalence of norms For any θ R J, inf θ R J θ T B J f C q,α ( 1 J θ θ T B J θ, ) α f α θ 2 J θ T B J 2 θ 2. Bayesian Adaptation p. 31/4

Log Spline Models [0, 1) = K [ ) k=1 (k 1)/K,k/K θ T B J = j θ jb J,j, J = K + q 1 p J,θ (x) = e θt B J (x) c J (θ), e c J(θ) = 1 0 e θt B J (x) dx. Bayesian Adaptation p. 32/4

Log Spline Models [0, 1) = K [ ) k=1 (k 1)/K,k/K θ T B J = j θ jb J,j, J = K + q 1 p J,θ (x) = e θt B J (x) c J (θ), e c J(θ) = 1 0 e θt B J (x) dx. prior on θ induces prior on p J,θ for fixed J prior on J give model weights λ n,j Bayesian Adaptation p. 32/4

Log Spline Models [0, 1) = K [ ) k=1 (k 1)/K,k/K θ T B J = j θ jb J,j, J = K + q 1 p J,θ (x) = e θt B J (x) c J (θ), e c J(θ) = 1 0 e θt B J (x) dx. prior on θ induces prior on p J,θ for fixed J prior on J give model weights λ n,j flat prior on θ and model weights λ n,j as before gives adaptation to smoothness classes up to logarithmic factor Bayesian Adaptation p. 32/4

Log Spline Models [0, 1) = K [ ) k=1 (k 1)/K,k/K θ T B J = j θ jb J,j, J = K + q 1 p J,θ (x) = e θt B J (x) c J (θ), e c J(θ) = 1 0 e θt B J (x) dx. prior on θ induces prior on p J,θ for fixed J prior on J give model weights λ n,j flat prior on θ and model weights λ n,j as before gives adaptation to smoothness classes up to logarithmic factor Can do better? Bayesian Adaptation p. 32/4

Adaptation (3) A finite, ordered ǫ n,α < ǫ n,β if α > β for every α nǫ 2 n,α THEOREM If sup log N ( ǫ/2,c n,α (ǫ),d ) Enǫ 2 n,α, α A, ǫ ǫ n,α ( λ n,α Π n,α Cn,α (Bǫ n,α ) ( λ n,β Π n,β Bn,β (ǫ n,β ) ) = o ( e n,β) 2nǫ2, α < β, ( λ n,α Π n,α Cn,α (iǫ n,α ) ) ( λ n,β Π n,β Bn,β (ǫ n,β ) ) e i2 n(ǫ 2 n,α ǫ 2 n,β ), then posterior rate is ǫ n,β B n,α (ǫ) and C n,α (ǫ) are KL-ball and d-ball in P n,α around p 0 Bayesian Adaptation p. 33/4

Log Spline Models Consider four combinations of priors Π n,α on θ weights λ n,α on J n,α to adapt to smoothness classes J n,α n 1/(2α+1) ǫ n,α = n α/(2α+1) Assume p 0 is β-smooth and sufficiently regular Bayesian Adaptation p. 34/4

Flat prior, uniform weights Π n,α uniform on [ M,M] J n,α, M large Uniform weights λ n,α = λ α Bayesian Adaptation p. 35/4

Flat prior, uniform weights Π n,α uniform on [ M,M] J n,α, M large Uniform weights λ n,α = λ α THEOREM Posterior rate is ǫ n,β log n Bayesian Adaptation p. 35/4

Flat prior, decreasing weights Π n,α uniform on [ M,M] J n,α, λ n,α γ<α (Cǫ n,γ) J n,γ, C > 1 M large THEOREM Posterior rate is ǫ n,β Bayesian Adaptation p. 36/4

Flat prior, decreasing weights Π n,α uniform on [ M,M] J n,α, λ n,α γ<α (Cǫ n,γ) J n,γ, C > 1 M large THEOREM Posterior rate is ǫ n,β Small models get small weight! Bayesian Adaptation p. 36/4

Discrete priors, increasing weights Π n,α discrete on R J with minimal number of support points to obtain approximation error ǫ n,α λ n,α λ α e Cnǫ2 n,α THEOREM Posterior rate is ǫ n,β Bayesian Adaptation p. 37/4

Discrete priors, increasing weights Π n,α discrete on R J with minimal number of support points to obtain approximation error ǫ n,α λ n,α λ α e Cnǫ2 n,α THEOREM Posterior rate is ǫ n,β Small models get big weight! Bayesian Adaptation p. 37/4

Discrete priors, increasing weights Π n,α discrete on R J with minimal number of support points to obtain approximation error ǫ n,α λ n,α λ α e Cnǫ2 n,α THEOREM Posterior rate is ǫ n,β Splines of dimension J n,α give approximation error ǫ n,α. A uniform grid on coefficients in dimension J n,α that gives approximation error ǫ n,α is too large. Need sparse subset. Similarly a smooth prior on coefficients in dimension J n,α is too rich. Bayesian Adaptation p. 37/4

Special smooth prior, increasing weights Π n,α continuous and uniform on minimal subset of R J that allows approximation with error ǫ n,α Special, increasing weights λ n,α THEOREM (Huang, 2002) Posterior rate is ǫ n,β Huang obtains this result for the full scale of regularity spaces in a general finite-dimensional setting Bayesian Adaptation p. 38/4

Conclusion There is a range of weights λ n,α that works Which weights λ n,α work depends on the fine properties of the priors on the models P n,α Bayesian Adaptation p. 39/4

Gaussian mixtures Model Prior p F,σ (x) = φ σ (x z)df(z) F Dirichlet(α), σ π n, independent (α Gaussian, π n smooth) Bayesian Adaptation p. 40/4

Gaussian mixtures Model Prior p F,σ (x) = φ σ (x z)df(z) F Dirichlet(α), σ π n, independent (α Gaussian, π n smooth) CASE ss: π n fixed CASE s : π n shrinks at rate n 1/5 Bayesian Adaptation p. 40/4

Gaussian mixtures Model Prior p F,σ (x) = φ σ (x z)df(z) F Dirichlet(α), σ π n, independent (α Gaussian, π n smooth) CASE ss: π n fixed CASE s : π n shrinks at rate n 1/5 THEOREM (Ghosal,vdV) Rate of convergence relative to (truncated) Hellinger distance is CASE ss: if p 0 = p σ0,f 0, then (log n) k / n CASE s: if p 0 is 2-smooth, then n 2/5 (log n) 2 Assume p 0 subgaussian Bayesian Adaptation p. 40/4

Gaussian mixtures Model Prior p F,σ (x) = φ σ (x z)df(z) F Dirichlet(α), σ π n, independent (α Gaussian, π n smooth) CASE ss: π n fixed CASE s : π n shrinks at rate n 1/5 THEOREM (Ghosal,vdV) Rate of convergence relative to (truncated) Hellinger distance is CASE ss: if p 0 = p σ0,f 0, then (log n) k / n CASE s: if p 0 is 2-smooth, then n 2/5 (log n) 2 Can we adapt to the two cases? Bayesian Adaptation p. 40/4

Gaussian mixtures Weights λ n,s et λ n,ss Bayesian Adaptation p. 41/4

Gaussian mixtures Weights λ n,s et λ n,ss THEOREM Adaptation up to logarithmic factors if exp ( c(log n) k) < λ n,ss λ n,s < exp ( Cn 1/5 (log n) k) Bayesian Adaptation p. 41/4

Gaussian mixtures Weights λ n,s et λ n,ss THEOREM Adaptation up to logarithmic factors if exp ( c(log n) k) < λ n,ss λ n,s < exp ( Cn 1/5 (log n) k) We believe this works already if exp ( c(log n) k) < λ n,ss λ n,s < exp ( Cn 1/5 (log n) k) In particular: equal weights. Bayesian Adaptation p. 41/4

Conclusion There is a range of weights λ n,α that works Which weights λ n,α work depends on the fine properties of the priors on the models P n,α Bayesian Adaptation p. 42/4

Conclusion There is a range of weights λ n,α that works Which weights λ n,α work depends on the fine properties of the priors on the models P n,α This interaction makes comparison with penalized minimum contrast estimation difficult Need refined asymptotics and numerical implementation for further understanding Bayesian Adaptation p. 42/4

Conclusion There is a range of weights λ n,α that works Which weights λ n,α work depends on the fine properties of the priors on the models P n,α This interaction makes comparison with penalized minimum contrast estimation difficult Need refined asymptotics and numerical implementation for further understanding Bayesian density estimation is 10 years behind? Bayesian Adaptation p. 42/4