Bayesian Regularization

Similar documents
Nonparametric Bayesian Uncertainty Quantification

Bayesian Adaptation. Aad van der Vaart. Vrije Universiteit Amsterdam. aad. Bayesian Adaptation p. 1/4

Lectures on Nonparametric Bayesian Statistics

Nonparametric Bayesian Methods - Lecture I

Nonparametric Bayesian model selection and averaging

Priors for the frequentist, consistency beyond Schwartz

Bayesian Nonparametrics

DISCUSSION: COVERAGE OF BAYESIAN CREDIBLE SETS. By Subhashis Ghosal North Carolina State University

Bayesian nonparametrics

Lecture 35: December The fundamental statistical distances

Latent factor density regression models

Density Estimation. Seungjin Choi

ANISOTROPIC FUNCTION ESTIMATION USING MULTI-BANDWIDTH GAUSSIAN PROCESSES

ICES REPORT Model Misspecification and Plausibility

Bayesian estimation of the discrepancy with misspecified parametric models

Foundations of Nonparametric Bayesian Methods

Bayesian Consistency for Markov Models

Semiparametric posterior limits

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)

Consistency of the maximum likelihood estimator for general hidden Markov models

Adaptive Bayesian procedures using random series priors

Asymptotic properties of posterior distributions in nonparametric regression with non-gaussian errors

Optimality of Poisson Processes Intensity Learning with Gaussian Processes

ASYMPTOTIC EQUIVALENCE OF DENSITY ESTIMATION AND GAUSSIAN WHITE NOISE. By Michael Nussbaum Weierstrass Institute, Berlin

Exchangeability. Peter Orbanz. Columbia University

Approximation Theoretical Questions for SVMs

Hypes and Other Important Developments in Statistics

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

D I S C U S S I O N P A P E R

Context tree models for source coding

Introduction to Bayesian learning Lecture 2: Bayesian methods for (un)supervised problems

Markov Chain Monte Carlo (MCMC)

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Bayesian Sparse Linear Regression with Unknown Symmetric Error

Bayesian Aggregation for Extraordinarily Large Dataset

COS513 LECTURE 8 STATISTICAL CONCEPTS

A semiparametric Bernstein - von Mises theorem for Gaussian process priors

On Bayesian bandit algorithms

LAN property for sde s with additive fractional noise and continuous time observation

Minimax lower bounds I

Learning Bayesian network : Given structure and completely observed data

Optimal Estimation of a Nonsmooth Functional

Introduction to Probabilistic Machine Learning

The Bayesian approach to inverse problems

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

Convergence of latent mixing measures in nonparametric and mixture models 1

Nonparametric Drift Estimation for Stochastic Differential Equations

INFORMATION-THEORETIC DETERMINATION OF MINIMAX RATES OF CONVERGENCE 1. By Yuhong Yang and Andrew Barron Iowa State University and Yale University

Curve Fitting Re-visited, Bishop1.2.5

Linear Models A linear model is defined by the expression

Statistical inference on Lévy processes

16.1 Bounding Capacity with Covering Number

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

Non-Parametric Bayes

STA 4273H: Statistical Machine Learning

Nonparametric Bayes Inference on Manifolds with Applications

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Statistical Inference

Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo

MODEL COMPARISON CHRISTOPHER A. SIMS PRINCETON UNIVERSITY

Announcements. Proposals graded

Bayes spaces: use of improper priors and distances between densities

TANG, YONGQIANG. Dirichlet Process Mixture Models for Markov processes. (Under the direction of Dr. Subhashis Ghosal.)

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

Bayesian Machine Learning

BAYESIAN DETECTION OF IMAGE BOUNDARIES. Duke University and North Carolina State University

Surrogate loss functions, divergences and decentralized detection

General Theory of Large Deviations

Small ball probabilities and metric entropy

Minimax Estimation of Kernel Mean Embeddings

Overall Objective Priors

1.1 Basis of Statistical Decision Theory

Inconsistency of Bayesian inference when the model is wrong, and how to repair it

Gaussian Approximations for Probability Measures on R d

Learning gradients: prescriptive models

RATE EXACT BAYESIAN ADAPTATION WITH MODIFIED BLOCK PRIORS. By Chao Gao and Harrison H. Zhou Yale University

Stratégies bayésiennes et fréquentistes dans un modèle de bandit

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Lecture 2: Priors and Conjugacy

Lecture 2: Statistical Decision Theory (Part I)

A new Hierarchical Bayes approach to ensemble-variational data assimilation

Computational statistics

Semiparametric Minimax Rates

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Scaling up Bayesian Inference

Stat 451 Lecture Notes Numerical Integration

Part III. A Decision-Theoretic Approach and Bayesian testing

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee

PATTERN RECOGNITION AND MACHINE LEARNING

PMR Learning as Inference

Bayesian inference. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. April 10, 2017

Econometrics I, Estimation

Variational Principal Components

CPSC 540: Machine Learning

Latent Dirichlet Alloca/on

Nonparametric regression with martingale increment errors

Gaussian Random Fields: Geometric Properties and Extremes

Transcription:

Bayesian Regularization Aad van der Vaart Vrije Universiteit Amsterdam International Congress of Mathematicians Hyderabad, August 2010

Contents Introduction Abstract result Gaussian process priors

Co-authors Abstract result Subhashis Ghosal Gaussian process priors Harry van Zanten

Introduction

The Bayesian paradigm A parameter Θ is generated according to a prior distribution Π. Given Θ = θ the data X is generated according to a probability density p θ. This gives a joint distribution of (X, Θ). Given observed data x the statistician computes the conditional distribution of Θ given X = x: the posterior distribution. dπ(θ X) p θ (X)dΠ(θ)

The Bayesian paradigm A parameter Θ is generated according to a prior distribution Π. Given Θ = θ the data X is generated according to a probability density p θ. This gives a joint distribution of (X, Θ). Given observed data x the statistician computes the conditional distribution of Θ given X = x: the posterior distribution. Π(Θ B X) = B p θ(x)dπ(θ) Θ p θ(x)dπ(θ)

The Bayesian paradigm A parameter Θ is generated according to a prior distribution Π. Given Θ = θ the data X is generated according to a probability density p θ. This gives a joint distribution of (X, Θ). Given observed data x the statistician computes the conditional distribution of Θ given X = x: the posterior distribution. dπ(θ X) p θ (X)dΠ(θ)

Reverend Thomas Thomas Bayes (1702 1761, 1763) followed this argument with Θ possessing the uniform distribution and X given Θ = θ binomial (n, θ). Using his famous rule he computed that the posterior distribution is then Beta(X + 1, n X + 1). dπ n (θ X) ( ) n θ X (1 θ) n X 1. X

Reverend Thomas Thomas Bayes (1702 1761, 1763) followed this argument with Θ possessing the uniform distribution and X given Θ = θ binomial (n, θ). Using his famous rule he computed that the posterior distribution is then Beta(X + 1, n X + 1). 0 1 2 3 4 0.0 0.2 0.4 0.6 0.8 1.0

Reverend Thomas Thomas Bayes (1702 1761, 1763) followed this argument with Θ possessing the uniform distribution and X given Θ = θ binomial (n, θ). Using his famous rule he computed that the posterior distribution is then Beta(X + 1, n X + 1). 0 1 2 3 4 n=20,x=13 0.0 0.2 0.4 0.6 0.8 1.0

Reverend Thomas Thomas Bayes (1702 1761, 1763) followed this argument with Θ possessing the uniform distribution and X given Θ = θ binomial (n, θ). Using his famous rule he computed that the posterior distribution is then Beta(X + 1, n X + 1). 0 1 2 3 4 n=20,x=13 0.0 0.2 0.4 0.6 0.8 1.0

Reverend Thomas Thomas Bayes (1702 1761, 1763) followed this argument with Θ possessing the uniform distribution and X given Θ = θ binomial (n, θ). Using his famous rule he computed that the posterior distribution is then Beta(X + 1, n X + 1). 0 1 2 3 4 n=20,x=13 0.0 0.2 0.4 0.6 0.8 1.0

Nonparametric Bayes If the parameter θ is a function, then the prior is a probability distribution on an function space. So is the posterior, given the data. Bayes formula does not change: dπ(θ X) p θ (X)dΠ(θ).

Nonparametric Bayes If the parameter θ is a function, then the prior is a probability distribution on an function space. So is the posterior, given the data. Bayes formula does not change: dπ(θ X) p θ (X)dΠ(θ). Prior and posterior can be visualized by plotting functions that are simulated from these distributions. 1 0 1 2 0.0 0.2 0.4 0.6 0.8 1.0

Nonparametric Bayes If the parameter θ is a function, then the prior is a probability distribution on an function space. So is the posterior, given the data. Bayes formula does not change: dπ(θ X) p θ (X)dΠ(θ). Prior and posterior can be visualized by plotting functions that are simulated from these distributions. 1 0 1 2 0.0 0.2 0.4 0.6 0.8 1.0

Nonparametric Bayes If the parameter θ is a function, then the prior is a probability distribution on an function space. So is the posterior, given the data. Bayes formula does not change: dπ(θ X) p θ (X)dΠ(θ). Prior and posterior can be visualized by plotting functions that are simulated from these distributions. 1 0 1 2 0.0 0.2 0.4 0.6 0.8 1.0

Nonparametric Bayes If the parameter θ is a function, then the prior is a probability distribution on an function space. So is the posterior, given the data. Bayes formula does not change: dπ(θ X) p θ (X)dΠ(θ). Prior and posterior can be visualized by plotting functions that are simulated from these distributions. 1 0 1 2 0.0 0.2 0.4 0.6 0.8 1.0

Nonparametric Bayes If the parameter θ is a function, then the prior is a probability distribution on an function space. So is the posterior, given the data. Bayes formula does not change: dπ(θ X) p θ (X)dΠ(θ). Prior and posterior can be visualized by plotting functions that are simulated from these distributions. 1 0 1 2 0.0 0.2 0.4 0.6 0.8 1.0

Nonparametric Bayes If the parameter θ is a function, then the prior is a probability distribution on an function space. So is the posterior, given the data. Bayes formula does not change: dπ(θ X) p θ (X)dΠ(θ). Prior and posterior can be visualized by plotting functions that are simulated from these distributions. 1 0 1 2 0.0 0.2 0.4 0.6 0.8 1.0

Nonparametric Bayes If the parameter θ is a function, then the prior is a probability distribution on an function space. So is the posterior, given the data. Bayes formula does not change: dπ(θ X) p θ (X)dΠ(θ). Prior and posterior can be visualized by plotting functions that are simulated from these distributions. 1 0 1 2 0.0 0.2 0.4 0.6 0.8 1.0

Nonparametric Bayes If the parameter θ is a function, then the prior is a probability distribution on an function space. So is the posterior, given the data. Bayes formula does not change: dπ(θ X) p θ (X)dΠ(θ). Prior and posterior can be visualized by plotting functions that are simulated from these distributions. 1 0 1 2 0.0 0.2 0.4 0.6 0.8 1.0

Frequentist Bayesian Assume that the data X is generated according to a given parameter θ 0 and consider the posterior Π(θ X) as a random measure on the parameter set.

Frequentist Bayesian Assume that the data X is generated according to a given parameter θ 0 and consider the posterior Π(θ X) as a random measure on the parameter set. We like Π(θ X) to put most of its mass near θ 0 for most X.

Frequentist Bayesian Assume that the data X is generated according to a given parameter θ 0 and consider the posterior Π(θ X) as a random measure on the parameter set. We like Π(θ X) to put most of its mass near θ 0 for most X. Asymptotic setting: data X n where the information increases as n. We like the posterior Π n ( X n ) to contract to {θ 0 }, at a good rate.

Rate of contraction Assume X n is generated according to a given parameter θ 0 where the information increases as n. Posterior is consistent if, for every ε > 0, E θ0 Π ( θ: d(θ, θ 0 ) < ε X n) 1. Posterior contracts at rate at least ε n if E θ0 Π ( θ: d(θ, θ 0 ) < ε n X n) 1. Basic results on consistency were proved by Doob (1948) and Schwarz (1965). Interest in rates is recent.

Minimaxity To a given model Θ is attached an optimal rate of convergence defined by the minimax criterion ε n = inf sup E θ d ( T(X), θ ). T θ Θ This criterion has nothing to do with Bayes. A prior is good if the posterior contracts at this rate. (?)

Adaptation A model can be viewed as an instrument to test quality. It makes sense to use a collection (Θ α : α A) of models simultaneously, e.g. a scale of regularity classes. A posterior is good if it adapts: if the true parameter belongs to Θ α, then the contraction rate is at least the minimax rate for this model.

Bayesian perspective Any prior (and hence posterior) is appropriate per se. In complex situations subject knowledge can be and must be incorporated in the prior. Computational ease is important for prior choice as well. Frequentist properties reveal key properties of priors of interest.

Abstract result

Entropy The covering number N(ε, Θ, d) of a metric space (Θ, d) is the minimal number of balls of radius ε needed to cover Θ. ε big ε small Entropy is its logarithm: log N(ε, Θ, d).

Rate theorem iid observations Given a random sample X 1,...,X n from a density p 0 and a prior Π on a set P of densities consider the posterior dπ n (p X 1,..., X n ) n p(x i )dπ(p). i=1 THEOREM [Ghosal+Ghosh+vdV, 2000] The Hellinger contraction rate is ε n if there exist P n P such that (1) log N(ε n, P n, h) nε 2 n and Π(P n ) = 1 o(e 3nε2 n ). entropy. (2) Π ( B KL (p 0, ε n ) ) e nε2 n. prior mass. h is the Hellinger distance : h 2 (p, q) = ( p q) 2 dµ. B KL (p 0, ε) is a Kullback-Leibler neighborhood of p 0.

Rate theorem iid observations Given a random sample X 1,...,X n from a density p 0 and a prior Π on a set P of densities consider the posterior dπ n (p X 1,..., X n ) n p(x i )dπ(p). i=1 THEOREM [Ghosal+Ghosh+vdV, 2000] The Hellinger contraction rate is ε n if there exist P n P such that (1) log N(ε n, P n, h) nε 2 n and Π(P n ) = 1 o(e 3nε2 n ). entropy. (2) Π ( B KL (p 0, ε n ) ) e nε2 n. prior mass. The entropy condition ensures that the likelihood is not too variable, so that it cannot be large at a wrong place by pure randomness. Le Cam (1964) showed that it gives the minimax rate.

Rate theorem general Given data X n following a model (Pθ n : θ Θ) that satisfies Le Cam s testing criterion, and a prior Π, form posterior dπ n (θ X n ) p n θ (Xn )dπ(θ). THEOREM The rate of contraction is ε n 1/ n if there exist Θ n Θ such that (1) D n (ε n, Θ n, d n ) nε 2 n and Π n (Θ Θ n ) = o(e 3nε2 n ). (2) Π n ( Bn (θ 0, ε n ; k) ) e nε2 n. B n (θ 0, ε; k) is Kullback-Leibler type neighbourhood of p n θ 0.

Rate theorem general Given data X n following a model (Pθ n : θ Θ) that satisfies Le Cam s testing criterion, and a prior Π, form posterior dπ n (θ X n ) p n θ (Xn )dπ(θ). THEOREM The rate of contraction is ε n 1/ n if there exist Θ n Θ such that (1) D n (ε n, Θ n, d n ) nε 2 n and Π n (Θ Θ n ) = o(e 3nε2 n ). (2) Π n ( Bn (θ 0, ε n ; k) ) e nε2 n. The theorem can be refined in various ways.

Le Cam s testing criterion Statistical model (P n θ : θ Θ) indexed by metric space (Θ, d). For all ε > 0: for all θ 1 with d(θ 1, θ 0 ) > ε test φ n with P n θ 0 φ n e nε2, sup Pθ n (1 φ n) e nε2 θ Θ:d(θ,θ 1 )<ε/2 θ 1 θ 0.. > ɛ radius ɛ/2 This applies to independent data, Markov chains, Gaussian time series, ergodic diffusions,....

Adaptation by hierarchical prior i.i.d. P n,α collection of densities with prior Π n,α, for α A. Prior weights λ n = (λ n,α : α A). Π n = α A λ n,απ n,α. THEOREM The Hellinger contraction rate is ε n,β if the prior weights satisfies (*) below and (1) log N ( ε n,α, P n,α, h ) nε 2 n,α, every α A. (2) Π n,β ( Bn,β (p 0, ε n,β ) ) e nε2 n,β. B n,α (p 0, ε) is Kullback-Leibler type neighbourhood of p 0 within P n,α.

Condition (*) on prior weights (simplified) α<β λ n,α e nε2 n,α + λ n,α e nε2 n,β, λ n,β λ n,β α>β α<β λ n,α λ n,β Π n,α ( Cn,α (p 0, ε n,α ) ) e 4nε2 n,β. α < β means ε n,α ε n,β. C n,α (p 0, ε) is Hellinger ball of radius ε around p 0 in P n,α.

Condition (*) on prior weights (simplified) α<β λ n,α e nε2 n,α + λ n,α e nε2 n,β, λ n,β λ n,β α>β α<β λ n,α λ n,β Π n,α ( Cn,α (p 0, ε n,α ) ) e 4nε2 n,β. α < β means ε n,α ε n,β. C n,α (p 0, ε) is Hellinger ball of radius ε around p 0 in P n,α. In many situations there is much freedom in choice of weights. The weights λ n,α µ α e Cnε2 n,α always work.

Model selection THEOREM Under the conditions of the theorem Π n ( α: α < β X1,, X n ) P 0, Π n ( α: α β, h(pn,α, p 0 ) ε n,β X 1,, X n ) P 0. Too big models do not get posterior weight. Neither do small models that are far from the truth.

Examples of priors Dirichlet mixtures of normals. Discrete priors. Mixtures of betas. Series priors (splines, Fourier, wavelets,...). Independent increment process priors. Sparse priors....... Gaussian process priors.

Gaussian process priors

Gaussian process The law of a stochastic process W = (W t : t T) is a prior distribution on the space of functions w: T R. Gaussian processes have been found useful, because of their variety and because of computational properties. Every Gaussian prior is reasonable in some way. We shall study performance with smoothness classes as test case.

Example: Brownian density estimation For W Brownian motion use as prior on a density p on [0, 1]: x ew x 1 0 ew y dy. [Leonard, Lenk, Tokdar & Ghosh]

Example: Brownian density estimation For W Brownian motion use as prior on a density p on [0, 1]: x ew x 1 0 ew y dy. 2.0 1.0 0.0 0.0 0.5 1.0 1.5 2.0 0.6 0.0 0.4 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 1.0 0.6 0.2 0.2 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Brownian motion t W t Prior density t c exp(w t )

Integrated Brownian motion 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0, 1, 2 and 3 times integrated Brownian motion

Stationary processes 4 2 0 2 4 0 1 2 3 4 5 Gaussian spectral measure 2.0 1.5 1.0 0.5 0.0 0.2 0.4 0.6 0.8 1.0 Matérn spectral measure (3/2)

Other Gaussian processes Brownian sheet 0.0 0.2 0.4 0.6 0.8 1.0 alpha=0.8 1.0 0.0 0.5 1.0 1.5 0 500 1000 1500 2000 alpha=0.2 0 500 1000 1500 2000 Fractional Brownian motion w = i w ie i, w i ind N(0, σ 2 i ) Series prior

Rates for Gaussian priors Prior W is Gaussian map in (B, ) with RKHS (H, H ) and small ball exponent φ 0 (ε) = log P( W < ε). THEOREM If statistical distances on the model combine appropriately with the norm of B, then the posterior rate is ε n if φ 0 (ε n ) nε n 2 AND inf h 2 H nε n 2. h H: h w 0 <ε n

Rates for Gaussian priors Prior W is Gaussian map in (B, ) with RKHS (H, H ) and small ball exponent φ 0 (ε) = log P( W < ε). THEOREM If statistical distances on the model combine appropriately with the norm of B, then the posterior rate is ε n if φ 0 (ε n ) nε n 2 AND inf h 2 H nε n 2. h H: h w 0 <ε n Both inequalities give lower bound on ε n. The first depends on W and not on w 0. If w 0 H, then second inequality is satisfied for ε n 1/ n.

Rates for Gaussian priors Prior W is Gaussian map in (B, ) with RKHS (H, H ) and small ball exponent φ 0 (ε) = log P( W < ε). THEOREM If statistical distances on the model combine appropriately with the norm of B, then the posterior rate is ε n if φ 0 (ε n ) nε n 2 AND inf h 2 H nε n 2. h H: h w 0 <ε n Both inequalities give lower bound on ε n. The first depends on W and not on w 0. If w 0 H, then second inequality is satisfied for ε n 1/ n.

Settings Density estimation X 1,..., X n iid in [0,1], p w (x) = ew x 1 0 ew t dt. Distance on parameter: Hellinger on p w. Norm on W : uniform. Classification (X 1, Y 1 ),..., (X n, Y n ) iid in [0,1] {0,1} P w (Y = 1 X = x) = 1 1 + e w x. Distance on parameter: L 2 (G) on P w. (G marginal of X i.) Norm on W : L 2 (G). Regression Y 1,..., Y n independent N(w(x i ), σ 2 ), for fixed design points x 1,..., x n. Distance on parameter: empirical L 2 -distance on w. Norm on W : empirical L 2 -distance. Ergodic diffusions (X t : t [0, n]), ergodic, recurrent: dx t = w(x t ) dt + σ(x t ) db t. Distance on parameter: Hellinger h n ( /σ µ0,2). Norm on W : L 2 (µ 0 ). (µ 0 stationary measure.) random

Reproducing kernel Hilbert space To every Gaussian random element with values in a Banach space (B, ) is attached a certain Hilbert space (H, H ), called the RKHS. H is stronger than and hence can consider H B.

Reproducing kernel Hilbert space To every Gaussian random element with values in a Banach space (B, ) is attached a certain Hilbert space (H, H ), called the RKHS. H is stronger than and hence can consider H B. DEFINITION For S: B B defined by Sb = EWb (W), the RKHS is the completion of SB under Sb 1, Sb 2 H = Eb 1(W)b 2(W).

Reproducing kernel Hilbert space To every Gaussian random element with values in a Banach space (B, ) is attached a certain Hilbert space (H, H ), called the RKHS. H is stronger than and hence can consider H B. DEFINITION For a process W = (W x : x X) with bounded sample paths and covariance function K(x, y) = EW x W y, the RKHS is the completion of the set of functions x α i K(y i, x), i under α i K(y i, ), i j β j K(z j, ) = H i α i β j K(y i, z j ). j

Reproducing kernel Hilbert space To every Gaussian random element with values in a Banach space (B, ) is attached a certain Hilbert space (H, H ), called the RKHS. H is stronger than and hence can consider H B. EXAMPLE If W is multivariate normal N d (0, Σ), then the RKHS is R d with norm h H = h t Σ 1 h

Reproducing kernel Hilbert space To every Gaussian random element with values in a Banach space (B, ) is attached a certain Hilbert space (H, H ), called the RKHS. H is stronger than and hence can consider H B. EXAMPLE Any W can be represented as W = µ i Z i e i, i=1 for numbers µ i 0, iid standard normal Z 1, Z 2,..., and e 1, e 2,... B with e 1 = e 2 = = 1. The RKHS consists of all h: = i h ie i with h 2 H : = i h 2 i µ 2 i <.

Reproducing kernel Hilbert space To every Gaussian random element with values in a Banach space (B, ) is attached a certain Hilbert space (H, H ), called the RKHS. H is stronger than and hence can consider H B. EXAMPLE Brownian motion is a random element in C[0, 1]. Its RKHS is H = {h: h (t) 2 dt < } with norm h H = h 2.

Small ball probability The small ball probability of a Gaussian random element W in (B, ) is P( W < ε), and the small ball exponent φ 0 (ε) is minus the logarithm of this. 1.0 0.8 0.6 0.4 0.2 0.0 0.2 0.0 0.2 0.4 0.6 0.8 1.0 small ball for uniform norm

Small ball probability The small ball probability of a Gaussian random element W in (B, ) is P( W < ε), and the small ball exponent φ 0 (ε) is minus the logarithm of this. It can be computed either by probabilistic arguments, or analytically from the RKHS. THEOREM [Kuelbs & Li (93)] For H 1 the unit ball of the RKHS (up to constants), ( ε ) φ 0 (ε) log N φ0 (ε), H 1,. There is a big literature. (In July 2009 243 entries in database maintained by Michael Lifshits.)

Rates for Gaussian priors proof Prior W is Gaussian map in (B, ) with RKHS (H, H ) and small ball exponent φ 0 (ε) = log P( W < ε). THEOREM If statistical distances on the model combine appropriately with the norm of B, then the posterior rate is ε n if φ 0 (ε n ) nε n 2 AND inf h 2 H nε n 2. h H: h w 0 <ε n

Rates for Gaussian priors proof Prior W is Gaussian map in (B, ) with RKHS (H, H ) and small ball exponent φ 0 (ε) = log P( W < ε). THEOREM If statistical distances on the model combine appropriately with the norm of B, then the posterior rate is ε n if φ 0 (ε n ) nε n 2 AND inf h 2 H nε n 2. h H: h w 0 <ε n PROOF The posterior rate is ε n if there exist sets B n such that (1) log N(ε n, B n, d) nε 2 n and P(W B n ) = 1 o(e 3nε2 n ). entropy. (2) P ( W w 0 < ε n ) e nε 2 n. prior mass.

Rates for Gaussian priors proof Prior W is Gaussian map in (B, ) with RKHS (H, H ) and small ball exponent φ 0 (ε) = log P( W < ε). THEOREM If statistical distances on the model combine appropriately with the norm of B, then the posterior rate is ε n if φ 0 (ε n ) nε n 2 AND inf h 2 H nε n 2. h H: h w 0 <ε n PROOF The posterior rate is ε n if there exist sets B n such that (1) log N(ε n, B n, d) nε 2 n and P(W B n ) = 1 o(e 3nε2 n ). entropy. (2) P ( W w 0 < ε n ) e nε 2 n. prior mass. Take B n = M n H 1 + ε n B 1 for large M n (H 1, B 1 the unit balls of H, B).

Proof (2) key results φ w0 (ε): = φ 0 (ε) + inf h H: h w 0 <ε h 2 H. THEOREM [Kuelbs & Li (93)] Concentration function measures concentration around w 0 (up to factors 2): P( W w 0 < ε) e φ w 0 (ε). THEOREM [Borell (75)] For H 1 and B 1 the unit balls of RKHS and B P(W / MH 1 + εb 1 ) 1 Φ ( Φ 1 (e φ 0(ε) ) + M ).

(Integrated) Brownian Motion THEOREM If w 0 C β [0, 1], then the rate for Brownian motion is: n 1/4 if β 1/2; n β/2 if β 1/2. The small ball exponent of Brownian motion is φ 0 (ε) (1/ε) 2 as ε 0. This gives the n 1/4 -rate, even for very smooth truths. Truths with β 1/2 are far from the RKHS, giving the rate n β/2. The minimax rate is attained iff β = 1/2.

(Integrated) Brownian Motion THEOREM If w 0 C β [0, 1], then the rate for Brownian motion is: n 1/4 if β 1/2; n β/2 if β 1/2. The small ball exponent of Brownian motion is φ 0 (ε) (1/ε) 2 as ε 0. This gives the n 1/4 -rate, even for very smooth truths. Truths with β 1/2 are far from the RKHS, giving the rate n β/2. The minimax rate is attained iff β = 1/2. THEOREM If w 0 C β [0, 1], then the rate for (α 1/2)-times integrated Brownian is n (β α)/(2α+d). The minimax rate is attained iff β = α.

Stationary processes A stationary Gaussian field (W t : t R d ) is characterized through a spectral measure µ, by cov(w s, W t ) = e iλt (s t) dµ(λ).

Stationary processes A stationary Gaussian field (W t : t R d ) is characterized through a spectral measure µ, by cov(w s, W t ) = e iλt (s t) dµ(λ). THEOREM Suppose that µ is Gaussian. Let ŵ 0 be the Fourier transform of w 0 : [0, 1] d R. If e λ ŵ 0 (λ) 2 dλ <, then rate of contraction is near 1/ n. If (1 + λ 2 ) β ŵ 0 (λ) 2 dλ <, then rate is (1/log n) κ β.

Stationary processes A stationary Gaussian field (W t : t R d ) is characterized through a spectral measure µ, by cov(w s, W t ) = e iλt (s t) dµ(λ). THEOREM Suppose that µ is Gaussian. Let ŵ 0 be the Fourier transform of w 0 : [0, 1] d R. If e λ ŵ 0 (λ) 2 dλ <, then rate of contraction is near 1/ n. If (1 + λ 2 ) β ŵ 0 (λ) 2 dλ <, then rate is (1/log n) κ β. THEOREM Suppose that dµ(λ) = (1 + λ 2 ) (α d/2) dλ. If w 0 C β [0, 1] d, then rate of contraction is n (α β)/(2α+d).

Adaptation Every Gaussian prior is good for some regularity class, but may be very bad for another. This can be alleviated by putting a prior on the regularity of the process. An alternative, more attractive approach is scaling.

Stretching or shrinking Sample paths can be smoothed by stretching 40 20 0 20 40 0 1 2 3 4 5

Stretching or shrinking Sample paths can be smoothed by stretching 40 20 0 20 40 0 1 2 3 4 5 and roughened by shrinking 4 2 0 2 4 0 1 2 3 4 5

Scaled (integrated) Brownian motion W t = B t/cn for B Brownian motion, and c n n (2α 1)/(2α+1) α < 1/2: c n 0 (shrink). α (1/2, 1]: c n (stretch). THEOREM The prior W t = B t/cn gives optimal rate for w 0 C α [0, 1], α (0, 1].

Scaled (integrated) Brownian motion W t = B t/cn for B Brownian motion, and c n n (2α 1)/(2α+1) α < 1/2: c n 0 (shrink). α (1/2, 1]: c n (stretch). THEOREM The prior W t = B t/cn gives optimal rate for w 0 C α [0, 1], α (0, 1]. THEOREM Appropriate scaling of k times integrated Brownian motion gives optimal prior for every α (0, k + 1].

Scaled (integrated) Brownian motion W t = B t/cn for B Brownian motion, and c n n (2α 1)/(2α+1) α < 1/2: c n 0 (shrink). α (1/2, 1]: c n (stretch). THEOREM The prior W t = B t/cn gives optimal rate for w 0 C α [0, 1], α (0, 1]. THEOREM Appropriate scaling of k times integrated Brownian motion gives optimal prior for every α (0, k + 1]. Stretching helps a little, shrinking helps a lot.

Scaled smooth stationary process A Gaussian field with infinitely-smooth sample paths is obtained for EG s G t = exp( s t 2 ). THEOREM The prior W t = G t/cn for c n n 1/(2α+d) gives nearly optimal rate for w 0 C α [0, 1], any α > 0.

Adaptation by random scaling Choose A d from a Gamma distribution. Choose (G t : t > 0) centered Gaussian with EG s G t = exp ( s t 2). Set W t G At. THEOREM if w 0 C α [0, 1] d, then the rate of contraction is nearly n α/(2α+d). if w 0 is supersmooth, then the rate is nearly n 1/2.

Adaptation by random scaling Choose A d from a Gamma distribution. Choose (G t : t > 0) centered Gaussian with EG s G t = exp ( s t 2). Set W t G At. THEOREM if w 0 C α [0, 1] d, then the rate of contraction is nearly n α/(2α+d). if w 0 is supersmooth, then the rate is nearly n 1/2. The first result is also true for randomly scaled k-times integrated Brownian motion and α k + 1.

Conclusion

Conclusion and final remark There exist natural (fixed) priors that yield fully automatic smoothing at the correct bandwidth. (For instance, randomly scaled Gaussian processes.)

Conclusion and final remark There exist natural (fixed) priors that yield fully automatic smoothing at the correct bandwidth. (For instance, randomly scaled Gaussian processes.) Similar statements are true for adaptation to the scale of models described by sparsity (research in progress).