Bayesian Regularization - PDF Free Download

Bayesian Regularization Aad van der Vaart Vrije Universiteit Amsterdam International Congress of Mathematicians Hyderabad, August 2010

Contents Introduction Abstract result Gaussian process priors

Co-authors Abstract result Subhashis Ghosal Gaussian process priors Harry van Zanten

Introduction

The Bayesian paradigm A parameter Θ is generated according to a prior distribution Π. Given Θ = θ the data X is generated according to a probability density p θ. This gives a joint distribution of (X, Θ). Given observed data x the statistician computes the conditional distribution of Θ given X = x: the posterior distribution. dπ(θ X) p θ (X)dΠ(θ)

Reverend Thomas Thomas Bayes (1702 1761, 1763) followed this argument with Θ possessing the uniform distribution and X given Θ = θ binomial (n, θ). Using his famous rule he computed that the posterior distribution is then Beta(X + 1, n X + 1). dπ n (θ X) ( ) n θ X (1 θ) n X 1. X

Nonparametric Bayes If the parameter θ is a function, then the prior is a probability distribution on an function space. So is the posterior, given the data. Bayes formula does not change: dπ(θ X) p θ (X)dΠ(θ). Prior and posterior can be visualized by plotting functions that are simulated from these distributions. 1 0 1 2 0.0 0.2 0.4 0.6 0.8 1.0

Frequentist Bayesian Assume that the data X is generated according to a given parameter θ 0 and consider the posterior Π(θ X) as a random measure on the parameter set.

Frequentist Bayesian Assume that the data X is generated according to a given parameter θ 0 and consider the posterior Π(θ X) as a random measure on the parameter set. We like Π(θ X) to put most of its mass near θ 0 for most X. Asymptotic setting: data X n where the information increases as n. We like the posterior Π n ( X n ) to contract to {θ 0 }, at a good rate.

Rate of contraction Assume X n is generated according to a given parameter θ 0 where the information increases as n. Posterior is consistent if, for every ε > 0, E θ0 Π ( θ: d(θ, θ 0 ) < ε X n) 1. Posterior contracts at rate at least ε n if E θ0 Π ( θ: d(θ, θ 0 ) < ε n X n) 1. Basic results on consistency were proved by Doob (1948) and Schwarz (1965). Interest in rates is recent.

Minimaxity To a given model Θ is attached an optimal rate of convergence defined by the minimax criterion ε n = inf sup E θ d ( T(X), θ ). T θ Θ This criterion has nothing to do with Bayes. A prior is good if the posterior contracts at this rate. (?)

Adaptation A model can be viewed as an instrument to test quality. It makes sense to use a collection (Θ α : α A) of models simultaneously, e.g. a scale of regularity classes. A posterior is good if it adapts: if the true parameter belongs to Θ α, then the contraction rate is at least the minimax rate for this model.

Bayesian perspective Any prior (and hence posterior) is appropriate per se. In complex situations subject knowledge can be and must be incorporated in the prior. Computational ease is important for prior choice as well. Frequentist properties reveal key properties of priors of interest.

Abstract result

Entropy The covering number N(ε, Θ, d) of a metric space (Θ, d) is the minimal number of balls of radius ε needed to cover Θ. ε big ε small Entropy is its logarithm: log N(ε, Θ, d).

Rate theorem iid observations Given a random sample X 1,...,X n from a density p 0 and a prior Π on a set P of densities consider the posterior dπ n (p X 1,..., X n ) n p(x i )dπ(p). i=1 THEOREM [Ghosal+Ghosh+vdV, 2000] The Hellinger contraction rate is ε n if there exist P n P such that (1) log N(ε n, P n, h) nε 2 n and Π(P n ) = 1 o(e 3nε2 n ). entropy. (2) Π ( B KL (p 0, ε n ) ) e nε2 n. prior mass. h is the Hellinger distance : h 2 (p, q) = ( p q) 2 dµ. B KL (p 0, ε) is a Kullback-Leibler neighborhood of p 0.

Rate theorem iid observations Given a random sample X 1,...,X n from a density p 0 and a prior Π on a set P of densities consider the posterior dπ n (p X 1,..., X n ) n p(x i )dπ(p). i=1 THEOREM [Ghosal+Ghosh+vdV, 2000] The Hellinger contraction rate is ε n if there exist P n P such that (1) log N(ε n, P n, h) nε 2 n and Π(P n ) = 1 o(e 3nε2 n ). entropy. (2) Π ( B KL (p 0, ε n ) ) e nε2 n. prior mass. The entropy condition ensures that the likelihood is not too variable, so that it cannot be large at a wrong place by pure randomness. Le Cam (1964) showed that it gives the minimax rate.

Rate theorem general Given data X n following a model (Pθ n : θ Θ) that satisfies Le Cam s testing criterion, and a prior Π, form posterior dπ n (θ X n ) p n θ (Xn )dπ(θ). THEOREM The rate of contraction is ε n 1/ n if there exist Θ n Θ such that (1) D n (ε n, Θ n, d n ) nε 2 n and Π n (Θ Θ n ) = o(e 3nε2 n ). (2) Π n ( Bn (θ 0, ε n ; k) ) e nε2 n. B n (θ 0, ε; k) is Kullback-Leibler type neighbourhood of p n θ 0.

Le Cam s testing criterion Statistical model (P n θ : θ Θ) indexed by metric space (Θ, d). For all ε > 0: for all θ 1 with d(θ 1, θ 0 ) > ε test φ n with P n θ 0 φ n e nε2, sup Pθ n (1 φ n) e nε2 θ Θ:d(θ,θ 1 )<ε/2 θ 1 θ 0.. > ɛ radius ɛ/2 This applies to independent data, Markov chains, Gaussian time series, ergodic diffusions,....

Adaptation by hierarchical prior i.i.d. P n,α collection of densities with prior Π n,α, for α A. Prior weights λ n = (λ n,α : α A). Π n = α A λ n,απ n,α. THEOREM The Hellinger contraction rate is ε n,β if the prior weights satisfies (*) below and (1) log N ( ε n,α, P n,α, h ) nε 2 n,α, every α A. (2) Π n,β ( Bn,β (p 0, ε n,β ) ) e nε2 n,β. B n,α (p 0, ε) is Kullback-Leibler type neighbourhood of p 0 within P n,α.

Condition (*) on prior weights (simplified) α<β λ n,α e nε2 n,α + λ n,α e nε2 n,β, λ n,β λ n,β α>β α<β λ n,α λ n,β Π n,α ( Cn,α (p 0, ε n,α ) ) e 4nε2 n,β. α < β means ε n,α ε n,β. C n,α (p 0, ε) is Hellinger ball of radius ε around p 0 in P n,α.

Model selection THEOREM Under the conditions of the theorem Π n ( α: α < β X1,, X n ) P 0, Π n ( α: α β, h(pn,α, p 0 ) ε n,β X 1,, X n ) P 0. Too big models do not get posterior weight. Neither do small models that are far from the truth.

Examples of priors Dirichlet mixtures of normals. Discrete priors. Mixtures of betas. Series priors (splines, Fourier, wavelets,...). Independent increment process priors. Sparse priors....... Gaussian process priors.

Gaussian process priors

Gaussian process The law of a stochastic process W = (W t : t T) is a prior distribution on the space of functions w: T R. Gaussian processes have been found useful, because of their variety and because of computational properties. Every Gaussian prior is reasonable in some way. We shall study performance with smoothness classes as test case.

Example: Brownian density estimation For W Brownian motion use as prior on a density p on [0, 1]: x ew x 1 0 ew y dy. [Leonard, Lenk, Tokdar & Ghosh]

Example: Brownian density estimation For W Brownian motion use as prior on a density p on [0, 1]: x ew x 1 0 ew y dy. 2.0 1.0 0.0 0.0 0.5 1.0 1.5 2.0 0.6 0.0 0.4 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 1.0 0.6 0.2 0.2 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Brownian motion t W t Prior density t c exp(w t )

Integrated Brownian motion 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0, 1, 2 and 3 times integrated Brownian motion

Stationary processes 4 2 0 2 4 0 1 2 3 4 5 Gaussian spectral measure 2.0 1.5 1.0 0.5 0.0 0.2 0.4 0.6 0.8 1.0 Matérn spectral measure (3/2)

Other Gaussian processes Brownian sheet 0.0 0.2 0.4 0.6 0.8 1.0 alpha=0.8 1.0 0.0 0.5 1.0 1.5 0 500 1000 1500 2000 alpha=0.2 0 500 1000 1500 2000 Fractional Brownian motion w = i w ie i, w i ind N(0, σ 2 i ) Series prior

Rates for Gaussian priors Prior W is Gaussian map in (B, ) with RKHS (H, H ) and small ball exponent φ 0 (ε) = log P( W < ε). THEOREM If statistical distances on the model combine appropriately with the norm of B, then the posterior rate is ε n if φ 0 (ε n ) nε n 2 AND inf h 2 H nε n 2. h H: h w 0 <ε n Both inequalities give lower bound on ε n. The first depends on W and not on w 0. If w 0 H, then second inequality is satisfied for ε n 1/ n.

Settings Density estimation X 1,..., X n iid in [0,1], p w (x) = ew x 1 0 ew t dt. Distance on parameter: Hellinger on p w. Norm on W : uniform. Classification (X 1, Y 1 ),..., (X n, Y n ) iid in [0,1] {0,1} P w (Y = 1 X = x) = 1 1 + e w x. Distance on parameter: L 2 (G) on P w. (G marginal of X i.) Norm on W : L 2 (G). Regression Y 1,..., Y n independent N(w(x i ), σ 2 ), for fixed design points x 1,..., x n. Distance on parameter: empirical L 2 -distance on w. Norm on W : empirical L 2 -distance. Ergodic diffusions (X t : t [0, n]), ergodic, recurrent: dx t = w(x t ) dt + σ(x t ) db t. Distance on parameter: Hellinger h n ( /σ µ0,2). Norm on W : L 2 (µ 0 ). (µ 0 stationary measure.) random

Reproducing kernel Hilbert space To every Gaussian random element with values in a Banach space (B, ) is attached a certain Hilbert space (H, H ), called the RKHS. H is stronger than and hence can consider H B. DEFINITION For S: B B defined by Sb = EWb (W), the RKHS is the completion of SB under Sb 1, Sb 2 H = Eb 1(W)b 2(W).

Reproducing kernel Hilbert space To every Gaussian random element with values in a Banach space (B, ) is attached a certain Hilbert space (H, H ), called the RKHS. H is stronger than and hence can consider H B. DEFINITION For a process W = (W x : x X) with bounded sample paths and covariance function K(x, y) = EW x W y, the RKHS is the completion of the set of functions x α i K(y i, x), i under α i K(y i, ), i j β j K(z j, ) = H i α i β j K(y i, z j ). j

Reproducing kernel Hilbert space To every Gaussian random element with values in a Banach space (B, ) is attached a certain Hilbert space (H, H ), called the RKHS. H is stronger than and hence can consider H B. EXAMPLE Any W can be represented as W = µ i Z i e i, i=1 for numbers µ i 0, iid standard normal Z 1, Z 2,..., and e 1, e 2,... B with e 1 = e 2 = = 1. The RKHS consists of all h: = i h ie i with h 2 H : = i h 2 i µ 2 i <.

Small ball probability The small ball probability of a Gaussian random element W in (B, ) is P( W < ε), and the small ball exponent φ 0 (ε) is minus the logarithm of this. 1.0 0.8 0.6 0.4 0.2 0.0 0.2 0.0 0.2 0.4 0.6 0.8 1.0 small ball for uniform norm

Small ball probability The small ball probability of a Gaussian random element W in (B, ) is P( W < ε), and the small ball exponent φ 0 (ε) is minus the logarithm of this. It can be computed either by probabilistic arguments, or analytically from the RKHS. THEOREM [Kuelbs & Li (93)] For H 1 the unit ball of the RKHS (up to constants), ( ε ) φ 0 (ε) log N φ0 (ε), H 1,. There is a big literature. (In July 2009 243 entries in database maintained by Michael Lifshits.)

Rates for Gaussian priors proof Prior W is Gaussian map in (B, ) with RKHS (H, H ) and small ball exponent φ 0 (ε) = log P( W < ε). THEOREM If statistical distances on the model combine appropriately with the norm of B, then the posterior rate is ε n if φ 0 (ε n ) nε n 2 AND inf h 2 H nε n 2. h H: h w 0 <ε n PROOF The posterior rate is ε n if there exist sets B n such that (1) log N(ε n, B n, d) nε 2 n and P(W B n ) = 1 o(e 3nε2 n ). entropy. (2) P ( W w 0 < ε n ) e nε 2 n. prior mass.

Proof (2) key results φ w0 (ε): = φ 0 (ε) + inf h H: h w 0 <ε h 2 H. THEOREM [Kuelbs & Li (93)] Concentration function measures concentration around w 0 (up to factors 2): P( W w 0 < ε) e φ w 0 (ε). THEOREM [Borell (75)] For H 1 and B 1 the unit balls of RKHS and B P(W / MH 1 + εb 1 ) 1 Φ ( Φ 1 (e φ 0(ε) ) + M ).

(Integrated) Brownian Motion THEOREM If w 0 C β [0, 1], then the rate for Brownian motion is: n 1/4 if β 1/2; n β/2 if β 1/2. The small ball exponent of Brownian motion is φ 0 (ε) (1/ε) 2 as ε 0. This gives the n 1/4 -rate, even for very smooth truths. Truths with β 1/2 are far from the RKHS, giving the rate n β/2. The minimax rate is attained iff β = 1/2.

Stationary processes A stationary Gaussian field (W t : t R d ) is characterized through a spectral measure µ, by cov(w s, W t ) = e iλt (s t) dµ(λ).

Stationary processes A stationary Gaussian field (W t : t R d ) is characterized through a spectral measure µ, by cov(w s, W t ) = e iλt (s t) dµ(λ). THEOREM Suppose that µ is Gaussian. Let ŵ 0 be the Fourier transform of w 0 : [0, 1] d R. If e λ ŵ 0 (λ) 2 dλ <, then rate of contraction is near 1/ n. If (1 + λ 2 ) β ŵ 0 (λ) 2 dλ <, then rate is (1/log n) κ β.

Adaptation Every Gaussian prior is good for some regularity class, but may be very bad for another. This can be alleviated by putting a prior on the regularity of the process. An alternative, more attractive approach is scaling.

Stretching or shrinking Sample paths can be smoothed by stretching 40 20 0 20 40 0 1 2 3 4 5

Stretching or shrinking Sample paths can be smoothed by stretching 40 20 0 20 40 0 1 2 3 4 5 and roughened by shrinking 4 2 0 2 4 0 1 2 3 4 5

Scaled (integrated) Brownian motion W t = B t/cn for B Brownian motion, and c n n (2α 1)/(2α+1) α < 1/2: c n 0 (shrink). α (1/2, 1]: c n (stretch). THEOREM The prior W t = B t/cn gives optimal rate for w 0 C α [0, 1], α (0, 1]. THEOREM Appropriate scaling of k times integrated Brownian motion gives optimal prior for every α (0, k + 1].

Scaled smooth stationary process A Gaussian field with infinitely-smooth sample paths is obtained for EG s G t = exp( s t 2 ). THEOREM The prior W t = G t/cn for c n n 1/(2α+d) gives nearly optimal rate for w 0 C α [0, 1], any α > 0.

Adaptation by random scaling Choose A d from a Gamma distribution. Choose (G t : t > 0) centered Gaussian with EG s G t = exp ( s t 2). Set W t G At. THEOREM if w 0 C α [0, 1] d, then the rate of contraction is nearly n α/(2α+d). if w 0 is supersmooth, then the rate is nearly n 1/2.

Conclusion

Conclusion and final remark There exist natural (fixed) priors that yield fully automatic smoothing at the correct bandwidth. (For instance, randomly scaled Gaussian processes.)

Conclusion and final remark There exist natural (fixed) priors that yield fully automatic smoothing at the correct bandwidth. (For instance, randomly scaled Gaussian processes.) Similar statements are true for adaptation to the scale of models described by sparsity (research in progress).