Bayesian Regularization

Size: px
Start display at page:

Download "Bayesian Regularization"

Transcription

1 Bayesian Regularization Aad van der Vaart Vrije Universiteit Amsterdam International Congress of Mathematicians Hyderabad, August 2010

2 Contents Introduction Abstract result Gaussian process priors

3 Co-authors Abstract result Subhashis Ghosal Gaussian process priors Harry van Zanten

4 Introduction

5 The Bayesian paradigm A parameter Θ is generated according to a prior distribution Π. Given Θ = θ the data X is generated according to a probability density p θ. This gives a joint distribution of (X, Θ). Given observed data x the statistician computes the conditional distribution of Θ given X = x: the posterior distribution. dπ(θ X) p θ (X)dΠ(θ)

6 The Bayesian paradigm A parameter Θ is generated according to a prior distribution Π. Given Θ = θ the data X is generated according to a probability density p θ. This gives a joint distribution of (X, Θ). Given observed data x the statistician computes the conditional distribution of Θ given X = x: the posterior distribution. Π(Θ B X) = B p θ(x)dπ(θ) Θ p θ(x)dπ(θ)

7 The Bayesian paradigm A parameter Θ is generated according to a prior distribution Π. Given Θ = θ the data X is generated according to a probability density p θ. This gives a joint distribution of (X, Θ). Given observed data x the statistician computes the conditional distribution of Θ given X = x: the posterior distribution. dπ(θ X) p θ (X)dΠ(θ)

8 Reverend Thomas Thomas Bayes ( , 1763) followed this argument with Θ possessing the uniform distribution and X given Θ = θ binomial (n, θ). Using his famous rule he computed that the posterior distribution is then Beta(X + 1, n X + 1). dπ n (θ X) ( ) n θ X (1 θ) n X 1. X

9 Reverend Thomas Thomas Bayes ( , 1763) followed this argument with Θ possessing the uniform distribution and X given Θ = θ binomial (n, θ). Using his famous rule he computed that the posterior distribution is then Beta(X + 1, n X + 1)

10 Reverend Thomas Thomas Bayes ( , 1763) followed this argument with Θ possessing the uniform distribution and X given Θ = θ binomial (n, θ). Using his famous rule he computed that the posterior distribution is then Beta(X + 1, n X + 1) n=20,x=

11 Reverend Thomas Thomas Bayes ( , 1763) followed this argument with Θ possessing the uniform distribution and X given Θ = θ binomial (n, θ). Using his famous rule he computed that the posterior distribution is then Beta(X + 1, n X + 1) n=20,x=

12 Reverend Thomas Thomas Bayes ( , 1763) followed this argument with Θ possessing the uniform distribution and X given Θ = θ binomial (n, θ). Using his famous rule he computed that the posterior distribution is then Beta(X + 1, n X + 1) n=20,x=

13 Nonparametric Bayes If the parameter θ is a function, then the prior is a probability distribution on an function space. So is the posterior, given the data. Bayes formula does not change: dπ(θ X) p θ (X)dΠ(θ).

14 Nonparametric Bayes If the parameter θ is a function, then the prior is a probability distribution on an function space. So is the posterior, given the data. Bayes formula does not change: dπ(θ X) p θ (X)dΠ(θ). Prior and posterior can be visualized by plotting functions that are simulated from these distributions

15 Nonparametric Bayes If the parameter θ is a function, then the prior is a probability distribution on an function space. So is the posterior, given the data. Bayes formula does not change: dπ(θ X) p θ (X)dΠ(θ). Prior and posterior can be visualized by plotting functions that are simulated from these distributions

16 Nonparametric Bayes If the parameter θ is a function, then the prior is a probability distribution on an function space. So is the posterior, given the data. Bayes formula does not change: dπ(θ X) p θ (X)dΠ(θ). Prior and posterior can be visualized by plotting functions that are simulated from these distributions

17 Nonparametric Bayes If the parameter θ is a function, then the prior is a probability distribution on an function space. So is the posterior, given the data. Bayes formula does not change: dπ(θ X) p θ (X)dΠ(θ). Prior and posterior can be visualized by plotting functions that are simulated from these distributions

18 Nonparametric Bayes If the parameter θ is a function, then the prior is a probability distribution on an function space. So is the posterior, given the data. Bayes formula does not change: dπ(θ X) p θ (X)dΠ(θ). Prior and posterior can be visualized by plotting functions that are simulated from these distributions

19 Nonparametric Bayes If the parameter θ is a function, then the prior is a probability distribution on an function space. So is the posterior, given the data. Bayes formula does not change: dπ(θ X) p θ (X)dΠ(θ). Prior and posterior can be visualized by plotting functions that are simulated from these distributions

20 Nonparametric Bayes If the parameter θ is a function, then the prior is a probability distribution on an function space. So is the posterior, given the data. Bayes formula does not change: dπ(θ X) p θ (X)dΠ(θ). Prior and posterior can be visualized by plotting functions that are simulated from these distributions

21 Nonparametric Bayes If the parameter θ is a function, then the prior is a probability distribution on an function space. So is the posterior, given the data. Bayes formula does not change: dπ(θ X) p θ (X)dΠ(θ). Prior and posterior can be visualized by plotting functions that are simulated from these distributions

22 Frequentist Bayesian Assume that the data X is generated according to a given parameter θ 0 and consider the posterior Π(θ X) as a random measure on the parameter set.

23 Frequentist Bayesian Assume that the data X is generated according to a given parameter θ 0 and consider the posterior Π(θ X) as a random measure on the parameter set. We like Π(θ X) to put most of its mass near θ 0 for most X.

24 Frequentist Bayesian Assume that the data X is generated according to a given parameter θ 0 and consider the posterior Π(θ X) as a random measure on the parameter set. We like Π(θ X) to put most of its mass near θ 0 for most X. Asymptotic setting: data X n where the information increases as n. We like the posterior Π n ( X n ) to contract to {θ 0 }, at a good rate.

25 Rate of contraction Assume X n is generated according to a given parameter θ 0 where the information increases as n. Posterior is consistent if, for every ε > 0, E θ0 Π ( θ: d(θ, θ 0 ) < ε X n) 1. Posterior contracts at rate at least ε n if E θ0 Π ( θ: d(θ, θ 0 ) < ε n X n) 1. Basic results on consistency were proved by Doob (1948) and Schwarz (1965). Interest in rates is recent.

26 Minimaxity To a given model Θ is attached an optimal rate of convergence defined by the minimax criterion ε n = inf sup E θ d ( T(X), θ ). T θ Θ This criterion has nothing to do with Bayes. A prior is good if the posterior contracts at this rate. (?)

27 Adaptation A model can be viewed as an instrument to test quality. It makes sense to use a collection (Θ α : α A) of models simultaneously, e.g. a scale of regularity classes. A posterior is good if it adapts: if the true parameter belongs to Θ α, then the contraction rate is at least the minimax rate for this model.

28 Bayesian perspective Any prior (and hence posterior) is appropriate per se. In complex situations subject knowledge can be and must be incorporated in the prior. Computational ease is important for prior choice as well. Frequentist properties reveal key properties of priors of interest.

29 Abstract result

30 Entropy The covering number N(ε, Θ, d) of a metric space (Θ, d) is the minimal number of balls of radius ε needed to cover Θ. ε big ε small Entropy is its logarithm: log N(ε, Θ, d).

31 Rate theorem iid observations Given a random sample X 1,...,X n from a density p 0 and a prior Π on a set P of densities consider the posterior dπ n (p X 1,..., X n ) n p(x i )dπ(p). i=1 THEOREM [Ghosal+Ghosh+vdV, 2000] The Hellinger contraction rate is ε n if there exist P n P such that (1) log N(ε n, P n, h) nε 2 n and Π(P n ) = 1 o(e 3nε2 n ). entropy. (2) Π ( B KL (p 0, ε n ) ) e nε2 n. prior mass. h is the Hellinger distance : h 2 (p, q) = ( p q) 2 dµ. B KL (p 0, ε) is a Kullback-Leibler neighborhood of p 0.

32 Rate theorem iid observations Given a random sample X 1,...,X n from a density p 0 and a prior Π on a set P of densities consider the posterior dπ n (p X 1,..., X n ) n p(x i )dπ(p). i=1 THEOREM [Ghosal+Ghosh+vdV, 2000] The Hellinger contraction rate is ε n if there exist P n P such that (1) log N(ε n, P n, h) nε 2 n and Π(P n ) = 1 o(e 3nε2 n ). entropy. (2) Π ( B KL (p 0, ε n ) ) e nε2 n. prior mass. The entropy condition ensures that the likelihood is not too variable, so that it cannot be large at a wrong place by pure randomness. Le Cam (1964) showed that it gives the minimax rate.

33 Rate theorem general Given data X n following a model (Pθ n : θ Θ) that satisfies Le Cam s testing criterion, and a prior Π, form posterior dπ n (θ X n ) p n θ (Xn )dπ(θ). THEOREM The rate of contraction is ε n 1/ n if there exist Θ n Θ such that (1) D n (ε n, Θ n, d n ) nε 2 n and Π n (Θ Θ n ) = o(e 3nε2 n ). (2) Π n ( Bn (θ 0, ε n ; k) ) e nε2 n. B n (θ 0, ε; k) is Kullback-Leibler type neighbourhood of p n θ 0.

34 Rate theorem general Given data X n following a model (Pθ n : θ Θ) that satisfies Le Cam s testing criterion, and a prior Π, form posterior dπ n (θ X n ) p n θ (Xn )dπ(θ). THEOREM The rate of contraction is ε n 1/ n if there exist Θ n Θ such that (1) D n (ε n, Θ n, d n ) nε 2 n and Π n (Θ Θ n ) = o(e 3nε2 n ). (2) Π n ( Bn (θ 0, ε n ; k) ) e nε2 n. The theorem can be refined in various ways.

35 Le Cam s testing criterion Statistical model (P n θ : θ Θ) indexed by metric space (Θ, d). For all ε > 0: for all θ 1 with d(θ 1, θ 0 ) > ε test φ n with P n θ 0 φ n e nε2, sup Pθ n (1 φ n) e nε2 θ Θ:d(θ,θ 1 )<ε/2 θ 1 θ 0.. > ɛ radius ɛ/2 This applies to independent data, Markov chains, Gaussian time series, ergodic diffusions,....

36 Adaptation by hierarchical prior i.i.d. P n,α collection of densities with prior Π n,α, for α A. Prior weights λ n = (λ n,α : α A). Π n = α A λ n,απ n,α. THEOREM The Hellinger contraction rate is ε n,β if the prior weights satisfies (*) below and (1) log N ( ε n,α, P n,α, h ) nε 2 n,α, every α A. (2) Π n,β ( Bn,β (p 0, ε n,β ) ) e nε2 n,β. B n,α (p 0, ε) is Kullback-Leibler type neighbourhood of p 0 within P n,α.

37 Condition (*) on prior weights (simplified) α<β λ n,α e nε2 n,α + λ n,α e nε2 n,β, λ n,β λ n,β α>β α<β λ n,α λ n,β Π n,α ( Cn,α (p 0, ε n,α ) ) e 4nε2 n,β. α < β means ε n,α ε n,β. C n,α (p 0, ε) is Hellinger ball of radius ε around p 0 in P n,α.

38 Condition (*) on prior weights (simplified) α<β λ n,α e nε2 n,α + λ n,α e nε2 n,β, λ n,β λ n,β α>β α<β λ n,α λ n,β Π n,α ( Cn,α (p 0, ε n,α ) ) e 4nε2 n,β. α < β means ε n,α ε n,β. C n,α (p 0, ε) is Hellinger ball of radius ε around p 0 in P n,α. In many situations there is much freedom in choice of weights. The weights λ n,α µ α e Cnε2 n,α always work.

39 Model selection THEOREM Under the conditions of the theorem Π n ( α: α < β X1,, X n ) P 0, Π n ( α: α β, h(pn,α, p 0 ) ε n,β X 1,, X n ) P 0. Too big models do not get posterior weight. Neither do small models that are far from the truth.

40 Examples of priors Dirichlet mixtures of normals. Discrete priors. Mixtures of betas. Series priors (splines, Fourier, wavelets,...). Independent increment process priors. Sparse priors Gaussian process priors.

41 Gaussian process priors

42 Gaussian process The law of a stochastic process W = (W t : t T) is a prior distribution on the space of functions w: T R. Gaussian processes have been found useful, because of their variety and because of computational properties. Every Gaussian prior is reasonable in some way. We shall study performance with smoothness classes as test case.

43 Example: Brownian density estimation For W Brownian motion use as prior on a density p on [0, 1]: x ew x 1 0 ew y dy. [Leonard, Lenk, Tokdar & Ghosh]

44 Example: Brownian density estimation For W Brownian motion use as prior on a density p on [0, 1]: x ew x 1 0 ew y dy Brownian motion t W t Prior density t c exp(w t )

45 Integrated Brownian motion , 1, 2 and 3 times integrated Brownian motion

46 Stationary processes Gaussian spectral measure Matérn spectral measure (3/2)

47 Other Gaussian processes Brownian sheet alpha= alpha= Fractional Brownian motion w = i w ie i, w i ind N(0, σ 2 i ) Series prior

48 Rates for Gaussian priors Prior W is Gaussian map in (B, ) with RKHS (H, H ) and small ball exponent φ 0 (ε) = log P( W < ε). THEOREM If statistical distances on the model combine appropriately with the norm of B, then the posterior rate is ε n if φ 0 (ε n ) nε n 2 AND inf h 2 H nε n 2. h H: h w 0 <ε n

49 Rates for Gaussian priors Prior W is Gaussian map in (B, ) with RKHS (H, H ) and small ball exponent φ 0 (ε) = log P( W < ε). THEOREM If statistical distances on the model combine appropriately with the norm of B, then the posterior rate is ε n if φ 0 (ε n ) nε n 2 AND inf h 2 H nε n 2. h H: h w 0 <ε n Both inequalities give lower bound on ε n. The first depends on W and not on w 0. If w 0 H, then second inequality is satisfied for ε n 1/ n.

50 Rates for Gaussian priors Prior W is Gaussian map in (B, ) with RKHS (H, H ) and small ball exponent φ 0 (ε) = log P( W < ε). THEOREM If statistical distances on the model combine appropriately with the norm of B, then the posterior rate is ε n if φ 0 (ε n ) nε n 2 AND inf h 2 H nε n 2. h H: h w 0 <ε n Both inequalities give lower bound on ε n. The first depends on W and not on w 0. If w 0 H, then second inequality is satisfied for ε n 1/ n.

51 Settings Density estimation X 1,..., X n iid in [0,1], p w (x) = ew x 1 0 ew t dt. Distance on parameter: Hellinger on p w. Norm on W : uniform. Classification (X 1, Y 1 ),..., (X n, Y n ) iid in [0,1] {0,1} P w (Y = 1 X = x) = e w x. Distance on parameter: L 2 (G) on P w. (G marginal of X i.) Norm on W : L 2 (G). Regression Y 1,..., Y n independent N(w(x i ), σ 2 ), for fixed design points x 1,..., x n. Distance on parameter: empirical L 2 -distance on w. Norm on W : empirical L 2 -distance. Ergodic diffusions (X t : t [0, n]), ergodic, recurrent: dx t = w(x t ) dt + σ(x t ) db t. Distance on parameter: Hellinger h n ( /σ µ0,2). Norm on W : L 2 (µ 0 ). (µ 0 stationary measure.) random

52 Reproducing kernel Hilbert space To every Gaussian random element with values in a Banach space (B, ) is attached a certain Hilbert space (H, H ), called the RKHS. H is stronger than and hence can consider H B.

53 Reproducing kernel Hilbert space To every Gaussian random element with values in a Banach space (B, ) is attached a certain Hilbert space (H, H ), called the RKHS. H is stronger than and hence can consider H B. DEFINITION For S: B B defined by Sb = EWb (W), the RKHS is the completion of SB under Sb 1, Sb 2 H = Eb 1(W)b 2(W).

54 Reproducing kernel Hilbert space To every Gaussian random element with values in a Banach space (B, ) is attached a certain Hilbert space (H, H ), called the RKHS. H is stronger than and hence can consider H B. DEFINITION For a process W = (W x : x X) with bounded sample paths and covariance function K(x, y) = EW x W y, the RKHS is the completion of the set of functions x α i K(y i, x), i under α i K(y i, ), i j β j K(z j, ) = H i α i β j K(y i, z j ). j

55 Reproducing kernel Hilbert space To every Gaussian random element with values in a Banach space (B, ) is attached a certain Hilbert space (H, H ), called the RKHS. H is stronger than and hence can consider H B. EXAMPLE If W is multivariate normal N d (0, Σ), then the RKHS is R d with norm h H = h t Σ 1 h

56 Reproducing kernel Hilbert space To every Gaussian random element with values in a Banach space (B, ) is attached a certain Hilbert space (H, H ), called the RKHS. H is stronger than and hence can consider H B. EXAMPLE Any W can be represented as W = µ i Z i e i, i=1 for numbers µ i 0, iid standard normal Z 1, Z 2,..., and e 1, e 2,... B with e 1 = e 2 = = 1. The RKHS consists of all h: = i h ie i with h 2 H : = i h 2 i µ 2 i <.

57 Reproducing kernel Hilbert space To every Gaussian random element with values in a Banach space (B, ) is attached a certain Hilbert space (H, H ), called the RKHS. H is stronger than and hence can consider H B. EXAMPLE Brownian motion is a random element in C[0, 1]. Its RKHS is H = {h: h (t) 2 dt < } with norm h H = h 2.

58 Small ball probability The small ball probability of a Gaussian random element W in (B, ) is P( W < ε), and the small ball exponent φ 0 (ε) is minus the logarithm of this small ball for uniform norm

59 Small ball probability The small ball probability of a Gaussian random element W in (B, ) is P( W < ε), and the small ball exponent φ 0 (ε) is minus the logarithm of this. It can be computed either by probabilistic arguments, or analytically from the RKHS. THEOREM [Kuelbs & Li (93)] For H 1 the unit ball of the RKHS (up to constants), ( ε ) φ 0 (ε) log N φ0 (ε), H 1,. There is a big literature. (In July entries in database maintained by Michael Lifshits.)

60 Rates for Gaussian priors proof Prior W is Gaussian map in (B, ) with RKHS (H, H ) and small ball exponent φ 0 (ε) = log P( W < ε). THEOREM If statistical distances on the model combine appropriately with the norm of B, then the posterior rate is ε n if φ 0 (ε n ) nε n 2 AND inf h 2 H nε n 2. h H: h w 0 <ε n

61 Rates for Gaussian priors proof Prior W is Gaussian map in (B, ) with RKHS (H, H ) and small ball exponent φ 0 (ε) = log P( W < ε). THEOREM If statistical distances on the model combine appropriately with the norm of B, then the posterior rate is ε n if φ 0 (ε n ) nε n 2 AND inf h 2 H nε n 2. h H: h w 0 <ε n PROOF The posterior rate is ε n if there exist sets B n such that (1) log N(ε n, B n, d) nε 2 n and P(W B n ) = 1 o(e 3nε2 n ). entropy. (2) P ( W w 0 < ε n ) e nε 2 n. prior mass.

62 Rates for Gaussian priors proof Prior W is Gaussian map in (B, ) with RKHS (H, H ) and small ball exponent φ 0 (ε) = log P( W < ε). THEOREM If statistical distances on the model combine appropriately with the norm of B, then the posterior rate is ε n if φ 0 (ε n ) nε n 2 AND inf h 2 H nε n 2. h H: h w 0 <ε n PROOF The posterior rate is ε n if there exist sets B n such that (1) log N(ε n, B n, d) nε 2 n and P(W B n ) = 1 o(e 3nε2 n ). entropy. (2) P ( W w 0 < ε n ) e nε 2 n. prior mass. Take B n = M n H 1 + ε n B 1 for large M n (H 1, B 1 the unit balls of H, B).

63 Proof (2) key results φ w0 (ε): = φ 0 (ε) + inf h H: h w 0 <ε h 2 H. THEOREM [Kuelbs & Li (93)] Concentration function measures concentration around w 0 (up to factors 2): P( W w 0 < ε) e φ w 0 (ε). THEOREM [Borell (75)] For H 1 and B 1 the unit balls of RKHS and B P(W / MH 1 + εb 1 ) 1 Φ ( Φ 1 (e φ 0(ε) ) + M ).

64 (Integrated) Brownian Motion THEOREM If w 0 C β [0, 1], then the rate for Brownian motion is: n 1/4 if β 1/2; n β/2 if β 1/2. The small ball exponent of Brownian motion is φ 0 (ε) (1/ε) 2 as ε 0. This gives the n 1/4 -rate, even for very smooth truths. Truths with β 1/2 are far from the RKHS, giving the rate n β/2. The minimax rate is attained iff β = 1/2.

65 (Integrated) Brownian Motion THEOREM If w 0 C β [0, 1], then the rate for Brownian motion is: n 1/4 if β 1/2; n β/2 if β 1/2. The small ball exponent of Brownian motion is φ 0 (ε) (1/ε) 2 as ε 0. This gives the n 1/4 -rate, even for very smooth truths. Truths with β 1/2 are far from the RKHS, giving the rate n β/2. The minimax rate is attained iff β = 1/2. THEOREM If w 0 C β [0, 1], then the rate for (α 1/2)-times integrated Brownian is n (β α)/(2α+d). The minimax rate is attained iff β = α.

66 Stationary processes A stationary Gaussian field (W t : t R d ) is characterized through a spectral measure µ, by cov(w s, W t ) = e iλt (s t) dµ(λ).

67 Stationary processes A stationary Gaussian field (W t : t R d ) is characterized through a spectral measure µ, by cov(w s, W t ) = e iλt (s t) dµ(λ). THEOREM Suppose that µ is Gaussian. Let ŵ 0 be the Fourier transform of w 0 : [0, 1] d R. If e λ ŵ 0 (λ) 2 dλ <, then rate of contraction is near 1/ n. If (1 + λ 2 ) β ŵ 0 (λ) 2 dλ <, then rate is (1/log n) κ β.

68 Stationary processes A stationary Gaussian field (W t : t R d ) is characterized through a spectral measure µ, by cov(w s, W t ) = e iλt (s t) dµ(λ). THEOREM Suppose that µ is Gaussian. Let ŵ 0 be the Fourier transform of w 0 : [0, 1] d R. If e λ ŵ 0 (λ) 2 dλ <, then rate of contraction is near 1/ n. If (1 + λ 2 ) β ŵ 0 (λ) 2 dλ <, then rate is (1/log n) κ β. THEOREM Suppose that dµ(λ) = (1 + λ 2 ) (α d/2) dλ. If w 0 C β [0, 1] d, then rate of contraction is n (α β)/(2α+d).

69 Adaptation Every Gaussian prior is good for some regularity class, but may be very bad for another. This can be alleviated by putting a prior on the regularity of the process. An alternative, more attractive approach is scaling.

70 Stretching or shrinking Sample paths can be smoothed by stretching

71 Stretching or shrinking Sample paths can be smoothed by stretching and roughened by shrinking

72 Scaled (integrated) Brownian motion W t = B t/cn for B Brownian motion, and c n n (2α 1)/(2α+1) α < 1/2: c n 0 (shrink). α (1/2, 1]: c n (stretch). THEOREM The prior W t = B t/cn gives optimal rate for w 0 C α [0, 1], α (0, 1].

73 Scaled (integrated) Brownian motion W t = B t/cn for B Brownian motion, and c n n (2α 1)/(2α+1) α < 1/2: c n 0 (shrink). α (1/2, 1]: c n (stretch). THEOREM The prior W t = B t/cn gives optimal rate for w 0 C α [0, 1], α (0, 1]. THEOREM Appropriate scaling of k times integrated Brownian motion gives optimal prior for every α (0, k + 1].

74 Scaled (integrated) Brownian motion W t = B t/cn for B Brownian motion, and c n n (2α 1)/(2α+1) α < 1/2: c n 0 (shrink). α (1/2, 1]: c n (stretch). THEOREM The prior W t = B t/cn gives optimal rate for w 0 C α [0, 1], α (0, 1]. THEOREM Appropriate scaling of k times integrated Brownian motion gives optimal prior for every α (0, k + 1]. Stretching helps a little, shrinking helps a lot.

75 Scaled smooth stationary process A Gaussian field with infinitely-smooth sample paths is obtained for EG s G t = exp( s t 2 ). THEOREM The prior W t = G t/cn for c n n 1/(2α+d) gives nearly optimal rate for w 0 C α [0, 1], any α > 0.

76 Adaptation by random scaling Choose A d from a Gamma distribution. Choose (G t : t > 0) centered Gaussian with EG s G t = exp ( s t 2). Set W t G At. THEOREM if w 0 C α [0, 1] d, then the rate of contraction is nearly n α/(2α+d). if w 0 is supersmooth, then the rate is nearly n 1/2.

77 Adaptation by random scaling Choose A d from a Gamma distribution. Choose (G t : t > 0) centered Gaussian with EG s G t = exp ( s t 2). Set W t G At. THEOREM if w 0 C α [0, 1] d, then the rate of contraction is nearly n α/(2α+d). if w 0 is supersmooth, then the rate is nearly n 1/2. The first result is also true for randomly scaled k-times integrated Brownian motion and α k + 1.

78 Conclusion

79 Conclusion and final remark There exist natural (fixed) priors that yield fully automatic smoothing at the correct bandwidth. (For instance, randomly scaled Gaussian processes.)

80 Conclusion and final remark There exist natural (fixed) priors that yield fully automatic smoothing at the correct bandwidth. (For instance, randomly scaled Gaussian processes.) Similar statements are true for adaptation to the scale of models described by sparsity (research in progress).

Nonparametric Bayesian Uncertainty Quantification

Nonparametric Bayesian Uncertainty Quantification Nonparametric Bayesian Uncertainty Quantification Lecture 1: Introduction to Nonparametric Bayes Aad van der Vaart Universiteit Leiden, Netherlands YES, Eindhoven, January 2017 Contents Introduction Recovery

More information

Bayesian Adaptation. Aad van der Vaart. Vrije Universiteit Amsterdam. aad. Bayesian Adaptation p. 1/4

Bayesian Adaptation. Aad van der Vaart. Vrije Universiteit Amsterdam.  aad. Bayesian Adaptation p. 1/4 Bayesian Adaptation Aad van der Vaart http://www.math.vu.nl/ aad Vrije Universiteit Amsterdam Bayesian Adaptation p. 1/4 Joint work with Jyri Lember Bayesian Adaptation p. 2/4 Adaptation Given a collection

More information

Lectures on Nonparametric Bayesian Statistics

Lectures on Nonparametric Bayesian Statistics Lectures on Nonparametric Bayesian Statistics Aad van der Vaart Universiteit Leiden, Netherlands Bad Belzig, March 2013 Contents Introduction Dirichlet process Consistency and rates Gaussian process priors

More information

Nonparametric Bayesian Methods - Lecture I

Nonparametric Bayesian Methods - Lecture I Nonparametric Bayesian Methods - Lecture I Harry van Zanten Korteweg-de Vries Institute for Mathematics CRiSM Masterclass, April 4-6, 2016 Overview of the lectures I Intro to nonparametric Bayesian statistics

More information

Nonparametric Bayesian model selection and averaging

Nonparametric Bayesian model selection and averaging Electronic Journal of Statistics Vol. 2 2008 63 89 ISSN: 1935-7524 DOI: 10.1214/07-EJS090 Nonparametric Bayesian model selection and averaging Subhashis Ghosal Department of Statistics, North Carolina

More information

Priors for the frequentist, consistency beyond Schwartz

Priors for the frequentist, consistency beyond Schwartz Victoria University, Wellington, New Zealand, 11 January 2016 Priors for the frequentist, consistency beyond Schwartz Bas Kleijn, KdV Institute for Mathematics Part I Introduction Bayesian and Frequentist

More information

Bayesian Nonparametrics

Bayesian Nonparametrics Bayesian Nonparametrics Peter Orbanz Columbia University PARAMETERS AND PATTERNS Parameters P(X θ) = Probability[data pattern] 3 2 1 0 1 2 3 5 0 5 Inference idea data = underlying pattern + independent

More information

DISCUSSION: COVERAGE OF BAYESIAN CREDIBLE SETS. By Subhashis Ghosal North Carolina State University

DISCUSSION: COVERAGE OF BAYESIAN CREDIBLE SETS. By Subhashis Ghosal North Carolina State University Submitted to the Annals of Statistics DISCUSSION: COVERAGE OF BAYESIAN CREDIBLE SETS By Subhashis Ghosal North Carolina State University First I like to congratulate the authors Botond Szabó, Aad van der

More information

Bayesian nonparametrics

Bayesian nonparametrics Bayesian nonparametrics Posterior contraction and limiting shape Ismaël Castillo LPMA Université Paris VI Berlin, September 2016 Ismaël Castillo (LPMA Université Paris VI) BNP lecture series Berlin, September

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

Latent factor density regression models

Latent factor density regression models Biometrika (2010), 97, 1, pp. 1 7 C 2010 Biometrika Trust Printed in Great Britain Advance Access publication on 31 July 2010 Latent factor density regression models BY A. BHATTACHARYA, D. PATI, D.B. DUNSON

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

ANISOTROPIC FUNCTION ESTIMATION USING MULTI-BANDWIDTH GAUSSIAN PROCESSES

ANISOTROPIC FUNCTION ESTIMATION USING MULTI-BANDWIDTH GAUSSIAN PROCESSES Submitted to the Annals of Statistics arxiv: 1111.1044v1 ANISOTROPIC FUNCTION ESTIMATION USING MULTI-BANDWIDTH GAUSSIAN PROCESSES By Anirban Bhattacharya,, Debdeep Pati, and David Dunson, Duke University,

More information

ICES REPORT Model Misspecification and Plausibility

ICES REPORT Model Misspecification and Plausibility ICES REPORT 14-21 August 2014 Model Misspecification and Plausibility by Kathryn Farrell and J. Tinsley Odena The Institute for Computational Engineering and Sciences The University of Texas at Austin

More information

Bayesian estimation of the discrepancy with misspecified parametric models

Bayesian estimation of the discrepancy with misspecified parametric models Bayesian estimation of the discrepancy with misspecified parametric models Pierpaolo De Blasi University of Torino & Collegio Carlo Alberto Bayesian Nonparametrics workshop ICERM, 17-21 September 2012

More information

Foundations of Nonparametric Bayesian Methods

Foundations of Nonparametric Bayesian Methods 1 / 27 Foundations of Nonparametric Bayesian Methods Part II: Models on the Simplex Peter Orbanz http://mlg.eng.cam.ac.uk/porbanz/npb-tutorial.html 2 / 27 Tutorial Overview Part I: Basics Part II: Models

More information

Bayesian Consistency for Markov Models

Bayesian Consistency for Markov Models Bayesian Consistency for Markov Models Isadora Antoniano-Villalobos Bocconi University, Milan, Italy. Stephen G. Walker University of Texas at Austin, USA. Abstract We consider sufficient conditions for

More information

Semiparametric posterior limits

Semiparametric posterior limits Statistics Department, Seoul National University, Korea, 2012 Semiparametric posterior limits for regular and some irregular problems Bas Kleijn, KdV Institute, University of Amsterdam Based on collaborations

More information

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel) Diffeomorphic Warping Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel) What Manifold Learning Isn t Common features of Manifold Learning Algorithms: 1-1 charting Dense sampling Geometric Assumptions

More information

Consistency of the maximum likelihood estimator for general hidden Markov models

Consistency of the maximum likelihood estimator for general hidden Markov models Consistency of the maximum likelihood estimator for general hidden Markov models Jimmy Olsson Centre for Mathematical Sciences Lund University Nordstat 2012 Umeå, Sweden Collaborators Hidden Markov models

More information

Adaptive Bayesian procedures using random series priors

Adaptive Bayesian procedures using random series priors Adaptive Bayesian procedures using random series priors WEINING SHEN 1 and SUBHASHIS GHOSAL 2 1 Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX 77030 2 Department

More information

Asymptotic properties of posterior distributions in nonparametric regression with non-gaussian errors

Asymptotic properties of posterior distributions in nonparametric regression with non-gaussian errors Ann Inst Stat Math (29) 61:835 859 DOI 1.17/s1463-8-168-2 Asymptotic properties of posterior distributions in nonparametric regression with non-gaussian errors Taeryon Choi Received: 1 January 26 / Revised:

More information

Optimality of Poisson Processes Intensity Learning with Gaussian Processes

Optimality of Poisson Processes Intensity Learning with Gaussian Processes Journal of Machine Learning Research 16 (2015) 2909-2919 Submitted 9/14; Revised 3/15; Published 12/15 Optimality of Poisson Processes Intensity Learning with Gaussian Processes Alisa Kirichenko Harry

More information

ASYMPTOTIC EQUIVALENCE OF DENSITY ESTIMATION AND GAUSSIAN WHITE NOISE. By Michael Nussbaum Weierstrass Institute, Berlin

ASYMPTOTIC EQUIVALENCE OF DENSITY ESTIMATION AND GAUSSIAN WHITE NOISE. By Michael Nussbaum Weierstrass Institute, Berlin The Annals of Statistics 1996, Vol. 4, No. 6, 399 430 ASYMPTOTIC EQUIVALENCE OF DENSITY ESTIMATION AND GAUSSIAN WHITE NOISE By Michael Nussbaum Weierstrass Institute, Berlin Signal recovery in Gaussian

More information

Exchangeability. Peter Orbanz. Columbia University

Exchangeability. Peter Orbanz. Columbia University Exchangeability Peter Orbanz Columbia University PARAMETERS AND PATTERNS Parameters P(X θ) = Probability[data pattern] 3 2 1 0 1 2 3 5 0 5 Inference idea data = underlying pattern + independent noise Peter

More information

Approximation Theoretical Questions for SVMs

Approximation Theoretical Questions for SVMs Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually

More information

Hypes and Other Important Developments in Statistics

Hypes and Other Important Developments in Statistics Hypes and Other Important Developments in Statistics Aad van der Vaart Vrije Universiteit Amsterdam May 2009 The Hype Sparsity For decades we taught students that to estimate p parameters one needs n p

More information

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite

More information

D I S C U S S I O N P A P E R

D I S C U S S I O N P A P E R I N S T I T U T D E S T A T I S T I Q U E B I O S T A T I S T I Q U E E T S C I E N C E S A C T U A R I E L L E S ( I S B A ) UNIVERSITÉ CATHOLIQUE DE LOUVAIN D I S C U S S I O N P A P E R 2014/06 Adaptive

More information

Context tree models for source coding

Context tree models for source coding Context tree models for source coding Toward Non-parametric Information Theory Licence de droits d usage Outline Lossless Source Coding = density estimation with log-loss Source Coding and Universal Coding

More information

Introduction to Bayesian learning Lecture 2: Bayesian methods for (un)supervised problems

Introduction to Bayesian learning Lecture 2: Bayesian methods for (un)supervised problems Introduction to Bayesian learning Lecture 2: Bayesian methods for (un)supervised problems Anne Sabourin, Ass. Prof., Telecom ParisTech September 2017 1/78 1. Lecture 1 Cont d : Conjugate priors and exponential

More information

Markov Chain Monte Carlo (MCMC)

Markov Chain Monte Carlo (MCMC) Markov Chain Monte Carlo (MCMC Dependent Sampling Suppose we wish to sample from a density π, and we can evaluate π as a function but have no means to directly generate a sample. Rejection sampling can

More information

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Lecture 2: From Linear Regression to Kalman Filter and Beyond Lecture 2: From Linear Regression to Kalman Filter and Beyond January 18, 2017 Contents 1 Batch and Recursive Estimation 2 Towards Bayesian Filtering 3 Kalman Filter and Bayesian Filtering and Smoothing

More information

Bayesian Sparse Linear Regression with Unknown Symmetric Error

Bayesian Sparse Linear Regression with Unknown Symmetric Error Bayesian Sparse Linear Regression with Unknown Symmetric Error Minwoo Chae 1 Joint work with Lizhen Lin 2 David B. Dunson 3 1 Department of Mathematics, The University of Texas at Austin 2 Department of

More information

Bayesian Aggregation for Extraordinarily Large Dataset

Bayesian Aggregation for Extraordinarily Large Dataset Bayesian Aggregation for Extraordinarily Large Dataset Guang Cheng 1 Department of Statistics Purdue University www.science.purdue.edu/bigdata Department Seminar Statistics@LSE May 19, 2017 1 A Joint Work

More information

COS513 LECTURE 8 STATISTICAL CONCEPTS

COS513 LECTURE 8 STATISTICAL CONCEPTS COS513 LECTURE 8 STATISTICAL CONCEPTS NIKOLAI SLAVOV AND ANKUR PARIKH 1. MAKING MEANINGFUL STATEMENTS FROM JOINT PROBABILITY DISTRIBUTIONS. A graphical model (GM) represents a family of probability distributions

More information

A semiparametric Bernstein - von Mises theorem for Gaussian process priors

A semiparametric Bernstein - von Mises theorem for Gaussian process priors Noname manuscript No. (will be inserted by the editor) A semiparametric Bernstein - von Mises theorem for Gaussian process priors Ismaël Castillo Received: date / Accepted: date Abstract This paper is

More information

On Bayesian bandit algorithms

On Bayesian bandit algorithms On Bayesian bandit algorithms Emilie Kaufmann joint work with Olivier Cappé, Aurélien Garivier, Nathaniel Korda and Rémi Munos July 1st, 2012 Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms

More information

LAN property for sde s with additive fractional noise and continuous time observation

LAN property for sde s with additive fractional noise and continuous time observation LAN property for sde s with additive fractional noise and continuous time observation Eulalia Nualart (Universitat Pompeu Fabra, Barcelona) joint work with Samy Tindel (Purdue University) Vlad s 6th birthday,

More information

Minimax lower bounds I

Minimax lower bounds I Minimax lower bounds I Kyoung Hee Kim Sungshin University 1 Preliminaries 2 General strategy 3 Le Cam, 1973 4 Assouad, 1983 5 Appendix Setting Family of probability measures {P θ : θ Θ} on a sigma field

More information

Learning Bayesian network : Given structure and completely observed data

Learning Bayesian network : Given structure and completely observed data Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution

More information

Optimal Estimation of a Nonsmooth Functional

Optimal Estimation of a Nonsmooth Functional Optimal Estimation of a Nonsmooth Functional T. Tony Cai Department of Statistics The Wharton School University of Pennsylvania http://stat.wharton.upenn.edu/ tcai Joint work with Mark Low 1 Question Suppose

More information

Introduction to Probabilistic Machine Learning

Introduction to Probabilistic Machine Learning Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning

More information

The Bayesian approach to inverse problems

The Bayesian approach to inverse problems The Bayesian approach to inverse problems Youssef Marzouk Department of Aeronautics and Astronautics Center for Computational Engineering Massachusetts Institute of Technology ymarz@mit.edu, http://uqgroup.mit.edu

More information

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric? Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric? Zoubin Ghahramani Department of Engineering University of Cambridge, UK zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/

More information

Convergence of latent mixing measures in nonparametric and mixture models 1

Convergence of latent mixing measures in nonparametric and mixture models 1 Convergence of latent mixing measures in nonparametric and mixture models 1 XuanLong Nguyen xuanlong@umich.edu Department of Statistics University of Michigan Technical Report 527 (Revised, Jan 25, 2012)

More information

Nonparametric Drift Estimation for Stochastic Differential Equations

Nonparametric Drift Estimation for Stochastic Differential Equations Nonparametric Drift Estimation for Stochastic Differential Equations Gareth Roberts 1 Department of Statistics University of Warwick Brazilian Bayesian meeting, March 2010 Joint work with O. Papaspiliopoulos,

More information

INFORMATION-THEORETIC DETERMINATION OF MINIMAX RATES OF CONVERGENCE 1. By Yuhong Yang and Andrew Barron Iowa State University and Yale University

INFORMATION-THEORETIC DETERMINATION OF MINIMAX RATES OF CONVERGENCE 1. By Yuhong Yang and Andrew Barron Iowa State University and Yale University The Annals of Statistics 1999, Vol. 27, No. 5, 1564 1599 INFORMATION-THEORETIC DETERMINATION OF MINIMAX RATES OF CONVERGENCE 1 By Yuhong Yang and Andrew Barron Iowa State University and Yale University

More information

Curve Fitting Re-visited, Bishop1.2.5

Curve Fitting Re-visited, Bishop1.2.5 Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood differentiation p(t x, w, β) = Maximum Likelihood N N ( t n y(x n, w), β 1). (1.61) n=1 As we did in the case of the

More information

Linear Models A linear model is defined by the expression

Linear Models A linear model is defined by the expression Linear Models A linear model is defined by the expression x = F β + ɛ. where x = (x 1, x 2,..., x n ) is vector of size n usually known as the response vector. β = (β 1, β 2,..., β p ) is the transpose

More information

Statistical inference on Lévy processes

Statistical inference on Lévy processes Alberto Coca Cabrero University of Cambridge - CCA Supervisors: Dr. Richard Nickl and Professor L.C.G.Rogers Funded by Fundación Mutua Madrileña and EPSRC MASDOC/CCA student workshop 2013 26th March Outline

More information

16.1 Bounding Capacity with Covering Number

16.1 Bounding Capacity with Covering Number ECE598: Information-theoretic methods in high-dimensional statistics Spring 206 Lecture 6: Upper Bounds for Density Estimation Lecturer: Yihong Wu Scribe: Yang Zhang, Apr, 206 So far we have been mostly

More information

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University Econ 690 Purdue University In virtually all of the previous lectures, our models have made use of normality assumptions. From a computational point of view, the reason for this assumption is clear: combined

More information

Non-Parametric Bayes

Non-Parametric Bayes Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Nonparametric Bayes Inference on Manifolds with Applications

Nonparametric Bayes Inference on Manifolds with Applications Nonparametric Bayes Inference on Manifolds with Applications Abhishek Bhattacharya Indian Statistical Institute Based on the book Nonparametric Statistics On Manifolds With Applications To Shape Spaces

More information

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas CS839: Probabilistic Graphical Models Lecture 7: Learning Fully Observed BNs Theo Rekatsinas 1 Exponential family: a basic building block For a numeric random variable X p(x ) =h(x)exp T T (x) A( ) = 1

More information

Statistical Inference

Statistical Inference Statistical Inference Robert L. Wolpert Institute of Statistics and Decision Sciences Duke University, Durham, NC, USA Week 12. Testing and Kullback-Leibler Divergence 1. Likelihood Ratios Let 1, 2, 2,...

More information

Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo

Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo Han Liu John Lafferty Larry Wasserman Statistics Department Computer Science Department Machine Learning Department Carnegie Mellon

More information

MODEL COMPARISON CHRISTOPHER A. SIMS PRINCETON UNIVERSITY

MODEL COMPARISON CHRISTOPHER A. SIMS PRINCETON UNIVERSITY ECO 513 Fall 2008 MODEL COMPARISON CHRISTOPHER A. SIMS PRINCETON UNIVERSITY SIMS@PRINCETON.EDU 1. MODEL COMPARISON AS ESTIMATING A DISCRETE PARAMETER Data Y, models 1 and 2, parameter vectors θ 1, θ 2.

More information

Announcements. Proposals graded

Announcements. Proposals graded Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:

More information

Bayes spaces: use of improper priors and distances between densities

Bayes spaces: use of improper priors and distances between densities Bayes spaces: use of improper priors and distances between densities J. J. Egozcue 1, V. Pawlowsky-Glahn 2, R. Tolosana-Delgado 1, M. I. Ortego 1 and G. van den Boogaart 3 1 Universidad Politécnica de

More information

TANG, YONGQIANG. Dirichlet Process Mixture Models for Markov processes. (Under the direction of Dr. Subhashis Ghosal.)

TANG, YONGQIANG. Dirichlet Process Mixture Models for Markov processes. (Under the direction of Dr. Subhashis Ghosal.) ABSTRACT TANG, YONGQIANG. Dirichlet Process Mixture Models for Markov processes. (Under the direction of Dr. Subhashis Ghosal.) Prediction of the future observations is an important practical issue for

More information

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC Stat 451 Lecture Notes 07 12 Markov Chain Monte Carlo Ryan Martin UIC www.math.uic.edu/~rgmartin 1 Based on Chapters 8 9 in Givens & Hoeting, Chapters 25 27 in Lange 2 Updated: April 4, 2016 1 / 42 Outline

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

BAYESIAN DETECTION OF IMAGE BOUNDARIES. Duke University and North Carolina State University

BAYESIAN DETECTION OF IMAGE BOUNDARIES. Duke University and North Carolina State University Submitted to the Annals of Statistics arxiv: arxiv:1508.05847 BAYESIAN DETECTION OF IMAGE BOUNDARIES By Meng Li and Subhashis Ghosal Duke University and North Carolina State University Detecting boundary

More information

Surrogate loss functions, divergences and decentralized detection

Surrogate loss functions, divergences and decentralized detection Surrogate loss functions, divergences and decentralized detection XuanLong Nguyen Department of Electrical Engineering and Computer Science U.C. Berkeley Advisors: Michael Jordan & Martin Wainwright 1

More information

General Theory of Large Deviations

General Theory of Large Deviations Chapter 30 General Theory of Large Deviations A family of random variables follows the large deviations principle if the probability of the variables falling into bad sets, representing large deviations

More information

Small ball probabilities and metric entropy

Small ball probabilities and metric entropy Small ball probabilities and metric entropy Frank Aurzada, TU Berlin Sydney, February 2012 MCQMC Outline 1 Small ball probabilities vs. metric entropy 2 Connection to other questions 3 Recent results for

More information

Minimax Estimation of Kernel Mean Embeddings

Minimax Estimation of Kernel Mean Embeddings Minimax Estimation of Kernel Mean Embeddings Bharath K. Sriperumbudur Department of Statistics Pennsylvania State University Gatsby Computational Neuroscience Unit May 4, 2016 Collaborators Dr. Ilya Tolstikhin

More information

Overall Objective Priors

Overall Objective Priors Overall Objective Priors Jim Berger, Jose Bernardo and Dongchu Sun Duke University, University of Valencia and University of Missouri Recent advances in statistical inference: theory and case studies University

More information

1.1 Basis of Statistical Decision Theory

1.1 Basis of Statistical Decision Theory ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 1: Introduction Lecturer: Yihong Wu Scribe: AmirEmad Ghassami, Jan 21, 2016 [Ed. Jan 31] Outline: Introduction of

More information

Inconsistency of Bayesian inference when the model is wrong, and how to repair it

Inconsistency of Bayesian inference when the model is wrong, and how to repair it Inconsistency of Bayesian inference when the model is wrong, and how to repair it Peter Grünwald Thijs van Ommen Centrum Wiskunde & Informatica, Amsterdam Universiteit Leiden June 3, 2015 Outline 1 Introduction

More information

Gaussian Approximations for Probability Measures on R d

Gaussian Approximations for Probability Measures on R d SIAM/ASA J. UNCERTAINTY QUANTIFICATION Vol. 5, pp. 1136 1165 c 2017 SIAM and ASA. Published by SIAM and ASA under the terms of the Creative Commons 4.0 license Gaussian Approximations for Probability Measures

More information

Learning gradients: prescriptive models

Learning gradients: prescriptive models Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan

More information

RATE EXACT BAYESIAN ADAPTATION WITH MODIFIED BLOCK PRIORS. By Chao Gao and Harrison H. Zhou Yale University

RATE EXACT BAYESIAN ADAPTATION WITH MODIFIED BLOCK PRIORS. By Chao Gao and Harrison H. Zhou Yale University Submitted to the Annals o Statistics RATE EXACT BAYESIAN ADAPTATION WITH MODIFIED BLOCK PRIORS By Chao Gao and Harrison H. Zhou Yale University A novel block prior is proposed or adaptive Bayesian estimation.

More information

Stratégies bayésiennes et fréquentistes dans un modèle de bandit

Stratégies bayésiennes et fréquentistes dans un modèle de bandit Stratégies bayésiennes et fréquentistes dans un modèle de bandit thèse effectuée à Telecom ParisTech, co-dirigée par Olivier Cappé, Aurélien Garivier et Rémi Munos Journées MAS, Grenoble, 30 août 2016

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

Lecture 2: Priors and Conjugacy

Lecture 2: Priors and Conjugacy Lecture 2: Priors and Conjugacy Melih Kandemir melih.kandemir@iwr.uni-heidelberg.de May 6, 2014 Some nice courses Fred A. Hamprecht (Heidelberg U.) https://www.youtube.com/watch?v=j66rrnzzkow Michael I.

More information

Lecture 2: Statistical Decision Theory (Part I)

Lecture 2: Statistical Decision Theory (Part I) Lecture 2: Statistical Decision Theory (Part I) Hao Helen Zhang Hao Helen Zhang Lecture 2: Statistical Decision Theory (Part I) 1 / 35 Outline of This Note Part I: Statistics Decision Theory (from Statistical

More information

A new Hierarchical Bayes approach to ensemble-variational data assimilation

A new Hierarchical Bayes approach to ensemble-variational data assimilation A new Hierarchical Bayes approach to ensemble-variational data assimilation Michael Tsyrulnikov and Alexander Rakitko HydroMetCenter of Russia College Park, 20 Oct 2014 Michael Tsyrulnikov and Alexander

More information

Computational statistics

Computational statistics Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated

More information

Semiparametric Minimax Rates

Semiparametric Minimax Rates arxiv: math.pr/0000000 Semiparametric Minimax Rates James Robins, Eric Tchetgen Tchetgen, Lingling Li, and Aad van der Vaart James Robins Eric Tchetgen Tchetgen Department of Biostatistics and Epidemiology

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan Background: Global Optimization and Gaussian Processes The Geometry of Gaussian Processes and the Chaining Trick Algorithm

More information

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak 1 Introduction. Random variables During the course we are interested in reasoning about considered phenomenon. In other words,

More information

Scaling up Bayesian Inference

Scaling up Bayesian Inference Scaling up Bayesian Inference David Dunson Departments of Statistical Science, Mathematics & ECE, Duke University May 1, 2017 Outline Motivation & background EP-MCMC amcmc Discussion Motivation & background

More information

Stat 451 Lecture Notes Numerical Integration

Stat 451 Lecture Notes Numerical Integration Stat 451 Lecture Notes 03 12 Numerical Integration Ryan Martin UIC www.math.uic.edu/~rgmartin 1 Based on Chapter 5 in Givens & Hoeting, and Chapters 4 & 18 of Lange 2 Updated: February 11, 2016 1 / 29

More information

Part III. A Decision-Theoretic Approach and Bayesian testing

Part III. A Decision-Theoretic Approach and Bayesian testing Part III A Decision-Theoretic Approach and Bayesian testing 1 Chapter 10 Bayesian Inference as a Decision Problem The decision-theoretic framework starts with the following situation. We would like to

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee 9.520: Class 20 Bayesian Interpretations Tomaso Poggio and Sayan Mukherjee Plan Bayesian interpretation of Regularization Bayesian interpretation of the regularizer Bayesian interpretation of quadratic

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

PMR Learning as Inference

PMR Learning as Inference Outline PMR Learning as Inference Probabilistic Modelling and Reasoning Amos Storkey Modelling 2 The Exponential Family 3 Bayesian Sets School of Informatics, University of Edinburgh Amos Storkey PMR Learning

More information

Bayesian inference. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. April 10, 2017

Bayesian inference. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. April 10, 2017 Bayesian inference Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark April 10, 2017 1 / 22 Outline for today A genetic example Bayes theorem Examples Priors Posterior summaries

More information

Econometrics I, Estimation

Econometrics I, Estimation Econometrics I, Estimation Department of Economics Stanford University September, 2008 Part I Parameter, Estimator, Estimate A parametric is a feature of the population. An estimator is a function of the

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

Latent Dirichlet Alloca/on

Latent Dirichlet Alloca/on Latent Dirichlet Alloca/on Blei, Ng and Jordan ( 2002 ) Presented by Deepak Santhanam What is Latent Dirichlet Alloca/on? Genera/ve Model for collec/ons of discrete data Data generated by parameters which

More information

Nonparametric regression with martingale increment errors

Nonparametric regression with martingale increment errors S. Gaïffas (LSTA - Paris 6) joint work with S. Delattre (LPMA - Paris 7) work in progress Motivations Some facts: Theoretical study of statistical algorithms requires stationary and ergodicity. Concentration

More information

Gaussian Random Fields: Geometric Properties and Extremes

Gaussian Random Fields: Geometric Properties and Extremes Gaussian Random Fields: Geometric Properties and Extremes Yimin Xiao Michigan State University Outline Lecture 1: Gaussian random fields and their regularity Lecture 2: Hausdorff dimension results and

More information