Foundations of Nonparametric Bayesian Methods

1 / 27 Foundations of Nonparametric Bayesian Methods Part II: Models on the Simplex Peter Orbanz http://mlg.eng.cam.ac.uk/porbanz/npb-tutorial.html

2 / 27 Tutorial Overview Part I: Basics Part II: Models on the simplex 1. Bayesian models and de Finetti 2. DP construction 3. Generalization exmples: Pólya trees, neutral processes 4. Consistency of NP Bayesian estimates Part III: Construction of new models

3 / 27 Introduction: NP Bayes Motivation Machine learning Roughly: Flexible models that adapt well to data. Bayesian statistics Search for universal priors. Bayesian modeling Assumption: Data i.i.d. from unknown model. (Essential!) De Finetti: Justified for exchangeable data. Theorem of de Finetti (in part) prompted search for NP priors. Exchangeable sequence Probability of data X 1, X 2,... invariant under permutation of sampling order.

4 / 27 De Finetti s Theorem (1) Theorem (De Finetti, 1937; Hewitt & Savage, 1955) On a Polish space (D, B(D)), RVs X 1, X 2,... are exchangeable iff there is a random probability measure F on D such that X 1, X 2,... are conditionally i.i.d. with distribution F. Additionally: 1. X i exch.ble distribution G of F uniquely determined. 2. For B B(D): 1 lim n n n I B (X i ) = a.s. F(B) i=1 That is: Empirical distribution n F almost surely

5 / 27 De Finetti s Theorem (2) Interpretation If X i exchangeable, then: 1. There is RV F with values f M + 1 (D) 2. Distribution: G = F(P) 3. X i are generated as: f G and X i i.i.d f Consequence For A := i=1 A i: µ X (A ) = M + 1 (D) ( i=1 ) f (A i ) dg(f )

6 / 27 Bayes and de Finetti NP Bayesian density estimation 1. Start with a prior assumption (eg a DP) for G. 2. Update prior given data. 3. Hopefully: Posterior will concentrate at true f. (We are not trying to estimate G!) Requirements 1. Prior sufficiently spread over M + 1 (D) 2. Updating prior on F given draws from F must be possible Terminology: Conjugacy Algebraic: Prior in G, likelihood in F posterior in G Functional: Prior, posterior in same parametric class Measurable map (prior par, data) posterior par

7 / 27 Prior Domains Note Parametric priors perfectly fit into de Finetti framework. Why not use a parametric prior on F? Example (Draper, 1999) 1. Start with Gaussian model, prior on µ, σ. 2. Data comes in: Bimodal. 3. Two choices: Keep model poor fit Change to bimodal model confounded (data used twice) Prior domain requirement Prior should support sufficiently diverse models to roughly fit arbitrary data.

8 / 27 Process Models: Divergent Aims NP Bayesian statistics Prior models with large support in M + 1 (D) DP: Concentrates on discrete distributions in M + 1 (D) Principal motivation of DP generalization: Overcome discreteness Machine Learning Principal application: Clustering and mixtures Discreteness required

9 / 27 Construction Approaches Kolmogorov Constructions Construct process directly from marginals. Generalization of DP Consider all processes with a particular property of DP: Partitioning: Tailfree processes and Pólya trees Random CDFs: Neutral-to-the-right (NTTR) and Levy processes Related: DP mixtures (smoothed DPs) De Finetti-based Derive class of possible G from knowledge about X i. Example: X i follow generalized Pólya urn scheme G is a DP

10 / 27 Dirichlet Process Construction Recall: Two stage Construction 1. Extend marginals to process on (Ω E, B E ) (Kolmogorov) 2. Restrict process to subset Ω Ω E (Doob) Kolmogorov s theorem: Dirichlet process Axes Ω 0 := [0, 1] Index set E = H(D) "measurable partitions" H(D) := {B 1,..., B d partition of D, d N, B i B(D)} Marginals: Dirichlet distributions (for I E) dµ I Θ I α,y I (θ I ) = p I (θ I α, y I )dλ I (θ I ) where: p I = Dirichlet density λ I = Lebesgue measure restricted to Sim(R, I )

11 / 27 DP: Restriction Step Restrict to Ω = M + 1 (D) Check: Ω Polish (in weak topology; true if D Polish) Check: µ E, ( Ω) = 1 (true for Dirichlet marginals) Doob s theorem: Exists µ E on Ω with same marginals as µ E. We are done: µ E is our Dirichlet process µ E = DP(αG 0 ) Historical Note Ferguson s (1973) construction: Only extension step Ghosh & Ramamoorthi (2002) note that singletons are not in B E, give equivalent construction on Q

12 / 27 Conjugate Posterior Bayesian Model with DP Unobserved: G DP(. α 0 G 0 ) (draw from prior) Observed: θ G (sample draw) Consider Marginals For any H = (B 1,..., B m ) H: Exactly one bin B k contains θ Posterior update: Prior parameters: y H 0 = (G 0(B 1 ),..., G 0 (B m )) and α 0 Posterior: y H 1 (G 0(B 1 ),..., G 0 (B k ) + 1 α 0,..., G 0 (B m )) α 1 = α 0 + 1 DP posterior Posterior is DP(. α 1 G 1 ) with G 1 α 0 G 0 + δ θ and α 1 = α 0 + 1

13 / 27 Sampling Under DP Prior Model θ 1,..., θ n G G DP(α 0 G 0 ) Marginalize out G Sequential Sampling With posterior formula: θ 1 G 0 θ i+1 i δ θj + αg 0 j=1

14 / 27 Discreteness Key Observation If G DP(α 0 G 0 + δ θ ), then G({θ}) > 0 a.s. Consequence Draw G from DP discrete almost surely, so G = c i δ θi i=1 Stick-breaking (Sethuraman, 1984) Generative model for c i and θ i : θ i iid G 0 (i 1 ) c i := (1 Z j ) Z i with Z i iid Beta(1, α 0 ) j=1

15 / 27 Other Properties Undominated Model Family DP(. α 0 G 0 + δ θ ) of posteriors not dominated In particular: No Bayes equation Implication: Conjugacy essential Prior support: KL property DP posits non-zero mass on every Kullback-Leibler ɛ-neighborhood of every measure in M + 1 (D).

16 / 27 Now: Generalizations (Potentially) Smooth models Dirichlet process mixtures Pólya trees, tailfree processes Neutral-to-the-right (NTTR) and Lévy processes

DP mixtures Original motivation Smooth DP by convolution with parametric model. Mixture model q conditional density, µ Z mixing distribution p(x) = q(x z)dµ Z (z) Ω z Example: q and µ Z Gaussian, z variance p Student Finite mixture µ Z (B) = n c i δ zi (B) for B B(Ω x ) i=1 DP mixture Draw µ Z from DP. 17 / 27

18 / 27 Pólya Trees Idea (on R) Binary splitting of R into measurable sets. Subdivision as tree: Nodes = Sets with Children = Subsets Root = R One random probability per edge P(B) = product over probs on path Root B Model components m = level in tree π m = measurable partition π m+1 refines π m V m,b = Beta RVs α m,b = Beta parameters

19 / 27 Pólya Tree Properties Include discrete and continuous models Particular expectation G 0 : Choose π m such that values G 0 (B) match quantiles of G 0 α m,b m 2 draws absolutely continuous α m,b 1 m DP Conjugacy Sample observation: Posterior update: G PT(π, α) Y G α m,b α m,b + 1 for B with Y B.

20 / 27 The Tailfree Property Pólya trees are example of tailfree processes. Def: Tailfree process Consider process defined like PT, but with arbitrary RVs V m,b. Tailfree: If RV sets {V m,b B π m } independent between tree levels. Meaning: Tailfree model Prior makes no assumption about shape of tails outside region covered by samples. Freedman (1968) Countable domains: Tailfree processes have consistent posteriors.

21 / 27 Random CDFs on R Observation: For D = R DP: normalized gamma process. Draw CDF as process trajectory. CDF properties are local: 1. initialize at 0 2. monotonicity expressible as increments 5 4 3 2 1 0 0 1 2 3 4 5 Possible choice: Lévy processes Easily chosen as monotonic ( subordinators ) Can be simulated numerically Best-understood class of stochastic processes Independence of components may be desirable

22 / 27 Lévy Processes Def: Lévy process Stochastic process with stationary independent increments and right-continuous pathes with left-hand limits (rcll). Intuition: rcll Only countable number of jumps Jump at x: Value at x belongs to right-hand branch Decomposition property Decomposes into a Brownian motion (continuous) and a Poisson (jump) component ( simulation). De Finetti property (Bühlmann, 1960) P Process (X t ) with X s Xt whenever s t. Increments: exchangeable conditionally stationary and independent

Lévy Processes and NPB Direct approach To generate CDF: 1. Draw trajectory from increasing process 2. Normalize Dead end (James et al 2005) If process normalized subordinator and conjugate, it is a Dirichlet process. Neutrality (NTTR processes) Process is neutral to the right if cumulative hazard function (= 1-CDF) has independent increments. F NTTR log(1 F(. )) Lévy NTTR process properties Conjugacy (algebraic), contain DP, range of consistency results 23 / 27

24 / 27 Lévy Processes and Kolmogorov Extensions Construction of Lévy process Choose µ I to guarantee independent stationary increments Apply Kolmogorov process µ E on Ω E Restrict to Ω = set of rcll functions Convolution Semigroup Set of measures ν s (s R + ) with ν s+t = ν s ν t s, t R + Lévy marginals from convolution semigroup For I = {t} set µ I := ν t with {ν s } s R+ convolution semigroup.

25 / 27 Consistency Consistency means: If θ 0 is true solution, posterior (regarded as random measure) converges to δ θ0 a.s. as n. Two notions of consistency Frequentist consistency: a.s. under µ X (X Θ = θ 0 ). Bayesian consistency: a.s. under evidence only ensures consistency for θ 0 in prior domain Different results Bayesian def: Under very general conditions, any Bayesian model consistent (Doob, 1949) Frequentist def: Models like DP can be completely wrong.

Example Results Countable domain D (Freedman, 1968) Freedman gives example of Bayes-consistent model which converges to arbitrary prescribed distribution for any true value of θ 0. Remedy: Tailfree prior. DP inconsistency (Diaconis & Freedman, 1986) DP estimate of mean (of unknown distribution) can be inconsistent. Recent results Quantify model complexity (learning-theory style) Under conditions on complexity: Consistency and convergence rates Controllable behavior under model misspecification Shen & Wasserman (1998); Ghosal, van der Vaart et al (2002 and later) 26 / 27

27 / 27 Preview Part III Constructing parameterized models from parameterized marginals Conjugacy Sufficient statistics of process models What are suitable families of marginals?