Foundations of Nonparametric Bayesian Methods

Size: px

Start display at page:

Download "Foundations of Nonparametric Bayesian Methods"

Delilah Sutton
5 years ago
Views:

1 1 / 27 Foundations of Nonparametric Bayesian Methods Part II: Models on the Simplex Peter Orbanz

2 2 / 27 Tutorial Overview Part I: Basics Part II: Models on the simplex 1. Bayesian models and de Finetti 2. DP construction 3. Generalization exmples: Pólya trees, neutral processes 4. Consistency of NP Bayesian estimates Part III: Construction of new models

3 3 / 27 Introduction: NP Bayes Motivation Machine learning Roughly: Flexible models that adapt well to data. Bayesian statistics Search for universal priors. Bayesian modeling Assumption: Data i.i.d. from unknown model. (Essential!) De Finetti: Justified for exchangeable data. Theorem of de Finetti (in part) prompted search for NP priors. Exchangeable sequence Probability of data X 1, X 2,... invariant under permutation of sampling order.

4 4 / 27 De Finetti s Theorem (1) Theorem (De Finetti, 1937; Hewitt & Savage, 1955) On a Polish space (D, B(D)), RVs X 1, X 2,... are exchangeable iff there is a random probability measure F on D such that X 1, X 2,... are conditionally i.i.d. with distribution F. Additionally: 1. X i exch.ble distribution G of F uniquely determined. 2. For B B(D): 1 lim n n n I B (X i ) = a.s. F(B) i=1 That is: Empirical distribution n F almost surely

5 5 / 27 De Finetti s Theorem (2) Interpretation If X i exchangeable, then: 1. There is RV F with values f M + 1 (D) 2. Distribution: G = F(P) 3. X i are generated as: f G and X i i.i.d f Consequence For A := i=1 A i: µ X (A ) = M + 1 (D) ( i=1 ) f (A i ) dg(f )

6 6 / 27 Bayes and de Finetti NP Bayesian density estimation 1. Start with a prior assumption (eg a DP) for G. 2. Update prior given data. 3. Hopefully: Posterior will concentrate at true f. (We are not trying to estimate G!) Requirements 1. Prior sufficiently spread over M + 1 (D) 2. Updating prior on F given draws from F must be possible Terminology: Conjugacy Algebraic: Prior in G, likelihood in F posterior in G Functional: Prior, posterior in same parametric class Measurable map (prior par, data) posterior par

7 7 / 27 Prior Domains Note Parametric priors perfectly fit into de Finetti framework. Why not use a parametric prior on F? Example (Draper, 1999) 1. Start with Gaussian model, prior on µ, σ. 2. Data comes in: Bimodal. 3. Two choices: Keep model poor fit Change to bimodal model confounded (data used twice) Prior domain requirement Prior should support sufficiently diverse models to roughly fit arbitrary data.

8 8 / 27 Process Models: Divergent Aims NP Bayesian statistics Prior models with large support in M + 1 (D) DP: Concentrates on discrete distributions in M + 1 (D) Principal motivation of DP generalization: Overcome discreteness Machine Learning Principal application: Clustering and mixtures Discreteness required

9 9 / 27 Construction Approaches Kolmogorov Constructions Construct process directly from marginals. Generalization of DP Consider all processes with a particular property of DP: Partitioning: Tailfree processes and Pólya trees Random CDFs: Neutral-to-the-right (NTTR) and Levy processes Related: DP mixtures (smoothed DPs) De Finetti-based Derive class of possible G from knowledge about X i. Example: X i follow generalized Pólya urn scheme G is a DP

10 10 / 27 Dirichlet Process Construction Recall: Two stage Construction 1. Extend marginals to process on (Ω E, B E ) (Kolmogorov) 2. Restrict process to subset Ω Ω E (Doob) Kolmogorov s theorem: Dirichlet process Axes Ω 0 := [0, 1] Index set E = H(D) "measurable partitions" H(D) := {B 1,..., B d partition of D, d N, B i B(D)} Marginals: Dirichlet distributions (for I E) dµ I Θ I α,y I (θ I ) = p I (θ I α, y I )dλ I (θ I ) where: p I = Dirichlet density λ I = Lebesgue measure restricted to Sim(R, I )

11 11 / 27 DP: Restriction Step Restrict to Ω = M + 1 (D) Check: Ω Polish (in weak topology; true if D Polish) Check: µ E, ( Ω) = 1 (true for Dirichlet marginals) Doob s theorem: Exists µ E on Ω with same marginals as µ E. We are done: µ E is our Dirichlet process µ E = DP(αG 0 ) Historical Note Ferguson s (1973) construction: Only extension step Ghosh & Ramamoorthi (2002) note that singletons are not in B E, give equivalent construction on Q

12 12 / 27 Conjugate Posterior Bayesian Model with DP Unobserved: G DP(. α 0 G 0 ) (draw from prior) Observed: θ G (sample draw) Consider Marginals For any H = (B 1,..., B m ) H: Exactly one bin B k contains θ Posterior update: Prior parameters: y H 0 = (G 0(B 1 ),..., G 0 (B m )) and α 0 Posterior: y H 1 (G 0(B 1 ),..., G 0 (B k ) + 1 α 0,..., G 0 (B m )) α 1 = α DP posterior Posterior is DP(. α 1 G 1 ) with G 1 α 0 G 0 + δ θ and α 1 = α 0 + 1

13 13 / 27 Sampling Under DP Prior Model θ 1,..., θ n G G DP(α 0 G 0 ) Marginalize out G Sequential Sampling With posterior formula: θ 1 G 0 θ i+1 i δ θj + αg 0 j=1

14 14 / 27 Discreteness Key Observation If G DP(α 0 G 0 + δ θ ), then G({θ}) > 0 a.s. Consequence Draw G from DP discrete almost surely, so G = c i δ θi i=1 Stick-breaking (Sethuraman, 1984) Generative model for c i and θ i : θ i iid G 0 (i 1 ) c i := (1 Z j ) Z i with Z i iid Beta(1, α 0 ) j=1

15 15 / 27 Other Properties Undominated Model Family DP(. α 0 G 0 + δ θ ) of posteriors not dominated In particular: No Bayes equation Implication: Conjugacy essential Prior support: KL property DP posits non-zero mass on every Kullback-Leibler ɛ-neighborhood of every measure in M + 1 (D).

16 16 / 27 Now: Generalizations (Potentially) Smooth models Dirichlet process mixtures Pólya trees, tailfree processes Neutral-to-the-right (NTTR) and Lévy processes

17 DP mixtures Original motivation Smooth DP by convolution with parametric model. Mixture model q conditional density, µ Z mixing distribution p(x) = q(x z)dµ Z (z) Ω z Example: q and µ Z Gaussian, z variance p Student Finite mixture µ Z (B) = n c i δ zi (B) for B B(Ω x ) i=1 DP mixture Draw µ Z from DP. 17 / 27

18 18 / 27 Pólya Trees Idea (on R) Binary splitting of R into measurable sets. Subdivision as tree: Nodes = Sets with Children = Subsets Root = R One random probability per edge P(B) = product over probs on path Root B Model components m = level in tree π m = measurable partition π m+1 refines π m V m,b = Beta RVs α m,b = Beta parameters

19 19 / 27 Pólya Tree Properties Include discrete and continuous models Particular expectation G 0 : Choose π m such that values G 0 (B) match quantiles of G 0 α m,b m 2 draws absolutely continuous α m,b 1 m DP Conjugacy Sample observation: Posterior update: G PT(π, α) Y G α m,b α m,b + 1 for B with Y B.

20 20 / 27 The Tailfree Property Pólya trees are example of tailfree processes. Def: Tailfree process Consider process defined like PT, but with arbitrary RVs V m,b. Tailfree: If RV sets {V m,b B π m } independent between tree levels. Meaning: Tailfree model Prior makes no assumption about shape of tails outside region covered by samples. Freedman (1968) Countable domains: Tailfree processes have consistent posteriors.

21 21 / 27 Random CDFs on R Observation: For D = R DP: normalized gamma process. Draw CDF as process trajectory. CDF properties are local: 1. initialize at 0 2. monotonicity expressible as increments Possible choice: Lévy processes Easily chosen as monotonic ( subordinators ) Can be simulated numerically Best-understood class of stochastic processes Independence of components may be desirable

22 22 / 27 Lévy Processes Def: Lévy process Stochastic process with stationary independent increments and right-continuous pathes with left-hand limits (rcll). Intuition: rcll Only countable number of jumps Jump at x: Value at x belongs to right-hand branch Decomposition property Decomposes into a Brownian motion (continuous) and a Poisson (jump) component ( simulation). De Finetti property (Bühlmann, 1960) P Process (X t ) with X s Xt whenever s t. Increments: exchangeable conditionally stationary and independent

23 Lévy Processes and NPB Direct approach To generate CDF: 1. Draw trajectory from increasing process 2. Normalize Dead end (James et al 2005) If process normalized subordinator and conjugate, it is a Dirichlet process. Neutrality (NTTR processes) Process is neutral to the right if cumulative hazard function (= 1-CDF) has independent increments. F NTTR log(1 F(. )) Lévy NTTR process properties Conjugacy (algebraic), contain DP, range of consistency results 23 / 27

24 24 / 27 Lévy Processes and Kolmogorov Extensions Construction of Lévy process Choose µ I to guarantee independent stationary increments Apply Kolmogorov process µ E on Ω E Restrict to Ω = set of rcll functions Convolution Semigroup Set of measures ν s (s R + ) with ν s+t = ν s ν t s, t R + Lévy marginals from convolution semigroup For I = {t} set µ I := ν t with {ν s } s R+ convolution semigroup.

25 25 / 27 Consistency Consistency means: If θ 0 is true solution, posterior (regarded as random measure) converges to δ θ0 a.s. as n. Two notions of consistency Frequentist consistency: a.s. under µ X (X Θ = θ 0 ). Bayesian consistency: a.s. under evidence only ensures consistency for θ 0 in prior domain Different results Bayesian def: Under very general conditions, any Bayesian model consistent (Doob, 1949) Frequentist def: Models like DP can be completely wrong.

26 Example Results Countable domain D (Freedman, 1968) Freedman gives example of Bayes-consistent model which converges to arbitrary prescribed distribution for any true value of θ 0. Remedy: Tailfree prior. DP inconsistency (Diaconis & Freedman, 1986) DP estimate of mean (of unknown distribution) can be inconsistent. Recent results Quantify model complexity (learning-theory style) Under conditions on complexity: Consistency and convergence rates Controllable behavior under model misspecification Shen & Wasserman (1998); Ghosal, van der Vaart et al (2002 and later) 26 / 27

27 27 / 27 Preview Part III Constructing parameterized models from parameterized marginals Conjugacy Sufficient statistics of process models What are suitable families of marginals?

Exchangeability. Peter Orbanz. Columbia University

Exchangeability. Peter Orbanz. Columbia University Exchangeability Peter Orbanz Columbia University PARAMETERS AND PATTERNS Parameters P(X θ) = Probability[data pattern] 3 2 1 0 1 2 3 5 0 5 Inference idea data = underlying pattern + independent noise Peter