Foundations of Nonparametric Bayesian Methods

Similar documents
Exchangeability. Peter Orbanz. Columbia University

Bayesian Nonparametrics

Bayesian Nonparametrics

Bayesian nonparametrics

Lecture 3a: Dirichlet processes

Lecture 16-17: Bayesian Nonparametrics I. STAT 6474 Instructor: Hongxiao Zhu

STAT Advanced Bayesian Inference

Nonparametric Bayesian Methods - Lecture I

CS281B / Stat 241B : Statistical Learning Theory Lecture: #22 on 19 Apr Dirichlet Process I

Non-Parametric Bayes

13: Variational inference II

A Brief Overview of Nonparametric Bayesian Models

Priors for the frequentist, consistency beyond Schwartz

Curve Fitting Re-visited, Bishop1.2.5

Stochastic Processes, Kernel Regression, Infinite Mixture Models

Bayesian Nonparametrics: Models Based on the Dirichlet Process

Bayesian Methods for Machine Learning

Bayesian Nonparametrics for Speech and Signal Processing

Dirichlet Processes: Tutorial and Practical Course

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Bayesian Regularization

Bayesian Nonparametrics: Dirichlet Process

CPSC 540: Machine Learning

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

arxiv: v2 [math.st] 7 Jan 2011

STA 4273H: Statistical Machine Learning

13 : Variational Inference: Loopy Belief Propagation and Mean Field

On the posterior structure of NRMI

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Expectation Propagation Algorithm

On the Support of MacEachern s Dependent Dirichlet Processes and Extensions

Nonparametric Bayesian Methods (Gaussian Processes)

Asymptotics for posterior hazards

Computational statistics

ICML Scalable Bayesian Inference on Point processes. with Gaussian Processes. Yves-Laurent Kom Samo & Stephen Roberts

On Consistency of Nonparametric Normal Mixtures for Bayesian Density Estimation

6 Bayesian Methods for Function Estimation

Nonparametric Bayes Inference on Manifolds with Applications

Answers and expectations

Bayesian non-parametric model to longitudinally predict churn

Dirichlet Process. Yee Whye Teh, University College London

Lecture 1: Bayesian Framework Basics

Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA)

CS Lecture 19. Exponential Families & Expectation Propagation

Bayesian Nonparametric and Parametric Inference

Bayesian Nonparametrics: some contributions to construction and properties of prior distributions

Introduction to Bayesian Statistics with WinBUGS Part 4 Priors and Hierarchical Models

COS513 LECTURE 8 STATISTICAL CONCEPTS

The Calibrated Bayes Factor for Model Comparison

Nonuniform Random Variate Generation

Bayesian estimation of the discrepancy with misspecified parametric models

Spatial Bayesian Nonparametrics for Natural Image Segmentation

1 Independent increments

arxiv: v3 [math.st] 19 Oct 2011

TANG, YONGQIANG. Dirichlet Process Mixture Models for Markov processes. (Under the direction of Dr. Subhashis Ghosal.)

Quantifying the Price of Uncertainty in Bayesian Models

Scribe Notes 11/30/07

10708 Graphical Models: Homework 2

Bayesian nonparametric models of sparse and exchangeable random graphs

Probabilistic modeling of NLP

Lecture 13 : Variational Inference: Mean Field Approximation

Bayesian Nonparametric Models

The Bayesian Paradigm

Recitation 2: Probability

Approximate Bayesian Computation and Particle Filters

Information geometry for bivariate distribution control

Classical and Bayesian inference

Functional Conjugacy in Parametric Bayesian Models

On Bayesian bandit algorithms

Bayesian Machine Learning

Bayesian Statistics. Debdeep Pati Florida State University. April 3, 2017

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Probability Background

Nonparametric Bayesian Uncertainty Quantification

Density Modeling and Clustering Using Dirichlet Diffusion Trees

Latent factor density regression models

Artificial Intelligence

Colouring and breaking sticks, pairwise coincidence losses, and clustering expression profiles

Bayesian Modeling of Conditional Distributions

CSC 2541: Bayesian Methods for Machine Learning

Asymptotic properties of posterior distributions in nonparametric regression with non-gaussian errors

PMR Learning as Inference

Flexible Regression Modeling using Bayesian Nonparametric Mixtures

Latent Dirichlet Alloca/on

Order-q stochastic processes. Bayesian nonparametric applications

David B. Dahl. Department of Statistics, and Department of Biostatistics & Medical Informatics University of Wisconsin Madison

Non-parametric Clustering with Dirichlet Processes

Construction of Dependent Dirichlet Processes based on Poisson Processes

Review of Probabilities and Basic Statistics

arxiv: v1 [stat.ml] 20 Nov 2012

Bayes spaces: use of improper priors and distances between densities

Lecture 2: Basic Concepts of Statistical Decision Theory

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

Brownian Motion. 1 Definition Brownian Motion Wiener measure... 3

Outline. Introduction to Bayesian nonparametrics. Notation and discrete example. Books on Bayesian nonparametrics

ICES REPORT Model Misspecification and Plausibility

Stat 5101 Lecture Notes

Stratégies bayésiennes et fréquentistes dans un modèle de bandit

Transcription:

1 / 27 Foundations of Nonparametric Bayesian Methods Part II: Models on the Simplex Peter Orbanz http://mlg.eng.cam.ac.uk/porbanz/npb-tutorial.html

2 / 27 Tutorial Overview Part I: Basics Part II: Models on the simplex 1. Bayesian models and de Finetti 2. DP construction 3. Generalization exmples: Pólya trees, neutral processes 4. Consistency of NP Bayesian estimates Part III: Construction of new models

3 / 27 Introduction: NP Bayes Motivation Machine learning Roughly: Flexible models that adapt well to data. Bayesian statistics Search for universal priors. Bayesian modeling Assumption: Data i.i.d. from unknown model. (Essential!) De Finetti: Justified for exchangeable data. Theorem of de Finetti (in part) prompted search for NP priors. Exchangeable sequence Probability of data X 1, X 2,... invariant under permutation of sampling order.

4 / 27 De Finetti s Theorem (1) Theorem (De Finetti, 1937; Hewitt & Savage, 1955) On a Polish space (D, B(D)), RVs X 1, X 2,... are exchangeable iff there is a random probability measure F on D such that X 1, X 2,... are conditionally i.i.d. with distribution F. Additionally: 1. X i exch.ble distribution G of F uniquely determined. 2. For B B(D): 1 lim n n n I B (X i ) = a.s. F(B) i=1 That is: Empirical distribution n F almost surely

5 / 27 De Finetti s Theorem (2) Interpretation If X i exchangeable, then: 1. There is RV F with values f M + 1 (D) 2. Distribution: G = F(P) 3. X i are generated as: f G and X i i.i.d f Consequence For A := i=1 A i: µ X (A ) = M + 1 (D) ( i=1 ) f (A i ) dg(f )

6 / 27 Bayes and de Finetti NP Bayesian density estimation 1. Start with a prior assumption (eg a DP) for G. 2. Update prior given data. 3. Hopefully: Posterior will concentrate at true f. (We are not trying to estimate G!) Requirements 1. Prior sufficiently spread over M + 1 (D) 2. Updating prior on F given draws from F must be possible Terminology: Conjugacy Algebraic: Prior in G, likelihood in F posterior in G Functional: Prior, posterior in same parametric class Measurable map (prior par, data) posterior par

7 / 27 Prior Domains Note Parametric priors perfectly fit into de Finetti framework. Why not use a parametric prior on F? Example (Draper, 1999) 1. Start with Gaussian model, prior on µ, σ. 2. Data comes in: Bimodal. 3. Two choices: Keep model poor fit Change to bimodal model confounded (data used twice) Prior domain requirement Prior should support sufficiently diverse models to roughly fit arbitrary data.

8 / 27 Process Models: Divergent Aims NP Bayesian statistics Prior models with large support in M + 1 (D) DP: Concentrates on discrete distributions in M + 1 (D) Principal motivation of DP generalization: Overcome discreteness Machine Learning Principal application: Clustering and mixtures Discreteness required

9 / 27 Construction Approaches Kolmogorov Constructions Construct process directly from marginals. Generalization of DP Consider all processes with a particular property of DP: Partitioning: Tailfree processes and Pólya trees Random CDFs: Neutral-to-the-right (NTTR) and Levy processes Related: DP mixtures (smoothed DPs) De Finetti-based Derive class of possible G from knowledge about X i. Example: X i follow generalized Pólya urn scheme G is a DP

10 / 27 Dirichlet Process Construction Recall: Two stage Construction 1. Extend marginals to process on (Ω E, B E ) (Kolmogorov) 2. Restrict process to subset Ω Ω E (Doob) Kolmogorov s theorem: Dirichlet process Axes Ω 0 := [0, 1] Index set E = H(D) "measurable partitions" H(D) := {B 1,..., B d partition of D, d N, B i B(D)} Marginals: Dirichlet distributions (for I E) dµ I Θ I α,y I (θ I ) = p I (θ I α, y I )dλ I (θ I ) where: p I = Dirichlet density λ I = Lebesgue measure restricted to Sim(R, I )

11 / 27 DP: Restriction Step Restrict to Ω = M + 1 (D) Check: Ω Polish (in weak topology; true if D Polish) Check: µ E, ( Ω) = 1 (true for Dirichlet marginals) Doob s theorem: Exists µ E on Ω with same marginals as µ E. We are done: µ E is our Dirichlet process µ E = DP(αG 0 ) Historical Note Ferguson s (1973) construction: Only extension step Ghosh & Ramamoorthi (2002) note that singletons are not in B E, give equivalent construction on Q

12 / 27 Conjugate Posterior Bayesian Model with DP Unobserved: G DP(. α 0 G 0 ) (draw from prior) Observed: θ G (sample draw) Consider Marginals For any H = (B 1,..., B m ) H: Exactly one bin B k contains θ Posterior update: Prior parameters: y H 0 = (G 0(B 1 ),..., G 0 (B m )) and α 0 Posterior: y H 1 (G 0(B 1 ),..., G 0 (B k ) + 1 α 0,..., G 0 (B m )) α 1 = α 0 + 1 DP posterior Posterior is DP(. α 1 G 1 ) with G 1 α 0 G 0 + δ θ and α 1 = α 0 + 1

13 / 27 Sampling Under DP Prior Model θ 1,..., θ n G G DP(α 0 G 0 ) Marginalize out G Sequential Sampling With posterior formula: θ 1 G 0 θ i+1 i δ θj + αg 0 j=1

14 / 27 Discreteness Key Observation If G DP(α 0 G 0 + δ θ ), then G({θ}) > 0 a.s. Consequence Draw G from DP discrete almost surely, so G = c i δ θi i=1 Stick-breaking (Sethuraman, 1984) Generative model for c i and θ i : θ i iid G 0 (i 1 ) c i := (1 Z j ) Z i with Z i iid Beta(1, α 0 ) j=1

15 / 27 Other Properties Undominated Model Family DP(. α 0 G 0 + δ θ ) of posteriors not dominated In particular: No Bayes equation Implication: Conjugacy essential Prior support: KL property DP posits non-zero mass on every Kullback-Leibler ɛ-neighborhood of every measure in M + 1 (D).

16 / 27 Now: Generalizations (Potentially) Smooth models Dirichlet process mixtures Pólya trees, tailfree processes Neutral-to-the-right (NTTR) and Lévy processes

DP mixtures Original motivation Smooth DP by convolution with parametric model. Mixture model q conditional density, µ Z mixing distribution p(x) = q(x z)dµ Z (z) Ω z Example: q and µ Z Gaussian, z variance p Student Finite mixture µ Z (B) = n c i δ zi (B) for B B(Ω x ) i=1 DP mixture Draw µ Z from DP. 17 / 27

18 / 27 Pólya Trees Idea (on R) Binary splitting of R into measurable sets. Subdivision as tree: Nodes = Sets with Children = Subsets Root = R One random probability per edge P(B) = product over probs on path Root B Model components m = level in tree π m = measurable partition π m+1 refines π m V m,b = Beta RVs α m,b = Beta parameters

19 / 27 Pólya Tree Properties Include discrete and continuous models Particular expectation G 0 : Choose π m such that values G 0 (B) match quantiles of G 0 α m,b m 2 draws absolutely continuous α m,b 1 m DP Conjugacy Sample observation: Posterior update: G PT(π, α) Y G α m,b α m,b + 1 for B with Y B.

20 / 27 The Tailfree Property Pólya trees are example of tailfree processes. Def: Tailfree process Consider process defined like PT, but with arbitrary RVs V m,b. Tailfree: If RV sets {V m,b B π m } independent between tree levels. Meaning: Tailfree model Prior makes no assumption about shape of tails outside region covered by samples. Freedman (1968) Countable domains: Tailfree processes have consistent posteriors.

21 / 27 Random CDFs on R Observation: For D = R DP: normalized gamma process. Draw CDF as process trajectory. CDF properties are local: 1. initialize at 0 2. monotonicity expressible as increments 5 4 3 2 1 0 0 1 2 3 4 5 Possible choice: Lévy processes Easily chosen as monotonic ( subordinators ) Can be simulated numerically Best-understood class of stochastic processes Independence of components may be desirable

22 / 27 Lévy Processes Def: Lévy process Stochastic process with stationary independent increments and right-continuous pathes with left-hand limits (rcll). Intuition: rcll Only countable number of jumps Jump at x: Value at x belongs to right-hand branch Decomposition property Decomposes into a Brownian motion (continuous) and a Poisson (jump) component ( simulation). De Finetti property (Bühlmann, 1960) P Process (X t ) with X s Xt whenever s t. Increments: exchangeable conditionally stationary and independent

Lévy Processes and NPB Direct approach To generate CDF: 1. Draw trajectory from increasing process 2. Normalize Dead end (James et al 2005) If process normalized subordinator and conjugate, it is a Dirichlet process. Neutrality (NTTR processes) Process is neutral to the right if cumulative hazard function (= 1-CDF) has independent increments. F NTTR log(1 F(. )) Lévy NTTR process properties Conjugacy (algebraic), contain DP, range of consistency results 23 / 27

24 / 27 Lévy Processes and Kolmogorov Extensions Construction of Lévy process Choose µ I to guarantee independent stationary increments Apply Kolmogorov process µ E on Ω E Restrict to Ω = set of rcll functions Convolution Semigroup Set of measures ν s (s R + ) with ν s+t = ν s ν t s, t R + Lévy marginals from convolution semigroup For I = {t} set µ I := ν t with {ν s } s R+ convolution semigroup.

25 / 27 Consistency Consistency means: If θ 0 is true solution, posterior (regarded as random measure) converges to δ θ0 a.s. as n. Two notions of consistency Frequentist consistency: a.s. under µ X (X Θ = θ 0 ). Bayesian consistency: a.s. under evidence only ensures consistency for θ 0 in prior domain Different results Bayesian def: Under very general conditions, any Bayesian model consistent (Doob, 1949) Frequentist def: Models like DP can be completely wrong.

Example Results Countable domain D (Freedman, 1968) Freedman gives example of Bayes-consistent model which converges to arbitrary prescribed distribution for any true value of θ 0. Remedy: Tailfree prior. DP inconsistency (Diaconis & Freedman, 1986) DP estimate of mean (of unknown distribution) can be inconsistent. Recent results Quantify model complexity (learning-theory style) Under conditions on complexity: Consistency and convergence rates Controllable behavior under model misspecification Shen & Wasserman (1998); Ghosal, van der Vaart et al (2002 and later) 26 / 27

27 / 27 Preview Part III Constructing parameterized models from parameterized marginals Conjugacy Sufficient statistics of process models What are suitable families of marginals?