Properties of Bayesian nonparametric models and priors over trees

Similar documents
Being Bayesian About Network Structure:

A Brief Overview of Nonparametric Bayesian Models

Bayesian Nonparametrics

Message Passing Algorithms for Dirichlet Diffusion Trees

Bayesian Nonparametrics for Speech and Signal Processing

Nonparametric Bayesian Methods: Models, Algorithms, and Applications (Day 5)

Non-Parametric Bayes

Bayesian Nonparametrics: Dirichlet Process

Lecture 3a: Dirichlet processes

Message Passing Algorithms for Dirichlet Diffusion Trees

Foundations of Nonparametric Bayesian Methods

Tree-Based Inference for Dirichlet Process Mixtures

Bayesian Methods for Machine Learning

CPSC 540: Machine Learning

Bayesian Nonparametrics: Models Based on the Dirichlet Process

Stochastic Processes, Kernel Regression, Infinite Mixture Models

Bayesian Nonparametric Models

Exchangeability. Peter Orbanz. Columbia University

Lecture 16-17: Bayesian Nonparametrics I. STAT 6474 Instructor: Hongxiao Zhu

Bayesian nonparametrics

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Dirichlet Processes: Tutorial and Practical Course

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Bayesian nonparametric latent feature models

Haupthseminar: Machine Learning. Chinese Restaurant Process, Indian Buffet Process

Probabilistic Graphical Models

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

CS281B / Stat 241B : Statistical Learning Theory Lecture: #22 on 19 Apr Dirichlet Process I

Advanced Machine Learning

CSCI 5822 Probabilistic Model of Human and Machine Learning. Mike Mozer University of Colorado

Non-parametric Clustering with Dirichlet Processes

STAT Advanced Bayesian Inference

Dirichlet Process. Yee Whye Teh, University College London

Spatial Bayesian Nonparametrics for Natural Image Segmentation

Gentle Introduction to Infinite Gaussian Mixture Modeling

Part IV: Monte Carlo and nonparametric Bayes

Hierarchical Dirichlet Processes

Infinite latent feature models and the Indian Buffet Process

Applied Nonparametric Bayes

Bayesian nonparametric models for bipartite graphs

Bayesian Nonparametrics

Lecture 16 Deep Neural Generative Models

Computer Vision Group Prof. Daniel Cremers. 14. Clustering

Applied Bayesian Nonparametrics 3. Infinite Hidden Markov Models

Bayesian nonparametric latent feature models

Hierarchical Models, Nested Models and Completely Random Measures

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group

Variational Inference for the Nested Chinese Restaurant Process

Bayesian Nonparametric Modelling of Genetic Variations using Fragmentation-Coagulation Processes

Dirichlet Processes and other non-parametric Bayesian models

Clustering using Mixture Models

CSC 2541: Bayesian Methods for Machine Learning

Bayesian Nonparametric Learning of Complex Dynamical Phenomena

Bayesian non parametric approaches: an introduction

Pattern Recognition and Machine Learning

19 : Bayesian Nonparametrics: The Indian Buffet Process. 1 Latent Variable Models and the Indian Buffet Process

Nonparametric Bayesian Methods (Gaussian Processes)

Truncation error of a superposed gamma process in a decreasing order representation

Flexible Martingale Priors for Deep Hierarchies

Lecture 6: Graphical Models: Learning

Probabilistic Graphical Models

Bayesian Nonparametric Models on Decomposable Graphs

Graphical Models for Query-driven Analysis of Multimodal Data

STA 4273H: Sta-s-cal Machine Learning

Feature Allocations, Probability Functions, and Paintboxes

Distance dependent Chinese restaurant processes

Nonparametric Probabilistic Modelling

A Process over all Stationary Covariance Kernels

Machine Learning Summer School

Hierarchical Bayesian Languge Model Based on Pitman-Yor Processes. Yee Whye Teh

Modelling Genetic Variations with Fragmentation-Coagulation Processes

Gaussian Mixture Model

Sharing Clusters Among Related Groups: Hierarchical Dirichlet Processes

STA 4273H: Statistical Machine Learning

Machine Learning for Data Science (CS4786) Lecture 24

Linear Dynamical Systems

Nonparametric Bayesian Methods - Lecture I

Bayesian Machine Learning

Bayesian (Nonparametric) Approaches to Language Modeling

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Infinite Latent Feature Models and the Indian Buffet Process

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

Applying hlda to Practical Topic Modeling

Nonparametric Mixed Membership Models

A permutation-augmented sampler for DP mixture models

Generalisation and inductive reasoning. Computational Cognitive Science 2014 Dan Navarro

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning

The Indian Buffet Process: An Introduction and Review

Nonparametric Bayesian Models for Sparse Matrices and Covariances

STA 4273H: Statistical Machine Learning

PMR Learning as Inference

Construction of Dependent Dirichlet Processes based on Poisson Processes

The Infinite Factorial Hidden Markov Model

Density Modeling and Clustering Using Dirichlet Diffusion Trees

Bayesian Hidden Markov Models and Extensions

An Infinite Product of Sparse Chinese Restaurant Processes

13: Variational inference II

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees

Transcription:

Properties of Bayesian nonparametric models and priors over trees David A. Knowles Computer Science Department Stanford University July 24, 2013

Introduction Theory: what characteristics might we want? exchangeability projectivity/consistency large support de Finetti s representation theorem Applications: distributions over trees The nested Chinese restaurant process and the Dirichlet diffusion tree Kingman s coalescent If we have time: fragmentation-coagulation processes.

Exchangeability Intuition: Data = pattern + noise Consider data X 1, X 2,... where X i X. A very strong assumption we could make: X i are i.i.d. (in the Bayesian paradigm this corresponds to data=noise) An often reasonable assumption: X i are exchangeable X 1, X 2,... are exchangeable if P(X 1, X 2,...) is invariant under any finite permutation σ, i.e. P(X 1, X 2,...) = P(X σ(1), X σ(2),...) We can still have dependence! Intuition: the order of the observations doesn t matter.

Exchangeability: the CRP Recall the CRP predictive probability: P(c n = k c 1,..., c n 1 ) = { ak,n 1 n 1+θ k {1,..., K n } θ n 1+θ k = K n + 1 where K n is the number of blocks after assigning c n and a k,n = n i=1 I[c i = k]. Easy to show that P(c 1,..., c n ) = θk 1 K k=1 1 2 (a k,n 1) (θ + 1) (θ + n 1) K = Γ(θ)θK Γ(a k,n ) Γ(θ + n) k=1 using Γ(a + 1) = aγ(a). Just depends on the size and number of blocks, and n. No dependence on the order: exchangeable! (but dependence...)

Exchangeability: breaking the CRP I want more reinforcement! Let s square those counts. P(c n = k c 1,..., c n 1 ) = But no longer exchangeable :( a 2 k,n 1 K i=1 a2 i,n 1 +θ k {1,..., K n } θ K i=1 a2 i,n 1 +θ k = K n + 1 P(c 1,..., c n ) = θ K 1 K k=1 12 2 2 (a k,n 1) 2 (θ + 1)(θ + K 2 i=1 a2 i,2 ) (θ + K n 1 i=1 a2 i,n 1 )

Projectivity This is a consistency property. Consider the finite sequence of random variables X 1, X 2,..., X N, with law P N. Projectivity requires that P N 1 (X 1,..., X N 1 ) = P N (X 1, X 2,..., X N )dx N Priors with a sequential generative process get this for free (e.g. the CRP): P N (X 1, X 2,..., X N ) = P N 1 (X 1,..., X N 1 )P(X N X 1,..., X N 1 )

Projectivity: a counterexample The uniform distribution on partitions is exchangeable but not projective: 1/5 {{1, 2, 3}} 1/5 {{1, 2}, {3}} 1/5 {{1}, {2, 3}} 1/5 {{1, 3}, {2}} 1/5 {{1}, {3}, {2}} 2/5 {{1, 2}} 3/5 {{1}, {2}}

de Finetti s theorem Kolmogov extension theorem: if P N are projective then there exists a limiting distribution P on the infinite sequence X 1, X 2,... whose finite dimensional marginals are given by P N de Finetti s theorem: in that case, if X 1, X 2,... are exchangeable then there exists Q s.t. P(X 1, X 2,...) = µ(x i )Q(dµ) M(X ) where M(X ) is the space of probability measures on X and µ M(X ) is a random probability measure. As a graphical model: µ Q X i µ iid µ In words: any exchangeable sequence of r.v.s can be represented as (formally, is equal in distribution to) a mixture of i.i.d. r.v.s. i i

de Finetti s theorem: notes Q is a distribution over distributions : sometimes called the de Finetti mixing measure Not a constructive proof: sometimes we can characterise Q nicely (e.g. the Dirichlet process for the CRP, the beta process for the Indian buffet process), sometimes not (e.g. the tree priors presented later) Decomposition into pattern µ and noise Motivates Bayesian hierarchical models: Q is the prior µ is -dimensional in general but might be finite in some cases (e.g. proportion ones for binary sequence) Just one example of a family of ergodic decomposition theorems : group invariance integral decomposition (e.g. Aldous-Hoover theorem for random graphs or arrays) Can get more formal: µ is determined by the limiting empirical distribution of the data, or the tail-σ-algebra of the sequence X i

Support and consistency The space of probability measures M(X ) might be finite (e.g. for X = {1, 2}) or infinite (e.g. for X = N or R) Nonparametric models can have support (non-zero probability mass) on infinite dimensional M(X ) whereas a parametric model can only put support on a finite-dimensional subspace [figure from Teh and Orbanz (2011)] Model P0 outside model: misspecified P0 = Pθ 0 "truth" M(X )

Consistency Under mild conditions (e.g. identifiability) Bayesian models exhibit weak consistency: in the infinite data limit the posterior will converge to a point mass at the correct parameter value (µ), assuming this was sampled from the prior (says nothing about model misspecification, or approximate inference) Frequentist consistency (consistency for some true P 0 M(X ) in some class) is more difficult to ensure: some cases are known, e.g. Dirichlet process mixtures of diagonal covariance Gaussians for smooth densities. Convergence rates are also an active area of research: smooth parametric models typically get n 1 2

Break Next up: priors over trees.

Motivation True hierarchies Parameter tying Visualisation and interpretability Camel Cow Deer Chimp Gorilla Horse Giraffe Elephant Cat Dog Lion Tiger Mouse Squirrel Wolf Dolphin Whale Seal Salmon Trout Penguin Eagle Robin Chicken Ostrich Finch Ant Cockroach Butterfly Bee Alligator Iguana Rhino

Trees For statistical applications we are usually interested in trees with labelled leaves. Mathematically we can think of this as a hierarchical partition or a acyclic graph. Usually we distinguish the root node. e.g. for {{{1, 2, 3}, {4, 5}}, {{6, 7}}} the tree is 1 2 3 4 5 6 7

Nested CRP Distribution over hierarchical partitions Denote the K blocks in the first level as {Bk 1 : k = 1,..., K} Partition these blocks with independent CRPs Denote the partition of Bk 1 as {B2 kl : l = 1,..., K k } Recurse for S iterations, forming a S deep hierarchy B 1 1 B 1 2 B 2 11 B 2 12 B 2 21 1 2 3 4 5 6 7 Used to define a hierarchical topic model, see Blei et al. (2010)

ncrp: when to stop? Two approaches Use a finite depth S Work with the infinitely deep tree In the later case we can either Augment with a per node probability of stopping, e.g. Adams et al. (2009) Integrate over chains of infinite length, e.g. Steinhardt and Ghahramani (2012)

The Dirichlet diffusion tree The Dirichlet diffusion tree (DDT, Neal (2003)) is the continuum limit of the ncrp Embed an S-deep ncrp in continuous time R + such that the time of level i is i/ S Set the concentration parameter of each CRP to θ = 1/ S Take the limit as S Ignore degree two nodes (apart from the root) What does this process look like?

The Dirichlet diffusion tree Item 1 sits at its own table for all time R + For item i sits at the same table as all the other previous items at time t = 0 In time interval [t, t + dt], splits off to start its own table with probability dt/m, where m is the number of previous items at this table (at time t) If a previous item started a new table (this is a branch point ) then choose the two tables with probability proportion to the number of items that previously went each way

The Dirichlet diffusion tree While the ncrp allowed arbitrary branching, the DDT only has binary branching events: this is addressed by the Pitman Yor diffusion tree (K and Ghahramani, 2011) Branches now have associated branch lengths: these came from chains in the ncrp Items will diverge to form a singleton table at some finite t almost surely Draw this

As a model for data We have a prior over trees with unbounded depth and width. However, difficult to use with unbounded divergence times. Solution: transform the branch times t according to t = A 1 (t) where A 1 : R + [0, 1] and is strictly increasing, e.g. A 1 (t) = 1 e t/c, A(t ) = c log(1 t ) Equivalent to changing the probability of divergence in [t, t + dt] to a(t)dt/m where a(t) = A (t) = c/(1 t).

Diffusion on the tree We will model data x at the leaves using a diffusion process on the tree Interpretation of the tree structure as a graphical model P(x child x parent ) = k(x child x parent, t child t parent ) We will focus on Brownian (Gaussian) diffusion: k(x child x parent, t child t parent ) = N(x child ; x parent, t child t parent ) This allows the marginal likelihood P(x leaves tree) to be calculated using a single upwards sweep of message passing (belief propagation)

0 1 time

2 1 0 a time

2 b 3 1 0 a time

Properties Exchangeable (we can show this by explicit calculational of the probability of the tree structure, divergence times and locations), see proof in Neal (2003) Projective by its sequential construction Positive support for any binary tree

2 1 0 1 2 3 2 1 0 1 2 3 2 1 0 1 2 3 2 1 0 1 2 3 Figure 3. Generation of a two-dimensional data set from the Dirichlet diffusion tree prior with σ =1 and a(t) =1/(1 t). The plot on the left shows the first twenty data points generated, along with the underlying tree structure. The right plot shows 1000 data points obtained by continuing the procedure beyond these twenty points. 2 1 0 1 2 3 2 1 0 1 2 2 1 0 1 2 3 3 2 1 0 1 2 Figure 4. Two data sets of 1000 points drawn from Dirichlet diffusion tree priors with σ =1. For the data set on the left, the divergence function used was a(t) =(1/4)/(1 t). For the data set on the right, a(t) =(3/2)/(1 t).

Kingman s coalescent The DDT is an example of a Gibbs fragmentation tree (McCullagh et al., 2008) We can also construct trees using coagulation processes Kingman s coalescent (KC Kingman, 1982): iteratively merge subtrees, starting with all leaves in their own subtrees. Can also be derived as the continuum limit of a population genetics model (the Wright-Fisher model) of large populations of haploid individuals (i.e. only one parent)

2 Kingman s coalescent Kingman s coalescent Subtrees merge independently with rate 1: for m subtrees the time until the next merge is Exp(m(m 1)/2) Used in a similar manner to the DDT for hierarchical clustering in Teh et al. (2008) z!") #&' (a) x1 (b) (c) y {1,2}! # x2 &") $&' y {1,2,3,4} & $ x3!&") y %&' {3,4}!! x4 % π(t) = t3 δ3 t2 {{1, 2, 3, 4}} {{1, 2}, {3, 4}} {{1}, {2}, {3, 4}} {{1}, {2}, {3}, {4}} δ2 t1 δ1 t0 = 0!!")!%!%")!(!!"#!!"$!!"%!!!&"'!&"#!&"$!&"% & t!%&'!$!$&'!#!!!"!#!$ % $ Figure 1: (a) Variables describing the n-coalescent. (b) Sample path from a Brownian diffusion coalescent process in 1D, circles are coalescent points. (c) Sample observed points from same in 2D, notice the hierarchically clustered nature of the points.

Fragmentation-coagulation processes The DDT and KC are dual in the following sense: start with a CRP distributed partition, run DDT for dt, then KC for dt, and the result is still CRP distributed with the same parameters Used to construct the fragmentation-coagulation (FC) process: a reversible, Markov partition-valued process with the CRP as its stationary distribution Applied to modelling genetic variation (Teh et al., 2011) c µ + i 1 a c F C R a 0 T F C R µ C Figure 1: FCP ca and bundled line tion event. F: fr are, for the orang probability of joi ability of followi event, rate of sta fragmentation), a isting table (crea

Tree models I didn t talk about λ-coalescents: broad class generalising Kingman s coalescent (Pitman, 1999) Polya trees: split the unit interval recursively (binary splits) (Lavine, 1992; Mauldin et al., 1992), used for univariate density modelling Tree-structured stick breaking: a stick breaking representation of the nested CRP. Uses per node stopping probabilities (Adams et al., 2009) Time marginalised KC (Boyles and Welling, 2012) No prior (implicitly uniform)! Some work does ML estimation of trees while integrating over other variables (Heller and Ghahramani, 2005; Blundell et al., 2010)

Inference Detaching and reattaching subtrees (Neal, 2003; Boyles and Welling, 2012; K et al., 2011) Sequential Monte Carlo (Bouchard-Côté et al., 2012; Gorur and Teh, 2008) Greedy agglomerative methods (Heller and Ghahramani, 2005; Blundell et al., 2010) does so much more slowly (compared to the ure proportion assignment). When n R = 2, ment agrees with BHC-γ. Ti Tj Ti Join (Tm) Tj tion to BHC and DP mixture models rose trees are a strict generalisation of very node is restricted to have just two chilill recover BHC. Heller and Ghahramani cribed two parametrisation of π T which we to as BHC-γ and BHC-DP. BHC-γ sets γ being a fixed hyperparameter, and the el we just described is a generalisation of Ta Tb Tc Td Te Absorb (Tm) Tj Ta Tb Tc Td Te Ta Tb Tc Td Te Collapse (Tm) Ta Tb Tc Td Te

Conclusions Exchangeability, projectivity and support are key characteristics to consider when designing models Proving consistency and convergence rates is an active area of research There are a lot of priors over the space of trees with interesting relationsips between them Rich probability literature on many of these processes

Bibliography I Adams, R., Ghahramani, Z., and Jordan, M. (2009). Tree-Structured Stick Breaking Processes for Hierarchical Modeling. Statistics, (1):1 3. Blei, D. M., Griffiths, T. L., and Jordan, M. I. (2010). The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies. Journal of the ACM, 57. Blundell, C., Teh, Y. W., and Heller, K. A. (2010). Bayesian rose trees. In Proceedings of the International Conference on Uncertainty in Artificial Intelligence. Bouchard-Côté, A., Sankararaman, S., and Jordan, M. I. (2012). Phylogenetic inference via sequential monte carlo. Systematic biology, 61(4):579 593. Boyles, L. and Welling, M. (2012). The time-marginalized coalescent prior for hierarchical clustering. In Advances in Neural Information Processing Systems 25, pages 2978 2986. Gorur, D. and Teh, Y. (2008). A Sequential Monte Carlo Algorithm for Coalescent Clustering. gatsby.ucl.ac.uk.

Bibliography II Heller, K. A. and Ghahramani, Z. (2005). Bayesian hierarchical clustering. In Proceedings of the 22nd International Conference on Machine learning, page 304. K, D. A., Gael, J. V., and Ghahramani, Z. (2011). Message passing algorithms for Dirichlet diffusion trees. In Proceedings of the 28th Annual International Conference on Machine Learning. K, D. A. and Ghahramani, Z. (2011). Pitman-Yor diffusion trees. In The 28th Conference on Uncertainty in Artificial Intelligence (to appear). Kingman, J. (1982). The coalescent. Stochastic processes and their applications, 13(3):235 248. Lavine, M. (1992). Some aspects of polya tree distributions for statistical modelling. The Annals of Statistics, pages 1222 1235. Mauldin, R. D., Sudderth, W. D., and Williams, S. (1992). Polya trees and random distributions. The Annals of Statistics, pages 1203 1221. McCullagh, P., Pitman, J., and Winkel, M. (2008). Gibbs fragmentation trees. Bernoulli, 14(4):988 1002. Neal, R. M. (2003). Density modeling and clustering using Dirichlet diffusion trees. Bayesian Statistics, 7:619 629.

Bibliography III Pitman, J. (1999). Coalescents with multiple collisions. Annals of Probability, pages 1870 1902. Steinhardt, J. and Ghahramani, Z. (2012). Flexible martingale priors for deep hierarchies. In International Conference on Artificial Intelligence and Statistics (AISTATS). Teh, Y. and Orbanz, P. (2011). Nips tutorial: Bayesian nonparametrics. In NIPS. Teh, Y. W., Blundell, C., and Elliott, L. T. (2011). Modelling genetic variations with fragmentation-coagulation processes. In Advances in Neural Information Processing Systems (NIPS). Teh, Y. W., Daumé III, H., and Roy, D. M. (2008). Bayesian agglomerative clustering with coalescents. Advances in Neural Information Processing Systems, 20.