Bayesian Clustering with the Dirichlet Process: Issues with priors and interpreting MCMC. Shane T. Jensen

Similar documents
Non-Parametric Bayes

An Alternative Prior Process for Nonparametric Bayesian Clustering

An Alternative Prior Process for Nonparametric Bayesian Clustering

CSCI 5822 Probabilistic Model of Human and Machine Learning. Mike Mozer University of Colorado

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Lecture 16-17: Bayesian Nonparametrics I. STAT 6474 Instructor: Hongxiao Zhu

Bayesian Methods for Machine Learning

Clustering using Mixture Models

Bayesian Nonparametrics: Dirichlet Process

MCMC: Markov Chain Monte Carlo

Bayesian Nonparametric Regression for Diabetes Deaths

Non-parametric Clustering with Dirichlet Processes

Hierarchical Bayesian Languge Model Based on Pitman-Yor Processes. Yee Whye Teh

CPSC 540: Machine Learning

Bayesian nonparametrics

Gibbs Sampling Methods for Multiple Sequence Alignment

Similarity Analysis between Transcription Factor Binding Sites by Bayesian Hypothesis Test *

Computer Vision Group Prof. Daniel Cremers. 14. Clustering

Bayesian Nonparametric Models

Hierarchical Bayesian Nonparametrics

Gentle Introduction to Infinite Gaussian Mixture Modeling

STAT Advanced Bayesian Inference

Infinite latent feature models and the Indian Buffet Process

Bayesian Nonparametrics

A Brief Overview of Nonparametric Bayesian Models

Colouring and breaking sticks, pairwise coincidence losses, and clustering expression profiles

Probabilistic Graphical Models

Contents. Part I: Fundamentals of Bayesian Inference 1

Advanced Machine Learning

Bayesian learning of sparse factor loadings

A Bayesian Nonparametric Model for Predicting Disease Status Using Longitudinal Profiles

Spatial Bayesian Nonparametrics for Natural Image Segmentation

Bayes methods for categorical data. April 25, 2017

Bayesian Nonparametrics for Speech and Signal Processing

Image segmentation combining Markov Random Fields and Dirichlet Processes

CS281B / Stat 241B : Statistical Learning Theory Lecture: #22 on 19 Apr Dirichlet Process I

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Bayesian Classification and Regression Trees

Haupthseminar: Machine Learning. Chinese Restaurant Process, Indian Buffet Process

Bayesian non parametric approaches: an introduction

Clustering bi-partite networks using collapsed latent block models

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Bayesian non-parametric model to longitudinally predict churn

Nonparametric Mixed Membership Models

Modelling Genetic Variations with Fragmentation-Coagulation Processes

Bayesian Analysis for Natural Language Processing Lecture 2

Bayesian Sparse Correlated Factor Analysis

Markov Chain Monte Carlo Lecture 6

Part IV: Monte Carlo and nonparametric Bayes

Different points of view for selecting a latent structure model

David B. Dahl. Department of Statistics, and Department of Biostatistics & Medical Informatics University of Wisconsin Madison

Distance dependent Chinese restaurant processes

Dirichlet Processes: Tutorial and Practical Course

The Origin of Deep Learning. Lili Mou Jan, 2015

Distance-Based Probability Distribution for Set Partitions with Applications to Bayesian Nonparametrics

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

28 : Approximate Inference - Distributed MCMC

Infinite Latent Feature Models and the Indian Buffet Process

Fitting Narrow Emission Lines in X-ray Spectra

Gaussian Mixture Model

Dirichlet Process. Yee Whye Teh, University College London

Nonparametric Bayes tensor factorizations for big data

A marginal sampler for σ-stable Poisson-Kingman mixture models

Graphical Models for Query-driven Analysis of Multimodal Data

Outline. Clustering. Capturing Unobserved Heterogeneity in the Austrian Labor Market Using Finite Mixtures of Markov Chain Models

Interpretable Latent Variable Models

arxiv: v1 [stat.ml] 8 Jan 2012

Alignment. Peak Detection

Density Estimation. Seungjin Choi

18 : Advanced topics in MCMC. 1 Gibbs Sampling (Continued from the last lecture)

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Study Notes on the Latent Dirichlet Allocation

Lecture 3a: Dirichlet processes

Bayesian Nonparametrics: some contributions to construction and properties of prior distributions

MCMC and Gibbs Sampling. Kayhan Batmanghelich

Spatial Normalized Gamma Process

Bayesian Nonparametric Models on Decomposable Graphs

Bayesian nonparametric latent feature models

Shared Segmentation of Natural Scenes. Dependent Pitman-Yor Processes

Dirichlet Processes and other non-parametric Bayesian models

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April

STA 4273H: Statistical Machine Learning

19 : Bayesian Nonparametrics: The Indian Buffet Process. 1 Latent Variable Models and the Indian Buffet Process

Chapter 8 PROBABILISTIC MODELS FOR TEXT MINING. Yizhou Sun Department of Computer Science University of Illinois at Urbana-Champaign

A permutation-augmented sampler for DP mixture models

Acoustic Unit Discovery (AUD) Models. Leda Sarı

Introduction to Probabilistic Machine Learning

Bayesian Inference for Dirichlet-Multinomials

Bayesian inference for multivariate extreme value distributions

Random Partition Distribution Indexed by Pairwise Information

Hierarchical Dirichlet Processes

Discovering molecular pathways from protein interaction and ge

Probabilistic Time Series Classification

Bayesian nonparametric latent feature models

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

CS6220: DATA MINING TECHNIQUES

Department of Statistics, The Wharton School, University of Pennsylvania

Priors for Random Count Matrices with Random or Fixed Row Sums

Computational statistics

Transcription:

Bayesian Clustering with the Dirichlet Process: Issues with priors and interpreting MCMC Shane T. Jensen Department of Statistics The Wharton School, University of Pennsylvania stjensen@wharton.upenn.edu Collaborative work with J. Liu, L. Dicker, and G. Tuteja Shane T. Jensen 1 May 13, 2006

Introduction Bayesian non-parametric or semi-parametric models are very useful in many applications Non-parametric: random variables realizations from unspecified probability distribution e.g., X i F( ) i =1,..., n X i s can be observed data, latent variables or unknown parameters (often in a hierarchical setting) Prior distributions for F( ) play an important role in non-parametric modeling Shane T. Jensen 2 May 13, 2006

Dirichlet Process Priors A commonly-used prior distribution for an unknown probability distribution is the Dirichlet process F( ) DP(θ, F 0 ) F 0 is a probability measure can represent prior belief in form of F θ is a weight parameter can represent degree of belief in prior form F 0 Ferguson (1973,1974); Antoniak (1974); many others Important consequence of Dirichlet process is that it induces a discretized posterior distribution Shane T. Jensen 3 May 13, 2006

Consequence of DP priors Ferguson, 1974: using a Dirichlet process DP(θ, F 0 ) prior for F( ) results in a posterior mixture of F 0 and point masses at observation X i : F( ) X 1,..., X n DP ( θ + n, F 0 + n δ(x i ) ) i=1 For density estimation, discreteness may be a problem: convolutions with kernel functions can be used to produce a continuous density estimate In other applications, discreteness is not a disadvantage! Shane T. Jensen 4 May 13, 2006

Clustering with a DP prior Point mass component of posterior leads to a random partition of our variables Consider a new variable X n+1 and let X 1,..., X C be the unique values of X 1:n =(X 1,..., X n ). Then, P (X n+1 = X C X 1:n )= N c θ + n P (X n+1 = new X 1:n )= θ θ + n c =1,..., C N c = size of cluster c: number in X 1:n that equal X c Rich get richer : will return to this... Shane T. Jensen 5 May 13, 2006

Motivating Application: TF motifs Genes are regulated by transcription factor (TF) proteins that bind to the DNA sequence near to gene TF proteins can selectively control only certain target genes by only binding to the same sequence, called a motif The motif sites are highly conserved but not identical, so we use a matrix description of the motif appearance Frequency Matrix - X i Sequence Logo A 0.05 0.02 0.85 0.02 0.21 0.06 C 0.04 0.02 0.03 0.93 0.05 0.06 G 0.06 0.94 0.06 0.04 0.70 0.11 T 0.85 0.02 0.06 0.01 0.04 0.77 Shane T. Jensen 6 May 13, 2006

Collections of TF motifs Large databases contain motif information on many TFs but with large amount of redundancy TRANSFAC and JASPAR are largest (100 s in each) Want to cluster motifs together to either reduce redundancy in databases or match new motifs to database Nucleotide conservation varies both within a single motif (between positions) and between different motifs Tal1beta-E47S AGL3 Shane T. Jensen 7 May 13, 2006

Motif Clustering with DP prior Hierarchical model with levels for both within-unit and between-unit variability in discovered motifs Observed count matrix Y i is a product multinomial realization of frequency matrix X i Unknown X i s share unknown distribution F( ) Dirichlet process DP(θ, F 0 ) prior for F( ) leads to posterior mixture of F 0 and point masses at each X i Our prior measure F 0 in this application is a product Dirichlet distribution Shane T. Jensen 8 May 13, 2006

Benefits and Issues with DP prior Allows unknown number of clusters without need to model number of clusters directly No real prior knowledge about number of clusters in our application However, with DP there are implicit assumptions about number of clusters (and their size distribution) Rich get richer property influences prior predictive number of clusters and cluster size distribution How influential is this property in an application? Shane T. Jensen 9 May 13, 2006

Benefits and Issues with MCMC DP-based model is easy to implement via Gibbs sampling p(x i X i ) is same choice structure as p(x n+1 X 1:n ) X i either sampled into one of current clusters defined by X i or sampled from F 0 to form a new cluster Alternative is direct model on number of clusters and then use something like Reversible Jump MCMC Mixing can be an issue with Gibbs sampler collapsed Gibbs sampler: integrate out X i and deal directly with clustering indicators split/merge moves to speed up mixing: lots of great work by R. Neal, D. Dahl and others Shane T. Jensen 10 May 13, 2006

Main Issue 1: Posterior Inference from MCMC However, there are still issues posterior inference based on Gibbs sampling output also has issues Need to infer a set of clusters from sampled partitions, but we have a label switching problem (Stephens, 1999) cluster labels are exchangeable for a particular partition usual summaries such as posterior mean can be misleading mixtures of these exchangeable labeling need summaries that are uninfluenced by labeling Shane T. Jensen 11 May 13, 2006

Posterior Inference Options Option 1: clusters defined by last partition visited sampled partition produced at end of Gibbs chain surprisingly popular, e.g. Latent Dirichlet Alloc. models Option 2: clusters defined by MAP partition sampled partition with highest posterior density simple and popular Option 3: clusters defined by threshold on pairwise posterior probabilities P ij frequency of iterations with motifs i & j in same cluster Shane T. Jensen 12 May 13, 2006

Main Issue 2: Implicit DP Assumptions DP has implicit rich get richer property: easy to see from the predictive distribution: P (X n+1 joins cluster c ) = P (X n+1 forms new cluster) = N c θ + n θ θ + n c =1,..., C Chinese restaurant process: new customer chooses table sits at current table with probability N c, the number of customers already sitting there sits at entirely new table with probability θ Shane T. Jensen 13 May 13, 2006

Alternative Priors for Clustering Uniform Prior: socialism, no one gets rich P (X n+1 joins cluster c ) = 1 θ + C c =1,..., C P (X n+1 forms new cluster) = θ θ + C Pitman-Yor Prior: rich get richer, but charitable P (X n+1 joins cluster c ) = N c α θ + n c =1,..., C P (X n+1 forms new cluster) = θ C α θ + n 0 α 1 is often called the discount factor Shane T. Jensen 14 May 13, 2006

Asymptotic Comparison of Priors Number of clusters C n is clearly a function of sample size n How does C n grow as n? DP Prior : Pitman Yor Prior : E(C n ) θ log(n) E(C n ) K(θ, α) n α Uniform Prior : E(C n ) K(θ) n 1 2 DP prior shows slowest growth in number of clusters C n Interestingly, Pitman-Yor can lead to either faster or slower growth vs. Uniform, depending on α Also working on results for distribution of cluster sizes Shane T. Jensen 15 May 13, 2006

Finite Sample Comparison of Priors Y = C n vs. X = n for different values of θ θ= 1 θ= 10 θ= 100 Expected Number of Clusters 5 10 20 50 100 200 500 1000 DP UN PY (α= 0.5) PY (α= 0.25) PY (α= 0.75) Expected Number of Clusters 50 100 200 500 1000 2000 DP UN PY (α= 0.5) PY (α= 0.25) PY (α= 0.75) Expected Number of Clusters 100 200 500 1000 2000 5000 DP UN PY (α= 0.5) PY (α= 0.25) PY (α= 0.75) 1e+02 5e+02 5e+03 5e+04 1e+02 5e+02 5e+03 5e+04 1e+02 5e+02 5e+03 5e+04 n = number of observations n = number of observations n = number of observations Shane T. Jensen 16 May 13, 2006

Simulation Study of Motif Clustering Evaluation of different priors and modes of inference in context of motif clustering application Simulated realistic collections of motifs (known partitions) Different simulation conditions to vary clustering difficulty: high to low within-cluster similarity high to low between-cluster similarity Success measured by Jacard similarity between true partition z and inferred partition ẑ J(z, ẑ) = TP TP + FP + FN Shane T. Jensen 17 May 13, 2006

Simulation Comparison of Inference Alternatives Jacard Index 0.2 0.4 0.6 0.8 1.0 MAP Prob > 0.5 Prob > 0.25 2 4 6 8 Increasing Clustering Difficulty MAP partition consistently inferior to pairwise probs. Post. probs. incorporate uncertainty across iterations Shane T. Jensen 18 May 13, 2006

Simulation Comparison of Prior Alternatives Jacard Index 0.70 0.75 0.80 0.85 0.90 0.95 Uniform PY 0.25 PY 0.5 PY 0.75 DP 2 4 6 8 Increasing Clustering Difficulty Not much difference in general between priors Uniform does a little worse in most situations Shane T. Jensen 19 May 13, 2006

Real Data Results: Clustering JASPAR database Tree based on pairwise posterior probabilities: 1 Prob(Clustering) 0.0 0.2 0.4 0.6 0.8 1.0 Homo.sapiens NUCLEAR MA0065 Homo.sapiens NUCLEAR MA0072 Drosophila.melanogaster NUCLEAR MA0016 Homo.sapiens NUCLEAR MA0074 Homo.sapiens NUCLEAR MA0066 Homo.sapiens NUCLEAR MA0071 Arabidopsis.thaliana HOMEO.ZIP MA0008 Arabidopsis.thaliana HOMEO.ZIP MA0110 Mus.musculus bhlh.zip MA0104 Homo.sapiens bhlh.zip MA0093 Homo.sapiens bhlh.zip MA0059 Mus.musculus bhlh MA0004 Homo.sapiens bhlh.zip MA0058 Homo.sapiens bzip MA0018 Antirrhinum.majus bzip MA0096 Antirrhinum.majus bzip MA0097 Homo.sapiens ETS MA0028 Drosophila.melanogaster ETS MA0026 Homo.sapiens ETS MA0076 Homo.sapiens HMG MA0084 Mus.musculus HMG MA0087 Homo.sapiens FORKHEAD MA0030 Homo.sapiens FORKHEAD MA0031 Homo.sapiens AP2 MA0003 Homo.sapiens ZN.FINGER MA0095 Mus.musculus HMG MA0078 Rattus.norvegicus FORKHEAD MA0041 Rattus.norvegicus FORKHEAD MA0047 Rattus.norvegicus FORKHEAD MA0040 Homo.sapiens FORKHEAD MA0042 Rattus.norvegicus bzip MA0019 Homo.sapiens bhlh MA0091 Homo.sapiens ZN.FINGER MA0073 Homo.sapiens TEA MA0090 Homo.sapiens NUCLEAR MA0017 Gallus.gallus ZN.FINGER MA0103 Homo.sapiens P53 MA0106 Drosophila.melanogaster ZN.FINGER MA0011 Homo.sapiens MADS MA0083 Arabidopsis.thaliana MADS MA0001 Arabidopsis.thaliana MADS MA0005 Homo.sapiens FORKHEAD MA0032 Homo.sapiens bhlh MA0048 Antirrhinum.majus MADS MA0082 Oryctolagus.cuniculus ZN.FINGER MA0109 Homo.sapiens Unknown MA0024 Pisum.sativum HMG MA0044 Homo.sapiens RUNT MA0002 Mus.musculus bhlh MA0006 Homo.sapiens PAIRED MA0069 Petunia.hybrida TRP.CLUSTER MA0054 Hordeum.vulgare TRP.CLUSTER MA0034 Xenupus.laevis ZN.FINGER MA0088 Gallus.gallus ETS MA0098 Homo.sapiens bhlh MA0055 Mus.musculus T.BOX MA0009 Drosophila.melanogaster ZN.FINGER MA0086 NA bzip MA0102 Homo.sapiens bzip MA0025 Mus.musculus HOMEO MA0063 Drosophila.melanogaster REL MA0023 Homo.sapiens REL MA0105 Homo.sapiens ETS MA0062 Zea.mays ZN.FINGER MA0020 Zea.mays ZN.FINGER MA0021 Homo.sapiens HMG MA0077 Mus.musculus PAIRED MA0067 Homo.sapiens bzip MA0043 Gallus.gallus bzip MA0089 Mus.musculus bzip MA0099 Drosophila.melanogaster REL MA0022 Homo.sapiens ZN.FINGER MA0079 Mus.musculus bhlh.zip MA0111 Homo.sapiens REL MA0107 Homo.sapiens REL MA0101 Vertebrates REL MA0061 Mus.musculus PAIRED MA0014 Homo.sapiens ZN.FINGER MA0057 Mus.musculus ZN.FINGER MA0039 Mus.musculus HOMEO MA0027 Homo.sapiens ZN.FINGER MA0056 Homo.sapiens ETS MA0081 Drosophila.melanogaster IPT/TIG MA0085 Rattus.rattus NUCLEAR MA0007 Homo.sapiens ETS MA0080 Drosophila.melanogaster ZN.FINGER MA0015 NA TATA.box MA0108 Mus.musculus ZN.FINGER MA0035 Mus.musculus ZN.FINGER MA0029 Homo.sapiens ZN.FINGER MA0037 Rattus.norvegicus ZN.FINGER MA0038 Homo.sapiens TRP.CLUSTER MA0050 Homo.sapiens TRP.CLUSTER MA0051 Drosophila.melanogaster ZN.FINGER MA0013 Homo.sapiens HOMEO MA0070 Homo.sapiens FORKHEAD MA0033 Mus.musculus PAIRED.HOMEO MA0068 Drosophila.melanogaster ZN.FINGER MA0010 Pisum.sativum HMG MA0045 Drosophila.melanogaster ZN.FINGER MA0012 Drosophila.melanogaster ZN.FINGER MA0049 Post-processed MAP partition to remove weak relationships, then very similar to thresholded post. probs. Shane T. Jensen 20 May 13, 2006

Comparing Priors: Clustering JASPAR database Number of Clusters Unif Average Cluster Size Unif Frequency 0 50 100 200 300 20 25 30 35 Number of Clusters DP Frequency 0 50 100 200 300 Frequency 0 100 200 300 400 2.5 3.0 3.5 Average Cluster Size DP Frequency 0 100 200 300 400 20 25 30 35 2.5 3.0 3.5 Very little difference between using DP and uniform prior Likelihood is dominating any prior assumption on partition Shane T. Jensen 21 May 13, 2006

Summary Non-parametric Bayesian approaches based on Dirichlet process can be very useful for clustering applications Issues with MCMC inference: popular MAP partitions seem inferior to partitions based on posterior probabilities Issues with implicit DP assumptions: alternative priors give quite different prior partitions Posterior differences between priors are small in our motif application, but can be larger in other applications Jensen and Liu, JASA (forthcoming) plus other manuscripts soon available on my website http://stat.wharton.upenn.edu/ stjensen Shane T. Jensen 22 May 13, 2006