Bayesian Clustering with the Dirichlet Process: Issues with priors and interpreting MCMC Shane T. Jensen Department of Statistics The Wharton School, University of Pennsylvania stjensen@wharton.upenn.edu Collaborative work with J. Liu, L. Dicker, and G. Tuteja Shane T. Jensen 1 May 13, 2006
Introduction Bayesian non-parametric or semi-parametric models are very useful in many applications Non-parametric: random variables realizations from unspecified probability distribution e.g., X i F( ) i =1,..., n X i s can be observed data, latent variables or unknown parameters (often in a hierarchical setting) Prior distributions for F( ) play an important role in non-parametric modeling Shane T. Jensen 2 May 13, 2006
Dirichlet Process Priors A commonly-used prior distribution for an unknown probability distribution is the Dirichlet process F( ) DP(θ, F 0 ) F 0 is a probability measure can represent prior belief in form of F θ is a weight parameter can represent degree of belief in prior form F 0 Ferguson (1973,1974); Antoniak (1974); many others Important consequence of Dirichlet process is that it induces a discretized posterior distribution Shane T. Jensen 3 May 13, 2006
Consequence of DP priors Ferguson, 1974: using a Dirichlet process DP(θ, F 0 ) prior for F( ) results in a posterior mixture of F 0 and point masses at observation X i : F( ) X 1,..., X n DP ( θ + n, F 0 + n δ(x i ) ) i=1 For density estimation, discreteness may be a problem: convolutions with kernel functions can be used to produce a continuous density estimate In other applications, discreteness is not a disadvantage! Shane T. Jensen 4 May 13, 2006
Clustering with a DP prior Point mass component of posterior leads to a random partition of our variables Consider a new variable X n+1 and let X 1,..., X C be the unique values of X 1:n =(X 1,..., X n ). Then, P (X n+1 = X C X 1:n )= N c θ + n P (X n+1 = new X 1:n )= θ θ + n c =1,..., C N c = size of cluster c: number in X 1:n that equal X c Rich get richer : will return to this... Shane T. Jensen 5 May 13, 2006
Motivating Application: TF motifs Genes are regulated by transcription factor (TF) proteins that bind to the DNA sequence near to gene TF proteins can selectively control only certain target genes by only binding to the same sequence, called a motif The motif sites are highly conserved but not identical, so we use a matrix description of the motif appearance Frequency Matrix - X i Sequence Logo A 0.05 0.02 0.85 0.02 0.21 0.06 C 0.04 0.02 0.03 0.93 0.05 0.06 G 0.06 0.94 0.06 0.04 0.70 0.11 T 0.85 0.02 0.06 0.01 0.04 0.77 Shane T. Jensen 6 May 13, 2006
Collections of TF motifs Large databases contain motif information on many TFs but with large amount of redundancy TRANSFAC and JASPAR are largest (100 s in each) Want to cluster motifs together to either reduce redundancy in databases or match new motifs to database Nucleotide conservation varies both within a single motif (between positions) and between different motifs Tal1beta-E47S AGL3 Shane T. Jensen 7 May 13, 2006
Motif Clustering with DP prior Hierarchical model with levels for both within-unit and between-unit variability in discovered motifs Observed count matrix Y i is a product multinomial realization of frequency matrix X i Unknown X i s share unknown distribution F( ) Dirichlet process DP(θ, F 0 ) prior for F( ) leads to posterior mixture of F 0 and point masses at each X i Our prior measure F 0 in this application is a product Dirichlet distribution Shane T. Jensen 8 May 13, 2006
Benefits and Issues with DP prior Allows unknown number of clusters without need to model number of clusters directly No real prior knowledge about number of clusters in our application However, with DP there are implicit assumptions about number of clusters (and their size distribution) Rich get richer property influences prior predictive number of clusters and cluster size distribution How influential is this property in an application? Shane T. Jensen 9 May 13, 2006
Benefits and Issues with MCMC DP-based model is easy to implement via Gibbs sampling p(x i X i ) is same choice structure as p(x n+1 X 1:n ) X i either sampled into one of current clusters defined by X i or sampled from F 0 to form a new cluster Alternative is direct model on number of clusters and then use something like Reversible Jump MCMC Mixing can be an issue with Gibbs sampler collapsed Gibbs sampler: integrate out X i and deal directly with clustering indicators split/merge moves to speed up mixing: lots of great work by R. Neal, D. Dahl and others Shane T. Jensen 10 May 13, 2006
Main Issue 1: Posterior Inference from MCMC However, there are still issues posterior inference based on Gibbs sampling output also has issues Need to infer a set of clusters from sampled partitions, but we have a label switching problem (Stephens, 1999) cluster labels are exchangeable for a particular partition usual summaries such as posterior mean can be misleading mixtures of these exchangeable labeling need summaries that are uninfluenced by labeling Shane T. Jensen 11 May 13, 2006
Posterior Inference Options Option 1: clusters defined by last partition visited sampled partition produced at end of Gibbs chain surprisingly popular, e.g. Latent Dirichlet Alloc. models Option 2: clusters defined by MAP partition sampled partition with highest posterior density simple and popular Option 3: clusters defined by threshold on pairwise posterior probabilities P ij frequency of iterations with motifs i & j in same cluster Shane T. Jensen 12 May 13, 2006
Main Issue 2: Implicit DP Assumptions DP has implicit rich get richer property: easy to see from the predictive distribution: P (X n+1 joins cluster c ) = P (X n+1 forms new cluster) = N c θ + n θ θ + n c =1,..., C Chinese restaurant process: new customer chooses table sits at current table with probability N c, the number of customers already sitting there sits at entirely new table with probability θ Shane T. Jensen 13 May 13, 2006
Alternative Priors for Clustering Uniform Prior: socialism, no one gets rich P (X n+1 joins cluster c ) = 1 θ + C c =1,..., C P (X n+1 forms new cluster) = θ θ + C Pitman-Yor Prior: rich get richer, but charitable P (X n+1 joins cluster c ) = N c α θ + n c =1,..., C P (X n+1 forms new cluster) = θ C α θ + n 0 α 1 is often called the discount factor Shane T. Jensen 14 May 13, 2006
Asymptotic Comparison of Priors Number of clusters C n is clearly a function of sample size n How does C n grow as n? DP Prior : Pitman Yor Prior : E(C n ) θ log(n) E(C n ) K(θ, α) n α Uniform Prior : E(C n ) K(θ) n 1 2 DP prior shows slowest growth in number of clusters C n Interestingly, Pitman-Yor can lead to either faster or slower growth vs. Uniform, depending on α Also working on results for distribution of cluster sizes Shane T. Jensen 15 May 13, 2006
Finite Sample Comparison of Priors Y = C n vs. X = n for different values of θ θ= 1 θ= 10 θ= 100 Expected Number of Clusters 5 10 20 50 100 200 500 1000 DP UN PY (α= 0.5) PY (α= 0.25) PY (α= 0.75) Expected Number of Clusters 50 100 200 500 1000 2000 DP UN PY (α= 0.5) PY (α= 0.25) PY (α= 0.75) Expected Number of Clusters 100 200 500 1000 2000 5000 DP UN PY (α= 0.5) PY (α= 0.25) PY (α= 0.75) 1e+02 5e+02 5e+03 5e+04 1e+02 5e+02 5e+03 5e+04 1e+02 5e+02 5e+03 5e+04 n = number of observations n = number of observations n = number of observations Shane T. Jensen 16 May 13, 2006
Simulation Study of Motif Clustering Evaluation of different priors and modes of inference in context of motif clustering application Simulated realistic collections of motifs (known partitions) Different simulation conditions to vary clustering difficulty: high to low within-cluster similarity high to low between-cluster similarity Success measured by Jacard similarity between true partition z and inferred partition ẑ J(z, ẑ) = TP TP + FP + FN Shane T. Jensen 17 May 13, 2006
Simulation Comparison of Inference Alternatives Jacard Index 0.2 0.4 0.6 0.8 1.0 MAP Prob > 0.5 Prob > 0.25 2 4 6 8 Increasing Clustering Difficulty MAP partition consistently inferior to pairwise probs. Post. probs. incorporate uncertainty across iterations Shane T. Jensen 18 May 13, 2006
Simulation Comparison of Prior Alternatives Jacard Index 0.70 0.75 0.80 0.85 0.90 0.95 Uniform PY 0.25 PY 0.5 PY 0.75 DP 2 4 6 8 Increasing Clustering Difficulty Not much difference in general between priors Uniform does a little worse in most situations Shane T. Jensen 19 May 13, 2006
Real Data Results: Clustering JASPAR database Tree based on pairwise posterior probabilities: 1 Prob(Clustering) 0.0 0.2 0.4 0.6 0.8 1.0 Homo.sapiens NUCLEAR MA0065 Homo.sapiens NUCLEAR MA0072 Drosophila.melanogaster NUCLEAR MA0016 Homo.sapiens NUCLEAR MA0074 Homo.sapiens NUCLEAR MA0066 Homo.sapiens NUCLEAR MA0071 Arabidopsis.thaliana HOMEO.ZIP MA0008 Arabidopsis.thaliana HOMEO.ZIP MA0110 Mus.musculus bhlh.zip MA0104 Homo.sapiens bhlh.zip MA0093 Homo.sapiens bhlh.zip MA0059 Mus.musculus bhlh MA0004 Homo.sapiens bhlh.zip MA0058 Homo.sapiens bzip MA0018 Antirrhinum.majus bzip MA0096 Antirrhinum.majus bzip MA0097 Homo.sapiens ETS MA0028 Drosophila.melanogaster ETS MA0026 Homo.sapiens ETS MA0076 Homo.sapiens HMG MA0084 Mus.musculus HMG MA0087 Homo.sapiens FORKHEAD MA0030 Homo.sapiens FORKHEAD MA0031 Homo.sapiens AP2 MA0003 Homo.sapiens ZN.FINGER MA0095 Mus.musculus HMG MA0078 Rattus.norvegicus FORKHEAD MA0041 Rattus.norvegicus FORKHEAD MA0047 Rattus.norvegicus FORKHEAD MA0040 Homo.sapiens FORKHEAD MA0042 Rattus.norvegicus bzip MA0019 Homo.sapiens bhlh MA0091 Homo.sapiens ZN.FINGER MA0073 Homo.sapiens TEA MA0090 Homo.sapiens NUCLEAR MA0017 Gallus.gallus ZN.FINGER MA0103 Homo.sapiens P53 MA0106 Drosophila.melanogaster ZN.FINGER MA0011 Homo.sapiens MADS MA0083 Arabidopsis.thaliana MADS MA0001 Arabidopsis.thaliana MADS MA0005 Homo.sapiens FORKHEAD MA0032 Homo.sapiens bhlh MA0048 Antirrhinum.majus MADS MA0082 Oryctolagus.cuniculus ZN.FINGER MA0109 Homo.sapiens Unknown MA0024 Pisum.sativum HMG MA0044 Homo.sapiens RUNT MA0002 Mus.musculus bhlh MA0006 Homo.sapiens PAIRED MA0069 Petunia.hybrida TRP.CLUSTER MA0054 Hordeum.vulgare TRP.CLUSTER MA0034 Xenupus.laevis ZN.FINGER MA0088 Gallus.gallus ETS MA0098 Homo.sapiens bhlh MA0055 Mus.musculus T.BOX MA0009 Drosophila.melanogaster ZN.FINGER MA0086 NA bzip MA0102 Homo.sapiens bzip MA0025 Mus.musculus HOMEO MA0063 Drosophila.melanogaster REL MA0023 Homo.sapiens REL MA0105 Homo.sapiens ETS MA0062 Zea.mays ZN.FINGER MA0020 Zea.mays ZN.FINGER MA0021 Homo.sapiens HMG MA0077 Mus.musculus PAIRED MA0067 Homo.sapiens bzip MA0043 Gallus.gallus bzip MA0089 Mus.musculus bzip MA0099 Drosophila.melanogaster REL MA0022 Homo.sapiens ZN.FINGER MA0079 Mus.musculus bhlh.zip MA0111 Homo.sapiens REL MA0107 Homo.sapiens REL MA0101 Vertebrates REL MA0061 Mus.musculus PAIRED MA0014 Homo.sapiens ZN.FINGER MA0057 Mus.musculus ZN.FINGER MA0039 Mus.musculus HOMEO MA0027 Homo.sapiens ZN.FINGER MA0056 Homo.sapiens ETS MA0081 Drosophila.melanogaster IPT/TIG MA0085 Rattus.rattus NUCLEAR MA0007 Homo.sapiens ETS MA0080 Drosophila.melanogaster ZN.FINGER MA0015 NA TATA.box MA0108 Mus.musculus ZN.FINGER MA0035 Mus.musculus ZN.FINGER MA0029 Homo.sapiens ZN.FINGER MA0037 Rattus.norvegicus ZN.FINGER MA0038 Homo.sapiens TRP.CLUSTER MA0050 Homo.sapiens TRP.CLUSTER MA0051 Drosophila.melanogaster ZN.FINGER MA0013 Homo.sapiens HOMEO MA0070 Homo.sapiens FORKHEAD MA0033 Mus.musculus PAIRED.HOMEO MA0068 Drosophila.melanogaster ZN.FINGER MA0010 Pisum.sativum HMG MA0045 Drosophila.melanogaster ZN.FINGER MA0012 Drosophila.melanogaster ZN.FINGER MA0049 Post-processed MAP partition to remove weak relationships, then very similar to thresholded post. probs. Shane T. Jensen 20 May 13, 2006
Comparing Priors: Clustering JASPAR database Number of Clusters Unif Average Cluster Size Unif Frequency 0 50 100 200 300 20 25 30 35 Number of Clusters DP Frequency 0 50 100 200 300 Frequency 0 100 200 300 400 2.5 3.0 3.5 Average Cluster Size DP Frequency 0 100 200 300 400 20 25 30 35 2.5 3.0 3.5 Very little difference between using DP and uniform prior Likelihood is dominating any prior assumption on partition Shane T. Jensen 21 May 13, 2006
Summary Non-parametric Bayesian approaches based on Dirichlet process can be very useful for clustering applications Issues with MCMC inference: popular MAP partitions seem inferior to partitions based on posterior probabilities Issues with implicit DP assumptions: alternative priors give quite different prior partitions Posterior differences between priors are small in our motif application, but can be larger in other applications Jensen and Liu, JASA (forthcoming) plus other manuscripts soon available on my website http://stat.wharton.upenn.edu/ stjensen Shane T. Jensen 22 May 13, 2006