Priors for Random Count Matrices with Random or Fixed Row Sums Mingyuan Zhou Joint work with Oscar Madrid and James Scott IROM Department, McCombs School of Business Department of Statistics and Data Sciences The University of Texas at Austin th Conference on Bayesian Nonparametrics Raleigh, NC, June, / 7
Table of Contents Motivations How to construct an infinite random count matrix? Priors for random count matrices Infinite vocabulary naive Bayes classifiers Random count matrices and mixed-membership modeling Conclusions / 7
Motivations Where do random count matrices appear? Directly observable random count matrices: Text analysis: document-word count matrix DNA-sequencing: sample-gene count matrix Social network analysis: user-venue check-in count matrix Consumer behavior: consumer-product count matrix Latent random count matrices: Topic models [Blei et al., ]: document-topic count matrix (the sum of each row is the length of the corresponding document) Hidden Markov models: state-state transition count matrix / 7
Motivations Motivations to Study Random Count Matrices Lack of priors to describe random count matrices with a potentially infinite number of rows/columns. A naive Bayes classifier often requires a predetermined vocabulary shared across all categories, and has to ignore previously unseen features/terms. How to calculate the predictive distribution of a new count vector that brings previously unseen terms? Interesting combinatorial structures unique to infinite random count matrices. Priors for random count matrices can be used to construct priors for mixed-membership modeling. / 7
Frequency Frequency Frequency Frequency Frequency Frequency Frequency Frequency Document Document Document Document Priors for Random Count Matrices with Random or Fixed Row Sums Motivations Representation of a count vector under a count matrix Mac.Hardware Politics.Guns Mac.Hardware Politics.Guns Term A Mac.Hardware document Term Term A Mac.Hardware document Term Term A Politics.Guns document Term Term A Politics.Guns document Term New term New term New term.. New term / 7
Motivations Infinite random count matrices to be studied No natural upper bound on the number of rows or columns Conditionally independent rows, i.i.d. columns Parallel column-wise construction Sequential row-wise constructions Predictive distribution of a new row count vector that brings new features Random count matrices with fixed row sums for mixed-membership modeling / 7
How to construct an infinite random count matrix? Related prior distributions Prior distributions for counts: Poisson, logarithmic, digamma distributions Negative binomial, beta-negative binomial, and gamma-negative binomial distributions Poisson-logarithmic bivariate distribution [Zhou & Carin, ] Generating a random count vector: Chinese restaurant process, Pitman-Yor process Normalized random measures with independent increments [Regazzini, Lijoi, & Prünster, ; James, Lijoi, & Prünster, 9] Exchangeable partition probability functions (EPPFs) [Pitman, ]; Size dependent EPPFs [Zhou & Walker, ] Generating an infinite random binary matrix: Indian buffet process [Griffiths & Ghahramani, ]; Beta-Bernoulli process [Thibaux & Jordan, 7] Generating an infinite random count matrix: How? 7 / 7
How to construct an infinite random count matrix? Steps to construct an infinite random count matrix Choose a completely random measure G, a draw from which consists of countably infinite atoms G = k= r kδ ωk. For X j := k= n jkδ ωk, draw counts n jk f (r k, θ j ), where f denotes a count distribution parameterized by r k and θ j. Denote n :k = (n k,..., n Jk ) T and n k = J j= n jk. The count matrix N J is constructed by organizing all the nonzero column count vectors, {n :k } k:n k >, in an arbitrary order into a random count matrix. In practice, we cannot instantiate all the atoms of G. Thus we will have to marginalize G out from {X j },J to construct N J. / 7
Priors for random count matrices Example: gamma-poisson or negative binomial process Gamma-Poisson process [Titsias, ; Zhou & Carin, ; Zhou et al., ] X j PP(G), G ΓP(G, /c) Conditional likelihood: p({x j },J G) = k= r n k k J j= n jk! e Jr k K J = e JG(Ω\D) k= r n k k e Jr k J j= n jk! To marginalize G out, one may separate Ω to the absolution continuous space and points of discontinuity, and then apply the characteristic function to G(Ω\D) and the Lévy measure of G to each point of discontinuity. The {X j },J to N J is a one-to-(k J!) mapping, thus f (N J γ, c) = E G [p({x j },J G)] K J! 9 / 7
Priors for random count matrices Example: gamma-poisson or negative binomial process Exchangeable rows and i.i.d. columns Distribution for the count matrix: f (N J γ, c) = γk J exp [ γ ln( J+c c )] K J! Row exchangeable, column i.i.d: K J k= n :k Multinomial(n k, /J,..., /J), n k Log[J/(J + c)], K J Pois {γ [ln(j + c) ln(c)]}. Γ(n k ) (J+c) n k J j= n jk! Closed-form Gibbs sampling update equations for model parameters / 7
Priors for random count matrices Example: gamma-poisson or negative binomial process Exchangeable rows and i.i.d. columns Distribution for the count matrix: f (N J γ, c) = γk J exp [ γ ln( J+c c )] K J! Row exchangeable, column i.i.d: K J k= n :k Multinomial(n k, /J,..., /J), n k Log[J/(J + c)], K J Pois {γ [ln(j + c) ln(c)]}. Γ(n k ) (J+c) n k J j= n jk! Closed-form Gibbs sampling update equations for model parameters / 7
Priors for random count matrices Example: gamma-poisson or negative binomial process Sequential row-wise construction Sequential row-wise construction: p(n + J+ N J, θ) = f (N J+ θ) f (N J θ) = K J!K + J+! K J+! K J+ k=k J + Log K J k= ( NB n (J+)k ; n k, ( ) n (J+)k ; J + c + ) J + c + Pois { K + J+; γ [ln(j + c + ) ln(j + c)] }. To add a new row to N J Z J K J : First, draw count NB(n k, p J+ ) at each existing column Second, draw K + J+ Pois {γ [ln(j + c + ) ln(j + c)]} number of new columns Third, draw Log(pJ+ ) random count at each new column The combinatorial coefficient arises as the newly added columns are inserted into the original ones at random locations, with their relative orders preserved. / 7
Priors for random count matrices Example: gamma-poisson or negative binomial process rows 7 9 columns Figure: A sequentially constructed negative binomial process random count matrix N J NBPM(γ, c). / 7
Priors for random count matrices Example: gamma-negative binomial process Gamma-negative binomial process [Zhou & Carin, ; Zhou et al., ] Gamma-negative binomial process: Conditional likelihood: X j NBP(G, p j ), G ΓP(G, /c) p({x j },J G, p) = Augmented likelihood: k= j= K J p({x j, L j },J G, p) = e q G(Ω\D) J Γ(n jk + r k ) n jk!γ(r k ) pn jk j ( p j ) r k k= where q j = ln( p j ) and q = J j= q j. r l k e q r k k ( J j= s(n jk, l jk ) p n jk j n jk! ), / 7
Priors for random count matrices Example: gamma-negative binomial process Distribution for the (augmented) count matrix: f (N J, L J θ) = γk J exp [ γ ln( c+q ) ] c K J! Row heterogeneity, column i.i.d.: n jk = l jk t= K J k= n jkt, n jkt Log(p j ), ( J Γ(l k ) (c + q ) l k j= (l k,..., l Jk ) Mult(l k, q /q,..., q J /q ), l k Log[q /(c + q )], K J Pois{γ [ln(c + q ) ln(c)]}. s(n jk, l jk ) p n jk j n jk! Closed-form Gibbs sampling update equations for model parameters. ) / 7
Priors for random count matrices Example: gamma-negative binomial process Predictive distribution of a new row: p(n + J+, L+ J+ N J, L J, θ) = K J!K + J+! K J+! K J k= NB (l (J+)k ; l k, K J+ k=k J + Log (l (J+)k ; KJ+ k= SumLog ( l (J+)k, p J+ ) ) q J+ c+q +q J+ ) q J+ c+q +q J+ Pois { K + J+ ; γ [ln(c + q + q J+ ) ln(c + q )] }. To add a new row: q Draw NB(l k, J+ c+q +q J+ ) tables at existing columns (dishes) Draw K + J+ Pois {γ [ln(c + q + q J+ ) ln(c + q )]} new dishes Draw Log( q J+ c+q +q J+ ) tables at each new dish Draw Log(pJ+ ) customers at each table and aggregate the counts across the tables of the same dish as n (J+)k = l (J+)k t= n (J+)kt / 7
Priors for random count matrices Example: gamma-negative binomial process rows 7 9 9 7 columns 7 9 Figure: A sequentially constructed gamma-negative binomial process random count matrix N J GNBPM(γ, c, p,, p J ). / 7
Priors for random count matrices Example: beta-negative binomial process Beta-negative binomial process Beta-negative binomial process [Zhou et al., ; Broderick et al., ; Zhou & Carin ; Heaukulani & Roy, ; Zhou et al., ]: Conditional likelihood: X j NBP(r j, B), B BP(c, B ) p({x j },J B, r) = e p r where K J k= p n k k ( p k) r p = k=k J + ln( p k) J j= Γ(n jk + r j ) n jk!γ(r j ) 7 / 7
Priors for random count matrices Example: beta-negative binomial process Distribution for the count matrix: f (N J γ, c, r) = γk J e γ [ψ(c+r ) ψ(c)] K J! K J k= Row heterogeneity, column i.i.d.: Γ(n k )Γ(c + r ) Γ(c + n k + r ) n :k DirMult(n k, r,, r J ) J j= n k Digam(r, c) K J Pois { γ [ψ(c + r ) ψ(c)] } where Digam(n r, c) = Γ(r+n)Γ(c+r) ψ(c+r) ψ(c) nγ(c+n+r)γ(r) Γ(n jk + r j ) n jk!γ(r j ) Closed-form Gibbs sampling update equations for model parameters / 7
Priors for random count matrices Example: beta-negative binomial process Ice cream buffet process (a.k.a., multi-scoop IBP [Zhou et al., ] and negative binomial IBP [Heaukulani & Roy, ]) Sequential row-wise construction: p(n + J+ N J) = K J!K + J+! K J+! KJ k= BNB(n (J+)k; r J+, n k, c + r ) K J+ k=k J + Digam(n (J+)k; r J+, c + r ) Pois { K + J+ ; γ [ψ(c + r + r J+ ) ψ(c + r )] }. To add a new row: Customer J + takes n(j+)k BNB(r J+, n k, c + r ) number of scoops at an existing ice cream (column). The customer further selects K + J+ Pois {γ [ψ(c + r + r J+ ) ψ(c + r )]} new ice creams out of the buffet line. The customer takes n(j+)k Digam(r J+, c + r ) number of scoops at each new ice cream. 9 / 7
Priors for random count matrices Example: beta-negative binomial process 7 9 9 7 columns rows 7 9 Figure: A sequentially constructed beta-negative binomial process random count matrix N J BNBPM(γ, c, r,, r J ). / 7
Priors for random count matrices Example: beta-negative binomial process Comparison of different priors NBP: Var[n (J+)k ] = E[n (J+)k ] + E [n (J+)k ] n k GNBP: Var[n (J+)k ] = E[n (J+)k] p J+ + E [n (J+)k ] BNBP: Var[n (J+)k ] = E[n (J+)k] c+r n k +c+r l k + E [n (J+)k ] n k (c+r ) n k +c+r / 7
Priors for random count matrices Example: beta-negative binomial process columns rows NBP columns rows NBP columns rows NBP columns rows GNBP 7 7 columns rows GNBP 7 7 columns rows GNBP 9 columns rows BNBP 7 columns rows BNBP 7 7 columns rows BNBP / 7
Priors for random count matrices Example: beta-negative binomial process Training and posterior predictive checking (a) The observed count matrix (b) A simulated NBP random count matrix Documents Documents 9 Words 9 Words (c) A simulated GNBP random count matrix (d) A simulated BNBP random count matrix Documents Documents 7 Words Words / 7
Infinite vocabulary naive Bayes classifiers Predictive distribution of a new row vector The predictive distribution of a row vector n J+ is p(n J+ N J, θ) = p(n+ J+ N J, θ) K + J+! () = K K J+! J! K J!K +!f (N J+ θ) J+. () K J+! f (N J θ) The normalizing constant /K + J+! in () arises because a realization of N + J+ to n J+ is one-to-many, with K + J+! distinct orderings of these new columns. The normalizing constant K J!/K J+! in () arises because there are K + J+ i= (K J + i)! = K J+!/K J! ways to insert the K + J+ new columns into the original ordered K J columns, which is again a one-to-many mapping. / 7
Infinite vocabulary naive Bayes classifiers Each category is summarized as a random count matrix N J ; columns with all zeros are excluded. Gibbs sampling is used to infer the parameters θ that generate N J ; to represent the posterior of θ, S MCMC samples {θ [s] },S are collected. For a testing row count vector n J+, its predictive likelihood given N J is calculated via Monte Carlo integration using p(n J+ N J ) = S S p(n + J+ N J, θ [s] ) s= for both the NBP and BNBP, and using p(n J+ N J ) = S for the GNBP. S s= K + J+! p(n + J+ N J, L [s] J, θ[s] ) K + J+! / 7
Frequency Frequency Frequency Frequency Frequency Frequency Frequency Frequency Document Document Document Document Priors for Random Count Matrices with Random or Fixed Row Sums Infinite vocabulary naive Bayes classifiers Infinite vocabulary naive Bayes classifiers Mac.Hardware Politics.Guns Mac.Hardware Politics.Guns Term A Mac.Hardware document Term Term A Mac.Hardware document Term Term A Politics.Guns document Term Term A Politics.Guns document Term New term New term New term.. New term / 7
Infinite vocabulary naive Bayes classifiers (a) Infinite vocabulary (b) Finite vocabulary.9.9.... Accuracy.7.7.... NBP Multinomial BNBP GNBP Accuracy.7.7......... Ratio of training documents..... Ratio of training documents Figure: Document categorization results on the Newsgroup dataset with (a) an unconstrained vocabulary that can grow to infinite, and (b) an predetermined finite vocabulary of size V =,, using the negative binomial process (NBP), gamma-negative binomial process (GNBP), and beta-negative binomial process (BNBP). The results of the multinomial naive Bayes classifier using Laplace smoothing are included for comparison. 7 / 7
Infinite vocabulary naive Bayes classifiers (a) Infinite vocabulary (b) Finite vocabulary.9.9.9.9 Accuracy...7 NBP Multinomial BNBP GNBP Accuracy...7.7.... Ratio of training documents.7.... Ratio of training documents Figure: Analogous plots to the plots in the previous Figure for the TDT dataset. The predetermined finite vocabulary has the size of V =,77. / 7
Infinite vocabulary naive Bayes classifiers Figure: (a) The predicted probabilities of the test documents under different categories for the CNAE-9 dataset, using the GNBP nonparametric Bayesian naive Bayes classifier with % of the documents of each of the nine categories used for training. (b) Boxplots of the categorization accuracies; each accuracy is computed with S =, S =, S =, or S = MCMC samples. 9 / 7
Random count matrices and mixed-membership modeling Beta-negative binomial process (BNBP) mixed-membership modeling Construct EPPFs for mixture modeling using priors for random count vectors [Zhou & Walker, ] One way to generate a random count vector (n,..., n l ): Draw l, the length of the vector, and then draw independent positive random counts {n k },l. Another way to generate such a random count vector: Draw a total count n, and partition it using an EPPF, resulting in a set of exchangeable categorical variables z = (z,..., z n ). Map z to a random positive count vector (n,..., n l ), where n k := n i= δ(z i = k) >. Both ways lead to the same distributed (n,..., n l ) if and only if P(n,..., n l, n) = n! l! l P(z, n) k= n k! (Sample size dependent) EPPF for Mixture modeling: [ ] P(z, n) n! P(n,..., n l, n) P(z n) = = P(n) l! l k= n k! P(n) / 7
Random count matrices and mixed-membership modeling Beta-negative binomial process (BNBP) mixed-membership modeling Construct EPPFs for mixed-membership modeling using priors for random count matrices [Zhou ] BNBP random count matrix prior f (N J r, γ, c) = γk J e γ [ψ(c+r ) ψ(c)] KJ Γ(n k )Γ(c+r ) J Γ(n jk +r j ) K J! k= Γ(c+n k +r ) j= n jk!γ(r j ) With z = (z,..., z JmJ ) and n jk = m j i= δ(z ji = k), the joint distribution of a column count vector m = (m,..., m J ) T and its partition into a column exchangeable latent random count matrix with K J nonempty columns can be expressed as f (z, m r, γ, c) = K J! = γk J e γ[ψ(c+r ) ψ(c)] J j= m j! J j= K J k= m j! KJ k= n jk! Γ(n k)γ(c + r ) Γ(c + n k + r ) f (N J r, γ, c) J j= Γ(n jk + r j ) Γ(r j ) / 7
Random count matrices and mixed-membership modeling Beta-negative binomial process (BNBP) mixed-membership modeling The BNBP s EPPF for mixed-membership modeling: f (z m, r, γ, c) = f (z, m r, γ, c) f (m r, γ, c) The prediction rule is simple: = K J! J j= m j! KJ k= n jk! P(z ji z ji f (z ji, z ji, m r, γ, c), m, r, γ, c) = K ji. J + k= f (z ji = k, z ji, m r, γ, c) n ji k γ r j, c + r c + n ji k + r f (N J r, γ, c) f (m r, γ, c) (n ji jk + r j ), for k =,, K ji J ; if k = K ji J +. / 7
Random count matrices and mixed-membership modeling Beta-negative binomial process (BNBP) mixed-membership modeling Random count matrices with fixed row sums (a) r i = (b) r i = (c) r i = 9 9 9 7 7 7 9 9 9 9 Group Group Group 9 7 9 9 7 7 7 9 9 7 7 Partition Partition Partition Figure: Random draws from the EPPF that governs the BNBP s exchangeable random partitions of groups (rows), each of which has data points. The jth row of each matrix, which sums to, represents the partition of the m j = data points of the jth group over a random number of exchangeable clusters. The kth column of each matrix represents the kth nonempty cluster in order of appearance in Gibbs sampling (the empty clusters are deleted). / 7
Random count matrices and mixed-membership modeling Gamma-negative binomial process (GNBP) mixed-membership modeling The GNBP s EPPF for mixed-membership modeling GNBP random count matrix prior f (N J, L J γ, c, p) = γk J exp[ γ ln( c+q c )] KJ K J! k= ( Γ(l k ) J (c+q ) l k j= s(n jk,l jk ) p n jk j n jk! With z = (z,..., z JmJ ), b = (b,..., b JmJ ), and n jkt = m j i= δ(z ji = k, b ji = t), the joint distribution of a column count vector m = (m,..., m J ) T, its partition into a column exchangeable latent random count matrix with K J nonempty columns, and an auxiliary categorical random vector can be expressed as ) f (b, z, m γ, c, p) = γ K J J p m j K J j Γ(l k) m j! (c + q ) l k j= e γ ln( c+q c ) k= J l jk j= t= Γ(n jkt ) / 7
Random count matrices and mixed-membership modeling Gamma-negative binomial process (GNBP) mixed-membership modeling The GNBP s EPPF for mixed-membership modeling: The prediction rule is simple: f (z, b m, γ, c, p) = f (z, b, m γ, c, p) f (m γ, c, p) P(z ji = k, b ji = t b ji, z ji, m, p, c) = f (z ji = k, b ji = t, b ji, z ji, m p, c) z ji,b ji f (z ji, b ji, b ji, z ji, m p, c) n ji jkt, ji if k KJ l ji k /(c + q ), if k K ji J γ /(c + q ),, t l ji jk ;, t = l ji jk + ; if k = K ji J +, t =. If we let z ji be the dish index and b ji be the table index for customer i in restaurant j, then the collapsed Gibbs sampler can be related to the Chinese restaurant franchise sampler of the hierarchical Dirichlet process (Teh et al., ). / 7
Conclusions Conclusions A family of probability mass functions for random count matrices. The proposed random count matrices have a random number of i.i.d. columns and could also be constructed by adding one row at a time. Their parameters can be inferred with closed-form Gibbs sampling update equations. Infinite vocabulary naive Bayes classifiers. Priors for random count matrices can be used to construct (group size dependent) EPPFs for mixed-membership modeling, with simple prediction rules for collapsed Gibbs sampling. / 7
Conclusions Main References M. Zhou, O. H. M. Padilla and J. G. Scott. Priors for random count matrices derived from a family of negative binomial processes. arxiv:.,. M. Zhou. Beta-negative binomial process and exchangeable random partitions for mixed-membership modeling. NIPS,. M. Zhou and S. G. Walker. Sample size dependent species models. arxiv:.,. C. Heaukulani and D. M. Roy. The combinatorial structure of beta negative binomial processes. arxiv:.,. T. Broderick, L. Mackey, J. Paisley, and M. I. Jordan. Combinatorial clustering and the beta negative binomial process. IEEE Trans. Pattern Analysis and Machine Intelligence,. M. Zhou and L. Carin. Negative binomial process count and mixture modeling. IEEE Trans. Pattern Analysis and Machine Intelligence,. M. Zhou and L. Carin. Augment-and-conquer negative binomial processes. In NIPS,. M. Zhou, L. Hannah, D. Dunson, and L. Carin. Beta-negative binomial process and Poisson factor analysis. In AISTATS,. 7 / 7