Clustering bi-partite networks using collapsed latent block models

Clustering bi-partite networks using collapsed latent block models Jason Wyse, Nial Friel & Pierre Latouche Insight at UCD Laboratoire SAMM, Université Paris 1 Mail: jason.wyse@ucd.ie Insight Latent Space workshop, Friday 17th January 1 / 29

Bi-partite networks Consider an observed bi-partite network: Clubs : Members : 1,..., c 1,..., m Adjacency matrix Y such that { 1 if member i is in club j Y ij = 0 otherwise. Assume binary valued ties for the moment. 2 / 29

Bi-partite networks Movie Lens data 200 Movie-Lens data: 943 users 1682 movies Users 400 600 800 Movies rated/ not rated Movies clubs Users members 500 1000 1500 Movie 3 / 29

Bi-partite networks Is there clustering of members and clubs? Identify groups of members with similar linking attribute to groups of clubs, should they exist and vica-versa. Linking attribute: a random variable describing a tie (e.g. Bernoulli for Movie-Lens, can be count/continuous valued). Model these groups using the same probability distribution for linking attributes within a group. 4 / 29

Rest of talk... Using the latent block model for bi-partite network modelling Using the Integrated classification likelihood for model selection A greedy search algorithm for model selection Applications 5 / 29

Latent block model Assume there are K member groups (rows), G club groups (columns). For a member i in group k, the linking attribute to club j in group g is modelled by p(y ij θ kg ). In this talk, for the most part we ll assume binary links p(y ij θ kg ) = θ Y ij kg (1 θ kg ) 1 Y ij 6 / 29

Latent block model Latent block model: consider generative model for Y ij Label z i generated from (1,..., K) with weights (ω 1,..., ω K ) Label w j generated from (1,..., G) with weights (ρ 1,..., ρ G ) Conditioning on z i and w j, Y ij is generated from the model for links with parameter θ zi w j Y ij p( θ zi w j ). 7 / 29

Latent block model Govaert & Nadif (2008) for full details. Let z be a label vector such that z i = k if row (user) i is in row group k. Similarly let w j be labels for the columns (movies) j. The likelihood of observing the adjacency matrix Y can be written as a sum over all latent partitions p(y K, G, θ, ω, ρ) = p(z, w ω, ρ)p(y z, w, θ, K, G). (z,w) Z W Intractable, so work with likelihood completed with labels. 8 / 29

Latent block model Assume row and column allocations independent a priori p(z, w ω, ρ, K, G) = p(z ω, K)p(w ρ, G) ( m ) K c G = i=1 k=1 ω I(z i =k) k j=1 g=1 ρ I(w j =g) g Assume local independence of the entries of the adjacency conditioning on the labels p(y z, w, θ, K, G) = K G k=1 g=1 i:z i =k j:w j =g p(y ij θ kg ) Task: find the clustering via the two label vectors and also infer the number of groups for the clustering. 9 / 29

Latent block model Govaert & Nadif (2008) for full details mixture weights ω for the row clustering labels z for the rows mixture weights ρ for the column clustering labels w for the columns p(z, w ω, ρ, K, G) = p(y z, w, θ, K, G) = K k=1 K ω m k k G G g=1 ρ cg g k=1 g=1 i:z i =k j:w j =g p(y ij θ kg ) using a local independence assumption. Loosely speaking a latent mixture model on rows and columns; use the latent mixture to infer K and G. 10 / 29

Latent block model Priors p(ω K) Dir(α,..., α) p(ρ G) Dir(β,..., β) p(θ K, G) Note that here we condition on K and G which are generally not known in practice. For the latent block model Wyse & Friel (2012) have used collapsing and MCMC schemes for the choice of K and G by assuming p(θ K, G) is fully conjugate to p(y ij θ kg ) i.e. integrating out ω, ρ and θ analytically. 11 / 29

Integrated classification likelihood Consider the integrated complete data log likelihood giving rise the ICL criterion ( ) log p(y, z, w K, G) = log p(y, z, w, ω, ρ, θ K, G) dωdρ dθ ω,ρ,θ = log p(y z, w, K, G) + log p(z, w K, G) where = ICL(z, w, K, G) log p(y z, w, K, G) = log ( θ p(y z, w, θ, K, G)p(θ K, G) dθ) ( ) log p(z, w K, G) = log ρ,ω p(z, w ω, ρ, K, G)p(ω, ρ K, G)dρdω 12 / 29

Integrated classification likelihood The ICL criterion can be used for selecting the number of clusters K and G. Larger values of the ICL are more favourable. As shown by McDaid et al (2013) collapsing can be performed for stochastic block models with fairly standard prior assumptions. Côme and Latouche (2013) use a greedy search on the exact ICL to find the number of stochastic blocks as well as stochastic block memberships. The advantage of such approaches is that they may perform better than competing MCMC schemes e.g. MCMC can have poor mixing and require very large number of iterations with larger networks. 13 / 29

ICL greedy search We can use a very similar scheme to Côme and Latouche (2013) to find the number of clusters K and G for the bi-partite network. Assume that and also that K G p(θ K, G) = p(θ kg ) p(θ kg ) θ kg k=1 g=1 i:z i =k j:w j =g p(y ij θ kg ) dθ kg can be computed exactly (standard conjugate prior). Then log p(y z, w, K, G) can be computed exactly. Can compute log p(z, w K, G) exactly also to give the exact ICL criterion. 14 / 29

ICL greedy search The scheme we use is applied alternately to rows and columns of our adjacency matrix. Firstly initialize the labels z, w choosing a conservative (larger than needed) values for K and G, K max, G max. The greedy search algorithm iteratively allocates members and clubs and merges existing clusters so as to maximize the ICL. 15 / 29

ICL greedy search Randomly scan the rows. Take member i with z i = k. Compute the change in ICL for moving member i to cluster l k k l = ICL(z, w, K, G) ICL(z, w, K, G) and we take k k = 0. Move member i to the cluster l that gives the largest change in ICL. If all k l are negative, leave i where it is. 16 / 29

ICL greedy search If taking member i from cluster k would leave it empty, we compute the differences as instead. k l = ICL(z, w, K 1, G) ICL(z, w, K, G) This is the process by which clusters disappear as the greedy search progresses. The process just described is applied to the clubs too. The greedy search terminates when no further moves can increase the ICL. 17 / 29

Greedy search pruning After a few full sweeps of the data, we may already expect a good deal of clustering. Updating each row requires O(c M KG) calculation with c M average cost of computing a marginal likelihood. Reduce this cost by pruning off unlikely clusters. Low probabilities of being reassigned from cluster k to l correspond to large negative differences in exact ICL. 18 / 29

Greedy search pruning For rows, the form of the full conditional for row label i can be written π(z i = k everything else) = exp{ k k } K l=1 exp{ k l}. where k is the allocation of row i from the previous iteration. Of most interest is when π(z i = k everything else) is large compared with other groups i.e. π(z i = k everything else) > 1 δ with δ small = strong cohesion to group k 19 / 29

Greedy search pruning Prune off clusters with a very small full conditional probability compared with cluster k where k gives the maximum change in ICL (can be the same as k). Consider clusters pairwise exp{ k k } exp{ k k } + exp{ k l } > 1 δ or equivalently [ ] 1 δ k k k l > log δ then prune off cluster l from the search options in future iterations. Take log [(1 δ)/δ] = 150. This implies very small δ. 20 / 29

Sparse storage Store only the present ties and their positions in a triplet form. Useful for sparse networks. Then we can make a calculation to reduce vastly computations on the no-tie Y ij s = π(θ kg ) π(θ kg ) i:z i =k j:w j =g i:z i =k p(y ij θ kg ) dθ kg p( no-tie θ kg ) ci g : z p(yij s θ kg ) dθ kg. j J i g 21 / 29

Models Depending on the type of ties in the observed network, one has a choice of assumed models that still allow the ICL to be computed exactly. p(y ij θ kg ) p(θ kg ) Binomial Beta Multinomial Dirichlet Poisson Gamma Gaussian Gaussian-Gamma This allows for probabilistic modelling of richer network information than tie/no-tie if available. 22 / 29

Applications- four algorithms There are four possible algorithms available to us: Algorithm Pruning Sparse form A0 No No A1 No Yes A2 Yes No A3 Yes Yes In terms of speed we would expect A3 to be fastest and A0 to be slowest for large data. 23 / 29

Applications- congressional voting We applied the ICL search to the UCI congressional voting data analysed in Wyse and Friel (2011) (abstain=nay for our purposes) 435 congressmen (members) voting on 16 key issues (clubs). Number of groups found K = 6, G = 11. Little difference between four algorithms (speed & max ICL). 24 / 29

Applications- congressional voting A0 Closer look at the randomness introduced by randomly processing rows. 100 runs of the algorithm gave the maximum ICL s reached Frequency 0 5 10 15 Algorithm run times averaged at 0.6 of a second. 3600 3580 3560 3540 maximum ICL This is in contrast to the 1 hour it took Wyse and Friel (2011) algorithm to generate 100,000 posterior samples of the clustering (inefficient). 25 / 29

Applications- Movie-Lens 100k data Start four algorithms A0-A3 with same random seed. This allows for direct comparison. Algorithm maximum ICL time (sec) (K, G) A0-225670.9 183.8 (49,40) A1-225670.9 69.8 (49,40) A2-225670.9 134.3 (49,40) A3-225670.9 51.8 (49,40) All algorithms get to the same result from same starting position. However, we see marked speed-up for using sparse forms (A1 & A3) and pruning (A2 & A3). Pruning can give faster run with a looser threshold, but this can introduce error. 26 / 29

Applications- Movie-Lens 100k data Re ordered matrix 200 Identified 49 user and 40 movie clusters. Users 400 MCMC is practically infeasible for even this size of matrix. 600 800 In problems like this, we see making use of sparsity gives good savings. 500 1000 1500 Movies 27 / 29

Conclusion/Further work The ICL greedy search could be much more scalable than MCMC giving similar conclusions. Scalability can be improved even more by exploting sparsity and other ideas (e.g. pruning off bad clusters). Ceilings on the number of rows/columns manageable need investigation. Convergence results for greedy search and investigation of other search strategies would be desirable. Any suggestions? 28 / 29

References Govaert & Nadif (2008). Block clustering with Bernoulli mixture models: Comparison of different approaches,computational Statistics and Data Analysis 52,3233-3245. Côme & Latouche (2013). Model selection and clustering in stochastic block models with the exact integrated complete data likelihood, arxiv:1303.2962v1 McDaid, Murphy, Friel & Hurley (2013). Improved Bayesian inference for the stochastic block model with application to large networks, Computational Statistics & Data Analysis 60, 12-31. Wyse & Friel (2012). Block clustering with collapsed latent block models, Statistics and Computing 22 415-428 29 / 29