Clustering bi-partite networks using collapsed latent block models

Similar documents
Inferring structure in bipartite networks using the latent blockmodel and exact ICL

Introduction to Probabilistic Machine Learning

Lecture 13 : Variational Inference: Mean Field Approximation

Different points of view for selecting a latent structure model

Modeling heterogeneity in random graphs

Probabilistic Graphical Models

Non-Parametric Bayes

Latent Dirichlet Bayesian Co-Clustering

Introduction to Machine Learning

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

PMR Learning as Inference

Sparse Stochastic Inference for Latent Dirichlet Allocation

Introduction to Machine Learning

13: Variational inference II

Generative Models for Discrete Data

19 : Bayesian Nonparametrics: The Indian Buffet Process. 1 Latent Variable Models and the Indian Buffet Process

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

CSC411 Fall 2018 Homework 5

STA 4273H: Statistical Machine Learning

Infinite latent feature models and the Indian Buffet Process

STA 4273H: Statistical Machine Learning

Bayesian Methods for Machine Learning

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Generative Clustering, Topic Modeling, & Bayesian Inference

Bayesian Machine Learning

STA 414/2104: Machine Learning

The Jackknife-Like Method for Assessing Uncertainty of Point Estimates for Bayesian Estimation in a Finite Gaussian Mixture Model

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Image segmentation combining Markov Random Fields and Dirichlet Processes

CS Lecture 18. Topic Models and LDA

Bayesian Inference for Dirichlet-Multinomials

Variable selection for model-based clustering

ICML Scalable Bayesian Inference on Point processes. with Gaussian Processes. Yves-Laurent Kom Samo & Stephen Roberts

Variable selection for model-based clustering of categorical data

ECE 5984: Introduction to Machine Learning

Homework 6: Image Completion using Mixture of Bernoullis

Hmms with variable dimension structures and extensions

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem

Probability and Estimation. Alan Moses

Learning Bayesian network : Given structure and completely observed data

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

The Expectation-Maximization Algorithm

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Lecture 4. Generative Models for Discrete Data - Part 3. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza.

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17

Study Notes on the Latent Dirichlet Allocation

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Bayesian Machine Learning

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Clustering using Mixture Models

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Classical and Bayesian inference

STA 4273H: Sta-s-cal Machine Learning

Programming Assignment 4: Image Completion using Mixture of Bernoullis

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

Linear Classification

Recent Advances in Bayesian Inference Techniques

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Time-Sensitive Dirichlet Process Mixture Models

Coupled Hidden Markov Models: Computational Challenges

Robotics 2 Data Association. Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Wolfram Burgard

Graphical Models for Collaborative Filtering

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007

Introduction to Bayesian inference

Modeling Environment

Bayesian Nonparametrics for Speech and Signal Processing

STA 4273H: Statistical Machine Learning

Probabilistic Graphical Models

39th Annual ISMS Marketing Science Conference University of Southern California, June 8, 2017

1-bit Matrix Completion. PAC-Bayes and Variational Approximation

Tree-Based Inference for Dirichlet Process Mixtures

CS Lecture 19. Exponential Families & Expectation Propagation

Crowdsourcing & Optimal Budget Allocation in Crowd Labeling

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Final Exam, Machine Learning, Spring 2009

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

Riemann Manifold Methods in Bayesian Statistics

Expectation Propagation Algorithm

Machine Learning, Fall 2012 Homework 2

Part 1: Expectation Propagation

Lecture 6: Graphical Models: Learning

Lecture 4: Probabilistic Learning

Probabilistic Graphical Models

STAT J535: Chapter 5: Classes of Bayesian Priors

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

STAT Advanced Bayesian Inference

Motif representation using position weight matrix

Infering the Number of State Clusters in Hidden Markov Model and its Extension

Collapsed Variational Bayesian Inference for Hidden Markov Models

Approximate Inference using MCMC

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Approximate Bayesian inference

Document and Topic Models: plsa and LDA

Bayesian Image Segmentation Using MRF s Combined with Hierarchical Prior Models

Bayesian Nonparametrics: Dirichlet Process

Estimating the marginal likelihood with Integrated nested Laplace approximation (INLA)

Transcription:

Clustering bi-partite networks using collapsed latent block models Jason Wyse, Nial Friel & Pierre Latouche Insight at UCD Laboratoire SAMM, Université Paris 1 Mail: jason.wyse@ucd.ie Insight Latent Space workshop, Friday 17th January 1 / 29

Bi-partite networks Consider an observed bi-partite network: Clubs : Members : 1,..., c 1,..., m Adjacency matrix Y such that { 1 if member i is in club j Y ij = 0 otherwise. Assume binary valued ties for the moment. 2 / 29

Bi-partite networks Movie Lens data 200 Movie-Lens data: 943 users 1682 movies Users 400 600 800 Movies rated/ not rated Movies clubs Users members 500 1000 1500 Movie 3 / 29

Bi-partite networks Is there clustering of members and clubs? Identify groups of members with similar linking attribute to groups of clubs, should they exist and vica-versa. Linking attribute: a random variable describing a tie (e.g. Bernoulli for Movie-Lens, can be count/continuous valued). Model these groups using the same probability distribution for linking attributes within a group. 4 / 29

Rest of talk... Using the latent block model for bi-partite network modelling Using the Integrated classification likelihood for model selection A greedy search algorithm for model selection Applications 5 / 29

Latent block model Assume there are K member groups (rows), G club groups (columns). For a member i in group k, the linking attribute to club j in group g is modelled by p(y ij θ kg ). In this talk, for the most part we ll assume binary links p(y ij θ kg ) = θ Y ij kg (1 θ kg ) 1 Y ij 6 / 29

Latent block model Latent block model: consider generative model for Y ij Label z i generated from (1,..., K) with weights (ω 1,..., ω K ) Label w j generated from (1,..., G) with weights (ρ 1,..., ρ G ) Conditioning on z i and w j, Y ij is generated from the model for links with parameter θ zi w j Y ij p( θ zi w j ). 7 / 29

Latent block model Govaert & Nadif (2008) for full details. Let z be a label vector such that z i = k if row (user) i is in row group k. Similarly let w j be labels for the columns (movies) j. The likelihood of observing the adjacency matrix Y can be written as a sum over all latent partitions p(y K, G, θ, ω, ρ) = p(z, w ω, ρ)p(y z, w, θ, K, G). (z,w) Z W Intractable, so work with likelihood completed with labels. 8 / 29

Latent block model Assume row and column allocations independent a priori p(z, w ω, ρ, K, G) = p(z ω, K)p(w ρ, G) ( m ) K c G = i=1 k=1 ω I(z i =k) k j=1 g=1 ρ I(w j =g) g Assume local independence of the entries of the adjacency conditioning on the labels p(y z, w, θ, K, G) = K G k=1 g=1 i:z i =k j:w j =g p(y ij θ kg ) Task: find the clustering via the two label vectors and also infer the number of groups for the clustering. 9 / 29

Latent block model Govaert & Nadif (2008) for full details mixture weights ω for the row clustering labels z for the rows mixture weights ρ for the column clustering labels w for the columns p(z, w ω, ρ, K, G) = p(y z, w, θ, K, G) = K k=1 K ω m k k G G g=1 ρ cg g k=1 g=1 i:z i =k j:w j =g p(y ij θ kg ) using a local independence assumption. Loosely speaking a latent mixture model on rows and columns; use the latent mixture to infer K and G. 10 / 29

Latent block model Priors p(ω K) Dir(α,..., α) p(ρ G) Dir(β,..., β) p(θ K, G) Note that here we condition on K and G which are generally not known in practice. For the latent block model Wyse & Friel (2012) have used collapsing and MCMC schemes for the choice of K and G by assuming p(θ K, G) is fully conjugate to p(y ij θ kg ) i.e. integrating out ω, ρ and θ analytically. 11 / 29

Integrated classification likelihood Consider the integrated complete data log likelihood giving rise the ICL criterion ( ) log p(y, z, w K, G) = log p(y, z, w, ω, ρ, θ K, G) dωdρ dθ ω,ρ,θ = log p(y z, w, K, G) + log p(z, w K, G) where = ICL(z, w, K, G) log p(y z, w, K, G) = log ( θ p(y z, w, θ, K, G)p(θ K, G) dθ) ( ) log p(z, w K, G) = log ρ,ω p(z, w ω, ρ, K, G)p(ω, ρ K, G)dρdω 12 / 29

Integrated classification likelihood The ICL criterion can be used for selecting the number of clusters K and G. Larger values of the ICL are more favourable. As shown by McDaid et al (2013) collapsing can be performed for stochastic block models with fairly standard prior assumptions. Côme and Latouche (2013) use a greedy search on the exact ICL to find the number of stochastic blocks as well as stochastic block memberships. The advantage of such approaches is that they may perform better than competing MCMC schemes e.g. MCMC can have poor mixing and require very large number of iterations with larger networks. 13 / 29

ICL greedy search We can use a very similar scheme to Côme and Latouche (2013) to find the number of clusters K and G for the bi-partite network. Assume that and also that K G p(θ K, G) = p(θ kg ) p(θ kg ) θ kg k=1 g=1 i:z i =k j:w j =g p(y ij θ kg ) dθ kg can be computed exactly (standard conjugate prior). Then log p(y z, w, K, G) can be computed exactly. Can compute log p(z, w K, G) exactly also to give the exact ICL criterion. 14 / 29

ICL greedy search The scheme we use is applied alternately to rows and columns of our adjacency matrix. Firstly initialize the labels z, w choosing a conservative (larger than needed) values for K and G, K max, G max. The greedy search algorithm iteratively allocates members and clubs and merges existing clusters so as to maximize the ICL. 15 / 29

ICL greedy search Randomly scan the rows. Take member i with z i = k. Compute the change in ICL for moving member i to cluster l k k l = ICL(z, w, K, G) ICL(z, w, K, G) and we take k k = 0. Move member i to the cluster l that gives the largest change in ICL. If all k l are negative, leave i where it is. 16 / 29

ICL greedy search If taking member i from cluster k would leave it empty, we compute the differences as instead. k l = ICL(z, w, K 1, G) ICL(z, w, K, G) This is the process by which clusters disappear as the greedy search progresses. The process just described is applied to the clubs too. The greedy search terminates when no further moves can increase the ICL. 17 / 29

Greedy search pruning After a few full sweeps of the data, we may already expect a good deal of clustering. Updating each row requires O(c M KG) calculation with c M average cost of computing a marginal likelihood. Reduce this cost by pruning off unlikely clusters. Low probabilities of being reassigned from cluster k to l correspond to large negative differences in exact ICL. 18 / 29

Greedy search pruning For rows, the form of the full conditional for row label i can be written π(z i = k everything else) = exp{ k k } K l=1 exp{ k l}. where k is the allocation of row i from the previous iteration. Of most interest is when π(z i = k everything else) is large compared with other groups i.e. π(z i = k everything else) > 1 δ with δ small = strong cohesion to group k 19 / 29

Greedy search pruning Prune off clusters with a very small full conditional probability compared with cluster k where k gives the maximum change in ICL (can be the same as k). Consider clusters pairwise exp{ k k } exp{ k k } + exp{ k l } > 1 δ or equivalently [ ] 1 δ k k k l > log δ then prune off cluster l from the search options in future iterations. Take log [(1 δ)/δ] = 150. This implies very small δ. 20 / 29

Sparse storage Store only the present ties and their positions in a triplet form. Useful for sparse networks. Then we can make a calculation to reduce vastly computations on the no-tie Y ij s = π(θ kg ) π(θ kg ) i:z i =k j:w j =g i:z i =k p(y ij θ kg ) dθ kg p( no-tie θ kg ) ci g : z p(yij s θ kg ) dθ kg. j J i g 21 / 29

Models Depending on the type of ties in the observed network, one has a choice of assumed models that still allow the ICL to be computed exactly. p(y ij θ kg ) p(θ kg ) Binomial Beta Multinomial Dirichlet Poisson Gamma Gaussian Gaussian-Gamma This allows for probabilistic modelling of richer network information than tie/no-tie if available. 22 / 29

Applications- four algorithms There are four possible algorithms available to us: Algorithm Pruning Sparse form A0 No No A1 No Yes A2 Yes No A3 Yes Yes In terms of speed we would expect A3 to be fastest and A0 to be slowest for large data. 23 / 29

Applications- congressional voting We applied the ICL search to the UCI congressional voting data analysed in Wyse and Friel (2011) (abstain=nay for our purposes) 435 congressmen (members) voting on 16 key issues (clubs). Number of groups found K = 6, G = 11. Little difference between four algorithms (speed & max ICL). 24 / 29

Applications- congressional voting A0 Closer look at the randomness introduced by randomly processing rows. 100 runs of the algorithm gave the maximum ICL s reached Frequency 0 5 10 15 Algorithm run times averaged at 0.6 of a second. 3600 3580 3560 3540 maximum ICL This is in contrast to the 1 hour it took Wyse and Friel (2011) algorithm to generate 100,000 posterior samples of the clustering (inefficient). 25 / 29

Applications- Movie-Lens 100k data Start four algorithms A0-A3 with same random seed. This allows for direct comparison. Algorithm maximum ICL time (sec) (K, G) A0-225670.9 183.8 (49,40) A1-225670.9 69.8 (49,40) A2-225670.9 134.3 (49,40) A3-225670.9 51.8 (49,40) All algorithms get to the same result from same starting position. However, we see marked speed-up for using sparse forms (A1 & A3) and pruning (A2 & A3). Pruning can give faster run with a looser threshold, but this can introduce error. 26 / 29

Applications- Movie-Lens 100k data Re ordered matrix 200 Identified 49 user and 40 movie clusters. Users 400 MCMC is practically infeasible for even this size of matrix. 600 800 In problems like this, we see making use of sparsity gives good savings. 500 1000 1500 Movies 27 / 29

Conclusion/Further work The ICL greedy search could be much more scalable than MCMC giving similar conclusions. Scalability can be improved even more by exploting sparsity and other ideas (e.g. pruning off bad clusters). Ceilings on the number of rows/columns manageable need investigation. Convergence results for greedy search and investigation of other search strategies would be desirable. Any suggestions? 28 / 29

References Govaert & Nadif (2008). Block clustering with Bernoulli mixture models: Comparison of different approaches,computational Statistics and Data Analysis 52,3233-3245. Côme & Latouche (2013). Model selection and clustering in stochastic block models with the exact integrated complete data likelihood, arxiv:1303.2962v1 McDaid, Murphy, Friel & Hurley (2013). Improved Bayesian inference for the stochastic block model with application to large networks, Computational Statistics & Data Analysis 60, 12-31. Wyse & Friel (2012). Block clustering with collapsed latent block models, Statistics and Computing 22 415-428 29 / 29