Applying Latent Dirichlet Allocation to Group Discovery in Large Graphs

Lawrence Livermore National Laboratory Applying Latent Dirichlet Allocation to Group Discovery in Large Graphs Keith Henderson and Tina Eliassi-Rad keith@llnl.gov and eliassi@llnl.gov This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Outline Task at hand Introduction Related work Our approach Latent Dirichlet Allocation for Graphs (LDA-G) Experimental study Conclusion Henderson and Eliassi-Rad (2)

Task at Hand Inputs: Two abstracts t from PubMed A co-authorship network from PubMed A bipartite te author-by-knowledge o graph from PubMed Output: Find other researchers whose work is similar to authors of the two given abstracts Our solution Latent Dirichlet Allocation for Graphs (LDA-G) A scalable nonparametric Bayesian group discovery algorithm that discovers soft groups / communities with mixed-memberships Henderson and Eliassi-Rad (3)

Introduction Problem Given a graph G = (V,E), find groups / communities in G Algorithm must satisfy all of these properties 1) Nonparametric number of groups is not known a priori 2) Scalable algorithm is faster than O( V 2 ) and requires less than O( V 2 ) space 3) Effective algorithm finds groups that are highly predictive of G s underlying topology 4) Mixed-membership nodes can belong to multiple groups with varying degrees Henderson and Eliassi-Rad (4)

Related Work Non-probabilistic Approach Probabilistic Approach Cross-associations Infinite Relational Models (IRM) [Chakrabarti +, KDD 04] [Kemp+, AAAI 06] Generates a compressed form of the adjacency matrix by minimizing the total encoding cost Assumes uniform likelihood for links between groups Produces hard clustering Scales linearly to the number of links A simple nonparametric Bayesian approach to group discovery p(z data R) p(data R Z) p(z) Assumes non-uniform likelihood for links between groups Robust to missing data Generates soft groups with mixed- memberships Not scalable: (1) requires both present and absent links and (2) has slow inference compared to other models Henderson and Eliassi-Rad (5)

Related Work (continued) Simple Social Network-LDA (SSN-LDA) [Zhang+, ISI 07] SSN-LDA and LDA-G are formally similar il Both extend LDA [Blei+, JMLR 03] to find communities on graphs BUT, LDA-G s motivations and applications are different than SSN-LDA LDA-G was motivated by delineating scientific domains based on one or two given data points and applied to PubMed SSN-LDA was developed to find community structure in social networks and applied to CiteSeer Henderson and Eliassi-Rad (6)

Background: Latent Dirichlet Allocation LDA [D. Blei+, JMLR 03] A nonparametric generative model for topic discovery in text documents LDA takes a set of documents Each document is represented as a bag of words Each document generates a series of topics from a multinomial distribution LDA uses infinite (Dirichlet) priors on documents and topics Number of topics is learned, not fixed Topics assign some probability to words they have not yet generated LDA uses Gibbs sampling to efficiently infer posterior estimates on all topic and document distributions LDA can be parallelized without sacrificing MCMC convergence [D. Newman+, NIPS 07] Henderson and Eliassi-Rad (7)

Our LDA for Graphs (LDA-G) Each node has its own document Each document consists of its node s edge list Each neighbor is treated as a word in the vocabulary Assigns topics to each edge using LDA inference w i z i, φ ( z i ) ( z ~ Discrete( φ φ ~ Dirichlet ( β ) z i θ d i d ~ Discrete( θ θ ~ Dirichlet(α) ( Learns link prediction model: p ( edge( v j ) v i) = j i) t topics i ) [ p( edge( v ) t) p( t v ] Glossary: Document Source node, Topic Group, Word Destination node i ) ) Henderson and Eliassi-Rad (8)

Document representation in LDA vs. LDA-G LDA LDA-G 2 1 5 4 3 6 Doc 1 Doc 2 Doc 3 it, was, the, best, of, times, hi, mom, just, writing, to, let, you, know, we, hold, these, truths, to, be, self, evident, Doc 1 2, 4 Doc 2 <empty> Doc 3 <empty> Doc 4 1 Doc 5 2, 3, 4, 6 Doc 6 3 Henderson and Eliassi-Rad (9)

LDA-G meets all the requirements Is a simple nonparametric Bayesian model Number of groups is learned from the data Assumes non-uniform likelihood for links between groups Is robust to missing data Generates soft groups with mixed-memberships Is scalable Requires O( V log V ) in both runtime and storage Does not require absent links Can be parallelized without sacrificing MCMC convergence [D. Newman+, NIPS 07] Produces effective (high) quantitative results based on link prediction tasks Henderson and Eliassi-Rad (10)

Experiments: Description of Data Data motivated by two articles about the 1918 influenza Team 1 Co-Authors of: Aberrant innate immune response in lethal l infection of macaques with the 1918 influenza virus, published in Nature Magazine (2007) Team 2 Co-authors of: Characterization of the Reconstructed 1918 Spanish Influenza Pandemic Virus, published in Science Magazine (2005) Henderson and Eliassi-Rad (11)

Description of Data (continued) Two graphs were generated from PubMed Authors from Team 1 and 2 were used to discover articles in PubMed For all PubMed articles from Team 2 knowledge themes were extracted t based on term frequency All papers in PubMed with the major term of each theme were extracted 1. Author Knowledge graph, with published-on-theme links 37,463 nodes: 37,346346 authors and 117 knowledge nodes 119,443 who-knows-what links 2. Author Author graph, with co-authorship links 37,227 author nodes 143,364 who-knows-whom links Henderson and Eliassi-Rad (12)

Adjacency Matrices of the Two Graphs 117 kno owledge nod des Author x Knowledge 37,346 author nodes Author x Author 37,227 authors 37,227 authors Henderson and Eliassi-Rad (13)

Knowledge-Infused Author x Author Graph Start with Author x Author graph Fuse information from Author x Knowledge graph into Author x Author graph 1. Remove links from Author x Knowledge graph with weight less than 12 2. For any authors that share a knowledge theme, add a link to the Author x Author graph Original Author x Author Knowledge-Infused Author x Author Henderson and Eliassi-Rad (14)

Experimental Methodology Randomly select 500 present-links and 500 absent-links for testing Train on remaining present links 1 ROC curve random guess Cross-association implicitly considers absent-links as well Test for link prediction on held-out links Results averaged over 5 independent trials Caveat TPR (S Sensitivity) 0 1 FPR (1-Specificity) IRM cannot handle graphs with thousands of nodes and links Down-sample the graphs (for IRM only) Randomly select links from the sampled graph for testing Train on the remaining links in the sampled graph Henderson and Eliassi-Rad (15)

Quantitative Results on Link Prediction Henderson and Eliassi-Rad (16)

Number of Groups Found in the Data Cross- Associations AxK 28 Author Groups & 14 Knowledge Groups AxA Knowledge-Infused AxA 11 Groups 12 Groups LDA-G 34 Groups 18 Groups 18 Groups Knowledge-Infused Author x Author Original Data Cross-Associations Groups LDA-G s Groups Henderson and Eliassi-Rad (17)

Author x Knowledge Graph: LDA-G s Group Distributions Henderson and Eliassi-Rad (18)

Value of Knowledge Infusion Knowledge-Infused AxA graph generates groups with larger overlap among the authors of Team 1 (Nature paper) and Team 2 (Science paper) than the AxA graph Henderson and Eliassi-Rad (19)

Conclusion LDA-G is a group discovery algorithm that is nonparametric, scalable, effective, and produces mixed-memberships memberships We have recently extended LDA-G to incorporate graph attributes and time-evolving graphs LDA-G successfully found a group researchers whose work is similar to Teams 1 and 2 This is based on feedback from our subject matter experts Lessons learned from the experiments Collecting a small amount of highly-structured data can improve performance significantly more than collecting large amounts of messy data Collecting data from more than one source is better than allocating all resources to the same type of observation Henderson and Eliassi-Rad (20)

References D.M. Blei, A.Y. Ng, M.I. Jordan: Latent Dirichlet Allocation. Journal of Machine Learning Research 3: 993-1022 (2003). D. Chakrabarti, S. Papadimitriou, D.S. Modha, C. Faloutsos: Fully automatic cross-associations. KDD 2004: 79-88. C. Kemp, J.B. Tenenbaum, T.L. Griffiths, T. Yamada, N. Ueda: Learning systems of concepts with an infinite relational model. AAAI 2006. D. Newman, A. Asuncion, P. Smyth, M. Welling: Distributed inference for latent Dirichlet allocation: NIPS 2007:1081-1088. H. Zhang, B. Qiu, C. L. Giles, H. C. Foley, J. Yen: An LDA-based community structure discovery approach for large-scale social networks. ISI 2007: 200-207. Henderson and Eliassi-Rad (21)

Thank you Contact Information: keith@llnl.gov Henderson and Eliassi-Rad (22)