Applying Latent Dirichlet Allocation to Group Discovery in Large Graphs

Similar documents
HCDF: A Hybrid Community Discovery Framework

Topic Modelling and Latent Dirichlet Allocation

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

Distributed ML for DOSNs: giving power back to users

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Web Search and Text Mining. Lecture 16: Topics and Communities

Topic Models and Applications to Short Documents

Applying LDA topic model to a corpus of Italian Supreme Court decisions

Kernel Density Topic Models: Visual Topics Without Visual Words

Latent Dirichlet Allocation (LDA)

Statistical Debugging with Latent Topic Models

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Nonparametric Bayesian Matrix Factorization for Assortative Networks

arxiv: v1 [cs.si] 7 Dec 2013

Evaluation Methods for Topic Models

Text Mining for Economics and Finance Latent Dirichlet Allocation

Mixed Membership Stochastic Blockmodels

16 : Approximate Inference: Markov Chain Monte Carlo

Latent Dirichlet Allocation (LDA)

CS Lecture 18. Topic Models and LDA

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Finding Mixed-Memberships in Social Networks

Dynamic Probabilistic Models for Latent Feature Propagation in Social Networks

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations

Latent Dirichlet Allocation Based Multi-Document Summarization

Network Event Data over Time: Prediction and Latent Variable Modeling

Non-Parametric Bayes

MEI: Mutual Enhanced Infinite Community-Topic Model for Analyzing Text-augmented Social Networks

RaRE: Social Rank Regulated Large-scale Network Embedding

Applying hlda to Practical Topic Modeling

Hybrid Models for Text and Graphs. 10/23/2012 Analysis of Social Media

Hierarchical Dirichlet Processes

Mixed Membership Stochastic Blockmodels

Topic Models. Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW. Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017

Latent Dirichlet Allocation Introduction/Overview

arxiv: v1 [stat.ml] 5 Dec 2016

Distributed Gibbs Sampling of Latent Topic Models: The Gritty Details THIS IS AN EARLY DRAFT. YOUR FEEDBACKS ARE HIGHLY APPRECIATED.

Topic modeling with more confidence: a theory and some algorithms

Generative Clustering, Topic Modeling, & Bayesian Inference

Topic Modeling Using Latent Dirichlet Allocation (LDA)

Topic Models. Charles Elkan November 20, 2008

Dirichlet Enhanced Latent Semantic Analysis

Bayesian non parametric inference of discrete valued networks

Asynchronous Distributed Learning of Topic Models

Introduction To Machine Learning

Dimension Reduction (PCA, ICA, CCA, FLD,

Online Bayesian Passive-Agressive Learning

Asynchronous Distributed Learning of Topic Models

arxiv: v2 [stat.ap] 3 Aug 2016

Lecture 19, November 19, 2012

Unified Modeling of User Activities on Social Networking Sites

ECE 5984: Introduction to Machine Learning

Note on Algorithm Differences Between Nonnegative Matrix Factorization And Probabilistic Latent Semantic Indexing

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

GLAD: Group Anomaly Detection in Social Media Analysis

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

Gaussian Mixture Model

Modeling of Growing Networks with Directional Attachment and Communities

AN INTRODUCTION TO TOPIC MODELS

Topic Modeling: Beyond Bag-of-Words

Bayesian Nonparametrics for Speech and Signal Processing

Sparse Stochastic Inference for Latent Dirichlet Allocation

Lecture 13 : Variational Inference: Mean Field Approximation

Document and Topic Models: plsa and LDA

Gentle Introduction to Infinite Gaussian Mixture Modeling

Latent variable models for discrete data

topic modeling hanna m. wallach

Collapsed Variational Inference for HDP

Graphical Models and Kernel Methods

Incorporating Social Context and Domain Knowledge for Entity Recognition

Replicated Softmax: an Undirected Topic Model. Stephen Turner

Latent Dirichlet Allocation

Sharing Clusters Among Related Groups: Hierarchical Dirichlet Processes

Content-based Recommendation

Web-Mining Agents Topic Analysis: plsi and LDA. Tanya Braun Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Design of Text Mining Experiments. Matt Taddy, University of Chicago Booth School of Business faculty.chicagobooth.edu/matt.

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank

Integrating Ontological Prior Knowledge into Relational Learning

Gibbs Sampling. Héctor Corrada Bravo. University of Maryland, College Park, USA CMSC 644:

Random function priors for exchangeable arrays with applications to graphs and relational data

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Advanced Machine Learning

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Markov Topic Models. Bo Thiesson, Christopher Meek Microsoft Research One Microsoft Way Redmond, WA 98052

STA 4273H: Statistical Machine Learning

Modeling Environment

Overlapping Community Detection at Scale: A Nonnegative Matrix Factorization Approach

Introduction to Probabilistic Machine Learning

Learning latent structure in complex networks

Bayesian Hidden Markov Models and Extensions

Probabilistic Time Series Classification

Probabilistic Graphical Models

Parallelized Variational EM for Latent Dirichlet Allocation: An Experimental Evaluation of Speed and Scalability

Latent Variable Models Probabilistic Models in the Study of Language Day 4

Fast Collapsed Gibbs Sampling For Latent Dirichlet Allocation

CS145: INTRODUCTION TO DATA MINING

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Latent Dirichlet Conditional Naive-Bayes Models

Machine Learning Summer School, Austin, TX January 08, 2015

Transcription:

Lawrence Livermore National Laboratory Applying Latent Dirichlet Allocation to Group Discovery in Large Graphs Keith Henderson and Tina Eliassi-Rad keith@llnl.gov and eliassi@llnl.gov This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Outline Task at hand Introduction Related work Our approach Latent Dirichlet Allocation for Graphs (LDA-G) Experimental study Conclusion Henderson and Eliassi-Rad (2)

Task at Hand Inputs: Two abstracts t from PubMed A co-authorship network from PubMed A bipartite te author-by-knowledge o graph from PubMed Output: Find other researchers whose work is similar to authors of the two given abstracts Our solution Latent Dirichlet Allocation for Graphs (LDA-G) A scalable nonparametric Bayesian group discovery algorithm that discovers soft groups / communities with mixed-memberships Henderson and Eliassi-Rad (3)

Introduction Problem Given a graph G = (V,E), find groups / communities in G Algorithm must satisfy all of these properties 1) Nonparametric number of groups is not known a priori 2) Scalable algorithm is faster than O( V 2 ) and requires less than O( V 2 ) space 3) Effective algorithm finds groups that are highly predictive of G s underlying topology 4) Mixed-membership nodes can belong to multiple groups with varying degrees Henderson and Eliassi-Rad (4)

Related Work Non-probabilistic Approach Probabilistic Approach Cross-associations Infinite Relational Models (IRM) [Chakrabarti +, KDD 04] [Kemp+, AAAI 06] Generates a compressed form of the adjacency matrix by minimizing the total encoding cost Assumes uniform likelihood for links between groups Produces hard clustering Scales linearly to the number of links A simple nonparametric Bayesian approach to group discovery p(z data R) p(data R Z) p(z) Assumes non-uniform likelihood for links between groups Robust to missing data Generates soft groups with mixed- memberships Not scalable: (1) requires both present and absent links and (2) has slow inference compared to other models Henderson and Eliassi-Rad (5)

Related Work (continued) Simple Social Network-LDA (SSN-LDA) [Zhang+, ISI 07] SSN-LDA and LDA-G are formally similar il Both extend LDA [Blei+, JMLR 03] to find communities on graphs BUT, LDA-G s motivations and applications are different than SSN-LDA LDA-G was motivated by delineating scientific domains based on one or two given data points and applied to PubMed SSN-LDA was developed to find community structure in social networks and applied to CiteSeer Henderson and Eliassi-Rad (6)

Background: Latent Dirichlet Allocation LDA [D. Blei+, JMLR 03] A nonparametric generative model for topic discovery in text documents LDA takes a set of documents Each document is represented as a bag of words Each document generates a series of topics from a multinomial distribution LDA uses infinite (Dirichlet) priors on documents and topics Number of topics is learned, not fixed Topics assign some probability to words they have not yet generated LDA uses Gibbs sampling to efficiently infer posterior estimates on all topic and document distributions LDA can be parallelized without sacrificing MCMC convergence [D. Newman+, NIPS 07] Henderson and Eliassi-Rad (7)

Our LDA for Graphs (LDA-G) Each node has its own document Each document consists of its node s edge list Each neighbor is treated as a word in the vocabulary Assigns topics to each edge using LDA inference w i z i, φ ( z i ) ( z ~ Discrete( φ φ ~ Dirichlet ( β ) z i θ d i d ~ Discrete( θ θ ~ Dirichlet(α) ( Learns link prediction model: p ( edge( v j ) v i) = j i) t topics i ) [ p( edge( v ) t) p( t v ] Glossary: Document Source node, Topic Group, Word Destination node i ) ) Henderson and Eliassi-Rad (8)

Document representation in LDA vs. LDA-G LDA LDA-G 2 1 5 4 3 6 Doc 1 Doc 2 Doc 3 it, was, the, best, of, times, hi, mom, just, writing, to, let, you, know, we, hold, these, truths, to, be, self, evident, Doc 1 2, 4 Doc 2 <empty> Doc 3 <empty> Doc 4 1 Doc 5 2, 3, 4, 6 Doc 6 3 Henderson and Eliassi-Rad (9)

LDA-G meets all the requirements Is a simple nonparametric Bayesian model Number of groups is learned from the data Assumes non-uniform likelihood for links between groups Is robust to missing data Generates soft groups with mixed-memberships Is scalable Requires O( V log V ) in both runtime and storage Does not require absent links Can be parallelized without sacrificing MCMC convergence [D. Newman+, NIPS 07] Produces effective (high) quantitative results based on link prediction tasks Henderson and Eliassi-Rad (10)

Experiments: Description of Data Data motivated by two articles about the 1918 influenza Team 1 Co-Authors of: Aberrant innate immune response in lethal l infection of macaques with the 1918 influenza virus, published in Nature Magazine (2007) Team 2 Co-authors of: Characterization of the Reconstructed 1918 Spanish Influenza Pandemic Virus, published in Science Magazine (2005) Henderson and Eliassi-Rad (11)

Description of Data (continued) Two graphs were generated from PubMed Authors from Team 1 and 2 were used to discover articles in PubMed For all PubMed articles from Team 2 knowledge themes were extracted t based on term frequency All papers in PubMed with the major term of each theme were extracted 1. Author Knowledge graph, with published-on-theme links 37,463 nodes: 37,346346 authors and 117 knowledge nodes 119,443 who-knows-what links 2. Author Author graph, with co-authorship links 37,227 author nodes 143,364 who-knows-whom links Henderson and Eliassi-Rad (12)

Adjacency Matrices of the Two Graphs 117 kno owledge nod des Author x Knowledge 37,346 author nodes Author x Author 37,227 authors 37,227 authors Henderson and Eliassi-Rad (13)

Knowledge-Infused Author x Author Graph Start with Author x Author graph Fuse information from Author x Knowledge graph into Author x Author graph 1. Remove links from Author x Knowledge graph with weight less than 12 2. For any authors that share a knowledge theme, add a link to the Author x Author graph Original Author x Author Knowledge-Infused Author x Author Henderson and Eliassi-Rad (14)

Experimental Methodology Randomly select 500 present-links and 500 absent-links for testing Train on remaining present links 1 ROC curve random guess Cross-association implicitly considers absent-links as well Test for link prediction on held-out links Results averaged over 5 independent trials Caveat TPR (S Sensitivity) 0 1 FPR (1-Specificity) IRM cannot handle graphs with thousands of nodes and links Down-sample the graphs (for IRM only) Randomly select links from the sampled graph for testing Train on the remaining links in the sampled graph Henderson and Eliassi-Rad (15)

Quantitative Results on Link Prediction Henderson and Eliassi-Rad (16)

Number of Groups Found in the Data Cross- Associations AxK 28 Author Groups & 14 Knowledge Groups AxA Knowledge-Infused AxA 11 Groups 12 Groups LDA-G 34 Groups 18 Groups 18 Groups Knowledge-Infused Author x Author Original Data Cross-Associations Groups LDA-G s Groups Henderson and Eliassi-Rad (17)

Author x Knowledge Graph: LDA-G s Group Distributions Henderson and Eliassi-Rad (18)

Value of Knowledge Infusion Knowledge-Infused AxA graph generates groups with larger overlap among the authors of Team 1 (Nature paper) and Team 2 (Science paper) than the AxA graph Henderson and Eliassi-Rad (19)

Conclusion LDA-G is a group discovery algorithm that is nonparametric, scalable, effective, and produces mixed-memberships memberships We have recently extended LDA-G to incorporate graph attributes and time-evolving graphs LDA-G successfully found a group researchers whose work is similar to Teams 1 and 2 This is based on feedback from our subject matter experts Lessons learned from the experiments Collecting a small amount of highly-structured data can improve performance significantly more than collecting large amounts of messy data Collecting data from more than one source is better than allocating all resources to the same type of observation Henderson and Eliassi-Rad (20)

References D.M. Blei, A.Y. Ng, M.I. Jordan: Latent Dirichlet Allocation. Journal of Machine Learning Research 3: 993-1022 (2003). D. Chakrabarti, S. Papadimitriou, D.S. Modha, C. Faloutsos: Fully automatic cross-associations. KDD 2004: 79-88. C. Kemp, J.B. Tenenbaum, T.L. Griffiths, T. Yamada, N. Ueda: Learning systems of concepts with an infinite relational model. AAAI 2006. D. Newman, A. Asuncion, P. Smyth, M. Welling: Distributed inference for latent Dirichlet allocation: NIPS 2007:1081-1088. H. Zhang, B. Qiu, C. L. Giles, H. C. Foley, J. Yen: An LDA-based community structure discovery approach for large-scale social networks. ISI 2007: 200-207. Henderson and Eliassi-Rad (21)

Thank you Contact Information: keith@llnl.gov Henderson and Eliassi-Rad (22)