Web Search and Text Mining. Lecture 16: Topics and Communities

Similar documents
Topic Modelling and Latent Dirichlet Allocation

Document and Topic Models: plsa and LDA

Latent Dirichlet Allocation

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Applying Latent Dirichlet Allocation to Group Discovery in Large Graphs

Markov Topic Models. Bo Thiesson, Christopher Meek Microsoft Research One Microsoft Way Redmond, WA 98052

Latent Dirichlet Allocation

Hybrid Models for Text and Graphs. 10/23/2012 Analysis of Social Media

Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA)

CS Lecture 18. Topic Models and LDA

Topic Models and Applications to Short Documents

Text Mining for Economics and Finance Latent Dirichlet Allocation

Study Notes on the Latent Dirichlet Allocation

CS145: INTRODUCTION TO DATA MINING

Topic Models. Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW. Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1

A Continuous-Time Model of Topic Co-occurrence Trends

topic modeling hanna m. wallach

Topic Models. Charles Elkan November 20, 2008

Applying LDA topic model to a corpus of Italian Supreme Court decisions

Dynamic Mixture Models for Multiple Time Series

Knowledge Discovery and Data Mining 1 (VO) ( )

Generative Clustering, Topic Modeling, & Bayesian Inference

Latent Dirichlet Allocation Introduction/Overview

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations

Probabilistic Dyadic Data Analysis with Local and Global Consistency

Latent Dirichlet Allocation (LDA)

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

Note for plsa and LDA-Version 1.1

Probabilistic Graphical Models

Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation

Latent Dirichlet Alloca/on

Collaborative User Clustering for Short Text Streams

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Web-Mining Agents Topic Analysis: plsi and LDA. Tanya Braun Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Lecture 3: Pattern Classification. Pattern classification

Dirichlet Enhanced Latent Semantic Analysis

Latent Dirichlet Bayesian Co-Clustering

Machine Learning

Distributed Gibbs Sampling of Latent Topic Models: The Gritty Details THIS IS AN EARLY DRAFT. YOUR FEEDBACKS ARE HIGHLY APPRECIATED.

Probabilistic Latent Semantic Analysis

Kernel Density Topic Models: Visual Topics Without Visual Words

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Parametric Embedding for Class Visualization

Chapter 8 PROBABILISTIC MODELS FOR TEXT MINING. Yizhou Sun Department of Computer Science University of Illinois at Urbana-Champaign

Statistical Debugging with Latent Topic Models

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Recent Advances in Bayesian Inference Techniques

Probabilistic Latent Semantic Analysis

16 : Approximate Inference: Markov Chain Monte Carlo

Parallelized Variational EM for Latent Dirichlet Allocation: An Experimental Evaluation of Speed and Scalability

Understanding Comments Submitted to FCC on Net Neutrality. Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014

Topic modeling with more confidence: a theory and some algorithms

Topic Discovery Project Report

Mixtures of Multinomials

Gaussian Mixture Model

Topic Models. Material adapted from David Mimno University of Maryland INTRODUCTION. Material adapted from David Mimno UMD Topic Models 1 / 51

Applying hlda to Practical Topic Modeling

Latent Dirichlet Allocation: Stability and Applications to Studies of User-Generated Content

Topic Modeling for Personalized Recommendation of Volatile Items

Lecture 19, November 19, 2012

Modeling Environment

Classification of Text Documents and Extraction of Semantically Related Words using Hierarchical Latent Dirichlet Allocation.

Bayesian Semi-supervised Learning with Deep Generative Models

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation

Title Document Clustering. Issue Date Right.

CS6220: DATA MINING TECHNIQUES

Multilayer Neural Networks

Distributed ML for DOSNs: giving power back to users

Dynamic Topic Models. Abstract. 1. Introduction

Probabilistic Latent Semantic Analysis

Gaussian Models

AN INTRODUCTION TO TOPIC MODELS

CS 572: Information Retrieval

arxiv: v6 [stat.ml] 11 Apr 2017

Note 1: Varitional Methods for Latent Dirichlet Allocation

Lecture 13 : Variational Inference: Mean Field Approximation

Information Retrieval and Web Search

Query-document Relevance Topic Models

Evaluation Methods for Topic Models

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation

Mixture Models in Text Mining Tools in R

Inference Methods for Latent Dirichlet Allocation

A Probabilistic Clustering-Projection Model for Discrete Data

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank

Probabilistic Topic Models Tutorial: COMAD 2011

Probablistic Graphical Models, Spring 2007 Homework 4 Due at the beginning of class on 11/26/07

Introduction To Machine Learning

Introduction to Bayesian inference

State Space and Hidden Markov Models

SUPERVISED MULTI-MODAL TOPIC MODEL FOR IMAGE ANNOTATION

6.867 Machine learning

Modeling User Rating Profiles For Collaborative Filtering

LDA with Amortized Inference

Bayesian Approach 2. CSC412 Probabilistic Learning & Reasoning

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Latent Topic Models for Hypertext

Note on Algorithm Differences Between Nonnegative Matrix Factorization And Probabilistic Latent Semantic Indexing

HIERARCHICAL RELATIONAL MODELS FOR DOCUMENT NETWORKS. BY JONATHAN CHANG 1 AND DAVID M. BLEI 2 Facebook and Princeton University

Transcription:

Web Search and Tet Mining Lecture 16: Topics and Communities

Outline Latent Dirichlet Allocation (LDA) Graphical models for social netorks

Eploration, discovery, and query-ansering in the contet of the relationships of authors/actors, topics, communities, roles for large document collections and the associated social netorks. 1) analysis of topic trend 2) find the authors ho are most likely to rite on a topic 3) find the unusual paper ritten by an author 4) discover organizational structure via communication patterns 5) roles/communities membership of social actors

Topics Models Discovering the topics in a tet corpus. Latent semantic indeing (discovered dimension hard to interpret, not natural at all from a generative model viepoint) Probabilistic latent semantic indeing (not a complete generative model) Latent Dirichlet allocation: topics are multinomial distributions over ords (Blei et. al., 2003)

Representation of Topics WORD PROB. WORD PROB. WORD PROB. WORD PROB. LIGHT.0306 RECOGNITION.0500 KERNEL.0547 SOURCE.0389 RESPONSE.0282 CHARACTER.0334 VECTOR.0293 INDEPENDENT.0376 INTENSITY.0252 TANGENT.0246 SUPPORT.0293 SOURCES.0344 RETINA.0241 CHARACTERS.0232 MARGIN.0239 SEPARATION.0322 OPTICAL.0233 DISTANCE.0197 SVM.0196 INFORMATION.0319 KOCH.0190 HANDWRITTEN.0166 DATA.0165 ICA.0276 Each topic is a represented as a multinomial distribution over the set of ords in a vocabulary. φ t = [φ 1t,..., φ V t ] T, V i=1 φ it = 1 Conceptually, each document is a miture of topics.

Some Models Unigra mmodel The ords of every document are dran independently from a single multinomial distribution, Graphical representation, p() = N BLEI, NG, AND JORDAN i=1 p( i ) N (a) unigram M

Miture of unigrams Each document is generated by 1) choosing a topic z; 2) generating N ords independently from the conditional multinomial p( z) p() = z p(z) N i=1 p( i z) A document belongs to a single topic.

Graphical representation, N (a) unigram M z N (b) miture of unigrams M

plsi A document label d and a ord i are conditionally independent given an unobserved topic z, p(d, i ) = z p(z)p(d, i z) = z p(z)p(d z)p( i z) = = p(d) z p(z d)p( i z) A document can have multiple topics ith p(z d) as the miture eights.

ment the unigram model ith a discrete random topic variable z (Figure 3b), e z N M Graphical representation, (b) miture of unigrams d z N M But drabacks: (c) plsi/aspect model Figure 3: Graphical model representation of different models of discrete data. 1) p(z d) for documents in training set, no generalization 2) number parameters linear in number of documents kv + km, prune to overfitting ure of unigrams

LDA Key difference from plsi: The topic miture eights modeled by a k-parameter hidden random variable rather than a large set of individual parameters hich are eplicitly linked to the training set (p(z d), d M). The hidden variable θ Dirichlet(α)

Topic (LDA) Author α θ a d α z β φ β φ β T N d D A N d D (a) (b) The corpus is generated by Figure 10: Different generative models for docume 1) a distribution over topics 6.1 A is sampled Simple Topic from a (LDA) Dirichlet Model dist. θ Dirichlet(α) 2) for each ord in theas document, mentioned earliera single the topic paper, is chosen there have according been a number to this of other earli distribution. z M ultinomial(θ) document content are based on the idea that the probability distribution can be epressed as a miture of topics, here each topic is a probabili 3) each ord is sampled [Blei from et al., a 2003, multinomial Hofmann, distribution 1999, Uedaover and Saito, ords2003, specific Iyer and Osten focus on one such model Latent Dirichlet Allocation [LDA; Blei et to the sampled topic. generation M ultinomial(p( z)) of a corpus is a three step process. First, for each document, is sampled from a Dirichlet distribution. Second, for each ord in the d chosen according to this distribution. Finally, each ord is sampled from over ords specific to the sampled topic. The parameters of this model are similar to those of the author top distribution over ords for each topic, and Θ represents a distribution over

Given the parameters α and β, the joint distribution of a topic miture θ, a set of N topics z, and a set of N ords is given by Sum over θ, z, p(θ, z, α, β) = p(θ α) p( α, β) = p(θ α) N i=1 N i=1 p(z n θ)p( n z n, β) p(z n θ)p( n z n, β) dθ z n

The plsi model posits that each ord of a training document comes from a randomly chosen topic. The topics are themselves dran from a document-specific distribution over topics, i.e., a point on the topic simple. There is one such distribution for each document; the set of BLEI, NG, AND JORDAN Geometric Picture topic 1 topic simple topic 2 ord simple topic 3 Figure 4: The topic simple for three topics embedded in the ord simple for three ords. The corners of the ord simple correspond to the three distributions here each ord (respectively) has probability one. The three points of the topic simple correspond to three different distributions over ords. The miture of unigrams places each document at one of the corners of the topic simple. The plsi model induces an empirical distribution on the topic simple denoted by. LDA places a smooth distribution on the topic simple denoted by the contour lines. Word simple: (V 1)-D simple.

1. Unigram: one point in the ord simple, all ords from the same dist The other models using latent topics, form a (K 1)-subsimple using K points in the ord simple. 2. Miture of unigrams: one of the K points is chosen randomly 3. plsi: one point in topic comple for each document in training 4. LDA: both training doc and unseen doc from a continuous dist on topic simple

Author Models Documents are ritten by authors, model based on the interests of authors Topic (LDA) epressed as author-ord Author distributions (McCallum, 1999). Author-Topic a d α θ a d α θ A z z β φ β φ β φ T N d D A N d D T N d D (a) (b) (c) Figure 10: Different generative models for documents. 6.1 A Simple Topic (LDA) Model

Author Author-Topic Author-Topic Models a d a d α θ A z β φ β φ A N d D T N d D (b) (c) 0: Different generative models for documents. DA) Model Each author is a multinomial distribution over topics. (Rosen-Zvi et. al., 2005) per, there have been a number of other earlier approaches to modeling n the idea that the probability distribution over ords in a document e of topics, here each topic is a probability distribution over ords 999, Ueda and Saito, 2003, Iyer and Ostendorf, 1999]. Here e ill atent Dirichlet Allocation [LDA; Blei et al., 2003]. 3 In LDA, the ree step process. First, for each document, a distribution over topics istribution. Second, for each ord in the document, a single topic is bution. Finally, each ord is sampled from a multinomial distribution pled topic. odel are similar to those of the author topic model: Φ represents a h topic, and Θ represents a distribution over topics for each document. 1) Author chosen uniformly from author list a dist over topics. 2) A topic chosen from that distribution, and a ord sampled.

Figure 3: Graphical model for the author topic model. Under this generative process, each topic is dran independently hen conditioned on Θ, and each ord is dran independently hen conditioned on Φ and z. The probability of the corpus, conditioned on Θ and Φ (and implicitly on a fied number of topics T ), is Probabilities of ords given topics: Φ, W T matri. D Probabilities of ords given P ( Θ, topic Φ, A) t: = φ t, WP ( -dimensional d Θ, Φ, a d ). vector. (1) d=1 Probabilities of topics given authors: Θ, T A matri. We can obtain the probability of the ords in each document, d by summing over the latent variables Probabilities and z, toofgive topics given author a: θ a, T -dimensional vector. P ( d Θ, Φ, A) = = = = N d i=1 N d P ( i Θ, Φ, a d ) A i=1 a=1 t=1 N d A i=1 a=1 t=1 N d i=1 1 A d a a d T P ( i, z i = t, i = a Θ, Φ, a d )) T P ( i z i = t, φ t )P (z i = t i = a, θ a )P ( i = a a d ) T φ i tθ ta, (2) t=1 here the factorization in the third line makes use of the independence assumptions of the model. The last line in the equations above epresses the probability of the ords in terms the entries of the parameter matrices Φ and Θ introduced earlier. The probability distribution over author assignments, P ( i = a a d ), is assumed to be uniform over the elements of a d, and deterministic if

Social Netork Analysis In a document corpus, authors interact through co-authorship, citations etc. In an email corpus, users can be authors and recipients of email messages. The interactions among the authors/users give rise to a social netork. We can etract communities from the netork and discover roles played by the social actors. Eploring communication patterns as ell as semantics.

! a d " z $ # T z N Author-Recipient-Topic Models Eplicit modeling D authors as ell as recipient D in emails (McCallumAuthor-Topic et. al., 2005). Model Etensions: Author-Recipient-Topic eplicit modeling Model of roles: RART (AT) model. [Rosen-Zvi, Griffiths, Steyvers, Smyth 2004] d $ # A N d (ART) [This paper] a d a d r d! " A z! " A,A z $ # T N d $ # T N d D D Figure 1: Three related models, and the ART model. In all models, each observed ord,, is generated from a multinomial ord distribution, φ z, specific to a particular topic/author, z, hoever topics are selected differently in each of the models.

Community-Author-Topic Models Discovering e-communities through topic similarity as ell as communication patterns (Ding et. al., WWW 2006). β φ ure 12: Community similarity comparisons U α γ θ ψ C T α di ω N d D topic z i and the author i responsible for this assigned based on the posterior probability cond all other variables: P (z i, i ω i, z i, i, i, a d i denote the topic and author assigned to ω i, and i are all other assignments of topic and a cluding current α di instance. i represents other ords in the document set and a d is the observ set for this document. A key issue in using Gibbs sampling for di approimation ω is the evaluation of conditional probability. In Author-Topic N d model, given T top ords, P (z i, i ω i, Dz i, i, i, a d ) is estimated UT 2 by definingfigure a community 5: Modeling as no more community than a ith Figuretopics 14: Modeling community ith topics and of users. users P (z i = j, i = k ω i = m, z i, i, i, a d also test the similarity among topics(users) for the sider the conditional probability P (c, u, z ω), a ord ω associates three variables: community, user and topic. Our C W T AT P (ω i = m i = k)p ( i = k z i = j topics) hich are discovered as a community by CUT 1 2). Typically the topics(users) associated ith the changes hen both factors are simultaneously considered. mj + β Ckj + α topics) in a community interpretation represent of the high semantic similarities. meaning One of P ould (c, u, z ω) epect isne communities to emerge. The Σ m model C W T m the probability that ord ω is generated by user u under j + V β Σ j C AT kj + ample, in Fig. 8, Topic 5 and Topic 12 that conike.grigsby are in Fig. 14 constrains the community as a joint distribution topic both z, in contained community in the c. topic set of over topic and users. Hoever, here m such nonlinear m and j generative j, α and β are prior p onoho, ho is Unfortunately, the community this companion conditional of Mike probability models cannot require be computed directly. To get P (c, u, z ω),e have: more efficient yet approimate of times solutions. that ord It ould ω i = also mbe is assigned to top larger computational for ord andresources, topic Dirichlets, asking for Cmj W T represents th y. C AT represents the number of times that author γ α β ψ θ φ C C T,U