Web Search and Text Mining. Lecture 16: Topics and Communities

Web Search and Tet Mining Lecture 16: Topics and Communities

Outline Latent Dirichlet Allocation (LDA) Graphical models for social netorks

Eploration, discovery, and query-ansering in the contet of the relationships of authors/actors, topics, communities, roles for large document collections and the associated social netorks. 1) analysis of topic trend 2) find the authors ho are most likely to rite on a topic 3) find the unusual paper ritten by an author 4) discover organizational structure via communication patterns 5) roles/communities membership of social actors

Topics Models Discovering the topics in a tet corpus. Latent semantic indeing (discovered dimension hard to interpret, not natural at all from a generative model viepoint) Probabilistic latent semantic indeing (not a complete generative model) Latent Dirichlet allocation: topics are multinomial distributions over ords (Blei et. al., 2003)

Representation of Topics WORD PROB. WORD PROB. WORD PROB. WORD PROB. LIGHT.0306 RECOGNITION.0500 KERNEL.0547 SOURCE.0389 RESPONSE.0282 CHARACTER.0334 VECTOR.0293 INDEPENDENT.0376 INTENSITY.0252 TANGENT.0246 SUPPORT.0293 SOURCES.0344 RETINA.0241 CHARACTERS.0232 MARGIN.0239 SEPARATION.0322 OPTICAL.0233 DISTANCE.0197 SVM.0196 INFORMATION.0319 KOCH.0190 HANDWRITTEN.0166 DATA.0165 ICA.0276 Each topic is a represented as a multinomial distribution over the set of ords in a vocabulary. φ t = [φ 1t,..., φ V t ] T, V i=1 φ it = 1 Conceptually, each document is a miture of topics.

Some Models Unigra mmodel The ords of every document are dran independently from a single multinomial distribution, Graphical representation, p() = N BLEI, NG, AND JORDAN i=1 p( i ) N (a) unigram M

Miture of unigrams Each document is generated by 1) choosing a topic z; 2) generating N ords independently from the conditional multinomial p( z) p() = z p(z) N i=1 p( i z) A document belongs to a single topic.

Graphical representation, N (a) unigram M z N (b) miture of unigrams M

plsi A document label d and a ord i are conditionally independent given an unobserved topic z, p(d, i ) = z p(z)p(d, i z) = z p(z)p(d z)p( i z) = = p(d) z p(z d)p( i z) A document can have multiple topics ith p(z d) as the miture eights.

ment the unigram model ith a discrete random topic variable z (Figure 3b), e z N M Graphical representation, (b) miture of unigrams d z N M But drabacks: (c) plsi/aspect model Figure 3: Graphical model representation of different models of discrete data. 1) p(z d) for documents in training set, no generalization 2) number parameters linear in number of documents kv + km, prune to overfitting ure of unigrams

LDA Key difference from plsi: The topic miture eights modeled by a k-parameter hidden random variable rather than a large set of individual parameters hich are eplicitly linked to the training set (p(z d), d M). The hidden variable θ Dirichlet(α)

Topic (LDA) Author α θ a d α z β φ β φ β T N d D A N d D (a) (b) The corpus is generated by Figure 10: Different generative models for docume 1) a distribution over topics 6.1 A is sampled Simple Topic from a (LDA) Dirichlet Model dist. θ Dirichlet(α) 2) for each ord in theas document, mentioned earliera single the topic paper, is chosen there have according been a number to this of other earli distribution. z M ultinomial(θ) document content are based on the idea that the probability distribution can be epressed as a miture of topics, here each topic is a probabili 3) each ord is sampled [Blei from et al., a 2003, multinomial Hofmann, distribution 1999, Uedaover and Saito, ords2003, specific Iyer and Osten focus on one such model Latent Dirichlet Allocation [LDA; Blei et to the sampled topic. generation M ultinomial(p( z)) of a corpus is a three step process. First, for each document, is sampled from a Dirichlet distribution. Second, for each ord in the d chosen according to this distribution. Finally, each ord is sampled from over ords specific to the sampled topic. The parameters of this model are similar to those of the author top distribution over ords for each topic, and Θ represents a distribution over

Given the parameters α and β, the joint distribution of a topic miture θ, a set of N topics z, and a set of N ords is given by Sum over θ, z, p(θ, z, α, β) = p(θ α) p( α, β) = p(θ α) N i=1 N i=1 p(z n θ)p( n z n, β) p(z n θ)p( n z n, β) dθ z n

The plsi model posits that each ord of a training document comes from a randomly chosen topic. The topics are themselves dran from a document-specific distribution over topics, i.e., a point on the topic simple. There is one such distribution for each document; the set of BLEI, NG, AND JORDAN Geometric Picture topic 1 topic simple topic 2 ord simple topic 3 Figure 4: The topic simple for three topics embedded in the ord simple for three ords. The corners of the ord simple correspond to the three distributions here each ord (respectively) has probability one. The three points of the topic simple correspond to three different distributions over ords. The miture of unigrams places each document at one of the corners of the topic simple. The plsi model induces an empirical distribution on the topic simple denoted by. LDA places a smooth distribution on the topic simple denoted by the contour lines. Word simple: (V 1)-D simple.

1. Unigram: one point in the ord simple, all ords from the same dist The other models using latent topics, form a (K 1)-subsimple using K points in the ord simple. 2. Miture of unigrams: one of the K points is chosen randomly 3. plsi: one point in topic comple for each document in training 4. LDA: both training doc and unseen doc from a continuous dist on topic simple

Author Models Documents are ritten by authors, model based on the interests of authors Topic (LDA) epressed as author-ord Author distributions (McCallum, 1999). Author-Topic a d α θ a d α θ A z z β φ β φ β φ T N d D A N d D T N d D (a) (b) (c) Figure 10: Different generative models for documents. 6.1 A Simple Topic (LDA) Model

Author Author-Topic Author-Topic Models a d a d α θ A z β φ β φ A N d D T N d D (b) (c) 0: Different generative models for documents. DA) Model Each author is a multinomial distribution over topics. (Rosen-Zvi et. al., 2005) per, there have been a number of other earlier approaches to modeling n the idea that the probability distribution over ords in a document e of topics, here each topic is a probability distribution over ords 999, Ueda and Saito, 2003, Iyer and Ostendorf, 1999]. Here e ill atent Dirichlet Allocation [LDA; Blei et al., 2003]. 3 In LDA, the ree step process. First, for each document, a distribution over topics istribution. Second, for each ord in the document, a single topic is bution. Finally, each ord is sampled from a multinomial distribution pled topic. odel are similar to those of the author topic model: Φ represents a h topic, and Θ represents a distribution over topics for each document. 1) Author chosen uniformly from author list a dist over topics. 2) A topic chosen from that distribution, and a ord sampled.

Figure 3: Graphical model for the author topic model. Under this generative process, each topic is dran independently hen conditioned on Θ, and each ord is dran independently hen conditioned on Φ and z. The probability of the corpus, conditioned on Θ and Φ (and implicitly on a fied number of topics T ), is Probabilities of ords given topics: Φ, W T matri. D Probabilities of ords given P ( Θ, topic Φ, A) t: = φ t, WP ( -dimensional d Θ, Φ, a d ). vector. (1) d=1 Probabilities of topics given authors: Θ, T A matri. We can obtain the probability of the ords in each document, d by summing over the latent variables Probabilities and z, toofgive topics given author a: θ a, T -dimensional vector. P ( d Θ, Φ, A) = = = = N d i=1 N d P ( i Θ, Φ, a d ) A i=1 a=1 t=1 N d A i=1 a=1 t=1 N d i=1 1 A d a a d T P ( i, z i = t, i = a Θ, Φ, a d )) T P ( i z i = t, φ t )P (z i = t i = a, θ a )P ( i = a a d ) T φ i tθ ta, (2) t=1 here the factorization in the third line makes use of the independence assumptions of the model. The last line in the equations above epresses the probability of the ords in terms the entries of the parameter matrices Φ and Θ introduced earlier. The probability distribution over author assignments, P ( i = a a d ), is assumed to be uniform over the elements of a d, and deterministic if

Social Netork Analysis In a document corpus, authors interact through co-authorship, citations etc. In an email corpus, users can be authors and recipients of email messages. The interactions among the authors/users give rise to a social netork. We can etract communities from the netork and discover roles played by the social actors. Eploring communication patterns as ell as semantics.

! a d " z $ # T z N Author-Recipient-Topic Models Eplicit modeling D authors as ell as recipient D in emails (McCallumAuthor-Topic et. al., 2005). Model Etensions: Author-Recipient-Topic eplicit modeling Model of roles: RART (AT) model. [Rosen-Zvi, Griffiths, Steyvers, Smyth 2004] d $ # A N d (ART) [This paper] a d a d r d! " A z! " A,A z $ # T N d $ # T N d D D Figure 1: Three related models, and the ART model. In all models, each observed ord,, is generated from a multinomial ord distribution, φ z, specific to a particular topic/author, z, hoever topics are selected differently in each of the models.

Community-Author-Topic Models Discovering e-communities through topic similarity as ell as communication patterns (Ding et. al., WWW 2006). β φ ure 12: Community similarity comparisons U α γ θ ψ C T α di ω N d D topic z i and the author i responsible for this assigned based on the posterior probability cond all other variables: P (z i, i ω i, z i, i, i, a d i denote the topic and author assigned to ω i, and i are all other assignments of topic and a cluding current α di instance. i represents other ords in the document set and a d is the observ set for this document. A key issue in using Gibbs sampling for di approimation ω is the evaluation of conditional probability. In Author-Topic N d model, given T top ords, P (z i, i ω i, Dz i, i, i, a d ) is estimated UT 2 by definingfigure a community 5: Modeling as no more community than a ith Figuretopics 14: Modeling community ith topics and of users. users P (z i = j, i = k ω i = m, z i, i, i, a d also test the similarity among topics(users) for the sider the conditional probability P (c, u, z ω), a ord ω associates three variables: community, user and topic. Our C W T AT P (ω i = m i = k)p ( i = k z i = j topics) hich are discovered as a community by CUT 1 2). Typically the topics(users) associated ith the changes hen both factors are simultaneously considered. mj + β Ckj + α topics) in a community interpretation represent of the high semantic similarities. meaning One of P ould (c, u, z ω) epect isne communities to emerge. The Σ m model C W T m the probability that ord ω is generated by user u under j + V β Σ j C AT kj + ample, in Fig. 8, Topic 5 and Topic 12 that conike.grigsby are in Fig. 14 constrains the community as a joint distribution topic both z, in contained community in the c. topic set of over topic and users. Hoever, here m such nonlinear m and j generative j, α and β are prior p onoho, ho is Unfortunately, the community this companion conditional of Mike probability models cannot require be computed directly. To get P (c, u, z ω),e have: more efficient yet approimate of times solutions. that ord It ould ω i = also mbe is assigned to top larger computational for ord andresources, topic Dirichlets, asking for Cmj W T represents th y. C AT represents the number of times that author γ α β ψ θ φ C C T,U