Web-Mining Agents Topic Analysis: plsi and LDA. Tanya Braun Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Web-Mining Agents Topic Analysis: plsi and LDA Tanya Braun Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Acknowledgments Pilfered from: Ramesh M. Nallapati Machine Learning applied to Natural Language Processing Thomas J. Watson Research Center, Yorktown Heights, NY USA from his presentation on Generative Topic Models for Community Analysis & David M. Blei MLSS 2012 from his presentation on Probabilistic Topic Models & Sina Miran CS 290D Paper Presentation on Probabilistic Latent Semantic Indexing (PLSI) 2

Recap Agents Task/goal: Information retrieval Environment: Document repository Means: Vector space (bag-of-words) Dimension reduction (LSI) Probability based retrieval (binary) Language models Today: Topic models as special language models Probabilistic LSI (plsi) Latent Dirichlet Allocation (LDA) Soon: What agents can take with them What agents can leave at the repository (win-win) 3

Objectives Topic Models: statistical methods that analyze the words of texts in order to: Discover the themes that run through them (topics) How those themes are connected to each other How they change over time "Theoretical Physics" "Neuroscience" FORCE LASER o o o o o o o o o o o o o o o o o o o o o o o o RELATIVITY o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o NEURON o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o NERVE OXYGEN 1880 1900 1920 1940 1960 1980 2000 1880 1900 1920 1940 1960 1980 2000 4

Topic Modeling Scenario Each topic is a distribution over words Each document is a mixture of corpus-wide topics Each word is drawn from one of those topics 5

Topic Modeling Scenario Topics Documents Topic proportions and assignments In reality, we only observe the documents The other structures are hidden variables Topic modeling algorithms infer these variables from data 6

Plate Notation Naïve Bayes Model: Compact representation C = topic/class (name for a word distribution) N = number of words in considered document W i one specific word in corpus M documents, W now words in document C C w 1 w 2 w 3.. w N Idea: Generate doc from P(W, C) M W N M 7

Generative vs. Descriptive Models Generative models: Learn P(x, y) Tasks: Transform P(x,y) into P(y x) for classification Use the model to predict (infer) new data Advantages Assumptions and model are explicit Use well-known algorithms (e.g., MCMC, EM) Descriptive models: Learn P(y x) Task: Classify data Advantages Fewer parameters to learn Better performance for classification 8

Earlier Topic Models: Topics Known W Unigram No context information N P w #,, w & = ( P(w * ) * fifth, an, of, futures, the, an, incorporated, a, a, the, inflation, most, dollars, quarter, in, is, mass!! thrift, did, eighty, said, hard, 'm, july, bullish!! that, or, limited, the! Automatically generated sentences from a unigram model 9

Earlier Topic Models: Topics Known W Unigram No context information N C P w #,, w & = ( P(w * ) Mixture of Unigrams One topic per document 0 0 ( P w #,, w &, c. = ( P(c. ) * ( P(w * c. ) W.1#.1# * N M How to specify P(c. )? 10

Mixture of Unigrams: Known Topics p Multinomial Naïve Bayes For each document d = 1,, M Generate c d ~ Mult(. p) C Indication vector c=(z 1, z k ) {0, 1} K P(C=c)= = p zk k >1# For each position i = 1,..., N d Generate w i ~ Mult(. b, c d ) 0 W ( P w #,, w &3, c. β, π.1# N M 0 & 3 = ( P(c. π) ( P w * β, c. 0 & 3 = ( π 63 ( β 63,7 8 b.1# *1#.1# π 63 P c. π β 63,7 8 P w * β, c. multinomial *1# 11

Multinomial Distribution Generalization of binomial distribution K possible outcomes instead of 2 Probability mass function n = number of trials x j = count for how often class j occurring p j = probability of class j occurring f x #,, x = ; p #,, p = = Γ * x * + 1 * Γ(x * + 1) Here, the input to Γ @ is a positive integer, so = ( p * H 8 *1# Γ n = n 1! 12

Mixture of Unigrams: Unknown Topics p Topics/classes are hidden Joint probability of words and classes Z 0 ( P w #,, w &3, c. β, π.1# 0 & 3 = ( π 63 ( β 63,7 8.1# *1# W Sum over topics (K = number of topics) N M 0 0 = & 3 b ( P w #,, w &3 β, π.1# = ( L π z> ( β M>,7 8.1# >1# *1# π MN P z > π β MN,7 8 P w * β, z > 13

Mixture of Unigrams: Learning Learn parameters π and β Use likelihood Solve 0 ( P w #,, w &3 β, π.1# 0 L log P w #,, w &3 β, π.1# 0 = ( L π M> ( β M>,7 8 Not a concave/convex function Note: a non-concave/non-convex function is not necessarily convex/concave Possibly no unique max, many saddle or turning points No easy way to find roots of derivative 0 =.1# >1# 0 = & 3 *1# & 3 = L log L π z> ( β M>,7 8.1# argmax βπ L log L π z> ( β M>,7 8.1# = >1# & 3 *1# >1# *1# 14

Trick: Optimize Lower Bound (γ.>, θ) 15

Mixture of Unigrams: Learning The problem Optimize w.r.t. each document Derive lower bound 0 argmax βπ L log L π z> ( β M>,7 8.1# = >1# & 3 *1# log L γ * x * L γ * log x * where γ * 0 L γ * = 1 Jensen's inequality * * * log L x * * = log L γ * x * γ * * 0 = & 3 L γ * log x * γ * log γ * * H(g) argmax βπ L log L π z> ( β M>,7 8.1# >1# *1# The problem again = & 3 = & 3 log L π > ( β >,78 >1# *1# L γ > log(π > ( β >,78 ) >1# *1# + H(γ) 16

Mixture of Unigrams: Learning For each document d = log L π > ( β >,78 >1# & 3 *1# Chicken-and-egg problem: = L γ.> log(π > ( β >,78 ) >1# If we knew γ.> we could find π > and β >,78 with ML If we knew π > and β >,78 we could find γ.> with ML Finally we need π z> and β M >,7 8 Solution: Expectation Maximization Iterative algorithm to find local optimum Guaranteed to maximize a lower bound on the loglikelihood of the observed data & 3 *1# + H(γ) 17

Graphical Idea of the EM Algorithm (γ.>, θ) θ = (π >, β >,78 ) 18

Mixture of Unigrams: Learning For each document d EM solution E step M step = & 3 log L π > ( β >,78 L γ.> log(π > ( β >,78 ) & 3 >1# *1# >1# *1# γ.> = & 3 (`) π (`a#) > + *1# β>,78 = = & 3 π (`) b1# b β >,78 *1# ` (`) + H(γ) 0.1# (`a#) γ π > =.> M ` (`a#) 0 γ `.1# =.> n(d, w * ) 0 γ (`) &.> 3.1# n(d, w b ) β >,78 b1# 19

Back to Topic Modeling Scenario Topics Documents Topic proportions and assignments 20

Probabilistic LSI d Select a document d with probability P(d) For each word of d in the training set z Choose a topic z with probability P(z d) w N M Generate a word with probability P(w z) P d, w * = = P d L P w * z > P(z > d) >1# Documents can have multiple topics Thomas Hofmann, Probabilistic Latent Semantic Indexing, Proceedings of the 22 nd Annual International SIGIR Conference on Research and Development in Information Retrieval (SIGIR-99), 1999 21

plsi Joint probability for all documents, words 0 & 3 ( ( P d, w * e(.,7 8 ).1# *1# Distribution for document d, word w i = P d, w * = P d L P w * z > P(z > d) >1# d z k w i P(d) P(z k d) P(w i z k ) Reformulate P(z > d) with Bayes Rule z k P(z k ) = P d, w * = L P d z > P(z > )P(w * z > ) >1# w i P(d z k ) d P(z k w i ) 22

plsi: Learning Using EM Model 0 & 3 ( ( P d, w e(.,7 8 ) * P d, w * = L P d z > P(z > )P(w * z > ).1# *1# Likelihood 0 & 3 L = L L n d, w * log P(d, w * ) = L L n(d, w * ) log L P d z > P(z > )P(w * z > ).1# *1#.1# *1# >1# Parameters to learn (M step) 0 & 3 = >1# = P(d z > ) P z > P w * z > (E step) P z > d, w * 23

plsi: Learning Using EM EM solution E step P(z > d, w * ) = P(z > )P d z > P(w * z > ) = g1# P(z g )P d z g P(w * z g ) M step P w * z > = 0.1# n d, w * P z > d, w * 0 & 3 n d, w b P z > d, w b.1# b1# P d z > = & 3 *1# n d, w * P z > d, w * 0 & h n c, w * P z > c, w * 61# 0 *1# & 3 P z > = 1 R L L n d, w * P z > d, w *,.1# *1# 0 & 3 R = L L n d, w *.1# *1# 24

plsi: Overview More realistic than mixture model Documents can discuss multiple topics! Problems Very many parameters Danger of overfitting 25

plsi Testrun PLSI topics (TDT-1 corpus) Approx. 7 million words, 15863 documents, K = 128 The two most probable topics that generate the term flight (left) and love (right). List of most probable words per topic, with decreasing probability going down the list. 26

Relation with LSI P = U > Σ l V > n P d, w * = L P d z > P(z > )P(w * z > ) = >1# U > = P d z >.,> Σ l = diag P z > > V > = P w * z > *,> Difference: LSI: minimize Frobenius (L-2) norm plsi: log-likelihood of training data 27

plsi with Multinomials q d p d Multinomial Naïve Bayes Select document d ~ Mult( p) For each position i = 1,..., N d Generate z i ~ Mult( d, θ. ) Generate w i ~ Mult( z i, b k ) z 0 ( P w #,, w &3, d β, θ, π.1# w N M 0 & 3 = = ( P(d π) ( L P(z * = k d, θ. )P w * β >.1# *1# >1# 0 & 3 = b = ( π. ( L θ.,> β >,78.1# *1# >1# 28

Prior Distribution for Topic Mixture Goal: topic mixture proportions for each document are drawn from some distribution. Distribution on multinomials (k-tuples of non-negative numbers that sum to one) The space is of all of these multinomials can be interpreted geometrically as a (k-1)-simplex Generalization of a triangle to (k-1) dimensions Criteria for selecting our prior: It needs to be defined for a (k-1)-simplex Should have nice properties In Bayesian probability theory, if the posterior distributions p(θ x) are in the same family as the prior probability distribution p(θ), the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function. [Wikipedia] 29

Latent Dirichlet Allocation Document = mixture of topics (as in plsi), but according to a Dirichlet prior When we use a uniform Dirichlet prior, plsi=lda D. Blei, A. Ng, and M. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993-1022, January 2003 30

Dirichlet Distributions p(θ α) = Γ * α * * Γ(α * ) Defined over a (k-1)-simplex Takes K non-negative arguments which sum to one. Consequently it is a natural distribution to use over multinomial distributions. The Dirichlet parameter a i can be thought of as a prior count of the i th class Conjugate prior to the multinomial distribution Conjugate prior: if our likelihood is multinomial with a Dirichlet prior, then the posterior is also a Dirichlet f x #,, x = ; p #,, p = = Γ * x * + 1 * Γ(x * + 1) = ( θ * s 8 t# *1# = ( p * H 8 *1# 31

LDA Model a q q q z 1 z 2 z 3 z 4 z 1 z 2 z 3 z 4 z 1 z 2 z 3 z 4 w 1 w 2 w 3 w 4 w 1 w 2 w 3 w 4 w 1 w 2 w 3 w 4 b 32

LDA Model Parameters a Proportions parameter (k-dimensional vector of real numbers) q Per-document topic distribution (k-dimensional vector of probabilities summing up to 1) z Per-word topic assignment (number from 1 to k) w N M Observed word (number from 1 to v, where v is the number of words in the vocabulary) b Word prior (v-dimensional) 33

Dirichlet Distribution over a 2-Simplex A panel illustrating probability density functions of a few Dirichlet distributions over a 2-simplex, for the following α vectors (clockwise, starting from the upper left corner): (1.3, 1.3, 1.3), (3,3,3), (7,7,7), (2,6,11), (14, 9, 5), (6,2,6). [Wikipedia] 34

LDA Model Plate Notation a q For each document d, Generate q d ~ Dirichlet( a) For each position i = 1,..., N d Generate a topic z i ~ Mult( q d ) Generate a word w i ~ Mult( z i,b) z P β, θ, z #,, z &3, w #,, w &3 w N M 0 & 3 = ( P θ. α ( P z * θ. P(w * β, z * ).1# *1# b 35

Corpus-level Parameter α= K -1 Σ i α i Let α = 1 Per-document topic distribution: K = 10, D = 15 1.0 1 2 3 4 5 0.8 0.6 0.4 0.2 0.0 1.0 6 7 8 9 10 0.8 value 0.6 0.4 0.2 0.0 1.0 11 12 13 14 15 0.8 0.6 0.4 0.2 0.0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 item 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 36

Corpus-level Parameter α α = 10 α = 100 1.0 1 2 3 4 5 1.0 1 2 3 4 5 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.0 1.0 6 7 8 9 10 0.2 0.0 1.0 6 7 8 9 10 0.8 0.8 value 0.6 0.4 value 0.6 0.4 0.2 0.0 1.0 11 12 13 14 15 0.2 0.0 1.0 11 12 13 14 15 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 item 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 0.2 0.0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 item 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 37

Corpus-level Parameter α α = 0.1 α = 0.01 1.0 1 2 3 4 5 1.0 1 2 3 4 5 0.8 0.6 0.4 0.2 0.0 1.0 6 7 8 9 10 0.8 0.6 0.4 0.2 0.0 1.0 6 7 8 9 10 value 0.8 0.6 0.4 0.2 0.0 1.0 11 12 13 14 15 value 0.8 0.6 0.4 0.2 0.0 1.0 11 12 13 14 15 0.8 0.8 0.6 0.6 0.4 0.2 0.0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 item 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 0.4 0.2 0.0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 item 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 38

Back to Topic Modeling Scenario Topics Documents Topic proportions and assignments 39

Smoothed LDA Model a Give a different word distribution to each topic q β is K V matrix (V vocabulary size), each row denotes word distribution of a topic z η For each document d Choose q d ~ Dirichlet( a) Choose β > ~ Dirichlet(η ) w N β > K For each position i = 1,..., N d Generate a topic z i ~ Mult( q d ) M Generate a word w i ~ Mult( z i,b k ) 40

Smoothed LDA Model a P β, θ, z #,, z &3, w #,, w &3 = = ( P(β > η) q >1# 0 & 3 ( P θ. α ( P z.,* θ. P(w.,* β #:=, z.,* ) z η.1# *1# w N β > K M 41

Back to Topic Modeling Scenario But 42

Why does LDA work? Trade-off between two goals 1. For each document, allocate its words to as few topics as possible. 2. For each topic, assign high probability to as few terms as possible. These goals are at odds. Putting a document in a single topic makes #2 hard: All of its words must have probability under that topic. Putting very few words in each topic makes #1 hard: To cover a document s words, it must assign many topics to it. Trading off these goals finds groups of tightly cooccurring words 43

Inference: The Problem (non-smoothed version) To which topics does a given document belong? Compute the posterior distribution of the hidden variables given a document: P θ, z, w α, β P θ, z w, α, β = P w α, β P θ, z, w α, β = P(θ α) ( P z * θ P(w * z *, β) P w α, β = Γ α = & * * = s ( θ N t# 7 Γ(α * ) > ( L ( θ > β 8 >b * >1# This not only looks awkward, but is as well computationally intractable in general. Coupling between and ij. Solution: Approximations. & *1# *1# >1# b1# dθ 44

LDA Learning Parameter learning: Variational EM Numerical approximation using lower-bounds Results in biased solutions Convergence has numerical guarantees Gibbs Sampling Stochastic simulation Unbiased solutions Stochastic convergence D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993-1022, January 2003 45

LDA: Variational Inference Replace LDA model with simpler one z w N M z N M Minimize the Kullback-Leibler divergence between the two distributions. KL 46

LDA: Gibbs Sampling MCMC algorithm Fix all current values but one and sample that value, e.g, for z P(z * = k z t*, w) Eventually converges to true posterior 47

Variational Inference vs. Gibbs Sampling Gibbs sampling is slower (takes days for mod.-sized datasets), variational inference takes a few hours. Gibbs sampling is more accurate. Gibbs sampling convergence is difficult to test, although quite a few machine learning approximate inference techniques also have the same problem. 48

LDA Application: Reuters Data Setup 100-topic LDA trained on a 16,000 documents corpus of news articles by Reuters Some standard stop words removed Top words from some of the P(w z) Arts Budgets Children Education new million children school film tax women students show program people schools music budget child education movie billion years teachers play federal families high musical year work public 49

LDA Application: Reuters Data Result Again: Arts, Budgets, Children, Education. The William Randolph Hearst Foundation will give $1.25 million to Lincoln Center, Metropolitan Opera Co., New York Philharmonic and Juilliard School. Our board felt that we had a real opportunity to make a mark on the future of the performing arts with these grants an act every bit as important as our traditional areas of support in health, medical research, education and the social services, Hearst Foundation President Randolph A. Hearst said Monday in announcing the grants. 50

Measuring Performance Perplexity of a probability model How well a probability distribution or probability model predicts a sample q: Model of an unknown probability distribution p based on a training sample drawn from p Evaluate q by asking how well it predicts a separate test sample x #,, x & also drawn from p Perplexity of q w.r.t. sample x #,, x & defined as 2 t# & 8Š ˆ H 8 A better model q will tend to assign higher probabilities to q x * lower perplexity ( less surprised by sample ) [Wikipedia] 51

Perplexity of Various Models Unigram LDA 52

Use of LDA A widely used topic model (Griffiths, Steyvers, 04) Complexity is an issue Use in IR: Ad hoc retrieval (Wei and Croft, SIGIR 06: TREC benchmarks) Improvements over traditional LM (e.g., LSI techniques) But no consensus on whether there is any improvement over Relevance model, i.e., model with relevance feedback (relevance feedback part of the TREC tests) T. Griffiths, M. Steyvers, Finding Scientific Topics. Proceedings of the National Academy of Sciences, 101 (suppl. 1), 5228-5235. 2004 Xing Wei and W. Bruce Croft. LDA-based document models for ad-hoc retrieval. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '06). ACM, New York, NY, USA, 178-185. 2006. TREC=Text REtrieval Conference 53

Generative Topic Models for Community Analysis Pilfered from: Ramesh Nallapati http://www.cs.cmu.edu/~wcohen/10-802/lda-sep-18.ppt & Arthur Asuncion, Qiang Liu, Padhraic Smyth: Statistical Approaches to Joint Modeling of Text and Network Data 54

What if the corpus has network structure? CORA citation network. Figure from [Chang, Blei, AISTATS 2009] J. Chang, and D. Blei. Relational Topic Models for Document Networks. AISTATS, volume 5 of JMLR Proceedings, page 81-88. JMLR.org, 2009. 55

Outline Topic Models for Community Analysis Citation Modeling with plsi with LDA Modeling influence of citations Relational Topic Models 56

Hyperlink Modeling Using plsi p d z w q d z c Select document d ~ Mult(π) For each position n = 1,, N d Generate z n ~ Mult(@ θ. ) Generate w n ~ Mult(@ β M ) For each citation j = 1,, L d Generate z j ~ Mult(@ θ. ) Generate c j ~ Mult(@ γ M ) N L M b g D. A. Cohn, Th. Hofmann, The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity, In: Proc. NIPS, pp. 430-436, 2000 57

Hyperlink Modeling Using plsi p d q d plsi likelihood 0 ( P w #,, w &3, d θ, β, π.1# 0 & 3 = = ( π. ( L θ.> β >7 z z.1# *1# >1# New likelihood 0 w c ( P w #,, w &3, c #,, c Ž3, d θ, β, γ, π N L.1# M 0 & 3 = Ž 3 = = ( π. ( L θ.> β >7 ( L θ.> γ >6 b g.1# *1# >1# Learning using EM b1# >1# 58

Hyperlink Modeling Using plsi Heuristic 0 < a < 1 determines the relative importance of content and hyperlinks 0 ( P w #,, w &3, c #,, c Ž3, d θ, β, γ, π.1# 0 & 3 = s Ž 3 = #ts = ( π. ( L θ.> β >7 ( L θ.> γ >6.1# *1# >1# b1# >1# 59

Hyperlink Modeling Using plsi Classification performance Hyperlink Content Hyperlink Content 60

Hyperlink modeling using LDA a q z w z c For each document d, Generate q d ~ Dirichlet(a) For each position i = 1,..., N d Generate a topic z i ~ Mult(@ θ. ) Generate a word w i ~ Mult (@ β M ) For each citation j = 1,, L c Generate z i ~ Mult(q d ) Generate c i ~ Mult (@ γ M ) N L b M g Learning using variational EM, Gibbs sampling E. Erosheva, S Fienberg, J. Lafferty, Mixed-membership models of scientific publications. Proc National Academy Science U S A. 2004 Apr 6;101 Suppl 1:5220-7. Epub 2004 Mar 12. 61

Link-pLSI-LDA: Topic Influence in Blogs γ R. Nallapati, A. Ahmed, E. Xing, W.W. Cohen, Joint Latent Topic Models for Text and Citations, In: Proc. KDD, 2008. 62

Modeling Citation Influences - Copycat Model Each topic in a citing document is drawn from one of the topic mixtures of cited publications L. Dietz, St. Bickel, and T. Scheffer, Unsupervised Prediction of Citation Influences, In: Proc. ICML 2007. 64

Modeling Citation Influences Citation influence model: Combination of LDA and Copycat model L. Dietz, St. Bickel, and T. Scheffer, Unsupervised Prediction of Citation Influences, In: Proc. ICML 2007. 65

Modeling Citation Influences Citation influence graph for LDA paper 66

Modeling Citation Influences Words in LDA paper assigned to citations 67

Modeling Citation Influences Predictive Performance 68

Relational Topic Model (RTM) [ChangBlei 2009] Same setup as LDA, except now we have observed network information across documents α y.,. ~ ψ y.,. z., z., η) Link probability function θ. η θ. N. z.,e w.,e M y.,. β > K z.,e w.,e N. J. Chang, and D. Blei. Relational Topic Models for Document Networks. AISTATS, volume 5 of JMLR Proceedings, page 81-88. JMLR.org, 2009. Documents with similar topics are more likely to be linked. 69

Relational Topic Model (RTM) [ChangBlei 2009] For each document d α Draw topic proportions θ. α ~ Dir α θ. η θ. For each word w.,e Draw assignment z.,e θ. ~ Mult(θ. ) Draw word w.,e z.,e, β #:= ~ Mult(β M3, ) z.,e y.,. z.,e For each pair of documents d, d w.,e N. β > K w.,e N. Draw binary link indicator y z., z. ~ ψ( z., z. ) 70

Collapsed Gibbs Sampling for RTM Conditional distribution of each z: P z.,e = k z.,e, N.,e.> + α N.,e >7 + β N.,e > + Wβ ( ψ ž y.,. = 1 z., z., η. Ÿ.: 3,3 1# ( ψ ž y.,. = 0 z., z., η. Ÿ.: 3,3 1 LDA term Edge term Non-edge term Using the exponential link probability function, it is computationally efficient to calculate the edge term. It is very costly to compute the non-edge term exactly. 71

Approximating the Non-edges 1. Assume non-edges are missing and ignore the term entirely (Chang/Blei) 2. Make the following fast approximation: ( 1 exp c * * 1 exp c * &, where c * = 1 N L c * * 3. Subsample non-edges and exactly calculate the term over subset. 4. Subsample non-edges but instead of recalculating statistics for every z.,e token, calculate statistics once per document and cache them over each Gibbs sweep. 72

Document networks # Docs # Links Ave. Doc- Length Vocab-Size Link Semantics CORA 4,000 17,000 1,200 60,000 Paper citation (undirected) Netflix Movies Enron (Undirected) Enron (Directed) 10,000 43,000 640 38,000 Common actor/director 1,000 16,000 7,000 55,000 Communication between person i and person j 2,000 21,000 3,500 55,000 Email from person i to person j 73

Link Rank Link rank on held-out data as evaluation metric Lower is better d test {d train } Black-box Edges among {d train } predictor Ranking over {d train } Edges between d test and {d train } Link ranks How to compute link rank (simplified): 1. Train model with {d train }. 2. Given the model, calculate probability that d test would link to each d train. Rank {d train } according to these probabilities. 3. For each observed link between d test and {d train }, find the rank, and average all these ranks to obtain the link rank 74

Results on CORA data Comparison on CORA, K=20 270 250 230 Link Rank 210 190 170 150 Baseline (TFIDF/Cosine) LDA + Regression Ignoring non-edges Fast approximation of non-edges Subsampling nonedges (20%)+Caching 8-fold cross-validation. Random guessing gives link rank = 2000. 75

Results on CORA data 400 350 300 Baseline RTM, Fast Approximation 650 600 550 500 Baseline LDA + Regression (K=40) Ignoring Non-Edges (K=40) Fast Approximation (K=40) Subsampling (5%) + Caching (K=40) Link Rank 250 200 Link Rank 450 400 350 300 150 250 200 100 0 20 40 60 80 100 120 140 160 Number of Topics 150 0 0.2 0.4 0.6 0.8 1 Percentage of Words Model does better with more topics Model does better with more words in each document 76

Timing Results on CORA 7000 6000 5000 LDA + Regression Ignoring Non-Edges CORA, K=20 Fast Approximation Subsampling (5%) + Caching Subsampling (20%) + Caching Time (in seconds) 4000 3000 2000 1000 0 1000 1500 2000 2500 3000 3500 4000 Number of Documents Subsampling (20%) without caching not shown since it takes 62,000 seconds for D=1000 and 3,720,150 seconds for D=4000 77

Conclusion Relational topic modeling provides a useful start for combining text and network data in a single statistical framework RTM can improve over simpler approaches for link prediction Opportunities for future work: Faster algorithms for larger data sets Better understanding of non-edge modeling Extended models 78