Web-Mining Agents Topic Analysis: plsi and LDA. Tanya Braun Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Similar documents
Latent Dirichlet Allocation

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Hybrid Models for Text and Graphs. 10/23/2012 Analysis of Social Media

Document and Topic Models: plsa and LDA

Latent Dirichlet Allocation (LDA)

CS Lecture 18. Topic Models and LDA

Latent Dirichlet Allocation Introduction/Overview

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank

Topic Models and Applications to Short Documents

Generative Clustering, Topic Modeling, & Bayesian Inference

Topic Modelling and Latent Dirichlet Allocation

Study Notes on the Latent Dirichlet Allocation

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Lecture 13 : Variational Inference: Mean Field Approximation

CS145: INTRODUCTION TO DATA MINING

Applying LDA topic model to a corpus of Italian Supreme Court decisions

Parallelized Variational EM for Latent Dirichlet Allocation: An Experimental Evaluation of Speed and Scalability

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Text Mining for Economics and Finance Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA)

Gaussian Mixture Model

Dirichlet Enhanced Latent Semantic Analysis

HTM: A Topic Model for Hypertexts

Applying Latent Dirichlet Allocation to Group Discovery in Large Graphs

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

Recent Advances in Bayesian Inference Techniques

Evaluation Methods for Topic Models

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Undirected Graphical Models

13: Variational inference II

A Continuous-Time Model of Topic Co-occurrence Trends

Lecture 19, November 19, 2012

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations

Machine Learning Summer School

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Lecture 6: Graphical Models: Learning

Web Search and Text Mining. Lecture 16: Topics and Communities

Understanding Comments Submitted to FCC on Net Neutrality. Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014

Introduction to Bayesian inference

Knowledge Discovery and Data Mining 1 (VO) ( )

Learning Bayesian network : Given structure and completely observed data

Latent Dirichlet Allocation

Probabilistic Latent Semantic Analysis

Data Mining Techniques

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017

Latent Dirichlet Alloca/on

Sparse Stochastic Inference for Latent Dirichlet Allocation

Mixed-membership Models (and an introduction to variational inference)

Measuring Topic Quality in Latent Dirichlet Allocation

Latent variable models for discrete data

Topic Models. Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW. Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1

Modeling Environment

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Introduction to Bayesian Learning

Naïve Bayes classification

Distributed ML for DOSNs: giving power back to users

Probabilistic Dyadic Data Analysis with Local and Global Consistency

Latent Dirichlet Conditional Naive-Bayes Models

Note for plsa and LDA-Version 1.1

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

topic modeling hanna m. wallach

Nonparametric Bayesian Methods (Gaussian Processes)

Probabilistic Graphical Models & Applications

Graphical Models for Collaborative Filtering

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang

Lecture 22 Exploratory Text Analysis & Topic Models

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

Statistical Debugging with Latent Topic Models

LSI, plsi, LDA and inference methods

Generative Models for Discrete Data

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

CS6220: DATA MINING TECHNIQUES

Fast Collapsed Gibbs Sampling For Latent Dirichlet Allocation

Benchmarking and Improving Recovery of Number of Topics in Latent Dirichlet Allocation Models

Content-based Recommendation

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Latent Dirichlet Allocation Based Multi-Document Summarization

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Modeling User Rating Profiles For Collaborative Filtering

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation

Generative Models. CS4780/5780 Machine Learning Fall Thorsten Joachims Cornell University

Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation

Introduction To Machine Learning

Part 1: Expectation Propagation

Topic Models. Charles Elkan November 20, 2008

Language Information Processing, Advanced. Topic Models

Replicated Softmax: an Undirected Topic Model. Stephen Turner

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

Machine Learning (CS 567) Lecture 2

16 : Approximate Inference: Markov Chain Monte Carlo

Language as a Stochastic Process

Lecture 13: More uses of Language Models

Text Mining: Basic Models and Applications

Generative Models for Classification

LDA with Amortized Inference

Kernel Density Topic Models: Visual Topics Without Visual Words

an introduction to bayesian inference

Markov Topic Models. Bo Thiesson, Christopher Meek Microsoft Research One Microsoft Way Redmond, WA 98052

Transcription:

Web-Mining Agents Topic Analysis: plsi and LDA Tanya Braun Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Acknowledgments Pilfered from: Ramesh M. Nallapati Machine Learning applied to Natural Language Processing Thomas J. Watson Research Center, Yorktown Heights, NY USA from his presentation on Generative Topic Models for Community Analysis & David M. Blei MLSS 2012 from his presentation on Probabilistic Topic Models & Sina Miran CS 290D Paper Presentation on Probabilistic Latent Semantic Indexing (PLSI) 2

Recap Agents Task/goal: Information retrieval Environment: Document repository Means: Vector space (bag-of-words) Dimension reduction (LSI) Probability based retrieval (binary) Language models Today: Topic models as special language models Probabilistic LSI (plsi) Latent Dirichlet Allocation (LDA) Soon: What agents can take with them What agents can leave at the repository (win-win) 3

Objectives Topic Models: statistical methods that analyze the words of texts in order to: Discover the themes that run through them (topics) How those themes are connected to each other How they change over time "Theoretical Physics" "Neuroscience" FORCE LASER o o o o o o o o o o o o o o o o o o o o o o o o RELATIVITY o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o NEURON o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o NERVE OXYGEN 1880 1900 1920 1940 1960 1980 2000 1880 1900 1920 1940 1960 1980 2000 4

Topic Modeling Scenario Each topic is a distribution over words Each document is a mixture of corpus-wide topics Each word is drawn from one of those topics 5

Topic Modeling Scenario Topics Documents Topic proportions and assignments In reality, we only observe the documents The other structures are hidden variables Topic modeling algorithms infer these variables from data 6

Plate Notation Naïve Bayes Model: Compact representation C = topic/class (name for a word distribution) N = number of words in considered document W i one specific word in corpus M documents, W now words in document C C w 1 w 2 w 3.. w N Idea: Generate doc from P(W, C) M W N M 7

Generative vs. Descriptive Models Generative models: Learn P(x, y) Tasks: Transform P(x,y) into P(y x) for classification Use the model to predict (infer) new data Advantages Assumptions and model are explicit Use well-known algorithms (e.g., MCMC, EM) Descriptive models: Learn P(y x) Task: Classify data Advantages Fewer parameters to learn Better performance for classification 8

Earlier Topic Models: Topics Known W Unigram No context information N P w #,, w & = ( P(w * ) * fifth, an, of, futures, the, an, incorporated, a, a, the, inflation, most, dollars, quarter, in, is, mass!! thrift, did, eighty, said, hard, 'm, july, bullish!! that, or, limited, the! Automatically generated sentences from a unigram model 9

Earlier Topic Models: Topics Known W Unigram No context information N C P w #,, w & = ( P(w * ) Mixture of Unigrams One topic per document 0 0 ( P w #,, w &, c. = ( P(c. ) * ( P(w * c. ) W.1#.1# * N M How to specify P(c. )? 10

Mixture of Unigrams: Known Topics p Multinomial Naïve Bayes For each document d = 1,, M Generate c d ~ Mult(. p) C Indication vector c=(z 1, z k ) {0, 1} K P(C=c)= = p zk k >1# For each position i = 1,..., N d Generate w i ~ Mult(. b, c d ) 0 W ( P w #,, w &3, c. β, π.1# N M 0 & 3 = ( P(c. π) ( P w * β, c. 0 & 3 = ( π 63 ( β 63,7 8 b.1# *1#.1# π 63 P c. π β 63,7 8 P w * β, c. multinomial *1# 11

Multinomial Distribution Generalization of binomial distribution K possible outcomes instead of 2 Probability mass function n = number of trials x j = count for how often class j occurring p j = probability of class j occurring f x #,, x = ; p #,, p = = Γ * x * + 1 * Γ(x * + 1) Here, the input to Γ @ is a positive integer, so = ( p * H 8 *1# Γ n = n 1! 12

Mixture of Unigrams: Unknown Topics p Topics/classes are hidden Joint probability of words and classes Z 0 ( P w #,, w &3, c. β, π.1# 0 & 3 = ( π 63 ( β 63,7 8.1# *1# W Sum over topics (K = number of topics) N M 0 0 = & 3 b ( P w #,, w &3 β, π.1# = ( L π z> ( β M>,7 8.1# >1# *1# π MN P z > π β MN,7 8 P w * β, z > 13

Mixture of Unigrams: Learning Learn parameters π and β Use likelihood Solve 0 ( P w #,, w &3 β, π.1# 0 L log P w #,, w &3 β, π.1# 0 = ( L π M> ( β M>,7 8 Not a concave/convex function Note: a non-concave/non-convex function is not necessarily convex/concave Possibly no unique max, many saddle or turning points No easy way to find roots of derivative 0 =.1# >1# 0 = & 3 *1# & 3 = L log L π z> ( β M>,7 8.1# argmax βπ L log L π z> ( β M>,7 8.1# = >1# & 3 *1# >1# *1# 14

Trick: Optimize Lower Bound (γ.>, θ) 15

Mixture of Unigrams: Learning The problem Optimize w.r.t. each document Derive lower bound 0 argmax βπ L log L π z> ( β M>,7 8.1# = >1# & 3 *1# log L γ * x * L γ * log x * where γ * 0 L γ * = 1 Jensen's inequality * * * log L x * * = log L γ * x * γ * * 0 = & 3 L γ * log x * γ * log γ * * H(g) argmax βπ L log L π z> ( β M>,7 8.1# >1# *1# The problem again = & 3 = & 3 log L π > ( β >,78 >1# *1# L γ > log(π > ( β >,78 ) >1# *1# + H(γ) 16

Mixture of Unigrams: Learning For each document d = log L π > ( β >,78 >1# & 3 *1# Chicken-and-egg problem: = L γ.> log(π > ( β >,78 ) >1# If we knew γ.> we could find π > and β >,78 with ML If we knew π > and β >,78 we could find γ.> with ML Finally we need π z> and β M >,7 8 Solution: Expectation Maximization Iterative algorithm to find local optimum Guaranteed to maximize a lower bound on the loglikelihood of the observed data & 3 *1# + H(γ) 17

Graphical Idea of the EM Algorithm (γ.>, θ) θ = (π >, β >,78 ) 18

Mixture of Unigrams: Learning For each document d EM solution E step M step = & 3 log L π > ( β >,78 L γ.> log(π > ( β >,78 ) & 3 >1# *1# >1# *1# γ.> = & 3 (`) π (`a#) > + *1# β>,78 = = & 3 π (`) b1# b β >,78 *1# ` (`) + H(γ) 0.1# (`a#) γ π > =.> M ` (`a#) 0 γ `.1# =.> n(d, w * ) 0 γ (`) &.> 3.1# n(d, w b ) β >,78 b1# 19

Back to Topic Modeling Scenario Topics Documents Topic proportions and assignments 20

Probabilistic LSI d Select a document d with probability P(d) For each word of d in the training set z Choose a topic z with probability P(z d) w N M Generate a word with probability P(w z) P d, w * = = P d L P w * z > P(z > d) >1# Documents can have multiple topics Thomas Hofmann, Probabilistic Latent Semantic Indexing, Proceedings of the 22 nd Annual International SIGIR Conference on Research and Development in Information Retrieval (SIGIR-99), 1999 21

plsi Joint probability for all documents, words 0 & 3 ( ( P d, w * e(.,7 8 ).1# *1# Distribution for document d, word w i = P d, w * = P d L P w * z > P(z > d) >1# d z k w i P(d) P(z k d) P(w i z k ) Reformulate P(z > d) with Bayes Rule z k P(z k ) = P d, w * = L P d z > P(z > )P(w * z > ) >1# w i P(d z k ) d P(z k w i ) 22

plsi: Learning Using EM Model 0 & 3 ( ( P d, w e(.,7 8 ) * P d, w * = L P d z > P(z > )P(w * z > ).1# *1# Likelihood 0 & 3 L = L L n d, w * log P(d, w * ) = L L n(d, w * ) log L P d z > P(z > )P(w * z > ).1# *1#.1# *1# >1# Parameters to learn (M step) 0 & 3 = >1# = P(d z > ) P z > P w * z > (E step) P z > d, w * 23

plsi: Learning Using EM EM solution E step P(z > d, w * ) = P(z > )P d z > P(w * z > ) = g1# P(z g )P d z g P(w * z g ) M step P w * z > = 0.1# n d, w * P z > d, w * 0 & 3 n d, w b P z > d, w b.1# b1# P d z > = & 3 *1# n d, w * P z > d, w * 0 & h n c, w * P z > c, w * 61# 0 *1# & 3 P z > = 1 R L L n d, w * P z > d, w *,.1# *1# 0 & 3 R = L L n d, w *.1# *1# 24

plsi: Overview More realistic than mixture model Documents can discuss multiple topics! Problems Very many parameters Danger of overfitting 25

plsi Testrun PLSI topics (TDT-1 corpus) Approx. 7 million words, 15863 documents, K = 128 The two most probable topics that generate the term flight (left) and love (right). List of most probable words per topic, with decreasing probability going down the list. 26

Relation with LSI P = U > Σ l V > n P d, w * = L P d z > P(z > )P(w * z > ) = >1# U > = P d z >.,> Σ l = diag P z > > V > = P w * z > *,> Difference: LSI: minimize Frobenius (L-2) norm plsi: log-likelihood of training data 27

plsi with Multinomials q d p d Multinomial Naïve Bayes Select document d ~ Mult( p) For each position i = 1,..., N d Generate z i ~ Mult( d, θ. ) Generate w i ~ Mult( z i, b k ) z 0 ( P w #,, w &3, d β, θ, π.1# w N M 0 & 3 = = ( P(d π) ( L P(z * = k d, θ. )P w * β >.1# *1# >1# 0 & 3 = b = ( π. ( L θ.,> β >,78.1# *1# >1# 28

Prior Distribution for Topic Mixture Goal: topic mixture proportions for each document are drawn from some distribution. Distribution on multinomials (k-tuples of non-negative numbers that sum to one) The space is of all of these multinomials can be interpreted geometrically as a (k-1)-simplex Generalization of a triangle to (k-1) dimensions Criteria for selecting our prior: It needs to be defined for a (k-1)-simplex Should have nice properties In Bayesian probability theory, if the posterior distributions p(θ x) are in the same family as the prior probability distribution p(θ), the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function. [Wikipedia] 29

Latent Dirichlet Allocation Document = mixture of topics (as in plsi), but according to a Dirichlet prior When we use a uniform Dirichlet prior, plsi=lda D. Blei, A. Ng, and M. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993-1022, January 2003 30

Dirichlet Distributions p(θ α) = Γ * α * * Γ(α * ) Defined over a (k-1)-simplex Takes K non-negative arguments which sum to one. Consequently it is a natural distribution to use over multinomial distributions. The Dirichlet parameter a i can be thought of as a prior count of the i th class Conjugate prior to the multinomial distribution Conjugate prior: if our likelihood is multinomial with a Dirichlet prior, then the posterior is also a Dirichlet f x #,, x = ; p #,, p = = Γ * x * + 1 * Γ(x * + 1) = ( θ * s 8 t# *1# = ( p * H 8 *1# 31

LDA Model a q q q z 1 z 2 z 3 z 4 z 1 z 2 z 3 z 4 z 1 z 2 z 3 z 4 w 1 w 2 w 3 w 4 w 1 w 2 w 3 w 4 w 1 w 2 w 3 w 4 b 32

LDA Model Parameters a Proportions parameter (k-dimensional vector of real numbers) q Per-document topic distribution (k-dimensional vector of probabilities summing up to 1) z Per-word topic assignment (number from 1 to k) w N M Observed word (number from 1 to v, where v is the number of words in the vocabulary) b Word prior (v-dimensional) 33

Dirichlet Distribution over a 2-Simplex A panel illustrating probability density functions of a few Dirichlet distributions over a 2-simplex, for the following α vectors (clockwise, starting from the upper left corner): (1.3, 1.3, 1.3), (3,3,3), (7,7,7), (2,6,11), (14, 9, 5), (6,2,6). [Wikipedia] 34

LDA Model Plate Notation a q For each document d, Generate q d ~ Dirichlet( a) For each position i = 1,..., N d Generate a topic z i ~ Mult( q d ) Generate a word w i ~ Mult( z i,b) z P β, θ, z #,, z &3, w #,, w &3 w N M 0 & 3 = ( P θ. α ( P z * θ. P(w * β, z * ).1# *1# b 35

Corpus-level Parameter α= K -1 Σ i α i Let α = 1 Per-document topic distribution: K = 10, D = 15 1.0 1 2 3 4 5 0.8 0.6 0.4 0.2 0.0 1.0 6 7 8 9 10 0.8 value 0.6 0.4 0.2 0.0 1.0 11 12 13 14 15 0.8 0.6 0.4 0.2 0.0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 item 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 36

Corpus-level Parameter α α = 10 α = 100 1.0 1 2 3 4 5 1.0 1 2 3 4 5 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.0 1.0 6 7 8 9 10 0.2 0.0 1.0 6 7 8 9 10 0.8 0.8 value 0.6 0.4 value 0.6 0.4 0.2 0.0 1.0 11 12 13 14 15 0.2 0.0 1.0 11 12 13 14 15 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 item 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 0.2 0.0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 item 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 37

Corpus-level Parameter α α = 0.1 α = 0.01 1.0 1 2 3 4 5 1.0 1 2 3 4 5 0.8 0.6 0.4 0.2 0.0 1.0 6 7 8 9 10 0.8 0.6 0.4 0.2 0.0 1.0 6 7 8 9 10 value 0.8 0.6 0.4 0.2 0.0 1.0 11 12 13 14 15 value 0.8 0.6 0.4 0.2 0.0 1.0 11 12 13 14 15 0.8 0.8 0.6 0.6 0.4 0.2 0.0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 item 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 0.4 0.2 0.0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 item 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 38

Back to Topic Modeling Scenario Topics Documents Topic proportions and assignments 39

Smoothed LDA Model a Give a different word distribution to each topic q β is K V matrix (V vocabulary size), each row denotes word distribution of a topic z η For each document d Choose q d ~ Dirichlet( a) Choose β > ~ Dirichlet(η ) w N β > K For each position i = 1,..., N d Generate a topic z i ~ Mult( q d ) M Generate a word w i ~ Mult( z i,b k ) 40

Smoothed LDA Model a P β, θ, z #,, z &3, w #,, w &3 = = ( P(β > η) q >1# 0 & 3 ( P θ. α ( P z.,* θ. P(w.,* β #:=, z.,* ) z η.1# *1# w N β > K M 41

Back to Topic Modeling Scenario But 42

Why does LDA work? Trade-off between two goals 1. For each document, allocate its words to as few topics as possible. 2. For each topic, assign high probability to as few terms as possible. These goals are at odds. Putting a document in a single topic makes #2 hard: All of its words must have probability under that topic. Putting very few words in each topic makes #1 hard: To cover a document s words, it must assign many topics to it. Trading off these goals finds groups of tightly cooccurring words 43

Inference: The Problem (non-smoothed version) To which topics does a given document belong? Compute the posterior distribution of the hidden variables given a document: P θ, z, w α, β P θ, z w, α, β = P w α, β P θ, z, w α, β = P(θ α) ( P z * θ P(w * z *, β) P w α, β = Γ α = & * * = s ( θ N t# 7 Γ(α * ) > ( L ( θ > β 8 >b * >1# This not only looks awkward, but is as well computationally intractable in general. Coupling between and ij. Solution: Approximations. & *1# *1# >1# b1# dθ 44

LDA Learning Parameter learning: Variational EM Numerical approximation using lower-bounds Results in biased solutions Convergence has numerical guarantees Gibbs Sampling Stochastic simulation Unbiased solutions Stochastic convergence D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993-1022, January 2003 45

LDA: Variational Inference Replace LDA model with simpler one z w N M z N M Minimize the Kullback-Leibler divergence between the two distributions. KL 46

LDA: Gibbs Sampling MCMC algorithm Fix all current values but one and sample that value, e.g, for z P(z * = k z t*, w) Eventually converges to true posterior 47

Variational Inference vs. Gibbs Sampling Gibbs sampling is slower (takes days for mod.-sized datasets), variational inference takes a few hours. Gibbs sampling is more accurate. Gibbs sampling convergence is difficult to test, although quite a few machine learning approximate inference techniques also have the same problem. 48

LDA Application: Reuters Data Setup 100-topic LDA trained on a 16,000 documents corpus of news articles by Reuters Some standard stop words removed Top words from some of the P(w z) Arts Budgets Children Education new million children school film tax women students show program people schools music budget child education movie billion years teachers play federal families high musical year work public 49

LDA Application: Reuters Data Result Again: Arts, Budgets, Children, Education. The William Randolph Hearst Foundation will give $1.25 million to Lincoln Center, Metropolitan Opera Co., New York Philharmonic and Juilliard School. Our board felt that we had a real opportunity to make a mark on the future of the performing arts with these grants an act every bit as important as our traditional areas of support in health, medical research, education and the social services, Hearst Foundation President Randolph A. Hearst said Monday in announcing the grants. 50

Measuring Performance Perplexity of a probability model How well a probability distribution or probability model predicts a sample q: Model of an unknown probability distribution p based on a training sample drawn from p Evaluate q by asking how well it predicts a separate test sample x #,, x & also drawn from p Perplexity of q w.r.t. sample x #,, x & defined as 2 t# & 8Š ˆ H 8 A better model q will tend to assign higher probabilities to q x * lower perplexity ( less surprised by sample ) [Wikipedia] 51

Perplexity of Various Models Unigram LDA 52

Use of LDA A widely used topic model (Griffiths, Steyvers, 04) Complexity is an issue Use in IR: Ad hoc retrieval (Wei and Croft, SIGIR 06: TREC benchmarks) Improvements over traditional LM (e.g., LSI techniques) But no consensus on whether there is any improvement over Relevance model, i.e., model with relevance feedback (relevance feedback part of the TREC tests) T. Griffiths, M. Steyvers, Finding Scientific Topics. Proceedings of the National Academy of Sciences, 101 (suppl. 1), 5228-5235. 2004 Xing Wei and W. Bruce Croft. LDA-based document models for ad-hoc retrieval. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '06). ACM, New York, NY, USA, 178-185. 2006. TREC=Text REtrieval Conference 53

Generative Topic Models for Community Analysis Pilfered from: Ramesh Nallapati http://www.cs.cmu.edu/~wcohen/10-802/lda-sep-18.ppt & Arthur Asuncion, Qiang Liu, Padhraic Smyth: Statistical Approaches to Joint Modeling of Text and Network Data 54

What if the corpus has network structure? CORA citation network. Figure from [Chang, Blei, AISTATS 2009] J. Chang, and D. Blei. Relational Topic Models for Document Networks. AISTATS, volume 5 of JMLR Proceedings, page 81-88. JMLR.org, 2009. 55

Outline Topic Models for Community Analysis Citation Modeling with plsi with LDA Modeling influence of citations Relational Topic Models 56

Hyperlink Modeling Using plsi p d z w q d z c Select document d ~ Mult(π) For each position n = 1,, N d Generate z n ~ Mult(@ θ. ) Generate w n ~ Mult(@ β M ) For each citation j = 1,, L d Generate z j ~ Mult(@ θ. ) Generate c j ~ Mult(@ γ M ) N L M b g D. A. Cohn, Th. Hofmann, The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity, In: Proc. NIPS, pp. 430-436, 2000 57

Hyperlink Modeling Using plsi p d q d plsi likelihood 0 ( P w #,, w &3, d θ, β, π.1# 0 & 3 = = ( π. ( L θ.> β >7 z z.1# *1# >1# New likelihood 0 w c ( P w #,, w &3, c #,, c Ž3, d θ, β, γ, π N L.1# M 0 & 3 = Ž 3 = = ( π. ( L θ.> β >7 ( L θ.> γ >6 b g.1# *1# >1# Learning using EM b1# >1# 58

Hyperlink Modeling Using plsi Heuristic 0 < a < 1 determines the relative importance of content and hyperlinks 0 ( P w #,, w &3, c #,, c Ž3, d θ, β, γ, π.1# 0 & 3 = s Ž 3 = #ts = ( π. ( L θ.> β >7 ( L θ.> γ >6.1# *1# >1# b1# >1# 59

Hyperlink Modeling Using plsi Classification performance Hyperlink Content Hyperlink Content 60

Hyperlink modeling using LDA a q z w z c For each document d, Generate q d ~ Dirichlet(a) For each position i = 1,..., N d Generate a topic z i ~ Mult(@ θ. ) Generate a word w i ~ Mult (@ β M ) For each citation j = 1,, L c Generate z i ~ Mult(q d ) Generate c i ~ Mult (@ γ M ) N L b M g Learning using variational EM, Gibbs sampling E. Erosheva, S Fienberg, J. Lafferty, Mixed-membership models of scientific publications. Proc National Academy Science U S A. 2004 Apr 6;101 Suppl 1:5220-7. Epub 2004 Mar 12. 61

Link-pLSI-LDA: Topic Influence in Blogs γ R. Nallapati, A. Ahmed, E. Xing, W.W. Cohen, Joint Latent Topic Models for Text and Citations, In: Proc. KDD, 2008. 62

63

Modeling Citation Influences - Copycat Model Each topic in a citing document is drawn from one of the topic mixtures of cited publications L. Dietz, St. Bickel, and T. Scheffer, Unsupervised Prediction of Citation Influences, In: Proc. ICML 2007. 64

Modeling Citation Influences Citation influence model: Combination of LDA and Copycat model L. Dietz, St. Bickel, and T. Scheffer, Unsupervised Prediction of Citation Influences, In: Proc. ICML 2007. 65

Modeling Citation Influences Citation influence graph for LDA paper 66

Modeling Citation Influences Words in LDA paper assigned to citations 67

Modeling Citation Influences Predictive Performance 68

Relational Topic Model (RTM) [ChangBlei 2009] Same setup as LDA, except now we have observed network information across documents α y.,. ~ ψ y.,. z., z., η) Link probability function θ. η θ. N. z.,e w.,e M y.,. β > K z.,e w.,e N. J. Chang, and D. Blei. Relational Topic Models for Document Networks. AISTATS, volume 5 of JMLR Proceedings, page 81-88. JMLR.org, 2009. Documents with similar topics are more likely to be linked. 69

Relational Topic Model (RTM) [ChangBlei 2009] For each document d α Draw topic proportions θ. α ~ Dir α θ. η θ. For each word w.,e Draw assignment z.,e θ. ~ Mult(θ. ) Draw word w.,e z.,e, β #:= ~ Mult(β M3, ) z.,e y.,. z.,e For each pair of documents d, d w.,e N. β > K w.,e N. Draw binary link indicator y z., z. ~ ψ( z., z. ) 70

Collapsed Gibbs Sampling for RTM Conditional distribution of each z: P z.,e = k z.,e, N.,e.> + α N.,e >7 + β N.,e > + Wβ ( ψ ž y.,. = 1 z., z., η. Ÿ.: 3,3 1# ( ψ ž y.,. = 0 z., z., η. Ÿ.: 3,3 1 LDA term Edge term Non-edge term Using the exponential link probability function, it is computationally efficient to calculate the edge term. It is very costly to compute the non-edge term exactly. 71

Approximating the Non-edges 1. Assume non-edges are missing and ignore the term entirely (Chang/Blei) 2. Make the following fast approximation: ( 1 exp c * * 1 exp c * &, where c * = 1 N L c * * 3. Subsample non-edges and exactly calculate the term over subset. 4. Subsample non-edges but instead of recalculating statistics for every z.,e token, calculate statistics once per document and cache them over each Gibbs sweep. 72

Document networks # Docs # Links Ave. Doc- Length Vocab-Size Link Semantics CORA 4,000 17,000 1,200 60,000 Paper citation (undirected) Netflix Movies Enron (Undirected) Enron (Directed) 10,000 43,000 640 38,000 Common actor/director 1,000 16,000 7,000 55,000 Communication between person i and person j 2,000 21,000 3,500 55,000 Email from person i to person j 73

Link Rank Link rank on held-out data as evaluation metric Lower is better d test {d train } Black-box Edges among {d train } predictor Ranking over {d train } Edges between d test and {d train } Link ranks How to compute link rank (simplified): 1. Train model with {d train }. 2. Given the model, calculate probability that d test would link to each d train. Rank {d train } according to these probabilities. 3. For each observed link between d test and {d train }, find the rank, and average all these ranks to obtain the link rank 74

Results on CORA data Comparison on CORA, K=20 270 250 230 Link Rank 210 190 170 150 Baseline (TFIDF/Cosine) LDA + Regression Ignoring non-edges Fast approximation of non-edges Subsampling nonedges (20%)+Caching 8-fold cross-validation. Random guessing gives link rank = 2000. 75

Results on CORA data 400 350 300 Baseline RTM, Fast Approximation 650 600 550 500 Baseline LDA + Regression (K=40) Ignoring Non-Edges (K=40) Fast Approximation (K=40) Subsampling (5%) + Caching (K=40) Link Rank 250 200 Link Rank 450 400 350 300 150 250 200 100 0 20 40 60 80 100 120 140 160 Number of Topics 150 0 0.2 0.4 0.6 0.8 1 Percentage of Words Model does better with more topics Model does better with more words in each document 76

Timing Results on CORA 7000 6000 5000 LDA + Regression Ignoring Non-Edges CORA, K=20 Fast Approximation Subsampling (5%) + Caching Subsampling (20%) + Caching Time (in seconds) 4000 3000 2000 1000 0 1000 1500 2000 2500 3000 3500 4000 Number of Documents Subsampling (20%) without caching not shown since it takes 62,000 seconds for D=1000 and 3,720,150 seconds for D=4000 77

Conclusion Relational topic modeling provides a useful start for combining text and network data in a single statistical framework RTM can improve over simpler approaches for link prediction Opportunities for future work: Faster algorithms for larger data sets Better understanding of non-edge modeling Extended models 78