CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING Text Data: Topic Model Instructor: Yizhou Sun yzsun@cs.ucla.edu December 4, 2017

Methods to be Learnt Vector Data Set Data Sequence Data Text Data Classification Clustering Logistic Regression; Decision Tree; KNN; SVM; NN K-means; hierarchical clustering; DBSCAN; Mixture Models Naïve Bayes for Text PLSA Prediction Frequent Pattern Mining Similarity Search Linear Regression GLM* Apriori; FP growth GSP; PrefixSpan DTW 2

Text Data: Topic Models Text Data and Topic Models Revisit of Mixture Model Probabilistic Latent Semantic Analysis (plsa) Summary 3

Text Data Word/term Document A sequence of words Corpus A collection of documents 4

Represent a Document Most common way: Bag-of-Words Ignore the order of words keep the count c1 c2 c3 c4 c5 m1 m2 m3 m4 Vector space model 5

Topic Topics A topic is represented by a word distribution Relate to an issue 6

Topic modeling Get topics automatically from a corpus Assign documents to topics automatically Most frequently used topic models plsa LDA Topic Models 7

Text Data: Topic Models Text Data and Topic Models Revisit of Mixture Model Probabilistic Latent Semantic Analysis (plsa) Summary 8

Mixture Model-Based Clustering A set C of k probabilistic clusters C 1,,C k probability density/mass functions: f 1,, f k, Cluster prior probabilities: w 1,, w k, σ j w j = 1 Joint Probability of an object i and its cluster C j is: P(x i, z i = C j ) = w j f j x i z i : hidden random variable Probability of i is: P x i = σ j w j f j (x i ) f 1 (x) f 2 (x) 9

Maximum Likelihood Estimation Since objects are assumed to be generated independently, for a data set D = {x 1,, x n }, we have, P D = P x i = w j f j (x i ) i i j logp D = logp x i = log w j f j (x i ) i Task: Find a set C of k probabilistic clusters s.t. P(D) is maximized i j 10

Gaussian Mixture Model Generative model For each object: Pick its cluster, i.e., a distribution component: Z~Multinoulli w 1,, w k Sample a value from the selected distribution: X Z~N μ Z, σ Z 2 Overall likelihood function L D θ = ς i σ j w j p(x i μ j, σ j 2 ) s.t. σ j w j = 1 and w j 0 11

Multinomial Mixture Model For documents with bag-of-words representation x d = (x d1, x d2,, x dn ), x dn is the number of words for nth word in the vocabulary Generative model For each document Sample its cluster label z~multinoulli(π) π = (π 1, π 2,, π K ), π k is the proportion of kth cluster p z = k = π k Sample its word vector x d ~multinomial(β z ) β z = β z1, β z2,, β zn, β zn is the parameter associate with nth word in the vocabulary p x d z = k = σ n x dn! ς n x dn! x ς n β dn x kn ς n β dn kn 12

Likelihood Function For a set of M documents L = p(x d ) = p(x d, z = k) d d k = d k p x d z = k p(z = k) p(z = k) x β dn kn d k n 13

Mixture of Unigrams For documents represented by a sequence of words w d = (w d1, w d2,, w dnd ), N d is the length of document d, w dn is the word at the nth position of the document Generative model For each document Sample its cluster label z~multinoulli(π) π = (π 1, π 2,, π K ), π k is the proportion of kth cluster p z = k = π k For each word in the sequence Sample the word w dn ~Multinoulli(β z ) p w dn z = k = β kwdn 14

Likelihood Function For a set of M documents L = p(w d ) = p(w d, z = k) d d k = d k p w d z = k p(z = k) = p(z = k) β kwdn d k n 15

Question Are multinomial mixture model and mixture of unigrams model equivalent? Why? 16

Text Data: Topic Models Text Data and Topic Models Revisit of Mixture Model Probabilistic Latent Semantic Analysis (plsa) Summary 17

Notations Word, document, topic w, d, z Word count in document c(w, d) Word distribution for each topic (β z ) β zw : p(w z) Topic distribution for each document (θ d ) θ dz : p(z d) (Yes, soft clustering) 18

Issues of Mixture of Unigrams All the words in the same documents are sampled from the same topic In practice, people switch topics during their writing 19

Illustration of plsa 20

Generative Model for plsa Describe how a document is generated probabilistically For each position in d, n = 1,, N d Generate the topic for the position as z n ~Multinoulli(θ d ), i. e., p z n = k = θ dk (Note, 1 trial multinomial, i.e., categorical distribution) Generate the word for the position as w n ~Multinoulli(β zn ), i. e., p w n = w = β zn w 21

Graphical Model Note: Sometimes, people add parameters such as θ and β into the graphical model 22

The Likelihood Function for a Corpus Probability of a word p w d = p(w, z = k d) = p w z = k p z = k d = β kw θ dk k k k Likelihood of a corpus π d is usually considered as uniform, i.e., 1/M 23

Re-arrange the Likelihood Function Group the same word from different positions together max logl = dw c w, d log z θ dz β zw s. t. z θ dz = 1 and w β zw = 1 24

Optimization: EM Algorithm Repeat until converge E-step: for each word in each document, calculate its conditional probability belonging to each topic p z w, d p w z, d p z d = β zw θ dz (i. e., p z w, d = β zwθ dz σ z β z w θ dz ) M-step: given the conditional distribution, find the parameters that can maximize the expected likelihood β zw σ d p z w, d c w, d (i. e., β zw = θ dz p z w, d c w, d w σ d p z w, d c w,d σ w,d p z w, d ) c w,d (i. e., θ dz = σ w p z w, d c w, d N d ) 25

Example Two documents, two topics Vocabulary: {data, mining, frequent, pattern, web, information, retrieval} At some iteration of EM algorithm, E-step 26

Example (Continued) M-step 0.8 5 + 0.5 2 β 11 = = 5/17.6 11.8 + 5.8 0.8 4 + 0.5 3 β 12 = = 4.7/17.6 11.8 + 5.8 β 13 = 3/17.6 β 14 = 1.6/17.6 β 15 = 1.3/17.6 β 16 = 1.2/17.6 β 17 = 0.8/17.6 θ 11 = 11.8 17 θ 12 = 5.2 17 27

Text Data: Topic Models Text Data and Topic Models Revisit of Mixture Model Probabilistic Latent Semantic Analysis (plsa) Summary 28

Basic Concepts Summary Word/term, document, corpus, topic Mixture of unigrams plsa Generative model Likelihood function EM algorithm 29

Quiz Q1: Is Multinomial Naïve Bayes a linear classifier? Q2: In plsa, For the same word in different positions in a document, do they have the same conditional probability p z w, d? 30