Language Information Processing, Advanced. Topic Models

Size: px

Start display at page:

Download "Language Information Processing, Advanced. Topic Models"

Gregory Todd Hardy
5 years ago
Views:

1 Language Information Processing, Advanced Topic Models Kyoto University - LIP, Adv

2 Today s talk Continue exploring the representation of text as histogram of words. Objective: unveil automatically topics in large corpora, distribution of topics in each text. These techniques are called topic models. Topic models are related to other algorithms: dictionary learning in computer vision, matrix factorization A lot of work in the previous decade Start with a precursor: Latent Semantic Indexing ( 88) follow with probabilistic Latent Semantic Indexing ( 99) continue with Latent Dirichlet Allocation ( 03) and finish with Pachinko Allocation ( 06). This field is still very active... generalizations to non-parametric Bayes Chinese Restaurant Process, Indian Buffet Process etc. Kyoto University - LIP, Adv

3 From a factorization Reminder: The Naive Bayes Assumption P(C,w 1,,w n )= n P(w i C,w 1,,w i 1 ) i=1 which handles all the conditional structures of text, we assume that each word appears independently conditionally to C, P(w i C,w 1,,w i 1 )=P(w i C, w 1,, w i 1 ) =P(w i C) and thus P(C,w 1,,w n )= n P(w i C) i=1 The only thing the Bayes classifier considers is word histogram Kyoto University - LIP, Adv

4 A Few Examples Kyoto University - LIP, Adv

5 Science Image Source: Topic Models Blei Lafferty (2009) Kyoto University - LIP, Adv

6 Yale Law Journal Image Source: Topic Models Blei Lafferty (2009) Kyoto University - LIP, Adv

7 Single Result for Science Article Kyoto University - LIP, Adv

8 Topic Graphs Kyoto University - LIP, Adv

9 Latent Semantic Indexing a variation of PCA for normalized word counts... Kyoto University - LIP, Adv

10 Latent Semantic Indexing [Deerwester, S., et al, 88] Uncover recurring patterns in text by considering examples. These patterns are groups of words which tend to appear together. To do so, given a set of n documents, LSI considers a document/word matrix T=[tf i,j ] R m n where tf i,j counts the term-frequency of word j in text i. Using this information, LSI builds a set of influential groups of words This is similar in spirit to PCA: learn principal components from data represent each datapoint as the sum of a few principal components use the principal coordinates for clustering or in supervised tasks. Kyoto University - LIP, Adv

11 Renormalizing Frequencies, Preprocesing Rather than considering only tf ij, introduce a term x ij =l ij g i which incorporates both local and global weights Local weights (i.e.relative to a term i and document j) binary weight: l ij =δ tfij >0 simple frequency l ij =tf ij, hellinger l ij = tf ij log(1+) l ij =log(tf ij +1) relative to max l ij = tf ij 2max i (tf ij ) Global weights (i.e.relative to a term i across all documents) equally weighted documents g i =1 1 l 2 norm of frequencies g i = j tf 2 ij g i =gf i /df i, where gf i = j tf ij, and df i = j δ tfij >0 n g i =log 2 1+df i p g ij logp ij i =1+ j logn, where p ij= tf ij gf i Kyoto University - LIP, Adv

12 typically, one can define Word/Document Representation X=[x ij ],x ij = 1+ j p ij logp ij logn g i log(tf ij +1) l ij After preprocessing, consider the normalized occurrences of words, d j x 1,1... x 1,n t T i x m,1... x m,n represents both term vectors t i and document vectors d j normalized representation of points (documents) in variables (terms), or vice-versa. Kyoto University - LIP, Adv

13 Word/Document Representation Each row represents a term, described by its relation to each document: t T i =[x i,1... x i,n ] Each column represents a document, described by its relation to each word: d j = x 1,j x m,j t T i t i is the correlation between terms i, i over all documents. XX T contains all these dot products. d T j d j is the correlation between documents j, j over all terms. X T X contains all these dot products Kyoto University - LIP, Adv

14 Singular Value Decomposition Consider the singular value decomposition (SVD) of X, X=UΣV T where U R m m,v R n n are orthogonal matrices and Σ R m n is diagonal. The matrix products highlighting term/documents correlations are XX T = (UΣV T )(UΣV T ) T =(UΣV T )(V TT Σ T U T )=UΣV T VΣ T U T =UΣΣ T U T X T X= (UΣV T ) T (UΣV T )=(V TT Σ T U T )(UΣV T )=VΣ T U T UΣV T =VΣ T ΣV T U contains the eigenvectors of XX T, V contains the eigenvectors of X T X. Both XX T and X T X have the same non-zero eigenvalues, given by the non-zero entries of ΣΣ T. Kyoto University - LIP, Adv

15 Singular Value Decomposition Let l be the number of non-zero eigenvalue of ΣΣ T. Then X = ˆX (l) def = U (l) Σ (l) V T (l) (d j ) (δ j ) (t T i ) x 1,1... x 1,n x m,1... xm,n = (τ T i ) u 1... u l σ σ l [ v 1 ] [ v l ] σ 1,...,σ l are the singular values, u 1,...,u l and v 1,...,v l are the left and right singular vectors. The only part of U that contributes to t i is its i th row, written τ i. The only part of V T that contributes to d j is the j th column, δ j. Kyoto University - LIP, Adv

16 Low Rank Approximations A property of the SVD is that for k l ˆX k = argmin X X k F X R m n,rank(x)=k ˆX k is an approximation of X with low rank. The term and document vectors can be considered as concept spaces the k entries of τ i provide the occurrence of term i in the k th concept. δj T provides the relation between document j and each concept. Kyoto University - LIP, Adv

17 Latent Semantic Indexing Representation of Documents We can use LSI to Quantify the relationship between documents j and j : compare the vectors Σ k δ T j and Σ kˆδ j Compare terms i and i through τ T i Σ k andτ T i Σ k, provides a clustering of the terms in the concept space. Project a new document onto the concept space, q χ=σ 1 k UT k q Kyoto University - LIP, Adv

18 Probabilistic Latent Semantic Indexing Kyoto University - LIP, Adv

19 Latent Variable Probabilistic Modeling PLSI adds on LSI by considering a probabilistic modeling built upon a latent class variable. Namely, the joint likelihood that word w appears in document d depends on an unobserved variable z Z={z 1,,z K } which defines a joint probability model overw D (words documents) as p(d,w)=p(d)p(w d),p(w d)= P(w z)p(z d) z Z which thus gives we also have that p(d,w)=p(d) P(w z)p(z d) z Z p(d,w)= P(z)P(w z)p(d z) z Z Kyoto University - LIP, Adv

20 Probabilistic Latent Semantic Indexing The different parameters of the probability below p(d,w)=p(d) P(w z)p(z d) z Z are all multinomial distribution, distributions on the simplex. P(z),P(w z)p(d z) These coefficients can be estimated using maximum likelihood with latent variables. Typically using the Expectation Maximization algorithm. Kyoto University - LIP, Adv

21 Consider again the formula Probabilistic Latent Semantic Indexing p(d,w)= P(z)P(w z)p(d z) z Z If we define matrices U=[P(w i z k )] ik V=[P(d j z k )] jk Σ=diag(P(z k )) we obtain that P=[P(w i,d j )]=UΣV T P and X are the same matrices. We have found a different factorization of P (or X). Difference In LSI, SVD considers the Frobenius norm to penalize for discrepancies. in probabilistic LSI, we use a different criterion: likelihood function. Kyoto University - LIP, Adv

22 Probabilistic Latent Semantic Indexing The probabilistic viewpoint provides a different cost function The probabilistic assumption is explicitated by the following graphical model Here θ stands for a document d, M number of documents, N number of words in a document Image Source: Wikipedia The plates stand for the fact that such dependencies are repeated M and N times. Kyoto University - LIP, Adv

23 Latent Dirichlet Allocation Kyoto University - LIP, Adv

24 Dirichlet Distribution Dirichlet Distribution is a distribution on the canonical simplex Σ d ={x R d + d x i =1} i=1 The density is parameterized by a family β of d real positive numbers, β=(β 1,,β d ), has the expression d 1 p β (x)= x β i 1 i B(β) i=1 with normalizing constant B(β) computed using the Gamma function, B(β)= d i=1γ(β i ) Γ( K i=1β i ) Kyoto University - LIP, Adv

25 Dirichlet Distribution The Dirichlet distribution is widely used to model count histograms Here are for instance β=(6,2,2),(3,7,5),(6,2,6),(2,3,4). Image Source: Wikipedia Kyoto University - LIP, Adv

26 Probabilistic Modeling in Latent Dirichlet Allocation LDA assumes that documents are random mixtures over latent topics, each topic is characterized by a distribution over words. each word is generated following this distribution. Consider K topics, a Dirichlet distribution on topics α R K ++ for documents K multinomials on V words described in a Markov matrix (rows sum to 1) ϕ R K V +,ϕ k Dir(β). Kyoto University - LIP, Adv

27 Latent Dirichlet Allocation Assume that all document d i =(w i1, w ini ) j has been generated with the following mechanism Choose a distribution of topics θ i Dir(α),j {1,...,M} for document d i. For each of the word locations(i,j), where j {1,...,N i } Choose a topic z i,j Multinomial(θ i ) at each location j in document d i Choose a word w i,j Multinomial(ϕ zi,j ). Kyoto University - LIP, Adv

28 Latent Dirichlet Allocation The graphical model of LDA can be displayed as Image Source: Wikipedia Kyoto University - LIP, Adv

29 Latent Dirichlet Allocation Inferring now all parameters and latent variables set of K topics, topic mixture θ i of each document d i, set of word probabilities for each topic φ k, topic z ij of each word w ij is a Bayesian inference problem. Many different techniques can be used to tackle this issue. See talk from Arnaud Doucet earlier last week.. Gibbs sampling Variational Bayes This is, in practice, the main challenge to use LDA. Kyoto University - LIP, Adv

30 Pachinko Allocation Kyoto University - LIP, Adv

31 The idea in one image From a simple multinomial (per document) to the Pachinko allocation. Image Source: Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations, Li Mc-Callum Kyoto University - LIP, Adv

32 The idea in one image Difference with LDA Image Source: Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations, Li Mc-Callum Kyoto University - LIP, Adv

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Information retrieval LSI, plsi and LDA Jian-Yun Nie Basics: Eigenvector, Eigenvalue Ref: http://en.wikipedia.org/wiki/eigenvector For a square matrix A: Ax = λx where x is a vector (eigenvector), and