CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments: Some slides were adapted from Chris Manning, and from Thomas Hoffman 1
Plan for next few weeks Project 1: done (submit by Friday). Project 2: (topic) language models: TBA tomorrow Monday 2/22: no class. Watch LDA Google talk by David Blei: https://www.youtube.com/watch?v=7bmsuybpx90 Wednesday 2/24: guest lecture: Prof. Joyce Ho Monday 2/29: Semantics (conclusion); NLP for IR Wednesday 3/2: NLP for IR + guest lecture Wednesday 3/2: Midterm (take-home) assigned. Due by 5pm Thursday 3/3. 2
Recall: Term-document matrix Anthony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth anthony 5.25 3.18 0.0 0.0 0.0 0.35 brutus 1.21 6.10 0.0 1.0 0.0 0.0 caesar 8.59 2.54 0.0 1.51 0.25 0.0 calpurnia 0.0 1.54 0.0 0.0 0.0 0.0 cleopatra 2.85 0.0 0.0 0.0 0.0 0.0 mercy 1.51 0.0 1.90 0.12 5.25 0.88 worser 1.37 0.0 0.11 4.15 0.25 1.95... Today: Can we transform this matrix to identify the "meaning" or topic of the documents, and use that for retrieval/classification, etc? 3 3
Problems with Lexical Semantics Ambiguity and association in natural language Polysemy: Words often have a multitude of meanings and different types of usage (more severe in very heterogeneous collections). The word-based retrieval model is unable to discriminate between different meanings of the same word. 4
Problems with Lexical Semantics Synonymy: Different terms may have identical or similar meanings (weaker: words indicating the same topic). No associations between words are made in the matrix or vector space representation. 5
Polysemy and Context Document similarity on single word level: polysemy and context planet... saturn... contribution to similarity, if used in 1 st meaning, but not if in 2 nd meaning 1 meaning 2 ring jupiter space voyager car company dodge ford 6
Solution: Topic Models Idea: model words in context (e.g., document) Examples: Topic models in science: http://topics.cs.princeton.edu/science/browser/ Topic models in javascript (by David Mimno) http://mimno.infosci.cornell.edu/jslda/ 7
Application: Model Evolution of Topics 8
Progression of Topic Models Latent Semantic Analysis Indexing (LSI LSA) Probalistic LSI (plsi) Probabilistic LSI with Dirichlet priors (LDA): Mon 2/22: Google tech talk by David Blei Scalable topic models (SVD/NMF, Bayes MF): Wed 2/24, Prof. Joyce Ho Word2Vec, other extensions (Mon 2/29) 9
Latent Semantic Indexing (LSI) Perform a low-rank approximation of documentterm matrix (typical rank 100-300) General idea Map documents (and terms) to a low-dimensional representation. Design a mapping such that the low-dimensional space reflects semantic associations (latent semantic space). Compute document similarity based on the inner product in this latent semantic space 10
Goals of LSI Similar terms map to similar location in low dimensional space Noise reduction by dimension reduction 11
Latent Semantic Analysis Latent semantic space: illustrative example courtesy of Susan Dumais 12
Latent semantic indexing: Overview Decompose the term-document matrix into a product of matrices. decomposition: singular value decomposition (SVD). SVD: C = UΣV T (where C = term-document matrix) Then, use the SVD to compute a new, improved term-document matrix C. Hope: get better similarity values out of C (compared to C). Using SVD for this purpose is called latent semantic indexing or LSI. 13 13
Singular Value Decomposition For an M N matrix A of rank r there exists a factorization (Singular Value Decomposition = SVD) as follows: A U V T M M M N V is N N The columns of U are orthogonal eigenvectors of AA T. The columns of V are orthogonal eigenvectors of A T A. Eigenvalues 1 r of AA T are the eigenvalues of A T A. i i diag 1... r Singular values. 14
Singular Value Decomposition Illustration of SVD dimensions and sparseness 15
SVD example Let 1 1 A 0 1 1 0 M=3, N=2. Its SVD is A U V T M M M N V is N N 0 1/ 1/ 2 2 2 / 1/ 1/ 6 6 6 1/ 1/ 1/ 3 1 3 0 3 0 0 1/ 3 1/ 0 2 2 1/ 1/ 2 2 Typically, the singular values arranged in decreasing order. 16
Low-rank Approximation SVD can be used to compute optimal low-rank approximations. Approximation problem: Find A k of rank k such that A k min) X : rank ( X k A X F Frobenius norm A k and X are both m n matrices. Typically, want k << r. 2/17/2016 17 CS572: Information Retrieval. Spring 2016
Low-rank Approximation Solution via SVD A U diag( 1,...,,0,...,0) V k k T set smallest r-k singular values to zero k A k k i 1 i u i v T i column notation: sum of rank 1 matrices 18
Reduced SVD If we retain only k singular values, and set the rest to 0, then we don t need the matrix parts in red Then Σ is k k, U is M k, V T is k N, and A k is M N This is referred to as the reduced SVD It is the convenient (space-saving) and usual form for computational applications k 19
Approximation error How good (bad) is this approximation? It s the best possible, measured by the Frobenius norm of the error: X :min rank ( X ) k A X F A A k F k 1 where the i are ordered such that i i+1. Suggests why Frobenius error drops as k increases. 20
SVD Low-rank approximation Whereas the term-doc matrix A may have M=50000, N=10 million (and rank close to 50000) We can construct an approximation A 100 with rank 100. Of all rank 100 matrices, it would have the lowest Frobenius error. C. Eckart, G. Young, The approximation of a matrix by another of lower rank. Psychometrika, 1, 211-218, 1936. 21
Connection to Vector Space Model Intuition: Dimension reduction through LSI brings together related axes in the vector space. 22
Intuition from block matrices n documents Block 1 What s the rank of this matrix? m terms Block 2 0 s 0 s Block k = non-zero entries. 23
Intuition from block matrices n documents Block 1 m terms Block 2 0 s 0 s Block k Vocabulary partitioned into k topics (clusters); each doc discusses only one topic. 24
Intuition from block matrices n documents Block 1 What s the best rank-k approximation to this matrix? m terms Block 2 0 s 0 s Block k = non-zero entries. 25
Intuition from block matrices Likely there s a good rank-k approximation to this matrix. wiper tire V6 Block 1 Block 2 Few nonzero entries Few nonzero entries car automobile 1 0 0 1 Block k 26
Assumption/Hope Topic 1 Topic 2 Topic 3 27
Latent Semantic Indexing by SVD 28
Performing the maps Each row and column of A gets mapped into the k- dimensional LSI space, by the SVD. Claim this is not only the mapping with the best (Frobenius error) approximation to A, but in fact improves retrieval. A query q is also mapped into this space, by q k q T U k 1 k Query NOT a sparse vector. 29
Sec. 18.4 Performing the maps A T A is the dot product of pairs of documents A T A A kt A k = (U k k V kt ) T (U k k V kt ) = V k k U k T U k k V k T = (V k k ) (V k k ) T Since V k = A kt U k k -1 we should transform query q to q k as follows q k q T U k k 1 30
Empirical evidence Experiments on TREC 1/2/3 Dumais Lanczos SVD code (available on netlib) due to Berry used in these expts Running times of ~ one day on tens of thousands of docs [still an obstacle to use] Dimensions various values 250-350 reported. Reducing k improves recall. (Under 200 reported unsatisfactory) Generally expect recall to improve what about precision? 2/17/2016 CS572: Information Retrieval. Spring 2016 31
2/17/2016 CS572: Information 32 Retrieval. Spring 2016
2/17/2016 CS572: Information 33 Retrieval. Spring 2016
2/17/2016 CS572: Information 34 Retrieval. Spring 2016
2/17/2016 CS572: Information 35 Retrieval. Spring 2016
2/17/2016 CS572: Information 36 Retrieval. Spring 2016
2/17/2016 CS572: Information 37 Retrieval. Spring 2016
2/17/2016 CS572: Information 38 Retrieval. Spring 2016
2/17/2016 CS572: Information 39 Retrieval. Spring 2016
Empirical evidence: Conclusion Precision at or above median TREC precision Top scorer on almost 20% of TREC topics Slightly better on average than straight vector spaces Effect of dimensionality: Dimensions Precision 250 0.367 300 0.371 346 0.374 40
Failure modes Negated phrases TREC topics sometimes negate certain query/terms phrases automatic conversion of topics to Boolean queries As usual, freetext/vector space syntax of LSI queries precludes (say) Find any doc having to do with the following 5 companies See Berry, Dumais for more (resources slide) 41
LSI has many other applications In many settings in pattern recognition and retrieval, we have a feature-object matrix. For text, the terms are features and the docs are objects. Could be opinions & users Recommender Systems This matrix may be redundant in dimensionality. Can work with low-rank approximation. If entries are missing (e.g., users opinions), can recover if dimensionality is low. 42
Resources http://www.cs.utk.edu/~berry/lsi++/ http://lsi.argreenhouse.com/lsi/lsipapers.html Dumais (1993) LSI meets TREC: A status report. Dumais (1994) Latent Semantic Indexing (LSI) and TREC-2. Dumais (1995) Using LSI for information filtering: TREC-3 experiments. M. Berry, S. Dumais and G. O'Brien. Using linear algebra for intelligent information retrieval. SIAM Review, 37(4):573-- 595, 1995. 43
Probabilistic View: Topic Language Models M C Information need P( Q M C, MT ) P( Q M C, MT, M d ) generation M T1 M T2 M d1 M d 2 d1 d2 query M Tm M d n dn document collection 2/17/2016 CS572: Information 44 Retrieval. Spring 2016
Latent Aspects: Example 45
46
(probabilistic) LSI: plsi 47
Aspect Model Generation process Choose a doc d with prob P(d) There are N d s Choose a latent class z with (generated) prob P(z d) There are K z s, and K << N K chosen in advance (how many topics in collection???) Generate a word w with (generated) prob P(w z) This creates pair (d, w), without direct concern for z Joining the probabilities: Remember: P(z d) means probability of z, given d 2/17/2016 CS572: Information 48 Retrieval. Spring 2016
Aspect Model (2) Log-likelihood Maximize this to find P(d), P(z d), P(w z) Apply Bayes theorem: end up with What is modeled? Doc-specific word distributions, P(w d), are based on combination of specific classes/factors/aspects, P(w z) Not just assigned to nearest cluster 2/17/2016 CS572: Information 49 Retrieval. Spring 2016
plsi Learning 50
plsi Generative Model 51
Approach: Expectation Maximization (EM) EM is popular technique to maximize likelihood estimation Alternates between: E-step: calculate future probabilities of z based on current estimates M-step: update estimate parameters based on calculated probabilities 2/17/2016 CS572: Information 52 Retrieval. Spring 2016
Simple example Data: -4-3 -2-1 0 1 2 3 4 5 OBJECTIVE: Fit mixture of Gaussian model with C=2 components Model: where Parameters: P(x ) keep fixed i.e. only estimate 53 x
Likelihood function Likelihood is a function of parameters, Probability is a function of r.v. x DIFFERENT from LAST PLOT 54
Probabilistic model Imagine model generating data Need to introduce label, z, for each data point Label is called a latent variable also called hidden, unobserved, missing c -4-3 -2-1 0 1 2 3 4 5 Simplifies the problem: if we knew the labels, we can decouple the components as estimate parameters separately for each one 55
Intuition of EM E-step: Compute a distribution on the labels of the points, using current parameters M-step: Update parameters using current guess of label distribution. E M E M E 56
EM for plsi 57
plsa for IR: T. Hoffman 2000 MED 1033 docs CRAN 1400 docs CACM 3204 docs CISI 1460 docs Reporting best results with K varying from 32, 48, 64, 80, 128 plsa* model takes the average across all models at different K values 58
Example of topics found from a Science Magazine papers collection 59
Using Aspects for Query Expansion 60
Relevance Results Cosine Similarity is the baseline In LSI, query vector(q) is multiplied to get the reduced space vector In PLSI, p(z d) and p(z q). In EM iterations, only P(z q) is adapted 61
Precision-Recall results(4/4) 62
Experiment: plsi w/ 128-factor decomposition 2/17/2016 CS572: Information 63 Retrieval. Spring 2016
Extension: Document Priors Model the document prior LDA: extension of plsi to better model document generation process [David Blei] Video: https://www.youtube.com/watch?v=7bmsuybpx90 Lecture slides: http://www.cs.columbia.edu/~blei/talks/blei_mlss_2012.pdf 64