Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Size: px

Start display at page:

Download "Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology"

Dustin Hodge
6 years ago
Views:

1 Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

2 Vector space model: pros Partial matching of queries and docs dealing with the case where no doc contains all search terms Ranking according to similarity score Term weighting schemes improves retrieval performance Various extensions Relevance feedback (modifying query vector) Doc clustering and classification 2

3 Problems with lexical semantics Ambiguity and association in natural language Polysemy: Words often have a multitude of meanings and different types of usage More severe in very heterogeneous collections. The vector space model is unable to discriminate between different meanings of the same word. 3

4 Problems with lexical semantics Synonymy: Different terms may have identical or similar meanings (weaker: words indicating the same topic). No associations between words are made in the vector space representation. 4

5 Polysemy and context Doc similarity on single word level: polysemy and context planet... saturn... contribution to similarity, if used in 1 st meaning, but not if in 2 nd meaning 1 meaning 2 ring jupiter space voyager car company dodge ford 5

6 Latent Semantic Indexing (LSI) Perform a low-rank approximation of doc-term matrix (typical rank ) latent semantic space Term-doc matrices are very large but the number of topics that people talk about is small (in some sense) General idea: Map docs (and terms) to a low-dimensional space Design a mapping such that the low-dimensional space reflects semantic associations Compute doc similarity based on the inner product in this latent semantic space 6

7 Goals of LSI Similar terms map to similar location in low dimensional space Noise reduction by dimension reduction 7

8 Term-document matrix This matrix is the basis for computing similarity between docs and queries. Can we transform this matrix, so that we get a better measure of similarity between docs and queries?... 8

9 LSI: Overview Decompose term-doc matrix into a product of matrices using Singular Value Decomposition (SVD) SVD: C = UΣV T (where C = term-doc matrix) We use columns of matrices U and V that correspond to the largest values in the diagonal matrix Σ as term and document dimensions in the new space SVD for this purpose is called LSI. 9

10 Singular Value Decomposition (SVD) For an M N matrix A of rank r there exists a factorization: A = UΣV T M M M N N N The columns of U are orthogonal eigenvectors of AA T. The columns of V are orthogonal eigenvectors of A T A. Eigenvalues 1 r of AA T are also the eigenvalues of A T A. Σ = diag σ 1,, σ r σ i = λ i Singular values Typically, the singular values arranged in decreasing order.

11 Singular Value Decomposition (SVD) Truncated SVD A = UΣV T min(m, N) M min(m,n) Min(M,N) min(m,n) Min(M,N) N min(m, N) 11

12 SVD example A = M=3, N=2 A = 0 2/ 6 1/ 2 1/ 6 1/ 2 1/ 6 1/ 3 1/ 3 1/ / 2 1/ 2 1/ 2 1/ 2 Or equivalently: 0 2/ 6 1/ 2 1/ 6 1/ 2 1/ / 2 1/ 2 1/ 2 1/ 2

13 Example We use a non-weighted matrix here to simplify the example. 13

14 Example of C = UΣV T : All four matrices C = UΣV T 14

15 Example of C = UΣV T : matrix U One row per term One column per min(m,n) Columns: semantic dims (distinct topics like politics, sports,...) u ij : how strongly related term i is to the topic in column j. 15

16 Example of C = UΣV T : The matrix Σ square, diagonal matrix min(m,n) min(m,n). Singular value: measures the importance of the corresponding semantic dimension. We ll make use of this by omitting unimportant dimensions. 16

17 Example of C = UΣV T : The matrix V T One column per doc One row per min(m,n) Columns of V: semantic dims v ij: how strongly related doc i is to the topic in column j. 17

18 Matrix decomposition: Summary We ve decomposed the term-doc matrix C product of three matrices. U: consists of one (row) vector for each term V T : consists of one (column) vector for each doc into a Σ: diagonal matrix with singular values, reflecting importance of each dimension Next:Why are we doing this? 18

19 Low-rank approximation Solution via SVD A k = U diag σ 1,, σ k, 0, 0 V T We retain only k singular values set smallest r-k singular values to zero k k k N M N M k A k = k i=1 σ k u i v i T column notation: sum of rank 1 matrices

20 Low-rank approximation SVD can be used to compute optimal low-rank approximations. Keeping the k largest singular values and setting all others to zero results in the optimal approximation [Eckart-Young]. No matrix of the rank k can approximates A better than A k. Approximation problem: Given matrix A, find matrix A k of rank k (e.g. a matrix with k linearly independent rows or columns) such that A k = min X:rank X =k A X F Frobenius norm A k and X are both M N matrices. Typically, we want k r. 20

21 Approximation error How good (bad) is this approximation? It s the best possible, measured by the Frobenius norm of the error: where the i are ordered such that i i+1. min X:rank X =k A X F = A A k F = σ k+1 A k = U diag σ 1,, σ k, 0, 0 V T Suggests why Frobenius error drops as k increases. 21

22 SVD Low-rank approximation Term-doc matrix C may have M = 50000, N = 10 6 rank close to Construct an approximation C 100 with rank 100. Of all rank 100 matrices, it would have the lowest Frobenius error. Great but why would we?? Answer: Latent Semantic Indexing C. Eckart, G. Young, The approximation of a matrix by another of lower rank. Psychometrika, 1, , 1936.

23 Recall unreduced decomposition C = UΣV T 23

24 Reducing the dimensionality to 2 24

25 Reducing the dimensionality to 2 25

26 Original matrix C vs. reduced C 2 = UΣ 2 V T C 2 as a two-dimensional representation of C. Dimensionality reduction to two dimensions. 26

27 Why is the reduced matrix better? Similarity of d2 and d3 in the original space: 0. Similarity of d2 und d3 in the reduced space: 0.52 * * * * *

28 Why the reduced matrix is better? boat and ship are semantically similar. The reduced similarity measure reflects this. What property of the SVD reduction is responsible for improved similarity? 28

29 Example 29 [Example from Dumais et. al]

30 Example 30 [Example from Dumais et. al]

31 Example (k=2) Σ k V k T U k 31 [Example from Dumais et. al]

32 graph tree minor Squares: terms Circles: docs survey time response computer user interface human EPS system 32

33 33 [Example from Dumais et. al]

34 How we use the SVD in LSI Key property of SVD: Each singular value tells us how important its dimension is. By setting less important dimensions to zero, we keep the important information, but get rid of the details. These details may be noise reduced LSI is a better representation Details make things dissimilar that should be similar reduced LSI is a better representation because it represents similarity better. 34

35 How LSI addresses synonymy and semantic relatedness? Docs may be semantically similar but are not similar in the vector space (when we talk about the same topics but use different words). Desired effect of LSI: Synonyms contribute strongly to doc similarity. Standard vector space: Synonyms contribute nothing to doc similarity. LSI (via SVD) selects the least costly mapping: different words (= different dimensions of the full space) are mapped to the same dimension in the reduced space. Thus, it maps synonyms or semantically related words to the same dimension. cost of mapping synonyms to the same dimension is much less than cost of collapsing unrelated words. Thus, LSI will avoid doing that for unrelated words. 35

36 Performing the maps Each row and column of C gets mapped into the kdimensional LSI space, by the SVD. A query q is also mapped into this space, by q k = q T U k Σ k 1 Since V k = C k T U k Σ k 1, we Query NOT a sparse vector. should transform query q to q k Claim: this is not only the mapping with the best (Frobenius error) approximation to C, but also improves retrieval. 36

37 Implementation Compute SVD of term-doc matrix Map docs to the reduced space Map the query into the reduced space q k = q T 1 U k Σ k Compute similarity of q k with all reduced docs in V k. Output ranked list of docs as usual What is the fundamental problem with this approach? 37

38 Empirical evidence Experiments on TREC 1/2/3 Dumais Lanczos SVD code (available on netlib) due to Berry used in these experiments Running times of ~ one day on tens of thousands of docs [still an obstacle to use] Dimensions various values reported. Reducing k improves recall. Under 200 reported unsatisfactory Generally expect recall to improve what about precision? 38

39 Empirical evidence Precision at or above median TREC precision Top scorer on almost 20% of TREC topics Slightly better on average than straight vector spaces Effect of dimensionality: Dimensions Precision

40 But why is this clustering? We ve talked about docs, queries, retrieval and precision here. What does this have to do with clustering? Intuition: Dimension reduction through LSI brings together related axes in the vector space. 40

41 Intuition from block matrices N documents Block 1 M terms Block 2 0 s 0 s Block k 41 = Homogeneous non-zero blocks. What s the rank of this matrix?

42 Intuition from block matrices N documents Block 1 M terms Block 2 0 s 0 s Block k Vocabulary partitioned into k topics (clusters); each doc discusses only one topic. 42

43 Intuition from block matrices Likely there s a good rank-k approximation to this matrix. wiper tire V6 Block 1 Block 2 Few nonzero entries car automobile Few nonzero entries Block k 43

44 Simplistic picture Topic 1 Topic 2 44 Topic 3

45 Reference Chapter 18 of IIR Book 45

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276,