Latent Semantic Indexing

Size: px

Start display at page:

Download "Latent Semantic Indexing"

Godfrey Booker
6 years ago
Views:

1 Latent Semantic Indexing 1

2 Latent Semantic Indexing Term-document matrices are very large, though most cells are zeros But the number of topics that people talk about is small (in some sense) Clothes, movies, politics, Each topic can be represented as a cluster of (semantically) related terms, e.g.: clothes=golf, jacket, shoe.. Can we represent the term-document space by a lower dimensional latent space (latent space=set of topics)? 2

3 Searching with latent topics Given a collection of documents, LSI learns clusters of frequently co-occurring terms (ex: information retrieval, ranking and web)! If you query with ranking, information retrieval LSI automatically extends the search to documents including also (and even ONLY) web 3

4 Car GTI Car Clarkson Badge Polo Red Tiger Woods Belfry Tee Selection based on Car GTI Car Clarkson Badge Polo Red Tiger Woods Belfry Tee Car GTI Polo Document base (20) Motor Bike Oil Tourer Bed

4 4 Car GTI Car Clarkson Badge Polo Red Tiger Woods Belfry Tee Selection based on Car GTI Car Clarkson Badge Polo Red Tiger Woods Belfry Tee Car GTI Polo Document base (20) Motor Bike Oil Tourer Bed lace legal button soft cat line yellow wind full sail harbour beach report June Speed Fish Pond gold Koi PC Dell RAM Floppy Core Apple Pip Tree Pea Pod Fresh Green French Lupin Seed May April Office Pen Desk VDU Friend Pal Help Can Paper Paste Pencil Roof Card Stamp Glue Happy Send Toil Work Time Cost With standard VSM 4 documents are selected

5 20 documents Car GTI Car Clarkson Badge Polo Red Tiger Woods Belfry Tee Car GTI Polo Fish Pond gold Koi PC Dell RAM Floppy Core Apple Pip Tree Pea Pod Fresh Green French Lupin Seed May April Motor Bike Oil Tourer Bed lace legal button soft cat line yellow wind full sail harbour beach report June Speed Office Pen Desk VDU Selection based on Friend Pal Help Can Paper Paste Pencil Roof Card Stamp Glue Happy Send Toil Work Time Cost Rank of selected documents Car GTI Car Clarkson Badge Polo Red Tiger Woods Belfry Tee The most relevant words associated with golf in these docs are: Car, and tf.idf Car 2 *(20/3) = 13 2 *(20/3) = 13 3 *(20/16) = 4

6 20 docs Car GTI Car Clarkson Badge Polo Red Tiger Woods Belfry Tee Car GTI Polo Fish Pond gold Koi PC Dell RAM Floppy Core Apple Pip Tree Pea Pod Fresh Green French Lupin Seed May April Motor Bike Oil Tourer Selezione basata su Car GTI Bed lace legal button soft cat line yellow Car Clarkson Badge wind full sail harbour beach report June Speed Polo Red Office Pen Desk VDU Friend Pal Help Can Tiger Woods Belfry Tee Paper Paste Pencil Roof Card Stamp Glue Happy Send If we consider the co-occurring terms with higher tf*idf, car e topgear turn out to be related to more than petrol Toil Work Time Cost Rank of selected docs Car 2 *(20/3) = 13 2 *(20/3) = 13 3 *(20/16) = 4

7 20 docs Car GTI Motor Bike Oil Tourer Car GTI Car Clarkson Badge Bed lace legal button Polo Red soft cat line yellow Car Clarkson Badge Tiger Woods Belfry Tee wind full sail harbour beach Car GTI Polo report June Speed Selection based on Polo Red Fish Pond gold Koi Office Pen Desk VDU PC Dell RAM Floppy Core Apple Pip Tree Tiger Woods Belfry Tee Pea Pod Fresh Green French Lupin Seed May April Selection based on the semantic domain of Car GTI Car Clarkson Badge Polo Red We now search with all the Friend words Paper in the Card semantic Toil domain of. Pal Stamp Help Paste Glue Work The Pencil list Happy of retrieved Time docs now Can Roof Send Cost is based on and the other related words Tiger Woods Belfry Tee Car Wheel GTI Polo rank df selected docs Car 2 *(20/3) = 13 2 *(20/3) = 13 3 *(20/16) = 4

8 20 docs Car GTI Motor Bike Oil Tourer Selezione basata su Car GTI Car Clarkson Badge Polo Red Car Clarkson Badge Tiger Woods Belfry Tee Car GTI Polo Polo Red Tiger Woods Belfry Tee Selection based on semantic domain Car GTI Car Clarkson Badge Polo Red Rank Fish Pond gold Koi PC Dell RAM Floppy Core Apple Pip Tree Tiger Woods Belfry Tee Pea Pod Fresh Green French The co-occurrence based Bed soft wind report Office Friend Paper Card lace full Pen Pal Stamp legal ranking cat sail improves the Desk performance. Help Paste Glue line harbour June Pencil Happy Note button that yellow one beach of the Speed most VDU relevant Can Roof doc Send does NOT include the word, and a doc with a spurious sense disappears Lupin Seed May April Toil Work Time Cost Car Wheel GTI Polo rank of selected docs Car 2 *(20/3) = 13 2 *(20/3) = 13 3 *(20/16) = 4

9 Ranking with latent Semantic Indexing Previous example just gives the intuition Latent Semantic Indexing is an algebraic method to identify clusters of co-occurring terms, called latent topics, and to compute query-doc similarity in a latent space, in which every coordinate is a latent topic. A latent quantity is one which cannot directly observed, what is observed is a measurement which may include some amount of random errors (topics are latent in this sense: we observe them, but they are an approximation of true semantic topics) Since it is an algebraic method, needs some linear algebra background 9

10 The LSI method: how to detect topics Linear Algebra Background 10

11 Eigenvalues & Eigenvectors Eigenvectors (for a square m m matrix S) Av = λv Example (right) eigenvector eigenvalue Av = λv Def: A vector v R n, v 0, is an eigenvector of a matrix mxm A with corresponding eigenvalue λ, if: Av = λv 11

12 Algebraic method How many eigenvalues are there at most? Av = λv equation has a non-zero solution if A A A Where I is the identity matrix this is a m-th order equation in λ which can have at most m distinct solutions (roots of the characteristic polynomial) can be complex even though A is real. 12

13 Example of eigenvector/eigenvalues " A = $ 1 1 # 3 5 " $ # % ', v = " 1 $ & # 3 Av = λv 1 1 %" 1 % ' $ ' 3 5 &# 3 = 4 " 1 $ & # 3 " 1+ 3( 1) % $ # 3+ 5( 3) ' = " 4 % $ ' & # 12 & " 4 % $ ' # 12 = " 4 % $ ' & # 12 & % ', λ = 4 & % ' &

Example of computation remember " A = $ 1 1 # 3 5 % ' & det M = M = det(a λi)

S & (A λi)v = 0 " 1 1 % $ ' # 3 5 " $ λ 0 % " ' & # 0 λ = 0 1 4 1 %" α % λ $ '

λ = 0 3 1 % " α % ' $ $ ' # 3 1 $ &# β ' = " 0 % # β & $ ' & # 0 & & ( * 3α β

0 λ 2 6λ + 8 = 0 (λ 4)(λ 2) = 0 λ 1 = 4,λ 2 = 2 λ Characteristic polynomial =

& " 1 1 %"% 3α α % $ ' # 3 3 $ &# β ' = " 0 % $ ' & # 0 & ( * α β = 0 ) α = β

14 Example of computation remember " A = $ 1 1 # 3 5 % ' & det M = M = det(a λi) = 0 " 1 1 % $ ' # 3 5 λ " % and 4 are the $ eigenvalues ' & # 0 1 = 0 of S & (A λi)v = 0 " 1 1 % $ ' # 3 5 " $ λ 0 % " ' & # 0 λ = %" α % λ $ ' $ & # & β ' = " 0 % 2 = 2 $ ' " # & # 0 & β % " v = " 1 λ 1 % $ ' # 3 5 λ = % " α % ' $ $ ' # 3 1 $ &# β ' = " 0 % # β & $ ' & # 0 & & ( * 3α β = 0 ) β = 3α +* 3α + β = 0 a c b d = ad bc (λ 1)(λ 5) + 3 = 0 λ 2 6λ = 0 λ 2 6λ + 8 = 0 (λ 4)(λ 2) = 0 λ 1 = 4,λ 2 = 2 λ Characteristic polynomial = (A 1 λi)v = 0 " &% " α %# $ ' # $ &# β ' = " 0 % v = $ $ ' $ & # 0 & " 1 1 %"% 3α α % $ ' # 3 3 $ &# β ' = " 0 % $ ' & # 0 & ( * α β = 0 ) α = β +* 3α + 3β = 0 '!! " Note that we compute only the DIRECTION of eigenvectors 4

15 Geometric interpretation A matrix-vector multiplication Ax is a linear transformation over the vector 15

16 Matrix vector multiplication Matrix multiplication by a vector = a linear transformation of the initial vector, that implies rotation and translation 16

17 Geometric interpretation 17

18 Geometric interpretation Ax=1.34x=λx An eigenvector is a special vector that is transformed into its scalar multiple under a given matrix (no rotation!) 18

19 Geometric interpretation Here we found another eigenvector for the matrix A Note that for any single eigenvalue you have infinite eigenvectors, but they have the same direction 19

20 Matrix-vector multiplication Eigenvectors of different eigenvalues are linearly independent (i.e. α 1.. α n è α 1 v α n v n 0) For square normal matrixes eigenvectors of different eigenvalues define an orthonormal space and they are othogonal. A square matrix is NORMAL iff it commutes with its transpose, i.e. AA T =A T A Example: è AA T= = A T A 20

21 Difference between orthonormal and orthogonal? Orthogonal mean that the dot product is null (the cosin of the angle is zero). Orthonormal mean that the dot product is null and the norm of the vectos is equal to 1. What we are actually saying is that eigenvectors define a set of DIRECTIONS wich are orthogonal. If two or more vectors are orthonormal they are also orthogonal but the inverse is not true. 21

22 Example: projecting a vector on 2 orthonormal spaces (or bases ) e 1,e 1, e 2,e 2 are unary vectors and v 1,v 1, v 2,v 2 are the coordinates of v along the directions of e 1,e 1, e 2,e 2 22

23 The effect of a matrix-vector multiplication is governed by eigenvectors and eigenvalues A matrix-vector multiplication such as Ax (A normal matrix, x a generic vector as in the previous slide) can be rewritten in terms of the eigenvalues/vectors of A. Example: 2 X= 4 Ax = A(2v 1 + 4v 2 + 6v 3 ) 6 Ax = 2Av 1 + 4Av 2 + 6Av 3 = 2λ 1 v 1 + 4λ 2 v 2 + 6λ 3 v 3 Where v 1,v 2 v 3 are the (orthogonal) eigenvectors of A Even though x is an arbitrary vector, the action of A on x (transformation) is determined by the eigenvalues/vectors. Why? 23

two effects over the vector: rotation (the coordinates of the vector change) and scaling (the length

24 Geometric explanation x is a generic vector with coordinates x i ; λ i,v i are the eigenvalues and eigenvectors of A Ax = x 1 λ 1 v 1 + x 2 λ 2 v 2 + x 3 λ 3 v 3 Multiplying a matrix and a vector has two effects over the vector: rotation (the coordinates of the vector change) and scaling (the length changes). The max compression and rotation depends on the matrix s eigenvalues λi (s1,s2 and s3 in the picture)

25 Geometric explanation In the distorsion, the max effect is played by the biggest eigenvalues (s1 and s2 in the picture ) The eigenvalues describe the distorsion operated by the matrix on the original vector

26 Summary so far A matrix A has eigenvectors v and eigenvalues λ, defined by Av=λv Eigenvalues can be computed as: We can compute only the the direction of eigenvectors, since for any eigenvalue there are infinite eigenvectors lying on the same direction If A is normal (i.e. if AA T =A T A) then the eigenvector form an othonormal basis The product of A by ANY vector x is a linear transformation of x where the rotation is determined by eigenvectors and the translation is determined by the eigenvalues. The biggest role in this transformation is played by the biggest eigenvalues. 26

27 Eigen/diagonal Decomposition Let A be a square matrix with m orthogonal eigenvectors (hence, A is normal) Theorem: Exists an eigen decomposition A=UΛU -1 Λ is a diagonal matrix (all zero except the diagonal cells) Columns of U are eigenvectors of A Diagonal elements of Λ are eigenvalues of A 27

28 Diagonal decomposition: why/how = n v v U... 1 Let U have the eigenvectors as columns: AU = A v 1... v n! " # # # $ % & & & = λ 1 v 1... λ n v n! " # # # $ % & & & = v 1... v n! " # # # $ % & & & λ 1... λ n! " # # # # $ % & & & & Then, AU can be written And A=UΛU 1. Thus AU=UΛ, or U 1 AU=Λ 28

29 Example AU=UΛ From this we compute λ 1 =1, λ 2 =3 from which we get v 11 = 2v 12 From which we get v 21 =0 and v 22 any real 29

30 Diagonal decomposition example 2! Recall A = # 2 1 " $ &; λ 1 =1,λ 2 = 3. % 1 1 The eigenvectors and form Inverting, we have U 1 Then, A=UΛU 1 = = 1/ 2 1/ 2 1/ 2 1/ 2 U = 1 1 Recall UU 1 = / 2 1/ / 2 1/

31 So what? What do these matrices have to do with Information Retrieval and document ranking? Recall M N term-document matrices But everything so far needs square normal matrices so you need to be patient and learn one last notion 31

32 Singular Value Decomposition for non-square matrixes For a non-square M N matrix A of rank r there exists a factorization (Singular Value Decomposition = SVD) as follows: A = UΣV T M M M N V is N N The columns of U are orthogonal eigenvectors of AA T (left singular vectors). The columns of V are orthogonal eigenvectors of A T A (right singular eigenvector). NOTE THAT AA T and A T A are square symmetric (and hence NORMAL) Eigenvalues λ 1 λ r of AA T = eigenvalues of A T A and: Σ = ( σ... ) diag 1 σ r Singular values. σ i = λ i 32

33 An example 33

34 Singular Value Decomposition Illustration of SVD dimensions and sparseness M All zeros! So these become zeros, too N MxM MxN NxN So these become zeros, too

35 Back to matrix-vector multiplication Remember what we said? In a matrix vector multiplication the biggest role is played by the biggest eigenvalues The diagonal matrix Σ has the eigenvalues of A T A (called the singular values of A) in decreasing order along the diagonal We can therefore apply an approximation by setting σ i =0 if σ i θ and only consider the first k singular values 35

36 Reduced SVD If we retain only k singular values, and set the rest to 0, then we don t need the matrix parts in red Then Σ is k k, U is M k, V T is k N, and A k is M N This is referred to as the reduced SVD, or rank k approximation Now all the red and yellow parts are zeros!! k 36

37 Let s recap M<N M>N Since the yellow part is zero, an exact representation of A is: A = σ 1 u 1 v 1 T +σ 2 u 2 v 2 T +...+σ r u r v r T r = min(m,n) But for some k<r, a good approximation is: A k = σ 1 u 1 v 1 T +σ 2 u 2 v 2 T σ k u k v k T 37

38 Example of rank approximation 1 = x x 1 A A*= A 38

39 Approximation error How good (bad) is this approximation? It s the best possible, measured by the Frobenius norm of the error: A X A = σ :min F k F k + 1 X rank ( X ) = k = A σ i = λ i where the σ i are ordered such that σ i σ i+1. Suggests why Frobenius error drops as k increases. 39

40 Images gives a better intuition (image = matrix of pixels) 40

41 K=10 41

42 K=20 42

43 K=30 43

44 K=100 44

45 K=322 (the original image) 45

46 We save space!! But this is only one issue 46

47 So, finally, back to IR Our initial problem was: the term-document (MxN) matrix A is highly sparse (has many zeros) However, since groups of terms tend to co-occur together, can we identify the semantic space of these clusters of terms, and apply the vector space model in the semantic space defined by such clusters? What we learned so far: Matrix A can be decomposed and rank k approximated using SVD Does this help solving our initial problems? 47

48 A is our term document matrix Latent Semantic Indexing via the SVD If A is a term/document matrix, then AA T and A T A are the matrixes of term and document co-occurrences, repectively 48

49 Meaning of A T A and AA T Word i in doc k Word j in doc k L = A A T = L ij = A ik A T kj N L ij = A ik A T kj = A ik A jk k=1 N k=1 L ij is the number of documents in which wi and wj A T co-occurr Similarly, L T ij is the number of co-occurring words in docs di and dj (or viceversa if A is a document-term matrix rather than term-document) A

50 Example

51 Term-document matrix A

52 Term co-occurrences example L trees,graph = ( ) ( ) T =2

53 So the matrix L=AA T is the matrix of term co-occurrences Remember: eigenvectors of a matrix define an orthonormal space Remember: bigger eigenvalues define the main directions of this space But: Matrix L and L T are SIMILARITY matrixes (respectively, of terms and of documents). They define a SIMILARITY SPACE (the orthonormal space of their eigenvectors) If the matrix elements are word co-occurrences, bigger eigenvalues are associated to bigger groups of similar words Similarly, bigger eigenvalues of L T =A T A are associated with bigger groups of similar documents (those in which co-occur the same terms)

term space: green, yellow and red vectors are documents.

54 LSI: the intuition The blue segments give the intuition of eigenvalues of L T =A T A Bigger eigenvalues are those for which the projection of all vectors on the direction of correspondent eigenvectors is higher t1 Projecting A in the term space: green, yellow and red vectors are documents. If they form small angles, they have common words (remember cosin-sim) The black vector are the unary eigenvector of A: they represent the main directions of the document vectors t2 t3 54

55 LSI intuition!" d j = (d 1 j,d 2 j,d 3 j )!" L T d j = d 1 j λ 1 v 1 + d 2 j λ 2 v 2 + d 3 j λ 3 v 3 t1 If we multiply all document vectors by L T = A T A, their distorsion is mostly determined by the highest eigenvalues t2 We now project our document vectors on the reference orthonormal system represented by the 3 black vectors t3 55

56 LSI intuition..if we remove the dimension(s) with lower eigenvalues (i.e. if we rank-reduce Σ) we don t loose much information!" d j = (d 1 j,d 2 j,d 3 j )!" L T d j d 1 j λ 1 v 1 + d 2 j λ 2 v 2 Remember that the two new axis represent a combination of co-occurring words e.g. a latent semantic space 56

Example t1 t2 t3 t4 d1 d2 d3 d4 1.000 1.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 1.000 1.000 0.000 0.000 1.000 1.000 We project terms and docs on two dimensions, v1 and v2 (the principal eigenvectors) 0.

000 0.000-0.707-0.707-0.851-0.526 0.000 0.000 0.526-0.851 0.000 0.000 0.000 0.000-0.707 0.707 Note that the 1.172 direction 0.724 0.000 of each 0.000 eigenvector is determined by Approximated 0.724 0.448 0.

57 Example t1 t2 t3 t4 d1 d2 d3 d We project terms and docs on two dimensions, v1 and v2 (the principal eigenvectors) x x Note that the direction of each eigenvector is determined by Approximated We have two latent semantic new the term-doc direction of just two terms: (t1,t2) or (t3,t4) matrix coordinates: s1:(t1,t2) and s2:(t3,t4) Even if t2 does not occur in d2, now if we query with t2 the system will return also d2!! 57

58 Co-occurrence space In principle, the space of document or term co-occurrences is much (much!) higher than the original space of terms!! But with SVD we consider only the most relevant ones, trough rank reduction A =U V T U k Σ k V k T = A k 58

59 Summary so-far We compute the SVD rank-k approximation for the term-document matrix A This approximation is based on considering only the principal eigenvalues of the term cooccurrence and document similarity matrixes (L=AA T and LT=A T A) The eigenvectors of the eigenvalues of L=AA T and LT=A T A represent the main directions identified by term vectors and document vectors, respectively. 59

60 LSI: what are the steps From term-doc matrix A, compute the approximation A k. with SVD Project docs, terms and queries in a space of k<<r dimensions (the k survived eigenvectors) and compute cos-similarity as usual These dimensions are not the original axes, but those defined by the orthonormal space of the reduced matrix Ak Aq! A k q! = σ 1 q 1 v! 1 +σ 2 q 2 v! σ k q k v! k Where σ i q i (i=1,2..k<<r) are the new coordinates of q in the orthonormal space of Ak 60

61 Projecting terms documents and queries in the LS space A If A=UΣV T we also have that: V = A T UΣ -1 t = t T ΣV T d = d T UΣ -1 q = q T UΣ -1 After rank k- approximation : A Ak=U k Σ k V k T d k d T U k Σ k -1 q k q T U k Σ k -1 sim(q, d) = sim(q T U k Σ k -1, d T U k Σ k -1 ) 61

62 Consider a term-doc matrix MxN (M=11, N=3) and a query q query A

63 1. Compute SVD: A= UΣVT

64 2. Obtain a low rank approximation (k=2) A k = U k Σ k V T k

65 3a. Compute doc/query similarity For N documents, A k has N columns, each representing the coordinates of a document d i projected in the k LSI dimensions A query is considered like a document, and is projected in the LSI space

66 3c. Compute the query vector q k = q T U k Σ k -1 q is projected in the 2-dimension LSI space!

67 Documents and queries projected in the LSI space

68 q/d similarity

69 An overview of a semantic network of terms based on the top 100 most significant latent semantic dimensions (Zhu&Chen) 69

70 Conclusion LSI performs a low-rank approximation of document-term matrix (typical rank ) General idea Map documents (and terms) to a lowdimensional representation. Design a mapping such that the low-dimensional space reflects semantic associations between words (latent semantic space). Compute document similarity based on the cossim in this latent semantic space

71 Another LSI Example

73 t1 t2 t3 t4 d1 d2 d3 A T AA T t1 t2 t3 Term co-occurrences

74 A k =U k Σ k V kt A Now it is like if t3 belongs to d1!

Information Retrieval

Information Retrieval Introduction to Information CS276: Information and Web Search Christopher Manning and Pandu Nayak Lecture 13: Latent Semantic Indexing Ch. 18 Today s topic Latent Semantic Indexing Term-document matrices