Problems. Looks for literal term matches. Problems:

Size: px

Start display at page:

Download "Problems. Looks for literal term matches. Problems:"

Antony O’Neal’
5 years ago
Views:

1 Problems Looks for literal term matches erms in queries (esp short ones) don t always capture user s information need well Problems: Synonymy: other words with the same meaning Car and automobile 电脑 vs. 计算机 What if we could match against concepts, that represent related words, rather than words themselves Wang Houfeng, ICL,PKU 74

2 Latent Semantic Indexing (LSI) Key idea: instead of representing documents as vectors in a term-dim space of terms Represent them (and terms themselves) as vectors in a lowerdimensional space whose axes are concepts that effectively group together similar words hese axes are the Principal Components from PCA Suppose we have keywords Car, automobile, driver, elephant We want queries on car to also get docs about drivers and automobiles, but not about elephants Relevant docs may not have the query terms, but may have many related terms Wang Houfeng, ICL,PKU 75

3 LSI via SVD he matrix A can be decomposed into 3 matrices (SVD) as follows: A = UΣV U are the matrix of orthogonal eigenvectors of AA. V is the matrix of orthogonal eigenvectors of A A. Eigenvalues λ 1 λ r of AA are the eigenvalues of A A. Σ is an r r diagonal matrix of singular values. R is the rank of A Σ = ( σ... ) diag 1 σ r σ i = λ i Singular values Wang Houfeng, ICL,PKU 76

4 Example term ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 controllability observability realization feedback controller observer transfer function polynomial matrices his happens to be a rank-7 matrix -so only 7 dimensions required U (9x7) = Σ (7x7) = V (8x7) = Wang Houfeng, ICL,PKU 77

5 Dimension Reduction in LSI In matrix Σ, select only k largest values Keep corresponding columns in U and V he resultant matrix A k is given by A k = U k Σ k V k where k, k < r, is the dimensionality of the concept space he parameter k should be large enough to allow fitting the characteristics of the data small enough to filter out the non-relevant representational detail Wang Houfeng, ICL,PKU 78

6 Computing and Using LSI Documents Documents erms A = U Σ V U k Σ k V k = erms m n A = m r U r r D r n V mxk U k kxk D k kxn V k = mxn Â k Singular Value Decomposition (SVD): Convert term-document matrix into 3matrices U, Σ and V Reduce Dimensionality: hrow out low-order rows and columns Recreate Matrix: Multiply to produce approximate termdocument matrix. Use new matrix to process queries OR, better, map query to reduced space Wang Houfeng, ICL,PKU 79

7 Low-rank Approximation:Reduction SVD can be used to compute optimal low-rank approximations. Approximation problem: Find A k of rank k such that A k = min)= X : rank ( X k A X F Frobenius norm A k and X are both m n matrices. ypically, want k << r Wang Houfeng, ICL,PKU 80

8 Low-rank Approximation Solution via SVD A = U diag( σ 1,..., σ,0,...,0) V k k set smallest r-k singular values to zero k k Ak = σ i = 1 i u i v i column notation: sum of rank 1 matrices Wang Houfeng, ICL,PKU 81

9 Approximation error It s the best possible, measured by the Frobenius norm of the error: A X = σ :min F k F k + 1 X rank ( X ) = k = A A where the σ i are ordered such that σ i σ i+1. Suggests why Frobenius error drops as k increased Wang Houfeng, ICL,PKU 82

10 SVD Low-rank approximation Whereas the term-doc matrix A may have m=50000, n=10 million (and rank close to 50000) We can construct an approximation A 100 with rank 100. Of all rank 100 matrices, it would have the lowest Frobenius error Wang Houfeng, ICL,PKU 83

11 Following the Example term ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 controllability observability realization feedback controller observer transfer function polynomial matrices U (9x7) = Σ (7x7) = V (8x7) = Wang Houfeng, ICL,PKU 84

12 Formally, this will be the rank-k (2) matrix that is closest to M in the matrix norm sense U (9x7) = Σ (7x7) = V (7x8) = U2 (9x2) = Σ 2 (2x2) = V2 (8x2) = U2* Σ 2*V2 will be a 9x8 matrix hat approximates original matrix Wang Houfeng, ICL,PKU 85

13 What should be the value of k? UΣV 5 components ignored =U 7 Σ 7 V 7 K=2 term ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 controllability observability realization feedback controller observer transfer function polynomial matrices K=6 One component ignored K=4 3 components ignored U 2 S 2 V U 6 Σ 6 V 6 U 4 Σ 4 V Wang Houfeng, ICL,PKU 86

14 Querying o query for feedback controller, the query vector would be q = [ ]' (' indicates transpose), Let q be the query vector. hen the document-space vector corresponding to q is given by: q'*u2*inv(σ2) = Dq Point at the centroid of the query terms poisitions in the new space. For the feedback controller query vector, the result is: Dq = o find the best document match, we compare the Dq vector against all the document vectors in the 2- dimensional V2 space. he document vector that is nearest in direction to Dq is the best match. he cosine values for the eight document vectors and the query vector are: U2 (9x2) = Σ2 (2x2) = V2 (8x2) = term ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 controllability observability realization feedback controller observer transfer function polynomial matrices Wang Houfeng, ICL,PKU 87

15 What LSI can do LSI analysis effectively does Dimensionality reduction Noise reduction Correlation analysis and Query expansion (with related words) Some of the individual effects can be achieved with simpler techniques (e.g. thesaurus construction). LSI does them together. LSI handles synonymy well, not so much polysemy Challenge: SVD is complex to compute (O(n 3 )) Needs to be updated as new documents are found/updated Wang Houfeng, ICL,PKU 88

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012 Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Principal Components Analysis Le Song Lecture 22, Nov 13, 2012 Based on slides from Eric Xing, CMU Reading: Chap 12.1, CB book 1 2 Factor or Component