Fast LSI-based techniques for query expansion in text retrieval systems

Size: px

Start display at page:

Download "Fast LSI-based techniques for query expansion in text retrieval systems"

Marjory Houston
6 years ago
Views:

1 Fast LSI-based techniques for query expansion in text retrieval systems L. Laura U. Nanni F. Sarracco Department of Computer and System Science University of Rome La Sapienza 2nd Workshop on Text-based Information Retrieval Koblenz, 11 Sept. 2005

2 Outline Preliminaries 1 Preliminaries Classical Text Matching Query expansion by Thesauri Spectral Techniques 2 LS-Thesaurus LS-Filter 3 4

3 Term-Documents matrix Classical Text Matching Query expansion by Thesauri Spectral Techniques Collection D = {d 1,..., d n } of text documents. T = {t 1,..., t m }: set of distinct index terms in D: A = t 1 t 2. d 1 d 2 d n a 1,1 a 1,2 a 1,n a 2,1 a 2,2 a 2,n... t m a m,1 a m,2 a m,n a i,j is a function of the weight of term t i in document d j

4 Term-Documents matrix Classical Text Matching Query expansion by Thesauri Spectral Techniques Collection D = {d 1,..., d n } of text documents. T = {t 1,..., t m }: set of distinct index terms in D: A = t 1 t 2. d 1 d 2 d n a 1,1 a 1,2 a 1,n a 2,1 a 2,2 a 2,n... t m a m,1 a m,2 a m,n a i,j is a function of the weight of term t i in document d j

5 Term-Documents matrix Classical Text Matching Query expansion by Thesauri Spectral Techniques Collection D = {d 1,..., d n } of text documents. T = {t 1,..., t m }: set of distinct index terms in D: A = t 1 t 2. d 1 d 2 d n a 1,1 a 1,2 a 1,n a 2,1 a 2,2 a 2,n... t m a m,1 a m,2 a m,n a i,j is a function of the weight of term t i in document d j

6 Term-Documents matrix Classical Text Matching Query expansion by Thesauri Spectral Techniques Collection D = {d 1,..., d n } of text documents. T = {t 1,..., t m }: set of distinct index terms in D: A = t 1 t 2. d 1 d 2 d n a 1,1 a 1,2 a 1,n a 2,1 a 2,2 a 2,n... t m a m,1 a m,2 a m,n a i,j is a function of the weight of term t i in document d j

7 TF-IDF term-weighting schemes Classical Text Matching Query expansion by Thesauri Spectral Techniques The value of a i,j is a function of two factors A LOCAL FACTOR L(i, j) measuring the relevance of term t i in document d j. We used: freq(i, j) L(i, j) = max i [1,m] freq(i, j) A GLOBAL FACTOR G(i) to de-amplify the relative weight of terms which are very frequently used in the collection. We used: G(i) = log n n i n i is the number of documents containing term t i.

8 TF-IDF term-weighting schemes Classical Text Matching Query expansion by Thesauri Spectral Techniques The value of a i,j is a function of two factors A LOCAL FACTOR L(i, j) measuring the relevance of term t i in document d j. We used: freq(i, j) L(i, j) = max i [1,m] freq(i, j) A GLOBAL FACTOR G(i) to de-amplify the relative weight of terms which are very frequently used in the collection. We used: G(i) = log n n i n i is the number of documents containing term t i.

9 TF-IDF term-weighting schemes Classical Text Matching Query expansion by Thesauri Spectral Techniques The value of a i,j is a function of two factors A LOCAL FACTOR L(i, j) measuring the relevance of term t i in document d j. We used: freq(i, j) L(i, j) = max i [1,m] freq(i, j) A GLOBAL FACTOR G(i) to de-amplify the relative weight of terms which are very frequently used in the collection. We used: G(i) = log n n i n i is the number of documents containing term t i.

10 Text Matching Algorithm Classical Text Matching Query expansion by Thesauri Spectral Techniques Input: a query vector q = {q 1,..., q m } Output: the rank vector r = q A A and q are sparse q A can be computed very efficiently. If the size of the query is limited to k terms TM has cost O(kn). Issues Polysemy: e.g., polo Synonymy: e.g., automobile, car, machine

11 Text Matching Algorithm Classical Text Matching Query expansion by Thesauri Spectral Techniques Input: a query vector q = {q 1,..., q m } Output: the rank vector r = q A A and q are sparse q A can be computed very efficiently. If the size of the query is limited to k terms TM has cost O(kn). Issues Polysemy: e.g., polo Synonymy: e.g., automobile, car, machine

12 Text Matching Algorithm Classical Text Matching Query expansion by Thesauri Spectral Techniques Input: a query vector q = {q 1,..., q m } Output: the rank vector r = q A A and q are sparse q A can be computed very efficiently. If the size of the query is limited to k terms TM has cost O(kn). Issues Polysemy: e.g., polo Synonymy: e.g., automobile, car, machine

13 Text Matching Algorithm Classical Text Matching Query expansion by Thesauri Spectral Techniques Input: a query vector q = {q 1,..., q m } Output: the rank vector r = q A A and q are sparse q A can be computed very efficiently. If the size of the query is limited to k terms TM has cost O(kn). Issues Polysemy: e.g., polo Synonymy: e.g., automobile, car, machine

14 Text Matching Algorithm Classical Text Matching Query expansion by Thesauri Spectral Techniques Input: a query vector q = {q 1,..., q m } Output: the rank vector r = q A A and q are sparse q A can be computed very efficiently. If the size of the query is limited to k terms TM has cost O(kn). Issues Polysemy: e.g., polo Synonymy: e.g., automobile, car, machine

15 Query expansion Preliminaries Classical Text Matching Query expansion by Thesauri Spectral Techniques QUERY EXPANSION (or QUERY REWEIGHTING) The process aimed to alter the weights, and possibly the terms, of a query. Two approaches: 1 use relevance feedback 2 use some knowledge on terms relationship (thesaurus) The term-term correlation matrix A A T gives a statistic estimation of relationships among terms in the collection Query expansion: q q A A T

16 Query expansion Preliminaries Classical Text Matching Query expansion by Thesauri Spectral Techniques QUERY EXPANSION (or QUERY REWEIGHTING) The process aimed to alter the weights, and possibly the terms, of a query. Two approaches: 1 use relevance feedback 2 use some knowledge on terms relationship (thesaurus) The term-term correlation matrix A A T gives a statistic estimation of relationships among terms in the collection Query expansion: q q A A T

17 Query expansion Preliminaries Classical Text Matching Query expansion by Thesauri Spectral Techniques QUERY EXPANSION (or QUERY REWEIGHTING) The process aimed to alter the weights, and possibly the terms, of a query. Two approaches: 1 use relevance feedback 2 use some knowledge on terms relationship (thesaurus) The term-term correlation matrix A A T gives a statistic estimation of relationships among terms in the collection Query expansion: q q A A T

18 Similarity Thesaurus Preliminaries Classical Text Matching Query expansion by Thesauri Spectral Techniques Qiu and Frei presented an alternative approach to compute A Idea: compute the probability that a document is representive of a term They propose the following weighting scheme: a i,j = freq(i,j) ( ) itf (j) max j freq(i,j) r Pn l=1 (( freq(i,l) max l freq(i,l) ) itf (j))2 if freq(i, j) > 0 0 otherwise max j freq(i, j): is the maximum frequency of term t i over all the documents in the collection. itf j = log m m j, m j the number of distinct terms in the document d j ;

19 Classical Text Matching Query expansion by Thesauri Spectral Techniques Query expansion by Similarity Thesaurus Algorithm 1: Compute A with the previous weighting function; 2: Compute similarity thesaurus: S A A T ; 3: Given a query vector q: 4: s q S; 5: s ξ( s, x r ); 6: s s q, where q = m i=1 q i; 7: q q + s ; Search quality Use of S.T. improves Text-Matching but does not completely resolve polysemy and synonymy.

20 Classical Text Matching Query expansion by Thesauri Spectral Techniques Query expansion by Similarity Thesaurus Algorithm 1: Compute A with the previous weighting function; 2: Compute similarity thesaurus: S A A T ; 3: Given a query vector q: 4: s q S; 5: s ξ( s, x r ); 6: s s q, where q = m i=1 q i; 7: q q + s ; Search quality Use of S.T. improves Text-Matching but does not completely resolve polysemy and synonymy.

21 Latent Semantic Indexing Classical Text Matching Query expansion by Thesauri Spectral Techniques Idea Use the Singular Value Decomposition to produce a low rank approximation of A n r r n m m A = U r S r V T

22 Latent Semantic Indexing Classical Text Matching Query expansion by Thesauri Spectral Techniques Idea Use the Singular Value Decomposition to produce a low rank approximation of A n k k n m m A k = U k S k V k k k T

23 LSI Search Preliminaries Classical Text Matching Query expansion by Thesauri Spectral Techniques Issue The documents collection is represented in the reduced k-dimensional subspace: D = Σ 1 k Uk T A Also the query vector is projected in the k-dimensional subspace: q k = Σ 1 k Uk T q The rank of i-th document is given by r i = cos α i = q k d i q k d i Requires the computation of n cosines very slow, unusable for large collections

24 LSI Search Preliminaries Classical Text Matching Query expansion by Thesauri Spectral Techniques Issue The documents collection is represented in the reduced k-dimensional subspace: D = Σ 1 k Uk T A Also the query vector is projected in the k-dimensional subspace: q k = Σ 1 k Uk T q The rank of i-th document is given by r i = cos α i = q k d i q k d i Requires the computation of n cosines very slow, unusable for large collections

25 LSI Search Preliminaries Classical Text Matching Query expansion by Thesauri Spectral Techniques Issue The documents collection is represented in the reduced k-dimensional subspace: D = Σ 1 k Uk T A Also the query vector is projected in the k-dimensional subspace: q k = Σ 1 k Uk T q The rank of i-th document is given by r i = cos α i = q k d i q k d i Requires the computation of n cosines very slow, unusable for large collections

26 LSI Search Preliminaries Classical Text Matching Query expansion by Thesauri Spectral Techniques Issue The documents collection is represented in the reduced k-dimensional subspace: D = Σ 1 k Uk T A Also the query vector is projected in the k-dimensional subspace: q k = Σ 1 k Uk T q The rank of i-th document is given by r i = cos α i = q k d i q k d i Requires the computation of n cosines very slow, unusable for large collections

27 LS-Thesaurus Preliminaries LS-Thesaurus LS-Filter Idea Compute the similarity matrix starting from a low rank approximation of A (as in LSI). We can formally define our similarity matrix S k in the following way: S k = A k A T k = U k Σ k V T k V k Σ T k U T k = U k Σ 2 k U T k Note that U and the diagonal elements of Σ correspond respectively to the eigenvectors and the eigenvalues of matrix A A T.

28 LS-Thesaurus LS-Filter Algorithm LS-Thesaurus - Pre-Process 1: Input: a collection of documents D; 2: Output: a Similarity Thesaurus, i.e., a m m matrix S k ; 3: 4: Compute the term-document matrix A from D; 5: (U, Λ) EIGEN(A A T ); 6: U k U (1:m,1:k) ; 7: Λ k Λ (1:k,1:k) ; 8: S k U k Λ k U T k ;

29 LS-Thesaurus LS-Filter Algorithm LS-Thesaurus - Expand 1: Input: a query vector q, a similarity thesaurus S k, 2: a positive integer x r ; 3: Output: a query vector q ; 4: 5: s q S k ; 6: s ξ( s, x r ); 7: s s q, where q = m i=1 q i; 8: q q + s ;

30 LS-Filter Preliminaries LS-Thesaurus LS-Filter Assumptions In LSI algorithm documents and queries are projected (and compared) in a k-dimensional subspace The axes of this subspace represent the k most important concepts arising from the documents in the collection The user query tries to catch one (or more) of these concepts by using an appropriate set of terms Idea Let the system select the terms which best represent the required concepts

31 LS-Filter Preliminaries LS-Thesaurus LS-Filter Assumptions In LSI algorithm documents and queries are projected (and compared) in a k-dimensional subspace The axes of this subspace represent the k most important concepts arising from the documents in the collection The user query tries to catch one (or more) of these concepts by using an appropriate set of terms Idea Let the system select the terms which best represent the required concepts

32 LS-Filter - Schematic view LS-Thesaurus LS-Filter 3 4 P tc 6 P ct 8 Search engine The user inserts a query

33 LS-Filter - Schematic view LS-Thesaurus LS-Filter 3 4 P tc 6 P ct 8 Search engine The query vector is projected in the concepts space

34 LS-Filter - Schematic view LS-Thesaurus LS-Filter 3 4 P tc 6 P ct 8 Search engine Minor concepts are removed

35 LS-Filter - Schematic view LS-Thesaurus LS-Filter 3 4 P tc 6 P ct 8 Search engine Concepts vector is projected back into the terms space

36 LS-Filter - Schematic view LS-Thesaurus LS-Filter 3 4 P tc 6 P ct 8 Search engine Minor terms are removed

37 LS-Filter - Schematic view LS-Thesaurus LS-Filter 3 4 P tc 6 P ct 8 Search engine Remaining terms are added to the original query vector

38 LS-Thesaurus LS-Filter Algorithm LS-Filter - Pre-Process 1: Input: a collection of documents D; 2: Output: a pair of matrices (P, P 1 ); 3: 4: Compute the term-document matrix A from D; 5: (U, Σ, V ) SVD(A); 6: U k U (1:m,1:k) ; 7: Σ k Σ (1:k,1:k) ; 8: P Σ 1 k U T k ; 9: P 1 U k Σ k ;

39 Algorithm LS-Filter - Expand LS-Thesaurus LS-Filter 1: Input: a query vector q, matrices (P, P 1 ), 2: two positive integers x c and x t ; 3: Output: a query vector q ; 4: 5: p P q; 6: p ξ( p, x c ); 7: p P 1 p ; 8: q ξ( p, x t );

40 Algorithms Preliminaries We compared the behavior of the following approaches: TM: the simple text matching; LS-T: text matching with queries previously expanded by LS-Thesaurus algorithm; LS-F: text matching with queries previously expanded by LS-Filter algorithm; LSI: the full LSI computation.

41 Dataset Preliminaries Three books from O Reilly in html format about Perl, Unix and Java 3000 documents (html pages) 150 short queries, with human made collection of relevant documents publicly available on the web at URL:

42 Retrieval effectivess 100 LSI TM LS-F LS-T 80 % precision % recall

43 Time performances LSI TM, LS-T, LS-F 8 7 LSI - time (seconds) TM, LS-T & LS-F - time (seconds) number of documents

44 Examples of LS-Filter job control shell logout zip file job shell file background bourne util echo login zip number perl checksum list prompt control ctrl object line filename

45 What we have done Previous techniques: TM: fast but prone to synonymy and polysemy LSI: effective but slow We introduced 2 techniques to improve TM search by using query expansion We compared them with TM and LSI search: we used a non-standard mid-size dataset generally their performances are better than TM, but not homogeneous

46 What we have done Previous techniques: TM: fast but prone to synonymy and polysemy LSI: effective but slow We introduced 2 techniques to improve TM search by using query expansion We compared them with TM and LSI search: we used a non-standard mid-size dataset generally their performances are better than TM, but not homogeneous

47 What we have done Previous techniques: TM: fast but prone to synonymy and polysemy LSI: effective but slow We introduced 2 techniques to improve TM search by using query expansion We compared them with TM and LSI search: we used a non-standard mid-size dataset generally their performances are better than TM, but not homogeneous

48 What we want to do Further experiments on standard datasets Exploit user relevance feedback to discriminate relevant concepts in case of ambiguous queries Very big data structures: can we decrease spatial cost?

49 Bibliography I Preliminaries H. Bast and D. Majumdar. Why spectral retrieval works. In Proocedings of ACM SIGIR, S. Deerwester, S.T. Dumais, T.K. Landauer, G.W. Furnas, and R.A. Harshman. Indexing by latent semantic analysis. J.Soc.Inform.Sci., 41(6): , C. Papadimitriou, P. Raghavan, H. Tamaki, and S. Vempala Latent semantic indexing: A probabilistic analysis. J. Comput. Systems Sciences, 61(2): , 2000.

50 Bibliography II Preliminaries Y. Qiu and H. Frei. Improving the retrieval effectiveness by a similarity thesaurus. Technical Report 225, ETH Zurich, 1994.

51 Thank you

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276,