INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 13: Query Expansion and Probabilistic Retrieval Paul Ginsparg Cornell University, Ithaca, NY 8 Oct 2009 1/ 34
Administrativa No office hours tomorrow, Fri 9 Oct e-mail questions, doubts, concerns, problems to cs4300-l@lists.cs.cornell.edu Remember mid-term is one week from today, Thu 15 Oct. For more info, see http://www.infosci.cornell.edu/courses/info4300/2009fa/exams.html 2/ 34
Overview 1 Recap 2 Pseudo Relevance Feedback 3 Query expansion 4 Probabalistic Retrieval 5 Discussion 3/ 34
Outline 1 Recap 2 Pseudo Relevance Feedback 3 Query expansion 4 Probabalistic Retrieval 5 Discussion 4/ 34
Selection of singular values t d t m m m m d Σ k V T k C k U k t d t k k k k d m is the original rank of C. k is the number of singular values chosen to represent the concepts in the set of documents. Usually, k m. Σ 1 k defined only on k-dimensional subspace. 5/ 34
Now approximate C C k In the LSI approximation, use C k (the rank k approximation to C), so similarity measure between query and document becomes q d (j) q d (j) = q C e (j) q C e (j) = q C k e (j) q C k e (j) = q d (j) q d (j), (2) where d (j) = C k e (j) = U k Σ k V T e (j) is the LSI representation of the j th document vector in the original term document space. Finding the closest documents to a query in the LSI approximation thus amounts to computing (2) for each of the j = 1,...,N documents, and returning the best matches. 6/ 34
Compare documents in concept space Recall the i,j entry of C T C is dot product between i,j columns of C (term vectors for documents i and j). In the truncated space, C T k C k = (U k Σ k V T k )T (U k Σ k V T k ) = V kσ k U T k U kσ k V T k = (V kσ k )(V k Σ k ) T Thus i,j entry the dot product between the i, j columns of (V k Σ k ) T = Σ k V T k. In concept space, comparison between pseudo-document ˆq and document ˆd (j) thus given by the cosine between Σ k ˆq and Σk ˆd (j) : (Σ k ˆq) Σk ˆd(j) Σ k ˆq Σk ˆd(j) = ( qt U k Σ 1 k Σ k)(σ k Σ 1 k UT k d (j) ) U T k q UT k d (j) = q d (j) U T k q d (j), (3) in agreement with (2), up to an overall q-dependent normalization which doesn t affect similarity rankings. 7/ 34
8/ 34
Rocchio illustrated qopt µ R µnr µ R µ NR x x x x x x µ R : centroid of relevant documents µ NR : centroid of nonrelevant documents µ R µ NR : difference vector Add difference vector to µ R to get q opt q opt separates relevant/nonrelevant perfectly. 9/ 34
Outline 1 Recap 2 Pseudo Relevance Feedback 3 Query expansion 4 Probabalistic Retrieval 5 Discussion 10/ 34
Relevance feedback: Problems Relevance feedback is expensive. Relevance feedback creates long modified queries. Long queries are expensive to process. Users are reluctant to provide explicit feedback. It s often hard to understand why a particular document was retrieved after applying relevance feedback. Excite had full relevance feedback at one point, but abandoned it later. 11/ 34
Pseudo-relevance feedback Pseudo-relevance feedback automates the manual part of true relevance feedback. Pseudo-relevance algorithm: Retrieve a ranked list of hits for the user s query Assume that the top k documents are relevant. Do relevance feedback (e.g., Rocchio) Works very well on average But can go horribly wrong for some queries. Several iterations can cause query drift. 12/ 34
Pseudo-relevance feedback at TREC4 Cornell SMART system Results show number of relevant documents out of top 100 for 50 queries (so total number of documents is 5000): method number of relevant documents lnc.ltc 3210 lnc.ltc-psrf 3634 Lnu.ltu 3709 Lnu.ltu-PsRF 4350 Results contrast two length normalization schemes (L vs. l) and pseudo-relevance feedback (PsRF). The pseudo-relevance feedback method used added only 20 terms to the query. (Rocchio will add many more.) This demonstrates that pseudo-relevance feedback is effective on average. 13/ 34
Outline 1 Recap 2 Pseudo Relevance Feedback 3 Query expansion 4 Probabalistic Retrieval 5 Discussion 14/ 34
Query expansion Query expansion is another method for increasing recall. We use global query expansion to refer to global methods for query reformulation. In global query expansion, the query is modified based on some global resource, i.e. a resource that is not query-dependent. Main information we use: (near-)synonymy A publication or database that collects (near-)synonyms is called a thesaurus. We will look at two types of thesauri: manually created and automatically created. 15/ 34
Query expansion: Example 16/ 34
Types of user feedback User gives feedback on documents. More common in relevance feedback User gives feedback on words or phrases. More common in query expansion 17/ 34
Types of query expansion Manual thesaurus (maintained by editors, e.g., PubMed) Automatically derived thesaurus (e.g., based on co-occurrence statistics) Query-equivalence based on query log mining (common on the web as in the palm example) 18/ 34
Thesaurus-based query expansion For each term t in the query, expand the query with words the thesaurus lists as semantically related with t. Example from earlier: hospital medical Generally increases recall May significantly decrease precision, particularly with ambiguous terms interest rate interest rate fascinate Widely used in specialized search engines for science and engineering It s very expensive to create a manual thesaurus and to maintain it over time. A manual thesaurus is roughly equivalent to annotation with a controlled vocabulary. 19/ 34
Example for manual thesaurus: PubMed 20/ 34
Automatic thesaurus generation Attempt to generate a thesaurus automatically by analyzing the distribution of words in documents Fundamental notion: similarity between two words Definition 1: Two words are similar if they co-occur with similar words. car and motorcycle cooccur with road, gas and license, so they must be similar. Definition 2: Two words are similar if they occur in a given grammatical relation with the same words. You can harvest, peel, eat, prepare, etc. apples and pears, so apples and pears must be similar. Co-occurrence is more robust, grammatical relations are more accurate. 21/ 34
Co-occurence-based thesaurus: Examples Word absolutely bottomed captivating doghouse makeup mediating keeping lithographs pathogens senses Nearest neighbors absurd, whatsoever, totally, exactly, nothing dip, copper, drops, topped, slide, trimmed shimmer, stunningly, superbly, plucky, witty dog, porch, crawling, beside, downstairs repellent, lotion, glossy, sunscreen, skin, gel reconciliation, negotiate, case, conciliation hoping, bring, wiping, could, some, would drawings, Picasso, Dali, sculptures, Gauguin toxins, bacteria, organisms, bacterial, parasite grasp, psyche, truly, clumsy, naive, innate 22/ 34
Summary Relevance feedback and query expansion increase recall. In many cases, precision is decreased, often significantly. Log-based query modification (which is more complex than simple query expansion) is more common on the web than relevance feedback. 23/ 34
Outline 1 Recap 2 Pseudo Relevance Feedback 3 Query expansion 4 Probabalistic Retrieval 5 Discussion 24/ 34
Basics of probability theory A = event 0 p(a) 1 joint probability p(a, B) = p(a B) conditional probability p(a B) = p(a, B)/p(B) Note p(a, B) = p(a B)p(B) = p(b A)p(A), gives posterior probability of A after seeing the evidence B Bayes Thm : p(a B) = p(b A)p(A) p(b) In denominator, use p(b) = p(b,a) + p(b,a) = p(b A)p(A) + p(b A)p(A) Odds: O(A) = p(a) p(a) = p(a) 1 p(a) 25/ 34
Probability Ranking Principle (PRP) For query q and document d, let R d,q be binary indicator whether d relevant to q: R = 1 if relevant, else R = 0 Order documents according to estimated probability of relevance with respect to information need p(r = 1 d,q) (Bayes optimal decision rule) d is relevant iff p(r = 1 d,q) > p(r = 0 d,q) 26/ 34
Binary independence model (BIM) Represent documents and queries as binary term incidence vectors: x = (x 1,...,x M ), q = (q 1,...,q n ) (x t = 1 if term t present in document d, else x t = 0) Assume independence: no association between terms in binary bag of words (not true of course, but works ok to first approximation). Want to determine p(r = 1 x, q) = p( x R = 1, q)p(r = 1 q) p( x q) p(r = 0 x, q) = p( x R = 0, q)p(r = 0 q) p( x q) (on r.h.s. probabilites that if document is retrieved, then representation is x, and prior probabilities of retrieving relevant or non-relevant document. also p(r = 1 x, q) + p(r = 0 x, q) = 1) 27/ 34
Ranking function Order by descending p(r = 1 d,q), using BIM p(r = 1 x, q). Use odds of relevance O(R x, q) = p( x R=1, q)p(r=1 q) p( x q) p( x R=0, q)p(r=0 q) p( x q) How to estimate last term? = p(r = 1 q) p( x R = 1, q) p(r = 0 q) p( x R = 0, q) 28/ 34
Naive Bayes conditional independence assumption so p( x R = 1, q) M p( x R = 0, q) = p(x t R = 1, q) p(x t R = 0, q) t=1 O(R x, q) = O(R q) M t=1 p(x t R = 1, q) p(x t R = 0, q) Let probability of terms appearing in relevant/nonrelevant documents w.r.t. query be p t = p(x t = 1 R = 1, q) and u t = p(x t = 1 R = 0, q): document relevant (R = 1) nonrelevant (R = 0) term present x t = 1 p t u t term absent x t = 0 1 p t 1 u t 29/ 34
One more approximation Also assume that terms not occuring in the query are equally likely to occur in relevant and nonrelevant documents: p t = u t if q t = 0. Then only need to retain query terms (q t = 1): O(R x, q) = O(R q) = O(R q) M t=1 t:x t=1,q t=1 p(x t R = 1, q) p(x t R = 0, q) p t u t t:x t=0,q t=1 1 p t 1 u t 30/ 34
Estimate df t is number of documents that contain term t term documents relevant nonrelevant total present x t = 1 s df t s df t absent x t = 0 S s (N df t ) (S s) N df t total S N S N Hence p t = s/s and u t = (df t s)/(n S). In practice if relevant documents a small percentage, then u t df t /N and log[(1 u t )/u t ] = log[(n df t /df t ] log N/df t reproducing idf weight 31/ 34
Outline 1 Recap 2 Pseudo Relevance Feedback 3 Query expansion 4 Probabalistic Retrieval 5 Discussion 32/ 34
Discussion 4 Original LSA article: Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Richard Harshman, Indexing by latent semantic analysis. Journal of the American Society for Information Science, Volume 41, Issue 6, 1990. Some questions: Explain the name latent semantic analysis What problems is LSA attempting to solve? does it succeed? What criteria were used in selecting SVD of the term doc matrix? Explain the meaning of the matrices in the SVD C = UΣV T What does the rank reduction C k C = U k Σ k Vk T (keeping only first k elements of Σ) have to do with latent semantics? Fig. 1: what aspect of LSA does this illustrate? (which docs are closer to the query vector in concept space despite not containing words in common with the query?) 33/ 34
Fig. 4: a) LSI-100 does better at the right of this graph than the left What does this have to do with synonomy and polysemy? Describe methodology of the MED experiment. Why were authors surprised that TERM and SMART gave similar results? The results of CISI were not as strong, possible explanations? Fig. 5: what data does the graph plot? what conclusions can you draw? The article states the only way documents can be retrieved is by an exhaustive comparison of a query vector against all stored document vectors. Explain the statement. Is it a serious problem? 34/ 34