INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Similar documents
INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

PV211: Introduction to Information Retrieval

Information Retrieval

Probabilistic Information Retrieval

Information Retrieval

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Latent Semantic Analysis. Hongning Wang

CS 572: Information Retrieval

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INF 141 IR METRICS LATENT SEMANTIC ANALYSIS AND INDEXING. Crista Lopes

Fast LSI-based techniques for query expansion in text retrieval systems

Introduction to Information Retrieval

Information Retrieval Basic IR models. Luca Bondi

Latent Semantic Analysis. Hongning Wang

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Fall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26

Boolean and Vector Space Retrieval Models

PV211: Introduction to Information Retrieval

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

INFO 4300 / CS4300 Information Retrieval. IR 9: Linear Algebra Review

Latent semantic indexing

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

.. CSC 566 Advanced Data Mining Alexander Dekhtyar..

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

1. Ignoring case, extract all unique words from the entire set of documents.

Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25

Information Retrieval and Web Search Engines

Notes on Latent Semantic Analysis

Problems. Looks for literal term matches. Problems:

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

Lecture 3: Probabilistic Retrieval Models

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

CS 3750 Advanced Machine Learning. Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya

Information Retrieval

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Manning & Schuetze, FSNLP (c) 1999,2000

Updating the Partial Singular Value Decomposition in Latent Semantic Indexing

Lecture 5: Web Searching using the SVD

Information Retrieval and Web Search

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Linear Algebra Background

13 Searching the Web with the SVD

9 Searching the Internet with the SVD

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Manning & Schuetze, FSNLP, (c)

Embeddings Learned By Matrix Factorization

Informa(on Retrieval 5/19/15. Example of manual thesaurus. Thesaurus- based query expansion. Search log query expansion.

Generic Text Summarization

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Chap 2: Classical models for information retrieval

A Note on the Effect of Term Weighting on Selecting Intrinsic Dimensionality of Data

Assignment 3. Latent Semantic Indexing

Information Retrieval

Information Retrieval

DISTRIBUTIONAL SEMANTICS

vector space retrieval many slides courtesy James Amherst

1 Information retrieval fundamentals

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Knowledge Discovery and Data Mining 1 (VO) ( )

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model

CME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016

Probabilistic Latent Semantic Analysis

PROBABILISTIC LATENT SEMANTIC ANALYSIS

IR Models: The Probabilistic Model. Lecture 8

How Latent Semantic Indexing Solves the Pachyderm Problem

Knowledge Discovery in Data: Overview. Naïve Bayesian Classification. .. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

Eigenvalue Problems Computation and Applications

Behavioral Data Mining. Lecture 2

Term Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan

A few applications of the SVD

Semantic Similarity from Corpora - Latent Semantic Analysis

Let A an n n real nonsymmetric matrix. The eigenvalue problem: λ 1 = 1 with eigenvector u 1 = ( ) λ 2 = 2 with eigenvector u 2 = ( 1

CSE 494/598 Lecture-6: Latent Semantic Indexing. **Content adapted from last year s slides

Data Mining and Matrices

EIGENVALE PROBLEMS AND THE SVD. [5.1 TO 5.3 & 7.4]

Folding-up: A Hybrid Method for Updating the Partial Singular Value Decomposition in Latent Semantic Indexing

CSE 494/598 Lecture-4: Correlation Analysis. **Content adapted from last year s slides

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002

CS47300: Web Information Search and Management

More PCA; and, Factor Analysis

CS630 Representing and Accessing Digital Information Lecture 6: Feb 14, 2006

Information Retrieval and Organisation

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

Retrieval models II. IN4325 Information Retrieval

Updating the Centroid Decomposition with Applications in LSI

Matrices, Vector Spaces, and Information Retrieval

Matrix decompositions and latent semantic indexing

Variable Latent Semantic Indexing

PV211: Introduction to Information Retrieval

Motivation. User. Retrieval Model Result: Query. Document Collection. Information Need. Information Retrieval / Chapter 3: Retrieval Models

Latent Semantic Models. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Singular Value Decompsition

Intelligent Data Analysis. Mining Textual Data. School of Computer Science University of Birmingham

Information Retrieval. Lecture 6

Transcription:

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 13: Query Expansion and Probabilistic Retrieval Paul Ginsparg Cornell University, Ithaca, NY 8 Oct 2009 1/ 34

Administrativa No office hours tomorrow, Fri 9 Oct e-mail questions, doubts, concerns, problems to cs4300-l@lists.cs.cornell.edu Remember mid-term is one week from today, Thu 15 Oct. For more info, see http://www.infosci.cornell.edu/courses/info4300/2009fa/exams.html 2/ 34

Overview 1 Recap 2 Pseudo Relevance Feedback 3 Query expansion 4 Probabalistic Retrieval 5 Discussion 3/ 34

Outline 1 Recap 2 Pseudo Relevance Feedback 3 Query expansion 4 Probabalistic Retrieval 5 Discussion 4/ 34

Selection of singular values t d t m m m m d Σ k V T k C k U k t d t k k k k d m is the original rank of C. k is the number of singular values chosen to represent the concepts in the set of documents. Usually, k m. Σ 1 k defined only on k-dimensional subspace. 5/ 34

Now approximate C C k In the LSI approximation, use C k (the rank k approximation to C), so similarity measure between query and document becomes q d (j) q d (j) = q C e (j) q C e (j) = q C k e (j) q C k e (j) = q d (j) q d (j), (2) where d (j) = C k e (j) = U k Σ k V T e (j) is the LSI representation of the j th document vector in the original term document space. Finding the closest documents to a query in the LSI approximation thus amounts to computing (2) for each of the j = 1,...,N documents, and returning the best matches. 6/ 34

Compare documents in concept space Recall the i,j entry of C T C is dot product between i,j columns of C (term vectors for documents i and j). In the truncated space, C T k C k = (U k Σ k V T k )T (U k Σ k V T k ) = V kσ k U T k U kσ k V T k = (V kσ k )(V k Σ k ) T Thus i,j entry the dot product between the i, j columns of (V k Σ k ) T = Σ k V T k. In concept space, comparison between pseudo-document ˆq and document ˆd (j) thus given by the cosine between Σ k ˆq and Σk ˆd (j) : (Σ k ˆq) Σk ˆd(j) Σ k ˆq Σk ˆd(j) = ( qt U k Σ 1 k Σ k)(σ k Σ 1 k UT k d (j) ) U T k q UT k d (j) = q d (j) U T k q d (j), (3) in agreement with (2), up to an overall q-dependent normalization which doesn t affect similarity rankings. 7/ 34

8/ 34

Rocchio illustrated qopt µ R µnr µ R µ NR x x x x x x µ R : centroid of relevant documents µ NR : centroid of nonrelevant documents µ R µ NR : difference vector Add difference vector to µ R to get q opt q opt separates relevant/nonrelevant perfectly. 9/ 34

Outline 1 Recap 2 Pseudo Relevance Feedback 3 Query expansion 4 Probabalistic Retrieval 5 Discussion 10/ 34

Relevance feedback: Problems Relevance feedback is expensive. Relevance feedback creates long modified queries. Long queries are expensive to process. Users are reluctant to provide explicit feedback. It s often hard to understand why a particular document was retrieved after applying relevance feedback. Excite had full relevance feedback at one point, but abandoned it later. 11/ 34

Pseudo-relevance feedback Pseudo-relevance feedback automates the manual part of true relevance feedback. Pseudo-relevance algorithm: Retrieve a ranked list of hits for the user s query Assume that the top k documents are relevant. Do relevance feedback (e.g., Rocchio) Works very well on average But can go horribly wrong for some queries. Several iterations can cause query drift. 12/ 34

Pseudo-relevance feedback at TREC4 Cornell SMART system Results show number of relevant documents out of top 100 for 50 queries (so total number of documents is 5000): method number of relevant documents lnc.ltc 3210 lnc.ltc-psrf 3634 Lnu.ltu 3709 Lnu.ltu-PsRF 4350 Results contrast two length normalization schemes (L vs. l) and pseudo-relevance feedback (PsRF). The pseudo-relevance feedback method used added only 20 terms to the query. (Rocchio will add many more.) This demonstrates that pseudo-relevance feedback is effective on average. 13/ 34

Outline 1 Recap 2 Pseudo Relevance Feedback 3 Query expansion 4 Probabalistic Retrieval 5 Discussion 14/ 34

Query expansion Query expansion is another method for increasing recall. We use global query expansion to refer to global methods for query reformulation. In global query expansion, the query is modified based on some global resource, i.e. a resource that is not query-dependent. Main information we use: (near-)synonymy A publication or database that collects (near-)synonyms is called a thesaurus. We will look at two types of thesauri: manually created and automatically created. 15/ 34

Query expansion: Example 16/ 34

Types of user feedback User gives feedback on documents. More common in relevance feedback User gives feedback on words or phrases. More common in query expansion 17/ 34

Types of query expansion Manual thesaurus (maintained by editors, e.g., PubMed) Automatically derived thesaurus (e.g., based on co-occurrence statistics) Query-equivalence based on query log mining (common on the web as in the palm example) 18/ 34

Thesaurus-based query expansion For each term t in the query, expand the query with words the thesaurus lists as semantically related with t. Example from earlier: hospital medical Generally increases recall May significantly decrease precision, particularly with ambiguous terms interest rate interest rate fascinate Widely used in specialized search engines for science and engineering It s very expensive to create a manual thesaurus and to maintain it over time. A manual thesaurus is roughly equivalent to annotation with a controlled vocabulary. 19/ 34

Example for manual thesaurus: PubMed 20/ 34

Automatic thesaurus generation Attempt to generate a thesaurus automatically by analyzing the distribution of words in documents Fundamental notion: similarity between two words Definition 1: Two words are similar if they co-occur with similar words. car and motorcycle cooccur with road, gas and license, so they must be similar. Definition 2: Two words are similar if they occur in a given grammatical relation with the same words. You can harvest, peel, eat, prepare, etc. apples and pears, so apples and pears must be similar. Co-occurrence is more robust, grammatical relations are more accurate. 21/ 34

Co-occurence-based thesaurus: Examples Word absolutely bottomed captivating doghouse makeup mediating keeping lithographs pathogens senses Nearest neighbors absurd, whatsoever, totally, exactly, nothing dip, copper, drops, topped, slide, trimmed shimmer, stunningly, superbly, plucky, witty dog, porch, crawling, beside, downstairs repellent, lotion, glossy, sunscreen, skin, gel reconciliation, negotiate, case, conciliation hoping, bring, wiping, could, some, would drawings, Picasso, Dali, sculptures, Gauguin toxins, bacteria, organisms, bacterial, parasite grasp, psyche, truly, clumsy, naive, innate 22/ 34

Summary Relevance feedback and query expansion increase recall. In many cases, precision is decreased, often significantly. Log-based query modification (which is more complex than simple query expansion) is more common on the web than relevance feedback. 23/ 34

Outline 1 Recap 2 Pseudo Relevance Feedback 3 Query expansion 4 Probabalistic Retrieval 5 Discussion 24/ 34

Basics of probability theory A = event 0 p(a) 1 joint probability p(a, B) = p(a B) conditional probability p(a B) = p(a, B)/p(B) Note p(a, B) = p(a B)p(B) = p(b A)p(A), gives posterior probability of A after seeing the evidence B Bayes Thm : p(a B) = p(b A)p(A) p(b) In denominator, use p(b) = p(b,a) + p(b,a) = p(b A)p(A) + p(b A)p(A) Odds: O(A) = p(a) p(a) = p(a) 1 p(a) 25/ 34

Probability Ranking Principle (PRP) For query q and document d, let R d,q be binary indicator whether d relevant to q: R = 1 if relevant, else R = 0 Order documents according to estimated probability of relevance with respect to information need p(r = 1 d,q) (Bayes optimal decision rule) d is relevant iff p(r = 1 d,q) > p(r = 0 d,q) 26/ 34

Binary independence model (BIM) Represent documents and queries as binary term incidence vectors: x = (x 1,...,x M ), q = (q 1,...,q n ) (x t = 1 if term t present in document d, else x t = 0) Assume independence: no association between terms in binary bag of words (not true of course, but works ok to first approximation). Want to determine p(r = 1 x, q) = p( x R = 1, q)p(r = 1 q) p( x q) p(r = 0 x, q) = p( x R = 0, q)p(r = 0 q) p( x q) (on r.h.s. probabilites that if document is retrieved, then representation is x, and prior probabilities of retrieving relevant or non-relevant document. also p(r = 1 x, q) + p(r = 0 x, q) = 1) 27/ 34

Ranking function Order by descending p(r = 1 d,q), using BIM p(r = 1 x, q). Use odds of relevance O(R x, q) = p( x R=1, q)p(r=1 q) p( x q) p( x R=0, q)p(r=0 q) p( x q) How to estimate last term? = p(r = 1 q) p( x R = 1, q) p(r = 0 q) p( x R = 0, q) 28/ 34

Naive Bayes conditional independence assumption so p( x R = 1, q) M p( x R = 0, q) = p(x t R = 1, q) p(x t R = 0, q) t=1 O(R x, q) = O(R q) M t=1 p(x t R = 1, q) p(x t R = 0, q) Let probability of terms appearing in relevant/nonrelevant documents w.r.t. query be p t = p(x t = 1 R = 1, q) and u t = p(x t = 1 R = 0, q): document relevant (R = 1) nonrelevant (R = 0) term present x t = 1 p t u t term absent x t = 0 1 p t 1 u t 29/ 34

One more approximation Also assume that terms not occuring in the query are equally likely to occur in relevant and nonrelevant documents: p t = u t if q t = 0. Then only need to retain query terms (q t = 1): O(R x, q) = O(R q) = O(R q) M t=1 t:x t=1,q t=1 p(x t R = 1, q) p(x t R = 0, q) p t u t t:x t=0,q t=1 1 p t 1 u t 30/ 34

Estimate df t is number of documents that contain term t term documents relevant nonrelevant total present x t = 1 s df t s df t absent x t = 0 S s (N df t ) (S s) N df t total S N S N Hence p t = s/s and u t = (df t s)/(n S). In practice if relevant documents a small percentage, then u t df t /N and log[(1 u t )/u t ] = log[(n df t /df t ] log N/df t reproducing idf weight 31/ 34

Outline 1 Recap 2 Pseudo Relevance Feedback 3 Query expansion 4 Probabalistic Retrieval 5 Discussion 32/ 34

Discussion 4 Original LSA article: Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Richard Harshman, Indexing by latent semantic analysis. Journal of the American Society for Information Science, Volume 41, Issue 6, 1990. Some questions: Explain the name latent semantic analysis What problems is LSA attempting to solve? does it succeed? What criteria were used in selecting SVD of the term doc matrix? Explain the meaning of the matrices in the SVD C = UΣV T What does the rank reduction C k C = U k Σ k Vk T (keeping only first k elements of Σ) have to do with latent semantics? Fig. 1: what aspect of LSA does this illustrate? (which docs are closer to the query vector in concept space despite not containing words in common with the query?) 33/ 34

Fig. 4: a) LSI-100 does better at the right of this graph than the left What does this have to do with synonomy and polysemy? Describe methodology of the MED experiment. Why were authors surprised that TERM and SMART gave similar results? The results of CISI were not as strong, possible explanations? Fig. 5: what data does the graph plot? what conclusions can you draw? The article states the only way documents can be retrieved is by an exhaustive comparison of a query vector against all stored document vectors. Explain the statement. Is it a serious problem? 34/ 34