INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Similar documents
INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Introduction to Information Retrieval

Latent Semantic Analysis. Hongning Wang

PV211: Introduction to Information Retrieval

PV211: Introduction to Information Retrieval

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Latent Semantic Analysis. Hongning Wang

INFO 4300 / CS4300 Information Retrieval. IR 9: Linear Algebra Review

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Manning & Schuetze, FSNLP (c) 1999,2000

Lecture 5: Web Searching using the SVD

Information Retrieval

13 Searching the Web with the SVD

9 Searching the Internet with the SVD

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Problems. Looks for literal term matches. Problems:

Multimedia Content Management Evaluation and Query Expansion

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Manning & Schuetze, FSNLP, (c)

Boolean and Vector Space Retrieval Models

Information Retrieval

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Embeddings Learned By Matrix Factorization

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Latent semantic indexing

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

Singular Value Decompsition

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

The Singular Value Decomposition

Variable Latent Semantic Indexing

1. Background: The SVD and the best basis (questions selected from Ch. 6- Can you fill in the exercises?)

INF 141 IR METRICS LATENT SEMANTIC ANALYSIS AND INDEXING. Crista Lopes

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

CS 3750 Advanced Machine Learning. Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya

Assignment 3. Latent Semantic Indexing

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Principal Component Analysis

Information Retrieval and Organisation

Applied Mathematics 205. Unit II: Numerical Linear Algebra. Lecturer: Dr. David Knezevic

Notes on singular value decomposition for Math 54. Recall that if A is a symmetric n n matrix, then A has real eigenvalues A = P DP 1 A = P DP T.

A few applications of the SVD

Data Mining Recitation Notes Week 3

Matrices, Vector Spaces, and Information Retrieval

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Lecture: Face Recognition and Feature Reduction

Fall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26

Singular Value Decomposition

CSE 494/598 Lecture-4: Correlation Analysis. **Content adapted from last year s slides

Linear Algebra Background

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002

Lecture: Face Recognition and Feature Reduction

Generic Text Summarization

CSE 494/598 Lecture-6: Latent Semantic Indexing. **Content adapted from last year s slides

1. Ignoring case, extract all unique words from the entire set of documents.

Latent Semantic Indexing

Towards Collaborative Information Retrieval

AM 205: lecture 8. Last time: Cholesky factorization, QR factorization Today: how to compute the QR factorization, the Singular Value Decomposition

Let A an n n real nonsymmetric matrix. The eigenvalue problem: λ 1 = 1 with eigenvector u 1 = ( ) λ 2 = 2 with eigenvector u 2 = ( 1

Matrix Decomposition and Latent Semantic Indexing (LSI) Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

The Singular Value Decomposition (SVD) and Principal Component Analysis (PCA)

CS 572: Information Retrieval

1 Information retrieval fundamentals

Anomalous State of Knowledge. Administrative. Relevance Feedback Query Expansion" computer use in class J hw3 out assignment 3 out later today

MATRIX DECOMPOSITION AND LATENT SEMANTIC INDEXING (LSI) Introduction to Information Retrieval CS 150 Donald J. Patterson

Dimension Reduction and Iterative Consensus Clustering

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Latent Semantic Models. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

EIGENVALE PROBLEMS AND THE SVD. [5.1 TO 5.3 & 7.4]

Singular Value Decomposition

PV211: Introduction to Information Retrieval

Maths for Signals and Systems Linear Algebra in Engineering

Updating the Centroid Decomposition with Applications in LSI

1 AUTOCRATIC STRATEGIES

CS47300: Web Information Search and Management

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Singular Value Decomposition

Dimensionality Reduction

Information Retrieval Tutorial 6: Evaluation

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

Fast LSI-based techniques for query expansion in text retrieval systems

Probabilistic Latent Semantic Analysis

Vector Space Scoring Introduction to Information Retrieval Informatics 141 / CS 121 Donald J. Patterson

Folding-up: A Hybrid Method for Updating the Partial Singular Value Decomposition in Latent Semantic Indexing

University of Illinois at Urbana-Champaign. Midterm Examination

Information Retrieval. Lecture 6

Information Retrieval Basic IR models. Luca Bondi

Latent Semantic Analysis (Tutorial)

Fall TMA4145 Linear Methods. Exercise set Given the matrix 1 2

vector space retrieval many slides courtesy James Amherst

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang

How Latent Semantic Indexing Solves the Pachyderm Problem

Nonnegative Matrix Factorization. Data Mining

7 Principal Component Analysis

Transcription:

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 12: Latent Semantic Indexing and Relevance Feedback Paul Ginsparg Cornell University, Ithaca, NY 6 Oct 2009 1/ 39

Overview 1 Recap 2 Motivation for query expansion 3 Relevance feedback: Basics 4 Relevance feedback: Details 2/ 39

Outline 1 Recap 2 Motivation for query expansion 3 Relevance feedback: Basics 4 Relevance feedback: Details 3/ 39

Term term Comparison To compare two terms, take the dot product between two rows of C, which measures the extent to which they have similar pattern of occurrence across the full set of documents. The i,j entry of CC T is equal to the dot product between i,j rows of C Since CC T = UΣV T V ΣU T = UΣ 2 U T = (UΣ)(UΣ) T, the i, j entry is the dot product between the i, j rows of UΣ. Hence the rows of UΣ can be considered as coordinates for terms, whose dot products give comparisons between terms. (Σ just rescales the coordinates) 4/ 39

Document document Comparison To compare two documents, take the dot product between two columns of C, which measures the extent to which two documents have a similar profile of terms. The i,j entry of C T C is equal to the dot product between the i,j columns of C Since C T C = V ΣU T UΣV T = V Σ 2 V T = (V Σ)(V Σ) T, the i,j entry is the dot product between the i,j rows of V Σ Hence the rows of V Σ can be considered as coordinates for documents, whose dot products give comparisons between documents. (Σ again just rescales coordinates) 5/ 39

Term document Comparison To compare a term and a document Use directly the value of i,j entry of C = UΣV T This is the dot product between i th row of UΣ 1/2 and j th row of V Σ 1/2 So use UΣ 1/2 and V Σ 1/2 as coordinates Recall UΣ for term term, and V Σ for document document comparisons can t use a single set of coordinates to make both between document and term and within term or document comparisons, but difference is only Σ 1/2 stretch. 6/ 39

Pseudo-document document Comparison How to represent pseudo-documents, and how to compute comparisons? e.g., given a novel query, find its location in concept space, and find its cosine w.r.t existing documents, or other documents not in original analysis (SVD). A query q is a vector of terms, like the columns of C, hence considered a pseudo-document Derive representation for any term vector q to be used in document comparison formulas. (like a row of V as earlier) Constraint: for a real document q = d (j) (= j th column C ij ), and before truncation (i.e., for C k = C), should give row of V Use q (s) = quσ 1 for comparing pseudodocs to docs 7/ 39

Pseudo-document document Comparison: q (s) = quσ 1 Consider the j, i component of C T UΣ 1 = (V ΣU T )UΣ 1 = V By inspection, the j th row of l.h.s. corresponds to the case q = d (j) : ( C T UΣ 1) ji = ( d(j) UΣ 1) i and the r.h.s. V ji is the j th row of V, as desired for comparing docs. So use q (s) = quσ 1, which sums corresponding rows of UΣ, hence corresponds to placing pseudo-document at centroid of corresponding term points (up to rescaling of rows by Σ). (Just as row of V scaled by Σ 1/2 or Σ can be used in semantic space for making term doc or doc doc comparisons.) Note: all of above after any preprocessing used to construct C 8/ 39

Selection of singular values t d t m m m m d Σ k V T k C k U k t d t k k k k d m is the original rank of C. k is the number of singular values chosen to represent the concepts in the set of documents. Usually, k m. Σ 1 k defined only on k-dimensional subspace. 9/ 39

More on query document comparison query = vector q in term space components q i = 1 if term i is in the query, and otherwise 0 any query terms not in the original term vector space ignored In VSM, similarity between query q and j th document d (j) given by the cosine measure : q d (j) q d (j) Using term document matrix C ij, this dot product given by the j th component of q C: d (j) = C e (j) ( e (j) = j th basis vector, single 1 in j th position, 0 elsewhere). Hence Similarity( q, d (j) ) = cos(θ) = q d (j) q d (j) = q C e (j) q C e (j). (1) 10/ 39

Now approximate C C k In the LSI approximation, use C k (the rank k approximation to C), so similarity measure between query and document becomes q d (j) q d (j) = q C e (j) q C e (j) = q C k e (j) q C k e (j) = q d (j) q d (j), (2) where d (j) = C k e (j) = U k Σ k V T e (j) is the LSI representation of the j th document vector in the original term document space. Finding the closest documents to a query in the LSI approximation thus amounts to computing (2) for each of the j = 1,...,N documents, and returning the best matches. 11/ 39

Pseudo-document To see that this agrees with the prescription given in the course text (and the original LSI article), recall: j th column of V T k represents document j in concept space : ˆd(j) = V T k e (j) query q is considered a pseudo-document in this space. LSI document vector in term space given above as d (j) = C k e (j) = U k Σ k V T k e (j) = U k Σ k ˆd(j), so follows that ˆd(j) = Σ 1 k UT k d (j) The pseudo-document query vector q is translated into the concept space using the same transformation: ˆq = Σ 1 k UT k q. 12/ 39

Compare documents in concept space Recall the i,j entry of C T C is dot product between i,j columns of C (term vectors for documents i and j). In the truncated space, C T k C k = (U k Σ k V T k )T (U k Σ k V T k ) = V kσ k U T k U kσ k V T k = (V kσ k )(V k Σ k ) T Thus i,j entry the dot product between the i, j columns of (V k Σ k ) T = Σ k V T k. In concept space, comparison between pseudo-document ˆq and document ˆd (j) thus given by the cosine between Σ k ˆq and Σk ˆd (j) : (Σ k ˆq) Σk ˆd(j) Σ k ˆq Σk ˆd(j) = ( qt U k Σ 1 k Σ k)(σ k Σ 1 k UT k d (j) ) U T k q UT k d (j) = q d (j) U T k q d (j), (3) in agreement with (2), up to an overall q-dependent normalization which doesn t affect similarity rankings. 13/ 39

14/ 39

Outline 1 Recap 2 Motivation for query expansion 3 Relevance feedback: Basics 4 Relevance feedback: Details 15/ 39

How can we improve recall in search? Main topic today: two ways of improving recall: relevance feedback and query expansion Example Query q: [aircraft] Document d contains plane, but doesn t contain aircraft. A simple IR system will not return d for q. Even if d is the most relevant document for q! Options for improving recall Local: Do a local, on-demand analysis for a user query Main local method: relevance feedback Global: Do a global analysis once (e.g., of collection) to produce thesaurus Use thesaurus for query expansion 16/ 39

Outline 1 Recap 2 Motivation for query expansion 3 Relevance feedback: Basics 4 Relevance feedback: Details 17/ 39

Relevance feedback: Basic idea The user issues a (short, simple) query. The search engine returns a set of documents. User marks some docs as relevant, some as nonrelevant. Search engine computes a new representation of the information need should be better than the initial query. Search engine runs new query and returns new results. New results have (hopefully) better recall. 18/ 39

Relevance feedback We can iterate this: several rounds of relevance feedback. We will use the term ad hoc retrieval to refer to regular retrieval without relevance feedback. We will now look at three different examples of relevance feedback that highlight different aspects of the process. 19/ 39

Relevance Feedback: Example 1 20/ 39

Results for initial query 21/ 39

User feedback: Select what is relevant 22/ 39

Results after relevance feedback 23/ 39

Vector space example: query canine (1) source: Fernando Díaz 24/ 39

Similarity of docs to query canine source: Fernando Díaz 25/ 39

User feedback: Select relevant documents source: Fernando Díaz 26/ 39

Results after relevance feedback source: Fernando Díaz 27/ 39

Example 3: A real (non-image) example Initial query: New space satellite applications Results for initial query: (r = rank) r + 1 0.539 NASA Hasn t Scrapped Imaging Spectrometer + 2 0.533 NASA Scratches Environment Gear From Satellite Plan 3 0.528 Science Panel Backs NASA Satellite Plan, But Urges Launches of Smaller Probes 4 0.526 A NASA Satellite Project Accomplishes Incredible Feat: Staying Within Budget 5 0.525 Scientist Who Exposed Global Warming Proposes Satellites for Climate Research 6 0.524 Report Provides Support for the Critics Of Using Big Satellites to Study Climate 7 0.516 Arianespace Receives Satellite Launch Pact From Telesat Canada + 8 0.509 Telecommunications Tale of Two Companies User then marks relevant documents with +. 28/ 39

Expanded query after relevance feedback 2.074 new 15.106 space 30.816 satellite 5.660 application 5.991 nasa 5.196 eos 4.196 launch 3.972 aster 3.516 instrument 3.446 arianespace 3.004 bundespost 2.806 ss 2.790 rocket 2.053 scientist 2.003 broadcast 1.172 earth 0.836 oil 0.646 measure 29/ 39

Results for expanded query r * 1 0.513 NASA Scratches Environment Gear From Satellite Plan * 2 0.500 NASA Hasn t Scrapped Imaging Spectrometer 3 0.493 When the Pentagon Launches a Secret Satellite, Space Sleuths Do Some Spy Work of Their Own 4 0.493 NASA Uses Warm Superconductors For Fast Circuit * 5 0.492 Telecommunications Tale of Two Companies 6 0.491 Soviets May Adapt Parts of SS-20 Missile For Commercial Use 7 0.490 Gaping Gap: Pentagon Lags in Race To Match the Soviets In Rocket Launchers 8 0.490 Rescue of Satellite By Space Agency To Cost $90 Million 30/ 39

Outline 1 Recap 2 Motivation for query expansion 3 Relevance feedback: Basics 4 Relevance feedback: Details 31/ 39

Key concept for relevance feedback: Centroid The centroid is the center of mass of a set of points. Recall that we represent documents as points in a high-dimensional space. Thus: we can compute centroids of documents. Definition: µ(d) = 1 D v(d) d D where D is a set of documents and v(d) = d is the vector we use to represent document d. 32/ 39

Centroid: Examples x x x x 33/ 39

Rocchio algorithm The Rocchio algorithm implements relevance feedback in the vector space model. Rocchio chooses the query q opt that maximizes q opt = arg max[sim( q,µ(d r )) sim( q,µ(d nr ))] q Closely related to maximum separation between relevant and nonrelevant docs Making some additional assumptions, we can rewrite q opt as: q opt = µ(d r ) + [µ(d r ) µ(d nr )] D r : set of relevant docs; D nr : set of nonrelevant docs 34/ 39

Rocchio algorithm The optimal query vector is: q opt = µ(d r ) + [µ(d r ) µ(d nr )] 1 = 1 dj + [ dj 1 dj ] D r D r D nr dj D r dj D r dj D nr We move the centroid of the relevant documents by the difference between the two centroids. 35/ 39

Exercise: Compute Rocchio vector x x x x x x circles: relevant documents, X s: nonrelevant documents 36/ 39

Rocchio illustrated qopt µ R µnr µ R µ NR x x x x x x µ R : centroid of relevant documents µ NR : centroid of nonrelevant documents µ R µ NR : difference vector Add difference vector to µ R to get q opt q opt separates relevant/nonrelevant perfectly. 37/ 39

Rocchio 1971 algorithm (SMART) Used in practice: q m = α q 0 + βµ(d r ) γµ(d nr ) = α q 0 + β 1 dj γ 1 D r D nr dj D r dj D nr dj q m : modified query vector; q 0 : original query vector; D r and D nr : sets of known relevant and nonrelevant documents respectively; α, β, and γ: weights attached to each term New query moves towards relevant documents and away from nonrelevant documents. Tradeoff α vs. β/γ: If we have a lot of judged documents, we want a higher β/γ. Set negative term weights to 0. Negative weight for a term doesn t make sense in the vector space model. 38/ 39

Positive vs. negative relevance feedback Positive feedback is more valuable than negative feedback. For example, set β = 0.75, γ = 0.25 to give higher weight to positive feedback. Many systems only allow positive feedback. 39/ 39