INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Similar documents
INFO 4300 / CS4300 Information Retrieval. IR 9: Linear Algebra Review

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Information Retrieval Tutorial 6: Evaluation

Latent Semantic Analysis. Hongning Wang

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Information Retrieval

Latent Semantic Analysis. Hongning Wang

Linear Algebra Background

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Introduction to Information Retrieval

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

Lecture: Face Recognition and Feature Reduction

Latent Semantic Models. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

PV211: Introduction to Information Retrieval

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

13 Searching the Web with the SVD

9 Searching the Internet with the SVD

CSE 494/598 Lecture-4: Correlation Analysis. **Content adapted from last year s slides

Singular Value Decompsition

Lecture 5: Web Searching using the SVD

Singular Value Decomposition

Fall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26

Evaluation Metrics. Jaime Arguello INLS 509: Information Retrieval March 25, Monday, March 25, 13

The Singular Value Decomposition

.. CSC 566 Advanced Data Mining Alexander Dekhtyar..

Maths for Signals and Systems Linear Algebra in Engineering

PV211: Introduction to Information Retrieval

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Chap 2: Classical models for information retrieval

Lecture 6. Numerical methods. Approximation of functions

1 Information retrieval fundamentals

CS 572: Information Retrieval

How Latent Semantic Indexing Solves the Pachyderm Problem

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Using SVD to Recommend Movies

Probabilistic Latent Semantic Analysis

MAT 1302B Mathematical Methods II

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Data Mining Recitation Notes Week 3

Singular Value Decomposition

Midterm Examination Practice

Matrix decompositions and latent semantic indexing

Lecture 6 Positive Definite Matrices

Linear Systems. Carlo Tomasi

Embeddings Learned By Matrix Factorization

Lecture: Face Recognition and Feature Reduction

[Disclaimer: This is not a complete list of everything you need to know, just some of the topics that gave people difficulty.]

Singular Value Decomposition

CS47300: Web Information Search and Management

Let A an n n real nonsymmetric matrix. The eigenvalue problem: λ 1 = 1 with eigenvector u 1 = ( ) λ 2 = 2 with eigenvector u 2 = ( 1

Conceptual Questions for Review

EIGENVALE PROBLEMS AND THE SVD. [5.1 TO 5.3 & 7.4]

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

Problems. Looks for literal term matches. Problems:

University of Illinois at Urbana-Champaign. Midterm Examination

CS 6375 Machine Learning

Ranked Retrieval (2)

Singular Value Decomposition

Chapter 7: Symmetric Matrices and Quadratic Forms

Multivariate Statistical Analysis

Manning & Schuetze, FSNLP, (c)

CS264: Beyond Worst-Case Analysis Lecture #15: Topic Modeling and Nonnegative Matrix Factorization

Dimensionality Reduction

33AH, WINTER 2018: STUDY GUIDE FOR FINAL EXAM

Modern Information Retrieval

INF 141 IR METRICS LATENT SEMANTIC ANALYSIS AND INDEXING. Crista Lopes

CS 3750 Advanced Machine Learning. Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya

Words vs. Terms. Words vs. Terms. Words vs. Terms. Information Retrieval cares about terms You search for em, Google indexes em Query:

Information Retrieval

Generic Text Summarization

CS246: Mining Massive Data Sets Winter Only one late period is allowed for this homework (11:59pm 2/14). General Instructions

More Linear Algebra. Edps/Soc 584, Psych 594. Carolyn J. Anderson

Markov chains and the number of occurrences of a word in a sequence ( , 11.1,2,4,6)

Machine learning for pervasive systems Classification in high-dimensional spaces

Image Registration Lecture 2: Vectors and Matrices

IR: Information Retrieval

Machine Learning for natural language processing

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

18.06SC Final Exam Solutions

Data Mining and Matrices

PROBABILISTIC LATENT SEMANTIC ANALYSIS

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

The Singular Value Decomposition (SVD) and Principal Component Analysis (PCA)

Math Camp Notes: Linear Algebra II

Notes on singular value decomposition for Math 54. Recall that if A is a symmetric n n matrix, then A has real eigenvalues A = P DP 1 A = P DP T.

Manning & Schuetze, FSNLP (c) 1999,2000

Background Mathematics (2/2) 1. David Barber

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

Performance evaluation of binary classifiers

Knowledge Discovery and Data Mining 1 (VO) ( )

Transcription:

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 9: Collaborative Filtering, SVD, and Linear Algebra Review Paul Ginsparg Cornell University, Ithaca, NY 22 Sep 2011 1/ 50

Administrativa No office hour tomorrow, Fri 23 Sep 2011 Ass t 2 to be posted 24 Sep, due Sat 8 Oct, 1pm (late submission permitted until Sun 9 Oct at 11 p.m.) No class Tue 11 Oct (midterm break) The Midterm Examination is on Thu Oct 13 from 11:40 to 12:55, in Kimball B11. It will be open book. Topics examined include assignments, lectures and discussion class readings before the midterm break. 2/ 50

Overview 1 Recap Discussion 2 2 More on Unranked evaluation 3 Evaluation benchmarks 4 Matrix basics 5 Matrix Decompositions 6 Singular value decomposition 3/ 50

Outline 1 Recap Discussion 2 2 More on Unranked evaluation 3 Evaluation benchmarks 4 Matrix basics 5 Matrix Decompositions 6 Singular value decomposition 4/ 50

Discussion 2 K. Sparck Jones, A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28, 11-21, 1972. Letter by Stephen Robertson and reply by Karen Sparck Jones, Journal of Documentation 28, 164-165, 1972. 5/ 50

Exhaustivity and specificity What are the semantic and statistical interpretations of specificity? Semantic: tea, coffee, cocoa (more specific, smaller # docs) beverage (less specific, larger # docs) Statistical: specificity a function of term usage, frequently used implies non-specific (even if has specific meaning). Exhaustivity of a document description is determined by the number of controlled vocabulary terms assigned. Reject frequently occurring terms? via conjunction (but according to item C table I, average number of matched terms smaller than request, so would reduce recall) remove them entirely (again hurts recall, needed for many relevant documents) What is graphed in figure 1 and what does it illustrate? (Why aren t axes labelled?) 6/ 50

idf weight Spärck Jones defines f (n) = m such that 2 m 1 < n <= 2 m (In other words f (n) = log 2 (n), where x denotes the smallest integer not less than x, equivalent to one plus the greatest integer less than x) and suggests weight = f (N) f (n) + 1 e.g. for N = 200 documents, f (N) = 8 (2 8 = 256) n = 90, f (n) = 7 (2 7 = 128), hence weight = 8 7 + 1 = 2 n = 3, f (n) = 2 (2 2 = 4), hence weight = 8 2 + 1 = 7 overall weight for query is then 2 + 7 = 9 +1 so that terms occurring in more than roughly half the documents in the corpus not given zero weight (for N = 200, anything in more than 128 documents) 7/ 50

idf weight, modified Robertson: Spärck Jones weight f (N) f (n) + 1 log 2 (N/n) + 1 Note that n/n is the probability an item chosen (at random) will contain the term. Suppose an item contains a,b,c in common with query, and probabilities are p a, p b, p c. Then weight assigned to the document is log(1/p a ) + log(1/p b ) + log(1/p c ) = log(1/p a p b p c ) (probability that doc will randomly contain all three terms a,b,c under what assumption?) quantifies statement: less likely that given combination of terms occurs, more likely relevant to query (theoretical justification for logarithmic idf weights) 8/ 50

Outline 1 Recap Discussion 2 2 More on Unranked evaluation 3 Evaluation benchmarks 4 Matrix basics 5 Matrix Decompositions 6 Singular value decomposition 9/ 50

Precision and recall Precision (P) is the fraction of retrieved documents that are relevant Precision = #(relevant items retrieved) #(retrieved items) = P(relevant retrieved) Recall (R) is the fraction of relevant documents that are retrieved Recall = #(relevant items retrieved) #(relevant items) = P(retrieved relevant) 10/ 50

A precision-recall curve 1.0 0.8 Precision 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall Each point corresponds to a result for the top k ranked hits (k = 1,2,3,4,...). Interpolation (in red): Take maximum of all future points Rationale for interpolation: The user is willing to look at more stuff if both precision and recall get better. Questions? 11/ 50

Averaged 11-point precision/recall graph 1 0.8 Precision 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 Recall Compute interpolated precision at recall levels 0.0, 0.1, 0.2,... Do this for each of the queries in the evaluation benchmark Average over queries This measure measures performance at all recall levels. The curve is typical of performance levels at TREC. Note that performance is not very good! 12/ 50

Precision/recall tradeoff You can increase recall by returning more docs. Recall is a non-decreasing function of the number of docs retrieved. A system that returns all docs has 100% recall! The converse is also true (usually): It s easy to get high precision for very low recall. Suppose the document with the largest score is relevant. How can we maximize precision? 13/ 50

A combined measure: balanced F Frequently used: balanced F, the harmonic mean of P and R: 1 F = 1 ( 1 2 P + 1 ) or F = 2PR R P + R Extremes: If P R, then F 2P. If R P, then F 2R. So F is automatically sensitive to the one that is much smaller. If P R, then F P R. 14/ 50

F more generally F allows us to trade off precision against recall. 1 F = α 1 P + (1 α) 1 R = (β2 + 1)PR β 2 P + R where β 2 = 1 α α α [0,1] and thus β 2 [0, ] Most frequently used: balanced F with β = 1 or α = 0.5 This is the harmonic mean of P and R: 1 F = 1 2 ( 1 P + 1 R ) What value range of β weights recall higher than precision? 15/ 50

F: Example relevant not relevant retrieved 20 40 60 not retrieved 60 1,000,000 1,000,060 80 1,000,040 1,000,120 P = 20/(20 + 40) = 1/3 R = 20/(20 + 60) = 1/4 F 1 = 2 1 1 = 2/7 + 1 1 4 1 3 16/ 50

Accuracy Why do we use complex measures like precision, recall, and F? Why not something simple like accuracy? Accuracy is the fraction of decisions (relevant/nonrelevant) that are correct. In terms of the contingency table above, accuracy = (TP + TN)/(TP + FP + FN + TN). Why is accuracy not a useful measure for web information retrieval? 17/ 50

Exercise Compute precision, recall and F 1 for this result set: relevant not relevant retrieved 18 2 not retrieved 82 1,000,000,000 The snoogle search engine below always returns 0 results ( 0 matching results found ), regardless of the query. 18/ 50

Why accuracy is not a useful measure in IR Simple trick to maximize accuracy in IR: always say no and return nothing You then get 99.99% accuracy on most queries. Searchers on the web (and in IR in general) want to find something and have a certain tolerance for junk. It s better to return some bad hits as long as you return something. We use precision, recall, and F for evaluation, not accuracy. 19/ 50

F: Why harmonic mean? Why don t we use a different mean of P and R as a measure? e.g., the arithmetic mean The simple (arithmetic) mean is 50% for return-everything search engine, which is too high. Desideratum: Punish really bad performance on either precision or recall. Taking the minimum achieves this. But minimum is not smooth and hard to weight. F (harmonic mean) is a kind of smooth minimum. 20/ 50

Difficulties in using precision, recall and F We need relevance judgments for information-need-document pairs but they are expensive to produce. There are some alternatives to using precision/recall and having to produce relevance judgments 21/ 50

Variance of measures like precision/recall For a test collection, it is usual that a system does badly on some information needs (e.g., P = 0.2 at R = 0.1) and really well on others (e.g., P = 0.95 at R = 0.8). Indeed, it is usually the case that the variance of the same system across queries is much greater than the variance of different systems on the same query. That is, there are easy information needs and hard ones. 22/ 50

Outline 1 Recap Discussion 2 2 More on Unranked evaluation 3 Evaluation benchmarks 4 Matrix basics 5 Matrix Decompositions 6 Singular value decomposition 23/ 50

What we need for a benchmark A collection of documents Documents must be representative of the documents we expect to see in reality. A collection of information needs... incorrectly but necessarily called queries Information needs must be representative of the information needs we expect to see in reality. Human relevance assessments We need to hire/pay judges or assessors to do this. Expensive, time-consuming Judges must be representative of the users we expect to see in reality. 24/ 50

Standard relevance benchmark: Cranfield Pioneering: first testbed allowing precise quantitative measures of information retrieval effectiveness Late 1950s, UK 1398 abstracts of aerodynamics journal articles, a set of 225 queries, exhaustive relevance judgments of all query-document-pairs Too small, too untypical for serious IR evaluation today 25/ 50

Standard relevance benchmark: TREC TREC = Text Retrieval Conference (TREC) Organized by the U.S. National Institute of Standards and Technology (NIST) TREC is actually a set of several different relevance benchmarks. Best known: TREC Ad Hoc, used for first 8 TREC evaluations between 1992 and 1999 1.89 million documents, mainly newswire articles, 450 information needs No exhaustive relevance judgments too expensive Rather, NIST assessors relevance judgments are available only for the documents that were among the top k returned for some system which was entered in the TREC evaluation for which the information need was developed. 26/ 50

Standard relevance benchmarks: Others GOV2 Another TREC/NIST collection 25 million web pages Used to be largest collection that is easily available But still 3 orders of magnitude smaller than what Google/Yahoo/MSN index NTCIR East Asian language and cross-language information retrieval Cross Language Evaluation Forum (CLEF) This evaluation series has concentrated on European languages and cross-language information retrieval. Many others 27/ 50

Validity of relevance assessments Relevance assessments are only usable if they are consistent. If they are not consistent, then there is no truth and experiments are not repeatable. How can we measure this consistency or agreement among judges? Kappa measure 28/ 50

Kappa measure Kappa is measure of how much judges agree or disagree. Designed for categorical judgments Corrects for chance agreement P(A) = proportion of time judges agree P(E) = what agreement would we get by chance κ = P(A) P(E) 1 P(E) κ =? for (i) chance agreement (ii) total agreement 29/ 50

Kappa measure (2) Values of κ in the interval [2/3,1.0] are seen as acceptable. With smaller values: need to redesign relevance assessment methodology used etc. 30/ 50

Calculating the kappa statistic Judge 2 Relevance Yes No Total Judge 1 Yes 300 20 320 Relevance No 10 70 80 Total 310 90 400 Observed proportion of the times the judges agreed P(A) = (300 + 70)/400 = 370/400 = 0.925 Pooled marginals: P(non-rel) = (80 + 90)/(400 + 400) = 170/800 = 0.2125 P(relevant) = (320 + 310)/(400 + 400) = 630/800 = 0.7878 Probability that the two judges agreed by chance: P(E) = P(non-rel) 2 + P(relevant) 2 = 0.2125 2 + 0.7878 2 = 0.665 Kappa statistic: κ = (P(A) P(E))/(1 P(E)) = (0.925 0.665)/(1 0.665) = 0.776 (still in acceptable range) 31/ 50

Interjudge agreement at TREC information number of disagreements need docs judged 51 211 6 62 400 157 67 400 68 95 400 110 127 400 106 fair range (.67.8 ) 32/ 50

Impact of interjudge disagreement Judges disagree a lot. Does that mean that the results of information retrieval experiments are meaningless? No. Large impact on absolute performance numbers Virtually no impact on ranking of systems Supposes we want to know if algorithm A is better than algorithm B An information retrieval experiment will give us a reliable answer to this question......even if there is a lot of disagreement between judges. 33/ 50

Evaluation at large search engines Recall is difficult to measure on the web Search engines often use precision at top k, e.g., k = 10......or use measures that reward you more for getting rank 1 right than for getting rank 10 right. Search engines also use non-relevance-based measures. Example 1: clickthrough on first result Not very reliable if you look at a single clickthrough (you may realize after clicking that the summary was misleading and the document is nonrelevant)...... but pretty reliable in the aggregate. Example 2: Ongoing studies of user behavior in the lab recall last lecture Example 3: A/B testing 34/ 50

A/B testing Purpose: Test a single innovation Prerequisite: You have a large search engine up and running. Have most users use old system Divert a small proportion of traffic (e.g., 1%) to the new system that includes the innovation Evaluate with an automatic measure like clickthrough on first result Now we can directly see if the innovation does improve user happiness. Probably the evaluation methodology that large search engines trust most 35/ 50

Critique of pure relevance We ve defined relevance for an isolated query-document pair. Alternative definition: marginal relevance The marginal relevance of a document at position k in the result list is the additional information it contributes over and above the information that was contained in documents d 1...d k 1. Exercise Why is marginal relevance a more realistic measure of user happiness? Give an example where a non-marginal measure like precision or recall is a misleading measure of user happiness, but marginal relevance is a good measure. In a practical application, what is the difficulty of using marginal measures instead of non-marginal measures? 36/ 50

Outline 1 Recap Discussion 2 2 More on Unranked evaluation 3 Evaluation benchmarks 4 Matrix basics 5 Matrix Decompositions 6 Singular value decomposition 37/ 50

Definitions C = M N matrix with real-valued entries. rank of C is number of linearly independent rows (or columns), rank(c) min{m, N} A square r r matrix with off-diagonal entries equal to zero is called a diagonal matrix (rank = # of non-zero diagonal entries) If all diagonal entries are equal to one, then called the identity 1 0 0... 0 0 1 0... 0 I r = 0 0 1... 0... 0 0 0... 1 38/ 50

Eigenvalues and eigenvectors For square M M matrix C and (non-vanishing) vector x, λ satisfying C x = λ x are called the eigenvalues of C. x is the right eigenvector eigenvector corresponding to eigenvalue of largest magnitude called the principal eigenvector. also left eigenvectors: M-vectors y with y T C = λ y T 39/ 50

Characteristic equation eigenvalues determined by characteristic equation. Write above as (C λi M ) x = 0, and take determinant: C λi M = 0 Gives an M th order polynomial equation in λ, at most M roots which are eigenvalues of C (in general complex, even if all entries of C are real). 40/ 50

Example 30 0 0 S = 0 20 0 0 0 1 has rank 3, eigenvalues λ 1 = 30, λ 2 = 20, λ 3 = 1, with 1 0 0 eigenvectors x 1 = 0, x 2 = 1 x 3 = 0 0 0 1 2 Consider arbitrary v = 4 = 2 x 1 + 4 x 2 + 6 x 3 6 Then S v = S(2 x 1 + 4 x 2 + 6 x 3 ) = 2S x 1 + 4S x 2 + 6S x 3 = 2λ 1 x 1 + 4λ 2 x 2 + 6λ 3 x 3 = 60 x 1 + 80 x 2 + 6 x 3 41/ 50

Example, cont d multiplication by S determined by eigenvalues and eigenvectors S v relatively unaffected by terms from small eigenvalues 60 60 ignore λ 3 = 1: then S v = 80 instead of 80 0 6 low-rank approximations 42/ 50

Symmetric matrices For symmetric S, eigenvectors corresponding to distinct eigenvalues are orthogonal. If S is real and symmetric, then eigenvalues are real ( ) 2 1 Example: S = 1 2 0 = S λi ( = (2 ) λ) ( 2 1 ) λ = 3,1. 1 1 eigenvectors and are orthogonal 1 1 43/ 50

Outline 1 Recap Discussion 2 2 More on Unranked evaluation 3 Evaluation benchmarks 4 Matrix basics 5 Matrix Decompositions 6 Singular value decomposition 44/ 50

Matrix diagonalization theorem Let S be square real-valued M M with M linearly independent eigenvectors then there exists a decomposition S = UΛU 1 where columns of U are the eigenvectors of S and λ 1 Λ = λ 2..., λ i λ i+1 λ M If eigenvalues are distinct, then decomposition is unique 45/ 50

Matrix diagonalization theorem, cont d U = ( u 1 u 2 u M ) Then SU = S( u 1 u 2 u M ) = (λ 1 u 1 λ 2 u 2 λ M u M ) λ 1 = ( u 1 u 2 u M ) λ 2 Thus SU = UΛ or S = UΛU 1... λ M 46/ 50

Symmetric diagonalization theorem S a square, symmetric, real-valued M M matrix with M linearly independent eigenvectors then there exists a symmetric diagonal decomposition S = QΛQ 1 where the columns of Q are the orthogonal and normalized (unit length, real) eigenvectors of S, and Λ is the diagonal matrix with entries the eigenvalues of S all entries of Q are real and Q 1 = Q T We will use this to build low-rank approximations to term document matrices, using CC T 47/ 50

Outline 1 Recap Discussion 2 2 More on Unranked evaluation 3 Evaluation benchmarks 4 Matrix basics 5 Matrix Decompositions 6 Singular value decomposition 48/ 50

SVD C an M N matrix of rank r, C T its N M transpose. CC T and C T C have the same r eigenvalues λ 1,...,λ r U = M M matrix whose columns are the orthogonal eigenvectors of CC T V = N N matrix whose columns are the orthogonal eigenvectors of C T C Then there s a singular value decomposition (SVD) C = UΣV T where the M N matrix Σ has Σ ii = σ i for 1 i r, and zero otherwise. σ i are called the singular values of C 49/ 50

Compare to S = QΛQ T C = UΣV T CC T = UΣV T V ΣU T = UΣ 2 U T (C T C = V ΣU T UΣV T = V Σ 2 V T ) l.h.s. is square symmetric real-valued, and r.h.s. is symmetric diagonal decomposition CC T (C T C) is a square matrix with rows, columns corresponding to each of the M terms (documents) i,j entry measures overlap between i th and j th terms (documents), based on document (term) co-occurrence Depends on term weighting: simplest case (1,0): i,j entry counts number of documents in which both terms i and j occur (number of terms which occur in both documents i, j) 50/ 50