INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 9: Collaborative Filtering, SVD, and Linear Algebra Review Paul Ginsparg Cornell University, Ithaca, NY 22 Sep 2011 1/ 50
Administrativa No office hour tomorrow, Fri 23 Sep 2011 Ass t 2 to be posted 24 Sep, due Sat 8 Oct, 1pm (late submission permitted until Sun 9 Oct at 11 p.m.) No class Tue 11 Oct (midterm break) The Midterm Examination is on Thu Oct 13 from 11:40 to 12:55, in Kimball B11. It will be open book. Topics examined include assignments, lectures and discussion class readings before the midterm break. 2/ 50
Overview 1 Recap Discussion 2 2 More on Unranked evaluation 3 Evaluation benchmarks 4 Matrix basics 5 Matrix Decompositions 6 Singular value decomposition 3/ 50
Outline 1 Recap Discussion 2 2 More on Unranked evaluation 3 Evaluation benchmarks 4 Matrix basics 5 Matrix Decompositions 6 Singular value decomposition 4/ 50
Discussion 2 K. Sparck Jones, A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28, 11-21, 1972. Letter by Stephen Robertson and reply by Karen Sparck Jones, Journal of Documentation 28, 164-165, 1972. 5/ 50
Exhaustivity and specificity What are the semantic and statistical interpretations of specificity? Semantic: tea, coffee, cocoa (more specific, smaller # docs) beverage (less specific, larger # docs) Statistical: specificity a function of term usage, frequently used implies non-specific (even if has specific meaning). Exhaustivity of a document description is determined by the number of controlled vocabulary terms assigned. Reject frequently occurring terms? via conjunction (but according to item C table I, average number of matched terms smaller than request, so would reduce recall) remove them entirely (again hurts recall, needed for many relevant documents) What is graphed in figure 1 and what does it illustrate? (Why aren t axes labelled?) 6/ 50
idf weight Spärck Jones defines f (n) = m such that 2 m 1 < n <= 2 m (In other words f (n) = log 2 (n), where x denotes the smallest integer not less than x, equivalent to one plus the greatest integer less than x) and suggests weight = f (N) f (n) + 1 e.g. for N = 200 documents, f (N) = 8 (2 8 = 256) n = 90, f (n) = 7 (2 7 = 128), hence weight = 8 7 + 1 = 2 n = 3, f (n) = 2 (2 2 = 4), hence weight = 8 2 + 1 = 7 overall weight for query is then 2 + 7 = 9 +1 so that terms occurring in more than roughly half the documents in the corpus not given zero weight (for N = 200, anything in more than 128 documents) 7/ 50
idf weight, modified Robertson: Spärck Jones weight f (N) f (n) + 1 log 2 (N/n) + 1 Note that n/n is the probability an item chosen (at random) will contain the term. Suppose an item contains a,b,c in common with query, and probabilities are p a, p b, p c. Then weight assigned to the document is log(1/p a ) + log(1/p b ) + log(1/p c ) = log(1/p a p b p c ) (probability that doc will randomly contain all three terms a,b,c under what assumption?) quantifies statement: less likely that given combination of terms occurs, more likely relevant to query (theoretical justification for logarithmic idf weights) 8/ 50
Outline 1 Recap Discussion 2 2 More on Unranked evaluation 3 Evaluation benchmarks 4 Matrix basics 5 Matrix Decompositions 6 Singular value decomposition 9/ 50
Precision and recall Precision (P) is the fraction of retrieved documents that are relevant Precision = #(relevant items retrieved) #(retrieved items) = P(relevant retrieved) Recall (R) is the fraction of relevant documents that are retrieved Recall = #(relevant items retrieved) #(relevant items) = P(retrieved relevant) 10/ 50
A precision-recall curve 1.0 0.8 Precision 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall Each point corresponds to a result for the top k ranked hits (k = 1,2,3,4,...). Interpolation (in red): Take maximum of all future points Rationale for interpolation: The user is willing to look at more stuff if both precision and recall get better. Questions? 11/ 50
Averaged 11-point precision/recall graph 1 0.8 Precision 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 Recall Compute interpolated precision at recall levels 0.0, 0.1, 0.2,... Do this for each of the queries in the evaluation benchmark Average over queries This measure measures performance at all recall levels. The curve is typical of performance levels at TREC. Note that performance is not very good! 12/ 50
Precision/recall tradeoff You can increase recall by returning more docs. Recall is a non-decreasing function of the number of docs retrieved. A system that returns all docs has 100% recall! The converse is also true (usually): It s easy to get high precision for very low recall. Suppose the document with the largest score is relevant. How can we maximize precision? 13/ 50
A combined measure: balanced F Frequently used: balanced F, the harmonic mean of P and R: 1 F = 1 ( 1 2 P + 1 ) or F = 2PR R P + R Extremes: If P R, then F 2P. If R P, then F 2R. So F is automatically sensitive to the one that is much smaller. If P R, then F P R. 14/ 50
F more generally F allows us to trade off precision against recall. 1 F = α 1 P + (1 α) 1 R = (β2 + 1)PR β 2 P + R where β 2 = 1 α α α [0,1] and thus β 2 [0, ] Most frequently used: balanced F with β = 1 or α = 0.5 This is the harmonic mean of P and R: 1 F = 1 2 ( 1 P + 1 R ) What value range of β weights recall higher than precision? 15/ 50
F: Example relevant not relevant retrieved 20 40 60 not retrieved 60 1,000,000 1,000,060 80 1,000,040 1,000,120 P = 20/(20 + 40) = 1/3 R = 20/(20 + 60) = 1/4 F 1 = 2 1 1 = 2/7 + 1 1 4 1 3 16/ 50
Accuracy Why do we use complex measures like precision, recall, and F? Why not something simple like accuracy? Accuracy is the fraction of decisions (relevant/nonrelevant) that are correct. In terms of the contingency table above, accuracy = (TP + TN)/(TP + FP + FN + TN). Why is accuracy not a useful measure for web information retrieval? 17/ 50
Exercise Compute precision, recall and F 1 for this result set: relevant not relevant retrieved 18 2 not retrieved 82 1,000,000,000 The snoogle search engine below always returns 0 results ( 0 matching results found ), regardless of the query. 18/ 50
Why accuracy is not a useful measure in IR Simple trick to maximize accuracy in IR: always say no and return nothing You then get 99.99% accuracy on most queries. Searchers on the web (and in IR in general) want to find something and have a certain tolerance for junk. It s better to return some bad hits as long as you return something. We use precision, recall, and F for evaluation, not accuracy. 19/ 50
F: Why harmonic mean? Why don t we use a different mean of P and R as a measure? e.g., the arithmetic mean The simple (arithmetic) mean is 50% for return-everything search engine, which is too high. Desideratum: Punish really bad performance on either precision or recall. Taking the minimum achieves this. But minimum is not smooth and hard to weight. F (harmonic mean) is a kind of smooth minimum. 20/ 50
Difficulties in using precision, recall and F We need relevance judgments for information-need-document pairs but they are expensive to produce. There are some alternatives to using precision/recall and having to produce relevance judgments 21/ 50
Variance of measures like precision/recall For a test collection, it is usual that a system does badly on some information needs (e.g., P = 0.2 at R = 0.1) and really well on others (e.g., P = 0.95 at R = 0.8). Indeed, it is usually the case that the variance of the same system across queries is much greater than the variance of different systems on the same query. That is, there are easy information needs and hard ones. 22/ 50
Outline 1 Recap Discussion 2 2 More on Unranked evaluation 3 Evaluation benchmarks 4 Matrix basics 5 Matrix Decompositions 6 Singular value decomposition 23/ 50
What we need for a benchmark A collection of documents Documents must be representative of the documents we expect to see in reality. A collection of information needs... incorrectly but necessarily called queries Information needs must be representative of the information needs we expect to see in reality. Human relevance assessments We need to hire/pay judges or assessors to do this. Expensive, time-consuming Judges must be representative of the users we expect to see in reality. 24/ 50
Standard relevance benchmark: Cranfield Pioneering: first testbed allowing precise quantitative measures of information retrieval effectiveness Late 1950s, UK 1398 abstracts of aerodynamics journal articles, a set of 225 queries, exhaustive relevance judgments of all query-document-pairs Too small, too untypical for serious IR evaluation today 25/ 50
Standard relevance benchmark: TREC TREC = Text Retrieval Conference (TREC) Organized by the U.S. National Institute of Standards and Technology (NIST) TREC is actually a set of several different relevance benchmarks. Best known: TREC Ad Hoc, used for first 8 TREC evaluations between 1992 and 1999 1.89 million documents, mainly newswire articles, 450 information needs No exhaustive relevance judgments too expensive Rather, NIST assessors relevance judgments are available only for the documents that were among the top k returned for some system which was entered in the TREC evaluation for which the information need was developed. 26/ 50
Standard relevance benchmarks: Others GOV2 Another TREC/NIST collection 25 million web pages Used to be largest collection that is easily available But still 3 orders of magnitude smaller than what Google/Yahoo/MSN index NTCIR East Asian language and cross-language information retrieval Cross Language Evaluation Forum (CLEF) This evaluation series has concentrated on European languages and cross-language information retrieval. Many others 27/ 50
Validity of relevance assessments Relevance assessments are only usable if they are consistent. If they are not consistent, then there is no truth and experiments are not repeatable. How can we measure this consistency or agreement among judges? Kappa measure 28/ 50
Kappa measure Kappa is measure of how much judges agree or disagree. Designed for categorical judgments Corrects for chance agreement P(A) = proportion of time judges agree P(E) = what agreement would we get by chance κ = P(A) P(E) 1 P(E) κ =? for (i) chance agreement (ii) total agreement 29/ 50
Kappa measure (2) Values of κ in the interval [2/3,1.0] are seen as acceptable. With smaller values: need to redesign relevance assessment methodology used etc. 30/ 50
Calculating the kappa statistic Judge 2 Relevance Yes No Total Judge 1 Yes 300 20 320 Relevance No 10 70 80 Total 310 90 400 Observed proportion of the times the judges agreed P(A) = (300 + 70)/400 = 370/400 = 0.925 Pooled marginals: P(non-rel) = (80 + 90)/(400 + 400) = 170/800 = 0.2125 P(relevant) = (320 + 310)/(400 + 400) = 630/800 = 0.7878 Probability that the two judges agreed by chance: P(E) = P(non-rel) 2 + P(relevant) 2 = 0.2125 2 + 0.7878 2 = 0.665 Kappa statistic: κ = (P(A) P(E))/(1 P(E)) = (0.925 0.665)/(1 0.665) = 0.776 (still in acceptable range) 31/ 50
Interjudge agreement at TREC information number of disagreements need docs judged 51 211 6 62 400 157 67 400 68 95 400 110 127 400 106 fair range (.67.8 ) 32/ 50
Impact of interjudge disagreement Judges disagree a lot. Does that mean that the results of information retrieval experiments are meaningless? No. Large impact on absolute performance numbers Virtually no impact on ranking of systems Supposes we want to know if algorithm A is better than algorithm B An information retrieval experiment will give us a reliable answer to this question......even if there is a lot of disagreement between judges. 33/ 50
Evaluation at large search engines Recall is difficult to measure on the web Search engines often use precision at top k, e.g., k = 10......or use measures that reward you more for getting rank 1 right than for getting rank 10 right. Search engines also use non-relevance-based measures. Example 1: clickthrough on first result Not very reliable if you look at a single clickthrough (you may realize after clicking that the summary was misleading and the document is nonrelevant)...... but pretty reliable in the aggregate. Example 2: Ongoing studies of user behavior in the lab recall last lecture Example 3: A/B testing 34/ 50
A/B testing Purpose: Test a single innovation Prerequisite: You have a large search engine up and running. Have most users use old system Divert a small proportion of traffic (e.g., 1%) to the new system that includes the innovation Evaluate with an automatic measure like clickthrough on first result Now we can directly see if the innovation does improve user happiness. Probably the evaluation methodology that large search engines trust most 35/ 50
Critique of pure relevance We ve defined relevance for an isolated query-document pair. Alternative definition: marginal relevance The marginal relevance of a document at position k in the result list is the additional information it contributes over and above the information that was contained in documents d 1...d k 1. Exercise Why is marginal relevance a more realistic measure of user happiness? Give an example where a non-marginal measure like precision or recall is a misleading measure of user happiness, but marginal relevance is a good measure. In a practical application, what is the difficulty of using marginal measures instead of non-marginal measures? 36/ 50
Outline 1 Recap Discussion 2 2 More on Unranked evaluation 3 Evaluation benchmarks 4 Matrix basics 5 Matrix Decompositions 6 Singular value decomposition 37/ 50
Definitions C = M N matrix with real-valued entries. rank of C is number of linearly independent rows (or columns), rank(c) min{m, N} A square r r matrix with off-diagonal entries equal to zero is called a diagonal matrix (rank = # of non-zero diagonal entries) If all diagonal entries are equal to one, then called the identity 1 0 0... 0 0 1 0... 0 I r = 0 0 1... 0... 0 0 0... 1 38/ 50
Eigenvalues and eigenvectors For square M M matrix C and (non-vanishing) vector x, λ satisfying C x = λ x are called the eigenvalues of C. x is the right eigenvector eigenvector corresponding to eigenvalue of largest magnitude called the principal eigenvector. also left eigenvectors: M-vectors y with y T C = λ y T 39/ 50
Characteristic equation eigenvalues determined by characteristic equation. Write above as (C λi M ) x = 0, and take determinant: C λi M = 0 Gives an M th order polynomial equation in λ, at most M roots which are eigenvalues of C (in general complex, even if all entries of C are real). 40/ 50
Example 30 0 0 S = 0 20 0 0 0 1 has rank 3, eigenvalues λ 1 = 30, λ 2 = 20, λ 3 = 1, with 1 0 0 eigenvectors x 1 = 0, x 2 = 1 x 3 = 0 0 0 1 2 Consider arbitrary v = 4 = 2 x 1 + 4 x 2 + 6 x 3 6 Then S v = S(2 x 1 + 4 x 2 + 6 x 3 ) = 2S x 1 + 4S x 2 + 6S x 3 = 2λ 1 x 1 + 4λ 2 x 2 + 6λ 3 x 3 = 60 x 1 + 80 x 2 + 6 x 3 41/ 50
Example, cont d multiplication by S determined by eigenvalues and eigenvectors S v relatively unaffected by terms from small eigenvalues 60 60 ignore λ 3 = 1: then S v = 80 instead of 80 0 6 low-rank approximations 42/ 50
Symmetric matrices For symmetric S, eigenvectors corresponding to distinct eigenvalues are orthogonal. If S is real and symmetric, then eigenvalues are real ( ) 2 1 Example: S = 1 2 0 = S λi ( = (2 ) λ) ( 2 1 ) λ = 3,1. 1 1 eigenvectors and are orthogonal 1 1 43/ 50
Outline 1 Recap Discussion 2 2 More on Unranked evaluation 3 Evaluation benchmarks 4 Matrix basics 5 Matrix Decompositions 6 Singular value decomposition 44/ 50
Matrix diagonalization theorem Let S be square real-valued M M with M linearly independent eigenvectors then there exists a decomposition S = UΛU 1 where columns of U are the eigenvectors of S and λ 1 Λ = λ 2..., λ i λ i+1 λ M If eigenvalues are distinct, then decomposition is unique 45/ 50
Matrix diagonalization theorem, cont d U = ( u 1 u 2 u M ) Then SU = S( u 1 u 2 u M ) = (λ 1 u 1 λ 2 u 2 λ M u M ) λ 1 = ( u 1 u 2 u M ) λ 2 Thus SU = UΛ or S = UΛU 1... λ M 46/ 50
Symmetric diagonalization theorem S a square, symmetric, real-valued M M matrix with M linearly independent eigenvectors then there exists a symmetric diagonal decomposition S = QΛQ 1 where the columns of Q are the orthogonal and normalized (unit length, real) eigenvectors of S, and Λ is the diagonal matrix with entries the eigenvalues of S all entries of Q are real and Q 1 = Q T We will use this to build low-rank approximations to term document matrices, using CC T 47/ 50
Outline 1 Recap Discussion 2 2 More on Unranked evaluation 3 Evaluation benchmarks 4 Matrix basics 5 Matrix Decompositions 6 Singular value decomposition 48/ 50
SVD C an M N matrix of rank r, C T its N M transpose. CC T and C T C have the same r eigenvalues λ 1,...,λ r U = M M matrix whose columns are the orthogonal eigenvectors of CC T V = N N matrix whose columns are the orthogonal eigenvectors of C T C Then there s a singular value decomposition (SVD) C = UΣV T where the M N matrix Σ has Σ ii = σ i for 1 i r, and zero otherwise. σ i are called the singular values of C 49/ 50
Compare to S = QΛQ T C = UΣV T CC T = UΣV T V ΣU T = UΣ 2 U T (C T C = V ΣU T UΣV T = V Σ 2 V T ) l.h.s. is square symmetric real-valued, and r.h.s. is symmetric diagonal decomposition CC T (C T C) is a square matrix with rows, columns corresponding to each of the M terms (documents) i,j entry measures overlap between i th and j th terms (documents), based on document (term) co-occurrence Depends on term weighting: simplest case (1,0): i,j entry counts number of documents in which both terms i and j occur (number of terms which occur in both documents i, j) 50/ 50