INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Similar documents
PV211: Introduction to Information Retrieval

TDDD43. Information Retrieval. Fang Wei-Kleiner. ADIT/IDA Linköping University. Fang Wei-Kleiner ADIT/IDA LiU TDDD43 Information Retrieval 1

Informa(on Retrieval

Informa(on Retrieval

Term Weighting and Vector Space Model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Term Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval

Information Retrieval

Scoring, Term Weighting and the Vector Space

Introduction to Information Retrieval

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2017. Instructor: Walid Magdy

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2018. Instructor: Walid Magdy

Dealing with Text Databases

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model

Vector Space Scoring Introduction to Information Retrieval Informatics 141 / CS 121 Donald J. Patterson

CS 572: Information Retrieval

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

Lecture 4 Ranking Search Results. Many thanks to Prabhakar Raghavan for sharing most content from the following slides

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)

MATRIX DECOMPOSITION AND LATENT SEMANTIC INDEXING (LSI) Introduction to Information Retrieval CS 150 Donald J. Patterson

Introduction to Information Retrieval

Non-Boolean models of retrieval: Agenda

PV211: Introduction to Information Retrieval

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval. IR 9: Linear Algebra Review

Information Retrieval. Lecture 6

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

Information Retrieval Using Boolean Model SEEM5680

Document Similarity in Information Retrieval

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Information Retrieval and Web Search

Recap of the last lecture. CS276A Information Retrieval. This lecture. Documents as vectors. Intuition. Why turn docs into vectors?

Introduction to Information Retrieval

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

CS 572: Information Retrieval

PV211: Introduction to Information Retrieval

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

Information Retrieval

The Boolean Model ~1955

boolean queries Inverted index query processing Query optimization boolean model January 15, / 35

Embeddings Learned By Matrix Factorization

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Matrix Decomposition and Latent Semantic Indexing (LSI) Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Boolean and Vector Space Retrieval Models

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Content-Addressable Memory Associative Memory Lernmatrix Association Heteroassociation Learning Retrieval Reliability of the answer

Vector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model

1 Information retrieval fundamentals

Dimensionality reduction

Chap 2: Classical models for information retrieval

CS630 Representing and Accessing Digital Information Lecture 6: Feb 14, 2006

Query. Information Retrieval (IR) Term-document incidence. Incidence vectors. Bigger corpora. Answers to query

Query CS347. Term-document incidence. Incidence vectors. Which plays of Shakespeare contain the words Brutus ANDCaesar but NOT Calpurnia?

Motivation. User. Retrieval Model Result: Query. Document Collection. Information Need. Information Retrieval / Chapter 3: Retrieval Models

Complex Data Mining & Workflow Mining. Introduzione al text mining

Fall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26

University of Illinois at Urbana-Champaign. Midterm Examination

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting


INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Applied Natural Language Processing

Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25

vector space retrieval many slides courtesy James Amherst

Ranked Retrieval (2)

Latent Semantic Analysis. Hongning Wang

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Information Retrieval Basic IR models. Luca Bondi

Machine Learning (CS 567) Lecture 2

CAIM: Cerca i Anàlisi d Informació Massiva

Stat 406: Algorithms for classification and prediction. Lecture 1: Introduction. Kevin Murphy. Mon 7 January,

OPTIMIZING SEARCH ENGINES USING CLICKTHROUGH DATA. Paper By: Thorsten Joachims (Cornell University) Presented By: Roy Levin (Technion)

Geoffrey Zweig May 7, 2009

Ricerca dell Informazione nel Web. Aris Anagnostopoulos

DISTRIBUTIONAL SEMANTICS

Part I: Web Structure Mining Chapter 1: Information Retrieval and Web Search

Information Retrieval

1 Boolean retrieval. Online edition (c)2009 Cambridge UP

CS 188: Artificial Intelligence. Outline

IR Models: The Probabilistic Model. Lecture 8

CSCE 561 Information Retrieval System Models

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Machine Learning for natural language processing

Announcements Monday, September 18

Midterm Examination Practice

Probabilistic Information Retrieval

Information Retrieval and Organisation

9 Searching the Internet with the SVD

Modern Information Retrieval

Maschinelle Sprachverarbeitung

13 Searching the Web with the SVD

Lecture 5: Introduction to (Robertson/Spärck Jones) Probabilistic Retrieval

Mathematical Description of Light

Math 3191 Applied Linear Algebra

Transcription:

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 5: Scoring, Term Weighting, The Vector Space Model II Paul Ginsparg Cornell University, Ithaca, NY 14 Sep 2010 1/ 34

Administrativa Course Webpage: http://www.infosci.cornell.edu/courses/info4300/2010fa/ Assignment 1. Posted: 3 Sep, Due: Sun, 19 Sep Lectures: Tuesday and Thursday 11:40-12:55, Olin Hall 165 Instructor: Paul Ginsparg, ginsparg@..., 255-7371, Cornell Information Science, 301 College Avenue Instructor s Assistant: Corinne Russell, crussell@cs..., 255-5925, Cornell Information Science, 301 College Avenue Instructor s Office Hours: Wed 1 2pm, Fri 2 3pm, or e-mail instructor to schedule an appointment Teaching Assistant: Niranjan Sivakumar, ns253@... The Teaching Assistants do not have scheduled office hours but are available to help you by email. Send messages about the course to: cs4300-l@lists.cs.cornell.edu (forwarded to the Instructor and the Teaching Assistants) Course text at: http://informationretrieval.org/ 2/ 34

Overview 1 Recap 2 The vector space model 3 Zones 4 Discussion 3/ 34

Outline 1 Recap 2 The vector space model 3 Zones 4 Discussion 4/ 34

Term frequency weight The log frequency weight of term t in d is defined as follows { 1 + log10 tf w t,d = t,d if tf t,d > 0 0 otherwise 5/ 34

idf weight The document frequency df t is defined as the number of documents that t occurs in. We define the idf weight of term t as follows: idf t = log 10 N df t idf is a measure of the informativeness of the term. 6/ 34

tf.idf weight The tf.idf weight of a term is the product of its tf weight and its idf weight. w t,d = w (tf) t,d w(idf) t e.g. w t,d = (1 + logtf t,d ) log df N t 7/ 34

Outline 1 Recap 2 The vector space model 3 Zones 4 Discussion 8/ 34

Binary count weight matrix Anthony Julius The Hamlet Othello Macbeth... and Caesar Tempest Cleopatra Anthony 5.25 3.18 0.0 0.0 0.0 0.35 Brutus 1.21 6.10 0.0 1.0 0.0 0.0 Caesar 8.59 2.54 0.0 1.51 0.25 0.0 Calpurnia 0.0 1.54 0.0 0.0 0.0 0.0 Cleopatra 2.85 0.0 0.0 0.0 0.0 0.0 mercy 1.51 0.0 1.90 0.12 5.25 0.88 worser 1.37 0.0 0.11 4.15 0.25 1.95... Each document is now represented by a real-valued vector of tf.idf weights R V. 9/ 34

Documents as vectors Each document is now represented by a real-valued vector of tf.idf weights R V. So we have a V -dimensional real-valued vector space. Terms are axes of the space. Documents are points or vectors in this space. Very high-dimensional: tens of millions of dimensions when you apply this to web search engines This is a very sparse vector - most entries are zero. Not only useful for scoring documents on a query, but also essential for document classification and document clustering 10/ 34

Queries as vectors Key idea 1: do the same for queries: represent them as vectors in the high-dimensional space Key idea 2: Rank documents according to their proximity to the query proximity = similarity proximity inverse distance Recall: We re doing this because we want to get away from the you re-either-with-us-or-against-us Boolean model. Instead: rank relevant documents higher than nonrelevant documents 11/ 34

How do we formalize vector space similarity? First cut: (inverse) distance between two points ( = distance between the end points of the two vectors) Euclidean distance? Euclidean distance is a bad idea......because Euclidean distance is large for vectors of different lengths. 12/ 34

Why distance is a bad idea poor 1 d 1 :Ranks of starving poets swell d 2 :Engineers pay up by 5% q:[rich poor] 0 d 3 :Record baseball salaries in 2009 rich 0 1 The Euclidean distance of q and d 2 is large although the distribution of terms in the query q and the distribution of terms in the document d 2 are very similar. Questions about basic vector space setup? 13/ 34

Use angle instead of distance Rank documents according to angle with query Thought experiment: take a document d and append it to itself. Call this document d. Semantically d and d have the same content. The angle between the two documents is 0, corresponding to maximal similarity......even though the Euclidean distance between the two documents can be quite large. 14/ 34

From angles to cosines The following two notions are equivalent. Rank documents according to the angle between query and document in increasing order Rank documents according to cosine(query,document) in decreasing order Cosine is a monotonically decreasing function of the angle for the interval [0,180 ] 15/ 34

Cosine 16/ 34

Length normalization How do we compute the cosine? A vector can be (length-) normalized by dividing each of its components by its length here we use the L 2 norm: x 2 = i x2 i This maps vectors onto the unit sphere......since after normalization: x 2 = i x2 i = 1.0 As a result, longer documents and shorter documents have weights of the same order of magnitude. Effect on the two documents d and d (d appended to itself) from earlier slide: they have identical vectors after length-normalization. 17/ 34

Cosine similarity between query and document cos( q, d) = sim( q, d) = q d q d = V V i=1 q2 i i=1 q id i V i=1 d2 i q i is the idf weight of term i in the query. d i is the tf weight of term i in the document. q and d are the lengths of q and d. This is the cosine similarity of q and d....or, equivalently, the cosine of the angle between q and d. 18/ 34

Cosine for normalized vectors For normalized vectors, the cosine is equivalent to the dot product or scalar product. cos( q, d) = q d = i q i d i (if q and d are length-normalized). 19/ 34

Cosine similarity illustrated poor 1 v(d 1 ) v(q) v(d 2 ) θ 0 0 1 v(d 3 ) rich 20/ 34

Cosine: Example term frequencies (counts) term SaS PaP WH affection 115 58 20 jealous 10 7 11 gossip 2 0 6 wuthering 0 0 38 How similar are these novels? SaS: Sense and Sensibility PaP: Pride and Prejudice WH: Wuthering Heights 21/ 34

Cosine: Example term frequencies (counts) term SaS PaP WH affection 115 58 20 jealous 10 7 11 gossip 2 0 6 wuthering 0 0 38 log frequency weighting term SaS PaP WH affection 3.06 2.76 2.30 jealous 2.0 1.85 2.04 gossip 1.30 0 1.78 wuthering 0 0 2.58 (To simplify this example, we don t do idf weighting.) 22/ 34

Cosine: Example log frequency weighting term SaS PaP WH affection 3.06 2.76 2.30 jealous 2.0 1.85 2.04 gossip 1.30 0 1.78 wuthering 0 0 2.58 log frequency weighting & cosine normalization term SaS PaP WH affection 0.789 0.832 0.524 jealous 0.515 0.555 0.465 gossip 0.335 0.0 0.405 wuthering 0.0 0.0 0.588 cos(sas,pap) 0.789 0.832 + 0.515 0.555 + 0.335 0.0 + 0.0 0.0 0.94. cos(sas,wh) 0.79 cos(pap,wh) 0.69 Why do we have cos(sas,pap) > cos(sas,wh)? 23/ 34

Computing the cosine score CosineScore(q) 1 float Scores[N] = 0 2 float Length[N] 3 for each query term t 4 do calculate w t,q and fetch postings list for t 5 for each pair(d,tf t,d ) in postings list 6 do Scores[d]+ = w t,d w t,q 7 Read the array Length 8 for each d 9 do Scores[d] = Scores[d]/Length[d] 10 return Top K components of Scores[] 24/ 34

Components of tf.idf weighting Term frequency Document frequency Normalization n (natural) tf t,d n (no) 1 n (none) 1 l (logarithm) 1 + log(tf t,d ) t (idf) log N dft 1 c (cosine) w 2 1 +w2 2+...+w2 M a (augmented) b (boolean) 0.5 + 0.5 tft,d maxt(tft,d) { 1 if tft,d > 0 0 otherwise p (prob idf) max{0,log N dft } u (pivoted dft unique) 1/u b (byte size) 1/CharLength α, α < 1 L (log ave) 1+log(tft,d) 1+log(avet d(tft,d)) Best known combination of weighting options Default: no weighting 25/ 34

tf.idf example We often use different weightings for queries and documents. Notation: qqq.ddd Example: ltn.lnc query: logarithmic tf, idf, no normalization document: logarithmic tf, no df weighting, cosine normalization Isn t it bad to not idf-weight the document? Example query: best car insurance Example document: car insurance auto insurance 26/ 34

tf.idf example: ltn.lnc Query: best car insurance. Document: car insurance auto insurance. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n lized auto 0 0 5000 2.3 0 1 1 1 0.52 0 best 1 1 50000 1.3 1.3 0 0 0 0 0 car 1 1 10000 2.0 2.0 1 1 1 0.52 1.04 insurance 1 1 1000 3.0 3.0 2 1.3 1.3 0.68 2.04 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n lized: document weights after cosine normalization, product: the product of final query weight and final document weight 1 2 + 0 2 + 1 2 + 1.3 2 1.92 1/1.92 0.52 1.3/1.92 0.68 Final similarity score between query and document: i w qi w di = 0 + 0 + 1.04 + 2.04 = 3.08 Questions? 27/ 34

Summary: Ranked retrieval in the vector space model Represent the query as a weighted idf vector Represent each document as a weighted tf vector Compute the cosine similarity between the query vector and each document vector Rank documents with respect to the query Return the top K (e.g., K = 10) to the user Note: not just for text retrieval! (See, e.g., itunes Genius) 28/ 34

Outline 1 Recap 2 The vector space model 3 Zones 4 Discussion 29/ 34

Parametric and Zone indices Digital documents have additional structure: metadata encoded in machine-parseable form (e.g., author, title, date of publication,...) One parametric index for each field. Fields: take finite set of values (e.g., dates of authorship) Zones: arbitrary free text (e.g., titles, abstracts) Permits searching for documents by Shakespeare written in 1601 containing the phrase alas poor Yorick or find documents with merchant in title and william in author list and the phrase gentle rain in body Use separate indexes for each field and zone, or use william.abstract, william.title, william.author Permits weighted zone scoring 30/ 34

Weighted Zone Scoring Given boolean query q and document d, assign to the pair (q,d) a score in [0,1] by computing linear combination of zone scores. Let g 1,...,g l [0,1] such that l i=1 g i = 1. For 1 i l, let s i be the score between q and the i th zone. Then the weighted zone score is defined as l i=1 g is i. Example: Three zones: author title, body g 1 =.2, g 2 =.3, g 3 =.5 (match in author zone least important) Compute weighted zone scores directly from inverted indexes: Instead of adding document to set of results as for boolean AND query, now compute a score for each document. 31/ 34

Learning Weights How to determine the weights g i for weighted zone scoring? A. specified by expert B. learned using training examples that have been judged editorially (machine-learned relevance) 1. given set of training examples [(q,d) plus relevance judgment (e.g., yes/no)] 2. set the weights g i to best approximate the relevance judgments Expensive component: labor-intensive assembly of user-generated relevance judgments, especially expensive in rapidly changing collection (such as Web). Or use passive collaborative feedback? (clickthrough data) 32/ 34

Outline 1 Recap 2 The vector space model 3 Zones 4 Discussion 33/ 34

Discussion 2, 21 Sep For this class, read and be prepared to discuss the following: K. Sparck Jones, A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28, 11-21, 1972. http://www.soi.city.ac.uk/ ser/idfpapers/ksj orig.pdf Letter by Stephen Robertson and reply by Karen Sparck Jones, Journal of Documentation 28, 164-165, 1972. http://www.soi.city.ac.uk/ ser/idfpapers/letters.pdf The first paper introduced the term weighting scheme known as inverse document frequency (IDF). Some of the terminology used in this paper will be introduced in the lectures. The letter describes a slightly different way of expressing IDF, which has become the standard form. (Stephen Robertson has mounted these papers on his Web site with permission from the publisher.) 34/ 34