Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2017. Instructor: Walid Magdy

Similar documents
Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2018. Instructor: Walid Magdy

Term Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan

Information Retrieval

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model

Scoring, Term Weighting and the Vector Space

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Term Weighting and Vector Space Model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

PV211: Introduction to Information Retrieval

Information Retrieval

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002

CS 572: Information Retrieval

TDDD43. Information Retrieval. Fang Wei-Kleiner. ADIT/IDA Linköping University. Fang Wei-Kleiner ADIT/IDA LiU TDDD43 Information Retrieval 1

Vector Space Scoring Introduction to Information Retrieval Informatics 141 / CS 121 Donald J. Patterson

Informa(on Retrieval

Informa(on Retrieval

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Lecture 4 Ranking Search Results. Many thanks to Prabhakar Raghavan for sharing most content from the following slides

Document Similarity in Information Retrieval

Dealing with Text Databases

Ranked Retrieval (2)

Introduction to Information Retrieval

MATRIX DECOMPOSITION AND LATENT SEMANTIC INDEXING (LSI) Introduction to Information Retrieval CS 150 Donald J. Patterson

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

PV211: Introduction to Information Retrieval

Query. Information Retrieval (IR) Term-document incidence. Incidence vectors. Bigger corpora. Answers to query

Information Retrieval Using Boolean Model SEEM5680

CS 572: Information Retrieval

The Boolean Model ~1955

boolean queries Inverted index query processing Query optimization boolean model January 15, / 35

Matrix Decomposition and Latent Semantic Indexing (LSI) Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

Non-Boolean models of retrieval: Agenda

Information Retrieval and Web Search

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

Introduction to Information Retrieval

Content-Addressable Memory Associative Memory Lernmatrix Association Heteroassociation Learning Retrieval Reliability of the answer

Vector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model

Introduction to Information Retrieval

Query CS347. Term-document incidence. Incidence vectors. Which plays of Shakespeare contain the words Brutus ANDCaesar but NOT Calpurnia?

Information Retrieval. Lecture 6

IR Models: The Probabilistic Model. Lecture 8

PV211: Introduction to Information Retrieval

Geoffrey Zweig May 7, 2009

Information Retrieval

Complex Data Mining & Workflow Mining. Introduzione al text mining

Chap 2: Classical models for information retrieval

Motivation. User. Retrieval Model Result: Query. Document Collection. Information Need. Information Retrieval / Chapter 3: Retrieval Models

Fall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Applied Natural Language Processing

Boolean and Vector Space Retrieval Models

1 Information retrieval fundamentals

Ricerca dell Informazione nel Web. Aris Anagnostopoulos

Information Retrieval Basic IR models. Luca Bondi

Maschinelle Sprachverarbeitung

CS630 Representing and Accessing Digital Information Lecture 6: Feb 14, 2006

Ranking-II. Temporal Representation and Retrieval Models. Temporal Information Retrieval

Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

vector space retrieval many slides courtesy James Amherst

Recap of the last lecture. CS276A Information Retrieval. This lecture. Documents as vectors. Intuition. Why turn docs into vectors?

Latent Semantic Analysis. Hongning Wang

Latent semantic indexing

Probabilistic Information Retrieval

CS47300: Web Information Search and Management


Modern Information Retrieval

CAIM: Cerca i Anàlisi d Informació Massiva

Chapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze

Part I: Web Structure Mining Chapter 1: Information Retrieval and Web Search

Machine Learning for natural language processing

What is Text mining? To discover the useful patterns/contents from the large amount of data that can be structured or unstructured.

9 Searching the Internet with the SVD

Variable Latent Semantic Indexing

Web Information Retrieval Dipl.-Inf. Christoph Carl Kling

Information Retrieval

Score Distribution Models

Information Retrieval

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Compact Indexes for Flexible Top-k Retrieval

Leverage Sparse Information in Predictive Modeling

Query Propagation in Possibilistic Information Retrieval Networks

Lecture 5: Web Searching using the SVD

A Neural Passage Model for Ad-hoc Document Retrieval

.. CSC 566 Advanced Data Mining Alexander Dekhtyar..

Manning & Schuetze, FSNLP, (c)

13 Searching the Web with the SVD

Document indexing, similarities and retrieval in large scale text collections

Natural Language Processing

1 Boolean retrieval. Online edition (c)2009 Cambridge UP

Learning Query and Document Similarities from Click-through Bipartite Graph with Metadata

Manning & Schuetze, FSNLP (c) 1999,2000

Natural Language Processing

Dealing with Text Databases

Transcription:

Text Technologies for Data Science INFR11145 Ranked IR Instructor: Walid Magdy 10-Oct-017 Lecture Objectives Learn about Ranked IR TFIDF VSM SMART notation Implement: TFIDF 1

Boolean Retrieval Thus far, our queries have all been Boolean. Documents either: match or no match. Good for expert users with precise understanding of their needs and the collection. Patent search uses sophisticated sets of Boolean queries and check hundreds of search results (car OR vehicle) AND (motor OR engine) AND NOT (cooler) Not good for the majority of users. Most incapable of writing Boolean queries. Most don t want to go through 1000s of results. This is particularly true of web search Question: What is the most unused web-search feature? 3 Ranked Retrieval Typical queries: free text queries Results are ranked with respect to a query Large result sets are not an issue We just show the top k ( 10) results We don t overwhelm the user Criteria: Top ranked documents are the most likely to satisfy user s query Score is based on how well documents match a query Score(d,q) 4

Old Example Find documents matching query {ink wink} 1. Load inverted lists for each query word. Merge two postings lists Linear merge Apply function for matches Boolean: exist / not exist = 0 or 1 Ranked: f(tf, df, length,.) = 0 1 ink 3:1 4:1 5:1 wink 1:1 5:1 Matches 1: f(0,1) = 0 3: f(1,0) = 0 4: f(1,0) = 0 5: f(1,1) = 1 5 Function example: Jaccard coeffecient Remember: a commonly used measure of overlap of two sets A and B jaccard A, B = A B D1: He likes to wink, he likes to drink A B D: He likes to drink, and drink, and drink jaccard A, A = 1 jaccard A, B = 0, if A B = 0 Example: D1 D = {he, likes, to, wink, and, drink} D1 D = {he, likes, to, drink} jaccard D1, D = 4 = 0.6667 6 6 3

Jaccard coefficient: Issues Does not consider term frequency (how many times a term occurs in a document) It treats all terms equally! How about rare terms in a collection? more informative than frequent terms. He likes to drink, should to == drink Needs more sophisticated way of length normalization D1 = 3, D = 1000! D1 Q, D D 7 Should are terms be treaded the same? Collection of 5 documents (balls = terms) Query Which is the least relevant document? Which is the most relevant document? D1 D D3 D4 D5 8 4

TFIDF TFIDF: Term Frequency, Inverse Document Frequency tf(t,d): number of times term t appeared in document d As tf t, d importance of t in d Document about IR, contains retrieval more than others df(t): number of documents term t appeared in As df d importance if t in a collection the appears in many document not important FT is not important word in financial times articles 9 DF, CF, & IDF DF CF (collection frequency) cf(t) = total number of occurrences of term t in a collection df(t) N (N: number of documents in a collection) cf(t) can be N DF is more commonly used in IR than CF idf(t): inverse of df(t) As idf t rare term importance idf(t) measure of the informativeness of t 10 5

IDF: formula idf t = log 10 ( N df t ) Log scale used to dampen the effect of IDF Suppose N = 1 million term df(t) idf(t) calpurnia 1 6 animal 100 4 sunday 1,000 3 fly 10,000 under 100,000 1 the 1,000,000 0 11 TFIDF term weighting One the best known term weights schemes in IR Increases with the number of occurrences within a document Increases with the rarity of the term in the collection Combines TF and IDF to find the weight of terms w t.d = 1 + log 10 tf t, d log 10 ( N df t ) For a query q and document d, retrieval score f(q,d): Score(q, d) = t q d w t.d 1 6

Document/Term vectors with tfidf Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 5.5 3.18 0 0 0 0.35 Brutus 1.1 6.1 0 1 0 0 Caesar 8.59.54 0 1.51 0.5 0 Calpurnia 0 1.54 0 0 0 0 Cleopatra.85 0 0 0 0 0 mercy 1.51 0 1.9 0.1 5.5 0.88 worser 1.37 0 0.11 4.15 0.5 1.95 Vector Space Model 14 Vector Space Model Documents and Queries are presented as vectors Match (Q,D) = Distance between vectors Example: Q= Gossip Jealous Euclidean Distance? Distance between the endpoints of the two vectors Large for vectors of diff. lengths Take a document d and append it to itself. Call this document d. Semantically d and d have the same content Euclidean distance can be quite large 15 7

Angle Instead of Distance The angle between the two documents is 0, corresponding to maximal similarity. Key idea: Rank documents according to angle with query. Rank documents in increasing order of the angle with query Rank documents in decreasing order of cosine (query, document) Cosine of angle = projection of one vector on the other 16 Length Normalization A vector can be normalized by dividing each of its components by its length for this we use the L norm: Ԧx = x i i Dividing a vector by its L norm makes it a unit (length) vector (on surface of unit hypersphere) Effect on the two documents d and d (d appended to itself) from earlier slide: they have identical vectors after length-normalization. Long and short documents now have comparable weights 17 8

Example D1 = 1 3 D1 normalized = D = 3 9 6 D normalized = D1 = 1 + 9 + 4 = 3.74 0.67 0.80 0.535 D1 = 9 + 81 + 36 = 11.5 0.67 0.80 0.535 18 Cosine Similarity (Query, Document) Ԧq i is the tf-idf weight of term i in the query Ԧd i is the tf-idf weight of term i in the document For normalized vectors: cos Ԧq, Ԧd V = Ԧq Ԧd = σ i=1 qi d i For non-normalized vectors: cos Ԧq, Ԧd = Ԧq Ԧd Ԧq Ԧd = Ԧq Ԧq Ԧd Ԧd = V σ i=1 qi d i V σ i=1 qi V σ i=1 di 19 9

Algorithm 0 TFIDF Variants Many search engines allow for different weightings for queries vs. documents SMART Notation: use notation ddd.qqq, using the acronyms from the table A very standard weighting scheme is: lnc.ltc 1 10

Summary of Steps: Represent the query as a weighted tf-idf vector Represent each document as a weighted tf-idf vector Compute the cosine similarity score for the query vector and each document vector Rank documents with respect to the query by score Return the top K (e.g., K = 10) to the user Resources Text book 1: Intro to IR, Chapter 6. 6.4 Text book : IR in Practice, Chapter 7 Ranked IR () BM5: Robertson, Stephen E., et al. "Okapi at TREC-3." Nist Special Publication Sp 109 (1995): 109. 3 11