Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

Similar documents
Vector Space Scoring Introduction to Information Retrieval Informatics 141 / CS 121 Donald J. Patterson

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

MATRIX DECOMPOSITION AND LATENT SEMANTIC INDEXING (LSI) Introduction to Information Retrieval CS 150 Donald J. Patterson

Term Weighting and Vector Space Model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Term Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002

Information Retrieval

Scoring, Term Weighting and the Vector Space

TDDD43. Information Retrieval. Fang Wei-Kleiner. ADIT/IDA Linköping University. Fang Wei-Kleiner ADIT/IDA LiU TDDD43 Information Retrieval 1

Dealing with Text Databases

PV211: Introduction to Information Retrieval

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2017. Instructor: Walid Magdy

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2018. Instructor: Walid Magdy

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

Informa(on Retrieval

Informa(on Retrieval

CS 572: Information Retrieval

Document Similarity in Information Retrieval

Lecture 4 Ranking Search Results. Many thanks to Prabhakar Raghavan for sharing most content from the following slides

PV211: Introduction to Information Retrieval

Matrix Decomposition and Latent Semantic Indexing (LSI) Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Information Retrieval Using Boolean Model SEEM5680

boolean queries Inverted index query processing Query optimization boolean model January 15, / 35

Recap of the last lecture. CS276A Information Retrieval. This lecture. Documents as vectors. Intuition. Why turn docs into vectors?

The Boolean Model ~1955

Introduction to Information Retrieval

Introduction to Information Retrieval

PV211: Introduction to Information Retrieval

Boolean and Vector Space Retrieval Models

Information Retrieval. Lecture 6

Query. Information Retrieval (IR) Term-document incidence. Incidence vectors. Bigger corpora. Answers to query

CS 572: Information Retrieval

Content-Addressable Memory Associative Memory Lernmatrix Association Heteroassociation Learning Retrieval Reliability of the answer

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

Machine Learning for natural language processing

CAIM: Cerca i Anàlisi d Informació Massiva

Information Retrieval and Web Search

Query CS347. Term-document incidence. Incidence vectors. Which plays of Shakespeare contain the words Brutus ANDCaesar but NOT Calpurnia?

Non-Boolean models of retrieval: Agenda

Ricerca dell Informazione nel Web. Aris Anagnostopoulos

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

Applied Natural Language Processing

Geoffrey Zweig May 7, 2009

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

Embeddings Learned By Matrix Factorization

Complex Data Mining & Workflow Mining. Introduzione al text mining

What is Text mining? To discover the useful patterns/contents from the large amount of data that can be structured or unstructured.

Chap 2: Classical models for information retrieval

Vector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model

Introduction to Information Retrieval

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

1 Information retrieval fundamentals

Information Retrieval

Motivation. User. Retrieval Model Result: Query. Document Collection. Information Need. Information Retrieval / Chapter 3: Retrieval Models

How Latent Semantic Indexing Solves the Pachyderm Problem

Information Retrieval Basic IR models. Luca Bondi

Probabilistic Information Retrieval

Data Mining Recitation Notes Week 3

1 Boolean retrieval. Online edition (c)2009 Cambridge UP

Lecture 5: Web Searching using the SVD

13 Searching the Web with the SVD

9 Searching the Internet with the SVD

vector space retrieval many slides courtesy James Amherst

Web Information Retrieval Dipl.-Inf. Christoph Carl Kling

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Fall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26

Variable Latent Semantic Indexing

Latent Semantic Analysis. Hongning Wang

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from

Maschinelle Sprachverarbeitung

Ranking-II. Temporal Representation and Retrieval Models. Temporal Information Retrieval

Vectors and their uses

Information Retrieval

Linear Algebra Background

IR Models: The Probabilistic Model. Lecture 8

Chapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze

DISTRIBUTIONAL SEMANTICS

Introduction to Information Retrieval


INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

.. CSC 566 Advanced Data Mining Alexander Dekhtyar..

Manning & Schuetze, FSNLP (c) 1999,2000

University of Illinois at Urbana-Champaign. Midterm Examination

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Manning & Schuetze, FSNLP, (c)

Text Analytics (Text Mining)

CSE 494/598 Lecture-4: Correlation Analysis. **Content adapted from last year s slides

Intelligent Data Analysis. Mining Textual Data. School of Computer Science University of Birmingham

Part I: Web Structure Mining Chapter 1: Information Retrieval and Web Search

STA141C: Big Data & High Performance Statistical Computing

Transcription:

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org

Querying Corpus-wide statistics

Querying Corpus-wide statistics Collection Frequency, cf Define: The total number of occurences of the term in the entire corpus

Querying Corpus-wide statistics Collection Frequency, cf Define: The total number of occurences of the term in the entire corpus Document Frequency, df Define: The total number of documents which contain the term in the corpus

Querying Corpus-wide statistics W ord Collection F requency Document F requency insurance 10440 3997 try 10422 8760

Querying Corpus-wide statistics W ord Collection F requency Document F requency insurance 10440 3997 try 10422 8760 This suggests that df is better at discriminating between documents

Querying Corpus-wide statistics W ord Collection F requency Document F requency insurance 10440 3997 try 10422 8760 This suggests that df is better at discriminating between documents How do we use df?

Querying Corpus-wide statistics

Querying Corpus-wide statistics Term-Frequency, Inverse Document Frequency Weights

Querying Corpus-wide statistics Term-Frequency, Inverse Document Frequency Weights tf-idf

Querying Corpus-wide statistics Term-Frequency, Inverse Document Frequency Weights tf-idf tf = term frequency

Querying Corpus-wide statistics Term-Frequency, Inverse Document Frequency Weights tf-idf tf = term frequency some measure of term density in a document

Querying Corpus-wide statistics Term-Frequency, Inverse Document Frequency Weights tf-idf tf = term frequency some measure of term density in a document idf = inverse document frequency

Querying Corpus-wide statistics Term-Frequency, Inverse Document Frequency Weights tf-idf tf = term frequency some measure of term density in a document idf = inverse document frequency a measure of the informativeness of a term

Querying Corpus-wide statistics Term-Frequency, Inverse Document Frequency Weights tf-idf tf = term frequency some measure of term density in a document idf = inverse document frequency a measure of the informativeness of a term it s rarity across the corpus

Querying Corpus-wide statistics Term-Frequency, Inverse Document Frequency Weights tf-idf tf = term frequency some measure of term density in a document idf = inverse document frequency a measure of the informativeness of a term it s rarity across the corpus could be just a count of documents with the term

Querying Corpus-wide statistics Term-Frequency, Inverse Document Frequency Weights tf-idf tf = term frequency some measure of term density in a document idf = inverse document frequency a measure of the informativeness of a term it s rarity across the corpus could be just a count of documents with the term more commonly it is:

Querying Corpus-wide statistics Term-Frequency, Inverse Document Frequency Weights tf-idf tf = term frequency some measure of term density in a document idf = inverse document frequency a measure of the informativeness of a term it s rarity across the corpus could be just a count of documents with the term more commonly it is: idf t = log ( ) corpus df t

Querying TF-IDF Examples idf t = log ( corpus df t ) idf t = log 10 ( 1, 000, 000 df t ) term df t idf t calpurnia 1 animal 10 sunday 1000 fly 10, 000 under 100, 000 the 1, 000, 000 6 4 3 2 1 0

Querying TF-IDF Summary Assign tf-idf weight for each term t in a document d: ( ) corpus tfidf(t, d) = (1 + log(tf t,d )) log df t,d Increases with number of occurrences of term in a doc. Increases with rarity of term across entire corpus Three different metrics term frequency document frequency collection/corpus frequency

Querying Now, real-valued term-document matrices Bag of words model Each element of matrix is tf-idf value Antony and Julius T he T empest Hamlet Othello M acbeth Cleopatra Caesar Antony 13.1 11.4 0.0 0.0 0.0 0.0 Brutus 3.0 8.3 0.0 1.0 0.0 0.0 Caesar 2.3 2.3 0.0 0.5 0.3 0.3 Calpurnia 0.0 11.2 0.0 0.0 0.0 0.0 Cleopatra 17.7 0.0 0.0 0.0 0.0 0.0 mercy 0.5 0.0 0.7 0.9 0.9 0.3 worser 1.2 0.0 0.6 0.6 0.6 0.0

Querying Vector Space Scoring That is a nice matrix, but How does it relate to scoring? Next, vector space scoring

Vector Space Model Define: Vector Space Model Representing a set of documents as vectors in a common vector space. It is fundamental to many operations (query,document) pair scoring document classification document clustering Queries are represented as a document A short one, but mathematically equivalent

Vector Space Model Define: Vector Space Model A document, d, is defined as a vector: One component for each term in the dictionary Assume the term is the tf-idf score V (d) t = (1 + log(tf t,d )) log V (d) ( ) corpus df t,d A corpus is many vectors together. A document can be thought of as a point in a multidimensional space, with axes related to terms.

Vector Space Model Recall our Shakespeare Example: Antony and Julius T he T empest Hamlet Othello M acbeth Cleopatra Caesar Antony 13.1 11.4 0.0 0.0 0.0 0.0 Brutus 3.0 8.3 0.0 1.0 0.0 0.0 Caesar 2.3 2.3 0.0 0.5 0.3 0.3 Calpurnia 0.0 11.2 0.0 0.0 0.0 0.0 Cleopatra 17.7 0.0 0.0 0.0 0.0 0.0 mercy 0.5 0.0 0.7 0.9 0.9 0.3 worser 1.2 0.0 0.6 0.6 0.6 0.0

Vector Space Model Recall our Shakespeare Example: V (d 1 ) Antony and Julius T he T empest Hamlet Othello M acbeth Cleopatra Caesar Antony 13.1 11.4 0.0 0.0 0.0 0.0 Brutus 3.0 8.3 0.0 1.0 0.0 0.0 Caesar 2.3 2.3 0.0 0.5 0.3 0.3 Calpurnia 0.0 11.2 0.0 0.0 0.0 0.0 Cleopatra 17.7 0.0 0.0 0.0 0.0 0.0 mercy 0.5 0.0 0.7 0.9 0.9 0.3 worser 1.2 0.0 0.6 0.6 0.6 0.0

Vector Space Model Recall our Shakespeare Example: V (d 1 ) V (d2 ) V (d6 ) Antony and Julius T he T empest Hamlet Othello M acbeth Cleopatra Caesar Antony 13.1 11.4 0.0 0.0 0.0 0.0 Brutus 3.0 8.3 0.0 1.0 0.0 0.0 Caesar 2.3 2.3 0.0 0.5 0.3 0.3 Calpurnia 0.0 11.2 0.0 0.0 0.0 0.0 Cleopatra 17.7 0.0 0.0 0.0 0.0 0.0 mercy 0.5 0.0 0.7 0.9 0.9 0.3 worser 1.2 0.0 0.6 0.6 0.6 0.0

Vector Space Model Recall our Shakespeare Example: V (d 1 ) V (d2 ) V (d6 ) Antony and Julius T he T empest Hamlet Othello M acbeth Cleopatra Caesar Antony 13.1 11.4 0.0 0.0 0.0 0.0 Brutus 3.0 8.3 0.0 1.0 0.0 0.0 Caesar 2.3 2.3 0.0 0.5 0.3 0.3 Calpurnia 0.0 11.2 0.0 0.0 0.0 0.0 Cleopatra 17.7 0.0 0.0 0.0 0.0 0.0 mercy 0.5 0.0 0.7 0.9 0.9 0.3 worser 1.2 0.0 0.6 0.6 0.6 0.0 V (d 6 ) 7

Vector Space Model Recall our Shakespeare Example: V (d 1 ) V (d2 ) V (d6 ) Antony and Julius T he T empest Hamlet Othello M acbeth Cleopatra Caesar Antony 13.1 11.4 0.0 0.0 0.0 0.0 Brutus 3.0 8.3 0.0 1.0 0.0 0.0 Caesar 2.3 2.3 0.0 0.5 0.3 0.3 Calpurnia 0.0 11.2 0.0 0.0 0.0 0.0 Cleopatra 17.7 0.0 0.0 0.0 0.0 0.0 mercy 0.5 0.0 0.7 0.9 0.9 0.3 worser 1.2 0.0 0.6 0.6 0.6 0.0

Vector Space Model Recall our Shakespeare Example: Brutus Julius Caesar Antony and Cleopatra Tempest Hamlet Othello MacBeth Antony

Vector Space Model Recall our Shakespeare Example: V (d 1 ) V (d2 ) V (d6 ) Antony and Julius T he T empest Hamlet Othello M acbeth Cleopatra Caesar Antony 13.1 11.4 0.0 0.0 0.0 0.0 Brutus 3.0 8.3 0.0 1.0 0.0 0.0 Caesar 2.3 2.3 0.0 0.5 0.3 0.3 Calpurnia 0.0 11.2 0.0 0.0 0.0 0.0 Cleopatra 17.7 0.0 0.0 0.0 0.0 0.0 mercy 0.5 0.0 0.7 0.9 0.9 0.3 worser 1.2 0.0 0.6 0.6 0.6 0.0

Vector Space Model Recall our Shakespeare Example: worser Antony and Cleopatra Tempest Hamlet Othello Julius Caesar MacBeth mercy

Query as a vector So a query can also be plotted in the same space worser mercy To score, we ask: How similar are two points? How to answer? worser Antony and Cleopatra query Tempest Hamlet Othello Julius Caesar MacBeth mercy

Score by magnitude How to answer? Similarity of magnitude? But, two documents, similar in content, different in length can have V (d 2 ) large differences in magnitude. V (d 1 ) V (d 3 ) V (d 4 ) V (d 5 )

Score by angle How to answer? Similarity of relative positions, or difference in angle Two documents are similar if the angle between them is 0. As long as the ratios of the axes are the same, the documents will be scored as equal. This is measured by the dot product V (d 1 ) θ V (d 5 ) V (d 2 ) V (d 3 ) V (d 4 )

Score by angle Rather than use angle V (d 2 ) use cosine of angle V (d 1 ) When sorting cosine and angle are equivalent θ V (d 3 ) V (d 4 ) Cosine is monotonically decreasing as V (d 5 ) a function of angle over (0... 180)

Big picture Why are we turning documents and queries into vectors Getting away from Boolean retrieval Developing ranked retrieval methods Developing scores for ranked retrieval Term weighting allows us to compute scores for document similarity Vector space model is a clean mathematical model to work with

Big picture Cosine similarity measure Gives us a symmetric score if d_1 is close to d_2, d_2 is close to d_1 Gives us transitivity if d_1 is close to d_2, and d_2 close to d_3, then d_1 is also close to d_3 No document is closer to d_1 than itself If vectors are normalized (length = 1) then The similarity score is just the dot product (fast)

Queries in the vector space model Central idea: the query is a vector We regard the query as a short document We return the documents ranked by the closeness of their vectors to the query (also a vector) V (q) V (d i ) sim(q, d i ) = V (q) V (d i ) Note that q is very sparse!

Cosine Similarity Score V (d 1 ) V (d 2 ) = cos(θ) V (d 1 ) V (d 2 ) V (d 1 ) V (d 2 ) cos(θ) = V (d 1 ) V (d 2 ) V (d 1 ) V (d 2 ) sim(d 1,d 2 ) = V (d 1 ) V (d 2 ) V (d 1 ) θ V (d 5 ) V (d 2 ) V (d 3 ) V (d 4 )

Cosine Similarity Score Define: dot product t n V (d 1 ) V (d 2 ) = i=t 1 ( V (d 1 ) i V (d2 ) i ) V (d 2 ) Antony and Julius T he T empest Hamlet Othello M acbeth Cleopatra Caesar Antony 13.1 11.4 0.0 0.0 0.0 0.0 Brutus 3.0 8.3 0.0 1.0 0.0 0.0 Caesar 2.3 2.3 0.0 0.5 0.3 0.3 Calpurnia 0.0 11.2 0.0 0.0 0.0 0.0 Cleopatra 17.7 0.0 0.0 0.0 0.0 0.0 mercy 0.5 0.0 0.7 0.9 0.9 0.3 worser 1.2 0.0 0.6 0.6 0.6 0.0 V (d 1 ) θ V (d 5 ) V (d 3 ) V (d 4 ) V (d 1 ) V (d 2 ) = (13.1 11.4) + (3.0 8.3) + (2.3 2.3) + (0 11.2) + (17.7 0) + (0.5 0) + (1.2 0) = 179.53

Cosine Similarity Score Define: Euclidean Length V (d 1 ) = t n i=t 1 ( V (d 1 ) i V (d1 ) i ) Antony and Julius T he T empest Hamlet Othello M acbeth Cleopatra Caesar Antony 13.1 11.4 0.0 0.0 0.0 0.0 Brutus 3.0 8.3 0.0 1.0 0.0 0.0 Caesar 2.3 2.3 0.0 0.5 0.3 0.3 Calpurnia 0.0 11.2 0.0 0.0 0.0 0.0 Cleopatra 17.7 0.0 0.0 0.0 0.0 0.0 mercy 0.5 0.0 0.7 0.9 0.9 0.3 worser 1.2 0.0 0.6 0.6 0.6 0.0 V (d 1 ) θ V (d 5 ) V (d 2 ) V (d 3 ) V (d 4 ) V (d 1 ) = (13.1 13.1) + (3.0 3.0) + (2.3 2.3) + (17.7 17.7) + (0.5 0.5) + (1.2 1.2) = 22.38

Cosine Similarity Score Define: Euclidean Length V (d 1 ) = t n i=t 1 ( V (d 1 ) i V (d1 ) i ) Antony and Julius T he T empest Hamlet Othello M acbeth Cleopatra Caesar Antony 13.1 11.4 0.0 0.0 0.0 0.0 Brutus 3.0 8.3 0.0 1.0 0.0 0.0 Caesar 2.3 2.3 0.0 0.5 0.3 0.3 Calpurnia 0.0 11.2 0.0 0.0 0.0 0.0 Cleopatra 17.7 0.0 0.0 0.0 0.0 0.0 mercy 0.5 0.0 0.7 0.9 0.9 0.3 worser 1.2 0.0 0.6 0.6 0.6 0.0 V (d 1 ) θ V (d 5 ) V (d 2 ) V (d 3 ) V (d 4 ) V (d 1 ) = (11.4 11.4) + (8.3 8.3) + (2.3 2.3) + (11.2 11.2) = 18.15

Cosine Similarity Score Example sim(d 1, d 2 ) = V (d 1 ) V (d 2 ) V (d 1 ) V (d 2 ) 179.53 = 22.38 18.15 = 0.442 V (d 1 ) θ V (d 2 ) V (d 3 ) V (d 4 ) V (d 5 )

Exercise Rank the following by decreasing cosine similarity. Assume tf-idf weighting: Two docs that have only frequent words in common (the, a, an, of) Two docs that have no words in common Two docs that have many rare words in common (mocha, volatile, organic, shade-grown)

Spamming indices This was invented before spam Consider: Indexing a sensible passive document collection vs. Indexing an active document collection, where people, companies, bots are shaping documents to maximize scores Vector space scoring may not be as useful in this context.

Interaction: vectors and phrases Scoring phrases doesn t naturally fit into the vector space world: How do we get beyond the bag of words? dark roast and pot roast There is no information on dark roast as a phrase in our indices. Biword index can treat some phrases as terms postings for phrases document wide statistics for phrases

Interaction: vectors and phrases Theoretical problem: Axes of our term space are now correlated There is a lot of shared information in light roast and dark roast rows of our index End-user problem: A user doesn t know which phrases are indexed and can t effectively discriminate results.

Multiple queries for phrases and vectors Query: rising interest rates Iterative refinement: Run the phrase query vector with 3 words as a term. If not enough results, run 2-phrase queries and fold into results: rising interest interest rates If still not enough results run query with three words as separate terms.

Vectors and Boolean queries Ranked queries and Boolean queries don t work very well together In term space ranked queries select based on sector containment - cosine similarity boolean queries select based on rectangle unions and intersections V (d 1 ) V (d 2 ) X Y V (d 1 ) V (d2 ) θ V (d 3 ) V (d 4 ) V (d 5 ) V (d 3 )

Vectors and wild cards

Vectors and wild cards How could we work with the query, quick* print*?

Vectors and wild cards How could we work with the query, quick* print*? Can we view this as a bag of words?

Vectors and wild cards How could we work with the query, quick* print*? Can we view this as a bag of words? What about expanding each wild-card into the matching set of dictionary terms?

Vectors and wild cards How could we work with the query, quick* print*? Can we view this as a bag of words? What about expanding each wild-card into the matching set of dictionary terms? Danger: Unlike the boolean case, we now have tfs and idfs to deal with

Vectors and wild cards How could we work with the query, quick* print*? Can we view this as a bag of words? What about expanding each wild-card into the matching set of dictionary terms? Danger: Unlike the boolean case, we now have tfs and idfs to deal with Overall, not a great idea

Vectors and other operators Vector space queries are good for no-syntax, bag-ofwords queries Nice mathematical formalism Clear metaphor for similar document queries Doesn t work well with Boolean, wild-card or positional query operators But...

Query language vs. Scoring Interfaces to the rescue Free text queries are often separated from operator query language Default is free text query Advanced query operators are available in advanced query section of interface Or embedded in free text query with special syntax aka -term - terma termb

Alternatives to tf-idf Sublinear tf scaling 20 occurrences of mole does not indicate 20 times the relevance This motivated the WTF score. WTF(t, d) 1 if tf t,d = 0 2 then return(0) 3 else return(1 + log(tf t,d )) There are other variants for reducing the impact of repeated terms