Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

Similar documents
Vector Space Scoring Introduction to Information Retrieval Informatics 141 / CS 121 Donald J. Patterson

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

MATRIX DECOMPOSITION AND LATENT SEMANTIC INDEXING (LSI) Introduction to Information Retrieval CS 150 Donald J. Patterson

Term Weighting and Vector Space Model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Term Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan

Information Retrieval

Scoring, Term Weighting and the Vector Space

Information Retrieval

Dealing with Text Databases

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2017. Instructor: Walid Magdy

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2018. Instructor: Walid Magdy

PV211: Introduction to Information Retrieval

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)

TDDD43. Information Retrieval. Fang Wei-Kleiner. ADIT/IDA Linköping University. Fang Wei-Kleiner ADIT/IDA LiU TDDD43 Information Retrieval 1

Informa(on Retrieval

Informa(on Retrieval

CS 572: Information Retrieval

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

Matrix Decomposition and Latent Semantic Indexing (LSI) Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Document Similarity in Information Retrieval

PV211: Introduction to Information Retrieval

Lecture 4 Ranking Search Results. Many thanks to Prabhakar Raghavan for sharing most content from the following slides

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Introduction to Information Retrieval

The Boolean Model ~1955

Information Retrieval Using Boolean Model SEEM5680

Content-Addressable Memory Associative Memory Lernmatrix Association Heteroassociation Learning Retrieval Reliability of the answer

Machine Learning for natural language processing

Query. Information Retrieval (IR) Term-document incidence. Incidence vectors. Bigger corpora. Answers to query

boolean queries Inverted index query processing Query optimization boolean model January 15, / 35

Introduction to Information Retrieval

Geoffrey Zweig May 7, 2009

PV211: Introduction to Information Retrieval

Introduction to Information Retrieval

Complex Data Mining & Workflow Mining. Introduzione al text mining

What is Text mining? To discover the useful patterns/contents from the large amount of data that can be structured or unstructured.

Query CS347. Term-document incidence. Incidence vectors. Which plays of Shakespeare contain the words Brutus ANDCaesar but NOT Calpurnia?

Ricerca dell Informazione nel Web. Aris Anagnostopoulos

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

Embeddings Learned By Matrix Factorization

CS 572: Information Retrieval

Non-Boolean models of retrieval: Agenda

Web Information Retrieval Dipl.-Inf. Christoph Carl Kling

Dealing with Text Databases

Applied Natural Language Processing

Variable Latent Semantic Indexing

Latent Semantic Analysis. Hongning Wang

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

Intelligent Data Analysis. Mining Textual Data. School of Computer Science University of Birmingham

Information Retrieval Basic IR models. Luca Bondi

1 Boolean retrieval. Online edition (c)2009 Cambridge UP

Information Retrieval

Boolean and Vector Space Retrieval Models

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Probabilistic Information Retrieval

How Latent Semantic Indexing Solves the Pachyderm Problem

Overview. Lecture 1: Introduction and the Boolean Model. What is Information Retrieval? What is Information Retrieval?

Information Retrieval and Web Search

Classification. Team Ravana. Team Members Cliffton Fernandes Nikhil Keswaney

Vector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model

Chapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze

.. CSC 566 Advanced Data Mining Alexander Dekhtyar..


Lecture 2 August 31, 2007

Lecture 16-17: Bayesian Nonparametrics I. STAT 6474 Instructor: Hongxiao Zhu

Ranked Retrieval (2)

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

Large-scale Image Annotation by Efficient and Robust Kernel Metric Learning

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Pivoted Length Normalization I. Summary idf II. Review

Chap 2: Classical models for information retrieval

Motivation. User. Retrieval Model Result: Query. Document Collection. Information Need. Information Retrieval / Chapter 3: Retrieval Models

Mapping of Science. Bart Thijs ECOOM, K.U.Leuven, Belgium

INF 141 IR METRICS LATENT SEMANTIC ANALYSIS AND INDEXING. Crista Lopes

CS47300 Fall 2017 Assignment 3 solutions

1 Information retrieval fundamentals

Latent Semantic Analysis. Hongning Wang

Ranking-II. Temporal Representation and Retrieval Models. Temporal Information Retrieval

Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25

Data Mining 2018 Logistic Regression Text Classification

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Large-Scale Behavioral Targeting

STA141C: Big Data & High Performance Statistical Computing

Manning & Schuetze, FSNLP (c) 1999,2000

CAIM: Cerca i Anàlisi d Informació Massiva

Social Data Mining Trainer: Enrico De Santis, PhD

Snapshot of Shakespeare

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Information Retrieval

vector space retrieval many slides courtesy James Amherst

Probabilistic Near-Duplicate. Detection Using Simhash

CONTENTS. 6 Two Promises Miranda and Ferdinand plan to marry. Caliban 36 gets Stephano and Trinculo to promise to kill Prospero.

Investigation of Latent Semantic Analysis for Clustering of Czech News Articles

Lecture 5: Web Searching using the SVD

Transcription:

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org

Collection Frequency, cf Define: The total number of occurences of the term in the entire corpus

Collection Frequency, cf Define: The total number of occurences of the term in the entire corpus Document Frequency, df Define: The total number of documents which contain the term in the corpus

W ord Collection F requency Document F requency insurance 10440 3997 try 10422 8760

W ord Collection F requency Document F requency insurance 10440 3997 try 10422 8760 This suggests that df is better at discriminating between documents

W ord Collection F requency Document F requency insurance 10440 3997 try 10422 8760 This suggests that df is better at discriminating between documents How do we use df?

Term-Frequency, Inverse Document Frequency Weights

Term-Frequency, Inverse Document Frequency Weights tf-idf

Term-Frequency, Inverse Document Frequency Weights tf-idf tf = term frequency

Term-Frequency, Inverse Document Frequency Weights tf-idf tf = term frequency some measure of term density in a document

Term-Frequency, Inverse Document Frequency Weights tf-idf tf = term frequency some measure of term density in a document idf = inverse document frequency

Term-Frequency, Inverse Document Frequency Weights tf-idf tf = term frequency some measure of term density in a document idf = inverse document frequency a measure of the informativeness of a term

Term-Frequency, Inverse Document Frequency Weights tf-idf tf = term frequency some measure of term density in a document idf = inverse document frequency a measure of the informativeness of a term it s rarity across the corpus

Term-Frequency, Inverse Document Frequency Weights tf-idf tf = term frequency some measure of term density in a document idf = inverse document frequency a measure of the informativeness of a term it s rarity across the corpus could be just a count of documents with the term

Term-Frequency, Inverse Document Frequency Weights tf-idf tf = term frequency some measure of term density in a document idf = inverse document frequency a measure of the informativeness of a term it s rarity across the corpus could be just a count of documents with the term more commonly it is:

Term-Frequency, Inverse Document Frequency Weights tf-idf tf = term frequency some measure of term density in a document idf = inverse document frequency a measure of the informativeness of a term it s rarity across the corpus could be just a count of documents with the term more commonly it is: idf t = log ( ) corpus df t

TF-IDF Examples idf t = log ( corpus df t ) idf t = log 10 ( 1, 000, 000 df t ) term df t idf t calpurnia 1 animal 10 sunday 1000 fly 10, 000 under 100, 000 the 1, 000, 000 6 4 3 2 1 0

TF-IDF Summary Assign tf-idf weight for each term t in a document d: ( ) corpus tfidf(t, d) = (1 + log(tf t,d )) log df t,d Increases with number of occurrences of term in a doc. Increases with rarity of term across entire corpus Three different metrics term frequency document frequency collection/corpus frequency

Now, real-valued term-document matrices Bag of words model Each element of matrix is tf-idf value Antony and Julius T he T empest Hamlet Othello M acbeth Cleopatra Caesar Antony 13.1 11.4 0.0 0.0 0.0 0.0 Brutus 3.0 8.3 0.0 1.0 0.0 0.0 Caesar 2.3 2.3 0.0 0.5 0.3 0.3 Calpurnia 0.0 11.2 0.0 0.0 0.0 0.0 Cleopatra 17.7 0.0 0.0 0.0 0.0 0.0 mercy 0.5 0.0 0.7 0.9 0.9 0.3 worser 1.2 0.0 0.6 0.6 0.6 0.0

Vector Space Scoring That is a nice matrix, but How does it relate to scoring? Next, vector space scoring

Vector Space Scoring Vector Space Model Define: Vector Space Model Representing a set of documents as vectors in a common vector space. It is fundamental to many operations (query,document) pair scoring document classification document clustering Queries are represented as a document A short one, but mathematically equivalent

Vector Space Scoring Vector Space Model Define: Vector Space Model A document, d, is defined as a vector: One component for each term in the dictionary Assume the term is the tf-idf score V (d) t = (1 + log(tf t,d )) log V (d) ( ) corpus df t,d A corpus is many vectors together. A document can be thought of as a point in a multidimensional space, with axes related to terms.

Vector Space Scoring Vector Space Model Recall our Shakespeare Example: Antony and Julius T he T empest Hamlet Othello M acbeth Cleopatra Caesar Antony 13.1 11.4 0.0 0.0 0.0 0.0 Brutus 3.0 8.3 0.0 1.0 0.0 0.0 Caesar 2.3 2.3 0.0 0.5 0.3 0.3 Calpurnia 0.0 11.2 0.0 0.0 0.0 0.0 Cleopatra 17.7 0.0 0.0 0.0 0.0 0.0 mercy 0.5 0.0 0.7 0.9 0.9 0.3 worser 1.2 0.0 0.6 0.6 0.6 0.0

Vector Space Scoring Vector Space Model Recall our Shakespeare Example: V (d 1 ) Antony and Julius T he T empest Hamlet Othello M acbeth Cleopatra Caesar Antony 13.1 11.4 0.0 0.0 0.0 0.0 Brutus 3.0 8.3 0.0 1.0 0.0 0.0 Caesar 2.3 2.3 0.0 0.5 0.3 0.3 Calpurnia 0.0 11.2 0.0 0.0 0.0 0.0 Cleopatra 17.7 0.0 0.0 0.0 0.0 0.0 mercy 0.5 0.0 0.7 0.9 0.9 0.3 worser 1.2 0.0 0.6 0.6 0.6 0.0

Vector Space Scoring Vector Space Model Recall our Shakespeare Example: V (d 1 ) V (d2 ) V (d6 ) Antony and Julius T he T empest Hamlet Othello M acbeth Cleopatra Caesar Antony 13.1 11.4 0.0 0.0 0.0 0.0 Brutus 3.0 8.3 0.0 1.0 0.0 0.0 Caesar 2.3 2.3 0.0 0.5 0.3 0.3 Calpurnia 0.0 11.2 0.0 0.0 0.0 0.0 Cleopatra 17.7 0.0 0.0 0.0 0.0 0.0 mercy 0.5 0.0 0.7 0.9 0.9 0.3 worser 1.2 0.0 0.6 0.6 0.6 0.0

Vector Space Scoring Vector Space Model Recall our Shakespeare Example: V (d 1 ) V (d2 ) V (d6 ) Antony and Julius T he T empest Hamlet Othello M acbeth Cleopatra Caesar Antony 13.1 11.4 0.0 0.0 0.0 0.0 Brutus 3.0 8.3 0.0 1.0 0.0 0.0 Caesar 2.3 2.3 0.0 0.5 0.3 0.3 Calpurnia 0.0 11.2 0.0 0.0 0.0 0.0 Cleopatra 17.7 0.0 0.0 0.0 0.0 0.0 mercy 0.5 0.0 0.7 0.9 0.9 0.3 worser 1.2 0.0 0.6 0.6 0.6 0.0 V (d 6 ) 7