CAIM: Cerca i Anàlisi d Informació Massiva

Similar documents
Boolean and Vector Space Retrieval Models

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

Information Retrieval and Web Search

Information Retrieval

Chap 2: Classical models for information retrieval

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

Modern Information Retrieval

Querying. 1 o Semestre 2008/2009

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model

Vector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model

Vector Space Scoring Introduction to Information Retrieval Informatics 141 / CS 121 Donald J. Patterson

Motivation. User. Retrieval Model Result: Query. Document Collection. Information Need. Information Retrieval / Chapter 3: Retrieval Models

IR Models: The Probabilistic Model. Lecture 8

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

TDDD43. Information Retrieval. Fang Wei-Kleiner. ADIT/IDA Linköping University. Fang Wei-Kleiner ADIT/IDA LiU TDDD43 Information Retrieval 1

Information Retrieval

.. CSC 566 Advanced Data Mining Alexander Dekhtyar..

Term Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan

Term Weighting and Vector Space Model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

IR: Information Retrieval

Information Retrieval

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Scoring, Term Weighting and the Vector Space

Ranked Retrieval (2)

PV211: Introduction to Information Retrieval

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

Probabilistic Information Retrieval

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Extended IR Models. Johan Bollen Old Dominion University Department of Computer Science

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2018. Instructor: Walid Magdy

Introduction to Quantum Logic. Chris Heunen

Information Retrieval Basic IR models. Luca Bondi

Fall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26

Truth and Logic. Brief Introduction to Boolean Variables. INFO-1301, Quantitative Reasoning 1 University of Colorado Boulder

Maschinelle Sprachverarbeitung

Dealing with Text Databases

Lecture 2: Syntax. January 24, 2018

vector space retrieval many slides courtesy James Amherst

Information Retrieval. Lecture 6

Intelligent Data Analysis. Mining Textual Data. School of Computer Science University of Birmingham

CSE 1400 Applied Discrete Mathematics Definitions

Ranking-II. Temporal Representation and Retrieval Models. Temporal Information Retrieval

CS1021. Why logic? Logic about inference or argument. Start from assumptions or axioms. Make deductions according to rules of reasoning.

Database Theory VU , SS Ehrenfeucht-Fraïssé Games. Reinhard Pichler

INTRODUCTION TO LOGIC. Propositional Logic. Examples of syntactic claims

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

Proving simple set properties...

7.1 What is it and why should we care?

Data Mining Recitation Notes Week 3

PV211: Introduction to Information Retrieval

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2017. Instructor: Walid Magdy

PageRank algorithm Hubs and Authorities. Data mining. Web Data Mining PageRank, Hubs and Authorities. University of Szeged.

On the satisfiability problem for a 4-level quantified syllogistic and some applications to modal logic

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)

Relational Algebra & Calculus

Retrieval Models. Boolean and Vector Space Retrieval Models. Common Preprocessing Steps. Boolean Model. Boolean Retrieval Model

Lecture : Set Theory and Logic

Data Mining and Machine Learning

Background for Discrete Mathematics

CS 572: Information Retrieval

Lecture 2: Axiomatic semantics

CSE 20. Lecture 4: Introduction to Boolean algebra. CSE 20: Lecture4

Problem 1: Suppose A, B, C and D are finite sets such that A B = C D and C = D. Prove or disprove: A = B.

Information Retrieval

QSQL: Incorporating Logic-based Retrieval Conditions into SQL

INFO 4300 / CS4300 Information Retrieval. IR 9: Linear Algebra Review

KRIPKE S THEORY OF TRUTH 1. INTRODUCTION

MATRIX DECOMPOSITION AND LATENT SEMANTIC INDEXING (LSI) Introduction to Information Retrieval CS 150 Donald J. Patterson

Compact Indexes for Flexible Top-k Retrieval

CMPSCI 601: Tarski s Truth Definition Lecture 15. where

Property Checking of Safety- Critical Systems Mathematical Foundations and Concrete Algorithms

Incomplete Information in RDF

1 Information retrieval fundamentals

Chapter 4: Advanced IR Models

First-Order Theorem Proving and Vampire. Laura Kovács (Chalmers University of Technology) Andrei Voronkov (The University of Manchester)

Math Real Analysis


Mathematical Foundations of Logic and Functional Programming

Chapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

Discrete Mathematics. W. Ethan Duckworth. Fall 2017, Loyola University Maryland

Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25

Lecture 3: Semantics of Propositional Logic

COMP9414: Artificial Intelligence Propositional Logic: Automated Reasoning

Part I: Web Structure Mining Chapter 1: Information Retrieval and Web Search

Document Similarity in Information Retrieval

cse541 LOGIC FOR COMPUTER SCIENCE

Warm-Up Problem. Is the following true or false? 1/35

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Comp487/587 - Boolean Formulas

Lecture 18: P & NP. Revised, May 1, CLRS, pp

Products, Relations and Functions

CS2742 midterm test 2 study sheet. Boolean circuits: Predicate logic:

CM10196 Topic 2: Sets, Predicates, Boolean algebras

A Boolean Approach to Qualitative Comparison. A summary by Richard Warnes. Charles Ragin (1987) THE COMPARATIVE METHOD. Chapter 6

CSE 1400 Applied Discrete Mathematics Proofs

Computational Logic. Relational Query Languages with Negation. Free University of Bozen-Bolzano, Werner Nutt

Transcription:

1 / 21 CAIM: Cerca i Anàlisi d Informació Massiva FIB, Grau en Enginyeria Informàtica Slides by Marta Arias, José Balcázar, Ricard Gavaldá Department of Computer Science, UPC Fall 2016 http://www.cs.upc.edu/~caim

2. Information Retrieval Models

3 / 21 Information Retrieval Models, I Setting the stage to think about IR What is an Information Retrieval Model? We need to clarify: A proposal for a logical view of documents (what info is stored/indexed about each document?), a query language (what kinds of queries will be allowed?), and a notion of relevance (how to handle each document, given a query?).

4 / 21 Information Retrieval Models, II A couple of IR models Focus for this course: Boolean model, Boolean queries, exact answers; extension: phrase queries. Vector model, weights on terms and documents; similarity queries, approximate answers, ranking.

5 / 21 Boolean Model of Information Retrieval Relevance assumed binary Documents: A document is completely identified by the set of terms that it contains. Order of occurrence considered irrelevant, number of occurrences considered irrelevant (but a closely related model, called bag-of-words or BoW, does consider relevant the number of occurrences). Thus, for a set of terms T = {t 1,..., t T }, a document is just a subset of T. Each document can be seen as a bit vector of length T, d = (d 1,..., d T ), where d i = 1 if and only if t i appears in d, or, equivalently, d i = 0 if and only if t i does not appear in d.

6 / 21 Queries in the Boolean Model, I Boolean queries, exact answers Atomic query: a single term. The answer is the set of documents that contain it. Combining queries: OR, AND: operate as union or intersection of answers; Set difference, t 1 BUTNOT t 2 t 1 AND NOT t 2 ; motivation: avoid unmanageably large answer sets. In Lucene: +/ signs on query terms, Boolean operators.

7 / 21 Queries in the Boolean Model, II A close relative to propositional logic Analogy: Terms act as propositional variables; documents act as propositional models; a document is relevant for a term if it contains the term, that is, if, as a propositional model, satisfies the variable; queries are propositional formulas (with a syntactic condition of avoiding global negation); a document is relevant for a query if, as a propositional model, it satisfies the propositional formula.

8 / 21 Example, I A very simple toy case Consider 7 documents with a vocabulary of 6 terms: d1 = one three d2 = two two three d3 = one three four five five five d4 = one two two two two three six six d5 = three four four four six d6 = three three three six six d7 = four five

9 / 21 Example, II Our documents in the Boolean model f ive f our one six three two d1 = [ 0 0 1 0 1 0 ] d2 = [ 0 0 0 0 1 1 ] d3 = [ 1 1 1 0 1 0 ] d4 = [ 0 0 1 1 1 1 ] d5 = [ 0 1 0 1 1 0 ] d6 = [ 0 0 0 1 1 0 ] d7 = [ 1 1 0 0 0 0 ] (Invent some queries and compute their answers!)

10 / 21 Queries in the Boolean Model, III No ranking of answers Answers are not quantified: A document either matches the query (is fully relevant), or does not match the query (is fully irrelevant). Depending on user needs and application, this feature may be good or may be bad.

11 / 21 Phrase Queries, I Slightly beyond the Boolean model Phrase queries: conjunction plus adjacency Ability to answer with the set of documents that have the terms of the query consecutively. A user querying Keith Richards may not wish a document that mentions both Keith Emerson and Emil Richards. Requires extending the notion of basic query to include adjacency.

12 / 21 Phrase Queries, II Options to hack them in Options: Run as conjunctive query, then doublecheck the whole answer set to filter out nonadjacency cases. This option may be very slow in cases of large amounts of false positives. Keep in the index dedicated information about adjacency of any two terms in a document (e.g. positions). Keep in the index dedicated information about a choice of interesting pairs of words.

13 / 21 Vector Space Model of Information Retrieval, I Basis of all successful approaches Order of words still irrelevant. Frequence is relevant. Not all words are equally important. For a set of terms T = {t 1,..., t T }, a document is a vector d = (w 1,..., w T ) of floats instead of bits. w i is the weight of t i in d.

14 / 21 Vector Space Model of Information Retrieval, II Moving to vector space A document is now a vector in IR T. The document collection conceptually becomes a matrix terms documents. but we never compute the matrix explicitly. Queries may also be seen as vectors in IR T.

15 / 21 The tf-idf scheme A way to assign weight vector to documents Two principles: The more frequent t is in d, the higher weight it should have. The more frequent t is in the whole collection, the less it discriminates among documents, so the lower its weight should be in all documents.

16 / 21 The tf-idf scheme, II The formula A document is a vector of weights d = [w d,1,..., w d,i,..., w d,t ]. Each weight is a product of two terms The term frequency term tf is tf d,i = w d,i = tf d,i idf i. f d,i máx j f d,j, where f d,j is the frequency of t j in d. And the inverse document frequency idf is idf i = log 2 D df i, where D = number of documents and df i = number of documents that contain term t i.

17 / 21 Example, I f ive f our one six three two maxf d1 = [ 0 0 1 0 1 0 ] 1 d2 = [ 0 0 0 0 1 2 ] 2 d3 = [ 3 1 1 0 1 0 ] 3 d4 = [ 0 0 1 2 1 4 ] 4 d5 = [ 0 3 0 1 1 0 ] 3 d6 = [ 0 0 0 2 3 0 ] 3 d7 = [ 1 1 0 0 0 0 ] 1 df = 2 3 3 3 6 2

18 / 21 Example, II df = 2 3 3 3 6 2 d3 = [ 3 1 1 0 1 0 ] d3 = [ 3 3 log 2 7 2 1 3 log 2 7 3 1 3 log 2 7 3 0 3 log 2 7 3 1 3 log 2 7 6 0 3 log 2 7 2 ] = [ 1.81 0.41 0.41 0 0.07 0 ] d4 = [ 0 0 1 2 1 4 ] 0 d4 = [ 4 2 7 0 2 4 2 7 1 3 4 2 7 2 3 4 2 7 1 3 4 2 7 4 6 4 2 7 2 ] = [ 0 0 0.61 1.22 0.11 3.61 ]

19 / 21 Similarity of Documents in the Vector Space Model The cosine similarity measure Similar vectors may happen to have very different sizes. We better compare only their directions. Equivalently, we normalize them before comparing them to have the same Euclidean length. where v w = i sim(d1, d2) = d1 d2 d1 d2 = d1 d1 d2 d2 v i w i, and v = v v = vi 2. i Our weights are all nonnegative. Therefore, all cosines / similarities are between 0 and 1.

20 / 21 Cosine similarity, Example Then d3 = [ 1.81 0.41 0.41 0 0.07 0 ] d4 = [ 0 0 0.61 1.22 0.11 3.61 ] d3 = 1.898, d4 = 3.866, d3 d4 = 0.26 and sim(d3, d4) = 0.035 (i.e., small similarity).

21 / 21 Query Answering Queries can be transformed to vectors too. Sometimes, tf-idf weights; often, binary weights. sim(doc, query) [0, 1]. Answer: List of documents sorted by decreasing similarity. We will find uses for comparing sim(d1, d2) too.