Querying. 1 o Semestre 2008/2009

Similar documents
IR Models: The Probabilistic Model. Lecture 8

Boolean and Vector Space Retrieval Models

CAIM: Cerca i Anàlisi d Informació Massiva

Information Retrieval and Web Search

Maschinelle Sprachverarbeitung

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

Chap 2: Classical models for information retrieval

Information Retrieval

Singular Value Decompsition

Similarity Search. Stony Brook University CSE545, Fall 2016

Vector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

1. Write a program to calculate distance traveled by light

Modern Information Retrieval

CSCE 561 Information Retrieval System Models

Vector Space Scoring Introduction to Information Retrieval Informatics 141 / CS 121 Donald J. Patterson

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

.. CSC 566 Advanced Data Mining Alexander Dekhtyar..

CMPS 561 Boolean Retrieval. Ryan Benton Sept. 7, 2011

Variable Latent Semantic Indexing

Probabilistic Near-Duplicate. Detection Using Simhash

Lecture 2: IR Models. Johan Bollen Old Dominion University Department of Computer Science

Information Retrieval

Part I: Web Structure Mining Chapter 1: Information Retrieval and Web Search

Information Retrieval. Lecture 6

vector space retrieval many slides courtesy James Amherst

16 The Information Retrieval "Data Model"

9 Searching the Internet with the SVD

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Consider the following example of a linear system:

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Compact Indexes for Flexible Top-k Retrieval

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Term Weighting and Vector Space Model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Large-Scale Behavioral Targeting

MATRIX DECOMPOSITION AND LATENT SEMANTIC INDEXING (LSI) Introduction to Information Retrieval CS 150 Donald J. Patterson

Lecture 5: Web Searching using the SVD

Mining Data Streams. The Stream Model. The Stream Model Sliding Windows Counting 1 s

Recap of the last lecture. CS276A Information Retrieval. This lecture. Documents as vectors. Intuition. Why turn docs into vectors?

Web Information Retrieval Dipl.-Inf. Christoph Carl Kling

Introduction to Machine Learning CMU-10701

Digital Electronics Part 1: Binary Logic

13 Searching the Web with the SVD

Information Retrieval

Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25

Advanced Implementations of Tables: Balanced Search Trees and Hashing

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

A Survey on Spatial-Keyword Search

Dealing with Text Databases

Par$$oned Elias- Fano indexes. Giuseppe O)aviano Rossano Venturini

Alphabet Friendly FM Index

Integer factorization, part 1: the Q sieve. part 2: detecting smoothness. D. J. Bernstein

Mining Data Streams. The Stream Model Sliding Windows Counting 1 s

Ranking-II. Temporal Representation and Retrieval Models. Temporal Information Retrieval

Improving the Convergence of Back-Propogation Learning with Second Order Methods

boolean queries Inverted index query processing Query optimization boolean model January 15, / 35

What is Text mining? To discover the useful patterns/contents from the large amount of data that can be structured or unstructured.

Exercise Sheet 1. 1 Probability revision 1: Student-t as an infinite mixture of Gaussians

Within-Document Term-Based Index Pruning with Statistical Hypothesis Testing

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

Image Compression. 1. Introduction. Greg Ames Dec 07, 2002

Information Retrieval CS Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Databases. DBMS Architecture: Hashing Techniques (RDBMS) and Inverted Indexes (IR)

Information Retrieval Basic IR models. Luca Bondi

Latent Semantic Analysis. Hongning Wang

Towards Collaborative Information Retrieval

Image Compression Using the Haar Wavelet Transform

Query CS347. Term-document incidence. Incidence vectors. Which plays of Shakespeare contain the words Brutus ANDCaesar but NOT Calpurnia?

Information Retrieval

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model

TASM: Top-k Approximate Subtree Matching

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2018. Instructor: Walid Magdy

Sequential: Vector of Bits

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)

Linear Spectral Hashing

Nearest Neighbor Search with Keywords: Compression

TDDD43. Information Retrieval. Fang Wei-Kleiner. ADIT/IDA Linköping University. Fang Wei-Kleiner ADIT/IDA LiU TDDD43 Information Retrieval 1

EE5585 Data Compression April 18, Lecture 23

Polynomial Selection. Thorsten Kleinjung École Polytechnique Fédérale de Lausanne

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2017. Instructor: Walid Magdy

1 Information retrieval fundamentals

CS630 Representing and Accessing Digital Information Lecture 6: Feb 14, 2006

IR: Information Retrieval

CS276A Practice Problem Set 1 Solutions

Name: Matriculation Number: Tutorial Group: A B C D E

Greedy. Outline CS141. Stefano Lonardi, UCR 1. Activity selection Fractional knapsack Huffman encoding Later:

Probabilistic Information Retrieval

Fast Matrix Multiplication Over GF3

CS425: Algorithms for Web Scale Data

Fast LSI-based techniques for query expansion in text retrieval systems

Number Representation and Waveform Quantization

Statistical learning. Chapter 20, Sections 1 4 1

Transcription:

Querying Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2008/2009

Outline 1 2 3 4 5

Outline 1 2 3 4 5

function sim(d j, q) = 1 W d W q W d is the document norm W q is the query norm irrelevant for ranking w i,j = f i,j idf i t w i,j w i,q i=1

Implementation problems Processing the documents is unfeasible Processing the index is expensive The index needs to store f d,t Can be stored in a compressed format The inverted lists can be huge We need only the top r documents from a sorted list of N documents, where r N

Outline 1 2 3 4 5

An Example Inverted File Lexicon Num. Term Add. 1 best 0000 2 expedient 0024 3 government 0032 4 governs 0064 5 inexpedient 0080 6 least 0088 7 machines 0096 8 manufactured 0104 9 mass 0108 10 men 0116 11 purpose 0132 12 serve 0140 13 state 0156 14 wooden 0164 Inverted File Inverted list (1;1),(2;1),(5;1) (5;1) (1;1),(2;1),(5;1),(6;1) (1;1),(2;1) (6;1) (1;1) (3;1) (4;1) (3;1) (3;2),(4;1) (4;1) (3;1),(4;1) (3;1) (4;1)

Value 1 Set A {} 2 For each query term t Q 1 Stem t 2 Search the lexicon 3 Get f t and the address of I t 4 Set idf t log(n/f t ) 5 Read the inverted list I t 6 For each (d,f d,t ) pair in I t 1 If A d A then Set A d 0 Set A A {A d } 2 Set A d A d + f d,t idf t 3 For each A d A, set A d A d /W d 4 For 1 i r 1 Select d such that A d = max{a} 2 Look up the address of d 3 Retrieve d 4 Set A A {A d }

Main Problems How to efficiently obtain the document norms? How to efficiently process the inverted lists? How to efficiently select the top-k documents?

Outline 1 2 3 4 5

the The problem All in memory: too expensive E.g. 5 10 6 docs 20Mb All in disk: too slow E.g. can take several seconds to read the all the norms

the The problem All in memory: too expensive E.g. 5 10 6 docs 20Mb All in disk: too slow E.g. can take several seconds to read the all the norms The solution Use approximations Read only selected document weights

Approximate norms are real numbers Need 4 to 8 bytes of storage Real numbers can approximated by b-bit values

Approximate norms are real numbers Need 4 to 8 bytes of storage Real numbers can approximated by b-bit values Approximate Norm Using b bits to approximate x, such that L x U: x L c = U L + ǫ 2b where c is the code for x.

An Example Consider 10 x 18, b = 2, ǫ = 0.1: For x = 15.3: 15.3 10 c = 18 10 + 0.1 22 = 2.617 = 2 = 10 For x = 12.4: 12.4 10 c = 8.1 For x = 17.9: 17.9 10 c = 8.1 4 = 1.185 = 1 = 01 4 = 3.901 = 3 = 11

Approximation Error The previous approximation assumes we are distributing the norms over equal sized-buckets However, some norms occur more frequently than others Short documents are more frequent than long documents Small values introduce higher error Thus, we need more precision for short documents

Relative Error Geometric-sized buckets Using b bits to approximate x, such that L x U: Range of values for x: B = ( ) U+ǫ 2 b L c = f (x) = log B (x/l) = g(c) x < g(c + 1) log(x/l) log B where g(c) = L B c

An Example Consider 10 x 18, b = 2, ǫ = 0.1: If x = 15.3: Range for c = 2: B = ( ) 18.1 2 2 = 1.160 10.0 c = f (15.3) = log 1.16 (15.3/10.0) = 2 = 10 13.456 x < 15.610

Using the Approximate Weights 3 For 1 d N Set A d A d /g(c d ) 4 Set H top r values of A d 5 For d H 1 Read W d from disk 2 Get the address of document d 3 Set A d A d g(c d )/W d 6 For 1 d N 1 If A d H A d > min{h} then 1 Read W d from disk 2 Set A d A d g(c d )/W d 3 If A d > min{h} then Set H H {min{h}} + {A d } Get address of document d

Outline 1 2 3 4 5

the Accumulators One accumulator per document may be too expensive Solution: use pruning strategies Example: Process terms with higher weights first Stop creating accumulators when weight is below a threshold

Frequency-Sorted Indexes Order the inverted lists by f d,t, followed by d 5; (1, 2), (2, 2), (3, 5), (4, 1), (5, 2) 5; (3, 5), (1, 2), (2, 2), (5, 2), (4, 1) Advantage: allows easy access to the most important terms

Frequency-Sorted Lists are grouped by frequencies 5; (3, 5), (1, 2), (2, 2), (5, 2), (4, 1) (5, 1 : 3), (2, 3 : 1, 2, 5), (1, 1 : 4) d-gap coding can be used within each block frequency gaps can also be coded

Frequency-Sorted Lists are grouped by frequencies 5; (3, 5), (1, 2), (2, 2), (5, 2), (4, 1) (5, 1 : 3), (2, 3 : 1, 2, 5), (1, 1 : 4) d-gap coding can be used within each block frequency gaps can also be coded Processing the Lists 1 Lists are processed in parallel, one block at a time 2 The block with the highest TF IDF value is always processed first

Processing Frequency-Sorted Indexes Advantages More accuracy: if a cut threshold is used, the lost documents will have small importance Less processing: the larger blocks (which are those with low frequencies) have a smaller chance of being processed Less disk transfer: lists can be read one block at a time No loss in retrieval effectiveness: experiments show that it does not loose and sometimes improves the results Drawback Not appropriate for Boolean queries

Outline 1 2 3 4 5

The problem Sorting all documents requires N log N operations For large collections this implies several seconds However: we are only interested in the k top documents, where k N

The problem Sorting all documents requires N log N operations For large collections this implies several seconds However: we are only interested in the k top documents, where k N The solution Use a heap data structure

Selecting the Top r documents 1 Set H {} 2 For 1 d r 1 Set A d A d /W d 2 Get the address of document d 3 Set H H + {A d } 3 Build H into a heap 4 For r + 1 d N 1 Set A d A d /W d 2 If A d > min{h} then 1 Set H H min{h} + {A d } 2 Sift H 3 Get the address of document d 5 For 1 i r 1 Select d such that A d = max{h} 2 Retrieve d 3 Set H H {A d }

Algorithm Complexity Worst case 2r + (N r) + 2(N r)log r + r log r Expected value 2r + N + 1.4r log r log(n/r) + r log r

Algorithm Complexity Worst case 2r + (N r) + 2(N r)log r + r log r Expected value 2r + N + 1.4r log r log(n/r) + r log r Example For N = 1 000 000 and r = 100: Full sort: 20 000 000 comparisons Heap: 1, 013, 000 comparisons

Questions?