Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

Similar documents
Lecture 4 Ranking Search Results. Many thanks to Prabhakar Raghavan for sharing most content from the following slides

Information Retrieval. Lecture 6

Recap of the last lecture. CS276A Information Retrieval. This lecture. Documents as vectors. Intuition. Why turn docs into vectors?

Information Retrieval Tutorial 6: Evaluation

Chap 2: Classical models for information retrieval

Information Retrieval

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002

Vector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model

Boolean and Vector Space Retrieval Models

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Link Analysis. Stony Brook University CSE545, Fall 2016

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

CS6220: DATA MINING TECHNIQUES

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

IR Models: The Probabilistic Model. Lecture 8

Introduction to Information Retrieval

Improving Diversity in Ranking using Absorbing Random Walks

Information Retrieval and Web Search

PageRank algorithm Hubs and Authorities. Data mining. Web Data Mining PageRank, Hubs and Authorities. University of Szeged.

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

CS6220: DATA MINING TECHNIQUES

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Web Structure Mining Nodes, Links and Influence

1998: enter Link Analysis

Information Retrieval

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2018. Instructor: Walid Magdy

Link Mining PageRank. From Stanford C246

TDDD43. Information Retrieval. Fang Wei-Kleiner. ADIT/IDA Linköping University. Fang Wei-Kleiner ADIT/IDA LiU TDDD43 Information Retrieval 1

Term Weighting and Vector Space Model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Querying. 1 o Semestre 2008/2009

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2017. Instructor: Walid Magdy

CS 188: Artificial Intelligence Spring 2009

Vector Space Scoring Introduction to Information Retrieval Informatics 141 / CS 121 Donald J. Patterson

Information Retrieval Basic IR models. Luca Bondi

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

16 The Information Retrieval "Data Model"

Web Information Retrieval Dipl.-Inf. Christoph Carl Kling

Ranked Retrieval (2)

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Data and Algorithms of the Web

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

PV211: Introduction to Information Retrieval

PV211: Introduction to Information Retrieval

Data Mining and Matrices

Lecture 12: Link Analysis for Web Retrieval

IR: Information Retrieval

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)

PV211: Introduction to Information Retrieval

Introduction to Data Mining

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

Term Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan

Data Mining Recitation Notes Week 3

Announcements. CS 188: Artificial Intelligence Fall Markov Models. Example: Markov Chain. Mini-Forward Algorithm. Example

INFO 4300 / CS4300 Information Retrieval. IR 9: Linear Algebra Review

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

Query. Information Retrieval (IR) Term-document incidence. Incidence vectors. Bigger corpora. Answers to query

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

ECEN 689 Special Topics in Data Science for Communications Networks

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

0.1 Naive formulation of PageRank

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

Chapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Slides based on those in:

Stephen Scott.

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York

CS 188: Artificial Intelligence Spring Announcements

Information Retrieval

Probabilistic Near-Duplicate. Detection Using Simhash

Information Retrieval

Lesson Plan. AM 121: Introduction to Optimization Models and Methods. Lecture 17: Markov Chains. Yiling Chen SEAS. Stochastic process Markov Chains

Naïve Bayes, Maxent and Neural Models

A Survey on Spatial-Keyword Search

IS4200/CS6200 Informa0on Retrieval. PageRank Con+nued. with slides from Hinrich Schütze and Chris6na Lioma

Non-Boolean models of retrieval: Agenda

CS 188: Artificial Intelligence Fall Recap: Inference Example

Language Models. Web Search. LM Jelinek-Mercer Smoothing and LM Dirichlet Smoothing. Slides based on the books: 13

MATRIX DECOMPOSITION AND LATENT SEMANTIC INDEXING (LSI) Introduction to Information Retrieval CS 150 Donald J. Patterson

Jeffrey D. Ullman Stanford University

Dealing with Text Databases

Approximate Inference

CS 3750 Advanced Machine Learning. Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya

Performance Evaluation

Scoring, Term Weighting and the Vector Space

Pr[positive test virus] Pr[virus] Pr[positive test] = Pr[positive test] = Pr[positive test]

Probabilistic Information Retrieval

Transcription:

Outline for today Information Retrieval Efficient Scoring and Ranking Recap on ranked retrieval Jörg Tiedemann jorg.tiedemann@lingfil.uu.se Department of Linguistics and Philology Uppsala University Efficient scoring and ranking IR system architecture Jörg Tiedemann 1/35 Jörg Tiedemann 2/35 tf-idf weighting Cosine similarity between query and document I product of term frequency (tf) and inverse document frequency (idf) w t,d =(1 + log tf t,d ) log N df t I best known weighting scheme in IR I increases with I number of occurrences within a document I rarity of the term in collection similarity(q, d) =cos( ) Jörg Tiedemann 3/35 Jörg Tiedemann 4/35

Cosine similarity between query and document Efficient scoring Task: Find the k most relevant documents given a query Euclidean dot product: ~q ~d = ~q ~ d cos( )! We can use the dot product of normalized unit vectors: cos( ) = ~ q ~d ~q ~ d = ~ q ~q ~ d ~ d I No matching keyword! cos( ) =0! Use inverted index to reduce search space! I Space efficient storage of weights! plain term frequencies in postings instead of log values! plain IDF values in term dictionaries! can vary weighting schemes I Don t need complete ranking of all matching doc s!! Binary min heap for efficient top-k selection! Inexact top k retrieval (search heuristics) Jörg Tiedemann 5/35 Jörg Tiedemann 6/35 Binary min heap for selecting top k Inexact top k retrieval binary min heap = binary tree in which each node s value is less than the values of its children I bottleneck: cosine computation for all possible candidates I use search heuristics General idea: I find a set A of contenders with K < A << N I return top k documents in A I Why parent nodes with values less than children nodes? Jörg Tiedemann 7/35 Jörg Tiedemann 8/35

Inexact top k retrieval Index elimination! selecting A = pruning non-contendors I index elimination I champion lists I static quality scores I impact ordering I cluster pruning I only consider high-idf query terms I query: catcher in the rye I accumulate scores for catcher and rye only I only consider docs containing many query terms I multi-term queries: scores for docs that contain at least a fixed proportion pf query terms (e.g., 3 out of 4)! soft conjunction (early Google) I easy to implement in postings traversal Jörg Tiedemann 9/35 Jörg Tiedemann 10/35 Champion lists Static quality scores I For each term in dictionary: Pre-compute r documents of heighest weight (in the postings of the term)! Champion lists! I At query time: Only compute scores for documents in union of champion lists for all query terms I r is chosen at indexing time I might use different r for each term Idea 2: reorder posting lists according to expected relevance I query independent quality of documents (authority) What is a good indication of quality? I a paper with many citations I many bookmarks (del.icio.us,...) I PageRank (!) Jörg Tiedemann 11/35 Jörg Tiedemann 12/35

Page Rank Link Analysis - Example Graph I model: likelihood that a random surfer arrives at page B I markov chain model: web = probabilistic directed connected graph I web-page = state, N x N probability transition matrix (links) I PageRank = long-term visit rate = steady-state probability Pre-computed, query-independent document scores! Link Matrix d 0 d 1 d 2 d 3 d 4 d 5 d 6 d 0 0 0 1 0 0 0 0 d 1 0 1 1 0 0 0 0 d 2 1 0 1 1 0 0 0 d 3 0 0 0 1 1 0 0 d 4 0 0 0 0 0 0 1 d 5 0 0 0 0 0 1 1 d 6 0 0 0 1 1 0 1 Jörg Tiedemann 13/35 Jörg Tiedemann 14/35 Link Analysis - Example Graph Probability matrix P d 0 d 1 d 2 d 3 d 4 d 5 d 6 d 0 0.00 0.00 1.00 0.00 0.00 0.00 0.00 d 1 0.00 0.50 0.50 0.00 0.00 0.00 0.00 d 2 0.33 0.00 0.33 0.33 0.00 0.00 0.00 d 3 0.00 0.00 0.00 0.50 0.50 0.00 0.00 d 4 0.00 0.00 0.00 0.00 0.00 0.00 1.00 d 5 0.00 0.00 0.00 0.00 0.00 0.50 0.50 d 6 0.00 0.00 0.00 0.33 0.33 0.00 0.33 avoid dead ends! add teleportation rate (e.g. 14% chance to jump to any random page) Link Analysis - PageRank d 0 d 1 d 2 d 3 d 4 d 5 d 6 d 0 0.02 0.02 0.88 0.02 0.02 0.02 0.02 d 1 0.02 0.45 0.45 0.02 0.02 0.02 0.02 d 2 0.31 0.02 0.31 0.31 0.02 0.02 0.02 d 3 0.02 0.02 0.02 0.45 0.45 0.02 0.02 d 4 0.02 0.02 0.02 0.02 0.02 0.02 0.88 d 5 0.02 0.02 0.02 0.02 0.02 0.45 0.45 d 6 0.02 0.02 0.02 0.31 0.31 0.02 0.31 I x t,i = chance of being in state i at time t during the random walk For all documents: ~x t =(x t,1, x t,2,..,x t,n ) I steady-state probabilities ~x = ~ =( 1, 2,..., N )=~ P I task: find the steady-state prob s ~! power method (iterative procedure) Jörg Tiedemann 15/35 Jörg Tiedemann 16/35

Link Analysis - Compute PageRanks How to Integrate Static quality scores Power Method to find ~ = ~ P 1. start with random distribution ~x 0 =(x 0,1, x 0,2,..,x 0,N ) 2. at step t compute ~x t = ~x t 1 P: x t,k = NX i=1 x t 1,i P i,k 3. continue with 2. until convergence (~ = ~x m = ~x 0 P m ) I assign quality score g(d) to each d (e.g. PageRank) I combine with relevance score (cos(q, d)): net-score(q, d) =g(d)+cos(q, d) I might use other type of combination I return top k documents according to net-score How does this help to make retrieval more efficient? Jörg Tiedemann 17/35 Jörg Tiedemann 18/35 Static quality scores Other ideas I postings are ordered by g(d) (still consistent order!) I traverse postings and compute scores I early termination is possible I stop if minimal score cannot be improved I time threshold I threshold for goodness score I can be combined with champion lists High and low lists : I for each term: I high list ( = champion list) I low list (other documents) I use high lists first I use low list if < k documents found Jörg Tiedemann 19/35 Jörg Tiedemann 20/35

Impact ordering Cluster Pruning I compute scores only for documents with high wf t,d I sort each posting by wf t,d! non-consistent order of postings! I solution 1: early termination: for each term I stop after a fixed number of r documents I stop when wf t,d <threshold I score documents in union of retrieved postings I solution 2: sort terms by idf I stop if document scores don t change much Pre-processing (clustering): I Pick p N docs at random (= leaders ) (random = fast + reflects distribution well) I For every other doc, pre-compute nearest leader I attach them to leader I each leader has p N followers Query processing: I given query q: find nearest leader L I seek k nearest docs among L followers Jörg Tiedemann 21/35 Jörg Tiedemann 22/35 Cluster Pruning Summary on Efficient Scoring I inverted index for selecting candidates I on-the-fly similarity score calculations I efficient top-k selection with min heaps I inexact retrieval (index elimination, champion lists, cluster pruning) I static quality scores (relevance and efficiency) Variant: I clustering: attach documents to x nearest leaders I querying: find y nearest leaders and consider their followers Jörg Tiedemann 23/35 Jörg Tiedemann 24/35

Other practical issues when building IR systems IR system architecture I Tiered indexes I cascaded query processing I index with most important terms & doc s first I Zones I different indexes for various parts of a doc (title, body...) I Query term proximity I more relevant: keywords in close proximity to each other I Query parsing I I I check syntax create actual index queries combine results! All parts need careful tuning! (! Evaluation is important!) Jörg Tiedemann 25/35 Jörg Tiedemann 26/35 Summary Evaluation IR includes many components I Document preprocessing (linguistic and otherwise) I Positional indexes I Tiered indexes I Spelling correction I k-gram indexes for wildcard queries and spelling correction I Query parsing & Query processing I Document scoring (including proximity...) We need to measure the success of retrieval I compare IR engines I system development I user happiness Measure success in terms of relevance with respect to the users information need I How much of the search result is relevant? (precision) I How much of the relevant information did I find? (recall) Jörg Tiedemann 27/35 Jörg Tiedemann 28/35

Precision and recall Precision and recall Precision = #(relevant items retrieved) #(retrieved items) = P(relevant retrieved) Relevant Nonrelevant Retrieved true positives (TP) false positives (FP) Not retrieved false negatives (FN) true negatives (TN) Recall = #(relevant items retrieved) #(relevant items) = P(retrieved relevant) P = TP/(TP + FP) R = TP/(TP + FN) Why not measuring accuracy? Jörg Tiedemann 29/35 Jörg Tiedemann 30/35 A combined measure: F I balanced F = harmonic mean of precision and recall F balanced = 2 1 P + 1 R Evaluation of Ranked Retrieval I Precision/recall/F are measures for unranked sets I Relevant documents should be ranked high! precision-recall curve 1.0 0.8 I F allows us to trade off precision against recall. Precision 0.6 0.4 1 F = 1 P +(1 ) 1 R = ( 2 + 1)PR 2 P + R I 2 [0, 1] and thus 2 2 [0, 1] where 2 = 1 0.2 0.0 0.0 0.2 0.4 0.6 0. Recall I Red line: Interpolation (max precision) Jörg Tiedemann 31/35 Jörg Tiedemann 32/35

11-point interpolated average precision Recall Interpolated Precision 0.0 1.00 0.1 0.67 0.2 0.63 0.3 0.55 0.4 0.45 0.5 0.41 0.6 0.36 0.7 0.29 0.8 0.13 0.9 0.10 1.0 0.08 11-point average: 0.425 Mean Average Precision MAP(Q) = 1 Q Q X 1 m j m j j=1 k=1 X Precision(R jk ) I Q = {d 1,..d mj } set of relevant documents for query q j I R jk set of ranked documents up to document d k I Precision for non-retrieved documents = 0! approximates area underneath precision-recall curve! no fixed recall-levels, no interpolation! more stable than 11-point interpolation Jörg Tiedemann 33/35 Jörg Tiedemann 34/35 Issues with Evaluation I What is a relevant document? I No account for degree of relevance I Recall is difficult for large collections (web retrieval) I Is measuring relevance good enough? Jörg Tiedemann 35/35