Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

Size: px

Start display at page:

Download "Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting"

Harry Dixon
6 years ago
Views:

1 Outline for today Information Retrieval Efficient Scoring and Ranking Recap on ranked retrieval Jörg Tiedemann Department of Linguistics and Philology Uppsala University Efficient scoring and ranking IR system architecture Jörg Tiedemann 1/35 Jörg Tiedemann 2/35 tf-idf weighting Cosine similarity between query and document I product of term frequency (tf) and inverse document frequency (idf) w t,d =(1 + log tf t,d ) log N df t I best known weighting scheme in IR I increases with I number of occurrences within a document I rarity of the term in collection similarity(q, d) =cos( ) Jörg Tiedemann 3/35 Jörg Tiedemann 4/35

2 Cosine similarity between query and document Efficient scoring Task: Find the k most relevant documents given a query Euclidean dot product: ~q ~d = ~q ~ d cos( )! We can use the dot product of normalized unit vectors: cos( ) = ~ q ~d ~q ~ d = ~ q ~q ~ d ~ d I No matching keyword! cos( ) =0! Use inverted index to reduce search space! I Space efficient storage of weights! plain term frequencies in postings instead of log values! plain IDF values in term dictionaries! can vary weighting schemes I Don t need complete ranking of all matching doc s!! Binary min heap for efficient top-k selection! Inexact top k retrieval (search heuristics) Jörg Tiedemann 5/35 Jörg Tiedemann 6/35 Binary min heap for selecting top k Inexact top k retrieval binary min heap = binary tree in which each node s value is less than the values of its children I bottleneck: cosine computation for all possible candidates I use search heuristics General idea: I find a set A of contenders with K < A << N I return top k documents in A I Why parent nodes with values less than children nodes? Jörg Tiedemann 7/35 Jörg Tiedemann 8/35

3 Inexact top k retrieval Index elimination! selecting A = pruning non-contendors I index elimination I champion lists I static quality scores I impact ordering I cluster pruning I only consider high-idf query terms I query: catcher in the rye I accumulate scores for catcher and rye only I only consider docs containing many query terms I multi-term queries: scores for docs that contain at least a fixed proportion pf query terms (e.g., 3 out of 4)! soft conjunction (early Google) I easy to implement in postings traversal Jörg Tiedemann 9/35 Jörg Tiedemann 10/35 Champion lists Static quality scores I For each term in dictionary: Pre-compute r documents of heighest weight (in the postings of the term)! Champion lists! I At query time: Only compute scores for documents in union of champion lists for all query terms I r is chosen at indexing time I might use different r for each term Idea 2: reorder posting lists according to expected relevance I query independent quality of documents (authority) What is a good indication of quality? I a paper with many citations I many bookmarks (del.icio.us,...) I PageRank (!) Jörg Tiedemann 11/35 Jörg Tiedemann 12/35

4 Page Rank Link Analysis - Example Graph I model: likelihood that a random surfer arrives at page B I markov chain model: web = probabilistic directed connected graph I web-page = state, N x N probability transition matrix (links) I PageRank = long-term visit rate = steady-state probability Pre-computed, query-independent document scores! Link Matrix d 0 d 1 d 2 d 3 d 4 d 5 d 6 d d d d d d d Jörg Tiedemann 13/35 Jörg Tiedemann 14/35 Link Analysis - Example Graph Probability matrix P d 0 d 1 d 2 d 3 d 4 d 5 d 6 d d d d d d d avoid dead ends! add teleportation rate (e.g. 14% chance to jump to any random page) Link Analysis - PageRank d 0 d 1 d 2 d 3 d 4 d 5 d 6 d d d d d d d I x t,i = chance of being in state i at time t during the random walk For all documents: ~x t =(x t,1, x t,2,..,x t,n ) I steady-state probabilities ~x = ~ =( 1, 2,..., N )=~ P I task: find the steady-state prob s ~! power method (iterative procedure) Jörg Tiedemann 15/35 Jörg Tiedemann 16/35

5 Link Analysis - Compute PageRanks How to Integrate Static quality scores Power Method to find ~ = ~ P 1. start with random distribution ~x 0 =(x 0,1, x 0,2,..,x 0,N ) 2. at step t compute ~x t = ~x t 1 P: x t,k = NX i=1 x t 1,i P i,k 3. continue with 2. until convergence (~ = ~x m = ~x 0 P m ) I assign quality score g(d) to each d (e.g. PageRank) I combine with relevance score (cos(q, d)): net-score(q, d) =g(d)+cos(q, d) I might use other type of combination I return top k documents according to net-score How does this help to make retrieval more efficient? Jörg Tiedemann 17/35 Jörg Tiedemann 18/35 Static quality scores Other ideas I postings are ordered by g(d) (still consistent order!) I traverse postings and compute scores I early termination is possible I stop if minimal score cannot be improved I time threshold I threshold for goodness score I can be combined with champion lists High and low lists : I for each term: I high list ( = champion list) I low list (other documents) I use high lists first I use low list if < k documents found Jörg Tiedemann 19/35 Jörg Tiedemann 20/35

Impact ordering Cluster Pruning I compute scores only for documents with high wf t,d I sort each posting by wf t,d! non-consistent order of postings!

6 Impact ordering Cluster Pruning I compute scores only for documents with high wf t,d I sort each posting by wf t,d! non-consistent order of postings! I solution 1: early termination: for each term I stop after a fixed number of r documents I stop when wf t,d <threshold I score documents in union of retrieved postings I solution 2: sort terms by idf I stop if document scores don t change much Pre-processing (clustering): I Pick p N docs at random (= leaders ) (random = fast + reflects distribution well) I For every other doc, pre-compute nearest leader I attach them to leader I each leader has p N followers Query processing: I given query q: find nearest leader L I seek k nearest docs among L followers Jörg Tiedemann 21/35 Jörg Tiedemann 22/35 Cluster Pruning Summary on Efficient Scoring I inverted index for selecting candidates I on-the-fly similarity score calculations I efficient top-k selection with min heaps I inexact retrieval (index elimination, champion lists, cluster pruning) I static quality scores (relevance and efficiency) Variant: I clustering: attach documents to x nearest leaders I querying: find y nearest leaders and consider their followers Jörg Tiedemann 23/35 Jörg Tiedemann 24/35

7 Other practical issues when building IR systems IR system architecture I Tiered indexes I cascaded query processing I index with most important terms & doc s first I Zones I different indexes for various parts of a doc (title, body...) I Query term proximity I more relevant: keywords in close proximity to each other I Query parsing I I I check syntax create actual index queries combine results! All parts need careful tuning! (! Evaluation is important!) Jörg Tiedemann 25/35 Jörg Tiedemann 26/35 Summary Evaluation IR includes many components I Document preprocessing (linguistic and otherwise) I Positional indexes I Tiered indexes I Spelling correction I k-gram indexes for wildcard queries and spelling correction I Query parsing & Query processing I Document scoring (including proximity...) We need to measure the success of retrieval I compare IR engines I system development I user happiness Measure success in terms of relevance with respect to the users information need I How much of the search result is relevant? (precision) I How much of the relevant information did I find? (recall) Jörg Tiedemann 27/35 Jörg Tiedemann 28/35

8 Precision and recall Precision and recall Precision = #(relevant items retrieved) #(retrieved items) = P(relevant retrieved) Relevant Nonrelevant Retrieved true positives (TP) false positives (FP) Not retrieved false negatives (FN) true negatives (TN) Recall = #(relevant items retrieved) #(relevant items) = P(retrieved relevant) P = TP/(TP + FP) R = TP/(TP + FN) Why not measuring accuracy? Jörg Tiedemann 29/35 Jörg Tiedemann 30/35 A combined measure: F I balanced F = harmonic mean of precision and recall F balanced = 2 1 P + 1 R Evaluation of Ranked Retrieval I Precision/recall/F are measures for unranked sets I Relevant documents should be ranked high! precision-recall curve I F allows us to trade off precision against recall. Precision F = 1 P +(1 ) 1 R = ( 2 + 1)PR 2 P + R I 2 [0, 1] and thus 2 2 [0, 1] where 2 = Recall I Red line: Interpolation (max precision) Jörg Tiedemann 31/35 Jörg Tiedemann 32/35

9 11-point interpolated average precision Recall Interpolated Precision point average: Mean Average Precision MAP(Q) = 1 Q Q X 1 m j m j j=1 k=1 X Precision(R jk ) I Q = {d 1,..d mj } set of relevant documents for query q j I R jk set of ranked documents up to document d k I Precision for non-retrieved documents = 0! approximates area underneath precision-recall curve! no fixed recall-levels, no interpolation! more stable than 11-point interpolation Jörg Tiedemann 33/35 Jörg Tiedemann 34/35 Issues with Evaluation I What is a relevant document? I No account for degree of relevance I Recall is difficult for large collections (web retrieval) I Is measuring relevance good enough? Jörg Tiedemann 35/35

Lecture 4 Ranking Search Results. Many thanks to Prabhakar Raghavan for sharing most content from the following slides

Lecture 4 Ranking Search Results Many thanks to Prabhakar Raghavan for sharing most content from the following slides Recap of the previous lecture Index construction Doing sorting with limited main memory