Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

Outline for today Information Retrieval Efficient Scoring and Ranking Recap on ranked retrieval Jörg Tiedemann jorg.tiedemann@lingfil.uu.se Department of Linguistics and Philology Uppsala University Efficient scoring and ranking IR system architecture Jörg Tiedemann 1/35 Jörg Tiedemann 2/35 tf-idf weighting Cosine similarity between query and document I product of term frequency (tf) and inverse document frequency (idf) w t,d =(1 + log tf t,d ) log N df t I best known weighting scheme in IR I increases with I number of occurrences within a document I rarity of the term in collection similarity(q, d) =cos( ) Jörg Tiedemann 3/35 Jörg Tiedemann 4/35

Cosine similarity between query and document Efficient scoring Task: Find the k most relevant documents given a query Euclidean dot product: ~q ~d = ~q ~ d cos( )! We can use the dot product of normalized unit vectors: cos( ) = ~ q ~d ~q ~ d = ~ q ~q ~ d ~ d I No matching keyword! cos( ) =0! Use inverted index to reduce search space! I Space efficient storage of weights! plain term frequencies in postings instead of log values! plain IDF values in term dictionaries! can vary weighting schemes I Don t need complete ranking of all matching doc s!! Binary min heap for efficient top-k selection! Inexact top k retrieval (search heuristics) Jörg Tiedemann 5/35 Jörg Tiedemann 6/35 Binary min heap for selecting top k Inexact top k retrieval binary min heap = binary tree in which each node s value is less than the values of its children I bottleneck: cosine computation for all possible candidates I use search heuristics General idea: I find a set A of contenders with K < A << N I return top k documents in A I Why parent nodes with values less than children nodes? Jörg Tiedemann 7/35 Jörg Tiedemann 8/35

Inexact top k retrieval Index elimination! selecting A = pruning non-contendors I index elimination I champion lists I static quality scores I impact ordering I cluster pruning I only consider high-idf query terms I query: catcher in the rye I accumulate scores for catcher and rye only I only consider docs containing many query terms I multi-term queries: scores for docs that contain at least a fixed proportion pf query terms (e.g., 3 out of 4)! soft conjunction (early Google) I easy to implement in postings traversal Jörg Tiedemann 9/35 Jörg Tiedemann 10/35 Champion lists Static quality scores I For each term in dictionary: Pre-compute r documents of heighest weight (in the postings of the term)! Champion lists! I At query time: Only compute scores for documents in union of champion lists for all query terms I r is chosen at indexing time I might use different r for each term Idea 2: reorder posting lists according to expected relevance I query independent quality of documents (authority) What is a good indication of quality? I a paper with many citations I many bookmarks (del.icio.us,...) I PageRank (!) Jörg Tiedemann 11/35 Jörg Tiedemann 12/35

Page Rank Link Analysis - Example Graph I model: likelihood that a random surfer arrives at page B I markov chain model: web = probabilistic directed connected graph I web-page = state, N x N probability transition matrix (links) I PageRank = long-term visit rate = steady-state probability Pre-computed, query-independent document scores! Link Matrix d 0 d 1 d 2 d 3 d 4 d 5 d 6 d 0 0 0 1 0 0 0 0 d 1 0 1 1 0 0 0 0 d 2 1 0 1 1 0 0 0 d 3 0 0 0 1 1 0 0 d 4 0 0 0 0 0 0 1 d 5 0 0 0 0 0 1 1 d 6 0 0 0 1 1 0 1 Jörg Tiedemann 13/35 Jörg Tiedemann 14/35 Link Analysis - Example Graph Probability matrix P d 0 d 1 d 2 d 3 d 4 d 5 d 6 d 0 0.00 0.00 1.00 0.00 0.00 0.00 0.00 d 1 0.00 0.50 0.50 0.00 0.00 0.00 0.00 d 2 0.33 0.00 0.33 0.33 0.00 0.00 0.00 d 3 0.00 0.00 0.00 0.50 0.50 0.00 0.00 d 4 0.00 0.00 0.00 0.00 0.00 0.00 1.00 d 5 0.00 0.00 0.00 0.00 0.00 0.50 0.50 d 6 0.00 0.00 0.00 0.33 0.33 0.00 0.33 avoid dead ends! add teleportation rate (e.g. 14% chance to jump to any random page) Link Analysis - PageRank d 0 d 1 d 2 d 3 d 4 d 5 d 6 d 0 0.02 0.02 0.88 0.02 0.02 0.02 0.02 d 1 0.02 0.45 0.45 0.02 0.02 0.02 0.02 d 2 0.31 0.02 0.31 0.31 0.02 0.02 0.02 d 3 0.02 0.02 0.02 0.45 0.45 0.02 0.02 d 4 0.02 0.02 0.02 0.02 0.02 0.02 0.88 d 5 0.02 0.02 0.02 0.02 0.02 0.45 0.45 d 6 0.02 0.02 0.02 0.31 0.31 0.02 0.31 I x t,i = chance of being in state i at time t during the random walk For all documents: ~x t =(x t,1, x t,2,..,x t,n ) I steady-state probabilities ~x = ~ =( 1, 2,..., N )=~ P I task: find the steady-state prob s ~! power method (iterative procedure) Jörg Tiedemann 15/35 Jörg Tiedemann 16/35

Link Analysis - Compute PageRanks How to Integrate Static quality scores Power Method to find ~ = ~ P 1. start with random distribution ~x 0 =(x 0,1, x 0,2,..,x 0,N ) 2. at step t compute ~x t = ~x t 1 P: x t,k = NX i=1 x t 1,i P i,k 3. continue with 2. until convergence (~ = ~x m = ~x 0 P m ) I assign quality score g(d) to each d (e.g. PageRank) I combine with relevance score (cos(q, d)): net-score(q, d) =g(d)+cos(q, d) I might use other type of combination I return top k documents according to net-score How does this help to make retrieval more efficient? Jörg Tiedemann 17/35 Jörg Tiedemann 18/35 Static quality scores Other ideas I postings are ordered by g(d) (still consistent order!) I traverse postings and compute scores I early termination is possible I stop if minimal score cannot be improved I time threshold I threshold for goodness score I can be combined with champion lists High and low lists : I for each term: I high list ( = champion list) I low list (other documents) I use high lists first I use low list if < k documents found Jörg Tiedemann 19/35 Jörg Tiedemann 20/35

Impact ordering Cluster Pruning I compute scores only for documents with high wf t,d I sort each posting by wf t,d! non-consistent order of postings! I solution 1: early termination: for each term I stop after a fixed number of r documents I stop when wf t,d <threshold I score documents in union of retrieved postings I solution 2: sort terms by idf I stop if document scores don t change much Pre-processing (clustering): I Pick p N docs at random (= leaders ) (random = fast + reflects distribution well) I For every other doc, pre-compute nearest leader I attach them to leader I each leader has p N followers Query processing: I given query q: find nearest leader L I seek k nearest docs among L followers Jörg Tiedemann 21/35 Jörg Tiedemann 22/35 Cluster Pruning Summary on Efficient Scoring I inverted index for selecting candidates I on-the-fly similarity score calculations I efficient top-k selection with min heaps I inexact retrieval (index elimination, champion lists, cluster pruning) I static quality scores (relevance and efficiency) Variant: I clustering: attach documents to x nearest leaders I querying: find y nearest leaders and consider their followers Jörg Tiedemann 23/35 Jörg Tiedemann 24/35

Other practical issues when building IR systems IR system architecture I Tiered indexes I cascaded query processing I index with most important terms & doc s first I Zones I different indexes for various parts of a doc (title, body...) I Query term proximity I more relevant: keywords in close proximity to each other I Query parsing I I I check syntax create actual index queries combine results! All parts need careful tuning! (! Evaluation is important!) Jörg Tiedemann 25/35 Jörg Tiedemann 26/35 Summary Evaluation IR includes many components I Document preprocessing (linguistic and otherwise) I Positional indexes I Tiered indexes I Spelling correction I k-gram indexes for wildcard queries and spelling correction I Query parsing & Query processing I Document scoring (including proximity...) We need to measure the success of retrieval I compare IR engines I system development I user happiness Measure success in terms of relevance with respect to the users information need I How much of the search result is relevant? (precision) I How much of the relevant information did I find? (recall) Jörg Tiedemann 27/35 Jörg Tiedemann 28/35

Precision and recall Precision and recall Precision = #(relevant items retrieved) #(retrieved items) = P(relevant retrieved) Relevant Nonrelevant Retrieved true positives (TP) false positives (FP) Not retrieved false negatives (FN) true negatives (TN) Recall = #(relevant items retrieved) #(relevant items) = P(retrieved relevant) P = TP/(TP + FP) R = TP/(TP + FN) Why not measuring accuracy? Jörg Tiedemann 29/35 Jörg Tiedemann 30/35 A combined measure: F I balanced F = harmonic mean of precision and recall F balanced = 2 1 P + 1 R Evaluation of Ranked Retrieval I Precision/recall/F are measures for unranked sets I Relevant documents should be ranked high! precision-recall curve 1.0 0.8 I F allows us to trade off precision against recall. Precision 0.6 0.4 1 F = 1 P +(1 ) 1 R = ( 2 + 1)PR 2 P + R I 2 [0, 1] and thus 2 2 [0, 1] where 2 = 1 0.2 0.0 0.0 0.2 0.4 0.6 0. Recall I Red line: Interpolation (max precision) Jörg Tiedemann 31/35 Jörg Tiedemann 32/35

11-point interpolated average precision Recall Interpolated Precision 0.0 1.00 0.1 0.67 0.2 0.63 0.3 0.55 0.4 0.45 0.5 0.41 0.6 0.36 0.7 0.29 0.8 0.13 0.9 0.10 1.0 0.08 11-point average: 0.425 Mean Average Precision MAP(Q) = 1 Q Q X 1 m j m j j=1 k=1 X Precision(R jk ) I Q = {d 1,..d mj } set of relevant documents for query q j I R jk set of ranked documents up to document d k I Precision for non-retrieved documents = 0! approximates area underneath precision-recall curve! no fixed recall-levels, no interpolation! more stable than 11-point interpolation Jörg Tiedemann 33/35 Jörg Tiedemann 34/35 Issues with Evaluation I What is a relevant document? I No account for degree of relevance I Recall is difficult for large collections (web retrieval) I Is measuring relevance good enough? Jörg Tiedemann 35/35