Pivoted Length Normalization I. Summary idf II. Review

Similar documents
Lecture 3: Pivoted Document Length Normalization

Lecture 2 August 31, 2007

1 Information retrieval fundamentals

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Latent Semantic Analysis. Hongning Wang

Lecture 5: Introduction to (Robertson/Spärck Jones) Probabilistic Retrieval

Language Models, Smoothing, and IDF Weighting

Ranked Retrieval (2)

A Neural Passage Model for Ad-hoc Document Retrieval

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

A Study of the Dirichlet Priors for Term Frequency Normalisation

Information Retrieval and Web Search

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002

Information Retrieval

Query Propagation in Possibilistic Information Retrieval Networks

Towards Collaborative Information Retrieval

CS630 Representing and Accessing Digital Information Lecture 6: Feb 14, 2006

Chap 2: Classical models for information retrieval

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

Information Retrieval

Maschinelle Sprachverarbeitung

Part I: Web Structure Mining Chapter 1: Information Retrieval and Web Search

Chapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze

Motivation. User. Retrieval Model Result: Query. Document Collection. Information Need. Information Retrieval / Chapter 3: Retrieval Models

Vector Space Scoring Introduction to Information Retrieval Informatics 141 / CS 121 Donald J. Patterson

Ranking-II. Temporal Representation and Retrieval Models. Temporal Information Retrieval

Measuring the Variability in Effectiveness of a Retrieval System

Integrating Logical Operators in Query Expansion in Vector Space Model

INFO 630 / CS 674 Lecture Notes

Score Distribution Models

Query Performance Prediction: Evaluation Contrasted with Effectiveness

IR Models: The Probabilistic Model. Lecture 8

Midterm Examination Practice

Research Methodology in Studies of Assessor Effort for Information Retrieval Evaluation

Variable Latent Semantic Indexing

CS 646 (Fall 2016) Homework 3

Vector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

. Using a multinomial model gives us the following equation for P d. , with respect to same length term sequences.

Language Models and Smoothing Methods for Collections with Large Variation in Document Length. 2 Models

Manning & Schuetze, FSNLP (c) 1999,2000


Introduction to Information Retrieval

Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25

University of Illinois at Urbana-Champaign. Midterm Examination

Web Information Retrieval Dipl.-Inf. Christoph Carl Kling

Why Language Models and Inverse Document Frequency for Information Retrieval?

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

CS 572: Information Retrieval

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Effectiveness of complex index terms in information retrieval

Utilizing Passage-Based Language Models for Document Retrieval

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

Linear Algebra Background

INFO 4300 / CS4300 Information Retrieval. IR 9: Linear Algebra Review

Latent Semantic Analysis. Hongning Wang

Information Retrieval

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model

Boolean and Vector Space Retrieval Models

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

Manning & Schuetze, FSNLP, (c)

DISTRIBUTIONAL SEMANTICS

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Variables, distributions, and samples (cont.) Phil 12: Logic and Decision Making Fall 2010 UC San Diego 10/18/2010

Data Mining Recitation Notes Week 3

Prediction of Citations for Academic Papers

Modern Information Retrieval

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018

A Note on the Effect of Term Weighting on Selecting Intrinsic Dimensionality of Data

Generalized Inverse Document Frequency

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2017. Instructor: Walid Magdy

9 Searching the Internet with the SVD

Section 3. Measures of Variation

Behavioral Data Mining. Lecture 2

Lecture 13: More uses of Language Models

ChengXiang ( Cheng ) Zhai Department of Computer Science University of Illinois at Urbana-Champaign

On the Foundations of Diverse Information Retrieval. Scott Sanner, Kar Wai Lim, Shengbo Guo, Thore Graepel, Sarvnaz Karimi, Sadegh Kharazmi

Last Lecture. Distinguish Populations from Samples. Knowing different Sampling Techniques. Distinguish Parameters from Statistics

Information Retrieval

Chapter 1 Statistical Inference

Z score indicates how far a raw score deviates from the sample mean in SD units. score Mean % Lower Bound

Information Retrieval and Web Search Engines

The Static Absorbing Model for the Web a

A Study of Smoothing Methods for Language Models Applied to Information Retrieval

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Sparse Linear Models (10/7/13)

Fall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Lecture 3: Probabilistic Retrieval Models

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Measuring Term Specificity Information for Assessing Sentiment Orientation of Documents in a Bayesian Learning Framework

Describing distributions with numbers

Introduction to Data Mining

Lecture 5: Web Searching using the SVD

.. CSC 566 Advanced Data Mining Alexander Dekhtyar..

3-4 Equation of line from table and graph

Learning Objectives. c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 1

Transcription:

2 Feb 2006 1/11 COM S/INFO 630: Representing and Accessing [Textual] Digital Information Lecturer: Lillian Lee Lecture 3: 2 February 2006 Scribes: Siavash Dejgosha (sd82) and Ricardo Hu (rh238) Pivoted Length Normalization I. Summary Document length normalization schemes attempt to eliminate the advantage that long documents have over the shorter documents under a certain scoring scheme. However, Singhal et al s paper suggests that normalization over-penalizes long documents when compared to the actual document distribution. In particular, retrieval using cosine normalization tends to retrieve shorter documents more often and longer documents less often than the distribution of lengths of relevant documents tends to suggest. Pivoted Document Length Normalization (PDLN) modifies the standard cosine normalization (CDLN) scheme to overcome this situation. PDLN seeks to transform CDLN to better match the retrieval and true relevance probabilities of a document with respect to length. Notation d: a document in the corpus tf: term frequency idf: inverse document frequency w: weight of a particular term in the vocabulary w = tf * idf 2 2 2 ( = w1 + w2 +... wm norm + II. Review Consider the distribution of relevant documents according to document length. We can construct a curve of the probability vs. length graph of the relevant documents in a corpus versus the presumed relevant documents retrieved by a standard cosine normalization scheme:

2 Feb 2006 2/11 probability of relevance/retrieval Relevance vs Retrieval with cosine normalization cosine norm document length crossing point true relevance Observations I. Neither curve is linear despite cosine normalization being a linear function of the document L 2 norm. This is because the underlying relevance distribution is apparently nonlinear. Graph Construction 1. Sort the set of all documents in corpus by length. 2. Create bins, B i, with 1,000 documents each in sorted order. 3. The length of a particular bin is the median length of all documents in that bin: len(b i ) = median(len(, for all d ε B i ) 4. Each bin is assigned a probability by the ratio: relevant doc s in bin / total number of relevant doc s in the corpus: prob(b i ) = P(d ε B d is relevant) II. The shape of the cosine normalization curve does not match exactly that of the relevance curve. The shape of the curve is determined by the IR system s representation and ranking system of documents which is not necessarily a completely accurate function of the document length. a. For lengths < crossing point, the cosine curve is above the relevance curve. That is, short documents are more likely to be retrieved than their true likelihood of relevance. b. For lengths > crossing point, the cosine curve is below the relevance curve. That is, long documents are less likely to be retrieved than their true likelihood of relevance. III. Reliability of statistics over different lengths: Suppose there are only 2 documents at length 13,000 and 5,000 documents at length 13,001. The statistics for length 13,001 are much more reliable than the statistics at 13,000. We adjust for this by creating 1000 document bins.

III. Derivation of Pivoted Length Normalization 1,2 CS630 Lecture 3 2 Feb 2006 3/11 Motivation: From observation II above, norm( should be lower for longer documents because longer documents are being penalized too much by cosine normalization. Solution: Map the old normalization to a new normalization, norm ( with an appropriate value for longer documents. We use a linear mapping: [1] norm ( = m norm( + b, where m and b are parameters This equation defines pivoted document length normalization in terms of cosine normalization. Cosine Normalization Pivoted Normalization Factor Pivot Pivoted Normalization α slope = tan(α) Cosine Normalization Factor The pivot, p, is the norm of the document length corresponding to the crossing point in the previous graph. At this point, norm (d p ) = norm(d p ). Note that there is some leeway in specifying d p. It is not clear if they would have used the document that has the median length in the bin that represents the crossing-point, or if they would have used all the relevant documents in the bin represented by the crossing-point to calculate the d p. We do not need an exact specification of d p because the graph above is not plotted from actual data. Rather, it is intended to show how the tilting transformation works. We can just consider d p to be a representative document of the length of approximately equal to the value of the crossing point. 1 Singhal, Salton, Mitra and Buckley. "Document length normalization." Information Processing & Management. vol 32 #5, Sep 1996, p619-33 2 Singhal, Buckley, Mitra. Pivoted Document Length Normalization. ACM SIGIR. 1996

2 Feb 2006 4/11 At the pivot, norm (d p ) = norm(d p ) = p, and using [1] above we get normd (d p ) = m norm(d p ) + b, where m = tan(alpha) p = m p + b [2] b = p(1-m ) Using [1] and [2], we can write [3] norm ( = m norm( + p(1-m ) Removing one parameter Note that for any positive constant α with respect to a document d: rank score( = α score( since the constant term does not matter for ranking. We can use this equivalence to define a normalization function to simplify the constant term: 1 norm' '( = norm'( p(1 m') ' [4] ''( ) = m norm d norm( ) + 1 p(1 m') d Let s try to fix one of the parameters and optimize the other one. How about a good p? Let p = norm ( = norm, the average normalization over all documents We claim there exists an m such that Solving for that m we arrive at: m' norm (1 m') p [5] m' ' = norm m' 1+ (1 m') p m'' norm(1 m'') = m'. p(1 m') Now we define another new norm ( using [4] and [5] m'' norm '''( = norm( + 1 (1 m'') norm Multiply by (1-m ) to define norm iv ( which is equivalent for ranking purposes: norm( [6] norm iv ( = m'' + (1 m'' ) norm Notice that for a document of average length, norm iv ( = norm so that norm iv ( = 1. That is, a document of average length is already of appropriate length and does not need to be scaled under this scheme. Assuming the linear correction, we have reduced the two-parameter search problem to a search for just one parameter. All the newly defined norm s are equivalent under ranking since we have only multiplied the functions by a constant in d.

2 Feb 2006 5/11 Results Empirically, applying this new normalization resulted in: 1. relevance and retrieved graphs becoming much more similar 2. 6-12% improvement in retrieval precision 3. relatively stable optimal (m ) values over 6 corpora IV. Document Relevancy 3 In the motivation for PDLN, we assumed that we had the distribution of truly relevant documents. But how valid is this assumption? In practice, the set of valid documents in TREC is determined by humans who (1) judge the top 100 results from several different information retrieval systems for a particular query, and (2) pooling those results together. With this system, it is possible that there are relevant documents that are not in the top 100 results and have therefore not been marked. If a significant number of relevant documents are not marked or if there are inherent biases in the retrieval methods (e.g. over-retrieving short documents), these artifacts could skew the observed probability distribution. If we based our motivation for PDLN in that skewed distribution, then PDLN would be fundamentally flawed. Motivation It is impractical to have humans judge every document in a corpus of thousands or millions. Goal How can we count the number of truly relevant documents in a corpus without judging them all manually? Idea: Arrival Rates Instead of considering the static set of the top k results from a particular retrieval method, let k be a free variable. Estimate the number of relevant documents from the arrival rate of newly relevant documents with respect to k. Pool the top k documents from different retrieval methods by taking the union of the retrieved sets from different methods and retaining only the relevant documents. Call this R k. Compare this R k with the pooled relevant (k+1) document set, R k+1. Define the arrival rate ΔR(k) = R k+1 - R k where k is the pool depth. We expect ΔR(k) to be a decreasing function of k if (1) there are a finite number of relevant documents and (2) our retrieval methods retrieve relevant documents. Notice that if ΔR(k) can be fitted with a smooth function delr(k) that converges to zero, we can calculate the total number of relevant documents as: o reldocs = 1 delr( k) dk 3 How Reliable are the Results of Large-Scale Information Retrieval Experiments Zobel. SIGIR 1998

2 Feb 2006 6/11 Results It turns out that the arrival rate does fit a smooth converging function very well. TREC results only extend to the top 100 relevant documents for a given query so results beyond k=100 must be extrapolated. Since the fit is so good for k < 100, extrapolation seems justifiable in this case. arrival rate vs. doc s retrieved ΔS(k) k=100 extrapolated In practice, we cut the upper limit of the integral earlier than infinity because we have less confidence in the fit for large k: for k max = 200, delr ( k) dk = 6707, for k max = 500, 500 1 k 200 delr ( k) dk = 9358. However, the TREC results (which end at k=100) say that there are only 5040 relevant documents. Thus many relevant documents are not being marked and we are underestimating recall. 1