Chapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze

Similar documents
Ranked Retrieval (2)

Discrete Distributions: Poisson Distribution 1

Variable Latent Semantic Indexing

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

Latent Semantic Analysis. Hongning Wang

Language Models. Web Search. LM Jelinek-Mercer Smoothing and LM Dirichlet Smoothing. Slides based on the books: 13

What s for today. More on Binomial distribution Poisson distribution. c Mikyoung Jun (Texas A&M) stat211 lecture 7 February 8, / 16

Language Models. Hongning Wang

Hypothesis Testing: Chi-Square Test 1

Manning & Schuetze, FSNLP, (c)

Latent Semantic Analysis. Hongning Wang

Latent semantic indexing


Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Information Retrieval

3. Basics of Information Retrieval

Language Models, Smoothing, and IDF Weighting

Language Models and Smoothing Methods for Collections with Large Variation in Document Length. 2 Models

A Study of the Dirichlet Priors for Term Frequency Normalisation

Manning & Schuetze, FSNLP (c) 1999,2000

DISTRIBUTIONAL SEMANTICS

Natural Language Processing. Statistical Inference: n-grams

Modeling Environment

CS47300: Web Information Search and Management

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

Information Retrieval

Fall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26

vector space retrieval many slides courtesy James Amherst

Boolean and Vector Space Retrieval Models

Vector Space Scoring Introduction to Information Retrieval Informatics 141 / CS 121 Donald J. Patterson

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

.. CSC 566 Advanced Data Mining Alexander Dekhtyar..

Matrix decompositions and latent semantic indexing

A Study of Smoothing Methods for Language Models Applied to Information Retrieval

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

9 Searching the Internet with the SVD

13 Searching the Web with the SVD

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

Semantic Similarity from Corpora - Latent Semantic Analysis

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Investigation of Latent Semantic Analysis for Clustering of Czech News Articles

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

What is Text mining? To discover the useful patterns/contents from the large amount of data that can be structured or unstructured.

Information Retrieval and Web Search

Lecture 5: Web Searching using the SVD

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2017. Instructor: Walid Magdy

A REVIEW ARTICLE ON NAIVE BAYES CLASSIFIER WITH VARIOUS SMOOTHING TECHNIQUES

topic modeling hanna m. wallach

Chap 2: Classical models for information retrieval

Non-Boolean models of retrieval: Agenda

Pivoted Length Normalization I. Summary idf II. Review

Generalized Inverse Document Frequency

CS 572: Information Retrieval

Vector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model

Gaussian Models

Probabilistic Latent Semantic Analysis

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang

Cross-Lingual Language Modeling for Automatic Speech Recogntion

Maschinelle Sprachverarbeitung

Positional Language Models for Information Retrieval

Part I: Web Structure Mining Chapter 1: Information Retrieval and Web Search

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2018. Instructor: Walid Magdy

CS 646 (Fall 2016) Homework 3

Evaluation Metrics. Jaime Arguello INLS 509: Information Retrieval March 25, Monday, March 25, 13

Ranking-II. Temporal Representation and Retrieval Models. Temporal Information Retrieval

A Note on the Effect of Term Weighting on Selecting Intrinsic Dimensionality of Data

Language as a Stochastic Process

Text mining and natural language analysis. Jefrey Lijffijt

IR Models: The Probabilistic Model. Lecture 8

Semantic Similarity and Relatedness

Term Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan

Term Weighting and Vector Space Model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Latent Dirichlet Allocation Based Multi-Document Summarization

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

An Empirical Study on Dimensionality Optimization in Text Mining for Linguistic Knowledge Acquisition

Latent Dirichlet Allocation Introduction/Overview

Language Processing with Perl and Prolog

CAIM: Cerca i Anàlisi d Informació Massiva

N-gram N-gram Language Model for Large-Vocabulary Continuous Speech Recognition

INF 141 IR METRICS LATENT SEMANTIC ANALYSIS AND INDEXING. Crista Lopes

Score Distribution Models

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

Document and Topic Models: plsa and LDA

Text Analytics (Text Mining)

Hierarchical Dirichlet Trees for Information Retrieval

Language Model. Introduction to N-grams

Motivation. User. Retrieval Model Result: Query. Document Collection. Information Need. Information Retrieval / Chapter 3: Retrieval Models

arxiv: v1 [cs.ir] 1 May 2018

Notes on Latent Semantic Analysis

Retrieval models II. IN4325 Information Retrieval

Information Retrieval. Lecture 6

Transcription:

Chapter 10: Information Retrieval See corresponding chapter in Manning&Schütze

Evaluation Metrics in IR 2

Goal In IR there is a much larger variety of possible metrics For different tasks, different metrics might be appropriate 3

4

Evaluation of IR Systems All documents Retrieved Relevant Retrieved and relevant Recall=#(retrieved and relevant)/#(relevant) Precision=#(retrieved and relevant)/#(retrieved) 5

Precision vs. Recall 6

Average precision Goal: don t focus on a specific recall level still get one number AvgP N P r1 N r1 r rel( r) rel( r) P( r ) : precision at rank r rel(r) : indicator function; 1 if document at rank r is relevant 7

Mean average precision Problem: average precision still specific to query MAP 1 Q Q q1 AvgP( q) Q: number of queries 8

Interpolated precision Recall levels for each query distinct from 11 standard recall levels Interpolation procedure is necessary Let r j be the j-th standard recall level with j=1,2,,10. Then, P r max j rj rrj1 Pr 9

Interpolated precision 10

Vector Space Model and tf-idf 11

Goal Introduce a simple and fast retrieval algorithm 12

Preprocessing Stemming Stop words Longer units ( New York ) 13

Vector-space-model 14

Vector-space-model Considering every document as vector The vector contains the weights of the index terms as components In case of t index terms the dimension of the vector-space is also t Similarity of querie to a document is the correlation between the their vectors Correlation quantified by cosine of the angle between the vectors 15

Vector-space-model Index term weights tf i, j with occurs idf and i with n max freq log i N freq l i, j i, j freq l, j the frequency that termi in document j N n i number of totalnumber of documents douments 16 that contain term i

Vector-space-model Index term weights The weight of a term in a document is then calculated as product of the tf factor and the idf factor w i, j tf i, j idf i Or for the query w i, q 0.5 0.5 max l freq 17 i, q freq l, q idf i

Distance Metrics Pick an L-norm Angel/cosine between vectors cos( q, d) n i1 n i1 q 2 i q d i i n i1 d 2 i 18

Vector-space-model Advantages Improves retrieval performance as compared to Boolean retrieval Partial matching allowed Sort according to similarity Disadvantages Assumes that index terms are independent 19

Models of Term Distribution 20

Models for Term Distribution Goal: Understand the statistical properties of key words in a documents collection Assumptions Probability for a term is proportional to the length of the document Short text: each word occurs only once Two neighboring occurrences of the same term are statistically independent 21

Poission Distribution Probabiliity that the i-th term occurs k times in the document P i ( k) e i k i k! i parameter of the distribution k 22

Processes described by Poisson Distribution (Wikipedia) The number of cars that pass through a certain point on a road (sufficiently distant from traffic lights) during a given period of time. The number of spelling mistakes one makes while typing a single page. The number of phone calls at a call center per minute. The number of times a web server is accessed per minute. The number of roadkill (animals killed) found per unit length of road. The number of mutations in a given stretch of DNA after a certain amount of radiation. The number of unstable nuclei that decayed within a given period of time in a piece of radioactive substance. The radioactivity of the substance will weaken with time, so the total time interval used in the model should be significantly less than the mean lifetime of the substance. The number of pine trees per unit area of mixed forest. The number of stars in a given volume of space. The number of V2 rocket attacks per area in England, according to the fictionalized account in Thomas Pynchon's Gravity's Rainbow. The number of light bulbs that burn out in a certain amount of time. The number of viruses that can infect a cell in cell culture. The number of hematopoietic stem cells in a sample of unfractionated bone marrow cells. The inventivity of an inventor over their career. The number of particles that "scatter" off of a target in a nuclear or high energy physics experiment. 23

Check normalization and expectation value of k -> white board 24

Interpretation Let N be the number of documents in the corpus N E i ( k) N i :cf i (collection frequency) N ( 1 P (0)) : df i i (document frequency) 25

Experimental Test of Poisson Model Word cf i i N(1-P(0)) df i Overestimation follows 23533 0.2968 20363 21744 0.94 transformed 840 0.0106 845 807 1.03 soviet 35337 0.4457 28515 8204 3.48 students 15925 0.2008 14425 4953 2.91 james 11175 0.1409 10421 9191 1.13 freshly 611 0.0077 609 395 1,54 Model often works Some terms like soviet are bursty independence assumption is not valid 26

Probabilistic Retrieval 27

Goal Attempt to justify tf-idf for retrieval 28

Probabilistic Retrieval -> white board -> see corresponding section in Manning&Schütze 29

Language Model based Retrieval 30

Goal Practical way of using the probabilistic ideas for retrieval Reading see J. Ponte and B. Croft A language modeling approach to information retrieval SigIR 1998 31

Language Modeling The probability that a query Q was generated by a probabilistic model based on a document. q q q... 1 2 q n d d d... d 1 2 m p( d q) p( q d)* p( d) Uni-gram model: P( q d) n i1 P( q i d) Ignore p(d) for now 32

Performance tf-idf vs LM (original Ponte&Croft-paper) Results on TREC 10 collection a LM outperforms tf-idf 33

Smoothing Methods Jelinek-mercer method: involves a linear interpolation of the ML model with the collection model. P ( w d) (1 ) Pml ( w d) P( w C) 34

Smoothing Methods Absolute discounting: decrease the probability of seen words by substracting a constant from their counts. P ( w s d) max( c( w; d),0) P( w C) * c( w ; d) w* V 35

36 Smoothing Methods Bayesian smoothing using Dirichlet priors: A multinomial distribution, for which the conjugate prior for bayesian analysis is the dirichlet distribution: V w d w c C w P d w c d w P * ) ; ( ) ( ) ; ( ) ( *

Comparing different smoothing methods: sentence retrieval in question answering 37

Improved Language Models Bigrams Class LMs Grammar Prior knowledge (document length) Other resources (e.g. WordNet) 38

Latent Semantic Analysis 39

Goal Overcome semantic mismatch between terms in the query and the documents (e.g. cosmonaut vs. astronaut) 40

Term Document Matrix Structure Idea: derive semantic relatedness from co-occurance in term document matrix 41

Term Document Matrix Structure Create artificially heterogeneous collection 100 documents from 3 distinct newsgroups Indexed using standard stop word list 12418 distinct terms Term Document Matrix (12418 300) 8% fill of sparse matrix Matrix of cosine similarity between documents Clear structure apparent 42

Theory of LSA Whiteboard 43

Latent Semantic Analysis Word usage defined by term and document cooccurrence matrix structure Latent structure / semantics in word usage Clustering documents or words Singular Value Decomposition Cubic Computational Scaling 44

Term Document Matrix Structure 45

46

LSA Performance LSA consistently improves recall on standard test collections (precision/recall generally improved) Variable performance on larger TREC collections Dimensionality of Latent Space a magic number 300 1000 seems to work fine Computational cost high 47

Toolkits 48

Software to use Lucene Lemur 49

50

Summary Evaluation measures Vector space model Models of term distribution Probabilistic retrieval Latent semantic analysis Language models for IR 51