Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

Similar documents
Boolean and Vector Space Retrieval Models

Latent semantic indexing

Information Retrieval and Web Search

Latent Semantic Analysis. Hongning Wang

Information Retrieval

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

Chap 2: Classical models for information retrieval

Lecture 5: Web Searching using the SVD

Vector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model

13 Searching the Web with the SVD

9 Searching the Internet with the SVD

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2017. Instructor: Walid Magdy

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2018. Instructor: Walid Magdy

Notion of Distance. Metric Distance Binary Vector Distances Tangent Distance

Information Retrieval

Latent Semantic Analysis. Hongning Wang

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Chapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Fast LSI-based techniques for query expansion in text retrieval systems

Text Analytics (Text Mining)

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Dealing with Text Databases

CS 572: Information Retrieval

DISTRIBUTIONAL SEMANTICS

1 Information retrieval fundamentals

Information Retrieval Basic IR models. Luca Bondi

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

TDDD43. Information Retrieval. Fang Wei-Kleiner. ADIT/IDA Linköping University. Fang Wei-Kleiner ADIT/IDA LiU TDDD43 Information Retrieval 1

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

1. Ignoring case, extract all unique words from the entire set of documents.

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)

Fall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26

A Note on the Effect of Term Weighting on Selecting Intrinsic Dimensionality of Data

Information Retrieval

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Variable Latent Semantic Indexing

CS 3750 Advanced Machine Learning. Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya

3. Basics of Information Retrieval

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model

16 The Information Retrieval "Data Model"

IR Models: The Probabilistic Model. Lecture 8

Ranked Retrieval (2)

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

Towards Collaborative Information Retrieval

Vector Space Scoring Introduction to Information Retrieval Informatics 141 / CS 121 Donald J. Patterson

CAIM: Cerca i Anàlisi d Informació Massiva

MATRIX DECOMPOSITION AND LATENT SEMANTIC INDEXING (LSI) Introduction to Information Retrieval CS 150 Donald J. Patterson

Term Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan

Intelligent Data Analysis. Mining Textual Data. School of Computer Science University of Birmingham

Text Analytics (Text Mining)

Information Retrieval

CSE 494/598 Lecture-6: Latent Semantic Indexing. **Content adapted from last year s slides

Matrices, Vector-Spaces and Information Retrieval

Generic Text Summarization

Term Weighting and Vector Space Model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Maschinelle Sprachverarbeitung

Information Retrieval

Part I: Web Structure Mining Chapter 1: Information Retrieval and Web Search

CSE 494/598 Lecture-4: Correlation Analysis. **Content adapted from last year s slides

Manning & Schuetze, FSNLP, (c)

Latent Semantic Models. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York

.. CSC 566 Advanced Data Mining Alexander Dekhtyar..

Score Distribution Models

CS47300: Web Information Search and Management

Machine Learning for Signal Processing Sparse and Overcomplete Representations

Probabilistic Information Retrieval

Scoring, Term Weighting and the Vector Space

Matrix decompositions and latent semantic indexing

PV211: Introduction to Information Retrieval

Querying. 1 o Semestre 2008/2009

Large-scale Image Annotation by Efficient and Robust Kernel Metric Learning

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Linear Algebra Background

Matrices, Vector Spaces, and Information Retrieval

University of Illinois at Urbana-Champaign. Midterm Examination

Leverage Sparse Information in Predictive Modeling


Language Models, Smoothing, and IDF Weighting

Motivation. User. Retrieval Model Result: Query. Document Collection. Information Need. Information Retrieval / Chapter 3: Retrieval Models

Hidden Markov Models. x 1 x 2 x 3 x N

Investigation of Latent Semantic Analysis for Clustering of Czech News Articles

Pivoted Length Normalization I. Summary idf II. Review

How Latent Semantic Indexing Solves the Pachyderm Problem

PV211: Introduction to Information Retrieval

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

Document Similarity in Information Retrieval

Large-Scale Behavioral Targeting

CS 340 Lec. 6: Linear Dimensionality Reduction

Assignment 3. Latent Semantic Indexing

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Matrices, Vector Spaces, and Information Retrieval

Feature Engineering, Model Evaluations

Modern Information Retrieval

Transcription:

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency Srihari: CSE 626 1

Text Retrieval Retrieval of text-based information is referred to as Information Retrieval (IR) Used by text search engines over the internet Text is composed of two fundamental units documents and terms Document: journal paper, book, e-mail messages, source code, web pages Term: word, word-pair, phrase within a document Srihari: CSE 626 2

Representation of Text Capability to retain as much of the semantic content of the data as possible Computation of distance measures between queries and documents efficiently Natural language processing is difficult, e.g., Polysemy (same word with different meanings) Synonymy (several different ways to describe the same thing) IR systems is use today do not rely on NLP techniques Instead rely on vector of term occurrences Srihari: CSE 626 3

Vector Space Representation Document-Term Matrix t1 t2 t3 t4 t5 t6 D1 24 21 9 0 0 3 D2 32 10 5 0 3 0 D3 12 16 5 0 0 0 D4 6 7 2 0 0 0 D5 43 31 20 0 3 0 D6 2 0 0 18 7 16 D7 0 0 1 32 12 0 D8 3 0 0 22 4 2 D9 1 0 0 34 27 25 D10 6 0 0 17 4 23 Terms used In Database Related t1 t2 t3 t4 t5 t6 database SQL index regression likelihood linear Regression Terms d ij represents number of times that term appears in that document Srihari: CSE 626 4

Cosine Distance between Document Vectors Document-Term Matrix t1 t2 t3 t4 t5 t6 D1 24 21 9 0 0 3 D2 32 10 5 0 3 0 D3 12 16 5 0 0 0 D4 6 7 2 0 0 0 D5 43 31 20 0 3 0 D6 2 0 0 18 7 16 D7 0 0 1 32 12 0 D8 3 0 0 22 4 2 D9 1 0 0 34 27 25 D10 6 0 0 17 4 23 d c Cosine Distance ( D, D i j ) = T k = 1 d ik d jk T T 2 dik k = 1 k = 1 Cosine of the angle between two vectors Equivalent to their inner product after each has been normalized to have unit Higher values for more similar vectors Reflects similarity in terms of relative distributions of components d 2 jk Cosine is not influenced by one document being small compared Srihari: CSE 626 5 to the other (as in the case of Euclidean)

Euclidean vs Cosine Distance Database Regression Document-Term Matrix t1 t2 t3 t4 t5 t6 D1 24 21 9 0 0 3 D2 32 10 5 0 3 0 D3 12 16 5 0 0 0 D4 6 7 2 0 0 0 D5 43 31 20 0 3 0 D6 2 0 0 18 7 16 D7 0 0 1 32 12 0 D8 3 0 0 22 4 2 D9 1 0 0 34 27 25 D10 6 0 0 17 4 23 Document Document Euclidean White = 0 Black = max distance Document Number Cosine White = LargerCosine (or smaller angle) Both show two clusters of light sub-blocks Document (database documents and regression documents) Euclidean: 3,4 closer to 6-9 than to 5 since 3,4, 6-9 are closer to origin than 5 Cosine emphasizes relative contributions of individual terms Srihari: CSE 626 6

Properties of Document-Term Matrix Each vector D i is a surrogate document for the original document Entire document-term matrix (N x T) is sparse, with only 0.03% of cells being non-zero in TREC collection Each document is a vector in terms-space Due to sparsity, original text documents are represented as an inverted file structure (rather than matrix directly) Each term t j points to a list of N numbers describing term occurrences for each document Generating document-term matix is non-trivial Are plural and singular terms counted? Are very common words used as terms? Srihari: CSE 626 7

Vector Space Representation of Queries Queries Expressed using same term-based representation as documents Query is a document with very few terms Vector space representation of queries t1 database = (1,0,0,0,0,0) t2 SQL = (0,1,0,0,0,0) t3 regression = (0,0,0,1,0,0) t4 Terms database SQL index regression t5 t6 likelihood linear Srihari: CSE 626 8

Query match against database using cosine distance database SQL index regression likelihood linear D1 24 21 9 0 0 3 D2 32 10 5 0 3 0 D3 12 16 5 0 0 0 D4 6 7 2 0 0 0 D5 43 31 20 0 3 0 D6 2 0 0 18 7 16 D7 0 0 1 32 12 0 D8 3 0 0 22 4 2 D9 1 0 0 34 27 25 D10 6 0 0 17 4 23 database = (1,0,0,0,0,0) closest match is D2 SQL = (0,1,0,0,0,0) closest match is D3 regression = (0,0,0,1,0,0) closest match is D9 Use of cosine distance results in D2, D3 and D9 ranked as closest Srihari: CSE 626 9

Weights in Vector Space Model d ik is the weight for the k th term Many different choices for weights in IR literature Boolean approach of setting weights to 1 if term occurs and 0 if it doesn t favors larger documents since larger document is more likely to include query term somewhere TF-IDF weighting scheme is popular TF (term frequency) is same as seen earlier IDF (inverse document frequency) favors terms that occur in relatively few documents database SQL index regression likelihood linear D1 24 21 9 0 0 3 D2 32 10 5 0 3 0 D3 12 16 5 0 0 0 D4 6 7 2 0 0 0 D5 43 31 20 0 3 0 D6 2 0 0 18 7 16 D7 0 0 1 32 12 0 D8 3 0 0 22 4 2 D9 1 0 0 34 27 25 D10 6 0 0 17 4 23 Srihari: CSE 626 10

Inverse Document Frequency of a Term Definition log( N N = total number of documents n j = number of documents containing term j ( / N) is n j the fraction of / n j ) documents containing term j IDF favors terms that occur in relatively few documents Example of IDF IDF weights of terms (using natural logs): 0.105,0.693,0.511,0.693,0.357,0.69 Term database occurs in many documents and is given least weight regression occurs in fewest documents and given highest weight database SQL index regression likelihood linear D1 24 21 9 0 0 3 D2 32 10 5 0 3 0 D3 12 16 5 0 0 0 D4 6 7 2 0 0 0 D5 43 31 20 0 3 0 D6 2 0 0 18 7 16 D7 0 0 1 32 12 0 D8 3 0 0 22 4 2 D9 1 0 0 34 27 25 Srihari: CSE 626 11 D10 6 0 0 17 4 23

TF-IDF Weighting of Terms TF-IDF weighting scheme TF = term frequency, denoted TF(d,t) IDF = inverse document frequency, IDF(t) TF-IDF weight is the product of TF and IDF for a particular term in a particular document TF(d,t)IDF(t) Example is given next Srihari: CSE 626 12

TF-IDF Document Matrix TF (d,t) Document Matrix database SQL index regression likelihood linear D1 24 21 9 0 0 3 D2 32 10 5 0 3 0 D3 12 16 5 0 0 0 D4 6 7 2 0 0 0 D5 43 31 20 0 3 0 D6 2 0 0 18 7 16 D7 0 0 1 32 12 0 D8 3 0 0 22 4 2 D9 1 0 0 34 27 25 D10 6 0 0 17 4 23 IDF (t) weights (using natural logs): 0.105,0.693,0.511,0.693,0.357,0.69 TF-IDF Document Matrix 2.53 14.56 4.6 0 0 2.07 3.37 6.93 2.55 0 1.07 0 1.26 11.09 2.55 0 0 0 0.63 4.85 1.02 0 0 0 4.53 21.48 10.21 0 1.07 0 0.21 0 0 12.47 2.5 11.09 0 0 0.51 22.18 4.28 0 0.31 0 0 15.24 1.42 1.38 0.1 0 0 23.56 9.63 17.33 0.63 0 0 11.78 1.42 15.94 Srihari: CSE 626 13

Classic Approach to Matching Queries to Documents Represent queries as term vectors 1s for terms occurring in the query and 0s everywhere else Represent term vectors for documents using TF- IDF for the vector components Use the cosine distance measure to rank the documents in terms of distance to query Has disadvantage that shorter documents have a better match with query terms Srihari: CSE 626 14

Document Retrieval with TF and TF-IDF TF: Document Term Matrix database SQL index regression likelihood linear D1 24 21 9 0 0 3 D2 32 10 5 0 3 0 D3 12 16 5 0 0 0 D4 6 7 2 0 0 0 D5 43 31 20 0 3 0 D6 2 0 0 18 7 16 D7 0 0 1 32 12 0 D8 3 0 0 22 4 2 D9 1 0 0 34 27 25 D10 6 0 0 17 4 23 Query contains both database and index Q=(1,0,1,0,0,0) TF-IDF Document Matrix 2.53 14.56 4.6 0 0 2.07 3.37 6.93 2.55 0 1.07 0 1.26 11.09 2.55 0 0 0 0.63 4.85 1.02 0 0 0 4.53 21.48 10.21 0 1.07 0 0.21 0 0 12.47 2.5 11.09 0 0 0.51 22.18 4.28 0 0.31 0 0 15.24 1.42 1.38 0.1 0 0 23.56 9.63 17.33 0.63 0 0 11.78 1.42 15.94 Document TF distance TF-IDF distance D1 0.7 0.32 D2 0.77 0.51 D3 0.58 0.24 D4 0.6 0.23 D5 0.79 0.43 D6 0.14 0.02 D7 0.06 0.01 D8 0.02 0.02 D9 0.09 0.01 D10 0.01 0 Max values for query (1,0,1,0,0,0) Cosine distance is high when there is a better match TF-IDF chooses D2 while TF chooses D5 Unclear why D5 is a better choice Cosine distance favors shorter documents Srihari: CSE 626 15 (a disadvantage)

Comments on TF-IDF Method TF-IDF-based IR system first builds an inverted index with TF and IDF information Given a query (vector) lists some number of document vectors that are most similar to the query TF-IDF is superior in precision-recall compared to other weighting schemes Default baseline method for comparing performance Srihari: CSE 626 16