Natural Language Processing. Topics in Information Retrieval. Updated 5/10

Similar documents
Manning & Schuetze, FSNLP, (c)

Manning & Schuetze, FSNLP (c) 1999,2000

Latent semantic indexing

Variable Latent Semantic Indexing

CS 3750 Advanced Machine Learning. Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)

CS47300: Web Information Search and Management

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

.. CSC 566 Advanced Data Mining Alexander Dekhtyar..

Vector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Latent Semantic Models. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Boolean and Vector Space Retrieval Models

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Singular Value Decompsition

Information Retrieval

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

9 Searching the Internet with the SVD

Latent Semantic Analysis. Hongning Wang

Information Retrieval

Term Weighting and Vector Space Model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Information Retrieval and Web Search

Term Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan

vector space retrieval many slides courtesy James Amherst

1 Information retrieval fundamentals

Lecture 5: Web Searching using the SVD

13 Searching the Web with the SVD

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

Chapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze

Generic Text Summarization

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002

Dealing with Text Databases

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2018. Instructor: Walid Magdy

Linear Algebra Background

Latent Semantic Analysis. Hongning Wang

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Information Retrieval

Matrices, Vector Spaces, and Information Retrieval

Fall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26

INFO 4300 / CS4300 Information Retrieval. IR 9: Linear Algebra Review

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2017. Instructor: Walid Magdy

Vector Space Scoring Introduction to Information Retrieval Informatics 141 / CS 121 Donald J. Patterson

What is Text mining? To discover the useful patterns/contents from the large amount of data that can be structured or unstructured.

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

Investigation of Latent Semantic Analysis for Clustering of Czech News Articles

Scoring, Term Weighting and the Vector Space

Semantic Similarity from Corpora - Latent Semantic Analysis

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

Querying. 1 o Semestre 2008/2009

Part I: Web Structure Mining Chapter 1: Information Retrieval and Web Search

Ranked Retrieval (2)

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Chap 2: Classical models for information retrieval

MATRIX DECOMPOSITION AND LATENT SEMANTIC INDEXING (LSI) Introduction to Information Retrieval CS 150 Donald J. Patterson

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

PV211: Introduction to Information Retrieval

Text Analytics (Text Mining)

EIGENVALE PROBLEMS AND THE SVD. [5.1 TO 5.3 & 7.4]

An R Package for Latent Semantic Analysis

A Note on the Effect of Term Weighting on Selecting Intrinsic Dimensionality of Data

Text Analytics (Text Mining)

Matrix decompositions and latent semantic indexing

IR Models: The Probabilistic Model. Lecture 8

Ranking-II. Temporal Representation and Retrieval Models. Temporal Information Retrieval

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

Let A an n n real nonsymmetric matrix. The eigenvalue problem: λ 1 = 1 with eigenvector u 1 = ( ) λ 2 = 2 with eigenvector u 2 = ( 1

Information Retrieval. Lecture 6

Document Similarity in Information Retrieval

DISTRIBUTIONAL SEMANTICS

Non-Boolean models of retrieval: Agenda

Introduction to Information Retrieval

Embeddings Learned By Matrix Factorization

Probabilistic Latent Semantic Analysis

INF 141 IR METRICS LATENT SEMANTIC ANALYSIS AND INDEXING. Crista Lopes

Recap of the last lecture. CS276A Information Retrieval. This lecture. Documents as vectors. Intuition. Why turn docs into vectors?

A Neural Passage Model for Ad-hoc Document Retrieval

CS 572: Information Retrieval

Information Retrieval

Problems. Looks for literal term matches. Problems:

Information Retrieval Basic IR models. Luca Bondi

University of Illinois at Urbana-Champaign. Midterm Examination

Motivation. User. Retrieval Model Result: Query. Document Collection. Information Need. Information Retrieval / Chapter 3: Retrieval Models

Latent Semantic Analysis (Tutorial)

TDDD43. Information Retrieval. Fang Wei-Kleiner. ADIT/IDA Linköping University. Fang Wei-Kleiner ADIT/IDA LiU TDDD43 Information Retrieval 1

On the Foundations of Diverse Information Retrieval. Scott Sanner, Kar Wai Lim, Shengbo Guo, Thore Graepel, Sarvnaz Karimi, Sadegh Kharazmi

3. Basics of Information Retrieval

Maschinelle Sprachverarbeitung

PV211: Introduction to Information Retrieval

Fast LSI-based techniques for query expansion in text retrieval systems

PV211: Introduction to Information Retrieval

Transcription:

Natural Language Processing Topics in Information Retrieval Updated 5/10

Outline Introduction to IR Design features of IR systems Evaluation measures The vector space model Latent semantic indexing

Background on IR Retrieve textual information from document repositories. What is unstructured data? Scales of information retrieval systems Searching the web Searching document repositories (e.g. of an enterprise) Searching documents of a personal computer

Background on IR Ad-hoc retrieval: User enters a query describing the desired information The system returns a list of documents. Two main models: exact match (e.g. for Boolean queries) somewhat older ranked list

Text Categorization Attempt to assign documents to two or more pre-defined categories. Routing: Ranking of documents according to relevance. Training information in the form of relevance labels is available. Filtering: Absolute assessment of relevance.

Design Features of IR Systems Inverted Index: Primary data structure of IR systems. An inverted index lists for each word the documents that contain it and its frequency of occurrence. Including the position information allows searching for phrases. Stop List (Function Words): Lists words unlikely to be useful for searching. Examples: the, on, could. Excluding this considerably reduces the size of the inverted index without significantly affecting its performance. However, this would make it impossible to search for phrases that contain stop words.

Design Features (Cont.) Stemming: Simplified form of morphological analysis consisting simply of truncating a word. For example laughing, laughs, laugh and laughed are all stemmed to laugh. The problem is semantically different words like gallery and gall may both be truncated to gall making the stems unintelligible to users.

Evaluation Measures Precision: Percentage of relevant items returned. Recall: Percentage of all relevant documents in the collection that is in the returned set. Combine precision and recall: Cutoff precision at a particular cutoff, e.g. precision at 5 Uninterpolated average precision: precision values are averaged for points with relevant documents. Interpolated average precision : likewise interpolated. Precision-Recall curves F measure

Evaluation Measures example for three rankings

Un-interpolated & interpolated average precision

Probability Ranking Principle (PRP) Ranking documents in order of decreasing probability of relevance is optimal. View retrieval as a greedy search that aims to identify the most valuable document. Assumptions of PRP: Documents are independent. Complex information need is broken into a number of queries which are each optimized in isolation. Probability of relevance is only estimated.

The Vector Space Model Measure closeness between query and document. Queries and documents represented as n dimensional vectors. Each dimension corresponds to a word. Advantages: Conceptual simplicity and use of spatial proximity for semantic proximity.

Vector Similarity d = The man said that a space age man appeared d = Those men appeared to say their age

Vector Similarity (Cont.) cosine measure or normalized correlation coefficient Euclidean Distance:

Term Weighting Quantities used: tf i,j (Term frequency) : # of occurrences of w i in d i df i (Document frequency) : # of documents that w i occurs in cf i (Collection frequency) : total # of occurrences of w i in the collection

Term Weighting (Cont.) tf i,j = 1+log(tf), tf > 0 df i : indicator of informativeness Inverse document frequency (IDF weighting) TF.IDF (Term frequency & Inverse Document Frequency): indicator of semantically focused words: weight( i, j) = (1+ 0 log( tf i, j )) log N df i if if tf tf i, j i, j 1 = 0

Normalization Normalization is considered essential im many weighting schemes, otherwise longer documents would tend to be ranked higher.

Term Distribution Models Develop a model for the distribution of a word and use this model to characterize its importance for retrieval. Estimate p i (k): p i (k) : proportion of times that word w i appears k times in document. Poisson, Two-Poisson and K mixture. We can derive the IDF from term distribution models.

The Poisson Distribution λ ( ; ) = i λ p k λ i for some > i e λi 0 k! k the parameter λi > 0 is the average number of occurrences of w i per document. λ i = cf i N We are interested in the frequency of occurrence of a particular word w i in a document. Poisson distribution is good for estimating noncontent words.

The Two-Poisson Model Better fit to the frequency distribution Mixture of two poissons Non-privileged class: Low average # of occurrences Occurrences are accidental Privileged class: High average # of occurrences Central content word p π λ 1 λ ( k;,, ) e 1 k (1 ) e 1 λ 2 = π λ k! π λ π : probability of a document being in the privileged class 1-π : probability of a document being in the non-privileged class + 2 λ k 2 k! λ 1, λ 2 : average number of occurrence of word w i in each class

The K Mixture More accurate p k α δ i( ) = (1 ) k, 0 + α β β+ 1 β+ 1 k λ= cf N IDF= log2 N df β =λ 2 IDF 1= cf - df df α = λ β β : # of extra terms per document in which the term occurs α : absolute frequency of the term.

Latent Semantic Indexing Projects queries and documents into a space with latent semantic dimensions. Dimensionality reduction: the latent semantic space that we project into has fewer dimensions than the original space. Exploits co-occurrence: the fact that two or more terms occur in the same document more often than chance. Similarity metric: Co-occurring terms are projected onto the same dimensions.

Singular Value Decomposition SVD takes a document-by-term matrix A in n- dim space and projects it to A^ in a lower dimensional space k (n>>k), namely lower rank matrix, such that the 2-norm (distance) between the two matrices is minimized: = A Â 2

SVD (Cont) SVD projection: A T S ) t d = t n n n( Dd n T A txd term by document matrix T txn Terms in new space S nxn Singular values of A in descending order D dxn document matrix in new space n = min (t,d) T, D have orthonormal columns Fewer terms may be retained to achieve dimensionality reduction

LSI in IR Encode terms and documents using factors derived from SVD. Rank similarity of terms and docs to query via Euclidean distances or cosines.

LSI example

LSI example cont. T S D

LSI example : original vs. dimension reduced A = 1 0 1 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 1 1 0 0 0 0 1 0 1 k =2 0.85 0.52 0.28 0.13 0.21-0.08 0.36 0.36 0.16-0.21-0.03-0.18 1.00 0.72 0.36-0.05 0.16-0.21 0.98 0.13 0.21 1.03 0.62 0.41 0.13-0.39-0.08 0.90 0.41 0.49 k =3 1.05-0.03 0.61-0.02 0.29-0.31 0.15 0.92-0.18-0.05-0.12 0.06 0.87 1.07 0.15 0.04 0.10-0.05 1.03-0.02 0.29 0.99 0.64 0.35-0.02 0.01-0.31 1.01 0.35 0.66

LSI example cont. Condensed representation of documents B=S 2*2 D 2*n = cosines

LSI example - querying q = q T T k S k -1 For example: q= astronaut car =(0 1 0 1 0) T q = (0.38 0.01) T Query result cos(q,b i ) = (0.96 0.56 0.81 0.72 0.91 0.40)

Latent semantic indexing in IR The application of SVD to IR is called Latent Semantic Indexing (LSI) Comparing LSI to standard vector space search Higher recall Reduced precision The latency comes form the fact that original terms are transformed to a new basis, thought to be the true representation of the data. Is SVD representation more efficientt? Seems to be, due to compression. E.g if one reduces to 150 dimensions But needs costly matrix computations. Inverted indexing is not possible! Effort for computing the SVD. Objection to SVD: SVD is really designed for normal distributions, but count data is evidently not normal.

Discourse Segmentation Break documents into topically coherent multi-paragraph subparts. Detect topic shifts within document

TextTiling (Hearst and Plaunt, 1993) Search for vocabulary shifts from one subtopic to another. Divide text into fixed size blocks (20 words). Look for topic shifts in-between these blocks. Cohesion scorer: measures the topic continuity at each gap (point between two block). Depth scorer: at a gap determine how low the cohesion score is compared to surrounding gaps. Boundary selector: looks at the depth scores & selects the gaps that are the best segmentation points.

Three constellations of cohesion scores