Natural Language Processing. Topics in Information Retrieval. Updated 5/10

Natural Language Processing Topics in Information Retrieval Updated 5/10

Outline Introduction to IR Design features of IR systems Evaluation measures The vector space model Latent semantic indexing

Background on IR Retrieve textual information from document repositories. What is unstructured data? Scales of information retrieval systems Searching the web Searching document repositories (e.g. of an enterprise) Searching documents of a personal computer

Background on IR Ad-hoc retrieval: User enters a query describing the desired information The system returns a list of documents. Two main models: exact match (e.g. for Boolean queries) somewhat older ranked list

Text Categorization Attempt to assign documents to two or more pre-defined categories. Routing: Ranking of documents according to relevance. Training information in the form of relevance labels is available. Filtering: Absolute assessment of relevance.

Design Features of IR Systems Inverted Index: Primary data structure of IR systems. An inverted index lists for each word the documents that contain it and its frequency of occurrence. Including the position information allows searching for phrases. Stop List (Function Words): Lists words unlikely to be useful for searching. Examples: the, on, could. Excluding this considerably reduces the size of the inverted index without significantly affecting its performance. However, this would make it impossible to search for phrases that contain stop words.

Design Features (Cont.) Stemming: Simplified form of morphological analysis consisting simply of truncating a word. For example laughing, laughs, laugh and laughed are all stemmed to laugh. The problem is semantically different words like gallery and gall may both be truncated to gall making the stems unintelligible to users.

Evaluation Measures Precision: Percentage of relevant items returned. Recall: Percentage of all relevant documents in the collection that is in the returned set. Combine precision and recall: Cutoff precision at a particular cutoff, e.g. precision at 5 Uninterpolated average precision: precision values are averaged for points with relevant documents. Interpolated average precision : likewise interpolated. Precision-Recall curves F measure

Evaluation Measures example for three rankings

Un-interpolated & interpolated average precision

Probability Ranking Principle (PRP) Ranking documents in order of decreasing probability of relevance is optimal. View retrieval as a greedy search that aims to identify the most valuable document. Assumptions of PRP: Documents are independent. Complex information need is broken into a number of queries which are each optimized in isolation. Probability of relevance is only estimated.

The Vector Space Model Measure closeness between query and document. Queries and documents represented as n dimensional vectors. Each dimension corresponds to a word. Advantages: Conceptual simplicity and use of spatial proximity for semantic proximity.

Vector Similarity d = The man said that a space age man appeared d = Those men appeared to say their age

Vector Similarity (Cont.) cosine measure or normalized correlation coefficient Euclidean Distance:

Term Weighting Quantities used: tf i,j (Term frequency) : # of occurrences of w i in d i df i (Document frequency) : # of documents that w i occurs in cf i (Collection frequency) : total # of occurrences of w i in the collection

Term Weighting (Cont.) tf i,j = 1+log(tf), tf > 0 df i : indicator of informativeness Inverse document frequency (IDF weighting) TF.IDF (Term frequency & Inverse Document Frequency): indicator of semantically focused words: weight( i, j) = (1+ 0 log( tf i, j )) log N df i if if tf tf i, j i, j 1 = 0

Normalization Normalization is considered essential im many weighting schemes, otherwise longer documents would tend to be ranked higher.

Term Distribution Models Develop a model for the distribution of a word and use this model to characterize its importance for retrieval. Estimate p i (k): p i (k) : proportion of times that word w i appears k times in document. Poisson, Two-Poisson and K mixture. We can derive the IDF from term distribution models.

The Poisson Distribution λ ( ; ) = i λ p k λ i for some > i e λi 0 k! k the parameter λi > 0 is the average number of occurrences of w i per document. λ i = cf i N We are interested in the frequency of occurrence of a particular word w i in a document. Poisson distribution is good for estimating noncontent words.

The Two-Poisson Model Better fit to the frequency distribution Mixture of two poissons Non-privileged class: Low average # of occurrences Occurrences are accidental Privileged class: High average # of occurrences Central content word p π λ 1 λ ( k;,, ) e 1 k (1 ) e 1 λ 2 = π λ k! π λ π : probability of a document being in the privileged class 1-π : probability of a document being in the non-privileged class + 2 λ k 2 k! λ 1, λ 2 : average number of occurrence of word w i in each class

The K Mixture More accurate p k α δ i( ) = (1 ) k, 0 + α β β+ 1 β+ 1 k λ= cf N IDF= log2 N df β =λ 2 IDF 1= cf - df df α = λ β β : # of extra terms per document in which the term occurs α : absolute frequency of the term.

Latent Semantic Indexing Projects queries and documents into a space with latent semantic dimensions. Dimensionality reduction: the latent semantic space that we project into has fewer dimensions than the original space. Exploits co-occurrence: the fact that two or more terms occur in the same document more often than chance. Similarity metric: Co-occurring terms are projected onto the same dimensions.

Singular Value Decomposition SVD takes a document-by-term matrix A in n- dim space and projects it to A^ in a lower dimensional space k (n>>k), namely lower rank matrix, such that the 2-norm (distance) between the two matrices is minimized: = A Â 2

SVD (Cont) SVD projection: A T S ) t d = t n n n( Dd n T A txd term by document matrix T txn Terms in new space S nxn Singular values of A in descending order D dxn document matrix in new space n = min (t,d) T, D have orthonormal columns Fewer terms may be retained to achieve dimensionality reduction

LSI in IR Encode terms and documents using factors derived from SVD. Rank similarity of terms and docs to query via Euclidean distances or cosines.

LSI example

LSI example cont. T S D

LSI example : original vs. dimension reduced A = 1 0 1 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 1 1 0 0 0 0 1 0 1 k =2 0.85 0.52 0.28 0.13 0.21-0.08 0.36 0.36 0.16-0.21-0.03-0.18 1.00 0.72 0.36-0.05 0.16-0.21 0.98 0.13 0.21 1.03 0.62 0.41 0.13-0.39-0.08 0.90 0.41 0.49 k =3 1.05-0.03 0.61-0.02 0.29-0.31 0.15 0.92-0.18-0.05-0.12 0.06 0.87 1.07 0.15 0.04 0.10-0.05 1.03-0.02 0.29 0.99 0.64 0.35-0.02 0.01-0.31 1.01 0.35 0.66

LSI example cont. Condensed representation of documents B=S 2*2 D 2*n = cosines

LSI example - querying q = q T T k S k -1 For example: q= astronaut car =(0 1 0 1 0) T q = (0.38 0.01) T Query result cos(q,b i ) = (0.96 0.56 0.81 0.72 0.91 0.40)

Latent semantic indexing in IR The application of SVD to IR is called Latent Semantic Indexing (LSI) Comparing LSI to standard vector space search Higher recall Reduced precision The latency comes form the fact that original terms are transformed to a new basis, thought to be the true representation of the data. Is SVD representation more efficientt? Seems to be, due to compression. E.g if one reduces to 150 dimensions But needs costly matrix computations. Inverted indexing is not possible! Effort for computing the SVD. Objection to SVD: SVD is really designed for normal distributions, but count data is evidently not normal.

Discourse Segmentation Break documents into topically coherent multi-paragraph subparts. Detect topic shifts within document

TextTiling (Hearst and Plaunt, 1993) Search for vocabulary shifts from one subtopic to another. Divide text into fixed size blocks (20 words). Look for topic shifts in-between these blocks. Cohesion scorer: measures the topic continuity at each gap (point between two block). Depth scorer: at a gap determine how low the cohesion score is compared to surrounding gaps. Boundary selector: looks at the depth scores & selects the gaps that are the best segmentation points.

Three constellations of cohesion scores