Latent semantic indexing

Similar documents
Manning & Schuetze, FSNLP (c) 1999,2000

Manning & Schuetze, FSNLP, (c)

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

Latent Semantic Analysis. Hongning Wang

CS 3750 Advanced Machine Learning. Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya

Latent Semantic Analysis. Hongning Wang

Information Retrieval

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

Semantic Similarity from Corpora - Latent Semantic Analysis

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

CS47300: Web Information Search and Management

Problems. Looks for literal term matches. Problems:

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Assignment 3. Latent Semantic Indexing

Latent Semantic Models. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

INF 141 IR METRICS LATENT SEMANTIC ANALYSIS AND INDEXING. Crista Lopes

Variable Latent Semantic Indexing

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Matrix decompositions and latent semantic indexing

EIGENVALE PROBLEMS AND THE SVD. [5.1 TO 5.3 & 7.4]

Let A an n n real nonsymmetric matrix. The eigenvalue problem: λ 1 = 1 with eigenvector u 1 = ( ) λ 2 = 2 with eigenvector u 2 = ( 1

Singular Value Decompsition

Introduction to Information Retrieval

Latent Semantic Analysis (Tutorial)

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Linear Algebra Background

Updating the Partial Singular Value Decomposition in Latent Semantic Indexing

Matrix Decomposition and Latent Semantic Indexing (LSI) Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Investigation of Latent Semantic Analysis for Clustering of Czech News Articles

PV211: Introduction to Information Retrieval

An Empirical Study on Dimensionality Optimization in Text Mining for Linguistic Knowledge Acquisition

Generic Text Summarization

PROBABILISTIC LATENT SEMANTIC ANALYSIS

CS 572: Information Retrieval

An R Package for Latent Semantic Analysis

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

A few applications of the SVD

How Latent Semantic Indexing Solves the Pachyderm Problem

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)

Matrices, Vector Spaces, and Information Retrieval

Fast LSI-based techniques for query expansion in text retrieval systems

Updating the Centroid Decomposition with Applications in LSI

Chapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze

Intelligent Data Analysis. Mining Textual Data. School of Computer Science University of Birmingham

Principal components analysis COMS 4771

DISTRIBUTIONAL SEMANTICS

1. Ignoring case, extract all unique words from the entire set of documents.

CSE 494/598 Lecture-6: Latent Semantic Indexing. **Content adapted from last year s slides

Cross-Lingual Language Modeling for Automatic Speech Recogntion

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang

Embeddings Learned By Matrix Factorization

Text Analytics (Text Mining)

Folding-up: A Hybrid Method for Updating the Partial Singular Value Decomposition in Latent Semantic Indexing

vector space retrieval many slides courtesy James Amherst

Using Matrix Decompositions in Formal Concept Analysis

Linear Methods in Data Mining

1. The Polar Decomposition

Problem # Max points possible Actual score Total 120

.. CSC 566 Advanced Data Mining Alexander Dekhtyar..

Lecture 5: Web Searching using the SVD

13 Searching the Web with the SVD

9 Searching the Internet with the SVD

Knowledge Discovery and Data Mining 1 (VO) ( )

Dimensionality Reduction

MATRIX DECOMPOSITION AND LATENT SEMANTIC INDEXING (LSI) Introduction to Information Retrieval CS 150 Donald J. Patterson

Numerical Methods I Singular Value Decomposition

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Notes on Latent Semantic Analysis

Latent Semantic Indexing

(a) If A is a 3 by 4 matrix, what does this tell us about its nullspace? Solution: dim N(A) 1, since rank(a) 3. Ax =

LSI vs. Wordnet Ontology in Dimension Reduction for Information Retrieval

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model

Index. Copyright (c)2007 The Society for Industrial and Applied Mathematics From: Matrix Methods in Data Mining and Pattern Recgonition By: Lars Elden

Leverage Sparse Information in Predictive Modeling

Matrices, Vector-Spaces and Information Retrieval

Applied Linear Algebra in Geoscience Using MATLAB

1 Linearity and Linear Systems

Matrices, Vector Spaces, and Information Retrieval

Pseudoinverse & Moore-Penrose Conditions

Applied Natural Language Processing

Accuracy of Distributed Data Retrieval on a Supercomputer Grid. Technical Report December Department of Scientific Computing

Final Exam, Linear Algebra, Fall, 2003, W. Stephen Wilson

Learning Query and Document Similarities from Click-through Bipartite Graph with Metadata

Bruce Hendrickson Discrete Algorithms & Math Dept. Sandia National Labs Albuquerque, New Mexico Also, CS Department, UNM

1. Background: The SVD and the best basis (questions selected from Ch. 6- Can you fill in the exercises?)

Multivariate Statistical Analysis

Learning Query and Document Similarities from Click-through Bipartite Graph with Metadata

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

A Note on the Effect of Term Weighting on Selecting Intrinsic Dimensionality of Data

Lecture 7: Word Embeddings

Eigenvalues and diagonalization

Semantic Similarity and Relatedness

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2017. Instructor: Walid Magdy

Information Retrieval

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

15 Singular Value Decomposition

Introduction to Numerical Linear Algebra II

Transcription:

Latent semantic indexing Relationship between concepts and words is many-to-many. Solve problems of synonymy and ambiguity by representing documents as vectors of ideas or concepts, not terms. For retrieval, analyze queries the same way, and compute cosine similarity of vectors of ideas.

Latent semantic analysis Latent semantic analysis (LSA). Find the latent semantic space that underlies the documents. Find the basic (coarse-grained) ideas, regardless of the words used to say them. A kind of co-occurrence analysis; co-occurring words as bridges between non co-occurring words. Latent semantic space has many fewer dimensions than term space has. Space depends on documents from which it is derived. Components have no names; can t be interpreted.

rank A Singular value decomposition (1) Dimensionality reduction by singular value decomposition (SVD). Analogous to least-squares fit: closest fit of a lower-dimensional matrix to a higher-dimensional matrix. Theorem: Let A t d be a real-valued matrix, and let n t. There exist T t n, diagonal S n n, and D d n such that min A sii d TSD T, s jj for all 1 i j n, the columns of both T and D are orthonormal. Columns of T and D are the singular vectors of A; they represent terms and documents respectively); elements of S are the singular values of A.

Singular value decomposition (2) where n A t d = T S D rank A t n t min d. n n T d n

Singular value decomposition (3) For k n, define Ât d Tt ksk k Although  and A are both td matrices,  is really smaller : has rank k, can be represented as a smaller matrix. Theorem:  is the closest fit to A of a matrix of rank k; i.e., minimizes A 2. k Dd T.

Singular value decomposition (4) Â = T S D T t d Usually choose k n. t k k k d k

Using singular vectors SVD algorithms. The k columns of T and D that remain in T t k and D d k are the most important ones. For document d in original normalized A, A T d is vector of document similarities with d; A T A is (symmetrical) matrix of document-to-document similarities. Analogously in reduced space, Â T Â S k k D d k T T S k k D d k T Term similarity: AA T approximated by ÂÂ T Tt ksk k T Tt k Sk k

Example (1) Six documents, five terms. A d 1 d 2 d 3 d 4 d 5 d 6 cosmonaut 1 0 1 0 0 0 astronaut 0 1 0 0 0 0 moon 1 1 0 0 0 0 car 1 0 0 1 1 0 truck 0 0 0 1 0 1

16 0 00 1 28 0 00 1 Example (2) A cosmonaut astronaut moon car truck Dim 1 Dim 2 Dim 3 Dim 4 Dim 5 0 44 0 13 0 48 0 30 0.57 0.58 0.25 0 33 0 51 0 70 0.35 0.15 0 26 0.65 0 59 0.00 0.73 0 37 0.00 0 41 0.58 0 61 0 58 0.16 0 09 2 0 0 0 0 00 0 59 0 00 1 00 0 00 0 00 00 00 00 00 T 5 5 S 5 5 d 1 d 2 d 3 d 4 d 5 d 6 Dim 1 0 75 0 28 0 20 0 45 0 33 0 12 Dim 2 0 29 0 53 0 19 0.63 0.22 0.41 Dim 3 0 28 0 75 0 45 0 20 0 12 0 33 Dim 4 0 00 0.00 0.58 0.00 0 58 0.58 Dim 5 0 53 0.29 0 63 0.19 0.41 0 22 D 6 5 T

16 0 00 1 00 0 00 1 28 0 00 1 Example (3) Choose k 2. 2 0 0 0 0 00 0 59 0 00 0 00 0 00 0 00 00 0 00 00 0 00 00 0 00 00 0 39 Dim 1 Dim 2 0 75 0 29 Dim 3 0 28 d 1 d 2 d 3 d 4 d 5 d 0 28 0 53 0 0 20 0 45 0 33 0 12 0 19 0.63 0.22 0.41 75 0 45 0 Dim 4 0 00 0.00 0.58 0.00 Dim 5 0 20 0 53 0.29 0 12 63 0.19 0.41 0 33 0 58 0.58 0 22 S 2 2 D 6 2 T d 1 d 2 d 3 d 4 d 5 d Dim 1 1 62 0 60 0 04 0 97 0 71 0 2 Dim 2 0 46 0 84 0 30 1.00 0.35 0.6 B 2 6

0 18 0 54 Example (4) Hence inter-document similarity is given by  T  BT B d 1 d 2 d 3 d 4 d 5 d 6 d 1 1.00 d 2 0.78 1.00 d 3 0.40 0.88 1.00 d 4 0.47 0 62 1.00 d 5 0.74 0.16 0 32 0.94 1.00 d 6 0.10 0 87 0.93 0.74 1.00

Queries and new documents Two problems: Need to represent queries in same space. Want to add new documents without recomputing SVD. Folding in : Let q be the term vector for a query or new document. Then ˆq k 1 Tt k T q t 1 is the vector representing q in the reduced space. If q is a query, ˆq can be compared to other documents in D by cosine similarity. If q is a new document, ˆq can be appended to D; d is increased by 1. As new documents are added, SVD will become much poorer fit. Eventually need to recompute SVD.

Adding a new document  t ( d+1) = Tt k Sk k D ( d+1 ) k T

Choosing a value for k LSI is useful only if k n. If k is too large, it doesn t capture the underlying latent semantic space; if k is too small, too much is lost. No principled way of determining the best k; need to experiment.

How well does this work? Effectiveness of LSI compared to regular term-matching depends on nature of documents. Typical improvement: 0 to 30% better precision. Advantage greater for texts in which synonymy and ambiguity are more prevalent. Best when recall is high. Costs of LSI might outweigh improvement. SVD is computationally expensive; limited use for really large document collections (as in TREC). Inverted index not possible.

Other applications of LSI and LSA in NLP Cross-language information retrieval. Concatenate multilingual abstracts to act as bridge between languages. People-retrieval by information retrieval. Text segmentation. Essay scoring.