Notes on Latent Semantic Analysis

Similar documents
Probabilistic Latent Semantic Analysis

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Latent Semantic Analysis. Hongning Wang

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang

Latent Semantic Analysis. Hongning Wang

Introduction to Machine Learning

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

CS 572: Information Retrieval

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Factor Analysis (10/2/13)

Lecture 7: Con3nuous Latent Variable Models

Note on Algorithm Differences Between Nonnegative Matrix Factorization And Probabilistic Latent Semantic Indexing

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Information Retrieval

Manning & Schuetze, FSNLP, (c)

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

CS 340 Lec. 6: Linear Dimensionality Reduction

Machine Learning (Spring 2012) Principal Component Analysis

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Machine learning for pervasive systems Classification in high-dimensional spaces

Factor Analysis and Kalman Filtering (11/2/04)

Problems. Looks for literal term matches. Problems:

Probabilistic Latent Semantic Analysis

An Empirical Study on Dimensionality Optimization in Text Mining for Linguistic Knowledge Acquisition

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Manning & Schuetze, FSNLP (c) 1999,2000

Intelligent Data Analysis. Mining Textual Data. School of Computer Science University of Birmingham

Cross-Lingual Language Modeling for Automatic Speech Recogntion

PCA and admixture models

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Principal Component Analysis

Singular Value Decompsition

Lecture: Face Recognition and Feature Reduction

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Dimensionality Reduction

INF 141 IR METRICS LATENT SEMANTIC ANALYSIS AND INDEXING. Crista Lopes

CS47300: Web Information Search and Management

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

STA 414/2104: Lecture 8

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

STA 414/2104: Machine Learning

Lecture: Face Recognition and Feature Reduction

CSE 494/598 Lecture-6: Latent Semantic Indexing. **Content adapted from last year s slides

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

Dimensionality reduction

7. Variable extraction and dimensionality reduction

Principal Component Analysis (PCA)

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Structured matrix factorizations. Example: Eigenfaces

STA 414/2104: Lecture 8

Linear Algebra Background

CS 3750 Advanced Machine Learning. Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya

Data Mining Techniques

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)

Linear Methods in Data Mining

A few applications of the SVD

MLCC 2015 Dimensionality Reduction and PCA

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Matching the dimensionality of maps with that of the data

Introduction to Information Retrieval

Linear Algebra & Geometry why is linear algebra useful in computer vision?

CSC 411 Lecture 12: Principal Component Analysis

Expectation Maximization

Linear Algebra & Geometry why is linear algebra useful in computer vision?

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Latent Dirichlet Allocation and Singular Value Decomposition based Multi-Document Summarization

Document and Topic Models: plsa and LDA

PCA, Kernel PCA, ICA

Leverage Sparse Information in Predictive Modeling

Statistical Machine Learning

Dimensionality Reduction and Principle Components Analysis

CS281 Section 4: Factor Analysis and PCA

Lecture 16 Deep Neural Generative Models

Robustness of Principal Components

Latent Semantic Models. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

ABSTRACT INTRODUCTION

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Metric-based classifiers. Nuno Vasconcelos UCSD

Let A an n n real nonsymmetric matrix. The eigenvalue problem: λ 1 = 1 with eigenvector u 1 = ( ) λ 2 = 2 with eigenvector u 2 = ( 1

topic modeling hanna m. wallach

PV211: Introduction to Information Retrieval

EIGENVALE PROBLEMS AND THE SVD. [5.1 TO 5.3 & 7.4]

1. Ignoring case, extract all unique words from the entire set of documents.

Latent Dirichlet Allocation Introduction/Overview

2.3. Clustering or vector quantization 57

Language Information Processing, Advanced. Topic Models

Data Preprocessing Tasks

More PCA; and, Factor Analysis

Vector Space Models. wine_spectral.r

ISSN: (Online) Volume 3, Issue 5, May 2015 International Journal of Advance Research in Computer Science and Management Studies

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Lecture 5: Web Searching using the SVD

CS534 Machine Learning - Spring Final Exam

Probabilistic Latent Semantic Analysis

Recent Advances in Bayesian Inference Techniques

Transcription:

Notes on Latent Semantic Analysis Costas Boulis 1 Introduction One of the most fundamental problems of information retrieval (IR) is to find all documents (and nothing but those) that are semantically close to a given query. Early approaches in IR used exact keyword matching techniques to identify relevant documents with unsatisfactory results. There are two main reasons why these approaches fail: Synonymy. Two different words may refer to the same concept. For example car, automobile. If a document contains just the word car and the query just the word automobile then exact keyword matching techniques will fail to retrieve this document. P olysemy. The same word may refer to different concepts, depending on the context. For example the word bank (river bank or money bank). Context is not a simple matter of N-grams but can extend far in a document. This many-to-many mapping between words and concepts (or semantics) makes it difficult, if not impossible, to handcraft rules. For example some early approaches used a hand-crafted dictionary where synonyms where specified. Each query was then expanded with many synonyms of each word appearing in the query and an exact keyword match was then performed. However such approaches don t handle the issue of polysemy and will return a high number of irrelevant documents. One way to model the interaction between words and concepts is to view a query as a noisy observation of a concept. We can think of a concept as being associated with a probability distribution function (pdf) over the vocabulary words. For example a topic about car maintenance will have a higher probability than other topics for the words car, mechanic, parts and so on. But when a query is formed, not all these words with their true frequencies will appear. Some words that are highly associated with a certain topic may even not appear at all. Under this perspective a query is a noisy observation of an original topic (or perhaps a set of topics). 2 Principle Components Analysis Principle Components Analysis (PCA) is frequently used as a data compression technique and works by removing redundancies between dimensions of the data. There are many references to PCA in the literature, here I will just briefly mention some of its main characteristics without using a lot of mathematical notation. PCA projects the original data in a lower dimensionality space such that the variance of the data is best maintained. Let s suppose that we have N samples each of dimension M. We can represent these data as a matrix X of N M size. It is known in linear algebra that any such matrix can be decomposed in the following form (known as Singular Value Decomposition or SVD): 1

3.5 3 2.5 2 1.5 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 1: Projecting the 2-D data to a line X N M = U N M S M M V T M M (1) Where U and V are orthogonal 1 matrices of N M and M M size respectively. The S matrix is a diagonal M M matrix (assuming N > M). The diagonal elements of S are the eigenvalues of X and the matrices U and V contain the eigenvectors of X 2 If we order the eigenvalues according to their magnitude and keep the highest m of them then we essentially map the original M dimensional space to a m dimensional space with the least distortion, where distortion is defined to be the mean square error between the original and mapped points. So now we can write: X ˆX N m = U N m S m m V T m M (2) In Figure 1 an example for 2-D data and their minimum mean square error projection to a single dimension, is shown. We can observe that from all the lines that exist, the one where the projected data would have a variance as close as possible to the original data is the one shown. We can now see why PCA is a compression technique. Before we would need 2 real numbers to store each one of the data points, now we need 1 (with some distortion associated with it). Even if we don t project to a lower dimensionality space (if m = M) PCA may still be useful since it decorrelates 3 the dimensions. This is shown in Figure 2: Such a transformation can be useful for many feature normalization tasks in practice. 1 A N M matrix U is defined to be orthogonal iff U T U = I, i.e. its inverse is the same as its transpose 2 If X is a N M matrix then u and λ are respectively an eigenvector and eigenvalue of X iff Xu = λu. The eigenvalues are scalar and the eigenvectors are vectors of length min(n, M) 3 Two random variables Y and Z are defined to be correlated if the knowledge of one can help us know the other through a linear relation 2

Figure 2: PCA as a decrorrelating transform But PCA is not just useful for compression. We can see this method as a way to estimate the underlying relation between different variables. For example in Figure 1 we can think that the x axis is gas gallons and the y axis is miles I get on a specific car, road, environment and driving conditions e.t.c. We know that there is a linear relation between these two variables but due to noisy sensors and imperfect road conditions the actual measured points are as shown in Figure 1. Thus PCA is a way to remove the noise from the data and uncover the true relationship between different variables. 3 Factor Analysis PCA is a method in linear algebra and as any method in linear algebra all the variables are deterministic. Factor Analysis (FA) is a similar method but with probabilistic foundations. In FA we assume that the observed data x are generated as a linear combination of a (usually small) set of hidden factors 4. x = Λf + ɛ (3) where f is a vector of size m, Λ is a M m vector, x is a vector of size M and ɛ is a random vector of size M. The ɛ i, i = 1..M are uncorrelated and x is uncorrelated of ɛ given f. During training, the task is to estimate the common hidden factors given a set of observed vectors x. During testing, the task is to find the projections of a test vector x to the already estimated hidden factors f. If x is the original M-dimensional vector then from Equation 3 we can write: p(x) f p(x f)p(f) (4) From Equation 4 we can see that FA is an unsupervised way of learning the m-dimensional factors from M-dimensional data. The dimensionality reduction is important because of data sparsity issues. By using the dimensionality reduction we simplify our model by requiring fewer parameters to be estimated. And if the data are highly correlated there is no sacrifice in the modeling capacity of the projected space. 4 This is the one-mode FA, where the factors refer to just one variable. More similar to SVD is the two-mode FA which is described in the next section. 3

In PCA each vector can also be decomposed as a linear combination of a set of basis vectors. The difference between PCA and FA (other than the former is a linear algebra method and the latter a probabilistic one) is that PCA preserves the variance while FA preserves the covariance. Therefore PCA is insensitive to rotation but sensitive to scaling and vice versa for FA. For an excellent discussion on PCA, FA and probabilsitic extensions of PCA (like mixtures of PCA) see [4]. Probabilistic methods have a number of advantages over algebric ones including: Value of m. An issue with both FA and PCA is what is the optimum value of m,i.e. what is the value of m that best represents the data (according to a predetermined criterion). There is a sound body of methods and theory for selecting the complexity of statistical models, while in PCA we have to rely on heuristics or simple techniques. Training and updating. For most applications where the number of training samples and dimensionality is high, as in IR, PCA will result in a huge matrix which needs to be decomposed. On the other hand FA can make use of the wealth of EM 5 variants that exist and apply techniques which are both fast and robust to estimation errors. Generalization It is well known in statistics that FA is a special case of more complex models (called Graphical Models). Under this perspective there are many ways to visualize generalizations of FA. See for example [3]. 4 Back to Information Retrieval How all these relate to Information Retrieval? First of all, we need a representation of documents. The vector-space representation is one of the most popular, where each document is represented as a vector of the words appearing in the document. Some very common words like the, a, and e.t.c. are removed and the remaining words are stemmed, based on the assumption that words with the same prefix refer to the same topic (for example run, runner, running). The dimensionality M of each one of the documents is usually very high, in the order of dozens of thousand and usually each entry of the vector is either 0 or 1, meaning that a term either appeared at least once in the document or didn t appear at all. If we have N documents available then we can construct an N M matrix where each row corresponds to a document. We can think of this matrix as a description of both terms and documents where a row describes a certain document and a column describes a specific term. Further assume that we have selected m and used PCA, Equation 2. The diagonal matrix S refers to the common factors. Notice that these factors are common for both terms and documents, allowing for uniform representation. This allows us to ask questions like How similar are word W and query Q apart from the standard How similar are words W 1 and W 2 or documents D 1 and D 2. The N m matrix U projects the terms into the m-dimensional space and the M m matrix V projects the documents into the m-dimensional space. So if d is a vector describing a document and t is a vector describing a term we have: ˆd = V T d (5) ˆt = U T t (6) Notice also that although the original representation of terms or documents is discrete the projected points lie on a continuous space. We can now measure the proximity of two points by using the Euclidean measure in the m-dimensional space instead of the original M-dimensional space. 5 Expectation-Maximization (EM): a popular algorithm in Machine Learning for unsupervised learning 4

At the introduction section, I mentioned that we can see each query or word as a noisy realization of a number of topics. This is mathematically described by Equation 3, where x is the observation vector f are the topics or semantics estimated on a number of documents and ɛ is the error or noise. We can have the two-mode FA for documents d and terms t by using: p(d, t) f p(d f)p(t f)p(f) (7) Notice that Equation 7 is completely symmetric in both the documents and the terms. Here f are the common factors or topics associated with a collection of documents or terms. Perhaps the strongest advantage of FA and PCA is that the procedure of uncovering the hidden or latent semantics is totally unsupervised. Latent Semantic Analysis (LSA) as presented in [1], is exactly the same as two-mode PCA or SVD and Probabilistic LSA (PLSA) as presented in [2], is the two-mode extension of probabilistic PCA as presented in [4]. How LSA and its variants handle the two fundamental problems of IR, namely synonymy and polysemy? Synonymy is captured by the dimensionality reduction of PCA. Words which are highly correlated will be mapped in close proximity in the low dimensional space. Polysemy, on the other hand cannot be handled by PCA because inherently the same transform is applied on all the data. For example suppose that we have a collection of documents and at the first half the topic is about fishing so there is correlation between the words river and bank and at the second half the topic is about the economy and now the words money and bank are correlated. If PCA is applied in this scenario it will map in the same direction the words river, bank and money. The work in [2] offers a way to resolve polysemous words by calculating the quantity p(w f) (see Figure 3 from Hofmann s paper). A better way would be to estimate a mixture of probabilistic PCA, so that we learn jointly the transformations and the class parameters. 5 Handling speech recognizer output All the variants of LSA developed so far are for text taken from newspapers, technical documents, web pages e.t.c. This means that the form processed thus far is from lexically correct sources 6. But what about the case where the text is the output of automatic speech recognizers. The outputs will have errors in many levels and the challenge now is how to uncover semantics from erroneous sentences. We can think of this problem as estimating a double-hidden parameter and can be represented in a nice way through Graphical Models. There are additional issues that need to be taken care of, like the fact that if a word is very likely to be erroneous then its neighbors will probably also be erroneous. 6 Summary There are two main points to be remembered for Latent Semantic Analysis: LSA is an unsupervised way of uncovering synonyms in a collection of documents. Maps both documents and terms to the same space, allowing queries across variables (similarity of terms and documents). 6 Note that we don t care about grammaticaly and syntactically correct sentences because LSA assumes the bagof-words model. 5

References [1] S. Deerwester, S. Dumais, G. Furnas, T. Landauer, Indexing by Latent Semantic Analysis, Journal of the American Society of Information Science, pp.391-407,1990. [2] T. Hofmann, Probabilistic Latent Semantic Analysis, Proceedings of Uncertainty in Artificial Intelligence, 1999. [3] M. Jordan, C. Bishop, An Introduction to Graphical Models. [4] M. E. Tipping, C. M. Bishop, Mixtures of Probabilistic Principal Component Analysers, Neural Computation, vol. 11, no. 2, pp. 443-482, 1999. 6