Probabilistic Latent Semantic Analysis

Similar documents
Notes on Latent Semantic Analysis

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Analysis. Hongning Wang

Nonnegative Matrix Factorization

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Latent Semantic Analysis. Hongning Wang

Language Information Processing, Advanced. Topic Models

Probabilistic Latent Semantic Analysis

CS 572: Information Retrieval

Singular Value Decomposition

A Coupled Helmholtz Machine for PCA

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Numerical Methods I Singular Value Decomposition

Fisher s Linear Discriminant Analysis

Knowledge Discovery and Data Mining 1 (VO) ( )

Principal Component Analysis (PCA)

Information Retrieval

Data Mining and Matrices

Principal Component Analysis

Document and Topic Models: plsa and LDA

Data Mining Lecture 4: Covariance, EVD, PCA & SVD

An Empirical Study on Dimensionality Optimization in Text Mining for Linguistic Knowledge Acquisition

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Principal components analysis COMS 4771

14 Singular Value Decomposition

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology


Constrained Projection Approximation Algorithms for Principal Component Analysis

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

Singular Value Decompsition

Kernel Principal Component Analysis

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Expectation Maximization

Linear Least Squares. Using SVD Decomposition.

IV. Matrix Approximation using Least-Squares

15 Singular Value Decomposition

Mathematical foundations - linear algebra

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)

Singular Value Decomposition and its. SVD and its Applications in Computer Vision

Machine Learning - MT & 14. PCA and MDS

Density Estimation. Seungjin Choi

Chapter 7: Symmetric Matrices and Quadratic Forms

STA141C: Big Data & High Performance Statistical Computing

PCA and admixture models

Notes on singular value decomposition for Math 54. Recall that if A is a symmetric n n matrix, then A has real eigenvalues A = P DP 1 A = P DP T.

Collaborative Filtering: A Machine Learning Perspective

Clustering VS Classification

.. CSC 566 Advanced Data Mining Alexander Dekhtyar..

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

MATH36001 Generalized Inverses and the SVD 2015

Singular Value Decomposition

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Linear Algebra Review. Fei-Fei Li

Let A an n n real nonsymmetric matrix. The eigenvalue problem: λ 1 = 1 with eigenvector u 1 = ( ) λ 2 = 2 with eigenvector u 2 = ( 1

Review of Some Concepts from Linear Algebra: Part 2

CS145: INTRODUCTION TO DATA MINING

(a) If A is a 3 by 4 matrix, what does this tell us about its nullspace? Solution: dim N(A) 1, since rank(a) 3. Ax =

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Probabilistic Latent Semantic Analysis

Linear Methods in Data Mining

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Singular Value Decomposition

Non-negative Matrix Factorization: Algorithms, Extensions and Applications

EIGENVALE PROBLEMS AND THE SVD. [5.1 TO 5.3 & 7.4]

Manning & Schuetze, FSNLP, (c)

Machine learning for pervasive systems Classification in high-dimensional spaces

STA141C: Big Data & High Performance Statistical Computing

Eigenvalue Problems Computation and Applications

Unsupervised Learning

Lecture 6 Sept Data Visualization STAT 442 / 890, CM 462

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

The Singular Value Decomposition

The Singular Value Decomposition

be a Householder matrix. Then prove the followings H = I 2 uut Hu = (I 2 uu u T u )u = u 2 uut u

Multivariate Statistical Analysis

Orthogonal Nonnegative Matrix Factorization: Multiplicative Updates on Stiefel Manifolds

topic modeling hanna m. wallach

Lecture 2: Linear Algebra Review

Modeling Environment

Lecture 7: Con3nuous Latent Variable Models

Properties of Matrices and Operations on Matrices

Introduction PCA classic Generative models Beyond and summary. PCA, ICA and beyond

Review of Linear Algebra

Linear Models for Regression

A few applications of the SVD

The Singular Value Decomposition (SVD) and Principal Component Analysis (PCA)

Probabilistic & Unsupervised Learning

Lecture: Face Recognition and Feature Reduction

Linear Algebra Methods for Data Mining

Linear Algebra Background

7 Principal Component Analysis

Linear Algebra (Review) Volker Tresp 2018

Latent Dirichlet Allocation

COMP6237 Data Mining Covariance, EVD, PCA & SVD. Jonathon Hare

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Nonparameteric Regression:

Exercise Sheet 1. 1 Probability revision 1: Student-t as an infinite mixture of Gaussians

Transcription:

Probabilistic Latent Semantic Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 / 27

Outline Singular value decomposition Latent semantic analysis (a.k.a. latent semantic indexing) Probabilistic latent semantic analysis (a.k.a. probabilistic latent semantic indexing) 2 / 27

Range Space and Null Space Definition For any A R m n, its range space R(A) and null space N (A) are defined as follows: R(A) = {y R m y = Ax for some x R n } N (A) = {x R n Ax = 0}. Examples: What are range space and null space of A and B? [ ] [ ] 0 0 1 0 A = B = 0 0 0 0 3 / 27

Linear Algebraic Equations A linear algebraic equation has the form where A R m n, x R n, and b R m. Ax = b, (1) Theorem A solution of Ax = b exists if and only if b R(A). Theorem Let x be a particular solution to (1), then x = x + N (A) = {x + w w N (A)} is the general solution. Remark: Note that N (A) is the set of homogeneous solutions to (1). Therefore the set of all solutions is the sum of a particular solution and the set of homogeneous solutions. 4 / 27

Singular Value Decomposition Theorem (SVD) Let A R m n. Then there exist orthogonal matrices such that U = [u 1, u 2,..., u m ] R m m V = [v 1, v 2,..., v n ] R n n U T AV = diag (σ 1, σ 2,..., σ p ) R m n (2) where p = min{m, n} and σ 1 σ 2 σ p 0. The scalars {σ i R} are called singular values. SVD A = UΣV 5 / 27

Low-Rank Approximation Theorem (Eckart and Young, 1936) Suppose that A is an m n matrix of rank r, with singular values σ 1 σ 2 σ r. and SVD A = UΣV. Then the best rank-k approximation of A is A k = U 1 Σ 1 V 1, where That is, A A k F = min U = [U 1, U 2 ], V = [V 1, V 2 ], Σ 1 = diag(σ 1,..., σ k ). { } A Â F Â Rm n, rank(â) = k. 6 / 27

Proof of Eckart-Young Theorem Note that Direct calculations give A Â F = UΣV Â F = Σ U } {{ ÂV } F. N Σ N 2 F = i,j Σ i,j N i,j 2 = r σ i N i,i 2 + N i,i 2 + i>r i j i=1 N i,j 2, which is minimal when all the non-diagonal entries of N are equal to zero and so are all N i,i for i > r. The minimum of r i=1 σ i N i,i 2 is attained when σ i = N i,i for i = 1,..., k and all other N i,i are zero. 7 / 27

Least Squares Problem Consider a matrix A R m n with Rank(A) = r. Then R(A) = R(U 1 ) = sp [u 1, u 2,..., u r ] N (A) = sp [v r+1, v r+2,..., v n ] = R(V 2 ). The least squares (LS) problem is as follows: Setting x b Ax 2 = 0 leads to min x R n b Ax 2. (3) A T Ax = A T b. (4) 8 / 27

Remarks From SVD, one can see that the particular solution to (4) is x = V 1 Σ 1 U T 1 b. The complete solution is x = x + N (A T A). Note that N (A T A) = N (A) = sp[v 2 ]. Hence x = V 1 Σ 1 U T 1 b + V 2 w for some w R r. Since V 1 Σ 1 U T 1 b V 2 w for all w, x = V 1 Σ 1 U T 1 b is the MMSE solution for which x is the smallest among all solutions. The pseudo-inverse of A is The condition number of A is given by σ1 σ r. A = V 1 Σ 1 U T 1. (5) 9 / 27

PCA Principal component analysis (PCA) is a well-established technique for dimension reduction. Its applications include data compression, image processing, data visualization, exploratory data analysis, pattern recognition, and time series prediction. The most common derivation of PCA is in terms of a orthogonal projection which maximizes the variance in the projected space. Given a set of m-dimensional observation vector, {x t }, the PCA aims at finding a orthogonal linear projection y = W x such that the variance of y R q (q < d) is maximized. The ith element of y is called the ith principal component. Alternatively, the PCA provides an orthogonal linear projection which minimize the squared reconstruction error t x t ˆx t 2. Thus the PCA is an optimal linear encoding in MS sense. 10 / 27

PCA and SVD It was shown that the ith row vector of W denoted by w T i corresponds to the normalized eigenvector associated with the ith largest eigenvalue of the covariance matrix R x = E{xx T } Principal components can be found by SVD, linear neural networks, or probabilistic methods. The SVD of R x has the form R x = UΣV T. Then we select n column vectors to contruct U 1 = [u 1,..., u n ]. The the PCA transform leads to y = U T 1 x. 11 / 27

PCA: An Example 4 3 2 1 0 1 2 3 4 4 3 2 1 0 1 2 3 4 12 / 27

Spectral Decomposition (Eigen-Decomposition) Given a symmetric matrix C R m m, its spectral decomposition is given by C = λ 1 u 1 u T 1 + λ 2 u 2 u T 2 + + λ m u m u T m, where λ 1 λ 2 λ m are eigenvalues of C and u i are associated eigenvectors. 13 / 27

Power Iteration The power iteration is a classical method which finds the largest eigenvector (associated with the largest eigenvalue) of a matrix C R m m. Given a symmetric matrix C R m m (hence its eigenvalues are real), the power iteration starts from a nonzero vector w(0) and iteratively updates w(t) by where 2 represents Euclidean norm. w(t + 1) = Cw(t), (6) w(t + 1) = w(t + 1), w(t + 1) 2 (7) Combining (6) and (7) leads to the updating rule which has the form [ w(t + 1) = Cw(t) w T (t)c w(t)] 2 1 2. (8) Assume that C has an unique eigenvalue of maximum modulus λ 1 associated with the leading eigenvector u 1. Then the power iteration (8) leads w(t) to converge to u 1. 14 / 27

Deflation Suppose that we are interested in computing eigenvectors of the data covariance matrix C = xx T. The power iteration is applied to C for extracting its first eigenvector. Our question arises, How can we compute the second eigenvector of C using the power iteration?. The deflation method is a common numerical technique for computing several eigenvalues and eigenvectors of C. Assume that the first eigenvector is already computed. Then the output value can be deflated by the following transformation: x = ( I u 1 u T ) 1 x. (9) One can easily see that x x T = m i=2 λ iu i u T i. The power iteration is applied to the deflated data, in order to extract the second eigenvector of C. 15 / 27

Term-Document Matrix A term-document matrix X R D N is a collection of vector space representations of documents, where rows are terms (words) and columns are documents ( ) N X ij = t ij log, idf i where t ij is the term frequency of word i in document j and idf i is the number of documents containing word i. We write X = [d 1, d 2,..., d N ] (document vectors) = t 1 t 2. t D (term vectors). 16 / 27

Latent Semantic Analysis A method for automatic indexing and retrieval, uncovering latent semantic structure of a term-document matrix. SVD of X R D N is given by X UΣV where U R D K, Σ R K K, and V R N K. D N K K X U Σ V K K D N Comparing documents: Σ 1 U X = Σ 1 U UΣV = V = [ d 1,..., d N ]. Comparing terms: X V Σ 1 = UΣV V Σ 1 = U = [ t 1,..., t D ]. 17 / 27

Probabilistic Latent Semantic Analysis The key idea in latent semantic analysis is to map high-dimensional count vectors (co-occurrence data, dyadic data) to a lower-dimensional representation in a so-called latent semantic space. The goal of LSA is to find a data mapping which reveals semantical relations between the entries of interest. Probabilistic latent semantic analysis (PLSA) is a probabilistic variant of LSA: Has a sound statistical foundation; Defines a proper generative model of the data. Aspect model: A latent variable model for co-occurrence data which associates an unobserved class z {z 1,..., z K } with each occurrence of a word w {w 1,..., w D } in a document d {d 1,..., d N }. 18 / 27

PLSA: Graphical Representation d j A document d j and a term (word) w i are conditionally independent given an unobserved topic z: [ ] z p(w i, d j ) = p(d j ) Generation process: z p(w i z)p(z d j ). w i N D Select a document dj with probability p(d j ). Pick a latent class (topic) with probability p(z d j ). Generate a word wi with probability p(w i z). 19 / 27

PLSA: Symmetric Parameterization d j z p(w i, d j ) = z = z p(w i, d j, z) p(w i, d j z)p(z) N = z p(w i z)p(d j z)p(z). w i D 20 / 27

Model Fitting: EM Algorithm Dyadic data X : Entries X ij are made for dyads (w i, d j ) which refer to a domain with two sets of objects, W = {w 1,..., w D } and D = {d 1,..., d N }. Complete-data likelihood: p(x, z) = p(w i, d j, z) C ij = [p(w i z)p(d j z)p(z)] C ij, i j where C ij are the empirical counts for dyads (w i, d j ) and X ij = C ij / i j C ij. EM optimization E-step: Compute the expected complete-data log-likelihood: L c = p(z k w i, d j )C ij log [p(w i z k )p(d j z k )p(z k )]. i j k M-step: Re-estimate parameters p(wi z k ), p(d j z k ), p(z k ) which maximizes L c. 21 / 27

E-Step: Compute p(z w i, d j ) Compute the posterior distribution over latent variables: p(z k w i, d j ) = p(w i, d j z k )p(z k ) p(w i, d j ) p(w i z k )p(d j z k )p(z k ) = l p(w i z l )p(d j z l )p(z l ), where p(w i z k ), p(d j z k ), p(z k ) are estimated in the M-step. 22 / 27

M-Step: Re-estimate Parameters Re-estimate parameters: p(w i z k ) = p(d j z k ) = p(z k ) = j C ijp(z k w i, d j ) j C ijp(z k w i, d j ), i i C ijp(z k w i, d j ) j C ijp(z k w i, d j ), i i j C ijp(z k w i, d j ) i j C. ij 23 / 27

Updating equations in the M-step are determined by solving [ L c + λ(1 ] p(w i z k )) = 0, p(w i z k ) i L c + λ(1 p(d j z k )) = 0, p(d j z k ) j [ L c + λ(1 ] p(z l )) = 0, p(z k ) l for p(w i z k ), p(d j z k ), p(z k ), respectively. 24 / 27

Document Clustering by PLSA You are given parameters p(w i z k ), p(d j z k ), p(z k ) estimated by EM optimization. Compute p(z k d j ) p(d j z k )p(z k ). Assign document d j to cluster k if k = arg max p(z k d j ). k 25 / 27

PLSA: Revisited The PLSA models each word in a document as a sample from a mixture model where the mixture components are multinomial random variables that can be viewed as representations of topics, p(w, d) = z p(z)p(w z)p(d z) = p(d) z p(w z)p(z d). Each document is represented as a list of mixing proportions for mixture components and reduced to a probability distribution on a fixed set of topics. This distribution is the reduced description associated with the document. In PLSA, each document is represented as a list of numbers (the mixing proportions for topics) and there is no generative probabilistic model for these numbers. This leads to the following problems: The number of parameters in the model grows linearly with the size of corpus, which leads to problems with overfitting. It is not clear how to assign probability to a document outside of the training set. 26 / 27

References S. Deerwester, S. T. Dumais, and R. Harshman, Indexing by latent semantic analysis, Journal of the Americal Society of Information Science, vol. 41, no. 6, pp. 391-407, 1990. T. Hofmann, Probabilistic latent semantic indexing, in Proc. SIGIR-1999. T. Hofmann, Probabilistic latent semantic analysis, in Proc. UAI-1999. 27 / 27