cross-language retrieval (by concatenate features of different language in X and find co-shared U). TOEFL/GRE synonym in the same way.

Similar documents
STA414/2104 Statistical Methods for Machine Learning II

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Dirichlet Allocation Introduction/Overview

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Lecture 13 : Variational Inference: Mean Field Approximation

CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu

CSC321 Lecture 18: Learning Probabilistic Models

Latent Dirichlet Allocation (LDA)

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Bayesian Paradigm. Maximum A Posteriori Estimation

Language Information Processing, Advanced. Topic Models

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017

Document and Topic Models: plsa and LDA

Machine Learning for Signal Processing Sparse and Overcomplete Representations

Structured matrix factorizations. Example: Eigenfaces

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Graphical Models for Collaborative Filtering

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Data Mining and Matrices

Introduction to Probabilistic Machine Learning

Nonparametric Bayesian Methods (Gaussian Processes)

More Spectral Clustering and an Introduction to Conjugacy

Introduction to Machine Learning

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2017

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G.

Probabilistic & Unsupervised Learning

Factor Analysis (10/2/13)

Statistical Data Mining and Machine Learning Hilary Term 2016

Probabilistic Graphical Models

13: Variational inference II

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

SPARSE signal representations have gained popularity in recent

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Generative Clustering, Topic Modeling, & Bayesian Inference

CS 572: Information Retrieval

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Pattern Recognition and Machine Learning

Mathematical Formulation of Our Example

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Note on Algorithm Differences Between Nonnegative Matrix Factorization And Probabilistic Latent Semantic Indexing

Content-based Recommendation

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Data Mining Techniques

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Machine Learning Techniques for Computer Vision

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

ECE521 week 3: 23/26 January 2017

The Expectation-Maximization Algorithm

The Bayes classifier

Gaussian Graphical Models and Graphical Lasso

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

19 : Bayesian Nonparametrics: The Indian Buffet Process. 1 Latent Variable Models and the Indian Buffet Process

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Unsupervised Learning

Lecture 7: Con3nuous Latent Variable Models

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Collaborative topic models: motivations cont

Variational Autoencoders

Probabilistic Graphical Models

STA 414/2104: Lecture 8

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

Rank Selection in Low-rank Matrix Approximations: A Study of Cross-Validation for NMFs

Probabilistic Graphical Models

16 : Approximate Inference: Markov Chain Monte Carlo

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Information Retrieval

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

14 : Theory of Variational Inference: Inner and Outer Approximation

Probabilistic Latent Semantic Analysis

Linear Algebra Background

Sparse Linear Models (10/7/13)

Machine Learning for Signal Processing Sparse and Overcomplete Representations. Bhiksha Raj (slides from Sourish Chaudhuri) Oct 22, 2013

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Latent Semantic Models. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Probabilistic Latent Semantic Analysis

CS281 Section 4: Factor Analysis and PCA

STA 4273H: Statistical Machine Learning

Non-Parametric Bayes

Probabilistic Graphical Models

Dimension Reduction (PCA, ICA, CCA, FLD,

Latent Variable Models and EM Algorithm

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 16 Deep Neural Generative Models

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Lecture : Probabilistic Machine Learning

Latent Dirichlet Allocation (LDA)

Latent Variable Models

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Lecture 21: Spectral Learning for Graphical Models

Variational Principal Components

Transcription:

10-708: Probabilistic Graphical Models, Spring 2015 22 : Optimization and GMs (aka. LDA, Sparse Coding, Matrix Factorization, and All That ) Lecturer: Yaoliang Yu Scribes: Yu-Xiang Wang, Su Zhou 1 Introduction Graphical models are cool but hard to explain in layman s words. Matrix factorizations are simple and you can even tell your grandmother about it. They are often intimately related and the interface of the two subcommunities can borrow ideas from one field and apply to another. In additional most results can be thought of as removing certain restrictions and adding additional flexibility to the the naive matrix factorization. 2 Topic models and LSI Long before Latent Dirichlet Allocation (LDA), Latent Semantic Indexing has been very popular. It is essentially running SVD on the document matrix with bag of words models X = UΛV T, where the inner dimension k is the number of topics. The associated optimization problem can be written as X UV T 2 where U and V are not entirely identifiable (since we can take any invertible transformation of the rows of U and V without affecting the objective function). However, it is not hard to see that SVD recovers an optimal solution. The model is very simple but has some cool applications including: cross-language retrieval (by concatenate features of different language in X and find co-shared U). TOEFL/GRE synonym in the same way. Matching paper submission to reviewers. 3 Exponential family PCA LSI can be generalized into Generalized linear losses via a probabilistic model. X P r( Θ)Θ = UV T when P r( Θ) is Gaussian, the reduces to standard matrix factorization. Here the difference is to consider exponential family. The MLE can also be represented in a matrix factorization form X, UV T + G(UV T ) 1

2 22 : Optimization and GMs (aka. LDA, Sparse Coding, Matrix Factorization, and All That ) where G is the log-partition function. It is strictly convex and has analytic form. Also note that the maximization of the log-likelihood is simply the Fenchel conjugate. The whole thing is bi-strictly convex and alternating imization is guaranteed to converge to a stationary point. For example, in Poisson model the optimization is an unconstrained optimization which looks like X [UV T ] + exp([uv T ] ) 4 Nonnegative matrix factorization Another widely accepted method is Nonnegative Matrix Factorization (NMF). It gains popularity because it simply makes a lot of sense to assume the factors are non-negative. Also it has an interesting multiplicative update algorithm with the property that all 0 will remain at 0. This is a form of gradient descent with an adaptively chosen stepsize (refer to the notes for the update equations). X, UV T + G(UV T ) U 0,V 0 NMF is mainly used for feature learning on image data. Due to the nonnegativity constraint, the solutions are likely be sparse and localized and that corresponds very well to parts in an image or arguably regions of the brain. There is also a KL-divergence version of the NMF, which solves U 0,V 0 X log[uv T ] + [UV T ] If you compare the above KL-divergence version of the NMF and the Poisson PCA, you will see that the forms are very similar. There are some differences though as one cannot be simply reparameterized as another. KL-divergence NMF s low-rank matrix is linear in the exponentiated space, while in Poisson PCA, the low-rank factorized matrix is linear in the original logarithmic space. NMF leads to sparse solutions mainly because they have a nonnegativity constraint. The solution will be 0 when certain constraints are active. Since the algorithm will shrink some coordinates to zero and they stay at zero, one should always randomly initialize dense factor U and V to avoid trivial solutions. In particular, U = 0 and V = 0 is a trivial stationary point. 5 Polysemy, Synonymy, plsi and LDA LSI is good ay synonym but less so at Polysemy. So people introduce the notion of a topic where different topic can share the same words but they may mean different things in different topics. This leads to probabilistic LSI (Hoffman ML 01). LDA is essentially a Bayesian version of plsi with an additional Dirichlet prior. It is claimed that this seegly or change actually avoids probabilistic LSI from overfitting to data. The maximum-likelihood problem for plsi turns out to be X log[uv T ] s.t. U, V 0, U T 1 = 1, V T 1 = 1 Comparing the previous KL-Divergence version of NMF, the only difference is that now it has an additional constraint that each row of U and V must be in a probability simplex. This adds a semantic meaning to the factors.

22 : Optimization and GMs (aka. LDA, Sparse Coding, Matrix Factorization, and All That ) 3 6 Revisiting PCA (Deteristic) Principal Component Analysis (PCA) enjoys a set of salient properties: Orthogonal basis, implicit Gaussian assumption, 2nd order decoupling (decouping of PCA) and in some sense, an assumption that the data is noiseless (we will see what it means later). These are not necessarily the best way of doing dimension reduction., some smart folks thought. Then they come up with extensions of PCA which features the following: 1. Sparse coding with overcomplete basis. 2. Exponential family PCA/Kernel PCA with non-linear basis. 3. Independent Component Analysis that allows for high order de-correlation. 4. Probabilistic PCA that addresses additive noise to the model. The remaining parts of Yaoliang s lecture focused on ppca and Sparse Coding. The second and third item is left as supplementary readings. 7 Probabilistic PCA probablistic PCA Assume a generative model. X i V N (Uv i, σ 2 I), v i N (0, σ 2 I) for i = 1,..., n. The maximum likelihood optimization problem is where S is the sample covariance matrix. log U,σ det(uut + σ 2 I) + S, (UU T + σ 2 I) 1 2 This is a highly non-convex problem, since log det is a concave function and UU T is essentially a rank-constraint. Quite remarkably, we can solve the above maximum likelihood problem in closed form via SVD of S = U diag([s k ] W k=1 )V. The optimal solution where σ 2 = 1 W K W k=k+1 s2 k. U = U diag([ (s 2 k σ 2 )] K k=1) This is proven by the Von Neuman s trace inequality on the trace of the product of two matrices (see http://en. wikipedia.org/wiki/von_neumann%27s_trace_inequality) that says To see this, first note that we can replace UU T by a X with row space and column space U., since adding any component orthogonal to U will only increase the objective function. Now the problem reduces to vector problems on the diagonals, which solves log(σ 2 k + σ 2 ) + s2 k σk 2 σ 2 + σ 2 +,σ2 W k=k+1 log σ 2 + s2 k σ 2. The same reduction can also be seen from Von Neuman s trace inequality σ w i (A)σ i (B) < A, B > σ i (A)σ i (B). i i

4 22 : Optimization and GMs (aka. LDA, Sparse Coding, Matrix Factorization, and All That ) where we replace the second term in the maximum likelihood objective with its lower bound. The lower bound can be attained when we take UU T to have the range that matches the learning K-dimensional eigenspace of S. Then the solution to the spectrum can be obtained by simply setting derivative to 0. Note that the under the assumption that the data is contaated by noise, the solution is no longer rank-projection, but rather the solution to S X X 2 2 + λ X with an adaptively chosen regularization weight λ using the remaining described spectrum of S. It is claimed in the slides that the ppca recovers PCA when σ = 0. This is true but it s pretty meaningless because if the true parameter σ 2 = 0, S is always rank K and it doesn t make sense to do dimension reduction anyway. Since the probabilistic model comes out, there are many extensions to this. We refer readers to Slide 21 for the references. One notable connection is that if one applies ppca to count data (Multinomial rather than normal) with a Dirichlet prior, one gets back LDA with Z marginalized out. However, we can no longer marginalize out V as in the Gaussian case. 8 Sparsity and Choosing K The remaining issues involve getting exact sparsity and choosing number of topics K from data. LDA s Dirichlet prior, and other Bayesian models (even with shrinkage) cannot get exact sparsity. In general, the posterior mean estimators is rarely sparse but the maximum a posteriori (MAP) estimators can be sparse. The notes suggest that MAP estimator is not Bayesian hence unsatisfactory. Arguably, however, both MAP and posterior mean estimators can be Bayes estimator for the same prior from decision-theoretic point of view. They just correspond to different loss functions. MAP corresponds to L1 loss, while posterior mean corresponds to the square loss. The actual solution to this issue in Bayesian modelling often requires the not-so-pretty use of slab-and-spike type of priors that assigns point mass at sparse solutions. Somehow, the lecture only uses sparsity to motivate the discussion of sparse coding, where sparsity is often induces via an L1 penalty. This is what the next section is about. As for the model selection problem of choosing K, usually people use Bayesian Non-parametric, but that is beyond the scope of this lecture. We will sort of address this problem by showing that some norm-regularized matrix factorization actually corresponds to infinite sparse coding where we do not need to specify K ahead of time. This is what the next next section is about. 9 Sparse coding Assume factorization model X = UV T. Wavelets, fourier, random projection can be thought of having a fixed dictionary U and the problem is finding V. In sparse coding, we solve for U and V at the same time. Specifically, a commonly accepted regime is when U is overcomplete and V is sparse. Overcomplete means that U has more columns than rows and one can think of U as a dictionary of parts and every data point is a sparse linear combination of the parts. The more overcomplete U is the sparser V should be.

22 : Optimization and GMs (aka. LDA, Sparse Coding, Matrix Factorization, and All That ) 5 Variants of this models for exponential family, to structure sparsity and so forth have been proposed too and this is also stacked into deep neural networks. A general formulation can be written as l(x, UV T ) + λ g(v d,k ) (1) d,k for some sparse inducing norms. Often U needs to have bounded norm otherwise V can be set to arbitrarily near 0 and the solution is ill-posed. Solution to such problem is often done by alternating imization, e.g., k-svd. Provable guarantees are still an active area of research (search on provable dictionary learning, provable overcomplete dictionary learning, provable sparse coding and etc). 10 Infinite sparse coding An interesting result for sparse coding occurs when we take K to infinity. Consider the following special case of (1). l(x, Θ) + λ Θ Θ=UV T, U :,k 1 K k=1 V :,k (2) For finite K this is a non-convex problem in Θ but as K, is a norm. The dual norm Θ := Θ=UV T, U :,k 1 K k=1 V :,k Θ o = max{u T Θv : u 1, v 1} Specifically, when u = u 2, v = v 2, the Θ o = Θ 2 the spectral/operator norm. Thus, Θ = Θ gets back the nuclear norm, or the trace norm. The essential rank will be obtained adaptively in some sense. But there is still one parameter λ to tune. 11 Extensions to these ideas Like we discussed previously, similar ideas have been exploited a lot in both matrix factorization and graphical modelling. Zhu & Xing proposed Sparse Topic Modelling which is discriative rather than generative. Zhu and Xing paper has many more parameters than observations. There fore they add strong regularizers with l1 norms and Θ is actuallly clusters of S. The tweak apparently worked pretty well (better than LDA) for topic modelling. At the same time, because it is not sampling or computing posterior mean, the L1 norm will give exact sparsity. The ideas of using a latent low-dimensional subspace to represent high-dimensional data got really popular in the past decades because they actually work very well in computer vision data. Any combinations of the keywords: Supervised, predictive, sparse, large-margin and etc, might have generated a paper. It is like these methods are occupying the world. Nowadays, people often stop modelling probabilistically and directly model the optimization objective to enjoy the structural implication of the different regularizer. However, this approach is less principled and does not often lead to an easy construction of confidence/credibility intervals, which makes inference using these models often challenging (if not intractable). For predictive tasks, it does not matter much since one can always evaluate on the hold-out data, but for scientific discoveries, probabilistic inferences are of great importance. This is probably the reason why courses like probabilistic graphical model still exists and are this popular.