Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Similar documents
Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

STA 414/2104: Machine Learning

Lecture 7: Con3nuous Latent Variable Models

Probabilistic & Unsupervised Learning

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Variational Principal Components

Machine Learning Techniques for Computer Vision

CS281 Section 4: Factor Analysis and PCA

PCA and admixture models

Mixtures of Robust Probabilistic Principal Component Analyzers

Factor Analysis (10/2/13)

STA 4273H: Sta-s-cal Machine Learning

Dimension Reduction. David M. Blei. April 23, 2012

PCA, Kernel PCA, ICA

Introduction to Machine Learning

CS181 Midterm 2 Practice Solutions

Data Analysis and Manifold Learning Lecture 2: Properties of Symmetric Matrices and Examples

Smart PCA. Yi Zhang Machine Learning Department Carnegie Mellon University

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Factor Analysis and Kalman Filtering (11/2/04)

Latent Variable Models and EM Algorithm

CSC411 Fall 2018 Homework 5

Variational Autoencoders

Data Analysis and Manifold Learning Lecture 7: Spectral Clustering

Maximum variance formulation

Introduction to Machine Learning

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Probabilistic Graphical Models

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

EECS 275 Matrix Computation

Linear Dimensionality Reduction

Data Analysis and Manifold Learning Lecture 3: Graphs, Graph Matrices, and Graph Embeddings

Least Squares Regression

Probabilistic Graphical Models

1 Principal Components Analysis

Expectation Maximization

Robust Probabilistic Projections

L11: Pattern recognition principles

ABSTRACT INTRODUCTION

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Dimension Reduction (PCA, ICA, CCA, FLD,

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Probabilistic Latent Semantic Analysis

Advanced Introduction to Machine Learning

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Introduction PCA classic Generative models Beyond and summary. PCA, ICA and beyond

GWAS V: Gaussian processes

Bayesian Decision Theory

Unsupervised Learning

Covariance and Correlation Matrix

Dimensionality Reduction with Principal Component Analysis

LATENT VARIABLE MODELS. Microsoft Research 7 J. J. Thomson Avenue, Cambridge CB3 0FB, U.K.

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Lecture: Face Recognition and Feature Reduction

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Machine Learning for Data Science (CS4786) Lecture 12

Probabilistic & Unsupervised Learning. Latent Variable Models

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Principal Component Analysis

Linear Dependent Dimensionality Reduction

Least Squares Regression

Announcements (repeat) Principal Components Analysis

Recent Advances in Bayesian Inference Techniques

All you want to know about GPs: Linear Dimensionality Reduction

Density Estimation. Seungjin Choi

Linear Factor Models. Sargur N. Srihari

Statistical Machine Learning

Lecture 4: Probabilistic Learning

COM336: Neural Computing

Principal Component Analysis

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Latent Variable Models and EM algorithm

Neural Network Training

CPSC 340 Assignment 4 (due November 17 ATE)

CS229 Lecture notes. Andrew Ng

STA414/2104 Statistical Methods for Machine Learning II

Regularized PCA to denoise and visualise data

Data Analysis and Manifold Learning Lecture 9: Diffusion on Manifolds and on Graphs

Dimensionality reduction

Principal Component Analysis

STA 414/2104: Lecture 8

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Ch 4. Linear Models for Classification

CSC 411 Lecture 12: Principal Component Analysis

Lecture 6: April 19, 2002

A Small Footprint i-vector Extractor

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Relevance Vector Machines

Matching the dimensionality of maps with that of the data

CPSC 540: Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNING

p(d θ ) l(θ ) 1.2 x x x

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

CSC411: Final Review. James Lucas & David Madras. December 3, 2018

High Dimensional Discriminant Analysis

Transcription:

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inria.fr http://perception.inrialpes.fr/

Outline of This Lecture A short reminder from Lecture 1 Probabilistic formulation of PCA (PPCA) Maximum-likelihood PPCA EM for PPCA Mixture of PPCA What is Bayesian PCA? Factor Analysis

Material for This Lecture C. M. Bishop. Pattern Recognition and Machine Learning. 2006. (Chapter 12) More involved readings: S. Roweis. EM algorithms of PCA and SPCA. NIPS 1998. M. E. Tipping and C. M. Bishop. Pobabilistic Principal Component Analysis. J. R. Stat. Soc. B. 1999. M. E. Tipping and C. M. Bishop. Mixtures of Probabilistic Principal Component Analysers. Neural Computation. 1999.

PCA at a Glance The input (observation) space (the data are centered): X = [x 1... x n... x N ], x n R D The output (latent) space: Y = [y 1... y n... y N ], y j R d Projection: Y = W X with W a d D matrix. Reconstruction: X = WY with W a D d matrix. W W = I d, i.e., W is a row-orthonormal matrix when both data sets X and Y are represented in orthonormal bases: y j = Ũ (x j x). In this case W = Ũ. W W = Λ 1, i.e., this corresponds to the case of whitening: y j = Λ 1/2 d d Ũ (x j x), with W = Λ 1 2 d Ũ. Remember that W was estimated from the d largest eigenvalue-eigenvector pairs of the data covariance matrix.

From Lecture #1: Data Projection on a Linear Subspace From Y = W X we have YY = W XX W = W ŨΛ d Ũ W 1 The projected data has a diagonal covariance matrix: YY = Λ d, by identification we obtain W = Ũ 2 The projected data has an identity covariance matrix, this is called whitening the data: YY = I d W = Λ 1 2 d Ũ In what follow, we will consider W (reconstruction) istead of W (projection).

The Probabilistic Framework (I) Consider again the reconstruction of the observed variables from the latent variables. A point x is reconstructed from y with: x µ = Wy + ε ε R D is the reconstruction error and let s suppose that it has a Gaussian distribution with zero mean and spherical covariance: p(ε) = N (ε 0, σ 2 I)

The Probabilistic Framework (II) We can now define the conditional distribution of the observed variable x, conditioned on the value of the latent variable y: p(x y) = N (x Wy + µ, σ 2 I) The prior distribution of the latent variable is a Gaussian with zero-mean and unit-covariance: p(y) = N (y 0, I) The marginal (or predictive) distribution p(x) can be obtained from the sum and product rules, supposing continuous latent variables: p(x) = p(x, y)dy = p(x y)p(y)dy y y

The Probabilistic Framework (III) This is an instance of the linear-gaussian model, hence it is Gaussian as well: ( p(x) = N (x µ, C) exp 1 ) 2 (x µ) C 1 (x µ) The posterior distribution can be obtained using the Bayes theorem for Gaussian variables (see Bishop 06, chapter 2): p(y x) = N (y M 1 W (x µ), σ 2 M) This is the main difference with standard PCA: the latent variable is in this case a random variable with a Gaussian distribution.

The Probabilistic Framework (IV) The mean and covariance of this predictive distribution can be formally derived from the expression of x and from the Gaussian distributions just defined (using the fact that y and ε are independent random variables): E[x] = E[Wy + µ + ε] = WE[y] + E[µ] + E[ε] = µ C = E[(x µ)(x µ) ] = E[(Wy + ε)(wy + ε)] = WE[yy ]W + E[εε ] = WW + σ 2 I Gaussian distributions require the inverse of the covariance matrix; Using the Woodbury indentity (see equation (C.7) in Bishop 06) we have: C 1 = σ 2 I σ 2 WM 1 W M = W W + σ 2 I where M is a d d matrix. Useful when d D.

Maximum-likelihood PCA (I) The observed-data log-likelihood writes: ln p(x 1,..., x N µ, W, σ 2 ) = N ln p(x j µ, W, σ 2 ) j=1 This expression can be developed using the previous equations, to obtain: ln P (X µ, C) = N 2 ((D ln(2π) + ln C ) 1 2 N (x j µ) C 1 (x j µ) j=1

Maximum-likelihood PCA (II) The log-likelihood is quadratic in µ, by setting the derivative with respect to µ equal to zero, we obtain the expected result: µ ML = N x j = x j=1 Maximization with respect to W and σ 2, while is more complex, still has a closed-form solution: W ML = Ũ(Λ d σ 2 I d ) 1/2 R σml 2 1 D = λ i D d i=d+1 With Σ X = UΛU ŨΛ dũ, d < D, and RR = I (a d d matrix).

Maximum-likelihood PCA (Discussion) The covariance of the predictive density, C = WW + σ 2 I, is not affected by the arbitrary orthogonal transformation R of the latent space: C = ŨD[λ i σ 2 ]Ũ + σ 2 I The covariance projected onto a unit vector v is v Cv. We obtain the following cases: v is orthogonal to Ũ, then v Cv = σ 2 (noise variance) or the average variance associated with the discarded dimensions. v = u i is one of the column vectors of Ũ, then v Cv = λ i σ 2 + σ 2 = λ i : the model correctly captures the variance along the principal directions and approximates the variance in the remaining directions with σ 2. Matrix R introduces an arbitrary orthogonal transformation of the latent space.

Projecting the Data onto the Latent Space Any data point x can be be summarized by its posterior mean and posterior covariance in latent space. These are provided by the posterior distribution p(y x): E[y x] = M 1 W (x µ) C(y x) = σ 2 M M = W W + σ 2 I

From Probabilistic to Standard PCA The maximum-likelihood solution allows to estimate the reconstruction matrix W and the variance σ. The projection of the data onto the latent space can be estimated from the posterior mean. We obtain the following projection matrix: (W W + σ 2 I) 1 W When σ 2 = 0 this corresponds to the standard PCA solution rotating, projecting and whitening the data: (W W) 1 W = Λ 1/2 Ũ

EM for PCA We can derive an EM algorithm for PCA, by following the EM framework: derive the complete-data log-likelihood conditioned by the observed data, and take its expectation. Complete-data log-likelihood for observed-latent pairs x j, y j : ln P (X, Y µ, W, σ 2 ) = n (ln P (x j y j ) + ln P (y j )) j=1 Then we take the expectation with respect to the posterior distribution over the latent variables, E[ln P (X, Y µ, W, σ 2 )], which depends on the current model parameters µ = x, W, and σ 2, as well as on (these are the posterior statistics): E[y j ] = M 1 W (x j x) E[y j y j ] = σ 2 M 1 + E[y j ]E[y j ]

The EM Algorithm (I) Initialize the parameter values W and σ 2. E-step: Estimate the posterior statistics E[y j ] and E[y j y j ] using the current parameter values. M-step: Maximize with respect to W and σ 2 while keeping the posterior statistics fixed. The equations are: W new = N N (x j x)e[y j ] E[y j y j ] j=1 j=1 1 σ 2 new = 1 ND N j=1 ( x j x 2 2E[y j ] W new(x j x) ) +tr(e[y j y j ]WnewW new )

The EM Algorithm (II) By substitution of E[y j ] and E[y j y j ] (the E-step) into the expressions of W new and σnew 2 (the M-step), we get: W new = Σ X W old (σ 2 old I + M 1 old W old Σ XW old ) 1 σ 2 new = 1 D tr(σ X Σ X W old M 1 old W new)

EM for PCA (Discussion) Computational efficiency for high-dimensional spaces. EM is iterative, but each iteration can be quite efficient. The covariance matrix is never estimated explicitly. The case of σ 2 = 0 corresponds to a valid EM algorithm: S. Roweis. EM algorithms of PCA and SPCA. NIPS 1998. More details can be found in M. E. Tipping and C. M. Bishop. Pobabilistic Principal Component Analysis. J. R. Stat. Soc. B. 1999

Mixture of PPCA The log-likelihood of a mixture of PPCA: N N M ln(p(x i )) = ln π j p(x i j) i=1 i=1 j=1 We seek µ j, W j, and σj 2 for each mixture component j. For a given data point x there is a posterior distribution associated with each latent space j, the mean of which is Wj (x µ j). M 1 j It is also possible to define the posterior, or responsibility, of mixture component j for generating a data point: r ij = p(x i j)π j p(x i )

EM for Mixtures of PPCA Initialization of the model parameters E-step: estimate the posteriors r ij M-step: Use the maximum-likelihood formulation of PPCA to estimate W j, and σj 2 from the local responsibility-weighted covariance matrix: Σ j = 1 π j N N r ij (x i µ j )(x i µ j ) i=1 with π j = 1/N N i=1 r N i=1 ij and µ j = r ijx i N i=1 r ij

Bayesian PCA (I) Select the dimension d of the latent space. The generative model just introduced (well defined likelihood function) allows to address the problem in a principled way. The idea is to consider each column in W as having an independent Gaussian prior: P (W α) = d i=1 ( αi ) ( D/2 exp 1 ) 2π 2 α iw i w where α i = 1/σ 2 i is called the precision parameter. The objective is to estimate these parameters, one for each principal direction, and select only a subset of these directions. We need to select directions of maximum variance, hence directions with infinite precision will be disregarded.

Bayesian PCA (II) The approach is based on evidence approximation or empirical Bayes. The marginal likelihood function (the latent space W is integrated out): P (X α, µ, σ 2 ) = P (X µ, W, σ 2 ) P (W α)dw }{{} ML PCA The formal derivation is quite involved. The maximization with respect to the precision parameters yields a simple form: α new i = D w i w This estimation is interleaved with the EM updates for estimating W and σ 2.

Factor Analysis Probabilistic PCA so far (the predictive covariance is isotropic): P (x y) = N (x Wy + µ, σ 2 I) In factor analysis, the covariance is diagonal rather than isotropic: P (x y) = N (x Wy + µ, Ψ) the columns of W are called factor loadings and the diagonal entries of Ψ are called uniquenesses. The factor analysis point of view: one form of latent-variable density model, the form of the latent space is of interest but not the particular choice of coordinates (up to an orthogonal transformation). The factor analysis parameters, W, and Ψ are estimated via the maximum likelihood and EM frameworks.