Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

Similar documents
Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Data Analysis and Manifold Learning Lecture 2: Properties of Symmetric Matrices and Examples

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Data Analysis and Manifold Learning Lecture 7: Spectral Clustering

Data Analysis and Manifold Learning Lecture 3: Graphs, Graph Matrices, and Graph Embeddings

CS281 Section 4: Factor Analysis and PCA

PCA and admixture models

Lecture 7: Con3nuous Latent Variable Models

STA 414/2104: Machine Learning

STA 4273H: Sta-s-cal Machine Learning

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Dimension Reduction. David M. Blei. April 23, 2012

Expectation Maximization

CS181 Midterm 2 Practice Solutions

Probabilistic & Unsupervised Learning

Data Analysis and Manifold Learning Lecture 9: Diffusion on Manifolds and on Graphs

Factor Analysis (10/2/13)

Variational Principal Components

Machine Learning Techniques for Computer Vision

Dimension Reduction (PCA, ICA, CCA, FLD,

1 Principal Components Analysis

CSC411 Fall 2018 Homework 5

Smart PCA. Yi Zhang Machine Learning Department Carnegie Mellon University

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

EECS 275 Matrix Computation

Latent Variable Models and EM Algorithm

ABSTRACT INTRODUCTION

Introduction to Machine Learning

PCA, Kernel PCA, ICA

Introduction to Machine Learning

Principal Component Analysis

Lecture: Face Recognition and Feature Reduction

Principal Component Analysis

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Maximum variance formulation

Mixtures of Robust Probabilistic Principal Component Analyzers

STA414/2104 Statistical Methods for Machine Learning II

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Probabilistic Graphical Models

Probabilistic Latent Semantic Analysis

Factor Analysis and Kalman Filtering (11/2/04)

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Linear Dimensionality Reduction

Machine learning for pervasive systems Classification in high-dimensional spaces

Principal Component Analysis

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

COM336: Neural Computing

Recent Advances in Bayesian Inference Techniques

CSC 411 Lecture 12: Principal Component Analysis

Lecture 4: Probabilistic Learning

Machine Learning for Data Science (CS4786) Lecture 12

Introduction PCA classic Generative models Beyond and summary. PCA, ICA and beyond

Probabilistic Graphical Models

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Clustering VS Classification

Relevance Vector Machines

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

L11: Pattern recognition principles

Bayesian Decision Theory

Machine Learning - MT & 14. PCA and MDS

Pattern Recognition and Machine Learning

Joint Factor Analysis for Speaker Verification

STA 414/2104: Lecture 8

High Dimensional Discriminant Analysis

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Ch 4. Linear Models for Classification

Unsupervised Learning

Variational Autoencoders

A Derivation of the EM Updates for Finding the Maximum Likelihood Parameter Estimates of the Student s t Distribution

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

STA 414/2104: Lecture 8

GWAS V: Gaussian processes

Least Squares Regression

Robust Probabilistic Projections

Lecture: Face Recognition and Feature Reduction

Latent Variable Models and EM algorithm

Linear Models for Classification

Announcements (repeat) Principal Components Analysis

Semi-Supervised Learning through Principal Directions Estimation

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Pattern Classification

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Chapter 4: Factor Analysis

CS229 Lecture notes. Andrew Ng

The Expectation-Maximization Algorithm

Linear Dependent Dimensionality Reduction

Lecture 6: April 19, 2002

CSC411: Final Review. James Lucas & David Madras. December 3, 2018

Statistical Machine Learning

Probabilistic & Unsupervised Learning. Latent Variable Models

Multivariate Bayesian Linear Regression MLAI Lecture 11

Mathematical foundations - linear algebra

Data Mining Techniques

Least Squares Regression

Transcription:

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inrialpes.fr http://perception.inrialpes.fr/

Outline of Lecture 6 A short reminder from Lecture 1 Probabilistic formulation of PCA Maximum-likelihood PCA EM PCA What is Bayesian PCA? Factor Analysis

Material for This Lecture C. M. Bishop. Pattern Recognition and Machine Learning. 2006. (Chapter 12) More involved readings: S. Roweis. EM algorithms of PCA and SPCA. NIPS 1998. M. E. Tipping and C. M. Bishop. Pobabilistic Principal Component Analysis. J. R. Stat. Soc. B. 1999. M. E. Tipping and C. M. Bishop. Mixtures of Probabilistic Principal Component Analysers. Neural Computation. 1999.

PCA at a Glance The input (observation) space: X = [x 1... x j... x n ], x j R D The output (latent) space: Y = [y 1... y j... y n ], y j R d Projection: Y = W X with W a d D matrix. Reconstruction: X = WY with W a D d matrix. W W = I d, i.e., W is a row-orthonormal matrix when both data sets X and Y are represented in orthonormal bases: y j = Ũ (x j x). W W = Λ 1 d, i.e., this corresponds to the case of whitening: y j = Λ 1/2 d Ũ (x j x). Remember that W was estimated from the d largest eigenvalue-eigenvector pairs of the data covariance matrix.

From Lecture #1: Data Projection on a Linear Subspace From Y = W X we have YY = W XX W = W ŨΛ d Ũ W 1 The projected data has a diagonal covariance matrix: YY = Λ d, by identification we obtain W = Ũ 2 The projected data has an identity covariance matrix, this is called whitening the data: YY = I d W = Λ 1 2 d Ũ In what follow, we will consider W (reconstruction) istead of W (projection).

The Probabilistic Framework (I) Consider again the reconstruction of the observed variables from the latent variables. A point x is reconstructed from y with: x µ = Wy + ε ε R D is the reconstruction error and let s suppose that it has a Gaussian distribution with zero mean and spherical covariance: ε = N (ε 0, σ 2 I)

The Probabilistic Framework (II) We can now define the conditional distribution of the observed variable x conditioned on the value of the latent variable y: P (x y) = N (x Wy + µ, σ 2 I) The prior distribution of the latent variable is a Gaussian with zero-mean and unit-covariance: P (y) = N (y 0, I) The marginal distribution P (x) can be obtained from the sum and product rules, supposing continuous latent variables: P (x) = P (x y)p (y)dy y This is a linear-gaussian model, hence it is Gaussian as well: P (x) = N (x µ, C)

The Probabilistic Framework (III) The mean and covariance of this predictive distribution can be formally derived from the expression of x and from the Gaussian distributions just defined: E[x] = E[Wy + µ + ε] = WE[y] + E[µ] + E[ε] = µ C = E[(x µ)(x µ) ] = E[(Wy + ε)(wy + ε)] = WE[yy ]W + E[εε ] = WW + σ 2 I If assumed that y and ε are independent. Gaussian distributions require the inverse of the covariance matrix: C 1 = σ 2 (I WM 1 W ) Where M = W W + σ 2 I is a d d matrix. This is interesting when d D.

Maximum-likelihood PCA (I) The observed-data log-likelihood writes: ln P (x 1,..., x n µ, W, σ 2 ) = n ln P (x j µ, W, σ 2 ) j=1 This expression can be developed using the previous equations, to obtain: ln P (X µ, C) = n 2 (D ln(2π)+ln C ) 1 2 n (x j µ) C 1 (x j µ) j=1

Maximum-likelihood PCA (II) The log-likelihood is quadratic in µ, by setting the derivative with respect to µ equal to zero, we obtain the expected result: µ ML = n x j = x j=1 Maximization with respect to W and σ 2, while is more complex, has a closed-form solution: W ML = Ũ(Λ d σmli 2 d ) 1/2 R σml 2 1 D = λ i D d i=d+1 With Σ X = UΛU, d < D, and RR = I (a d d matrix).

Maximum-likelihood PCA (Discussion) The covariance of the predictive density, C = WW + σ 2 I, is not affected by the arbitrary orthogonal transformation R of the latent space: C = ŨΛ dũ σ 2 (I ŨŨ ) The covariance projected onto a unit vector is v Cv. We obtain the following cases: v is orthogonal to Ũ, then v Cv = σ 2 I or the average variance associated with the discarded dimensions. v is one of the column vectors of Ũ, then u i Cu i = λ i Matrix R introduces an arbitrary orthogonal transformation of the latent space.

From Probabilistic to Standard PCA The maximum-likelihood solution allows to estimate the reconstruction matrix W and the variance σ. The projection can be estimated from the pseudo-inverse of the reconstruction. We obtain: (W W) 1 W = (Λ d σ 2 I d ) 1/2 Ũ When σ 2 = 0 this corresponds to the standard PCA solution rotating, projecting and whitening the data.

EM for PCA We can derive an EM algorithm for PCA, by following the EM framework: derive the complete-data log-likelihood conditioned by the observed data, and take its expectation: ln P (X, Y µ, W, σ 2 ) = n (ln P (x j y j ) + ln P (y j )) j=1 Then we take the expectation with respect to the posterior distribution of the latent variables, E[ln P (X, Y µ, W, σ 2 )], which depends on the current model parameters µ = x, W, and σ 2, as well as on (these are the posterior statistics): E[y j ] = M 1 W (x j x) E[y j y j ] = σ 2 M 1 + E[y j ]E[y j ]

The EM Algorithm Initialize the parameter values W and σ 2. E-step: Estimate the posterior statistics E[y j ] and E[y j y j ] using the current parameter values. M-step: Update the parameter values from the current ones to new ones: W new = n n (x j x)e[y j ] E[y j y j ] j=1 σ 2 new = 1 nd j=1 1 n ( x j x 2 2E[y j ] Wnew(x j x) j=1 + tr(e[y j y j ]W neww new ))

EM for PCA (Discussion) Computational efficiency for high-dimensional spaces. EM is iterative, but each iteration can be quite efficient. The covariance matrix is never estimated explicitly. The case of σ 2 = 0 corresponds to a valid EM algorithm: S. Roweis. EM algorithms of PCA and SPCA. NIPS 1998. The case of EM in the presence of missing data can be found in M. E. Tipping and C. M. Bishop. Pobabilistic Principal Component Analysis. J. R. Stat. Soc. B. 1999

Bayesian PCA (I) Select the dimension d of the latent space. The generative model just introduced (well defined likelihood function) allows to address the problem in a principled way. The idea is to consider each column in W as having an independent Gaussian prior: P (W α) = d i=1 ( αi ) ( D/2 exp 1 ) 2π 2 α iw i w where α i = 1/σ 2 i is called the precision parameter. The objective is to estimate these parameters, one for each principal direction, and select only a subset of these directions. We need to select directions of maximum variance, hence directions with infinite precision will be disregarded.

Bayesian PCA (II) The approach is based on evidence approximation or empirical Bayes. The marginal likelihood function (the latent space W is integrated out): P (X α, µ, σ 2 ) = P (X µ, W, σ 2 ) P (W α)dw }{{} ML PCA The formal derivation is quite involved. The maximization with respect to the precision parameters yields a simple form: α new i = D w i w This estimation is interleaved with the EM updates for estimating W and σ 2.

Factor Analysis Probabilistic PCA so far (the predictive covariance is isotropic): P (x y) = N (x Wy + µ, σ 2 I) In factor analysis, the covariance is diagonal rather than isotropic: P (x y) = N (x Wy + µ, Ψ) the columns of W are called factor loadings and the diagonal entries of Ψ are called uniquenesses. The factor analysis point of view: one form of latent-variable density model, the form of the latent space is of interest but not the particular choice of coordinates (up to an orthogonal transformation). The factor analysis parameters, W, and Ψ are estimated via the maximum likelihood and EM frameworks.