Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inria.fr http://perception.inrialpes.fr/

Outline of This Lecture A short reminder from Lecture 1 Probabilistic formulation of PCA (PPCA) Maximum-likelihood PPCA EM for PPCA Mixture of PPCA What is Bayesian PCA? Factor Analysis

Material for This Lecture C. M. Bishop. Pattern Recognition and Machine Learning. 2006. (Chapter 12) More involved readings: S. Roweis. EM algorithms of PCA and SPCA. NIPS 1998. M. E. Tipping and C. M. Bishop. Pobabilistic Principal Component Analysis. J. R. Stat. Soc. B. 1999. M. E. Tipping and C. M. Bishop. Mixtures of Probabilistic Principal Component Analysers. Neural Computation. 1999.

PCA at a Glance The input (observation) space (the data are centered): X = [x 1... x n... x N ], x n R D The output (latent) space: Y = [y 1... y n... y N ], y j R d Projection: Y = W X with W a d D matrix. Reconstruction: X = WY with W a D d matrix. W W = I d, i.e., W is a row-orthonormal matrix when both data sets X and Y are represented in orthonormal bases: y j = Ũ (x j x). In this case W = Ũ. W W = Λ 1, i.e., this corresponds to the case of whitening: y j = Λ 1/2 d d Ũ (x j x), with W = Λ 1 2 d Ũ. Remember that W was estimated from the d largest eigenvalue-eigenvector pairs of the data covariance matrix.

From Lecture #1: Data Projection on a Linear Subspace From Y = W X we have YY = W XX W = W ŨΛ d Ũ W 1 The projected data has a diagonal covariance matrix: YY = Λ d, by identification we obtain W = Ũ 2 The projected data has an identity covariance matrix, this is called whitening the data: YY = I d W = Λ 1 2 d Ũ In what follow, we will consider W (reconstruction) istead of W (projection).

The Probabilistic Framework (I) Consider again the reconstruction of the observed variables from the latent variables. A point x is reconstructed from y with: x µ = Wy + ε ε R D is the reconstruction error and let s suppose that it has a Gaussian distribution with zero mean and spherical covariance: p(ε) = N (ε 0, σ 2 I)

The Probabilistic Framework (II) We can now define the conditional distribution of the observed variable x, conditioned on the value of the latent variable y: p(x y) = N (x Wy + µ, σ 2 I) The prior distribution of the latent variable is a Gaussian with zero-mean and unit-covariance: p(y) = N (y 0, I) The marginal (or predictive) distribution p(x) can be obtained from the sum and product rules, supposing continuous latent variables: p(x) = p(x, y)dy = p(x y)p(y)dy y y

The Probabilistic Framework (III) This is an instance of the linear-gaussian model, hence it is Gaussian as well: ( p(x) = N (x µ, C) exp 1 ) 2 (x µ) C 1 (x µ) The posterior distribution can be obtained using the Bayes theorem for Gaussian variables (see Bishop 06, chapter 2): p(y x) = N (y M 1 W (x µ), σ 2 M) This is the main difference with standard PCA: the latent variable is in this case a random variable with a Gaussian distribution.

The Probabilistic Framework (IV) The mean and covariance of this predictive distribution can be formally derived from the expression of x and from the Gaussian distributions just defined (using the fact that y and ε are independent random variables): E[x] = E[Wy + µ + ε] = WE[y] + E[µ] + E[ε] = µ C = E[(x µ)(x µ) ] = E[(Wy + ε)(wy + ε)] = WE[yy ]W + E[εε ] = WW + σ 2 I Gaussian distributions require the inverse of the covariance matrix; Using the Woodbury indentity (see equation (C.7) in Bishop 06) we have: C 1 = σ 2 I σ 2 WM 1 W M = W W + σ 2 I where M is a d d matrix. Useful when d D.

Maximum-likelihood PCA (I) The observed-data log-likelihood writes: ln p(x 1,..., x N µ, W, σ 2 ) = N ln p(x j µ, W, σ 2 ) j=1 This expression can be developed using the previous equations, to obtain: ln P (X µ, C) = N 2 ((D ln(2π) + ln C ) 1 2 N (x j µ) C 1 (x j µ) j=1

Maximum-likelihood PCA (II) The log-likelihood is quadratic in µ, by setting the derivative with respect to µ equal to zero, we obtain the expected result: µ ML = N x j = x j=1 Maximization with respect to W and σ 2, while is more complex, still has a closed-form solution: W ML = Ũ(Λ d σ 2 I d ) 1/2 R σml 2 1 D = λ i D d i=d+1 With Σ X = UΛU ŨΛ dũ, d < D, and RR = I (a d d matrix).

Maximum-likelihood PCA (Discussion) The covariance of the predictive density, C = WW + σ 2 I, is not affected by the arbitrary orthogonal transformation R of the latent space: C = ŨD[λ i σ 2 ]Ũ + σ 2 I The covariance projected onto a unit vector v is v Cv. We obtain the following cases: v is orthogonal to Ũ, then v Cv = σ 2 (noise variance) or the average variance associated with the discarded dimensions. v = u i is one of the column vectors of Ũ, then v Cv = λ i σ 2 + σ 2 = λ i : the model correctly captures the variance along the principal directions and approximates the variance in the remaining directions with σ 2. Matrix R introduces an arbitrary orthogonal transformation of the latent space.

Projecting the Data onto the Latent Space Any data point x can be be summarized by its posterior mean and posterior covariance in latent space. These are provided by the posterior distribution p(y x): E[y x] = M 1 W (x µ) C(y x) = σ 2 M M = W W + σ 2 I

From Probabilistic to Standard PCA The maximum-likelihood solution allows to estimate the reconstruction matrix W and the variance σ. The projection of the data onto the latent space can be estimated from the posterior mean. We obtain the following projection matrix: (W W + σ 2 I) 1 W When σ 2 = 0 this corresponds to the standard PCA solution rotating, projecting and whitening the data: (W W) 1 W = Λ 1/2 Ũ

EM for PCA We can derive an EM algorithm for PCA, by following the EM framework: derive the complete-data log-likelihood conditioned by the observed data, and take its expectation. Complete-data log-likelihood for observed-latent pairs x j, y j : ln P (X, Y µ, W, σ 2 ) = n (ln P (x j y j ) + ln P (y j )) j=1 Then we take the expectation with respect to the posterior distribution over the latent variables, E[ln P (X, Y µ, W, σ 2 )], which depends on the current model parameters µ = x, W, and σ 2, as well as on (these are the posterior statistics): E[y j ] = M 1 W (x j x) E[y j y j ] = σ 2 M 1 + E[y j ]E[y j ]

The EM Algorithm (I) Initialize the parameter values W and σ 2. E-step: Estimate the posterior statistics E[y j ] and E[y j y j ] using the current parameter values. M-step: Maximize with respect to W and σ 2 while keeping the posterior statistics fixed. The equations are: W new = N N (x j x)e[y j ] E[y j y j ] j=1 j=1 1 σ 2 new = 1 ND N j=1 ( x j x 2 2E[y j ] W new(x j x) ) +tr(e[y j y j ]WnewW new )

The EM Algorithm (II) By substitution of E[y j ] and E[y j y j ] (the E-step) into the expressions of W new and σnew 2 (the M-step), we get: W new = Σ X W old (σ 2 old I + M 1 old W old Σ XW old ) 1 σ 2 new = 1 D tr(σ X Σ X W old M 1 old W new)

EM for PCA (Discussion) Computational efficiency for high-dimensional spaces. EM is iterative, but each iteration can be quite efficient. The covariance matrix is never estimated explicitly. The case of σ 2 = 0 corresponds to a valid EM algorithm: S. Roweis. EM algorithms of PCA and SPCA. NIPS 1998. More details can be found in M. E. Tipping and C. M. Bishop. Pobabilistic Principal Component Analysis. J. R. Stat. Soc. B. 1999

Mixture of PPCA The log-likelihood of a mixture of PPCA: N N M ln(p(x i )) = ln π j p(x i j) i=1 i=1 j=1 We seek µ j, W j, and σj 2 for each mixture component j. For a given data point x there is a posterior distribution associated with each latent space j, the mean of which is Wj (x µ j). M 1 j It is also possible to define the posterior, or responsibility, of mixture component j for generating a data point: r ij = p(x i j)π j p(x i )

EM for Mixtures of PPCA Initialization of the model parameters E-step: estimate the posteriors r ij M-step: Use the maximum-likelihood formulation of PPCA to estimate W j, and σj 2 from the local responsibility-weighted covariance matrix: Σ j = 1 π j N N r ij (x i µ j )(x i µ j ) i=1 with π j = 1/N N i=1 r N i=1 ij and µ j = r ijx i N i=1 r ij

Bayesian PCA (I) Select the dimension d of the latent space. The generative model just introduced (well defined likelihood function) allows to address the problem in a principled way. The idea is to consider each column in W as having an independent Gaussian prior: P (W α) = d i=1 ( αi ) ( D/2 exp 1 ) 2π 2 α iw i w where α i = 1/σ 2 i is called the precision parameter. The objective is to estimate these parameters, one for each principal direction, and select only a subset of these directions. We need to select directions of maximum variance, hence directions with infinite precision will be disregarded.

Bayesian PCA (II) The approach is based on evidence approximation or empirical Bayes. The marginal likelihood function (the latent space W is integrated out): P (X α, µ, σ 2 ) = P (X µ, W, σ 2 ) P (W α)dw }{{} ML PCA The formal derivation is quite involved. The maximization with respect to the precision parameters yields a simple form: α new i = D w i w This estimation is interleaved with the EM updates for estimating W and σ 2.

Factor Analysis Probabilistic PCA so far (the predictive covariance is isotropic): P (x y) = N (x Wy + µ, σ 2 I) In factor analysis, the covariance is diagonal rather than isotropic: P (x y) = N (x Wy + µ, Ψ) the columns of W are called factor loadings and the diagonal entries of Ψ are called uniquenesses. The factor analysis point of view: one form of latent-variable density model, the form of the latent space is of interest but not the particular choice of coordinates (up to an orthogonal transformation). The factor analysis parameters, W, and Ψ are estimated via the maximum likelihood and EM frameworks.