Latent Variable Models and EM Algorithm

Similar documents
Latent Variable Models and EM algorithm

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Gaussian Mixture Models

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Unsupervised Learning

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

The Expectation Maximization or EM algorithm

Lecture 7: Con3nuous Latent Variable Models

The Expectation-Maximization Algorithm

Probabilistic & Unsupervised Learning

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Machine Learning for Data Science (CS4786) Lecture 12

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Expectation Maximization

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Mixtures of Gaussians. Sargur Srihari

Gaussian Mixture Models, Expectation Maximization

Statistical Pattern Recognition

Expectation Maximization Algorithm

Data Mining Techniques

But if z is conditioned on, we need to model it:

Introduction to Machine Learning

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

Expectation Maximization

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Expectation Maximization

Lecture 4: Probabilistic Learning

EM Algorithm LECTURE OUTLINE

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Clustering and Gaussian Mixture Models

Probabilistic Graphical Models

ECE 5984: Introduction to Machine Learning

Statistical Pattern Recognition

Probabilistic & Unsupervised Learning. Latent Variable Models

Weighted Finite-State Transducers in Computational Biology

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Probabilistic Unsupervised Learning

Intractable Likelihood Functions

Latent Variable Models and Expectation Maximization

An Introduction to Expectation-Maximization

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

Factor Analysis (10/2/13)

Latent Variable Models and Expectation Maximization

Expectation-Maximization

Lecture 6: April 19, 2002

CPSC 540: Machine Learning

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Introduction to Machine Learning

Lecture 8: Graphical models for Text

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Week 3: The EM algorithm

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Latent Variable Models

Machine Learning Techniques for Computer Vision

STA 414/2104: Machine Learning

Clustering and Gaussian Mixtures

Machine learning - HT Maximum Likelihood

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

CS281 Section 4: Factor Analysis and PCA

Review and Motivation

Lecture 16 Deep Neural Generative Models

Mixtures of Gaussians continued

Probabilistic Graphical Models

K-Means and Gaussian Mixture Models

COM336: Neural Computing

Expectation maximization tutorial

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Expectation maximization

CSCI-567: Machine Learning (Spring 2019)

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Finite Singular Multivariate Gaussian Mixture

Lecture : Probabilistic Machine Learning

Based on slides by Richard Zemel

CSC411 Fall 2018 Homework 5

STA414/2104 Statistical Methods for Machine Learning II

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

STA 4273H: Statistical Machine Learning

Linear Dynamical Systems

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

CSE446: Clustering and EM Spring 2017

PCA and admixture models

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Introduction PCA classic Generative models Beyond and summary. PCA, ICA and beyond

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Statistical Data Mining and Machine Learning Hilary Term 2016

K-Means, Expectation Maximization and Segmentation. D.A. Forsyth, CS543

Probabilistic clustering

13: Variational inference II

Transcription:

SC4/SM8 Advanced Topics in Statistical Machine Learning Latent Variable Models and EM Algorithm Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/atsml/ Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 1 / 36

Probabilistic Unsupervised Learning Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 2 / 36

Probabilistic Methods Mixture Models Algorithmic approach: Data Algorithm Analysis/ Interpretation Probabilistic modelling approach: Unobserved process Generative Model Analysis Interpretation Data Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 3 / 36

Mixture Models Probabilistic Unsupervised Learning Mixture Models Mixture models suppose that our dataset X was created by sampling iid from K distinct populations (called mixture components). Samples in population k can be modelled using a distribution F µk with density f (x µ k), where µ k is the model parameter for the k-th component. For a concrete example, consider a Gaussian with unknown mean µ k and known diagonal covariance σ 2 I, ( f (x µ k) = 2πσ 2 p 2 exp 1 ) 2σ x 2 µk 2 2. Generative model: for i = 1, 2,..., n: First determine the assignment variable independently for each data item i: Z i Discrete(π 1,..., π K) i.e., P(Z i = k) = π k where mixing proportions are π k 0 for each k and K k=1 πk = 1. Given the assignment Z i = k, then X i = (X (1) i,..., X (p) i ) is sampled (independently) from the corresponding k-th component: X i Z i = k f (x µ k) We observe X i = x i for each i but not Z i s (latent variables), and would like to infer the parameters {µ k} K k=1 and {π k} K k=1 (σ 2 can also be estimated). Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 4 / 36

Mixture Models Mixture Models Unknowns to learn given data are Parameters: θ = (π k, µ k) K k=1, where π 1,..., π K [0, 1], µ 1,..., µ K R p, and Latent variables: z 1,..., z n. The joint probability over all cluster indicator variables {Z i } are: p Z ((z i ) n i=1) = n π zi = i=1 n K i=1 k=1 π 1(zi=k) k The joint density at observations X i = x i given Z i = z i are: n n K p X ((x i ) n i=1 (Z i = z i ) n i=1) = f (x i µ zi ) = f (x i µ k ) 1(zi=k) i=1 i=1 k=1 Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 5 / 36

Mixture Models Mixture Models: Joint pmf/pdf of observed and latent variables Unknowns to learn given data are Parameters: θ = (π k, µ k) K k=1, where π 1,..., π K [0, 1], µ 1,..., µ K R p, and Latent variables: z 1,..., z n. The joint probability mass function/density 1 is: n K p X,Z ((x i, z i ) n i=1) = p Z ((z i ) n i=1)p X ((x i ) n i=1 (Z i = z i ) n i=1) = (π k f (x i µ k )) 1(zi=k) i=1 k=1 And the marginal density of x i (resulting model on the observed data) is: p(x i ) = K p(z i = j, x i ) = j=1 K π j f (x i µ j ). j=1 Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 6 / 36

Prod. Bernoulli Probabilistic Prod. Unsupervised Bernoulli Learning Sigmoid Mixturebelief Modelsnet 27.7 Table 11.1 Mixture Models: Gaussian Mixtures with Unequal Prod. Discrete in the likelihood means a factored distribution of the form j Gaussian means a factored distribution of the form Covariances j Summary of some popular directed latent variable models. Here Prod means product, so Cat(xij zi), and Prod. N (xij zi). PCA stands for principal components analysis. ICA stands for indepedendent components analysis. 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 (a) figure from Murphy, 2012, Ch. 11. Here θ = (π k, µ k, Σ k ) K k=1 are all the model parametes and ( f (x (µ k, Σ k )) = (2π) p 2 Σk 1 2 exp 1 2 (x µ k) Σ 1 k (x µ k ) Figure 11.3 A mixture of 3 Gaussians in 2d. (a) We show the contours of constant probability for each component in the mixture. (b) A surface plot of the overall density. Based on Figure 2.23 of (Bishop ) 2006a). Figure generated by mixgaussplotdemo..1 Mixtures of Gaussians K p(x) = π k f (x (µ k, Σ k )) The most widely used mixturek=1 model is the mixture of Gaussians (MOG), also called a Gaussian mixture model or GMM. In this model, each base distribution in the mixture is a multivariate Gaussian with mean μ and covariance matrix Σ k. Thus the model has the form Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 7 / 36 (b),

Mixture Models Mixture Models: Responsibility Suppose we know the parameters θ = (π k, µ k ) K k=1. Z i is a random variable and its conditional distribution given data set X is: Q ik := p(z i = k x i ) = p(z i = k, x i ) p(x i ) = π kf (x i µ k ) K j=1 π jf (x i µ j ) The conditional probability Q ik is called the responsibility of mixture component k for data point x i. These conditionals softly partitions the dataset among the k components: K k=1 Q ik = 1. Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 8 / 36

Mixture Models Mixture Models: Maximum Likehood How can we learn about the parameters θ = (π k, µ k ) K k=1 from data? Standard statistical methodology asks for the maximum likelihood estimator (MLE). The goal is to maximise the marginal probability of the data over the parameters ˆθ ML = argmax p(x θ) = argmax θ (π k,µ k) K k=1 = argmax (π k,µ k) K k=1 = argmax n p(x i (π k, µ k ) K k=1) i=1 n i=1 k=1 (π k,µ k) K k=1 i=1 K π k f (x i µ k ) n K log π k f (x i µ k ). k=1 } {{ } :=l((π k,µ k) K k=1 ) Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 9 / 36

Mixture Models Mixture Models: Maximum Likehood Marginal log-likelihood: l((π k, µ k ) K k=1) := log p(x (π k, µ k ) K k=1) = n K log π k f (x i µ k ) i=1 k=1 The gradient w.r.t. µ k : µk l((π k, µ k ) K k=1) = = n i=1 π k f (x i µ k ) K j=1 π jf (x i µ j ) µ k log f (x i µ k ) n Q ik µk log f (x i µ k ). i=1 Difficult to solve, as Q ik depends implicitly on µ k. Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 10 / 36

Mixture Models Likelihood Surface for a Simple Example If latent variables z i s were all observed, we would have a unimodal likelihood surface but when we marginalise out the latents, the likelihood surface becomes multimodal: no unique MLE. 320 compbody.tex 35 19.5 30 25 20 mu 2 14.5 9.5 4.5 0.5 15 5.5 10 10.5 5 15.5 0 30 20 10 0 10 20 30 (a) 15.5 10.5 5.5 0.5 4.5 9.5 14.5 19.5 mu 1 (left) n = 200 data points from a mixture of two 1D Gaussians with Figure 11.6: Left: N = 200 data points sampled from a mixture of 2 Gaussians in 1d, with π k = 0.5, σ k = 5, µ 1 = 10 and µ 2 = 10. Right: π 1 = Likelihood π 2 = surface 0.5, p(d µ σ = 1, 5µ 2), and withµ all 1 other = 10, parameters µ 2 = set 10. to their true values. We see the two symmetric modes, reflecting the unidentifiability of the parameters. Produced by mixgaussliksurfacedemo. (right) Observed data log likelihood surface l (µ 1, µ 2 ), all the other parameters being assumed known. 11.4.2.5 Unidentifiability Note that Department mixture of Statistics, models Oxford are not identifiable, which SC4/SM8 means ATSML, thereht2018 are many settings of the parameters which have the 11 / same 36 (b)

Mixture Models Mixture Models: Maximum Likehood Recall we would like to solve: n µk l((π k, µ k ) K k=1) = Q ik µk log f (x i µ k ) = 0 i=1 What if we ignore the dependence of Q ik on the parameters? Taking the mixture of Gaussian with covariance σ 2 I as example, n i=1 = 1 σ 2 n ( Q ik µk p 2 log(2πσ2 ) 1 ) 2σ 2 x i µ k 2 2 i=1 Q ik (x i µ k ) = 1 σ 2 ( n i=1 Q ik x i µ k ( n i=1 Q ik) ) = 0 n µ ML? i=1 k = Q ikx i n i=1 Q ik Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 12 / 36

Mixture Models Mixture Models: Maximum Likehood The estimate is a weighted average of data points, where the estimated mean of cluster k uses its responsibilities to data points as weights. n µ ML? i=1 k = Q ikx i n i=1 Q. ik Makes sense: Suppose we knew that data point x i came from population z i. Then Q izi = 1 and Q ik = 0 for k z i and: µ ML? i:z k = x i=k i i:z 1 = avg{x i : z i = k} i=k Our best guess of the originating population is given by Q ik. Soft K-Means algorithm? Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 13 / 36

Mixture Models Mixture Models: Maximum Likehood Gradient w.r.t. mixing proportion π k (including a Lagrange multiplier λ ( k π k 1 ) to enforce constraint k π k = 1). ( πk l((π k, µ k ) K k=1) λ( ) K k=1 π k 1) n f (x i µ k ) = K j=1 π jf (x i µ j ) λ = i=1 n i=1 Q ik π k λ = 0 π k n i=1 Q ik Note: K n Q ik = k=1 i=1 n K i=1 k=1 Q ik }{{} =1 n πk ML? i=1 = Q ik n Again makes sense: the estimate is simply (our best guess of) the proportion of data points coming from population k. Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 14 / 36

Mixture Models Mixture Models: The EM Algorithm Putting all the derivations together, we get an iterative algorithm for learning about the unknowns in the mixture model. Start with some initial parameters (π (0) k, µ (0) k ) K k=1. Iterate for t = 1, 2,...: Expectation Step: Maximization Step: Q (t) ik := π (t 1) k K j=1 π(t 1) j f (x i µ (t 1) k ) f (x i µ (t 1) j ) n π (t) i=1 k = Q(t) ik n n µ (t) i=1 k = Q(t) ik xi n i=1 Q(t) ik Will the algorithm converge? What does it converge to? Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 15 / 36

Mixture Models Example: Mixture of 3 Gaussians An example with 3 clusters. 5 0 5 10 5 0 5 X1 X2 Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 16 / 36

Mixture Models Example: Mixture of 3 Gaussians After 1st E and M step. Iteration 1 data[,2] 5 0 5 5 0 5 10 data[,1] Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 17 / 36

Mixture Models Example: Mixture of 3 Gaussians After 2nd E and M step. Iteration 2 data[,2] 5 0 5 5 0 5 10 data[,1] Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 18 / 36

Mixture Models Example: Mixture of 3 Gaussians After 3rd E and M step. Iteration 3 data[,2] 5 0 5 5 0 5 10 data[,1] Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 19 / 36

Mixture Models Example: Mixture of 3 Gaussians After 4th E and M step. Iteration 4 data[,2] 5 0 5 5 0 5 10 data[,1] Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 20 / 36

Mixture Models Example: Mixture of 3 Gaussians After 5th E and M step. Iteration 5 data[,2] 5 0 5 5 0 5 10 data[,1] Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 21 / 36

EM Algorithm EM Algorithm In a maximum likelihood framework, the objective function is the log likelihood, l(θ) = Direct maximisation is not feasible. n K log π kf (x i µ k) i=1 Consider another objective function F(θ, q), where q is any probability distribution on latent variables z, such that: k=1 F(θ, q) l(θ) for all θ, q, max F(θ, q) = l(θ) q F(θ, q) is a lower bound on the log likelihood. We can construct an alternating maximisation algorithm as follows: For t = 1, 2... until convergence: q (t) := argmax F(θ (t 1), q) q θ (t) := argmax F(θ, q (t) ) θ Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 22 / 36

EM Algorithm Probabilistic Unsupervised Learning EM Algorithm The lower bound we use is called the variational free energy. q is a probability mass function for a distribution over z := (z i ) n i=1. F(θ, q) =E q [log p(x, z θ) log q(z)] [( n ) ] K =E q 1(z i = k) (log π k + log f (x i µ k )) log q(z) i=1 k=1 = [( n ) ] K q(z) 1(z i = k) (log π k + log f (x i µ k )) log q(z) z i=1 k=1 Lemma F(θ, q) l(θ) for all q and for all θ. Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 23 / 36

EM Algorithm - Solving for q EM Algorithm Lemma F(θ, q) = l(θ) for q(z) = p(z x, θ). In combination with previous Lemma, this implies that q(z) = p(z x, θ) maximizes F(θ, q) for fixed θ, i.e., the optimal q is simply the conditional distribution given the data and that fixed θ. In mixture model, q (z) = p(z, x θ) p(x θ) = n i=1 π z i f (x i µ zi ) z n i=1 π z i f (x i µ z i ) = = n π zi f (x i µ zi ) k π kf (x i µ k ) n p(z i x i, θ). i=1 i=1 Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 24 / 36

Similar derivation for optimal π k as before. Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 25 / 36 Probabilistic Unsupervised Learning EM Algorithm - Solving for θ EM Algorithm Setting derivative with respect to µ k to 0, µk F(θ, q) = n q(z) 1(z i = k) µk log f (x i µ k ) z i=1 n = q(z i = k) µk log f (x i µ k ) = 0 i=1 This equation can be solved quite easily. E.g., for mixture of Gaussians, n µ i=1 k = q(z i = k)x i n i=1 q(z i = k) If it cannot be solved exactly, we can use gradient ascent algorithm (generalized EM): n µ k = µ k + α q(z i = k) µk log f (x i µ k ). i=1

EM Algorithm Probabilistic Unsupervised Learning EM Algorithm Start with some initial parameters (π (0) k, µ (0) k ) K k=1. Iterate for t = 1, 2,...: Expectation Step: q (t) (z i = k) := p(z i = k x i, θ (t 1) ) = Maximization Step: π (t 1) k K j=1 π(t 1) j f (x i µ (t 1) k ) f (x i µ (t 1) j ) n π (t) i=1 k = q(t) (z i = k) n n µ (t) i=1 k = q(t) (z i = k)x i n i=1 q(t) (z i = k) Theorem EM algorithm does not decrease the log likelihood. Proof: l(θ (t 1) ) = F(θ (t 1), q (t) ) F(θ (t), q (t) ) F(θ (t), q (t+1) ) = l(θ (t) ). Additional assumption, that 2 θ F(θ(t), q (t) ) are negative definite with eigenvalues < ɛ < 0, implies that θ (t) θ where θ is a local MLE. Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 26 / 36

EM Algorithm Notes on Probabilistic Approach and EM Algorithm Some good things: Guaranteed convergence to locally optimal parameters. Formal reasoning of uncertainties, using both Bayes Theorem and maximum likelihood theory. Rich language of probability theory to express a wide range of generative models, and straightforward derivation of algorithms for ML estimation. Some bad things: Can get stuck in local minima so multiple starts are recommended. Slower and more expensive than K-means. Choice of K still problematic, but rich array of methods for model selection comes to rescue. Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 27 / 36

EM Algorithm Flexible Gaussian Mixture Models We can allow each cluster to have its own mean and covariance structure to enable greater flexibility in the model. Different covariances Different, but diagonal covariances Identical covariances Identical and spherical covariances Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 28 / 36

Probabilistic PCA Probabilistic Unsupervised Learning Probabilistic PCA A probabilistic model related to PCA (also known as sensible PCA) has the following generative model: for i = 1, 2,..., n: Let k < n, p be given. Let Y i be a (latent) k-dimensional normally distributed random variable with 0 mean and identity covariance: Y i N (0, I k) We model the distribution of the ith data point given Y i as a p-dimensional normal: X i N (µ + LY i, σ 2 I) where the parameters are a vector µ R p, a matrix L R p k and σ 2 > 0. Tipping and Bishop, 1999 Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 29 / 36

Probabilistic PCA Probabilistic PCA: EM vs MLE EM algorithm can be used for ML estimation (lecture notes), but PPCA can more directly give an MLE (which is not unique). Let λ 1 λ p be the eigenvalues of the sample covariance and V 1:k R p k the top k eigenvectors as before. Let Q R k k be any orthogonal matrix. Then an MLE is given by: µ MLE = x (σ 2 ) MLE = 1 p k p j=k+1 λ j L MLE = V 1:k diag((λ 1 (σ 2 ) MLE ) 1 2,..., (λk (σ 2 ) MLE ) 1 2 )Q However, EM can be faster, can be implemented online, can handle missing data and can be extended to more complicated models! Tipping and Bishop, 1999 Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 30 / 36

Probabilistic PCA Probabilistic PCA PPCA latents principal subspace figures from M. Sahani s UCL course on Unsupervised Learning Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 31 / 36

Probabilistic PCA Probabilistic PCA PPCA latents PCA projection principal subspace figures from M. Sahani s UCL course on Unsupervised Learning Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 32 / 36

Probabilistic PCA Probabilistic PCA PPCA latents PPCA noise PPCA latent prior principal subspace figures from M. Sahani s UCL course on Unsupervised Learning Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 33 / 36

Probabilistic PCA Probabilistic PCA PPCA latents PPCA posterior PPCA latent prior PPCA noise PPCA projection principal subspace figures from M. Sahani s UCL course on Unsupervised Learning Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 34 / 36

Probabilistic PCA Mixture of Probabilistic PCAs We have learnt two types of unsupervised learning techniques: Dimensionality reduction, e.g. PCA, MDS, Isomap. Clustering, e.g. K-means, linkage and mixture models. Probabilistic models allow us to construct more complex models from simpler pieces. Mixture of probabilistic PCAs allows both clustering and dimensionality reduction at the same time. Z i Discrete(π 1,..., π K ) Y i N (0, I d ) X i Z i = k, Y i = y i N (µ k + Ly i, σ 2 I p ) Allows flexible modelling of covariance structure without using too many parameters. Ghahramani and Hinton 1996 Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 35 / 36

Probabilistic PCA Further reading Hastie et al, 8.5 Bishop, Chapter 9 Roweis and Ghahramani: A unifying review of linear Gaussian models Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 36 / 36