Chapter 12 EM algorithms The Expectation-Maximization (EM) algorithm is a maximum likelihood method for models that have hidden variables eg. Gaussian

Similar documents
Unsupervised Learning 2001

Algorithms for Clustering

Clustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar.

Machine Learning for Data Science (CS 4786)

Mixtures of Gaussians and the EM Algorithm

Distributional Similarity Models (cont.)

Exponential Families and Bayesian Inference

Distributional Similarity Models (cont.)

Table 12.1: Contingency table. Feature b. 1 N 11 N 12 N 1b 2 N 21 N 22 N 2b. ... a N a1 N a2 N ab

Lecture 12: September 27

Expectation-Maximization Algorithm.

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

Factor Analysis. Lecture 10: Factor Analysis and Principal Component Analysis. Sam Roweis

Lecture 11 and 12: Basic estimation theory

Lecture 2: Monte Carlo Simulation

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)

For a 3 3 diagonal matrix we find. Thus e 1 is a eigenvector corresponding to eigenvalue λ = a 11. Thus matrix A has eigenvalues 2 and 3.

Problem Set 2 Solutions

Session 5. (1) Principal component analysis and Karhunen-Loève transformation

Machine Learning Assignment-1

Study the bias (due to the nite dimensional approximation) and variance of the estimators

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

Machine Learning for Data Science (CS 4786)

A Note on Effi cient Conditional Simulation of Gaussian Distributions. April 2010

Bayesian Methods: Introduction to Multi-parameter Models

ECE 901 Lecture 13: Maximum Likelihood Estimation

Chapter 6 Principles of Data Reduction

First Year Quantitative Comp Exam Spring, Part I - 203A. f X (x) = 0 otherwise

CS284A: Representations and Algorithms in Molecular Biology

3/8/2016. Contents in latter part PATTERN RECOGNITION AND MACHINE LEARNING. Dynamical Systems. Dynamical Systems. Linear Dynamical Systems

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

Problem Set 4 Due Oct, 12

Statistical Pattern Recognition

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

4. Hypothesis testing (Hotelling s T 2 -statistic)

THE KALMAN FILTER RAUL ROJAS

Lecture 4. Hw 1 and 2 will be reoped after class for every body. New deadline 4/20 Hw 3 and 4 online (Nima is lead)

x = Pr ( X (n) βx ) =

The Expectation-Maximization (EM) Algorithm

11 Correlation and Regression

5.1 Review of Singular Value Decomposition (SVD)

Lecture 6 Ecient estimators. Rao-Cramer bound.

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Simulation. Two Rule For Inverting A Distribution Function


Statistical Inference Based on Extremum Estimators

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

Chapter 5. Inequalities. 5.1 The Markov and Chebyshev inequalities

CSE 527, Additional notes on MLE & EM

Regression and generalization

Lecture 7: Properties of Random Samples

Lecture 7: October 18, 2017

( θ. sup θ Θ f X (x θ) = L. sup Pr (Λ (X) < c) = α. x : Λ (x) = sup θ H 0. sup θ Θ f X (x θ) = ) < c. NH : θ 1 = θ 2 against AH : θ 1 θ 2

Properties and Hypothesis Testing

Axis Aligned Ellipsoid

EE 6885 Statistical Pattern Recognition

Dimensionality Reduction vs. Clustering

1 Covariance Estimation

Statistical and Mathematical Methods DS-GA 1002 December 8, Sample Final Problems Solutions

Probabilistic Unsupervised Learning

Probability 2 - Notes 10. Lemma. If X is a random variable and g(x) 0 for all x in the support of f X, then P(g(X) 1) E[g(X)].

Matrix Representation of Data in Experiment

An Introduction to Randomized Algorithms

10-701/ Machine Learning Mid-term Exam Solution

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem

Correlation Regression

Lecture 8: October 20, Applications of SVD: least squares approximation

Random Matrices with Blocks of Intermediate Scale Strongly Correlated Band Matrices

Variable selection in principal components analysis of qualitative data using the accelerated ALS algorithm

6. Kalman filter implementation for linear algebraic equations. Karhunen-Loeve decomposition

Lecture 4: April 10, 2013

Statistical Properties of OLS estimators

RAINFALL PREDICTION BY WAVELET DECOMPOSITION

Cov(aX, cy ) Var(X) Var(Y ) It is completely invariant to affine transformations: for any a, b, c, d R, ρ(ax + b, cy + d) = a.s. X i. as n.

Lecture 14: Graph Entropy

Probabilistic Unsupervised Learning

Expectation maximization

Lecture 12: November 13, 2018

Estimation of the Mean and the ACVF

Review Questions, Chapters 8, 9. f(y) = 0, elsewhere. F (y) = f Y(1) = n ( e y/θ) n 1 1 θ e y/θ = n θ e yn

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures

Appendix K. The three-point correlation function (bispectrum) of density peaks

Stat410 Probability and Statistics II (F16)

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

6.867 Machine learning

Notes 19 : Martingale CLT

1 Last time: similar and diagonalizable matrices

Inverse Matrix. A meaning that matrix B is an inverse of matrix A.

arxiv: v1 [math.pr] 13 Oct 2011

Maximum Likelihood Estimation

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

The Poisson Process *

Learning Theory: Lecture Notes

Online Convex Optimization in the Bandit Setting: Gradient Descent Without a Gradient. -Avinash Atreya Feb

Output Analysis and Run-Length Control

PAPER : IIT-JAM 2010

Transcription:

Chapter 2 EM algorithms The Expectatio-Maximizatio (EM) algorithm is a maximum likelihood method for models that have hidde variables eg. Gaussia Mixture Models (GMMs), Liear Dyamic Systems (LDSs) ad Hidde Markov Models (HMMs). 2. Gaussia Mixture Models Say we have a variable which is multi-modal ie. it separates ito distict clusters. For such data the mea ad variace are ot very represetative quatities. I a -dimesioal Gaussia Mixture Model (GMM) with m-compoets the likelihood of a data poit x is give by p(x ) = mx k= p(x jk)p(s = k) (2.) where s is a idicator variable idicatig which compoet is selected for which data poit. These are chose probabilistically accordig to p(s = k) = k (2.2) ad each compoet is a Gaussia p(x jk) = (2 2 k) =2 exp (x k ) 2 2 2 k! (2.3) To geerate data from a GMM we pick a Gaussia at radom (accordig to 2.2) ad the sample from that Gaussia. To t a GMM to a data set we eed to estimate k, k ad 2 k. This ca be achieved i two steps. I the 'E-Step' we soft-partitio the data amog the dieret clusters. This amouts to calculatig the probability that data poit belogs to cluster k which, from Baye's rule, is k p(s = kjx ) = p(x jk)p(s = k) k 0 p(x jk 0 )p(s = k 0 ) (2.4) 4

I the 'M-Step' we re-estimate the parameters usig Maximum Likelihood, but the data poits are weighted accordig to the soft-partitioig k = X k (2.5) k = 2 k = k x k k (x k ) 2 k These two steps costitute a EM algorithm. Summarizig: 5 0 5 0 0 0 20 30 40 50 60 Figure 2.: A variable with 3 modes. This ca be accurately modelled with a 3- compoet Gaussia Mixture Model. E-Step: Soft-partitioig. M-Step: arameter updatig. Applicatio of a 3-compoet GMM to our example data gives for cluster (i) = 0:3, = 0, 2 = 0, (ii) = 0:35, = 40, 2 = 0, ad (iii) 3 = 0:35, = 50, 2 = 5. GMMs are readily exteded to multivariate data by replacig each uivariate Gaussia i the mixture with a multivariate Gaussia. See eg. chapter 3 i [3]. 2.2 Geeral Approach If V are visible variables, H are hidde variables ad are parameters the

. E-Step: Get p(hjv; ) 2. M-Step, chage so as to maximise Q =< log p(v; Hj) > (2.6) where expectatio is wrt p(hjv; ). Why does it work? Maximisig Q maximises the likelihood p(v j). This ca be proved as follows. Firstly p(h; V j ) p(v j ) = (2.7) p(h j V; ) This meas that the log-likelihood, L() log p(v j ), ca be writte L() = log p(h; V j ) log p(h j V; ) (2.8) If we ow take expectatios with respect to a distributio p 0 (H) the we get L() = p 0 (H) log p(h; V j )dh p 0 (H) log p(h j V; )dh (2.9) The secod term is miimised by settig p 0 (H) = p(hjv; ) (we ca prove this from Jese's iequality or the positivity of the KL divergece; see [2] or lecture 4). This takes place i the E-Step. After the E-step the auxiliary fuctio Q is the equal to the log-likelihood. Therefore, whe we maximise Q i the M-step we are maximisig the likelihood. 2.3 robabilistic ricipal Compoet Aalysis I a earlier lecture, ricipal Compoet Aalysis (CA) was viewed as a liear trasform y = Q T x (2.0) where the jth colum of the matrix Q is the jth eigevector, q j, of the covariace matrix of the origial d-dimesioal data x. The jth projectio y j = q T j x (2.) has a variace give by the jth eigevalue j. If the projectios are raked accordig to variace (ie. eigevalue) the the M variables that recostruct the origial data with miimum error (ad are also liear fuctios of x) are give by y ; y 2 ; :::; y M. The remaiig variables y M + ; ::; y d ca be be discarded with miimal loss of iformatio (i the sese of least squares error). The recostructed data is give by ^x = Q :M y :M (2.2) where Q :M is a matrix formed from the rst M colums of Q. Similarly, y :M = [y ; y 2 ; :::; y M ] T.

I probabilistic CA (pca) [60] the CA trasform is coverted ito a statistical model by explaiig the `discarded' variace as observatio oise x = W y + e (2.3) where the oise is draw from a zero mea Gaussia distributio with isotropic covariace 2 I. The `observatios' x are geerated by trasformig the `sources' y with the 'mixig matrix' W ad the addig `observatio oise'. The pca model has M sources where M < d. For a give M, we have W = Q :M ad 2 = M d dx j=m + j (2.4) which is the average variace of the discarded projectios. There also exists a EM algorithm for dig the mixig matrix which is more eciet tha SVD for high dimesioal data. This is because it oly eeds to ivert a M- by-m matrix rather tha a d-by-d matrix. If we dee S as the sample covariace matrix ad C = W W T + 2 I (2.5) the the log-likelihood of the data uder a pca model is give by [60] log p(x) = Nd 2 log 2 N 2 log jcj N 2 T r(c S) (2.6) where N is the umber of data poits. We are ow i the positio to apply the MDL model order selectio criterio. We have MDL(M) = log p(x) + Md log N (2.7) 2 This gives us a procedure for choosig the optimal umber of sources. Because pca is a probabilistic model (whereas CA is a trasform) it is readily icorporated i larger models. A useful model, for example, is the Mixtures of pca model. This is idetical to the Gaussia Mixture model except that each Gaussia is decomposed usig pca (rather tha keepig it as a full covariace Gaussia). This ca greatly reduce the umber of parameters i the model [6]. 2.4 Liear Dyamical Systems A Liear Dyamical System is give by the followig `state-space' equatios x t+ = Ax t + w t (2.8) y t = Cx t + v t

where the state oise ad observatio oise are zero mea Gaussia variables with covariaces Q ad R. Give A,C,Q ad R the state ca be updated usig the Kalma lter. For real-time applicatios we ca ifer the states usig a Kalma lter. For retrospective/oie data aalysis the state at time t ca be determied usig data before t ad after t. This is kow as Kalma smoothig. See eg. [2]. Moreover, we ca also ifer other parameters; eg. state oise covariace Q, state trasformatio matrix C, etc. See eg. [22]. To do this, the state is regarded as a `hidde variable' (we do ot observe it) ad we apply the EM algorithm [5]. For a LDS x t+ = Ax t + w t (2.9) y t = Cx t + v t the hidde variables are the states x t ad the observed variable is the time series y t. If x t = [x ; x 2 ; ::; x t ] are the states ad y T = [y ; y 2 ; :::; y t ] are the observatios the the EM algorithm is as follows. M-Step I the M-Step we maximise Q =< log p(y T ; xt j) > (2.20) Because of the Markov roperty of a LDS (the curret state oly depeds o the last oe, ad ot o oes before that) we have p(y T ; xt j) = p(x ) ad whe we we take logs we get log p(y T ; xt j) = log p(x ) + t=2 TY t=2 p(x t jx t ) TY t= log p(x t jx t ) + p(y t jx t ) (2.2) t= log p(y t jx t ) (2.22) where each DF is a multivariate Gaussia. We ow eed to take expectatios wrt. the distributio over hidde variables This gives < log p(y T ; xt j) >= log p(x ) + t p(x t jy T ) (2.23) t=2 t log p(x t jx t ) + t= t log p(y t jx t ) (2.24) By takig derivates wrt each of the parameters ad settig them to zero we get update equatios for A, Q,C ad R. See [22] for details. The distributio over hidde variables is calculated i the E-Step.

E-Step The E-Step cosists of two parts. I the forward-pass the joit probability t p(x t ; y t ) = t p(x t jx t )p(y t jx t )dx t (2.25) is recursively evaluated usig a Kalma lter. I the backward pass we estimate the coditioal probability t p(y T t jx t) = t+ p(x t+ jx t )p(y t+ jx t+ )dx t+ (2.26) The two are the combied to produce a smoothed estimate p(x t jy T ) / t t (2.27) This E-Step costitutes a Kalma smoother.

.4.2 Secods 0.8 0.6 0.4 0.2 (a) 0 0 5 0 5 20 25 30 Frequecy.4.2 Secods 0.8 0.6 0.4 0.2 (b) 0 0 5 0 5 20 25 30 Frequecy Figure 2.2: (a) Kalma lterig ad (b) Kalma smoothig.