The Basic Idea of EM

Similar documents
Hidden Markov Models & The Multivariate Gaussian (10/26/04)

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

Hidden Markov Models

Lecture Notes on Linear Regression

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Generalized Linear Methods

Maximum Likelihood Estimation

Conjugacy and the Exponential Family

EM and Structure Learning

Limited Dependent Variables

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

The Expectation-Maximization Algorithm

Gaussian Mixture Models

Lecture 10 Support Vector Machines II

Outline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline

Linear Approximation with Regularization and Moving Least Squares

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

STATS 306B: Unsupervised Learning Spring Lecture 10 April 30

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Chapter Newton s Method

Expectation Maximization Mixture Models HMMs

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Assortment Optimization under MNL

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Homework Assignment 3 Due in class, Thursday October 15

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

Lecture 7: Boltzmann distribution & Thermodynamics of mixing

10-701/ Machine Learning, Fall 2005 Homework 3

Kernel Methods and SVMs Extension

Feature Selection: Part 1

Stat260: Bayesian Modeling and Inference Lecture Date: February 22, Reference Priors

Composite Hypotheses testing

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

3.1 ML and Empirical Distribution

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

THE SUMMATION NOTATION Ʃ

On an Extension of Stochastic Approximation EM Algorithm for Incomplete Data Problems. Vahid Tadayon 1

1 Convex Optimization

Problem Set 9 Solutions

More metrics on cartesian products

Notes on Frequency Estimation in Data Streams

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009

The EM Algorithm (Dempster, Laird, Rubin 1977) The missing data or incomplete data setting: ODL(φ;Y ) = [Y;φ] = [Y X,φ][X φ] = X

COS 521: Advanced Algorithms Game Theory and Linear Programming

The Geometry of Logit and Probit

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

Classification as a Regression Problem

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement

Supplementary material: Margin based PU Learning. Matrix Concentration Inequalities

Goodness of fit and Wilks theorem

The exam is closed book, closed notes except your one-page cheat sheet.

Primer on High-Order Moment Estimators

Learning from Data 1 Naive Bayes

Introduction to Vapor/Liquid Equilibrium, part 2. Raoult s Law:

Lecture 4. Instructor: Haipeng Luo

Course 395: Machine Learning - Lectures

Erratum: A Generalized Path Integral Control Approach to Reinforcement Learning

Semi-Supervised Learning

PHYS 705: Classical Mechanics. Calculus of Variations II

Computation of Higher Order Moments from Two Multinomial Overdispersion Likelihood Models

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

The Order Relation and Trace Inequalities for. Hermitian Operators

Case A. P k = Ni ( 2L i k 1 ) + (# big cells) 10d 2 P k.

Hidden Markov Models

STAT 3008 Applied Regression Analysis

EEE 241: Linear Systems

APPROXIMATE PRICES OF BASKET AND ASIAN OPTIONS DUPONT OLIVIER. Premia 14

Machine learning: Density estimation

Note on EM-training of IBM-model 1

Week 5: Neural Networks

4 Analysis of Variance (ANOVA) 5 ANOVA. 5.1 Introduction. 5.2 Fixed Effects ANOVA

Section 8.3 Polar Form of Complex Numbers

Statistical analysis using matlab. HY 439 Presented by: George Fortetsanakis

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

Lecture 12: Discrete Laplacian

Singular Value Decomposition: Theory and Applications

SDMML HT MSc Problem Sheet 4

Introduction to Hidden Markov Models

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

Difference Equations

A Bayes Algorithm for the Multitask Pattern Recognition Problem Direct Approach

ECE559VV Project Report

Rockefeller College University at Albany

Density matrix. c α (t)φ α (q)

2.3 Nilpotent endomorphisms

NP-Completeness : Proofs

8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore

Transcription:

The Basc Idea of EM Janxn Wu LAMDA Group Natonal Key Lab for Novel Software Technology Nanjng Unversty, Chna wujx2001@gmal.com June 7, 2017 Contents 1 Introducton 1 2 GMM: A workng example 2 2.1 Gaussan mxture model....................... 2 2.2 The hdden varable nterpretaton................. 3 2.3 What f we can observe the hdden varable?............ 5 2.4 Can we mtate an oracle?...................... 6 3 An nformal descrpton of the EM algorthm 6 4 The Expectaton-Maxmzaton algorthm 7 4.1 Jontly-non-concave ncomplete log-lkelhood........... 7 4.2 Possbly) Concave complete data log-lkelhood.......... 8 4.3 The general EM dervaton..................... 9 4.4 The E- & M-steps.......................... 11 4.5 The EM algorthm.......................... 12 4.6 Wll EM converge?.......................... 12 5 EM for GMM 13 1 Introducton Statstcal learnng models are very mportant n many areas nsde computer scence, ncludng but not confned to machne learnng, computer vson, pattern recognton and data mnng. It s also mportant n some deep learnng models, such as the Restrcted Boltzmann machne RBM). 1

Statstcal learnng models have parameters, and estmatng such parameters from data s one of the key problems n the study of such models. Expectaton- Maxmzaton EM) s arguably the most wdely used parameter estmaton technque. Hence, t s worthwhle to know some bascs of EM. However, although EM s a must-have knowledge n studyng statstcal learnng models, t s not easy for begnners. Ths note ntroduces the basc dea behnd EM. I want to emphasze that the man purpose of ths note s to ntroduce the basc dea or, emphaszng the ntuton) behnd EM, not for coverng all detals of EM or presentng rgorous mathematcal dervatons. 1 2 GMM: A workng example Let us start from a smple workng example: GMM). the Gaussan Mxture Model 2.1 Gaussan mxture model In Fgure 1, we show three curves correspondng to three dfferent probablty densty functons p.d.f.). The blue curve s the p.d.f. of a normal dstrbuton N10, 16),.e., a Gaussan dstrbuton wth the mean µ = 10 and the standard devaton σ = 4 and σ 2 = 16). We denote ths p.d.f. as p 1 x) = Nx; 10, 16). The red curve s another normal dstrbuton N30, 49) wth µ = 30 and σ = 7. Smlarly, we denote t as p 2 x) = Nx; 30, 49). We are nterested n the black curve, whose frst half s smlar to the blue one, whle the second half s smlar to the red one. Ths curve s also the p.d.f. of a dstrbuton, denoted by p 3. Snce the black curve s smlar to parts of the blue and red curves, t s reasonable to conjecture that p 3 are related to both p 1 and p 2. Indeed, p 3 s a weghted combnaton of p 1 and p 2. In ths example, p 3 x) = 0.2p 1 x) + 0.8p 2 x). 1) Because 0.2 + 0.8 = 1, t s easy to verfy that p 3 z) 0 always holds and p 3x) dx = 1. Hence, p 3 s a vald p.d.f. p 3 s a mxture of two Gaussans p 1 and p 2 ), hence a Gaussan mxture model GMM). The defnton of a GMM s n fact more general: t can have more than two components, and the Gaussans can be multvarate. 1 The frst verson of ths note was wrtten n Chnese, and was started as a note-takng n a course n the Georga Insttute of Technology whle I was a graduate student there. That verson was typeset n Mcrosoft Word. Unfortunately, that verson contaned a lot of errors and I dd not have a chance to check t agan. Ths verson wrtten n 2016) s started whle I am preparng materals for the Pattern Recognton course I wll teach n the Sprng Semester n Nanjng Unversty. It s greatly expanded, and the errors that I found are corrected. 2

0.1 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 0 5 10 15 20 25 30 35 40 45 50 Fgure 1: A smple GMM llustraton. A GMM s a dstrbuton whose p.d.f. has the followng form: px) = = N α Nx; µ, Σ ) 2) N α exp 1 ) 2π) d/2 Σ 1/2 2 x µ ) T Σ 1 x µ ), 3) n whch x s a d-dmensonal random vector. In ths GMM, there are N Gaussan components, wth the -th Gaussan has the mean vector µ R d and the covarance matrx Σ R d d. 2 These Gaussan components are combned together usng a lnear combnaton, where the weght for the -th component s α called the mxng coeffcents). The mxng coeffcents must satsfy the followng condtons N α = 1, 4) α 0,. 5) It s easy to verfy that under these condtons, px) s a vald multvarate probablty densty functon. 2.2 The hdden varable nterpretaton We can have a dfferent nterpretaton of the Gaussan mxture model, usng the hdden varable concept, as llustrated n Fgure 2. 2 We wll use boldface letters to denote a vector. 3

Z X Fgure 2: GMM as a graphcal model. In Fgure 2, the random varable X follows a Gaussan mxture model cf. Equaton 3). Its parameter s θ = {α, µ, Σ } N. 6) If we want to sample an nstance from ths GMM, we could drectly sample from the p.d.f. n Equaton 3. However, there s another two-step way to perform the samplng. Let us defne a random varable Z. Z s a multnomal dscrete dstrbuton, takng values from the set {1, 2,..., N}. The probablty that Z takes the value Z = s α,.e., PrZ = ) = α, for 1 N. Then, the two-step samplng procedure s: Step 1 Sample from Z, and get a value 1 N); Step 2 Sample x from the -th Gaussan component Nµ, Σ ). It s easy to verfy that the sample x acheved from ths two-step samplng procedure follows the underlyng GMM dstrbuton n Equaton 3. In learnng GMM parameters, we are gven a sample set {x 1, x 2,..., x M }, where x are..d. dentcally and ndependently dstrbuted) nstances sampled from the p.d.f. n Equaton 3. From ths set of samples, we want to estmate or learn the GMM parameters θ = {α, µ, Σ } N. Because we are gven the samples x, the random varable X cf. Fgure 2) are called observed or observable) random varables. As shown n Fgure 2, observed random varables are usually shown as a flled crcle. 4

The random varable Z, however, s not observable, and s called a hdden varable or a latent varable). Hdden varables are shown as a crcled, as the Z node n Fgure 2. 2.3 What f we can observe the hdden varable? In real applcatons, we do not know the value or nstantaton) of Z, because t s hdden not observable). Ths fact makes estmatng GMM parameters rather dffcult, and technques such as EM the focus of ths note) have to be employed. However, for the sample set X = {x 1, x 2,..., x M }, let us consder the scenaro n whch we can further suppose that some oracle has gven us the value of Z: Z = {z 1, z 2,..., z M }. In other words, we know that x s sampled from the z -th Gaussan component. In ths case, t s easy to estmate the parameters θ. Frst, we can fnd all those samples that are generated from the -th component, and use X to denote ths subset of samples. In precse mathematcal languages, X = {x j z j =, 1 j M}. 7) The mxng coeffcent estmaton s a smple countng. We can count the number of examples whch are generated from the -th Gaussan component as m = X, where s the sze number of elements) of a set. Then, the maxmum lkelhood estmaton for α s ˆα = m N j=1 m j = m M. 8) Second, t s also easy to estmate the µ and Σ parameters for any 1 N. The maxmum lkelhood estmaton solutons are the same as those sngle Gaussan equatons: 3 ˆµ = 1 m x, 9) x X ˆΣ = 1 x ˆµ m ) x ˆµ ) T. 10) x X In short, f we know the hdden varable s nstantatons, the estmaton s straghtforward. Unfortunately, we are only gven the observed sample set X. The hdden varable nstantatons Z s unknown to us. Ths fact complcates the entre parameter estmaton process. 3 Please refer to my note on propertes of normal dstrbutons for dervaton of these equatons. 5

2.4 Can we mtate an oracle? A natural queston to ask ourselves s: f we do not have an oracle to teach us, can we mtate the oracle s teachng? In other words, can we guess the value of z j for x j? A natural choce s to use the posteror pz j x j, θ t) ) as a replacement for z j. Ths term s the posteror probablty gven the sample x j and the current parameter value θ t). 4 The posteror probablty s the best educated guess we can have gven the nformaton that s handy. In ths guessng game, we have at least two ssues n our way. Frst, an oracle s supposed to know everythng, and wll be able to tell us that x 7 comes from the thrd Gaussan component, wth 100% confdence. If an oracle exsts, we can smply say z 7 = 3 n ths example. However, our guess wll never be determnstc t can at best be a probablty dstrbuton about the random varable z j. Hence, we wll assume that for every observed sample x, there s a correspondng hdden vector z, whose values can be guessed but cannot be observed. We stll use Z to denote the underlyng random varable, and use Z to denote the set of hdden vectors. In the GMM example, a vector z j wll have N dmensons, but one and only one of these dmensons wll be 1, and all others wll be 0. Second, the guess we have about z j s a dstrbuton determned by the posteror pz j x j, θ t) ). However, what we really want are values nstead of a dstrbuton. How are we gong to use ths guess? A common trck n statstcal learnng s to use ts expectaton. We wll leave the detals about how the expectaton s used to later sectons. 3 An nformal descrpton of the EM algorthm Now we are ready to gve an nformal descrpton of the EM algorthm. We frst ntalze the values of θ n any reasonable way; Then, we can estmate the best possble Z expectaton of ts posteror dstrbuton) usng X and the current θ estmaton; Wth ths Z estmaton, we can fnd a better estmate of θ usng X ; A better θ combned wth X ) wll lead to a better guess of Z; Ths process estmatng θ and Z n alternatng order) can proceed untl the change n θ s small.e., the procedure converges). In stll more nformal languages, after proper ntalzaton of the parameters, we can: 4 As we wll see, EM s an teratve process, n whch the varable t s the teraton ndex. We wll update the parameter θ s every teraton, and use θ t) to denote ts value n the t-th teraton. 6

E-Step Fnd a better guess of the non-observable hdden varables, by usng the data and current parameter values; M-Step Fnd a better parameter estmaton, by usng the current guess for the hdden varables and the data; Repeat Repeat the above two steps untl convergence. In the EM algorthm, the frst step s usually called the Expectaton step, abbrevated as the E-step; whle the second step s usually called the Maxmzaton step, abbrevated as the M-step. The EM algorthm repeats E- and M-steps n alternatng order. When the algorthm converges, we get the desred parameter estmatons. 4 The Expectaton-Maxmzaton algorthm Now we wll show more detals of the EM algorthm. Suppose we are dealng wth two sets of random varables: the observed varables X and the hdden varables Z. The jont p.d.f. s px, Z; θ), where θ are the parameters. We are gven a set of nstances of X to learn the parameters, as X = {x 1, x 2,..., x M }. The task s to estmate θ from X. For every x j, there s a correspondng z j. And we want to clarfy that θ now nclude the parameters that are assocated wth Z. In the GMM example, z j are estmates for Z, {α, µ, Σ } N are parameters specfyng X, and θ nclude both sets of parameters. 4.1 Jontly-non-concave ncomplete log-lkelhood If we use the maxmum lkelhood ML) estmaton technque, the ML estmate for θ s ˆθ = arg max px θ). 11) θ Or equvalently, we can maxmze the log-lkelhood ˆθ = arg max ln px θ), 12) θ because ln ) s a monotoncally ncreasng functon. Then, parameter estmaton becomes an optmzaton problem. We wll use the notaton Lθ) to denote the log-lkelhood, that s, Lθ) = ln px θ). 13) Recent developments n optmzaton tells that we can generally consder a mnmzaton problem as easy f t s convex, but non-convex problems are usually dffcult to solve. Equvalently, a concave maxmzaton problem s generally consdered easy, whle non-concave maxmzaton s usually dffcult, because the negatve of a convex functon s a concave one, and vce versa. 7

Unfortunately, the log-lkelhood s non-concave n most cases. Take the Gaussan mxture model as an example, the lkelhood px θ) s M N α px θ) = exp 1 2π) d/2 Σ 1/2 2 x j µ ) T Σ 1 x j µ )) ). 14) j=1 The log-lkelhood has the followng form: M N α ln exp 1 2π) d/2 Σ 1/2 2 x j µ ) T Σ 1 x j µ )) ), 15) j=1 Ths equaton s non-concave wth respect to the jont optmzaton varables {α, µ, Σ } n. In other words, ths s a dffcult maxmzaton problem. We have two sets of random varables X and Z. The log-lkelhood n Equaton 15 s called the ncomplete data log-lkelhood because Z s not n that equaton. 4.2 Possbly) Concave complete data log-lkelhood The complete data log-lkelhood s ln px, Z θ). 16) Let us use GMM as an example once more. In GMM, the z j vectors whch form Z) s an N-dmensonal vector wth N 1 0 s and only one dmenson wth value 1. Hence, the complete data lkelhood s px, Z θ) = M N [ j=1 α exp 2π) d/2 Σ 1/2 1 2 x j µ ) T Σ 1 x j µ ))] zj. 17) Ths equaton can be explaned usng the two-step samplng process. Let us assume x j s generated by the -th Gaussan component. Then, f, we know that z j 1, otherwse z j = z j = 1. In other words, the term nsde [ ] wll equal 1 for N 1 tmes when z j = 0, and the remanng one entry wll be evaluated to α Nx; µ, Σ ), whch exactly matches the 2-step samplng procedure. 5 Then, the complete data log-lkelhood s M j=1 N 1 z j 2 ln Σ 1 x j µ ) T Σ 1 x j µ ) ) ) + ln α + const. 18) Let us consder the scenaro when the hdden varable z j s known, but α, µ and Σ are unknown. Here we suppose Σ s nvertble for 1 N. Instead 5 The frst step has probablty α, and the second step has densty Nx; µ, Σ ). These two steps are ndependent of each other, hence the product rule apples. 8

of consderng parameters µ, Σ ), we consder µ, Σ 1 ). 6 It s well known that the log-determnant functon ln s concave. It s also easy to prove that the quadratc term z j µ ) T Σ 1 z j µ ) s jontly convex wth respect to varables µ, Σ 1 ), whch drectly mples that ts negatve s concave. 7 Hence, ths sub-problem can be effcently solved. From ths optmzaton perspectve, we can understand the EM algorthm from a dfferent pont of vew. Although the orgnal maxmum lkelhood parameter estmaton problem s dffcult to solve jontly non-concave), the EM algorthm can usually but not always) make concave subproblems, hence becomng effcently solvable. 4.3 The general EM dervaton Now we talk about EM n the general sense. We have observable varables X and samples X. We also have hdden varables Z and unobservable samples Z. The overall system parameters are denoted by θ. The parameter learnng problem tres to fnd optmal parameters ˆθ by maxmzng the ncomplete data log-lkelhood We assume Z s dscrete, and hence ˆθ = arg max ln px θ). 19) θ px θ) = Z px, Z θ). 20) However, ths assumpton s manly for notatonal smplcty. If Z s contnuous, we can replace the summaton wth an ntegral. Although we have mentoned prevously that we can use the posteror of Z,.e., pz X, θ) as our guess, t s also nterestng to observe what wll happen to the complete data lkelhood f we use a arbtrary dstrbuton for Z and hence understand why the posteror s specal and why we should use t). Let q be any vald probablty dstrbuton for Z, we can measure how dfferent s q to the posteror usng the classc Kullback-Lebler KL) dvergence measure, as KLq p) = ) pz X, θ) qz) ln. 21) qz) Z The probablty theory tells us that px, Z θ) px θ) = pz X, θ) 22) px, Z θ) qz) = qz) pz X, θ). 23) 6 It s more natural to understand ths choce as usng the canoncal parameterzaton of a normal dstrbuton. Please refer to my note on propertes of normal dstrbutons. 7 For knowledge about convexty, please refer to the book Convex optmzaton by Stephen Boyd and Leven Vandenberghe, Cambrdge Unversty Press. The PDF verson of ths book s avalable at http://stanford.edu/~boyd/cvxbook/. 9

Hence, ) ln px θ) = qz) ln px θ) 24) = Z = Z Z qz) ln px θ) 25) px, Z θ) qz) ln qz) qz) pz X, θ) px, Z θ) qz) ln qz) ln qz) ) ) pz X, θ) qz) 26) = Z 27) = px, Z θ) qz) ln + KLq p) qz) Z 28) = Lq, θ) + KLq p). 29) We have decomposed the ncomplete data log-lkelhood nto two terms. The frst term s Lq, θ), defned as Lq, θ) = Z qz) ln px, Z θ) qz). 30) The second term s a KL-dvergence between q and the posteror KLq p) = ) pz X, θ) qz) ln, 31) qz) Z whch was coped from Equaton 21. There are some nce propertes of the KL-dvergence. For example, Dq p) 0 32) always holds, and the qualty sgn s true f and only f q = p. 8 consequence of ths property s that One drect Lq, θ) ln px θ) 33) always holds, and Lq, θ) = ln px θ) f and only f qz) = pz X, θ). 34) In other words, we have found a lower bound of ln px θ). Hence, n order to maxmze ln px θ), we can perform two steps. 8 For more propertes of the KL-dvergence, please refer to the book Elements of nformaton theory by Thomas M. Cover and Joy A. Thomas, John Wley & Sons, Inc. 10

The frst step s to make the lower bound Lq, θ) equal to ln px θ). As aforementoned, we know the equalty hold f and only f ˆqZ) = pz X, θ). Now we have ln px θ) = Lˆq, θ), 35) and L only depends on θ now. Ths s the Expectaton step E-step) n the EM algorthm. In the second step, we can maxmze Lˆq, θ) wth respect to θ. Snce ln px θ) = Lˆq, θ), an ncrease of Lˆq, θ) also means an ncrease of the log-lkelhood ln px θ). And, because we are maxmzng Lˆq, θ) n ths step, the log-lkelhood wll always ncrease f we are not already at a local mnmum of the log-lkelhood. Ths s the Maxmzaton step M-step) n the EM algorthm. 4.4 The E- & M-steps In the E-step, we already know that we should set ˆqZ) = pz X, θ), 36) whch s straghtforward at least n ts mathematcal form). Then, how shall we maxmze Lˆq, θ)? We can substtute ˆq nto the defnton of L. We wll fnd the optmal θ that maxmzes L after pluggng n ˆq. However, note that ˆq nvolves θ too. Hence, we need some more notatons. Suppose we are n the t-th teraton. In the E-step, ˆq s computed usng the current parameter, as ˆqZ) = pz X, θ t) ). 37) Then, L becomes Lˆq, θ) = Z ˆqZ) ln px, Z θ) ˆqZ) 38) = Z = Z ˆqZ) ln px, Z θ) ˆqZ) ln ˆqZ) 39) pz X, θ t) ) ln px, Z θ) + const, 40) n whch const = ˆqZ) ln ˆqZ) does not nvolve the varable θ, hence can be gnored. The term remanng s n fact an expectaton, whch we denote as Qθ, θ t) ), Qθ, θ t) ) = Z pz X, θ t) ) ln px, Z θ) 41) = E Z X,θ t) [ln px, Z θ)]. 42) That s, n the E-step, we compute the posteror of Z. In the M-step, we compute the expectaton of the complete data log-lkelhood ln px, Z θ) 11

wth respect to the posteror dstrbuton pz X, θ t) ), and we maxmze the expectaton to get a better parameter estmate: θ t+1) = arg max θ Qθ, θ t) ) = arg max E Z X,θ t) [ln px, Z θ)]. 43) θ Thus, three computatons are nvolved n EM: 1) posteror, 2) expectaton, 3) maxmzaton. We treat 1) as the E-step, and 2)+3) as the M-step. Some researchers prefer to treat 1)+2) as the E-step, and 3) as the M-step. However, no matter how the computatons are attrbuted to dfferent steps, the EM algorthm does not change. 4.5 The EM algorthm Now we are ready to wrte down the EM algorthm. Algorthm 1 The Expectaton-Maxmzaton Algorthm 1: t 0 2: Intalze the parameters to θ 0) 3: The Eexpectaton)-step: Fnd pz X, θ t) ) 4: The Maxmzaton)-step.1: Fnd the expectaton Qθ, θ t) ) = E Z X,θ t) [ln px, Z θ)] 44) 5: The Maxmzaton)-step.2: Fnd a new parameter estmate θ t+1) = arg max Qθ, θ t) ) 45) θ 6: t t + 1 7: If the log-lkelhood has not converged, go to the E-step agan Lne 3) 4.6 Wll EM converge? The analyss of EM s convergence property s a complex topc. However, t s easy to show that the EM algorthm wll help acheve hgher lkelhood and converge to a local mnmum. Let us consder two tme steps t 1 and t. From Equaton 35, we get that: Lˆq t), θ t) ) = ln px θ t) ), 46) Lˆq t 1), θ t 1) ) = ln px θ t 1) ). 47) Note that we have added the tme ndex to the superscrpt of ˆq to emphasze that ˆq also changes among teratons. 12

Now because at the t 1)-th teraton we have θ t) = arg max Lˆq t 1), θ), 48) θ Lˆq t 1), θ t) ) Lˆq t 1), θ t 1) ). 49) Smlarly, at the t-th teraton, based on Equaton 33 and Equaton 35, we have Lˆq t 1), θ t) ) ln px θ t) ) = Lˆq t), θ t) ). 50) Puttng these equatons together, we get ln px θ t) ) = Lˆq t), θ t) ) [Use 46)] 51) Lˆq t 1), θ t) ) [Use 50)] 52) Lˆq t 1), θ t 1) ) [Use 49)] 53) = ln px θ t 1) ). [Use 47)] 54) Hence, EM wll converge to a local mnmum of the lkelhood. However, the analyss of ts convergence rate s very complex and beyond the scope of ths ntroductory note. 5 EM for GMM Now we can apply the EM algorthm to GMM. The frst thng s to compute the posteror. Usng the Bayes theorem, we have pz j x j, θ t) ) = px j, z j θ t) ), 55) px j θ t) ) n whch z j can be 0 or 1, and z j = 1 s true f and only f x j s generated by the -th Gaussan component. Next, we wll compute the Q functon, whch s the expectaton of the complete data log-lkelhood ln px, Z θ) wth respect to the posteror dstrbuton we just found. The GMM complete data log-lkelhood was already computed n Equaton 18. For easer reference, we copy ths equaton here: M j=1 N 1 z j 2 ln Σ 1 x j µ ) T Σ 1 x j µ ) ) ) + ln α + const, 56) The expectaton of Equaton 56 wth respect to Z s M j=1 N 1 γ j 2 ln Σ 1 x j µ ) T Σ 1 x j µ ) ) ) + ln α, 57) 13

where the constant term s gnored and γ j s the expectaton of z j x j, θ t). In other words, we need to compute the expectaton of the condtonal dstrbuton defned by Equaton 55. In Equaton 55, the denomnator does not depend on Z, and px j θ t) ) equals N Nx j ; µ t), Σ t) ). For the numerator, we can drectly compute αt) ts expectaton, as E [ ] px j, z j θ t) ) [ ] = E pz j θ t) )px j z j, θ t) ). 58) Note that when z j = 0, we always have px j z j, θ t) ) = 0. Thus, [ ] E pz j θ t) )px j z j, θ t) ) = Prz j = 1)px j µ t), Σ t) ) 59) or, Hence, we have [ γ j = E = α t) z j x j, θ t)] α t) [ γ j = E z j x j, θ t)] = α t) N k=1 αt) k Nx j ; µ t), Σ t) ). 60) Nx j ; µ t), Σ t) ), 61) Nx j ; µ t), Σ t) )) Nx j; µ t) k, Σt) k ) 62) for 1 N, 1 j M. After γ j s computed, Equaton 57 s completely specfed. We start the optmzaton from α. Because there s a constrant that N α = 1, we use the Lagrange multpler method, remove rrelevant terms, and get M N N ) γ j ln α + λ α 1. 63) j=1 Settng the dervatve to 0 gves us that for any 1 N, M j=1 γ j α + λ = 0 64) M j=1 or, α = γj. Because N α = 1, we know that λ = M N j=1 γ j. Hence, α = λ M j=1 γj M N. j=1 γj For notatonal smplcty, we defne j=1 m = j=1 M γ j. 65) j=1 From the defnton of γ j, t s easy to prove that N N M M N ) M m = γ j = γ j = 1 = M. 66) j=1 14

Then, we get the updatng rule for α : α t+1) = m M. 67) Furthermore, usng smlar steps n dervng the sngle Gaussan equatons, 9 t s easy to show that for any 1 N, M µ t+1) j=1 = γ jx j, 68) m Σ t+1) = M j=1 γ j x j µ t+1) ) ) T x j µ t+1). 69) m Puttng these results together, we have the complete set of updatng rules for GMM. If at teraton t, the parameter are estmated as α t), µ t), and Σ t) for 1 N, the EM algorthm updates these parameters as for 1 N, 1 j M) Exercses γ j = m = α Nx j ; µ t), Σ t) )) Nx j; µ t) k N k=1 αt) k, Σt) k ), 70) M γ j, 71) j=1 M µ t+1) j=1 = γ jx j, 72) m Σ t+1) = M j=1 γ j x j µ t+1) ) ) T x j µ t+1). 73) m 1. Derve the updatng equatons for Gaussan Mxture Models by yourself. You should not refer to Secton 5 durng your dervaton. If you have just fnshed readng Secton 5, wat for at least 2 to 3 hours before workng on ths problem. 2. In ths problem, we wll use the Expectaton-Maxmzaton method to learn parameters n a hdden Markov model HMM). As wll be shown n ths problem, the Baum-Welch algorthm s ndeed performng EM updates. To work out the soluton for ths problem, you wll also need knowledge and facts learned n the HMM and nformaton theory notes. We wll use the notatons n the HMM note. For your convenence, the notatons are repeated as follows. 9 Please refer to my note on propertes of normal dstrbutons. 15

There are N dscrete states, denoted by symbols S 1, S 2,..., S N. There are M output dscrete symbols, denoted by V 1, V 2,..., V M. Assumng one sequence wth T tme steps, whose hdden state s Q t and whose observed output s O t at tme t 1 t T ). We use q t and o t to denote the ndexes for state and output symbols at tme t, respectvely,.e., Q t = S qt and O t = V ot. The notaton 1 : t denotes all the ordered tme steps between 1 and t. For example, o 1:T s the sequence of all observed output symbols. An HMM has parameters λ = π, A, B), where π R N specfes the ntal state dstrbuton, A R N N s the state transton matrx, and B R N M s the observaton probablty matrx. Note that A j = PrQ t = S j Q t 1 = S ) and b j k) = PrO t = V k Q t = S j ) are elements of A and B, respectvely. In ths problem, we use a varable r to denote the ndex of EM teratons. Hence, λ 1) are the ntal parameters. Varous probabltes have been defned n the HMM note, denoted by α t ), β t ), γ t ), δ t ) and ξ t, j). In ths problem, we assume that at the r-th teratons, λ r) are known and these probabltes are computed usng λ r). The purpose of ths problem s to use the EM algorthm to fnd λ r+1) usng a tranng sequence o 1:T and λ r), by treatng Q and O as the hdden and observed random varables, respectvely. a) Suppose the hdden varables can be observed as S q1, S q2,..., S qt. Show that the complete data log-lkelhood s T 1 ln π q1 + ln A qtq t+1 + t=1 T ln b qt o t ). 74) b) The expectaton of Equaton 74 wth respect to the hdden varables Q t condtoned on o 1:T and λ r) ) forms an auxlary functon Qλ, λ r ) the E-step). Show that the expectaton of the frst term n Equaton 74 equals N γ 1) ln π,.e., E Q1:T [ln π Q1 ] = t=1 N γ 1 ) ln π. 75) c) Because the parameter π only hnges on Equaton 75, the update rule for π can be found by maxmzng ths equaton. Prove that we should set π r+1) = γ 1 ) n the M-step. Note that γ 1 ) s computed usng λ r) 16

as parameter values. Hnt: The rght hand sde of Equaton 75 s related to the cross entropy.) d) The second part of the E-step calculates the expectaton of the mddle term n Equaton 74. Show that [ T 1 ] N N T 1 ) E Q1:T ln A qtq t+1 = ξ t, j) ln A j. 76) t=1 j=1 t=1 e) For the M-step relevant to A, prove that we should set T 1 A r+1) t=1 j = ξ t, j) T 1 t=1 γ t). 77) f) The fnal part of the E-step calculates the expectaton of the last term n Equaton 74. Show that [ T ] N M T E Q1:T ln b qt o t ) = o t = k γ t j). 78) t=1 j=1 k=1 t=1 g) For the M-step relevant to B, prove that we should set T b r+1) t=1 j k) = o t = k γ t j) T t=1 γ, 79) tj) n whch s the ndcator functon. h) Are these results obtaned usng EM the same as those n the Baum- Welch? 17