Lecture 10: Expectation-Maximization Algorithm

Similar documents
MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

EM and Structure Learning

Hidden Markov Models

Lecture Notes on Linear Regression

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

An Experiment/Some Intuition (Fall 2006): Lecture 18 The EM Algorithm heads coin 1 tails coin 2 Overview Maximum Likelihood Estimation

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

Space of ML Problems. CSE 473: Artificial Intelligence. Parameter Estimation and Bayesian Networks. Learning Topics

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Gaussian Mixture Models

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

3.1 ML and Empirical Distribution

Maximum Likelihood Estimation

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Lecture 3: Probability Distributions

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

EEE 241: Linear Systems

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Lecture 12: Discrete Laplacian

The Basic Idea of EM

Semi-Supervised Learning

The Expectation-Maximization Algorithm

Lecture 10 Support Vector Machines II

Outline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

Probability Theory (revisited)

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

1 Convex Optimization

Week 5: Neural Networks

STATS 306B: Unsupervised Learning Spring Lecture 10 April 30

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Assortment Optimization under MNL

8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF

Lecture 4: Universal Hash Functions/Streaming Cont d

10-701/ Machine Learning, Fall 2005 Homework 3

Introduction to Vapor/Liquid Equilibrium, part 2. Raoult s Law:

Supplementary material: Margin based PU Learning. Matrix Concentration Inequalities

Tracking with Kalman Filter

Linear Approximation with Regularization and Moving Least Squares

On an Extension of Stochastic Approximation EM Algorithm for Incomplete Data Problems. Vahid Tadayon 1

Clustering gene expression data & the EM algorithm

Machine learning: Density estimation

Course 395: Machine Learning - Lectures

Lecture Nov

Quantifying Uncertainty

Composite Hypotheses testing

COS 521: Advanced Algorithms Game Theory and Linear Programming

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009

Expectation Maximization Mixture Models HMMs

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Feature Selection: Part 1

Hidden Markov Models

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Math 217 Fall 2013 Homework 2 Solutions

Gaussian process classification: a message-passing viewpoint

1 Motivation and Introduction

Randomness and Computation

Generalized Linear Methods

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Homework Assignment 3 Due in class, Thursday October 15

Statistics and Probability Theory in Civil, Surveying and Environmental Engineering

Econ Statistical Properties of the OLS estimator. Sanjaya DeSilva

Stat260: Bayesian Modeling and Inference Lecture Date: February 22, Reference Priors

Problem Set 9 Solutions

Stat 543 Exam 2 Spring 2016

Convergence of random processes

1 The Mistake Bound Model

Gaussian Conditional Random Field Network for Semantic Segmentation - Supplementary Material

Lecture 21: Numerical methods for pricing American type derivatives

STAT 3008 Applied Regression Analysis

First Year Examination Department of Statistics, University of Florida

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement

The EM Algorithm (Dempster, Laird, Rubin 1977) The missing data or incomplete data setting: ODL(φ;Y ) = [Y;φ] = [Y X,φ][X φ] = X

Conjugacy and the Exponential Family

The Expectation-Maximisation Algorithm

Stat 543 Exam 2 Spring 2016

Chapter Newton s Method

Technical Note: An Expectation-Maximization Algorithm to Estimate the Parameters of the Markov Chain Choice Model

Errors for Linear Systems

LOW BIAS INTEGRATED PATH ESTIMATORS. James M. Calvin

Statistical learning

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Lecture 3 Stat102, Spring 2007

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Report on Image warping

Linear Feature Engineering 11

APPENDIX A Some Linear Algebra

10-801: Advanced Optimization and Randomized Methods Lecture 2: Convex functions (Jan 15, 2014)

Natural Images, Gaussian Mixtures and Dead Leaves Supplementary Material

Transcription:

ECE 645: Estmaton Theory Sprng 2015 Instructor: Prof Stanley H Chan Lecture 10: Expectaton-Maxmzaton Algorthm (LaTeX prepared by Shaobo Fang) May 4, 2015 Ths lecture note s based on ECE 645 (Sprng 2015) by Prof Stanley H Chan n the School of Electrcal and Computer Engneerng at Purdue Unversty 1 Motvaton Consder a set of data ponts wth ther classes labeled, and assume that each class s a Gaussan as shown n Fgure 1(a) Gven ths set of data ponts, fndng the means of two Gaussan can be done easly by estmatng the sample mean, as the class labels are known Now magne that the classes are not labeled as shown n Fgure 1(b) How should we determne the mean for each of the classes then? In order to solve ths problem, we could use an teratve approach: frst make a guess of the class label for each data pont, then compute the means and update the guess of the class labels agan We repeat untl the means converge The problem of estmatng parameters n the absence of labels s known as unsupervsed learnng There are many unsupervsed learnng methods We wll focus on the Expectaton Maxmzaton (EM) algorthm 8 Class Labelled 8 Class Unlabelled Class 1 Class 2 6 6 4 4 2 2 0 0 2 2 4 4 6 6 8 8 10 1 0 1 2 3 4 5 6 10 1 0 1 2 3 4 5 6 Fgure 1: Estmaton of parameters becomes trval gven the labelled classes 2 The EM-algorthm Notatons 1 Y, y observatons Y random varable; y realzaton of Y 2 X, x complete data 3 Z, z, mssng data Note that X (Y,Z) 4 θ: unknown determnstc parameter θ (t) : t th estmate of the θ n the EM teraton 5 f(y θ) s the dstrbuton of Y gven θ

6 f(x θ) s a random varable takng value of f(x θ) (Remember: f( θ) s a functon and thus we can put any argument nto f( θ) and evaluate ts output) 7 E X y,θ [g(x)] g(x)f X y,θ (x y,θ)dx s the condtonal expectaton of g(x) gven Y y and θ 8 l(θ) logf(y θ) s the log-lkelhood Note that l(θ) depends on y EM Steps The EM-algorthm conssts of two steps: 1 E-step: Gven y and pretendng for the moment that θ (t) s correct, formulate the dstrbuton for the complete data x: f(x y,θ (t) ) Then, we calculate the Q-functon: Q(θ θ (t) ) def E X y,θ (t)[logf(x θ)] logf(x θ)f(x y,θ (t) )dx 2 M-step: Maxmze Q(θ θ (t) ) wth regard to θ: Propertes of Q(θ θ (t) ) θ (t+1) argmaxq(θ θ (t) ) θ 1 Ideally, f we have the dstrbuton of the complete data x, then fndng the parameter can be done by maxmzng f(x θ) However, the complete data s only a vrtual thng we created to solved the problem In realty we never know x All we know s ts dstrbuton f(x θ), whch depends on what we know about x So one way to handle ths uncertanty s to compute the average Ths average s the Q-functon 2 Another way of lookng at Q(θ θ (t) ) We can treat logf(x θ) as a functon of two varables h(x,θ) Maxmzng over θ s problematc because t depends on X So by takng expectaton E X [h(x,θ)] we can elmnate the dependency on X 3 Q(θ θ (t) ) can be thought of a local approxmaton of the log-lkelhood functon l(θ): Here, by local we meant that Q(θ θ (t) ) stays close to ts prevous estmate θ (t) In fact f Q(θ θ (t) ) Q(θ (t) θ (t) ), then l(θ) l(θ (t) ) 3 Estmatng Mean wth Partal Observaton Let us consder the frst example of the EM algorthm Suppose that we generated a sequence of n random varables Y N(θ,σ 2 ) for 1,,n Imagne that we have only observed Y [Y 1,Y 2,,Y m ] where m < n How should we estmate θ based on Y? Intutvely, the estmated θ should be the sample mean of the m observatons θ 1 m m Y However, n ths example we would lke to derve the EM algorthm and see f the EM algorthm would match wth our ntuton Soluton: To start the EM algorthm, we frst need to specfy the mssng data and the complete data In ths problem, the mssng data s Z [Y m+1,,y n ], and the complete datas X [Y,Z] The dstrbuton of X s: logf(x θ) n (Y θ) 2 2 log(2πσ2 ) 2σ 2 (1) 2

Therefore, the Q functon s Q(θ θ (t) ) def E X Y,θ (t)[logf(x θ)] [ n m (Y θ) 2 E X Y,θ (t) 2 log(2πσ2 ) 2σ 2 n 2 log(2πσ2 ) The last expectaton can be evaluated as Therefore, the Q functon s Q(θ θ (t) ) n 2 log(2πσ2 ) m (y θ) 2 2σ 2 m+1 m+1 E Y Y,θ (t)[(y θ) 2 ] E Y Y,θ (t)[y 2 2Y θ +θ 2 ] [(θ (t) ) 2 +σ 2 2θ (t) θ +θ 2 ] m (y θ) 2 2σ 2 ] (Y θ) 2 2σ 2 E X Y,θ (t)[(y θ) 2 ] 2σ 2 n m 2σ 2 [(θ(t) ) 2 +σ 2 2θ (t) θ +θ 2 ] In the M-step, we need to maxmze the Q-functon To ths end, we set whch yelds that θ Q(θ θ(t) ) 0, m θ (t+1) y +(n m)θ (t) n It s not dffcult to show that as t, θ (t) θ ( ) Hence, m θ ( ) y ( + 1 m ) θ ( ), n n whch yelds θ ( ) 1 m m y Ths result says that as the EM algorthm converges, the estmated parameter converges to the sample mean usng the avalable m samples, whch s qute ntutve 4 Gaussan Mxture Wth Known Mean And Varance Our next example of the EM algorthm to estmate the mxture weghts of a Gaussan mxture wth known mean and varance A Gaussan mxture s defned as f(y θ) θ N(y µ,σ), 2 (2) where θ [θ 1,,θ k ] s called the mxture weght The mxture weght satsfes the condton that Our goal s to derve the EM-algorthm for θ θ 1 3

Soluton: We frst need to defne the mssng data For ths problem, we observe that the observed data s Y [y 1,y 2,,y n ] The mssng data can be defned as the label for each y j, so that Z [Z 1,Z 2,,Z n ], wth Z j {1,,k} Consequently, the complete data s X [X 1,X 2,,X n ], where X j (y j,z j ) The dstrbuton of the complete data can be computed as Thus, the Q functon s The expectaton can be evaluated as f(x j θ) f(y j,z j θ) θ zj N(y j µ zj,σ 2 z j ), Q(θ θ (t) ) E X, Y,θ (t) {logf(x, θ)} E Z, y,θ (t) {logf(z,y, θ)} n E Z, y,θ (t) log θ zj N(y j, µ zj,σz 2 j ) j1 } E Zj y j,θ {logθ (t) zj +logn(y j, µ zj,σz 2 j ) j1 E Zj y j,θ (t){logθ z j } z j logθ zj P(Z j z j y j,θ (t) ) By summng over all j s, we can further defne Therefore, the Q functon becomes j1 j logθ P(Z j y j,θ (t) ) }{{} P(Z j y j,θ (t) ) j1 j1 Q(θ θ (t) ) def j θ (t) N(y j µ,σ 2 ) θ(t) N(y j µ,σ 2 ) j1 log j θ +C log θ +C, for some constant C ndependent of θ Maxmzng over θ yelds θ (t+1) argmax θ γ(t), logθ where the last equalty s due to Gbbs nequalty To summarze the EM algorthm s gven n the algorthm below 4

Data: Gaussan Mxture wth known mean and varance Result: Estmated θ for t 1, do end j1 θ (t) γ(t) θ (t) N(y j µ,σ 2 ) θ(t) N(y j µ,σ 2 ) Remark: To solve argmax θ all α and β such that α 1, γ(t) logθ, we use the Gbbs nequalty Gbbs nequalty states that for β 1, 0 α 1 and 0 β 1, t holds that α logβ α logα, (3) wth the equalty holds when α β for all The proof of Gbbs nequalty s due to the non-negatvty of the KL-dvergence whch we wll skp What we want to show s that f we let then the equalty holds when: whch s the result we want α γ(t) θ, β θ, γ(t), 5 Gaussan Mxture Prevously we have been workng on Gaussan Mxtures wth known mean and varance However for most of the tme t s lkely nether mean nor varance s avalable for us Thus, we are nterested n dervng an EM-algorthm that would generally apply for any Gaussan mxture model wth only observatons avalable Recall that a Gaussan mxture s defned as f(y θ) π N(y µ,σ ), (4) where θ def {(π µ Σ )} k s the parameter, wth π 1 Our goal s to derve the EM algorthm for learnng θ Soluton We frst specfy the followng data: Observed Data: Y [Y 1,,Y n ] wth realzatons y [y 1,,y n ]; Mssng Data: Z [Z 1,,Z n ] wth realzatons z [z 1,,z n ], where z j {1,,k}; Complete Data: X [X 1,,X n ] wth realzatons x [x 1,,x n ] and x j (y j,z j ) Accordngly, the dstrbuton of the complete data s f(y j,z j θ) π zj N(y j µ zj,σ zj ) 5

Therefore, we can show that The Q functon s P(Z j y j,θ (t) ) π (t) N(y j µ (t),σ (t) ) π(t) N(y µ (t),σ (t) ) Q(θ,θ (t) ) E X y,θ (t){logf(x θ)} E Z y,θ (t){logf(z,y θ)} n E Z y,θ (t){log( π zj N(y j µ zj,σ zj ))} j1 j1 j1 E Zj y j,θ (t){logπ z j 1 2 log Σ z j 1 2 (y j µ zj ) T Σ 1 z j (y j µ zj )}+C j1 P(Z j y,θ (t) ){logπ 1 2 log Σ 1 2 (y j µ )T Σ 1 (y j µ )}+C j {logπ 1 2 log Σ 1 2 (y j µ ) T Σ 1 (y j µ )}+C, where C s a constant ndependent of θ The Maxmzaton step s to solve the followng optmzaton problem maxmze θ subject to Q(θ θ (t) ) π 1, π > 0, Σ 0 (5) For π, the maxmzaton s maxmze π subject to j1 γ(t) j logπ π 1, π > 0 (6) The soluton of ths problem s π (t+1) j1 γ(t) j n j1 γ(t) j For µ, the maxmzaton can be reduced to solvng the equaton j1 γ(t) j (7) n µ Q(θ θ (t) ) 0 (8) The left hand sde s Therefore, µ Q(θ θ (t) ) µ { Σ 1 ( j1 j1 j (y j µ )T Σ 1 (y j µ )} j y j j1 µ (t+1) j1 γ(t) j y j1 γ(t) j j µ ) (9) 6

For Σ, the maxmzaton s equvalent to solvng The left hand sde s Σ (θ θ (t) ) 1 2 (Σn j1 γ(t) j )log Σ Σ 1 2 1 2 ( n γ t j )Σ 1 + 1 2 Σ (θ θ (t) ) 0 (10) j1 j1 j {(y Σ j µ ) T Σ 1 (y j µ )} j Σ 1 (y j µ )(y j µ ) T Σ 1 Therefore, Σ t+1 j1 γ(t) j (y j µ (t+1) )(y j µ (t+1) γt j ) T (11) 6 Bernoull Mxture Our next example s to consder a Bernoull mxture model To motvate ths problem, let us magne that we have a dataset of varous tems Our goal s to see whether there s any relatonshp between the presence or absence of these tems For example, f the object A (eg a tree) was presented, there s some probablty that the object B (eg a flower) s also presented However f gven certan object C (eg a dnosaur) presented t s unlkely to see the object D (eg a car, unless you are n Jurassc Park!) To setup the problem let us frst defne some notatons We use Y 1,,Y N to denote N mages we have observed In each mage, there are at most M tems, so that Y n [Y1 n,,ym n ] for n 1,,N Each entry n ths vector s a Bernoull random varable Moreover, we defne P(Y n 1 Y n k 1) def θ k (12) Therefore, the goal s to estmate the matrx Θ θ 11 θ M1 θ 1M (13) θ MM from the observatons Y 1,,Y N The general problem of estmatng Θ from Y 1,,Y N s very dffcult Therefore, t s necessary to pose some assumptons on the problem The assumpton we make here s sem-vald from our daly experence It s not completely true, but they are smple enough to provde us some computatonal solutons Assumpton 1 Condtonal Independence We assume that the observatons follow the condtonal ndependence structure: P(Y n 1 Yj n 1 Yk n n 1) P(Y 1 Yk n 1) P(Y j n 1 Yk n 1) (14) Remark: Condtonal ndependence s not the same as ndependence For example, we let A be the event that a puppy breaks a toy, B be the event that a mother yells, and C be the event that a chld cres Wthout knowng the relatonshp, t could be that the chld cres because the mother yells However, f we assume the condtonal ndependence of B and C gven A, then we know that the cryng of the chld and the yellng of the mother are both trggered by the dog, but not by each other 7

Indvdual Model In order to understand the EM algorthm of Bernoull Mxture, let us set n fxed Consequently, Furthermore, P(Y n y n ) P(Y n y n tem m s actve )P( tem m s actve ) }{{} m1 def π m P(Y n y n tem m s actve ) where θ m [θ m1,,θ mm ] s the mth row of Θ Therefore, M θ yn m (1 θ m) 1 yn def f m (y n θ m ), P(Y n y n ) m1 π m f m (y n θ m ) (15) EM Algorthm Now, we wll derve EM algorthm to estmate {π 1,,π M } and Θ To start wth, let us defne the followng types of data: Observed Data: Y 1,,Y N ; Mssng Data: Z 1,,Z N wth realzatons z 1,,z N and z n R 1 N ; Complete Data: X 1,,X N, accordngly x n (y n,z n ) The dstrbuton of the complete data s P(Y n y n,z n z n Θ) π m f m (y n θ m ) The dstrbuton of the mssng data condtoned on the observed data s The nth Q functon s π (t) P(Z n m Y n y n,θ (t) m f m (y ) n θ (t) m ) M m1 π(t) m f m (y n θ (t) m ) where we can show that Q n (Θ Θ (t) ) def E Zn y n,θ (t)[logf(x n Θ)] E Zn y n,θ (t)[logf(z n,y n Θ)] log(π m f m (y n θ (t) m ))P(Z n m y n,θ (t) ) }{{} m1 m1 log(π m f m (y n θ (t) m )) logπ m +log nm log(π mf m (y n θ (t) m )), logπ m + M def j θ yn m (1 θ m) 1 yn y n logθ m +(1 y n )log(1 θ m) 8

Therefore, overall Q-functon s Q(Θ Θ (t ) γ nm (t) n1m1 To maxmze the Q functon, we solve [ logπ m + ] y n logθ m +(1 y n )log(1 θ m ) (16) For a fxed m and, we have Settng ths to zero yelds Θ (t1) argmaxq(θ Θ (t) ) (17) Θ θ m Q(Θ Θ (t) ) N n1 [ ] y γ nm (t) n 1 yn θ m 1 θ m whch s N n1 γ(t) nmy n N n1 γ(t) nm(1 y n), θ m 1 θ m N θ (t+1) n1 m γ(t) Data: EM Algorthm for Bernoull Mxture Model Result: Estmated Θ and π m for t 1, do end 7 Convergence of EM nm nmy N n1 γ(t) nm π (t) M m1 π(t) m f m (y n θ (t) N θ (t+1) n1 m γ(t) nmy n N n1 γ(t) nm π (t+1) m nm N n1 γ(t) nm (18) m ) m f m (y n θ (t) The convergence of EM algorthm s known to be local What t means s that as the EM algorthm terates, θ (t+1) wll never be less lkely than θ (t) Ths property s called the monotoncty of EM, whch s the result of the followng theorem Theorem 1 Let X and Y be two random varables wth parametrc dstrbuton controlled by a parameter θ Λ Suppose that: 1 X does not depend on θ; 2 There exsts a Markov relatonshp θ X Y e f(y x,θ) f(y x) for all θ Λ and x X, y Y Then, for θ Λ and y Y such that X(y), we have: m ) l(θ) l(θ (t) ) f Q(θ θ (t) ) Q(θ (t) θ (t) ) (19) 9

Proof l(θ) logf(y θ) (by defnton) log f(x, y θ)dx (margnalzaton, e, total probablty) X(y) f(x,y θ) log X(y) f(x y,θ (t) ) f(x y,θ(t) )dx [ ] f(x,y θ) loge X y,θ (t) f(x y,θ) [ E X y,θ (t) log f(x,y θ) ] (Jensen s Inequalty) f(x y,θ) E X y,θ (t) log f(y X,θ)f(X θ) (Baye s Rule) f(y X,θ (t) )f(x θ (t) ) f(y θ (t) ) [ ] E X y,θ (t) log f(y X)f(X θ)f(y θ(t) ) f(y X)f(X θ (t) ) [ ] E X y,θ (t) log f(x θ)f(y θ(t) ) f(x θ (t) ) E X y,θ (t) [logf(x θ)] E X y,θ (t) Q(θ θ (t) ) Q(θ (t) θ (t) )+logf(y θ (t) ) }{{} l(θ (t) ) (assumpton 2) [ ] [ ] logf(x θ (t) ) +E X y,θ (t) logf(y θ (t) ) Thus, l(θ) l(θ (t) ) Q(θ θ (t) ) Q(θ (t) θ (t) ) Hence f Q(θ θ (t) ) Q(θ (t) θ (t) ), then l(θ) l(θ (t) ) 8 Usng Pror wth EM The EM algorthm can fal due to sngularty of the log-lkelhood functon For example, when learnng a GMM wth 10 components, the algorthm may decde that the most lkely soluton s for one of the Gaussans to only have one data pont assgned to t Ths could yeld some bad result of havng zero covarance To allevate ths problem, one can use the pror nformaton about θ In ths case, we can modfy the EM setp as E-step: Q(θ θ (t) ) E X y,θ (t)[logf(x θ)]; M-step: θ (t+1) argmax θ Q(θ θ (t) )+logf(θ) }{{} pror Example Assume that we have a GMM of k-components: f(y j θ) w N(y j µ,σ 2 ) (20) 10

Let us consder a constrant on µ : µ µ+( 1) µ, for 1,,k, e the means are equally spaced (For detals please refer to secton 33 of Gupta and Chen Prors: We assume the followng prors: 1 2 That s, That s, ( ) v σ 2 nverse-gamma 2, ǫ2 2 2 f(σ 2 ) (ξ 2 )v 2 Γ( v 2 ) (σ2 ) v 2 1 exp (σ 2 ) v+3 2 exp µ σ 2 N f( µ σ 2 ) exp ( ( ξ2 ( ξ2 2σ 2 ) 2σ 2 ( η, σ ) ρ ( µ η)2 2( σ2 ρ ) ) ) Therefore, the jont dstrbuton of the pror s: f( µ,σ 2 ) (σ 2 ) v+3 2 exp { ξ2 +l( η) 2 2σ 2 } (21) Parameters: θ (w 1,,w k,µ, µ,σ 2 ) Our goal s to estmate θ EM algorthm: Frst of all, we let j The EM steps can be derved as follows The Expectaton Step Q(θ θ (t) ) j1 j1 j1 j log(w N(y j µ,σ 2 )) w (t) N(y j µ (t),σ 2(t) ) w(t) N(y j µ (t),σ 2(t) ) (22) j log(w N(y j µ+( 1) µ,σ 2 )) j logw n 2 log(2π) n 2 log(σ2 ) 1 2σ 2 j1 j (y j µ ( 1) µ) 2 11

The Maxmzaton Step θ (t+1) argmax θ argmax θ 1 2σ 2 Q(θ θ (t) )+logf(θ) j1 j1 j logw n+v +3 logσ 2 ξ +l( µ η)2 2 2σ 2 j (y j µ ( 1) µ) 2 +C Thus, and w (t+1) j1 γ(t) j j1 γ(t) j, { µ [Q(θ θ(t) )+logf(θ)] 0 µ [Q(θ θ(t) )+logf(θ)] 0 [ 1 w(t+1) +1 1 1 w(t+1) +1 1 w(t+1) +1 2 + l n ] [ ] [ µ µ 1 ρη n + 1 n j1 n j1 y j The soluton of µ and µ can be obtaned by solvng the lnear system Fnally, ( ) σ 2 Q(θ θ (t) )+logf(θ) 0 σ 2(t+1) ξ2 +l( µ (t+1) η) 2 + j1 n+v +3 γ(t) j (y j µ (t+1) 2 γ(t) j ( 1)y j ) 2 9 MALAB Demo: EM Algorthm for Bernoull Mxture 91 Synthesze The Data ] 1 functon [ data rand] MakeData( DS, u vec, p mat ) 2 3 cnt 0; 4 for 1:1:length(u vec) 5 N DS*u vec(); 6 p vec p mat(,:); 7 %% 8 for m 1:1:length(p vec) 9 data vec randperm(n); 10 th N*p mat(,m); 11 for n 1:1:N 12 f data vec(n) > th 13 data vec(n) 0; 14 else 15 data vec(n) 1; 16 end 17 end 18 data(cnt+1:cnt+n, m) data vec'; 19 end 20 cnt cnt + N; 21 end 22 23 %% Now randomly permutate the rows of the matrx 24 [row, column] sze(data); 25 row vec randperm(row); 12

26 for 1:1:row 27 randtemp row vec(); 28 data rand(,:) data(randtemp,:); 29 end 30 31 end 92 Estmate the probablty of a Vector Gven Bernoull Dstrbuton 1 functon [ p b ] Bernoull vec( p vec, y vec ) 2 %% Calculate the probablty of usng the current Bernoull Mxture 3 p b 1; 4 for 1:1:length(p vec) 5 p b p b*(p vec()ˆ(y vec()))*((1-p vec())ˆ(1-y vec())); 6 end 7 8 end 93 The Man Functon for EM wth Bernoull Mxture 1 close all 2 clear all 3 clc 4 DS nput('eneter the synthetzed data sze:'); 5 u vec [1/4, 1/2, 1/4] 6 p mat [1, 04, 005; 7 02, 1, 08; 8 03, 07, 1] 9 data rand MakeData(DS, u vec, p mat); 10 T nput('enter the desred number of teratons:'); 11 12 %% Pck Intalzaton of parameters 13 u ntal [1/4, 1/8, 5/8]; 14 p ntal [03, 02, 08; 15 01, 08, 07; 16 05, 015, 06]; 17 18 M length(u ntal); 19 N sze(data rand, 1); 20 % Intlaze the parameters 21 u u ntal; 22 p p ntal; 23 24 u hstory zeros(m,t); 25 p hstory zeros(m,m,t); 26 27 for t 1:1:T 28 for m 1:1:M 29 p m u(m); 30 p vec p(m,:); 31 for n 1:1:N 32 y vec data rand(n,:); 33 %% Fnd the Hdden Varable, lambda 34 numerator p m*bernoull vec(p vec, y vec); % Modle the Bernoull Process 35 denom 0; 36 for mm 1:1:M 37 p vec tmp p(mm,:); 38 denom denom + u(mm)*bernoull vec(p vec tmp, y vec); 39 end 40 lambda(m,n) numerator/denom; 41 end 13

42 end 43 44 sum lambda sum(sum(lambda)); 45 46 %% Update mu 47 for m 1:1:M 48 u(m) sum(lambda(m,:))/sum lambda; 49 end 50 51 %% Update P matrx 52 for 1:1:M 53 for m 1:1:M 54 p(m,) (sum(lambda(m,:)*data rand(:,)'))/(sum(lambda(m,:))); 55 end 56 end 57 58 %% Save n hstory for each teraton to plot 59 u hstory(:,t) u; 60 p hstory(:,:,t) p; 61 end 62 dsp('updated p and u:') 63 p 64 u 65 66 fgure 67 hold on 68 grd on 69 for m 1:1:M 70 plot(u hstory(m,:)); 71 end 72 ylabel('estmated \mu value', 'FontSze', 20) 73 xlabel('iteratons', 'FontSze', 20) 74 ttle('convergence of \mu estmated for Mxture Number 3', 'FontSze', 20) 75 for m 1:1:M 76 stem(t, u vec(m)); 77 end 78 79 80 fgure 81 hold on 82 grd on 83 for 1:1:M 84 for jj 1:1:M 85 for t 1:1:T 86 tmp p hstory(,jj,t); 87 plot vec(t) tmp; 88 end 89 plot(plot vec) 90 end 91 end 92 ylabel('estmated P matrx values', 'FontSze', 20) 93 xlabel('iteratons', 'FontSze', 20) 94 ttle('convergence of P matrx estmated for Mxture Number 3', 'FontSze', 20) 95 for m 1:1:M 96 for n 1:1:M 97 stem(t, p mat(m,n)); 98 end 99 end 100 101 for m 1:1:M 102 one loc fnd(abs(p(m,:) - 1) mn(abs(p(m,:) - 1))) 103 p fnal(one loc,:) p(m,:); 104 u fnal(one loc) u(m); 105 end 106 107 dsp('after Automatc Sortng Based on Dagnals:') 108 p fnal 109 u fnal 14