The Expectation-Maximization (EM) Algorithm

Similar documents
Expectation-Maximization Algorithm.

Clustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar.

Algorithms for Clustering

Mixtures of Gaussians and the EM Algorithm

CSE 527, Additional notes on MLE & EM

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

Lecture 2: Monte Carlo Simulation

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

Statistical Pattern Recognition

Machine Learning Theory (CS 6783)

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

Expectation maximization

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

Probabilistic Unsupervised Learning

Distributional Similarity Models (cont.)

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Regression and generalization

January 25, 2017 INTRODUCTION TO MATHEMATICAL STATISTICS

Vector Quantization: a Limiting Case of EM

Distributional Similarity Models (cont.)

Outline. CSCI-567: Machine Learning (Spring 2019) Outline. Prof. Victor Adamchik. Mar. 26, 2019

Lecture 19: Convergence

Chapter 12 EM algorithms The Expectation-Maximization (EM) algorithm is a maximum likelihood method for models that have hidden variables eg. Gaussian

Unsupervised Learning 2001

Probability and MLE.

Pattern Classification, Ch4 (Part 1)

Review Questions, Chapters 8, 9. f(y) = 0, elsewhere. F (y) = f Y(1) = n ( e y/θ) n 1 1 θ e y/θ = n θ e yn

Parameter, Statistic and Random Samples

Introduction to Machine Learning DIS10

Exponential Families and Bayesian Inference

Random Matrices with Blocks of Intermediate Scale Strongly Correlated Band Matrices

Clustering: Mixture Models

ECE 901 Lecture 13: Maximum Likelihood Estimation

Recurrence Relations

Lecture 11 and 12: Basic estimation theory

Grouping 2: Spectral and Agglomerative Clustering. CS 510 Lecture #16 April 2 nd, 2014

Chapter 2 The Monte Carlo Method

Machine Learning for Data Science (CS 4786)

Stat 421-SP2012 Interval Estimation Section

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

Questions and Answers on Maximum Likelihood

Monte Carlo method and application to random processes

Table 12.1: Contingency table. Feature b. 1 N 11 N 12 N 1b 2 N 21 N 22 N 2b. ... a N a1 N a2 N ab

Infinite Sequences and Series

5. Fractional Hot deck Imputation

Problem Set 4 Due Oct, 12

Direction: This test is worth 250 points. You are required to complete this test within 50 minutes.

Probabilistic Unsupervised Learning

Optimization Methods MIT 2.098/6.255/ Final exam

Statistical and Mathematical Methods DS-GA 1002 December 8, Sample Final Problems Solutions

Matrix Representation of Data in Experiment

Lecture 7: Properties of Random Samples

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

1.010 Uncertainty in Engineering Fall 2008

Clases 7-8: Métodos de reducción de varianza en Monte Carlo *

Markov Decision Processes

10-701/ Machine Learning Mid-term Exam Solution

Stat410 Probability and Statistics II (F16)

Estimation for Complete Data

Monte Carlo Integration

Jacob Hays Amit Pillay James DeFelice 4.1, 4.2, 4.3

f(x i ; ) L(x; p) = i=1 To estimate the value of that maximizes L or equivalently ln L we will set =0, for i =1, 2,...,m p x i (1 p) 1 x i i=1

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

6.867 Machine learning, lecture 7 (Jaakkola) 1

Optimally Sparse SVMs

The picture in figure 1.1 helps us to see that the area represents the distance traveled. Figure 1: Area represents distance travelled


Introduction of Expectation-Maximization Algorithm, Cross-Entropy Method and Genetic Algorithm

Sequences, Mathematical Induction, and Recursion. CSE 2353 Discrete Computational Structures Spring 2018

10/2/ , 5.9, Jacob Hays Amit Pillay James DeFelice

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

The Bayesian Learning Framework. Back to Maximum Likelihood. Naïve Bayes. Simple Example: Coin Tosses. Given a generative model

Statistical Inference Based on Extremum Estimators

Introduction to Optimization Techniques. How to Solve Equations

1 Introduction to reducing variance in Monte Carlo simulations

A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

Topic 9: Sampling Distributions of Estimators

Computing the maximum likelihood estimates: concentrated likelihood, EM-algorithm. Dmitry Pavlyuk

CS284A: Representations and Algorithms in Molecular Biology

INF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification

11 Correlation and Regression

Axis Aligned Ellipsoid

Hypothesis Testing. Evaluation of Performance of Learned h. Issues. Trade-off Between Bias and Variance

Linear Support Vector Machines

7-1. Chapter 4. Part I. Sampling Distributions and Confidence Intervals

10.6 ALTERNATING SERIES

Expectation and Variance of a random variable

Intro to Learning Theory

Machine Learning Brett Bernstein

Department of Mathematics

TAMS24: Notations and Formulas

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

Tests of Hypotheses Based on a Single Sample (Devore Chapter Eight)

DS 100: Principles and Techniques of Data Science Date: April 13, Discussion #10

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures

Transcription:

The Expectatio-Maximizatio (EM) Algorithm Readig Assigmets T. Mitchell, Machie Learig, McGraw-Hill, 997 (sectio 6.2, hard copy). S. Gog et al. Dyamic Visio: From Images to Face Recogitio, Imperial College Pres, 200 (Appedix C, hard copy). A. Webb, Statistical Patter Recogitio, Arold, 999 (sectio 2.3, hard copy). Case Studies B. Moghaddam ad A. Petlad, "Probabilistic visual learig for object represetatio", IEEE Trasactios o Patter Aalysis ad Machie Itelligece, vol. 9, o. 7, pp. 696-70, 997 (o-lie). S. Mcea, Y. Raja, ad S. Gog, "Trackig color objects usig adaptive mixture models", Image ad Visio Computig, vol. 7, pp. 225-23, 999 (o-lie). C. Stauffer ad E. Grimso, "Adaptive backgroud mixture models for real-time trackig", IEEE Computer Visio ad Patter Recogitio Coferece, Vol. 2, pp. 246-252, 998 (o-lie).

-2- The Expectatio-Maximizatio (EM) Algorithm Overview -Itisaiterative algorithm that starts with a iitial estimate for θ ad iteratively modifies θ to icrease the likelihood of the observed data. -Works best i situatios where the data is icomplete or ca be thought of as beig icomplete. -EMistypically used with mixture models (e.g., mixtures of Gaussias). The case of icomplete data -May times, it is impossible to apply ML estimatio because we ca ot measure all the features or certai feature values are missig. - The EM algorithm is ideal (i.e., it produces ML estimates) for problems with uobserved (missig) data. Actual data: x = x x 2 x 3, Observed data: y = x x 2 Complete pdf: p(x/θ), Icomplete pdf: p(y/θ) -Icomplete pdf ca be derived from complete pdf: p(y/θ) =... p(x/θ)dx missig

-3- A example -Assume the followig two classes i a patter-recogitio problem: () A class of dark object (.) Roud black objects (.2) Square black objects (2) A class of light objects Complete data ad pdf: x = x x 2 x 3 p(x, x 2, x 3 /θ) = ( Observed (icomplete) data ad pdf: umber of roud dark objects umber of square dark objects umber of light objects! x! x 2!x 3! )(/4)x ( /4 + θ /4) x 2 ( /2 θ /4) x 3 y = y = x + x 2 umber of dark objects y 2 x 3 umber of light objects (may-to-oe mappig!!)

-4- EM: mai idea ad steps -Ifxwas available, the we could use ML to estimate θ,i.e., arg max θ l p(d x /θ) Idea: maximize the expectatio of p(x/θ) give the data y ad our curret estimate of θ.. Iitializatio step: iitialize the algorithm with a guess θ 0 2. Expectatio step: it is with respect to the ukow variables, usig the curret estimate of parameters ad coditioed upo the observatios. Q(θ; θ t ) = E xuobserved (l p(d x /θ) /D y,θ t ) * Expectatio is over the values of the uobserved variables sice the observed data is fixed. *Whe l p(d x /θ) isaliear fuctio of the uobserved variables, the the above step is equivalet to fidig E(x uobserved /D y, θ t ) 3. Maximizatio step: provides a ew estimate of the parameters. θ t+ = arg max θ Q(θ; θ t ) 4. Covergece step: if θ t+ θ t < ε,stop; otherwise, go to step 2.

-5- A example (cot d)) -Assume the followig two classes i a patter-recogitio problem: () A class of dark object (.) Roud black objects (.2) Square black objects (2) A class of light objects Complete data ad pdf: x = x x 2 x 3 p(x, x 2, x 3 /θ) = ( Observed (icomplete) data ad pdf: umber of roud dark objects umber of square dark objects umber of light objects! x! x 2!x 3! )(/4)x ( /4 + θ /4) x 2 ( /2 θ /4) x 3 y = y = x + x 2 umber of dark objects y 2 x 3 umber of light objects (may-to-oe mappig!!) Expectatio step: compute E(l p(d x /θ) /D y,θ t ))

-6- Σ l( p(d x /θ) = Π p(x i /θ) ==> l p(d x /θ) = Σ l p(x i /θ) =! x i! x i2!x i3! ) + x i l( /4) + x i2 l( /4 + θ /4) + x i3 l( /2 θ /4) E[l p(d x /θ)/d y,θ t ] = Σ! E[l( x i! x i2!x i3! )/D y,θ t ] + E[x i /D y,θ t ] l( /4) + E[x i2 /D y, θ t ] l( /4 + θ /4) + x i3 l( /2 θ /4) Maximizatio step: compute θ t+ by maximizig E(l p(d x /θ) /D y,θ t ) d dθ E[l p(d x/θ)/d y,θ t ] = 0==> θ t+ = 2 + E[x i2/d y, θ t ] x i3 E[x i2 /D y, θ t ] + x i3 Expectatio step (cot d): estimatig E[x i2 /D y, θ t ] P(x i2 /y i, y i2 ) = P(x i2 /y i ) = y i x i2 ( /4)x i2 ( /4 + θ /4) y i x i2 ( /2 + θ /4) yi E[x i2 /D y, θ t ] = y i /4 /2 + θ t /4

-7- Covergece properties of the EM algorithm - At each iteratio, a value of θ is computed so that the likelihood fuctio does ot decrease. -Itca be show that by icreasig Q(θ; θ t ) = E xuobserved (l p(d x /θ) /D y,θ t )with the EM algorithm, we are also icreasig l p(d x /θ). -This does ot guaratee that the algorithm will reach the ML estimate (global maximum) ad, i practice, it may get stuck i a local optimum. -The solutio depeds o the iitial estimate θ 0. -The algorithm is guarateed to be stable ad to coverge to a ML estimate (i.e., there is o chace of "overshootig" or divergig from the maximum).

-8- Maximum Likelihood of mixtures via EM Mixture model - I a mixture model, there are may "sub-models", each of which has its ow probability distributio which describes how itgeerates data whe it is active. -There is also a "mixer" or "gate" which cotrols how ofte each sub-model is active. -Formally, a mixture is defied as a weighted sum of compoets where each compoet is a parametric desity fuctio p(x/θ k ): p(x/θ) = Σ p(x/θ k )π k Mixture parameters -The parameters θ to estimate are: *the values of π k *the parameters θ k of p(x/θ k ) -The compoet desities p(x/θ k )may be of differet parametric forms ad are specified usig kowledge of the data geeratio process, if available. -The weights π k are the mixig parameters ad they sum to uity: Σ π k = -Fittig a mixture model to a set of observatios D x cosists of estimatig the

-9- set of mixture parameters that best describe this data. -Two fudametal issues arise i mixture fittig: () Estimatio of the mixture parameters. (2) Estimatio of the mixture compoets. Mixtures of Gaussias -Ithe mixtures of Gaussia model, p(x/θ k )isthe multivariate Gaussia distributio. -Ithis case, the parameters θ k are (µ k, Σ k ). Mixture parameter estimatio usig ML -Aswehave see, give aset of data D=(x, x 2,..., x ), ML seeks the value of θ that maximizes the followig probability: p(d/θ) = Π p(x i /θ) -Sice p(x i /θ) ismodeled as a mixture (i.e., p(x i /θ) = Σ p(x i /θ k )π k )the above expressio ca be writte as: p(d/θ) = Π Σ p(x i /θ k )π k -Igeeral, it is ot possible to solve p(d/θ) ad iterative schemes must be employed. θ = 0 explicitly for the parameters

-0- Estimate the meas of Gaussias usig EM (special case) Data geeratio process usig mixtures -Assume the data D is geerated by a probability distributio that is a mixture of k Gaussias. k = 2 -Each istace is geerated usig a two-step process: () Oe of the Gaussias is selected at radom, with probabilities π, π 2,...,π. (2) A sigle radom istace x i is geerated accordig to this selected distributio. -This process is repeated to geerate a set of data poits D. Assumptios (this example) () π = π 2 =... = π (uiform distributio) (2) Each Gaussia has the same variace σ 2 which is kow. -The problem is to estimate the meas of the Gaussias θ = (µ, µ 2,..., µ ) Note: if we kew which Gaussia geerated each datapoit, the it would be

-- easy to fid the parameters for each Gaussia usig ML. Ivolvig hidde or uobserved variables -Weca thik of the full descriptio of each istace x i as y i =(x i, z i )=(x i, z i, z i2,...,z i ) where z i is a class idicator vector (hidde variable): z ij = 0 if x i was geerated by j th compoet otherwise -Ithis case, x i are observable ad z i o-observable. Mai steps usig EM -The EM algorithm searches for a ML hypothesis through the followig iterative scheme: () Iitialize the hypothesis θ 0 =(µ 0, µ 0 2,..., µ 0 ) (2) Estimate the expected values of the hidde variables z ij usig the curret hypothesis θ t =(µ t, µ t 2,..., µ t ) (3) Update the hypothesis θ t+ =(µ t+, µ 2 t+,..., µ t+ )usig the expected values of the hidde variables from step 2. -Repeat steps (2)-(3) util covergece.

-2- Derivatio of the Expectatio-step -Wemust derive a expressio for Q(θ; θ t ) = E zi (l p(d y /θ) /D x,θ t ) () Derive the form of l p(d y /θ): -Weca write p(y i /θ) asfollows: p(d y /θ) = Π p(y i /θ) p(y i /θ) = p(x i, z i /θ) = p(x i /z i, θ)p(z i /θ) = p(x i /θ j )π j (assumig z ij = ad z ik =0 for k j) -Weca rewrite p(x i /θ j )π j as follows: p(y i /θ) = Π[p(x i /θ k )π k ] z ik -Thus, p(d y /θ) ca be writte as follows (π k s are all equal): p(d y /θ) = Π Π[p(x i /θ k )] z ik -Wehave assumed the form of p(x i /θ k )tobegaussia: p(x i /θ k ) = σ 2 exp[ (x i µ k ) 2 ], thus π 2σ 2 Π[p(x i /θ k )] z ik = σ 2 exp[ π 2σ 2 which leads to the followig form for p(d y /θ): Σ z ik (x i µ k ) 2 ] p(d y /θ) = Π σ 2 exp[ π 2σ 2 Σ z ik (x i µ k ) 2 ] -Let s compute ow l p(d y /θ): l p(d y /θ) = Σ(l σ 2 π 2σ 2 Σ z ik (x i µ k ) 2 )

-3- (2) Take the expected value of l p(d y /θ): E zi (l p(d y /θ)/d x,θ t ) = E( Σ(l Σ(l σ 2 π 2σ 2 σ 2 π 2σ 2 Σ z ik (x i µ t k) 2 ))) = Σ E(z ik )(x i µ t k) 2 ) - E(z ik )isjust the probability that the istace x i was geerated by the k-th compoet (i.e., E(z ik ) = Σ z ij P(z ij ) = P(z ik ) = P(k/x i ): j E(z ik ) = exp[ (x i µ t k) 2 ] 2σ 2 Σ exp[ (x i µ t j )2 ] 2σ 2 j= Derivatio of the Maximizatio-step -Maximize Q(θ; θ t ) = E zi (l p(d y /θ) /D x,θ t ) Q µ k = 0 or µ t+ k = Σ E(z ik )x i Σ E(z ik )

-4- Summary of the two steps -Choose the umber of compoets Iitializatio step Expectatio step θ 0 k=µ 0 k E(z ik ) = exp[ (x i µ t k) 2 ] 2σ 2 Σ exp[ (x i µ t j )2 ] 2σ 2 j= Maximizatio step µ t+ k = Σ E(z ik )x i Σ E(z ik )

-5- Estimate the mixture parameters (geeral case) -If we kew which sub-model was resposible for geeratig each datapoit, the it would be easy to fid the ML parameters for each sub-model. () Use EM to estimate which sub-model was resposible for geeratig each datapoit. (2) Fid the ML parameters based o these estimates. (3) Use the ew MLparameters to re-estimate the resposibilities ad iterate. Ivolvig hidde variables -Wedoot kow which istace x i was geerated by which compoet (i.e., the missig data are the labels showig which sub-model geerated each datapoit). -Augmet each istace x i by the missig iformatio: y i = (x i, z i ) where z i is a class idicator vector z i = (z i, z 2i,...,z i ): z ij = 0 if x i geerated by j th compoet otherwise (x i are observable ad z i o-observable)

-6- Derivatio of the Expectatio step -Wemust derive a expressio for Q(θ; θ t ) = E zi (l p(d y /θ) /D x,θ t ) () Derive the form of l p(d y /θ): -Weca write p(y i /θ) asfollows: p(d y /θ) = Π p(y i /θ) p(y i /θ) = p(x i, z i /θ) = p(x i /z i, θ)p(z i /θ) = p(x i /θ j )π j (assumig z ij = ad z ik =0 for k j) -Weca rewrite the above expressio as follows: p(y i /θ) = Π[p(x i /θ k )π k ] z ik -Thus, p(d y /θ) ca be writte as follows: p(d y /θ) = -Weca ow compute l p(d y /θ) l p(d y /θ) = Σ Π Σ Σ z ik l ( p(x i /θ k )) + Π[p(x i /θ k )π k ] z ik Σ z ik l ( p(x i /θ k )π k ) = Σ Σ z ik l (π k )

-7- (2) Take the expected value of l p(d y /θ): E(l p(d y /θ)/d x,θ t ) = Σ Σ E(z ik )l (p(x i /θ t k)) + Σ Σ E(z ik )l (π t k) - E(z ik )isjust the probability that istace x i was geerated by the k-th compoet (i.e., E(z ik ) = Σ z ij P(z ij ) = P(z ik ) = P(k/x i ): j E(z ik ) = j= p(x i/θ t k)π t k Σ p(x i /θ t j )π t j Derivatio of the Maximizatio step -Maximize Q(θ; θ t )subject to the costrait Σ π k = : Q (θ; θ t ) = Σ Σ E(z ik )l (p(x i /θ k )) + where λ is the Lagrage multiplier. Σ Σ E(z ik )l (π k ) + λ( Σ π k ) k = Q π k = 0 or Σ E(z ik ) λ = 0 or πk t+ π k = Σ E(z ik ) (the costrait Σ π k = gives Σ Σ E(z ik ) = λ) Q µ k = 0 or µ t+ k = π t+ k Σ E(z ik )x i Q Σ k = 0 or Σ t+ k = π t+ k Σ E(z ik )(x i µ t+ k )(x i µ t+ k ) T

-8- Summary of steps -Choose the umber of compoets Iitializatio step θ 0 k=(π k, 0 µ 0 k, Σ 0 k) Expectatio step E(z ik ) = j= p(x i/θ t k)π t k Σ p(x i /θ t j )π t j Maximizatio step π t+ k = Σ E(z ik ) µ t+ k = π t+ k Σ E(z ik )x i Σ t+ k = π t+ k Σ E(z ik )(x i µ t+ k )(x i µ t+ k ) T (4) If θ t+ θ t < ε,stop; otherwise, go to step 2.

-9- Estimatig the umber of compoets -Use EM to obtai a sequece of parameter estimates for a rage of values {Θ (), = mi,..., max } -The estimate of is the defied as a miimizer of some cost fuctio: ˆ = arg mi (C(Θ (), ), = mi,..., max -Most ofte, the cost fuctio icludes l p(d y /θ) ad a additioal term whose role is to pealize large values of. -Several criteria have bee used, e.g., Miimum descriptio legth (MDL)

-20- Lagrage Optimizatio -Suppose we wat to maximize f (x) subject to some costrait expressed i the form: g(x) = 0 -Tofid the maximum, first we form the Lagragia fuctio: L(x, λ) = f (x) + λ g(x) (λ is called the Lagrage udetermied multiplier) -Take the derivative ad set it equal to zero: L(x, λ) x = f (x) x + λ g(x) x = 0 -Solve the resultig equatio for λ ad the value x that maximizes f (x)