EM Algorithm LECTURE OUTLINE

Similar documents
Bayesian Decision Theory

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Latent Variable Models and EM Algorithm

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Expectation maximization

Statistical Pattern Recognition

ECE 5984: Introduction to Machine Learning

Gaussian Mixture Models

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

Expectation Maximization

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

But if z is conditioned on, we need to model it:

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

The Expectation Maximization Algorithm

EM-based Reinforcement Learning

Introduction to Machine Learning

Lecture 8: Graphical models for Text

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

CPSC 540: Machine Learning

LEARNING & LINEAR CLASSIFIERS

CSCI-567: Machine Learning (Spring 2019)

Lecture 6: Gaussian Mixture Models (GMM)

Pattern Recognition. Parameter Estimation of Probability Density Functions

Bayesian decision making

COM336: Neural Computing

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

Notes on Machine Learning for and

Expectation Maximization

EXPECTATION- MAXIMIZATION THEORY

Probabilistic Graphical Models

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Statistical Pattern Recognition

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Bayesian Machine Learning

Clustering and Gaussian Mixture Models

Expectation Maximization, and Learning from Partly Unobserved Data (part 2)

Expectation Maximization (EM)

Clustering with k-means and Gaussian mixture distributions

The Expectation-Maximization Algorithm

Expectation Maximization Algorithm

Mixtures of Gaussians continued

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Week 3: The EM algorithm

Mixture of Gaussians Models

Latent Variable Models and EM algorithm

CSE446: Clustering and EM Spring 2017

Clustering with k-means and Gaussian mixture distributions

Lecture 8: Clustering & Mixture Models

Outline lecture 6 2(35)

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Ch 4. Linear Models for Classification

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Mixture Models and EM

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Machine Learning Lecture 5

Expectation Maximization and Mixtures of Gaussians

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Expectation Maximization

Pattern Recognition and Machine Learning Course: Introduction. Bayesian Decision Theory.

Clustering with k-means and Gaussian mixture distributions

EM (cont.) November 26 th, Carlos Guestrin 1

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Unsupervised Learning

Maximum likelihood estimation

The Expectation Maximization or EM algorithm

Machine Learning for Signal Processing Bayes Classification and Regression

Expectation Maximization

Machine Learning for Data Science (CS4786) Lecture 12

Statistical learning. Chapter 20, Sections 1 4 1

Pattern Recognition and Machine Learning

Latent Variable Models

Intro. ANN & Fuzzy Systems. Lecture 38 Mixture of Experts Neural Network

Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models.

Machine Learning CS-6350, Assignment - 5 Due: 14 th November 2013

Lecture 16 Deep Neural Generative Models

L11: Pattern recognition principles

Pattern Recognition and Machine Learning Course: Introduction. Bayesian Decision Theory.

CS534 Machine Learning - Spring Final Exam

Lecture 7: Con3nuous Latent Variable Models

Introduction to Neural Networks

Bayesian Decision and Bayesian Learning

Posterior Regularization

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

10-701/15-781, Machine Learning: Homework 4

Review and Motivation

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

Multivariate Bayesian Linear Regression MLAI Lecture 11

Lecture 14. Clustering, K-means, and EM

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

K-means. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. November 19 th, Carlos Guestrin 1

PATTERN RECOGNITION AND MACHINE LEARNING

Probabilistic clustering

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Transcription:

EM Algorithm Lukáš Cerman, Václav Hlaváč Czech Technical University, Faculty of Electrical Engineering Department of Cybernetics, Center for Machine Perception 121 35 Praha 2, Karlovo nám. 13, Czech Republic cermal1@fel.cvut.cz http://cmp.felk.cvut.cz/ cermal1/files/lectureem.pdf LECTURE OUTLINE Task Formulation EM as Lower Bound Maximization Examples Relation to Unsupervised Learning Relation to K-Means Including Known Priors

TASK FORMULATION Consider an experiment with the probability model P (x, y θ), where x X are observed data, y Y are unobserved data, θ Θ are parameters of the distribution. The task is to estimate θ given a set of measurements X = {x1,... xn xi X }. Marginalize over missing data 2/13 P (x θ) = P (x, y θ). ML principle, defines the likelihood of the observed data, l(θ) = P (X θ) = P (x i θ) = P (x i, y θ). Maximimize log-likelihood L(θ) = log l(θ) θ = argmax L(θ) = argmax log P (x i, y θ).

EM AS A LOWER BOUND MAXIMIZATION 3/13 No closed form solution exists for θ = argmax L(θ). Option 1: numerical solution using gradient-based optimization techniques. Option 2: EM algorithm as a rather simple, alternative, solution to the problem. Instead of maximizing L(θ) maximize its lower bound F (θ, α). F (θ, α) = α(x i, y) log P (x i, y θ) α(x i, y) log α(x i, y) P (x i, y θ) α(x i, y) Proof by Jensen s inequality α(x i, y) log P (x i, y θ) α(x i, y) log α(x i, y) P (x i, y θ) α(x i, y) for each x i X.

FINDING THE OPTIMAL BOUND 4/13 Fix θ, maximize F (θ, α) with respect to α(xi, y). Introduce Lagrange multiplier λ to enforce α(x i, y). G(α) = λ 1 α(x i, y) + α(x i, y) log P (x i, y θ) α(x i, y) log α(x i, y) Take the derivative G(α) α(x i, y) = λ + log P (x i, y θ) log α(x i, y) 1 Solve for α(xi, y) α(x i, y) = P (x i, y θ) P (x i, y θ) = P (y x i, θ)

EXAMINING THE OPTIMAL BOUND 5/13 By examining the optimal bound, we see that it indeed touches the objective function likelihood L(θ). F (θ, α) = = = = α(x i, y) log P (x i, y θ) α(x i, y) P (y x i, θ) log log P (x i, y θ) P (x i, y θ) P (y x i, θ) log P (x i, y θ) = L(θ)

MAXIMIZING THE BOUND 6/13 Fix α, maximize F (θ, α) with respect to θ. argmax F (θ, α) = argmax = argmax = argmax α(x i, y) log P (x i, y θ) α(x i, y) log α(x i, y) α(x i, y) log P (x i, y θ) E P (y xi,θ)[log P (x i, y θ)] It is maximizing our expectation of complete log-likelihood log P (xi, y θ) over our current estimate P (y x i, θ).

EM ALGORITHM 7/13 At each iteration find an optimal lower bound F (θt, α) at the current guess θ t. Then maximize this bound to obtain an improved estimate θt+1. E-step: calculate M-step: calculate α(x i, y) t = P (x i, y θ t ) P (x i, y θ t ) = P (y x i, θ t ) θ t+1 = argmax E P (y xi,θ t )[log P (x i, y θ)] Initial values θ0 may be chosen randomly.

EXAMPLE GAUSSIAN MIXTURE MODELS Gaussian mixture model is defined by 8/13 P (x, y θ) = P (x y, θ)p (y θ) = N (x µ y, Σ y )P (y) E-step: calculate M-step: calculate α(x i, y) t = N (x µt y, Σ t y)p t (y) N (x µ t y, Σ t y)p t (y) P t+1 (y) = 1 n µ t+1 y = Σ t+1 y = α(x i, y) t α(x i, y) t x i α(x i, y) t α(x i, y) t (x i µ t y)(x i µ t y) α(x i, y) t

EXAMPLE IMAGE RECONSTRUCTION 9/13 Each pixel xirc in image i at position (r, c), is observed with a gaussian noise N (0, σ). Probability of observing value x irc assuming the face at position k i is therefore P (x irc k i, f, b, σ) = { N (fr,c ki +1, σ) for c k i, k i + w) N (b, σ) elsewhere, where b is background intensity, f rc are face pixels. Probability of observing set of m images X = {X1,..., Xi} is P (X f, b, σ) = P (X i, k f, b, σ) = P (k)p (X i k, f, b, σ). i i k Unobserved data are here the face positions ki. Parameters of probability model are face pixels f r,c, background intensity b and noise variation σ. k

RELATION TO UNSUPERVISED LEARNING 10/13 Consider a classification problem with measuments x X and classes y X. Releation between each measument x and its class assignment y can be described using probability P (x, y θ). Having an unlabeled training set of measurements X = {x1,..., xn} one can use EM algorithm to estimate the probability model and even to classify the observed data without any information from the teacher. To classify the data one can use output of E-step α(x i, y) t = P (x i, y θ t ) P (x i, y θ t ) = P (y x i, θ t ) which can be interpreted as a probablity of x i being of a class y.

RELATION TO K-MEANS K-means is unsupervised clustering algorithm. It iterates classification step 11/13 α(x i, y) t = { 1 for y = argmin y Y 0 elsewhere x i µ t y, and learning step µ t+1 y = α(x i, y) t x i α(x i, y) t. Whereas the K-means algorithm performs hard assignment of data points to clusters, the EM algorithm makes a soft assignment based on posterior probabilities P (y x i, θ). One can derive K-means algorithm as a particular limit of EM for GMM as follows. P (x, y θ) = P (x y, θ)p (y θ) = N (x µ y, ɛi)p (y) [ ] P t x (y) exp i µ t y 2 α(x i, y) t 2ɛ = [ P t xi ] µ (y) exp t y 2 Letting ɛ 0 one obtain hard assignment, just as in the K-means. 2ɛ

INCLUDING KNOWN PRIORS 12/13 EM can be used to find MAP solutions for models with defined priors P (θ) P (θ X) = P (X θ)p (θ) P (X) P (X θ)p (θ) = P (X, θ) Optimized lower bound is then F (θ, α) = α(x i, y) log P (x i, y, θ) α(x i, y) E-step: calculate M-step: calculate α(x i, y) t = P (x i, y, θ t ) P (x i, y, θ t ) = P (y x i, θ t ) θ t+1 = argmax E P (y xi,θ t )[log P (x i, y, θ)]

REFERENCES 13/13 [1] Ch. M. Bishop. Pattern Recognition and Machine Learning. Springer Science+Bussiness Media, New York, NY, 2006. [2] F. Dellaert. The expectation maximization algorithm, 2002. [3] V. Franc and M. Švec. Excercise in RPZ expectation-maximization algorithm, 2002. [4] M. I. Schlesinger and V. Hlaváč. Ten Lectures on Statistical and Structural Pattern Recognition. Kluwer Academic Publishers, Dordrecht, The Netherlands, 2002. [5] M. Urban. The EM algorithm for Nick Carter. http://cmp.felk.cvut.cz/cmp/courses/recognition/labs/em/index en.html, 2007. [6] Wikipedia. Jensen s inequality. http://en.wikipedia.org/wiki/jensen s inequality, 2007.