Statistical Pattern Recognition

Similar documents
Statistical Pattern Recognition

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Expectation Maximization

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Expectation Maximization

Expectation maximization

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Latent Variable View of EM. Sargur Srihari

Mixtures of Gaussians. Sargur Srihari

Introduction to Machine Learning

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

CS 6140: Machine Learning Spring 2017

Latent Variable Models and EM Algorithm

Lecture 4: Probabilistic Learning

STA 414/2104: Machine Learning

Data Mining Techniques

Gaussian Mixture Models, Expectation Maximization

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

STA 4273H: Statistical Machine Learning

Latent Variable Models and Expectation Maximization

CSCI-567: Machine Learning (Spring 2019)

Note Set 5: Hidden Markov Models

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CPSC 540: Machine Learning

Mixtures of Gaussians continued

Latent Variable Models and Expectation Maximization

Machine Learning Techniques for Computer Vision

Lecture 6: Gaussian Mixture Models (GMM)

Clustering and Gaussian Mixture Models

The Expectation Maximization or EM algorithm

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Clustering and Gaussian Mixtures

Review and Motivation

Weighted Finite-State Transducers in Computational Biology

Data Preprocessing. Cluster Similarity

Gaussian Mixture Models

Hidden Markov Models and Gaussian Mixture Models

Latent Variable Models

Clustering with k-means and Gaussian mixture distributions

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

A Note on the Expectation-Maximization (EM) Algorithm

Statistical learning. Chapter 20, Sections 1 4 1

Expectation Maximization (EM)

ECE 5984: Introduction to Machine Learning

K-Means and Gaussian Mixture Models

Clustering, K-Means, EM Tutorial

MIXTURE MODELS AND EM

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Lecture 14. Clustering, K-means, and EM

Data-Intensive Computing with MapReduce

Latent Variable Models and EM algorithm

K-means. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. November 19 th, Carlos Guestrin 1

Unsupervised Learning

A minimalist s exposition of EM

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Joint Factor Analysis for Speaker Verification

Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models.

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Hidden Markov Models in Language Processing

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Machine Learning for Data Science (CS4786) Lecture 12

Statistical Pattern Recognition

Hidden Markov Models

PATTERN RECOGNITION AND MACHINE LEARNING

EM (cont.) November 26 th, Carlos Guestrin 1

Maximum Likelihood Estimation. only training data is available to design a classifier

Expectation Maximization Algorithm

Expectation-Maximization (EM) algorithm

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

Statistical Pattern Recognition

Statistical Methods for NLP

Expectation Maximization (EM)

EM Algorithm LECTURE OUTLINE

CS534 Machine Learning - Spring Final Exam

COM336: Neural Computing

PCA and admixture models

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

Technical Details about the Expectation Maximization (EM) Algorithm

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Clustering with k-means and Gaussian mixture distributions

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Lecture 7: Con3nuous Latent Variable Models

CSE446: Clustering and EM Spring 2017

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Expectation maximization tutorial

L11: Pattern recognition principles

STA 4273H: Statistical Machine Learning

Lecture 11: Unsupervised Machine Learning

Variational Mixture of Gaussians. Sargur Srihari

Recent Advances in Bayesian Inference Techniques

Transcription:

Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 203 http://ce.sharif.edu/courses/9-92/2/ce725-/

Agenda Expectation-maximization (EM) Overview EM Applications EM Algorithm EM Examples Mixture Models Gaussian Mixtures 2

Expectation-Maximization (EM) EM algorithm is a general technique for finding maximum likelihood estimators under missing (unobserved) data. EM is perhaps most often used and mostly half understood algorithm for unsupervised learning. It is very intuitive Many people rely on their intuition to apply the algorithm in different problem domains The EM algorithm estimates the parameters of a model iteratively. Starting from some initial guess, each iteration consists of an Expectation step, and an Maximization step 3

Missing Data Problem Occurs whenever part of the data is unknown intrinsically inaccessible Example: which model does a data point belong to in mixture models? data is lost / erroneous Example: Some faulty / noisy process has generated the data. If the missing data is correlated in any way with the observed, we can hope to extract information about the missing data from the observed. If the missing data is independent from the observed, everything is lost. 4

EM Applications Application Examples PoS (Part of Speech) Tagging Complete data: A sentence (a sequence of words) and a corresponding sequence of PoS tags. Observed data: the sentence Unobserved data: the sequence of tags Model: an HMM with transition/emission probability tables Model Building with Partial Observations We ll discuss this example today. Our goal is to build a probabilistic model The model parameters can be estimated from a set of training examples: x, x 2,, x n x i s are i.i.d (identically and independently distributed) Unfortunately, we only get to observe part of each training example: x i =(x io, x iu ) and we can only observe x io. How do we build the model? 5

EM Applications More applications Filling in missing data in samples Discovering the value of latent variables Estimating the parameters of HMMs Estimating parameters of finite mixtures Unsupervised learning of clusters 6

EM Algorithm General Idea Given a set of incomplete (observed) data Assume observed data come from a specific model Iterate following steps until convergence Expectation step: formulate some parameters for that model, use this to guess the missing (latent / unobserved) data Maximization step: from the missing data and observed data, find the most likely parameters E step initial guess Guess of unknown parameters Guess of unknown hidden structure Observed structure M step 7

EM Algorithm Assumptions: Suppose that observations are Xs. Latent data are Zs., and the unknown parameters are θ. Initialization: Initialize the parameters of θ to some random value General Algorithm: E Step: Compute the best structure for Z given current parameters values. M Step: Use the just-computed values of Z to compute a better estimate for the parameters. 8

EM Algorithm Derivation We want to maximize P(X θ) By definition (q(z) is the distribution of latent data) 9

EM Algorithm Derivation(cont d) KL(p q) is the Kullback-Leibler divergence betweeb q(z) and P(Z X,θ) hence : kk(p q 0 aaa kk(p q = 0 q z = p(z x, θ) Maximize P(X θ) by maximizing L(q,θ) Algorithm E step : L q, θ ooo is maximized with respect to q(z) by substituting q z wwwh p z x, θ ooo M step : q(z) is held fixed and L q, θ is maximized with respect to θ 0

EM Algorithm A simple example Maximum likelihood Assume after an exam in the class we have these grades: Grade A B C D # of students a b c d And suppose that we know: P(A)=½, P(B)=μ, P(C)=2μ, P(D)=½-3μ What s the maximum likelihood estimate of μ? p a, b, c, d μ = K 2 a μ b 2μ c 2 3μ d ln p a, b, c, d μ = ln K + aaa 2 + b ln μ + ccc 2μ + ddd ( 2 3μ) ln P = b μ + 2c μ + 2 3c b+c = 0 μ = 3μ 6 b+c+d

EM Algorithm A simple example Hidden Information Suppose that we know that: Grade A B C D # of students a b c d P(A)=½, P(B)=μ, P(C)=2μ, P(D)=½-3μ number of high grades (A s + B s) = h umber of C s = c umber of D s = d What is the Maximum Likelihood estimate for μ now? Expectation: If we knew the value of μ we could compute the expected value for a and b a 2 h b h 2 2 2

EM Algorithm A simple example Hidden Information (cont.) Maximization: Grade A B C D # of students a b c d P(A)=½, P(B)=μ, P(C)=2μ, P(D)=½-3μ If we knew the expected values of a and b we could compute the maximum likelihood value of μ like before Then, we begin with a first estimate for μ and iterate between expectation and maximization to improve our estimates for μ, a and b. (0) initialguess for (t)h b(t) Eb (t) (t) 2 b(t) c (t ) ML estimates of given b(t) 6 b(t) c d 3

EM Algorithm Another example: K-means clustering Goal: represent a data set {x,, x } in terms of K clusters each of which is summarized by a prototype μ k Initialize prototypes, then iterate between two phases: E-step: assign each data point to nearest prototype M-step: update prototypes to be the cluster means Simplest version is based on Euclidean distance HW: Derive the EM equations for ( t) ( t) k-means algorithm ( P( Xu Xo, θ ), Φ( θθ, )). 4

Mixture Models Mixture density model estimation Models data with mixture density Where {,..., }, θ = θ θ m Pc ( ) +... + Pc ( m ) = Px ( θ) = (, ) ( ) j = px c j θ j Pcj To generate a sample from distribution P(X θ) first select class j with probability P(c j ) then generate x according to probability P(x c,θ j j ) Provides a framework for building more complex probability distributions Can be used to cluster data (How?) m 5

Gaussian Mixtures Linear super-position of Gaussians ormalization and positivity require Example: Mixture of 3 Gaussians K Px ( ) = Pc ( ) x ( µ, Σ ) k K = k= k k k Pc ( ) =, 0 Pc ( ) K K Separated Mixed 6

Gaussian Mixtures Fitting the Gaussian mixture model The goal: given the data set, find the corresponding parameters: mixing coefficients (or prior probabilities), means, and covariances If we knew which component generated each data point, the maximum likelihood solution would involve fitting each component to the corresponding cluster Problem: the data set is unlabelled We ll refer to the labels as latent (= hidden) variables Synthetic data set without labels 7

Gaussian Mixtures Maximum likelihood for the GMM The log likelihood function takes the form K ln px ( πµ,, Σ ) = ln π kx ( n µ k, Σk) n= k= ote: The sum over components, appears inside the log. There is no closed form solution for maximum likelihood. Then, how to maximize the log likelihood? Using EM algorithm. 8

Gaussian Mixtures EM Algorithm Initialize the means μ k, covariances Σ k and mixing coefficients π k. Repeat the following steps until convergence: E step: Evaluate z ij s (latent variables) using the current parameter values z ij : a binary variable which is if x i is drawn from the j th distribution pc ( j) px ( i cj) π j( xi µ j, Σ j) zij pc ( j xi ) = = K px ( i ) π ( x µ, Σ ) M step: Re-estimate the parameters using the current z ij s Equations in the next slides k= k i k k 9

Gaussian Mixtures EM algorithm M step Let us proceed by simply differentiating the log likelihood Setting derivative with respect to μ k equal to zero, gives π ( x µ, Σ ) Σ ( ) = Σ ( ) = 0 k i k k k xi µ k zik k xi µ k i= π j( xi µ j, Σ j) j i= we suppose that z ik values are known, in M step. multiplying both sides by Σk gives µ, which is simply k = znkxn, k = z n= nk k n= the weighted mean of the data T Similarly for the covariances, we obtain Σ k = znk( xn µ k)( xn µ k) k n= ote that the condition which requires the mixing coefficients to sum to, must be satisfied, when maximizing log-likelihood with respect to the π k. Then, we use the Lagrange multiplier method, as shown in the next slide 20

Gaussian Mixtures EM algorithm M step Estimating π k s: Using Lagrange multiplier method, we must maximize the following quantity K ln P( X πµ,, Σ ) + λ πk k= which gives ( xn µ k, Σk) + λ = 0 π ( x µ, Σ ) n= j j n j j k multiplying both sides by π k and sum over k, we find λ =. So π k =. 2

Gaussian Mixtures EM algorithm Latent variable view to obtain M step estimations: We have: Then, K z nk = = πk P zn = π k= k P( z ) ( ) P( x z = ) = ( x µ, Σ ) P( x z ) = ( x µ, Σ ) n k n k k n n n k k k= n= k= n= k= K znk pxz (, µ, Σ, π) = π x ( µ, Σ ) K nk { } ln pxz (, µ, Σ, π) = z lnπ + ln x ( µ, Σ ) K k n k k nk k n k k z nk z nk Keeping the z ij s fixed and maximizing with respect to the parameters give the previous results: µ = k znkxn k n= Σ = z ( x µ )( x µ ) k nk n k n k k n= T π k = k 22

Gaussian Mixtures Example: Mixture of two Gaussians After 20 cycles the algorithm is close to convergence. 23

Any Question? End of Lecture 4 Thank you! Spring 203 http://ce.sharif.edu/courses/9-92/2/ce725-/ 24