Statistical Pattern Recognition

Similar documents
Statistical Pattern Recognition

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Expectation Maximization

Mixtures of Gaussians. Sargur Srihari

CSCI-567: Machine Learning (Spring 2019)

Gaussian Mixture Models, Expectation Maximization

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Clustering and Gaussian Mixture Models

Latent Variable Models and Expectation Maximization

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

STA 414/2104: Machine Learning

Review and Motivation

STA 4273H: Statistical Machine Learning

Latent Variable Models and Expectation Maximization

Mixtures of Gaussians continued

CS 6140: Machine Learning Spring 2017

CPSC 540: Machine Learning

An Introduction to Expectation-Maximization

Introduction to Machine Learning

Expectation Maximization

Lecture 4: Probabilistic Learning

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Expectation maximization

Note Set 5: Hidden Markov Models

Latent Variable Models and EM Algorithm

Lecture 6: Gaussian Mixture Models (GMM)

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Data Preprocessing. Cluster Similarity

Clustering and Gaussian Mixtures

Machine Learning Techniques for Computer Vision

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Latent Variable View of EM. Sargur Srihari

Statistical learning. Chapter 20, Sections 1 4 1

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Gaussian Mixture Models

CSE446: Clustering and EM Spring 2017

EM Algorithm LECTURE OUTLINE

Hidden Markov Models

Hidden Markov Models and Gaussian Mixture Models

K-means. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. November 19 th, Carlos Guestrin 1

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

K-Means and Gaussian Mixture Models

Hidden Markov Models in Language Processing

Data Mining Techniques

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

The Expectation-Maximization Algorithm

Exponential Family and Maximum Likelihood, Gaussian Mixture Models and the EM Algorithm. by Korbinian Schwinger

Machine Learning for Data Science (CS4786) Lecture 12

Unsupervised Learning

Latent Variable Models

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Technical Details about the Expectation Maximization (EM) Algorithm

L11: Pattern recognition principles

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Statistical Pattern Recognition

ECE 5984: Introduction to Machine Learning

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Lecture 3: Pattern Classification

Probabilistic & Unsupervised Learning

Latent Variable Models and EM algorithm

Weighted Finite-State Transducers in Computational Biology

Bayesian Machine Learning

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Likelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University

Recent Advances in Bayesian Inference Techniques

Maximum Likelihood Estimation. only training data is available to design a classifier

Statistical Methods for NLP

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

Expectation Maximization Algorithm

The Expectation Maximization or EM algorithm

COM336: Neural Computing

Data-Intensive Computing with MapReduce

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Expectation Maximization (EM)

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Expectation Maximization (EM)

Joint Factor Analysis for Speaker Verification

MIXTURE MODELS AND EM

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

EM (cont.) November 26 th, Carlos Guestrin 1

Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models.

Lecture 7: Con3nuous Latent Variable Models

PCA and admixture models

A Note on the Expectation-Maximization (EM) Algorithm

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Lecture 11: Unsupervised Machine Learning

Linear Dynamical Systems

Programming Assignment 4: Image Completion using Mixture of Bernoullis

Transcription:

Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2

Agenda Expectation-maximization (EM) Overview EM Applications EM Algorithm EM Examples Mixture Models Gaussian Mixtures 2

Expectation-Maximization (EM) EM algorithm is a general technique for finding maximum likelihood estimators under missing (unobserved) data. EM is perhaps most often used and mostly half understood algorithm for unsupervised learning. It is very intuitive Many people rely on their intuition to apply the algorithm in different problem domains The EM algorithm estimates the parameters of a model iteratively. Starting from some initial guess, each iteration consists of an Expectation step, and an Maximization step 3

Missing Data Problem Occurs whenever part of the data is unknown intrinsically inaccessible Example: which model does a data point belong to in mixture models? data is lost / erroneous Example: Some faulty / noisy process has generated the data. If the missing data is correlated in any way with the observed, we can hope to extract information about the missing data from the observed. If the missing data is independent from the observed, everything is lost. 4

EM Applications Application Examples PoS (Part of Speech) Tagging Complete data: A sentence (a sequence of words) and a corresponding sequence of PoS tags. Observed data: the sentence Unobserved data: the sequence of tags Model: an HMM with transition/emission probability tables Model Building with Partial Observations We ll discuss this example today. Our goal is to build a probabilistic model The model parameters can be estimated from a set of training examples: x 1, x 2,, x n x i s are i.i.d (identically and independently distributed) Unfortunately, we only get to observe part of each training example: x i =(x io, x iu ) and we can only observe x io. How do we build the model? 5

EM Applications More applications Filling in missing data in samples Discovering the value of latent variables Estimating the parameters of HMMs Estimating parameters of finite mixtures Unsupervised learning of clusters 6

EM Algorithm General Idea Given a set of incomplete (observed) data Assume observed data come from a specific model Iterate following steps until convergence Expectation step: formulate some parameters for that model, use this to guess the missing (latent / unobserved) data Maximization step: from the missing data and observed data, find the most likely parameters E step initial guess Guess of unknown parameters Guess of unknown hidden structure Observed structure M step 7

EM Algorithm Assumptions: Suppose that observations are Xs. Latent data are Zs., and the unknown parameters are θ. Initialization: Initialize the parameters of θ to some random value General Algorithm: E Step: Compute the best structure for Z given current parameters values. M Step: Use the just-computed values of Z to compute a better estimate for the parameters. 8

EM Algorithm Intuition Consider a model p X θ, the MLE of θ can be found using: θ = argmax p X θ θ = argmax log p X θ θ Some times there is some hidden variable Z, so the model is p X, Z θ, the marginalization on latent variable Z yields: p X θ = p(x, z θ) z ow we can use MLE like before, but we need to perform above summation that may be computationally intractable. EM proposed to address this issue. 9

EM Algorithm Algorithm Initialize θ 0 with some random value in domain of θ For t = 1, 2, repeat: E-Step: Compute the posterior distribution of Z given X and θ t 1 as q t (Z) = p(z X; θ t 1 ) θ t M-Step: Find the optimal θ t by maximizing the expectation of the complete loglikelihood with respect to q t (Z) = argmax E q t (Z) θ log (p(x, Z θ)) = argmax θ q t Z log (p(x, z θ) The computation of this sum can often be greatly simplified by taking advantage of the independence The iterations stop when some convergence criterion is met. For example, when the difference between θ t and θ t 1 is below some threshold. z 10

EM Algorithm A simple example Maximum likelihood Assume after an exam in the class we have these grades: Grade A B C D # of students a b c d And suppose that we know: P(A)=½, P(B)=μ, P(C)=2μ, P(D)=½-3μ What s the maximum likelihood estimate of μ? p a, b, c, d μ = K 1 2 a μ b 2μ c 1 2 3μ d ln p a, b, c, d μ = ln K + aln 1 2 + b ln μ + cln 2μ + dln (1 2 3μ) ln P μ = b μ + 2c μ + 1 2 3c b+c = 0 μ = 3μ 6 b+c+d 11

EM Algorithm A simple example Hidden Information Suppose that we know that: Grade A B C D # of students a b c d P(A)=½, P(B)=μ, P(C)=2μ, P(D)=½-3μ number of high grades (A s + B s) = h umber of C s = c umber of D s = d What is the Maximum Likelihood estimate for μ now? Expectation: If we knew the value of μ we could compute the expected value for a and b 1 a 2 h b h 1 1 2 2 12

EM Algorithm A simple example Hidden Information (cont.) Maximization: Grade A B C D # of students a b c d P(A)=½, P(B)=μ, P(C)=2μ, P(D)=½-3μ If we knew the expected values of a and b we could compute the maximum likelihood value of μ like before Then, we begin with a first estimate for μ and iterate between expectation and maximization to improve our estimates for μ, a and b. (0) initial guess for (t)h b(t) E b (t) 1 (t) 2 b(t) c (t 1) ML estimates of given b(t) 6 b(t) c d 13

EM Algorithm Another example: K-means clustering Goal: represent a data set {x 1,, x } in terms of K clusters each of which is summarized by a prototype μ k Initialize prototypes, then iterate between two phases: E-step: assign each data point to nearest prototype M-step: update prototypes to be the cluster means Simplest version is based on Euclidean distance HW: Derive the EM equations for ( t) ( t) k-means algorithm ( P( Xu Xo, ), (, ) ). 14

Mixture Models Mixture density model estimation Models data with mixture density Where {,..., }, 1 m 1 P( c )... P( c m ) 1 P( x ) (, ) ( ) j p x c 1 j j P c j To generate a sample from distribution P(X θ) first select class j with probability P(c j ) then generate x according to probability P(x c j,θ j ) Provides a framework for building more complex probability distributions Can be used to cluster data (How?) m 15

Gaussian Mixtures Linear super-position of Gaussians ormalization and positivity require Example: Mixture of 3 Gaussians K P( x) P( c ) ( x, ) k K 1 k1 k k k P( c ) 1, 0 P( c ) 1 K K Separated Mixed 16

Gaussian Mixtures Fitting the Gaussian mixture model The goal: given the data set, find the corresponding parameters: mixing coefficients (or prior probabilities), means, and covariances If we knew which component generated each data point, the maximum likelihood solution would involve fitting each component to the corresponding cluster Problem: the data set is unlabelled We ll refer to the labels as latent (= hidden) variables Synthetic data set without labels 17

Gaussian Mixtures Maximum likelihood for the GMM The log likelihood function takes the form K ln p( X,, ) ln k ( xn k, k ) n1 k1 ote: The sum over components, appears inside the log. There is no closed form solution for maximum likelihood. Then, how to maximize the log likelihood? Using EM algorithm. 18

Gaussian Mixtures EM Algorithm Initialize the means μ k, covariances Σ k and mixing coefficients π k. Repeat the following steps until convergence: E step: Evaluate z ij s (latent variables) using the current parameter values z ij : a binary variable which is 1 if x i is drawn from the j th distribution p( c j ) p( xi c j ) j ( xi j, j ) zij p( c j xi ) K px ( i ) ( x, ) M step: Re-estimate the parameters using the current z ij s Equations in the next slides k1 k i k k 19

Gaussian Mixtures EM algorithm M step Let us proceed by simply differentiating the log likelihood Setting derivative with respect to μ k equal to zero, gives ( x, ) ( ) ( ) 0 k i k k k xi k zik k xi k i1 j( xi j, j) j i1 we suppose that z ik values are known, in M step. 1 1 multiplying both sides by k gives, which is simply k znkxn, k z n1 nk k n1 the weighted mean of the data 1 T Similarly for the covariances, we obtain k znk( xn k )( xn k ) k n1 ote that the condition which requires the mixing coefficients to sum to 1, must be satisfied, when maximizing log-likelihood with respect to the π k. Then, we use the Lagrange multiplier method, as shown in the next slide 20

Gaussian Mixtures EM algorithm M step Estimating π k s: Using Lagrange multiplier method, we must maximize the following quantity K ln PX (,, ) k 1 k1 which gives ( xn k, k ) 0 ( x, ) n1 j j n j j k multiplying both sides by π k and sum over k, we find λ =. So k. 21

Gaussian Mixtures EM algorithm Latent variable view to obtain M step estimations: We have: Then, K z nk k P zn k1 k P( z 1) ( ) P( x z 1) ( x, ) P( x z ) ( x, ) n k n k k n n n k k k1 n1 k1 n1 k1 K znk p( X, Z,, ) ( x, ) K ln p( X, Z,, ) z ln ln ( x, ) K nk k n k k nk k n k k z nk z nk Keeping the z ij s fixed and maximizing with respect to the parameters give the previous results: 1 k znkxn k n1 1 z ( x )( x ) k nk n k n k k n1 T k k 22

Gaussian Mixtures Example: Mixture of two Gaussians After 20 cycles the algorithm is close to convergence. 23

Any Question? End of Lecture 14 Thank you! Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2 24