Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

Similar documents
MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

EM and Structure Learning

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Space of ML Problems. CSE 473: Artificial Intelligence. Parameter Estimation and Bayesian Networks. Learning Topics

Clustering & Unsupervised Learning

Course 395: Machine Learning - Lectures

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

Clustering & (Ken Kreutz-Delgado) UCSD

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Maximum Likelihood Estimation

SDMML HT MSc Problem Sheet 4

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Clustering gene expression data & the EM algorithm

Lecture Notes on Linear Regression

Overview. Hidden Markov Models and Gaussian Mixture Models. Acoustic Modelling. Fundamental Equation of Statistical Speech Recognition

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

Maximum Likelihood Estimation (MLE)

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

Lecture 12: Classification

The big picture. Outline

Probability Density Function Estimation by different Methods

On an Extension of Stochastic Approximation EM Algorithm for Incomplete Data Problems. Vahid Tadayon 1

Statistical pattern recognition

An Experiment/Some Intuition (Fall 2006): Lecture 18 The EM Algorithm heads coin 1 tails coin 2 Overview Maximum Likelihood Estimation

9.913 Pattern Recognition for Vision. Class IV Part I Bayesian Decision Theory Yuri Ivanov

Classification as a Regression Problem

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

Multilayer neural networks

Advanced Statistical Methods: Beyond Linear Regression

Large-Margin HMM Estimation for Speech Recognition

Statistical learning

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Semi-Supervised Learning

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Hidden Markov Models

Composite Hypotheses testing

The Basic Idea of EM

Why Bayesian? 3. Bayes and Normal Models. State of nature: class. Decision rule. Rev. Thomas Bayes ( ) Bayes Theorem (yes, the famous one)

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Learning undirected Models. Instructor: Su-In Lee University of Washington, Seattle. Mean Field Approximation

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Clustering with Gaussian Mixtures

The EM Algorithm (Dempster, Laird, Rubin 1977) The missing data or incomplete data setting: ODL(φ;Y ) = [Y;φ] = [Y X,φ][X φ] = X

Homework Assignment 3 Due in class, Thursday October 15

Automatic Object Trajectory- Based Motion Recognition Using Gaussian Mixture Models

The exam is closed book, closed notes except your one-page cheat sheet.

Mixture o f of Gaussian Gaussian clustering Nov

A Bayes Algorithm for the Multitask Pattern Recognition Problem Direct Approach

Chapter Newton s Method

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

The Expectation-Maximization Algorithm

Generative classification models

Multi-layer neural networks

Classification learning II

Lecture Nov

2 STATISTICALLY OPTIMAL TRAINING DATA 2.1 A CRITERION OF OPTIMALITY We revew the crteron of statstcally optmal tranng data (Fukumzu et al., 1994). We

Evaluation of classifiers MLPs

The conjugate prior to a Bernoulli is. A) Bernoulli B) Gaussian C) Beta D) none of the above

STATS 306B: Unsupervised Learning Spring Lecture 10 April 30

Motion Perception Under Uncertainty. Hongjing Lu Department of Psychology University of Hong Kong

Gaussian Mixture Models

9 : Learning Partially Observed GM : EM Algorithm

Advances in Longitudinal Methods in the Social and Behavioral Sciences. Finite Mixtures of Nonlinear Mixed-Effects Models.

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Support Vector Machines

Primer on High-Order Moment Estimators

Machine learning: Density estimation

Discriminative classifier: Logistic Regression. CS534-Machine Learning

Support Vector Machines

Grenoble, France Grenoble University, F Grenoble Cedex, France

8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF

Hidden Markov Models

Introduction to Hidden Markov Models

8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore

Lecture 10 Support Vector Machines II

Ensemble Methods: Boosting

Outline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline

MIMA Group. Chapter 2 Bayesian Decision Theory. School of Computer Science and Technology, Shandong University. Xin-Shun SDU

Other NN Models. Reinforcement learning (RL) Probabilistic neural networks

CSC321 Tutorial 9: Review of Boltzmann machines and simulated annealing

Kernel Methods and SVMs Extension

Generalized Linear Methods

Probabilistic Classification: Bayes Classifiers. Lecture 6:

Newton s Method for One - Dimensional Optimization - Theory

Expectation Maximization Mixture Models HMMs

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

IV. Performance Optimization

10-701/ Machine Learning, Fall 2005 Homework 3

Robust mixture modeling using multivariate skew t distributions

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

Lecture 10 Support Vector Machines. Oct

Neural networks. Nuno Vasconcelos ECE Department, UCSD

1 Motivation and Introduction

Linear Approximation with Regularization and Moving Least Squares

OPTIMAL COMBINATION OF FOURTH ORDER STATISTICS FOR NON-CIRCULAR SOURCE SEPARATION. Christophe De Luigi and Eric Moreau

Transcription:

Fnte Mxture Models and Expectaton Maxmzaton Most sldes are from: Dr. Maro Fgueredo, Dr. Anl Jan and Dr. Rong Jn

Recall: The Supervsed Learnng Problem Gven a set of n samples X {(x, y )},,,n Chapter 3 of DHS Assume examples n each class come from a parameterzed Gaussan densty Estmate the parameters (mean, varance) of the Gaussan densty for each class, and use them for classfcaton Estmaton uses Maxmum Lkelhood approach.

Revew of Maxmum Lkelhood Gven n..d. examples from a densty p(x;θ), wth known form p and unknown parameter θ. Goal: estmate θ, denoted by θˆ, such that the observed data s most lkely to be from the dstrbuton wth that θ. Steps nvolved: Wrte the lkelhood of the observed data. Maxmze the lkelhood wth respect to the parameter. 3

Example: D Gaussan Dstrbuton Maxmum Lkelhood Estmaton of Mean of a Gaussan Dstrbuton 0.4 True Densty MLE from 0 Samples MLE from 00 samples 0.35 0.3 0.5 0. 0.5 0. 0.05 0 4 - -0.5 0 0.5.5.5 3 x

Example: D Gaussan Dstrbuton 5 D Parameter Estmaton for Gaussan Dstrbuton 4 3 Blue: True densty Red : Estmated from 50 examples. y 0 5 - - 0 3 4 x

Multmodal Class Dstrbutons A sngle Gaussan may not accurately model the classes. Fnd subclasses n handwrtten onlne characters (,000 characters wrtten by 00 wrters) Performance mproves by modelng subclasses Connell and Jan, Wrter Adaptaton for Onlne Handwrtng Recognton, IEEE PAMI, Mar 00

Multmodal Classes Handwrtten f vs y classfcaton task. 7 A sngle Gaussan dstrbuton may not model the classes accurately.

An extreme example of multmodal classes 0 Lmtatons of Unmodal class modellng y 9 8 7 6 5 4 3 Red vs. Blue classfcaton. The classes are well separated. However, ncorrect model assumptons result n hgh classfcaton error. 8 0 0 4 6 8 0 x The red class s a mxture of two Gaussan dstrbutons There s no class label nformaton, when modelng the densty of just the red class.

Fnte mxtures k random sources, probablty densty functons f (x),,,k f (x) f (x) Choose at random X random varable f (x) f k (x) 9

Fnte mxtures Example: 3 speces (Irs) 0

Fnte mxtures f (x) f (x) X Choose at random, Prob.(source ) α random varable f (x) f k (x) Condtonal: Jont: f (x source ) f (x) f (x and source ) f (x) α Uncondtonal: f (x and source ) f(x) all sources k α f (x)

Fnte mxtures f (x) k α f(x) Component denstes Mxng probabltes: α 0 and k α Parameterzed components (e.g., Gaussan): f (x) f (x θ ) Θ f (x Θ) k α f (x θ) { θ,θ,...,θk, α, α,..., αk}

Gaussan mxtures f (x θ ) Gaussan Arbtrary covarances: f (x θ ) N(x µ, C ) Θ { µ, µ,..., µ k, C, C,..., Ck, α, α,... αk } Common covarance: f (x θ ) N(x µ, C) 3 Θ { µ, µ,..., µ k, C, α, α,... αk }

Mxture fttng / estmaton () () (n) Data: n ndependent observatons, x {x, x,..., x } Goals: estmate the parameter set Θ, maybe classfy the observatons Example: - How many speces? Mean of each speces? - Whch ponts belong to each speces? Classfed data (classes unknown) Observed data 4

Gaussan mxtures (d), an example µ σ 3 µ 4 σ µ 7 3 σ 0. 3 5 α 0.6 α 0.3 α 0. 3

6 Gaussan mxtures, an R example k 3 4 0 3 µ 0 0 C 3 3 3 µ C 4 4 µ 8 0 0 C (500 ponts)

Uses of mxtures n pattern recognton Unsupervsed learnng (model-based clusterng): - each component models one cluster - clusterng mxture fttng Observatons: - unclassfed ponts Goals: - fnd the classes, - classfy the ponts 7

Uses of mxtures n pattern recognton Mxtures can approxmate arbtrary denstes Good to represent class condtonal denstes n supervsed learnng Example: - two strongly non-gaussan classes. - Use mxtures to model each class-condtonal densty. 8

Uses of mxtures n pattern recognton Fnd subclasses (lexemes) Eg. onlne characters Performance mproves by modelng subclasses,000 characters wrtten by 00 wrters Connell and Jan, Wrter Adaptaton for Onlne Handwrtng Recognton, IEEE PAMI, 00

Fttng mxtures n ndependent observatons x {x (), x (),..., x (n) } Maxmum (log)lkelhood (ML) estmate of Θ: Θ ˆ arg max Θ L(x, Θ) L(x, Θ) log n j f (x ( j) Θ) n k log j ( j) α f ( x θ ) 0 mxture ML estmate has no closed-form soluton

Gaussan mxtures: A pecular type of ML Θ { µ, µ,..., µ k, C, C,..., Ck, α, α,... αk } Maxmum (log)lkelhood (ML) estmate of Θ: Θ ˆ arg max L(x, Θ) Θ Subject to: C α 0 postve and defnte k α Problem: the lkelhood functon s unbounded as det( C ) 0 There s no global maxmum. Unusual goal: a good local maxmum

A Pecular type of ML problem Example: a -component Gaussan mxture f (x µ,µ,σ, α) α πσ e (x µ σ ) + α π e (x µ ) Some data ponts: { x, x,..., x n} µ x L(x, Θ) log α πσ + α π e (x µ ) + n j log(...), as σ 0

Fttng mxtures: a mssng data problem ML estmate has no closed-form soluton Standard alternatve: expectaton-maxmzaton (EM) algorthm Mssng data problem: Observed data: x {x (), x (),..., x (n) () () (n) Mssng data: z { z, z,..., z } Mssng labels ( colors ) } z ( j) [ ] ( j) ( j) z, z,..., z ( kj), [ 0... 0 0... 0] T 3 at poston x ( j) generated by component

Fttng mxtures: a mssng data problem Observed data: x {x z { z (), x, z (),..., x,..., z (n) () () (n) Mssng data: } } z ( j) [ ] ( j) ( j) z,...,z k Complete log-lkelhood functon: L n k Θ ( j) c ( x, z, ) z log ) j ( ( j) α f ( x θ ) k- zeros, one log f (x ( j),z ( j) Θ) In the presence of both x and z, Θ would be easy to estmate, but z s mssng. 4

The EM algorthm ˆ ˆ ˆ ˆ ) ( 0) () (t) (t+ Iteratve procedure: Θ, Θ,..., Θ, Θ,... Under mld condtons: Θˆ (t) local maxmum of L( x, Θ) t The E-step: compute the expected value of ( x, z, Θ) L c E[L c ( x, z, Θ) x, Θˆ (t) ] Q( Θ, Θˆ (t) ) The M-step: update parameter estmates 5 ˆ (t+ ) arg max Q(, ˆ (t Θ Θ Θ ) ) Θ

The EM algorthm: the Gaussan case The E-step: Q( Θ, Θˆ Because ( x, z, Θ) E[z (j) L c Θ ˆ (t) x, ] Pr{z (j) (t) ) E Bnary varable Z [L c ( x, z, Θ) x, Θˆ L ( x,e[ z x, Θˆ c (t) s lnear n z Bayes law x (j), Θˆ (t) } k αˆ n αˆ f(x n (t) ], Θ) (j) f(x (j) ] θˆ (t) θˆ ) (t) n ) w (j,t) 6 (j,t) w Estmate, at teraton t, of the probablty that x ( j) was produced by component Soft probablstc assgnment

The EM algorthm: the Gaussan case Result of the E-step: (j,t) w Estmate, at teraton t, of the probablty that x ( j) was produced by component The M-step: αˆ (t + ) n n j w ( j,t ) 7 ˆ n (t+ ) j µ n w j ( j,t) w x ( j,t) ( j) Ĉ (t+ ) n ( j,t) ( j) w (x j n µ ˆ j (t+ ) w ( j,t) ) (x ( j) µ ˆ (t+ ) ) T

Dffcultes wth EM It s a local (greedy) algorthm (lkelhood never dcreases) Intalzaton dependent 74 teratons 70 teratons 8

Automatcally decdng the number of components Add a penalty term to the objectve functon, whch ncreases wth the number of clusters Start wth a large number of clusters Modfy the M-step to nclude a kller crteron whch removes components satsfyng certan crteron Fnally, choose the number of components, resultng n wth the largest objectve functon value (lkelhood - penalty). 9

Example Same as n [Ueda and Nakano, 998]. 30

Example k 4 n 00 k max 0 0 0 4 4 µ µ 0 µ 4 µ 0 4 α C I m 4 3

Example Same as n [Ueda, Nakano, Ghahraman and Hnton, 000]. 3

Example An example wth overlappng components 33

34 The rs (4-dm.) data-set: 3 components correctly dentfed

Another supervsed learnng example Problem: learn to classfy textures, from 9 Gabor features. - Four classes: 35 -Ft Gaussan mxtures to 800 randomly located feature vectors from each class/texture. -Test on the remanng data. Mxture-based Lnear dscrmnant Quadratc dsrmnant Error rate 0.0074 0.085 0.055

Resultng decson regons -d projecton of the texture data and the obtaned mxtures 36

Propertes of EM EM s extremely popular because of the followng propertes: Easy to mplement Guarantees the lkelhood ncreases monotoncally (why?) Guarantees the convergence of the soluton to a statonary pont.e., local maxma (why?). Lmtatons of EM resultng soluton depends hghly on the ntalzaton Could be slow n several cases compared to drect optmzaton methods (e.g., Iteratve scalng) 37

EM as lower bound optmzaton Start wth ntal guess θ 0 0, θ l( θ, θ ) θ 0 0, θ

EM as lower bound optmzaton Touch Pont l( θ, θ ) l( θ, θ ) l( θ, θ ) + Q( θ, θ ) 0 0 Start wth ntal guess { θ, θ } 0 Come up wth a lower bounded l( θ, θ ) l( θ, θ ) + Q( θ, θ ) 0 0 Q( θ, θ ) s a concave functon Touch pont: Q( θ θ, θ θ ) 0 0 0 { θ 0, θ 0 }

EM as lower bound optmzaton Start wth ntal guess { θ, θ } 0 l( θ, θ ) l( θ, θ ) l( θ, θ ) + Q( θ, θ ) 0 0 Come up wth a lower bounded l( θ, θ ) l( θ, θ ) + Q( θ, θ ) 0 0 Q( θ, θ ) s a concave functon Touch pont: Q( θ θ, θ θ ) 0 0 0 { θ 0, θ 0 }{ θ, θ } Search the optmal soluton that maxmzes Q( θ, θ )

EM as lower bound optmzaton Start wth ntal guess { θ, θ } 0 l( θ, θ ) l( θ, θ ) + Q( θ, θ ) l( θ, θ ) Come up wth a lower bounded l( θ, θ ) l( θ, θ ) + Q( θ, θ ) 0 0 Q( θ, θ ) s a concave functon Touch pont: Q( θ θ, θ θ ) 0 0 0 { θ 0, θ 0 }{ θ, θ } { θ, θ } Search the optmal soluton that maxmzes Q( θ, θ ) Repeat the procedure

EM as lower bound optmzaton Optmal Pont Start wth ntal guess { θ, θ } 0 l( θ, θ ) { θ 0, θ 0 }{ θ, θ } { θ θ },,... Come up wth a lower bounded l( θ, θ ) l( θ, θ ) + Q( θ, θ ) 0 0 Q( θ, θ ) s a concave functon Touch pont: Q( θ θ, θ θ ) 0 0 0 Search the optmal soluton that maxmzes Q( θ, θ ) Repeat the procedure Converge to the local optmal

Summary Expectaton-Maxmzaton algorthm E step: Compute expected complete data lkelhood Mstep: Maxmze the lkelhood to fnd parameters Can be used wth any model wth hdden (latent) varables Hdden varables can be natural to the model or can be artfcally ntroduced. Makes the parameter estmaton smpler, and effcent EM algorthm can be explaned from many perspectves Bound optmzaton Proxmal pont optmzaton, etc Several generalzatons/specalzatons exst Easy to mplement, and s wdely used! 43