Mixture Models and EM

Similar documents
The Expectation Maximization Algorithm

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Human Pose Tracking I: Basics. David Fleet University of Toronto

Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models.

Estimating Gaussian Mixture Densities with EM A Tutorial

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Recent Advances in Bayesian Inference Techniques

Lecture 6: April 19, 2002

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models

Pattern Recognition. Parameter Estimation of Probability Density Functions

Lecture : Probabilistic Machine Learning

K-Means, Expectation Maximization and Segmentation. D.A. Forsyth, CS543

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Statistical Filters for Crowd Image Analysis

ECE521 week 3: 23/26 January 2017

Weighted Finite-State Transducers in Computational Biology

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Expectation Maximization

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning

Two-View Segmentation of Dynamic Scenes from the Multibody Fundamental Matrix

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Modeling Multiscale Differential Pixel Statistics

Statistical Pattern Recognition

Expectation Maximization

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Statistical learning. Chapter 20, Sections 1 4 1

Parameter Estimation. Industrial AI Lab.

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CPSC 540: Machine Learning

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Chapter 08: Direct Maximum Likelihood/MAP Estimation and Incomplete Data Problems

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Machine Learning for Signal Processing Expectation Maximization Mixture Models. Bhiksha Raj 27 Oct /

CSC487/2503: Foundations of Computer Vision. Visual Tracking. David Fleet

Accelerating the EM Algorithm for Mixture Density Estimation

MIXTURE OF EXPERTS ARCHITECTURES FOR NEURAL NETWORKS AS A SPECIAL CASE OF CONDITIONAL EXPECTATION FORMULA

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

output dimension input dimension Gaussian evidence Gaussian Gaussian evidence evidence from t +1 inputs and outputs at time t x t+2 x t-1 x t+1

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Uncertainty Quantification for Machine Learning and Statistical Models

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

ECE521 lecture 4: 19 January Optimization, MLE, regularization

EM Algorithm LECTURE OUTLINE

Bayesian Machine Learning

Bayesian Classifiers and Probability Estimation. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington

Latent Variable Models

Data Preprocessing. Cluster Similarity

Algorithmisches Lernen/Machine Learning

We Prediction of Geological Characteristic Using Gaussian Mixture Model

Maximum Likelihood Estimation. only training data is available to design a classifier

Pattern Recognition and Machine Learning

Lecture 6: Gaussian Mixture Models (GMM)

A Background Layer Model for Object Tracking through Occlusion

Linear Subspace Models

MODULE -4 BAYEIAN LEARNING

Latent Variable Models and Expectation Maximization

Learning Binary Classifiers for Multi-Class Problem

Statistical Pattern Recognition

Brief Introduction of Machine Learning Techniques for Content Analysis

Probability Theory for Machine Learning. Chris Cremer September 2015

CSC321 Lecture 18: Learning Probabilistic Models

Does the Wake-sleep Algorithm Produce Good Density Estimators?

CS 4495 Computer Vision Principle Component Analysis

Introduction to Machine Learning

Second-Order Polynomial Models for Background Subtraction

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Newscast EM. Abstract

Density Propagation for Continuous Temporal Chains Generative and Discriminative Models

CIFAR Lectures: Non-Gaussian statistics and natural images

Mixture of Gaussians Models

Latent Variable Models and Expectation Maximization

EM (cont.) November 26 th, Carlos Guestrin 1

STA414/2104 Statistical Methods for Machine Learning II

Programming Assignment 4: Image Completion using Mixture of Bernoullis

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Linear Models for Classification

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Notes on Machine Learning for and

Latent Variable Models and EM Algorithm

What is the expectation maximization algorithm?

Graphical models: parameter learning

Clustering with k-means and Gaussian mixture distributions

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Expectation Maximization Mixture Models HMMs

Multiscale Image Transforms

Technical Details about the Expectation Maximization (EM) Algorithm

Colorado School of Mines. Computer Vision. Professor William Hoff Dept of Electrical Engineering &Computer Science.

Mixtures of Gaussians. Sargur Srihari

Machine Learning Techniques for Computer Vision

PATTERN RECOGNITION AND MACHINE LEARNING

Is early vision optimised for extracting higher order dependencies? Karklin and Lewicki, NIPS 2005

Probabilistic Graphical Models for Image Analysis - Lecture 1

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts I-II

Approximate Message Passing with Built-in Parameter Estimation for Sparse Signal Recovery

A Novel Activity Detection Method

Transcription:

Mixture Models and EM Goal: Introduction to probabilistic mixture models and the expectationmaximization (EM) algorithm. Motivation: simultaneous fitting of multiple model instances unsupervised clustering of data coping with missing data segmentation? (... stay tuned) Readings: Chapter 16 in the Forsyth and Ponce. Matlab Tutorials: modelselectiontut.m (optional) 2503: Mixture Models and EM c D.J. Fleet & A.D. Jepson, 2005 Page: 1

Model Fitting: Density Estimation Let s say we want to model the distribution of grey levels d k d( x k ) at pixels, { x k } K, within some image region of interest. Non-parametric model: Compute a histogram. Parametric model: Fit an analytic density function to the data. For example, if we assume the samples were drawn from a Gaussian distribution, then we could fit a Gaussian density to the data by computing the sample mean and variance: µ = 1 d k, σ 2 = 1 (d k µ) 2 K K 1 k k Right plot shows a histogram of 150 IID samples drawn from the Gaussian density on the left (dashed). Overlaid is the estimated Gaussian model (solid). 2503: Mixture Models and EM Page: 2

Model Fitting: Multiple Data Modes When the data come from an image region with more than one dominant color, perhaps near an occlusion boundary, then a single Gaussian will not fit the data well: Missing Data: If the assignment of measurements to the two modes were known, then we could easily solve for the means and variances using sample statistics, as before, but only incorporating those data assigned to their respective models. Soft Assignments: But we don t know the assignments of pixels to the two Gaussians. So instead, let s infer them: Using Bayes rule, the probability that d k is owned (i.e., generated) by model M n is p(m n d k ) = p(d k M n ) p(m n ) p(d k ) 2503: Mixture Models and EM Page: 3

Ownership (example) Above we drew samples from two Gaussians in equal proportions, so p(m 1 ) = p(m 2 ) = 1 2, and p(d k M n ) = G(d k ; µ n, σ 2 n ) where G(d; µ, σ 2 ) is a Gaussian pdf with mean µ and variance σ 2 evaluated at d. And remember p(d k ) = n p(d k M n ) p(m n ). So, the ownerships, q n (d k ) p(m n d k ), then reduce to q 1 (d k ) = G(d k ; µ 1, σ 2 1 ) G(d k ; µ 1, σ 2 1 ) + G(d k; µ 2, σ 2 2 ), and q 2 (d k ) = 1 q 1 (d k ) For the 2-component density below: Then, the Gaussian parameters are given by weighted sample stats: µ n = 1 q n (d k )d k, σn 2 S = 1 q n (d k )(d k µ n ) 2, S n = q n (d k ) n S n k k k 2503: Mixture Models and EM Page: 4

Mixture Model Assume N processes, {M n } N n=1, each of which generates some data (or measurements). Each sample d from process M n is IID with density p n (d a n ), where a n denotes parameters for process M n. The proportion of the entire data set produced solely by M n is denoted m n = p(m n ) (it s called a mixing probability). Generative Process: First, randomly select one of the N processes according to the mixing probabilities, m (m 1,..., m N ). Then, given n, generate a sample from the observation density p n (d a n ). Mixture Model Likelihood: The probability of observing a datum d from the collection of N processes is given by their linear mixture: p(d M) = N n=1 m n p n (d a n ) The mixture model, M, comprises m, and the parameters, { a n } N n=1. Mixture Model Inference: Given K IID measurements (the data), {d k } K, our goal is to estimate the mixture model parameters. Remarks: One may also wish to estimate N and the parametric form of each component, but that s outside the scope of these notes. 2503: Mixture Models and EM Page: 5

Expectation-Maximization (EM) Algorithm EM is an iterative algorithm for parameter estimation, especially useful when one formulates the estimation problem in terms of observed and missing data. Observed data are the K intensities. Missing data are the assignments of observations to model components, z n (d k ) {0, 1}. Each EM iteration comprises an E-step and an M-step: E-Step: Compute the expected values of the missing data given the current model parameter estimate. For mixture models one can show this gives the ownership probability: E[z n (d k )] = q n (d k ). M-Step: Compute ML model parameters given observed data and the expected value of the missing data. For mixture models this yields a weighted regression problem for each model component: K q n (d k ) a n log p n (d k a n ) = 0. and the mixing probabilities are m n = 1 K K q n(d k ). Remarks: Each EM iteration can be shown to increase the likelihood of the observed data given the model parameters. EM converges to local maxima (not necessarily global maxima). An initial guess is required (e.g., random ownerships). 2503: Mixture Models and EM Page: 6

Derivation of EM for Mixture Models The mixture model likelihood function is given by: p ( {d k } K M ) = K p (d k M) = K N m n p n (d k a n ) n=1 where M ( m, { a n } n=1) N. The log likelihood is then given by L(M) = log p ( ( K N ) {d k } K M) = log m n p n (d k a n ) n=1 Our goal is to find extrema of the log likelihood function subject to the constraint that the mixing probabilities sum to 1. The constraint that n m n = 1 can be included with a Lagrange multiplier. Accordingly, the following conditions can be shown to hold at the extrema of the objective function: 1 K K q n (d k ) = m n and L a n = K q n (d k ) log p n (d k a n ) = 0. a n The first condition is easily derived from the derivative of the log likliehood with respect to m n, along with the Lagrange multiplier. The second condition is more involved as we show here, beginning with form of the derivative of the log likelihood with respect to the motion parameters for the m th component: ( L K N ) 1 = a N n n=1 m m n p n (d k a n ) n p n (d k a n ) a n n=1 K m n = N n=1 m p n (d k a n ) n p n (d k a n ) a n = K m n p n (d k a n ) N n=1 m n p n (d k a n ) a n log p n (d k a n ) The last step is an algebraic manipulation that uses the fact that log p(a) = 1 p(a). a p(a) a 2503: Mixture Models and EM Notes: 7

Derivation of EM for Mixture Models (cont) Notice that this equation can be greatly simplified because each term in the sum is really the product of the ownership probability q n (d k ) and the derivative of the component log likelihood. Therefore L a n = K q n (d k ) log p n (d k a n ) a n This is just a weighted log likelihood. In the case of a Gaussian component likelihood, p n (d k a n ), this is the derivative of a weighted least-squares error. Thus, setting L/ a n = 0 in the Gaussian case yields a weighted least-squares estimate for a n. 2503: Mixture Models and EM Notes: 8

Examples Example 1: Two distant modes. (We don t necessarily need EM here since hard assignments would be simple to determine, and reasonably efficient statistically.) Example 2: Two nearby modes. (Here, the soft ownerships are essential to the estimation of the mode locations and variances.) 2503: Mixture Models and EM Page: 9

More Examples Example 3: Nearby modes with uniformly distributed outliers. The model is a mixture of two Gaussians and a uniform outlier process. Example 4: Four modes and uniform noise present a challenge to EM. With only 1000 samples the model fit is reasonably good. 2503: Mixture Models and EM Page: 10

Mixture Models for Layered Optical Flow Layered motion is a natural domain for mixture models and the EM algorithm. Here, we believe there may be multiple motions present, but we don t know the motions, nor which pixels belong together (i.e., move coherently). Two key sources of multiple motions, even within small image neighbourhoods, are occlusion and transparency. Mixture models are also useful when the motion model doesn t provide a good approximation to the 2D motion within the region Example: The camera in the Pepsi sequence moves left-to-right. The depth discontinuities at the boundaries of the can produce motion discontinuities. For the pixels in the box in the left image, the right plot shows some of the motion constraint lines (in velocity space). -2-3 -1 1 0 2 2503: Mixture Models and EM Page: 11

Mixture Models for Optical Flow Formulation: Assume an image region contains two motions. Let the pixels in the region be { x k } K. Let s use parameterized motion models u j ( x; a j ) for j = 1, 2, where a j are the motion model parameters. (E.g., a is 2D for a translational model, and 6D for an affine motion model.) One gradient measurement per pixel, c k (f x ( x k ), f y ( x k ), f t ( x k )). Like gradient-based flow estimation, let f( x k ) u + f t ( x k ) be mean-zero Gaussian with variance σv 2, that is, p n ( c k x k, a n ) = G( f( x k ) u + f t ( x k ); 0, σ 2 v ) Let the fraction of measurements (pixels) owned by each of the two motions be denoted m 1 and m 2. Let m 0 denote the fraction of outlying measurements (i.e., constraints not consistent with either of the motions), and assume a uniform density for the outlier likelihood, denoted p 0. With three components, m 0 + m 1 + m 2 = 1. 2503: Mixture Models and EM Page: 12

Mixture Models for Optical Flow Mixture Model: The observation density for measurement c k is p( c k m, a 1, a 2 ) = m 0 p 0 + where m (m 0, m 1, m 2 ). 2 n=1 m n p n ( c k x k, a n ). Given K IID measurements { c k } K the joint likelihood is the product of individual component likelihoods: EM Algorithm: E Step: L( m, a 1, a 2 ) = K p( c k m, a 1, a 2 ). Infer the ownership probability, q n ( c k ), that constraint c k is owned by the n th mixture component. For the motion components of the mixture (n=1, 2), given m, a 1 and a 2, we have: q n ( c k ) = m n p n ( c k a n ) m 0 p 0 + m 1 p 1 ( c k x k, a 1 ) + m 2 p 2 ( c k x k, a 2 ). And of course the ownership probabilities sum to one so: q 0 ( c k ) = 1 q 1 ( c k ) q 2 ( c k ) M Step: Compute the maximum likelihood estimates of the mixing probabilities m and the flow field parameters, a 1 and a 2. 2503: Mixture Models and EM Page: 13

Mixing Probabilities: ML Parameter Estimation Given the ownership probabilities, the mixing probabilities are the fractions of the total ownership probability assigned to the respective components: 1 K q n ( c k ) = m n K Flow Estimation: Given the ownership probabilities, we can estimate the motion parameters for each component separately with a form of weighted, least-squares area-based regression. E.g., for 2D translation, where the a n u n, this amounts to the minimization of the weighted least-squares error K E( u n ) = q n ( c k ) d 2 ( u n, c k ) = K q n ( c k ) [ ] 2 f( xk, t) u n + f t ( x k, t), where f (f x, f y ) T (cf. iteratively reweighted LS for robust estimation). Convergence Behaviour: (of two motion estimates in velcocity space) 2503: Mixture Models and EM Page: 14

Representation for the Occlusion Example Model: translational flow, with 2 layers and an outlier process. Region at an occlusion boundary. Pixel Ownership: for Layer #2 Constraint Ownership: Layer #1 Layer #2 Outliers 2503: Mixture Models and EM Page: 15

Additional Test Patches #4 #2 #3 Test Patches: #1 Motion constraints for regions #1, #3 and #4: Outliers for regions #1, #3 and #4: 2503: Mixture Models and EM Page: 16

Further Readings Papers on mixture models and EM: A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society Series B, pp. 1 38, 1977. R. Neal, and G. Hinton. A view of the EM algorithm that justifies increemental, sparse and other variants. In Learning in Graphical Models, M. Jordan (ed.). Papers on mixture models for layered motion: S. Ayer and H. Sawhney. Compact Representations of Videos Through Dominant and Multiple Motion Estimation, IEEE Trans. on Pattern Analysis and Machine Intelligence, 18(8):777 784, 1996. A.D. Jepson and M. J. Black. Mixture models for optical flow computation. Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 760 761, New York, June 1993. Y. Weiss and E.H. Adelson. A unified mixture framework for motion segmentation: Incorporating spatial coherence and estimating the number of models. IEEE Proc. Computer Vision and Pattern Recognition, San Francisco, pp. 321 326, 1996. 2503: Mixture Models and EM Notes: 17