Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Similar documents
Statistical Pattern Recognition

Mixtures of Gaussians. Sargur Srihari

Statistical Pattern Recognition

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Expectation Maximization

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Machine Learning Techniques for Computer Vision

Latent Variable Models and Expectation Maximization

Expectation Maximization

Clustering and Gaussian Mixture Models

Latent Variable Models and Expectation Maximization

Gaussian Mixture Models, Expectation Maximization

CS 6140: Machine Learning Spring 2017

Lecture 6: Gaussian Mixture Models (GMM)

Latent Variable View of EM. Sargur Srihari

Probabilistic clustering

Introduction to Machine Learning

Machine Learning for Data Science (CS4786) Lecture 12

Expectation maximization

Review and Motivation

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Gaussian Mixture Models

Clustering and Gaussian Mixtures

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

The Expectation-Maximization Algorithm

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

K-Means and Gaussian Mixture Models

Clustering, K-Means, EM Tutorial

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Latent Variable Models and EM Algorithm

MIXTURE MODELS AND EM

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

CPSC 540: Machine Learning

CSCI-567: Machine Learning (Spring 2019)

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

p L yi z n m x N n xi

STA 4273H: Statistical Machine Learning

Mixtures of Gaussians continued

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Latent Variable Models

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Data Mining Techniques

Linear Dynamical Systems

L11: Pattern recognition principles

Statistical learning. Chapter 20, Sections 1 4 1

The Expectation-Maximization Algorithm

Machine Learning Lecture 5

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Expectation Maximization Algorithm

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

STA 414/2104: Machine Learning

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

Lecture 8: Clustering & Mixture Models

COM336: Neural Computing

Lecture 4: Probabilistic Learning

Unsupervised Learning

1 Expectation Maximization

Lecture 7: Con3nuous Latent Variable Models

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

An Introduction to Expectation-Maximization

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Non-Parametric Bayes

Hidden Markov Models and Gaussian Mixture Models

The Expectation Maximization or EM algorithm

Mixture of Gaussians Models

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Introduction to Machine Learning

Data Preprocessing. Cluster Similarity

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Gaussian Mixture Models

Expectation-Maximization (EM) algorithm

CSE446: Clustering and EM Spring 2017

Mixture Models and EM

Technical Details about the Expectation Maximization (EM) Algorithm

Latent Variable Models and EM algorithm

CS181 Midterm 2 Practice Solutions

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Expectation maximization tutorial

10-701/15-781, Machine Learning: Homework 4

Mobile Robot Localization

A Note on the Expectation-Maximization (EM) Algorithm

Probabilistic Graphical Models

EM Algorithm LECTURE OUTLINE

Lecture 3: Pattern Classification

Brief Introduction of Machine Learning Techniques for Content Analysis

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Recent Advances in Bayesian Inference Techniques

Variational Mixture of Gaussians. Sargur Srihari

Lecture 14. Clustering, K-means, and EM

Transcription:

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop

Limitations of K-means Hard assignments of data points to clusters small shift of a data point can flip it to a different cluster Not clear how to choose the value of K Solution: replace hard clustering of K-means with soft probabilistic assignments Represents the probability distribution of the data as a Gaussian mixture model

The Gaussian Distribution Multivariate Gaussian mean covariance Define precision to be the inverse of the covariance In 1-dimension

Gaussian Mixtures Linear super-position of Gaussians Normalization and positivity require Can interpret the mixing coefficients as prior probabilities

Example: Mixture of 3 Gaussians 1 0.5 0 0 0.5 1 (a)

Contours of Probability Distribution 1 0.5 0 0 0.5 1 (b)

Sampling from the Gaussian To generate a data point: first pick one of the components with probability then draw a sample from that component Repeat these two steps for each new data point

Example: Gaussian Mixture Density Mixture of 3 Gaussians p 2 (x) p 1 ( x ) p 2 ( x ) N 0 0, 0 1 0 1 N 0 4 0 [ 6,6], 4 p 1 (x) p 3 ( x ) N 6 0 0 [ 7, 7], 6 p 3 (x) ( x) + 0.3p ( x) 0.5 p ( x) p( x ) = 0.2p1 2 + 3

Synthetic Data Set 1 0.5 0 0 0.5 1 (a)

Fitting the Gaussian Mixture We wish to invert this process given the data set, find the corresponding parameters: mixing coefficients means covariances If we knew which component generated each data point, the maximum likelihood solution would involve fitting each component to the corresponding cluster Problem: the data set is unlabelled We shall refer to the labels as latent (= hidden) variables

Synthetic Data Set Without Labels 1 0.5 0 0 0.5 1 (b)

Posterior Probabilities We can think of the mixing coefficients as prior probabilities for the components For a given value of we can evaluate the corresponding posterior probabilities, called responsibilities These are given from Bayes theorem by

Posterior Probabilities (colour coded) 1 0.5 0 0 0.5 1 (a)

Posterior Probability Map

Maximum Likelihood for the GMM The log likelihood function takes the form Note: sum over components appears inside the log There is no closed form solution for maximum likelihood How to maximize the log likelihood solved by expectation-maximization (EM) algorithm

EM Algorithm Informal Derivation Let us proceed by simply differentiating the log likelihood Setting derivative with respect to equal to zero gives giving which is simply the weighted mean of the data

EM Algorithm Informal Derivation Similarly for the covariances For mixing coefficients use a Lagrange multiplier to give Fraction of points assigned to component j effective number of points assigned to cluster j. Average responsibility which component j takes for explaining the data points.

EM Algorithm Informal Derivation The solutions are not closed form since they are coupled Suggests an iterative scheme for solving them: Make initial guesses for the parameters Alternate between the following two stages: 1. E-step: evaluate responsibilities 2. M-step: update parameters using ML results

EM Latent Variable Viewpoint Binary latent variables describing which component generated each data point Conditional distribution of observed variable Prior distribution of latent variables Marginalizing over the latent variables we obtain

Expected Value of Latent Variable From Bayes theorem the posterior distribution: p(z X,µ, Σ, π ) N N n= 1 k= 1 [ ] nk π (x µ,σ ) The expectation of z nk under this posterior distribution K k n k k z E[ z nk ] = = z z nk z [ π N (x µ, Σ )] [ π N (x µ, Σ )] πkn (x n µ k,σk ) π N (x µ, Σ nj z nj nk j j k n n n j j k j j k = ) z nj z nk γ(x n )

Complete and Incomplete Data 1 1 0.5 0.5 0 0 0.5 1 (a) complete 0 0 0.5 1 (b) incomplete

Latent Variable View of EM If we knew the values for the latent variables, we would maximize the complete-data log likelihood which gives a trivial closed-form solution (fit each component to the corresponding set of data points) We don t know the values of the latent variables However, for given parameter values we can compute the expected values of the latent variables

Expected Complete-Data Log Likelihood Suppose we make a guess for the parameter values (means, covariances and mixing coefficients) Use these to evaluate the responsibilities Consider expected complete-data log likelihood where responsibilities are computed using We are implicitly filling in latent variables with best guess Keeping the responsibilities fixed and maximizing with respect to the parameters give the previous results

EM in General Given p(x,z θ) over observed variables X and latent variables Z, the goal is to maximize p(x θ) with respect to θ 1. Choose an initial setting for parameters θ old. 2. E step: Evaluate p(z X, θ old ). 3. M step: Evaluate θ new given by θ new where = arg max Q( θ, θ Q( θ, θ 4. Check for convergence of either log likelihood or the parameter values. If the convergence criterion is not satisfied, then let and return to step 2. θ old ) old = Z ) p( Z X, θ θ old θ new old )ln p( X, Z θ ).

K-means Algorithm Goal: represent a data set in terms of K clusters each of which is summarized by a prototype Initialize prototypes, then iterate between two phases: E-step: assign each data point to nearest prototype M-step: update prototypes to be the cluster means

Responsibilities Responsibilities assign data points to clusters such that Example: 5 data points and 3 clusters

K-means Cost Function data responsibilities prototypes

Minimizing the Cost Function E-step: minimize w.r.t. assigns each data point to nearest prototype M-step: minimize w.r.t gives each prototype set to the mean of points in that cluster Convergence guaranteed since there is a finite number of possible settings for the responsibilities

K-means Revisited Consider GMM with common covariances Take limit Responsibilities become binary Expected complete-data log likelihood becomes