Latent Variable View of EM. Sargur Srihari

Similar documents
Mixtures of Gaussians. Sargur Srihari

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Linear Dynamical Systems

Variational Mixture of Gaussians. Sargur Srihari

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Expectation Maximization

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Statistical Pattern Recognition

Lecture 6: Gaussian Mixture Models (GMM)

Statistical Pattern Recognition

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

MCMC and Gibbs Sampling. Sargur Srihari

The Expectation-Maximization Algorithm

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Introduction to Machine Learning

STA 4273H: Statistical Machine Learning

A Note on the Expectation-Maximization (EM) Algorithm

K-Means and Gaussian Mixture Models

Mixture Models and EM

Latent Variable Models and Expectation Maximization

Latent Variable Models and Expectation Maximization

ECE 5984: Introduction to Machine Learning

Technical Details about the Expectation Maximization (EM) Algorithm

COM336: Neural Computing

Expectation-Maximization (EM) algorithm

STA 4273H: Statistical Machine Learning

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Gaussian Mixture Models

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Data Mining Techniques

Basic Sampling Methods

CS6220: DATA MINING TECHNIQUES

Expectation Maximization

STA 414/2104: Machine Learning

Joint Factor Analysis for Speaker Verification

1 Expectation Maximization

Mixture Models and Expectation-Maximization

CSCI-567: Machine Learning (Spring 2019)

Machine Learning for Signal Processing Bayes Classification and Regression

Gaussian Mixture Models, Expectation Maximization

Probability and Information Theory. Sargur N. Srihari

Non-Parametric Bayes

Study Notes on the Latent Dirichlet Allocation

p L yi z n m x N n xi

Clustering, K-Means, EM Tutorial

Machine Learning Techniques for Computer Vision

The Expectation Maximization or EM algorithm

Machine Learning for Data Science (CS4786) Lecture 12

CPSC 540: Machine Learning

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

CS6220 Data Mining Techniques Hidden Markov Models, Exponential Families, and the Forward-backward Algorithm

Variational inference

K-Means, Expectation Maximization and Segmentation. D.A. Forsyth, CS543

Lecture 8: Graphical models for Text

Recent Advances in Bayesian Inference Techniques

Clustering and Gaussian Mixture Models

Data Mining Techniques

Expectation maximization

Latent Variable Models and EM Algorithm

CS6220: DATA MINING TECHNIQUES

Dynamic Approaches: The Hidden Markov Model

Variational Inference. Sargur Srihari

Weighted Finite-State Transducers in Computational Biology

Latent Variable Models

Variational Scoring of Graphical Model Structures

Multivariate Gaussians. Sargur Srihari

Clustering and Gaussian Mixtures

Basic math for biology

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Outline lecture 6 2(35)

Clustering with k-means and Gaussian mixture distributions

Variational Inference. Sargur Srihari

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Probabilistic clustering

Markov Chains and Hidden Markov Models

Statistical learning. Chapter 20, Sections 1 4 1

Hidden Markov Models and Gaussian Mixture Models

Review and Motivation

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Pattern Recognition and Machine Learning

Unsupervised Learning

Collaborative Filtering: A Machine Learning Perspective

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

CS145: INTRODUCTION TO DATA MINING

Exponential Family and Maximum Likelihood, Gaussian Mixture Models and the EM Algorithm. by Korbinian Schwinger

Variational Message Passing. By John Winn, Christopher M. Bishop Presented by Andy Miller

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Introduction to Probabilistic Graphical Models

EM & Variational Bayes

Variational Inference (11/04/13)

Machine Learning Overview

Linear Dynamical Systems (Kalman filter)

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

Probabilistic Graphical Models

Transcription:

Latent Variable View of EM Sargur srihari@cedar.buffalo.edu 1

Examples of latent variables 1. Mixture Model Joint distribution is p(x,z) We don t have values for z 2. Hidden Markov Model A single time slice is a mixture with components p(x z) An extension of mixture model Choice of mixture component depends on choice of mixture component for previous distribution Latent variables are multinomial variables z n That describe component responsible for generating x n 2

Another example of latent variables 3. Topic Models (Latent Dirichlet Allocation) In NLP unobserved groups explain why some observed data are similar Each document is a mixture of various topics (latent variables) Topics generate words CAT-related: milk, meow, kitten DOG-related: puppy, bark, bone Multinomial distributions over words with Dirichlet priors 3

Main Idea of EM Goal of EM is: find maximum likelihood models for distributions p(x) that have latent (or missing) data E.g., GMMs, HMMs In case of Gaussian mixture models We have a complex distribution of observed variables x We wish to estimate its parameters µ k,σ k,π k p(x) = p(x,z) Introduce latent variables z so that p(x) = π k N(x µ k, Σ k ) joint distribution p(x,z) is more tractable (since we know forms of components) p(x z k = 1) = N(x µ k, Σ k ) Complicated form from simpler components The original distribution is obtained by marginalizing the joint distribution K k=1 z 4

Alternative View of EM This view recognizes key role of latent variables Observed data matrix X = x 1 x 2 x n Latent Variables - matrix = where n th row is the sample vector x nt =[x n1 x n2 x nd ] with corresponding row z nt =[z n1 z n2 z nk ] z 1 z 2 z n Goal of EM algorithm is to find maximum likelihood solution for p(x) given some X When we do not have 5

Likelihood with Latent Variables Likelihood function (from sum rule) is p(x θ) = p(x, θ) where θ is the set of all model parameters E.g., means, covariances, responsibilities Log-likelihood function is ln p(x θ) = ln p(x, θ) We wish to find θ that maximizes this Joint likelihood function can be written as X = x 1 x 2 x n = z 1 z 2 z n p(x, θ)=p( X,θ)p(X θ) We choose this form from the graph since we know X and not 6

Complication due to Latent Variables Log likelihood function is ln p(x θ) = ln Key Observation: p(x, θ) Summation inside brackets due to marginalization Not due to log-likelihood Summation over latent variables appears inside logarithm Even if joint distribution p(x, θ) belongs to exponential family the marginal distribution p(x θ) does not Taking log of Sum of Gaussians does not give simple quadratic Results in complicated expressions for maximum likelihood solution, i.e., what value of θ maximizes the likelihood 7

Complete and Incomplete Data Sets Log-likelihood function is: ln p(x θ) = ln p(x, θ) Complete Data {X,} For each observation in X we know corresponding value of latent variable in Since we can evaluate p(x, θ)=p( X,θ)p(X θ) maximization over θ is straightforward Incomplete Data {X} Actual data set Since we do not know We cannot evaluate p(x, θ) to maximize over θ 8

Maximizing Expectation of p(x, θ) Since we don t have the complete data set {X,} to evaluate p(x, θ) we instead evaluate its expectation Since we are given X, for a given θ we first determine the distribution of the latent variables p( X,θ) Then the expected log-likelihood of complete data is E [ ln p(x, θ)]= p( X,θ)ln p(x, θ) We maximize this by considering every value of θ And for each θ by summing over each value of Summation is due to expectation not sum rule! Since logarithm acts directly on the joint p(x, θ) and not on a summation it is tractable 9

E and M Steps E Step: Estimate the missing values Use current parameter value θ old to find the posterior distribution of the latent variables given by M Step: Determine revised parameter estimate θ new by maximizing where p( X, θ old ) ( ) = p ( X,θ old ) Q θ,θ old θ new = argmaxq(θ,θ old ) θ ln p X, θ ( ) Summation due to expectation is the expectation of p(x, θ) for some general parameter value θ 10

General EM Algorithm Given joint distribution p(x, θ) over observed variables X and latent variables governed by parameters θ goal is to maximize likelihood function p(x θ) Step 1: Choose an initial setting for the parameters θ old Step 2: E Step: Evaluate p( X, θ old ) Step 3: M Step: Evaluate θ new given by where Q(θ,θ old ) = p( X, θ old )ln p(x, θ) Check for convergence of either log-likelihood or parameter values If not satisfied then let θ old θ new Return to Step 2 θ new = argmaxq(θ,θ old ) θ 11

Missing Variables EM has been described for maximum likelihood function when there are discrete latent variables It can also be applied when there are unobserved variables corresponding to missing values in data set Take the joint distribution of all variables and then marginalize over missing ones EM is then used to maximize corresponding likelihood function Method is valid when data is missing at random Not if missing value depends on unobserved values E.g., if quantity exceeds some threshold 12

Gaussian Mixtures Revisited Apply EM (latent variable view) to GMM In the E-step we compute Expectation of log-likelihood of complete data {X,} wrt posterior of latent Variables Q(θ,θ old ) = p( X, θ old )ln p(x, θ) What is the form of the two product terms? In the M-step we maximize Q(θ,θ old ) wrt Will show that this leads to the same m.l estimates for GMM parameters π,µ,σ as before θ 13

Likelihood for Complete Data Likelihood function for the complete data set is p(x, π, µ, Σ) = Log-likelihood is N K N n=1 k=1 π k z nk ( ) N x n µ k, Σ k Much simpler than log-likelihood for incomplete data: ln p(x π, µ, Σ) = K ln p(x, π, µ, Σ) = z nk lnπ k + lnn x n µ k, Σ k n=1 k=1 N n=1 Maximum likelihood solution for complete data can be obtained in closed form Since we don t have values for latent variables, we obtain its expectation wrt the posterior distribution of latent variables 14 z nk K ln π k N(x n µ k, Σ k ) k=1 ( ) { }

Posterior Distribution of Latent Variables K K zk p(z) = k From π and p(x z) = N ( x µ k,σ k ) z k we have k = 1 p( X,µ,Σ) α N K n=1 k=1 ( π k N ( x n µ k,σ k )) From which we can get the expected value for the indicator variable as E[z nk ] = ( ) π k N x n µ k,σ k K j=1 ( ) π j N x n µ j,σ j Substituting into complete log-likelihood: N K E ln p( X, π,µ,σ) = γ ( z nk ){ lnπ k + ln N ( x n µ k,σ k )} n=1 k=1 ( ) = γ z nk k=1 z nk Final procedure: choose initial values for π old,µ old,σ old Evaluate the responsibilities (E-step) Keep responsibilities fixed and use closed-form solutions for N 1 µ = γ( z )x Σ k = 1 γ(z nk )(x n µ k )(x n µ k ) T N k k nk n Nk n= 1 N n =1 Nk π k = N π new,µ new,σ new 15

Relation to K-means EM for Gaussian mixtures has close similarity to K- means K-means performs a hard assignment of data points to clusters Each data point is associated uniquely with one cluster EM makes a soft assignment based on posterior probabilities K-means does not estimate the covariances of the clusters but only the cluster means 16