Expectation Maximization Mixture Models HMMs

Similar documents
Machine Learning for Signal Processing Expectation Maximization Mixture Models. Bhiksha Raj 27 Oct /

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Review of probabilities

Expectation Maximization Mixture Models HMMs

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Latent Variable Models and EM algorithm

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Language as a Stochastic Process

Bayesian Models in Machine Learning

Bayesian Learning (II)

A Brief Review of Probability, Bayesian Statistics, and Information Theory

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

6.867 Machine Learning

MAT Mathematics in Today's World

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

6.867 Machine Learning

Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS

10-701/15-781, Machine Learning: Homework 4

Expectation Maximisation (EM) CS 486/686: Introduction to Artificial Intelligence University of Waterloo

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

Estimating Gaussian Mixture Densities with EM A Tutorial

Gaussian Mixture Models

Machine Learning for Data Science (CS4786) Lecture 12

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Machine Learning for Signal Processing Bayes Classification and Regression

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Gaussian Mixture Models, Expectation Maximization

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Latent Dirichlet Allocation (LDA)

Machine Learning for Signal Processing Bayes Classification

{ p if x = 1 1 p if x = 0

Statistical Pattern Recognition

Weighted Finite-State Transducers in Computational Biology

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Naïve Bayes classification

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Probability theory basics

Clustering with k-means and Gaussian mixture distributions

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

But if z is conditioned on, we need to model it:

Clustering with k-means and Gaussian mixture distributions

Bayesian Methods for Machine Learning

ECE521 week 3: 23/26 January 2017

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Machine Learning Lecture Notes

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Expectation-Maximization (EM) algorithm

Lecture 4: Probabilistic Learning

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Brief Introduction of Machine Learning Techniques for Content Analysis

Mixtures of Gaussians. Sargur Srihari

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

Study Notes on the Latent Dirichlet Allocation

CS Lecture 18. Expectation Maximization

Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models.

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14

CSCI-567: Machine Learning (Spring 2019)

BAYESIAN DECISION THEORY

Algorithmisches Lernen/Machine Learning

Linear Models for Regression CS534

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

MIXTURE MODELS AND EM

Intermediate Math Circles November 8, 2017 Probability II

Probability Theory for Machine Learning. Chris Cremer September 2015

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Latent Variable View of EM. Sargur Srihari

Lecture 8: Graphical models for Text

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

CS229 Machine Learning Project: Allocating funds to the right people

Introduction to Probability for Graphical Models

Clustering and Gaussian Mixture Models

CSC321 Lecture 18: Learning Probabilistic Models

Chapter 08: Direct Maximum Likelihood/MAP Estimation and Incomplete Data Problems

Bayesian Approach 2. CSC412 Probabilistic Learning & Reasoning

Week 3: The EM algorithm

A brief review of basics of probabilities

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Expectation Maximization (EM)

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Mixture Models and EM

Pattern Recognition. Parameter Estimation of Probability Density Functions

Hidden Markov Models Part 2: Algorithms

COMP90051 Statistical Machine Learning

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

The Expectation-Maximization Algorithm

Topic Models. Charles Elkan November 20, 2008

Notes on Machine Learning for and

Expectation Maximization

Likelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University

Statistical learning. Chapter 20, Sections 1 4 1

The Expectation Maximization or EM algorithm

Lecture 10: Probability distributions TUESDAY, FEBRUARY 19, 2019

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Probability Background

STA 414/2104: Machine Learning

Transcription:

11-755 Machine Learning for Signal rocessing Expectation Maximization Mixture Models HMMs Class 9. 21 Sep 2010 1

Learning Distributions for Data roblem: Given a collection of examples from some data, estimate its distribution Basic ideas of Maximum Lielihood and MA estimation can be found in Aarti/aris slides ointed to in a previous class Solution: Assign a model to the distribution Learn parameters of model from data Models can be arbitrarily complex Mixture densities, Hierarchical models. Learning must be done using Expectation Maximization Following slides: An intuitive explanation using a simple example of multinomials 2

A Thought Experiment 6 3 1 5 4 1 2 4 A person shoots a loaded dice repeatedly You observe the series of outcomes You can form a good idea of how the dice is loaded Figure out what the probabilities of the various numbers are for dice number = countnumber/sumrolls This is a maximum lielihood estimate Estimate that maes the observed sequence of numbers most probable 3

The Multinomial Distribution A probability distribution over a discrete collection of items is a Multinomial i l : belongs to a discrete set E.g. the roll of dice : in 1,2,3,4,5,6 Or the toss of a coin : in head, tails 4

Maximum Lielihood Estimation n 2 n 4 n 1 n 5 n n3 6 p p 6 p 3 1 p p 4 p 2 p 2 p5 p 1 p 3 p4 p 5 p 6 Basic principle: Assign a form to the distribution E.g. a multinomial Or a Gaussian Find the distribution that best fits the histogram of the data 5

Defining Best Fit The data are generated by draws from the distribution I.e. the generating process draws from the distribution Assumption: The distribution tion has a high probability of generating the observed data Not necessarily true Select the distribution that has the highest probability bilit of generating the data Should assign lower probability to less frequent observations and vice versa 6

Maximum Lielihood Estimation: Multinomial robability of generating n 1, n 2, n 3, n 4, n 5, n 6 n, n, n, n, n, n 1 2 3 4 5 6 Const n p i i i Find p 1,p 2,p 3,p 4,p 5,p 6 so that the above is maximized Alternately maximize log n1, n2, n3, n4, n5, n6 log Const n i logp i i Log is a monotonic function argmax x fx = argmax x logfx Solving for the probabilities gives us i Requires constrained optimization to ensure probabilities sum to 1 p j ni n j EVENTUALLY ITS JUST COUNTING! 7

Segue: Gaussians 0.5 exp 2 1, ; 1 N T d arameters of a Gaussian: Mean, Covariance, 11755/18797 21 Sep 2010 8

Maximum Lielihood: Gaussian Given a collection of observations 1, 2,, estimate mean and covariance 1 T 1 1, 2,... exp 0.5 i i d i 2 T 1 log,,... C 0.5 log i 1 2 i i Maximizing w.r.t and gives us 1 1 T ITS STILL i i i N N i i JUST COUNTING! 9

Laplacian x b x L x exp 1 ; arameters: Mean, scale b b > 0 b b b x L x exp 2, ; 11755/18797 21 Sep 2010 10

Maximum Lielihood: Laplacian Given a collection of observations x 1, x 2,, estimate mean and scale b x i x, x,... C N log b log 1 2 i b Maximizing w.r.t and b gives us 1 N 1 x x i N i b i i 11

Dirichlet from wiipedia K=3. K Clocwise from top left: α=6, 2, 2, 3, 7, 5, 6, 2, 6, 2, 3, 4 arameters are s log of the density as we change α from α=0.3, 0.3, 0.3 to 2.0, 2.0, 2.0, eeping all the individual αi's equal to each other. Determine mode and curvature Defined only of probability vectors D ; = [x 1 x 2.. x K ], i x i = 1, x i >= 0 for all i i i i i 12 i x i 1 i

Maximum Lielihood: Dirichlet Given a collection of observations 1, 2,, estimate 1, 2,... i 1 log j i N log i N log i j i i i log, No closed form solution for s. Needs gradient ascent Several distributions have this property: p the ML estimate of their parameters have no closed form solution 13

Continuing the Thought Experiment 6 3 1 5 4 1 2 4 4 4 1 6 3 2 1 2 Two persons shoot loaded dice repeatedly The dice are differently loaded for the two of them We observe the series of outcomes for both persons How to determine the probability distributions of the two dice? 14

Estimating robabilities Observation: The sequence of numbers from the two dice As indicated by the colors, we now who rolled what number 6 4 5 1 2 3 4 5 2 2 1 4 3 4 6 2 1 6 15

Estimating robabilities Observation: The sequence of numbers from the two dice As indicated by the colors, we now who rolled what number Segregation: Separate the blue observations from the red 6 4 5 1 2 3 4 5 2 2 1 4 3 4 6 2 1 6 6 5 2 4 2 1 3 6 1.. 4 1 3 5 2 4 4 2 6.. Collection of blue numbers Collection of red numbers 16

Estimating robabilities Observation: The sequence of numbers from the two dice As indicated by the colors, we now who rolled what number 6 4 5 1 2 3 4 5 2 2 1 4 3 4 6 2 1 6 Segregation: Separate the blue observations from the red 6 5 2 4 2 1 3 6 1.. 4 1 3 5 2 4 4 2 6.. From each set compute probabilities for each of the 6 0.3 0.25 0.3 0.25 0.2 possible outcomes 0.15 0.2 0.15 number no.of times number was rolled total number of observed rolls 0.1 0.05 0 1 2 3 4 5 6 0.1 0.05 0 1 2 3 4 5 6 17

A Thought Experiment 6 4 1 5 3 2 2 2 63154124 1 4 44163212 4 1 2 Now imagine that you cannot observe the dice yourself Instead there is a caller who randomly calls out the outcomes 40% of the time he calls out the number from the left shooter, and 60% of the time, the one from the right and you now this At any time, you do not now which h of the two he is calling out How do you determine the probability distributions for the two dice? 18

A Thought Experiment 6 4 1 5 3 2 2 2 63154124 1 4 44163212 4 1 2 How do you now determine the probability distributions for the two sets of dice.. If you do not even now what fraction of time the blue numbers are called, and what fraction are red? 19

A Mixture Multinomial The caller will call out a number in any given callout IF He selects RED, and the Red die rolls the number OR He selects BLUE and the Blue die rolls the number = Red Red +Blue Blue E.g. 6 = Red6 Red + Blue6 Blue A distribution that combines or mixes multiple multinomials is a mixture multinomial Z Z Z Mixture weights Component multinomials 20

Mixture Distributions Mixture Gaussian Z Z Z N ;, Z Z z z Mixture weights Component distributions Mixture of Gaussians and Laplacians Z N ; z, z Z L i; z, bz, Z Mixture distributions mix several component distributions Component distributions may be of varied type Mixing weights must sum to 1.0 Component distributions integrate to 1.0 Mixture distribution integrates to 1.0 Z i i 21

Maximum Lielihood Estimation For our problem: Z = color of dice n1, n2, n3, n4, n5, n6 Const Z Z Z n Const Maximum lielihood solution: Maximize log n1, n2, n3, n4, n5, n6 log Const Z Z Z n log Z Z Z No closed form solution summation inside log! In general ML estimates for mixtures do not have a closed form USE EM! 22 n

Expectation Maximization It is possible to estimate all parameters in this setup using the Expectation Maximization or EM algorithm First described in a landmar paper by Dempster, Laird and Rubin Maximum Lielihood Estimation from incomplete data, via the EM Algorithm, Journal of the Royal Statistical Society, Series B, 1977 Much wor on the algorithm since then The principles behind the algorithm existed for several years prior to the landmar paper, however. 23

Expectation Maximization Iterative solution Get some initial estimates for all parameters Dice shooter example: This includes probability distributions for dice AND the probability with which the caller selects the dice Two steps that are iterated: Expectation Step: Estimate statistically, the values of unseen variables Maximization Step: Using the estimated values of the unseen variables as truth, estimates of the model parameters 24

EM: The auxiliary function EM iteratively optimizes the following auxiliary function Q, = Z Z, logz, Z are the unseen variables Assuming Z is discrete may not be are the parameter estimates from the previous iteration are the estimates to be obtained in the current iteration 25

Expectation Maximization as counting Instance from blue dice Instance from red dice Dice unnown 6 6 6 6 6........ 6 6 Collection of blue Collection of red Collection of blue Collection of red numbers numbers numbers numbers 6.. 6.. Collection of blue Collection of red numbers numbers Hidden variable: Z Dice: The identity of the dice whose number has been called out If we new Z for every observation, we could estimate all terms By adding the observation to the right bin Unfortunately, we do not now Z it is hidden from us! Solution: FRAGMENT THE OBSERVATION 26

Fragmenting the Observation EM is an iterative algorithm At each time there is a current estimate of parameters The size of the fragments is proportional to the a posteriori probability of the component distributions The a posteriori probabilities of the various values of Z are computed using Bayes rule: Z Z Z C Z Z Every dice gets a fragment of size dice number 27

Expectation Maximization Hypothetical Dice Shooter Example: We obtain an initial estimate for the probability distribution of the two sets of dice somehow: blu ue 0.35 0.3 0.25 0.2 0.15 0.1 005 0.05 0 0.1 1 2 3 4 5 6 We obtain an initial estimate for the probability with which the caller calls out the two shooters somehow red 0.45 0.4 035 0.35 0.3 0.25 0.2 0.15 0.1 005 0.05 0 1 2 3 4 5 6 0.05 0.5 0.5 Z 28

Expectation Maximization Hypothetical Dice Shooter Example: Initial estimate: t blue = red = 0.5 4 blue = 0.1, for 4 red = 0.05 Caller has just called out 4 osterior probability of colors: red 4 C 4 Z red Z red C 0.05 0.5 C0.025 blue 4 C 4 Z blue Z blue C 0.1 0.5 C0.05 Normalizing : red 4 0.33 ; blue 4 0.67 29

Expectation Maximization 6 4 5 1 2 3 4 5 2 2 1 4 3 4 6 2 1 6 4 0.33 4 0.67 30

Expectation Maximization Every observed roll of the dice contributes to both Red and Blue 6 4 5 1 2 3 4 5 2 2 1 4 3 4 6 2 1 6 31

Expectation Maximization Every observed roll of the dice contributes to both Red and Blue 6 4 5 1 2 3 4 5 2 2 1 4 3 4 6 2 1 6 6 0.8 6 0.2 32

Expectation Maximization Every observed roll of the dice contributes to both Red and Blue 6 4 5 1 2 3 4 5 2 2 1 4 3 4 6 2 1 6 6 0.8, 4 0.33 6 0.2, 4 0.67 33

Expectation Maximization Every observed roll of the dice contributes to both Red and Blue 6 4 5 1 2 3 4 5 2 2 1 4 3 4 6 2 1 6 6 0.8, 4 0.33, 6 0.2, 4 0.67, 5 0.33, 5 0.67, 34

Expectation Maximization Every observed roll of the dice contributes to both Red and Blue 6 4 5 1 2 3 4 5 2 2 1 4 3 4 6 2 1 6 6 0.8, 4 0.33, 5 0.33, 1 0.57, 2 0.14, 3 0.33, 4 0.33, 5 0.33, 2 0.14, 2 0.14, 1 0.57, 4 0.33, 3 0.33, 4 0.33, 6 0.8, 2 0.14, 1 0.57, 6 0.8 6 0.2, 4 0.67, 5 0.67, 1 0.43, 2 0.86, 3 0.67, 4 0.67, 5 0.67, 2 0.86, 2 0.86, 1 0.43, 4 0.67, 3 0.67, 4 0.67, 6 0.2, 2 0.86, 1 0.43, 6 0.2 35

Expectation Maximization Every observed roll of the dice contributes to both Red and Blue Total count for Red is the sum of all the posterior probabilities in the red column 7.31 Total count for Blue is the sum of all the posterior probabilities in the blue column 10.69 Note: 10.69 + 7.31 = 18 = the total number of instances Called red blue 5.33.67 3.33.67 5.33.67 3.33.67 7.31 10.69 36

Expectation Maximization Total count for Red : 7.31 Red: Total count for 1: 1.71 Called red blue 5.33.67 3.33.67 5.33.67 3.33.67 7.31 10.69 37

Expectation Maximization Total count for Red : 7.31 Red: Total count for 1: 1.71 Total count for 2: 0.56 Called red blue 5.33.67 3.33.67 5.33.67 3.33.67 7.31 10.69 38

Expectation Maximization Total count for Red : 7.31 Red: Total count for 1: 1.71 Total count for 2: 0.56 Total count for 3: 0.66 Called red blue 5.33.67 3.33.67 5.33.67 3.33.67 7.31 10.69 39

Expectation Maximization Total count for Red : 7.31 Red: Total count for 1: 1.71 Total count for 2: 0.56 Total count for 3: 0.66 Called red blue 5.33.67 3.33.67 Total count for 4: 1.32 5.33.67 3.33.67 7.31 10.69 40

Expectation Maximization Total count for Red : 7.31 Red: Total count for 1: 1.71 Total count for 2: 0.56 Total count for 3: 0.66 Total count for 4: 1.32 Total count for 5: 0.66 Called red blue 5.33.67 3.33.67 5.33.67 3.33.67 7.31 10.69 41

Expectation Maximization Total count for Red : 7.31 Red: Total count for 1: 1.71 Total count for 2: 0.56 Total count for 3: 0.66 Total count for 4: 1.32 Total count for 5: 0.66 Total count for 6: 2.4 Called red blue 5.33.67 3.33.67 5.33.67 3.33.67 7.31 10.69 42

Expectation Maximization Total count for Red : 7.31 Red: Updated probability of Red dice: Called red blue 5.33.67 3.33.67 5.33.67 Total count for 1: 1.71 Total count for 2: 0.56 Total count for 3: 0.66 Total count for 4: 1.32 Total count for 5: 0.66 Total count for 6: 2.4 1 Red = 1.71/7.31 = 0.234 3.33.67 2 Red = 0.56/7.31 = 0.077 3 Red = 0.66/7.31 = 0.090 4 Red = 1.32/7.31 = 0.181 5 Red = 0.66/7.31 = 0.090 6 Red = 2.40/7.31 = 0.328 7.31 10.69 43

Expectation Maximization Total count for Blue : 10.69 Blue: Total count for 1: 1.29 Called red blue 5.33.67 3.33.67 5.33.67 3.33.67 7.31 10.69 44

Expectation Maximization Total count for Blue : 10.69 Blue: Total count for 1: 1.29 Total count for 2: 3.44 Called red blue 5.33.67 3.33.67 5.33.67 3.33.67 7.31 10.69 45

Expectation Maximization Total count for Blue : 10.69 Blue: Total count for 1: 1.29 Total count for 2: 3.44 Total count for 3: 1.34 Called red blue 5.33.67 3.33.67 5.33.67 3.33.67 7.31 10.69 46

Expectation Maximization Total count for Blue : 10.69 Blue: Total count for 1: 1.29 Total count for 2: 3.44 Total count for 3: 1.34 Called red blue 5.33.67 3.33.67 Total count for 4: 2.68 5.33.67 3.33.67 7.31 10.69 47

Expectation Maximization Total count for Blue : 10.69 Blue: Total count for 1: 1.29 Total count for 2: 3.44 Total count for 3: 1.34 Total count for 4: 2.68 Total count for 5: 1.34 Called red blue 5.33.67 3.33.67 5.33.67 3.33.67 7.31 10.69 48

Expectation Maximization Total count for Blue : 10.69 Blue: Total count for 1: 1.29 Total count for 2: 3.44 Total count for 3: 1.34 Total count for 4: 2.68 Total count for 5: 1.34 Total count for 6: 0.6 Called red blue 5.33.67 3.33.67 5.33.67 3.33.67 7.31 10.69 49

Expectation Maximization Total count for Blue : 10.69 Blue: Total count for 1: 1.29 Total count for 2: 3.44 Total count for 3: 1.34 Total count for 4: 2.68 Total count for 5: 1.34 Total count for 6: 0.6 Called red blue 5.33.67 3.33.67 5.33.67 3.33.67 6 8 2 Updated probability of Blue dice: 1 Blue = 1.29/11.69 = 0.122 2 Blue = 0.56/11.69 = 0.322 3 Blue = 0.66/11.69 = 0.125 4 Blue = 1.32/11.69 = 0.250 5 Blue = 0.66/11.69 = 0.125 6 Blue = 2.40/11.69 = 0.056056 7.31 10.69 50

Expectation Maximization Total count for Red : 7.31 Total count for Blue : 10.69 Total instances = 18 Note 7.31+10.69 = 18 We also revise our estimate for the probability that the caller calls out Red or Blue i.e the fraction of times that t he calls Red and the fraction of times he calls Blue Called red blue 5.33.67 3.33.67 5.33.67 3.33.67 Z=Red = 7.31/18 = 0.41 Z=Blue = 10.69/18 = 0.59 7.31 10.69 51

The updated values robability of Red dice: Called red blue 1 Red = 1.71/7.31 = 0.234 2 Red = 0.56/7.31 = 0.077 3 Red = 0.66/7.31 = 0.090090 5.33.67 4 Red = 1.32/7.31 = 0.181 5 Red = 0.66/7.31 = 0.090 6 Red = 2.40/7.31 = 0.328 3.33.67 robability of Blue dice: 5.33.67 1 Blue = 1.29/11.69 = 0.122 2 Blue = 0.56/11.69 = 0.322 3 Blue = 0.66/11.69 = 0.125 4 Blue = 1.32/11.69 = 0.250 3.33.67 5 Blue = 0.66/11.69 = 0.125 6 Blue = 2.40/11.69 = 0.056 Z=Red = 7.31/18 = 0.41 Z=Blue = 10.69/18 = 0.59 THE UDATED VALUES CAN BE USED TO REEAT THE ROCESS. ESTIMATION IS AN ITERATIVE ROCESS 52

The Dice Shooter Example 6 4 1 5 3 2 2 2 6 3 1 5 4 1 2 4 4 4 1 6 3 2 1 2 1. Initialize Z, Z 2. Estimate Z for each Z, for each called out number Associate with each value of Z, with weight Z 3. Re-estimate Z for every value of and Z 4. Re-estimate estimate Z 5. If not converged, return to 2 53

In Squiggles Given a sequence of observations O 1, O 2,.. N is the number of observations of number N is the number of observations of number Initialize Z, Z for dice Z and numbers Iterate: Iterate: For each number : ' ' Z Z Z Z Z Update: ' Z O O Z N Z N O Z Z Z that such ' Z N Z N Z 11755/18797 O Z N O Z ' Z Z N 21 Sep 2010 54

Solutions may not be unique The EM algorithm will give us one of many solutions, all equally valid! The probability of 6 being called out: 6 6 red 6 blue r b Assigns r as the probability of 6 for the red die Assigns b as the probability of 6 for the blue die The following too is a valid solution [FI] 0. anything 6 1.0 r b 0 Assigns 1.0 as the a priori probability of the red die Assigns 0.0 as the probability of the blue die The solution is NOT unique 55

A More Complex Model T d N 0.5 exp 2, ; 1 Gaussian mixtures are often good models for d 2 the distribution of multivariate data roblem: Estimating the parameters, given a collection of data 11755/18797 21 Sep 2010 56

Gaussian Mixtures: Generating model N ;, 6114531942224905 6.1 1.4 5.3 1.9 4.2 2.2 4.9 0.5 The caller now has two Gaussians At each draw he randomly selects a Gaussian, by the mixture weight distribution He then draws an observation from that t Gaussian Much lie the dice problem only the outcomes are now real numbers and can be anything 57

Estimating GMM with complete information Observation: A collection of numbers drawn from a mixture of 2 Gaussians As indicated by the colors, we now which Gaussian generated what number 6.1 1.4 5.3 1.9 4.2 2.2 4.9 0.5 6.1 5.3 4.2 4.9.. 1.4 1.9 2.2 0.5.. Segregation: Separate the blue observations from the red From each set compute parameters for that t Gaussian red 1 N red ired i red 1 N red i red i red ired T red N N red 58

Fragmenting the observation Gaussian unnown 4.2 4.2 4.2 4.2.. 4.2.. Collection of blue Collection of red numbers numbers The identity of the Gaussian is not nown! Solution: Fragment the observation Fragment size proportional to a posteriori probability ' ' ' ' N ;, ' N ;, ' ' 59

Expectation Maximization Initialize, and for both Gaussians Important how we do this Typical solution: Initialize means randomly, as the global covariance of the data and uniformly Compute fragment sizes for each Gaussian, for each observation Number red blue 6.1.81.19 1. 53 5.3.75.25 1.9.41.59 4.2.64.36 2.2.43.57 4.9.66.34 0.5.05.95 N ;, ' N ;, ' ' ' 60

Expectation Maximization Each observation contributes only as much as its fragment size to each statistic Meanred = 6.1*0.81 + 1.4*0.33 + 5.3*0.75 + 1.9*0.41 + 4.2*0.64 + 2.2*0.43 + 4.9*0.66 + 0.5*0.05 / 0.81 + 0.33 + 0.75 + 0.41 + 0.64 + 0.43 + 0.66 + 0.05 = 17.05 / 4.08 = 4.18 Varred = 6.1-4.18 2 *0.81 + 1.4-4.18 2 *0.33 + 5.3-4.18 2 *0.75 + 1.9-4.18 2 *0.41 + Number red blue 6.1.81.19 1. 53 5.3.75.25 1.9.41.59 4.2.64.36 2.2.43.57 4.9.66.34 0.5.05.95 408 4.08 392 3.92 4.2-4.18 2 *0.64 + 2.2-4.18 2 *0.43 + 4.9-4.18 2 *0.66 + 0.5-4.18 2 *0.05 / 0.81 + 0.33 + 0.75 + 0.41 + 0.64 + 0.43 + 0.66 + 0.05 red 4.08 8 61

EM for Gaussian Mixtures 1. Initialize, and for all Gaussians 2 For each observation compute a posteriori 2. For each observation compute a posteriori probabilities for all Gaussian ; N ' ' ', ; ', ; N N 3. Update mixture weights, means and variances for all Gaussians 2 N 4. If not converged, return to 2 21 Sep 2010 62 11755/18797

EM estimation of Gaussian Mixtures An Example Histogram of 4000 instances of a randomly generated data Individual parameters of a two-gaussian mixture estimated by EM Two-Gaussian mixture estimated by EM 63

Expectation Maximization The same principle can be extended to mixtures of other distributions. E.g. Mixture of Laplacians: Laplacian parameters become x x x x x x x b x x x 1 1 In a mixture of Gaussians and Laplacians, Gaussians use the x x Gaussian update rules, Laplacians use the Laplacian rule 11755/18797 21 Sep 2010 64

Expectation Maximization The EM algorithm is used whenever proper statistical analysis of a phenomenon requires the nowledge of a hidden or missing variable or a set of hidden/missing variables The hidden variable is often called a latent variable Some examples: Estimating mixtures of distributions Only data are observed. The individual distributions and mixing proportions must both be learnt. Estimating the distribution of data, when some attributes are missing Estimating the dynamics of a system, based only on observations that may be a complex function of system state 65

Solve this problem: Caller rolls a dice and flips a coin He calls out the number rolled if the coin shows head Otherwise he calls the number+1 Determine pheads and pnumber for the dice from a collection of ouputs Caller rolls two dice He calls out the sum Determine dice from a collection of ouputs 66

The dice and the coin Heads or tail? 4 Heads count 4 3 Tails count 4... 3 Unnown: Whether it was head or tails 67

The two dice 4 3,1, 1,3 2,2 Unnown: How to partition the number Count blue 3 += 3,1 4 Count blue 2 += 2,2 4 Count blue 1 += 1,3 4 68

Fragmentation can be hierarchical Z Z, Z 1 2 Z 1 Z 2 Z 3 Z 4 E.g. mixture of mixtures Fragments are further fragmented.. Wor this out 69

More later Will see a couple of other instances of the use of fem Wor out HMM training Assume state output distributions are multinomials Assume they are Gaussian Assume Gaussian mixtures 70