Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1

Discriminative vs Generative Models Discriminative: Just learn a decision boundary between your sets. Support Vector Machines Generative: Learn enough about your sets to be able to make new examples that would be set members Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 2

The Generative Model POV Assume the data was generated from a process we can model as a probability distribution Learn that probability distribution Once learned, use the probability distribution to Make new examples Classify data we haven t seen before. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 3

Non-parametric distribution not feasible Machine learning students 150 160 170 180 190 200 Height (cm) Let s probabilistically model ML student heights. Ruler has 200 marks (100 to 300 cm) How many probabilities to learn? How many students in the class? What if the ruler is continuous? Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 4

Learning a Parametric Distribution Pick a parametric model (e.g. Gaussian) Learn just a few parameter values p(x Θ) prob. of x, given parameters Θ of a model, M Probability density 150 160 170 180 190 200 Height (cm) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 5

Using Generative Models for Classification Gaussians whose means and variances were learned from data Machine learning students NBA players 150 160 170 180 190 200 Height (cm) 210 220 230 240 New person. Which class does he belong to? Answer: the class that calls him most probable. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 6

Learning a Gaussian Distribution p(x Θ) prob. of x, given parameters Θ of a model, M Θ {µ,σ } The parameters we must learn M 1 ( 2π ) 1/2 e σ (x µ) 2 2σ 2 Probability density The normal Gaussian distribution, often denoted N, for normal 150 160 170 180 190 200 Height (cm) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 7

Goal: Find the best Gaussian Hypothesis space is Gaussian distributions. Θ * Find parameters that maximize the prob. of observing data X {x 1,...x n } Θ * = p(x Θ) argmaxθ where each Θ {µ,σ } Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 8

Some math Θ * = p(x Θ),where each Θ {µ,σ } argmaxθ n p(x Θ) = p(x Θ) i i=1...if can we assume all x i are i.i.d. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 9

p(x Θ) = Numbers getting smaller n p(x Θ) i i=1 What happens as n grows? Problem? We get underflow if n is, say, 500 p(x Θ) n i=1 log(p(x i Θ)) solves underflow. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 10

Remember what we re maximizing Θ* p(x Θ) argmaxθ n = log(p(x Θ) ) i i=1 argmaxθ fitting the Gaussian into this... log(p(x Θ)) = log e (x µ) 2 2σ 2 ( 2π ) 1/2 σ Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 11

Some math gets you log e (x µ) 2 2σ 2 ( 2π ) 1/2 σ = log e (x µ) 2 2σ 2 log( ( 2π ) 1/2 σ ) = (x µ)2 2σ 2 logσ log( 2π ) 1/2 Plug back into equation from slide 11 Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 12

Θ* p(x Θ) argmaxθ n..which gives us = log(p(x Θ) ) i i=1 argmaxθ = n i=1 (x i µ) 2 2σ 2 argmaxθ logσ Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 13

Maximizing Log-likelihood To find best parameters, take the partial derivative with respect to parameters and set to 0. Θ * = n i=1 (x i µ) 2 2σ 2 logσ The result is a closed-form solution µ = 1 n n x i i=1 argmaxθ σ 2 = 1 n {σ,µ} (x µ) 2 i i=1 Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 14 n

What if the data distribution can t be well represented by a single Gaussian? Can we model more complex distributions using multiple Gaussians? Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 15

Gaussian Mixture Model (GMM) Machine learning students & NBA players Two Gaussian components 150 160 170 180 190 200 Height (cm) 210 220 230 240 Model the distribution as a mix of Gaussians K P(x) = j=1 x is the observed value P(z )P(x z ) j j z j is the jth Gaussian Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2014 16

P(x) = What are we optimizing? K P(z j )P(x z j ) j=1 Notating P(z j ) as weight w j and using the Normal (a.k.a. Gaussian) distribution N(µ j,σ j 2 ) gives us... K j=1 = w j N(x µ j,σ j 2 ) This gives 3 variables per Gaussian to optimize: w j,µ j,σ j such that 1= K w j j=1 Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 17

Bad news: No closed form solution. n Θ* p(x Θ) argmaxθ = log(p(x Θ) ) i i=1 argmaxθ n i=1 K j=1 = log w p(x N(µ,σ 2 )) j i j j argmaxθ Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 18

Expectation Maximization (EM) Solution: The EM algorithm EM updates model parameters iteratively. After each iteration, the likelihood the model would generate the observed data increases (or at least it doesn t decrease). EM algorithm always converges to a local optimum. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 19

EM Algorithm Summary Initialize the parameters E step: calculate the likelihood a model with these parameters generated the data M step: Update parameters to increase the likelihood from E step Repeat E & M steps until convergence to a local optimum. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 20

EM for GMM - Initialization Choose the number of Gaussian components K K should be much less than the number of data points to avoid overfitting. (Randomly) select parameters for each Gaussian j: w j,µ j,σ j...such that 1= K w j j=1 Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 21

EM for GMM Expectation step The responsibility γ j,n of Gaussian j for observation x n is defined as... γ j,n p(z j x n ) = p(x n z j )p(z j ) p(x n ) = p(x n z j )p(z j ) K k=1 p(z k )p(x n z k ) = w j N(x n µ j,σ 2 ) j K k=1 w k N(x n µ k,σ k 2 ) 22

EM for GMM Expectation step Define the responsibility Γ j of Gaussian j for all the observed data as... Γ j N γ j,n n=1 You can think of this as the proportion of the data explained by Gaussian j. 23

EM for GMM Maximization step Update our parameters as follows... new w j = Γ j N new µ j = N i=1 γ j,i x i new σ j 2 = N γ (x µ ) 2 j,i i j i=1 Γ j Γ j Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 24

Why does this work? We need to prove that, as our model parameters are adjusted, likelihood of the data never goes down (monotonically nondecreasing) This is the part where I point you to the textbook Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 25

What if our data isn t just scalars, but each data point has multiple dimensions? Can we generalize to multiple dimensions? We need to define a covariance matrix. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 26

Covariance Matrix Given d-dimensional random variable vector X! = [ X 1,..., X d ] the covariance matrix denoted Σ (confusing, eh?) is defined as... $ & & Σ & & & % [ ] E [(X 1 µ 1 )(X 2 µ 2 )]... E [(X 1 µ 1 )(X d µ d )] [ ] E [(X 2 µ 2 )(X 2 µ 2 )]! E [(X 2 µ 2 )(X d µ d )] E (X 1 µ 1 )(X 1 µ 1 ) E (X 2 µ 2 )(X 1 µ 1 )!! "! E (X d µ d )(X 1 µ 1 ) [ ] E [(X d µ d )(X 2 µ 2 )]! E [(X d µ d )(X d µ d )] ' ) ) ) ) ) ( This is a generalization of one-dimensional variance for a scalar random variable X $ ( ) 2 σ 2 = var(x) = E % X µ ' ( Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 27

Multivariate Gaussian Mixture Second dimension The d by d covariance matrix Σ describes the shape and orientation of an elipse. K j=1 First dimension P( X)! = w p(!! X N( µ,σ )) j j Given d dimensions and K Gaussians, how many parameters? Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 28

Example: Initialization (Illustra)on from Andrew Moore's tutorial slides on GMM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 29

After Iteration #1 (Illustra)on from Andrew Moore's tutorial slides on GMM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 30

After Iteration #2 (Illustra)on from Andrew Moore's tutorial slides on GMM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 31

After Iteration #3 (Illustra)on from Andrew Moore's tutorial slides on GMM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 32

After Iteration #4 (Illustra)on from Andrew Moore's tutorial slides on GMM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 33

After Iteration #5 (Illustra)on from Andrew Moore's tutorial slides on GMM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 34

After Iteration #6 (Illustra)on from Andrew Moore's tutorial slides on GMM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 35

After Iteration #20 (Illustra)on from Andrew Moore's tutorial slides on GMM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 36

GMM Remarks GMM is powerful: any density function can be arbitrarily-well approximated by a GMM with enough components. If the number of components K is too large, data will be overfitted. Likelihood increases with K. Extreme case: N Gaussians for N data points, with variances 0, then likelihood. How to choose K? Use domain knowledge. Validate through visualization. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 37

GMM is a soft version of K-means Similarity K needs to be specified. Converges to some local optima. Initialization matters final results. One would want to try different initializations. Differences GMM Assigns soft labels to instances. GMM Considers variances in addition to means. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 38

GMM for Classification Given training data with multiple classes 1) Model the training data for each class with a GMM 2) Classify a new point by estimating the probability each class generated the point 3) Pick the class with the highest probability as the label. (illustration from Leon Bottou s slides on EM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 39

GMM for Regression Given dataset D={<x 1, y 1 >,...,<x n, y n >}, where y i R and x i is a vector of d dimensions... Learn a d +1 dimensional GMM. Then, compute f (x) = E[y x] (illustration from Leon Bottou s slides on EM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 40