Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Size: px

Start display at page:

Download "Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall"

Barrie Parks
5 years ago
Views:

1 Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

2 Discriminative vs Generative Models Discriminative: Just learn a decision boundary between your sets. Support Vector Machines Generative: Learn enough about your sets to be able to make new examples that would be set members Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

3 The Generative Model POV Assume the data was generated from a process we can model as a probability distribution Learn that probability distribution Once learned, use the probability distribution to Make new examples Classify data we haven t seen before. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

4 Non-parametric distribution not feasible Machine learning students Height (cm) Let s probabilistically model ML student heights. Ruler has 200 marks (100 to 300 cm) How many probabilities to learn? How many students in the class? What if the ruler is continuous? Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

5 Learning a Parametric Distribution Pick a parametric model (e.g. Gaussian) Learn just a few parameter values p(x Θ) prob. of x, given parameters Θ of a model, M Probability density Height (cm) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

6 Using Generative Models for Classification Gaussians whose means and variances were learned from data Machine learning students NBA players Height (cm) New person. Which class does he belong to? Answer: the class that calls him most probable. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

2π ) 1/2 e σ (x µ) 2 2σ 2 Probability density The normal Gaussian distribution,

7 Learning a Gaussian Distribution p(x Θ) prob. of x, given parameters Θ of a model, M Θ {µ,σ } The parameters we must learn M 1 ( 2π ) 1/2 e σ (x µ) 2 2σ 2 Probability density The normal Gaussian distribution, often denoted N, for normal Height (cm) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

8 Goal: Find the best Gaussian Hypothesis space is Gaussian distributions. Θ * Find parameters that maximize the prob. of observing data X {x 1,...x n } Θ * = p(x Θ) argmaxθ where each Θ {µ,σ } Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

9 Some math Θ * = p(x Θ),where each Θ {µ,σ } argmaxθ n p(x Θ) = p(x Θ) i i=1...if can we assume all x i are i.i.d. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

10 p(x Θ) = Numbers getting smaller n p(x Θ) i i=1 What happens as n grows? Problem? We get underflow if n is, say, 500 p(x Θ) n i=1 log(p(x i Θ)) solves underflow. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

11 Remember what we re maximizing Θ* p(x Θ) argmaxθ n = log(p(x Θ) ) i i=1 argmaxθ fitting the Gaussian into this... log(p(x Θ)) = log e (x µ) 2 2σ 2 ( 2π ) 1/2 σ Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

12 Some math gets you log e (x µ) 2 2σ 2 ( 2π ) 1/2 σ = log e (x µ) 2 2σ 2 log( ( 2π ) 1/2 σ ) = (x µ)2 2σ 2 logσ log( 2π ) 1/2 Plug back into equation from slide 11 Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

13 Θ* p(x Θ) argmaxθ n..which gives us = log(p(x Θ) ) i i=1 argmaxθ = n i=1 (x i µ) 2 2σ 2 argmaxθ logσ Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

14 Maximizing Log-likelihood To find best parameters, take the partial derivative with respect to parameters and set to 0. Θ * = n i=1 (x i µ) 2 2σ 2 logσ The result is a closed-form solution µ = 1 n n x i i=1 argmaxθ σ 2 = 1 n {σ,µ} (x µ) 2 i i=1 Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall n

15 What if the data distribution can t be well represented by a single Gaussian? Can we model more complex distributions using multiple Gaussians? Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

16 Gaussian Mixture Model (GMM) Machine learning students & NBA players Two Gaussian components Height (cm) Model the distribution as a mix of Gaussians K P(x) = j=1 x is the observed value P(z )P(x z ) j j z j is the jth Gaussian Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

17 P(x) = What are we optimizing? K P(z j )P(x z j ) j=1 Notating P(z j ) as weight w j and using the Normal (a.k.a. Gaussian) distribution N(µ j,σ j 2 ) gives us... K j=1 = w j N(x µ j,σ j 2 ) This gives 3 variables per Gaussian to optimize: w j,µ j,σ j such that 1= K w j j=1 Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

18 Bad news: No closed form solution. n Θ* p(x Θ) argmaxθ = log(p(x Θ) ) i i=1 argmaxθ n i=1 K j=1 = log w p(x N(µ,σ 2 )) j i j j argmaxθ Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

19 Expectation Maximization (EM) Solution: The EM algorithm EM updates model parameters iteratively. After each iteration, the likelihood the model would generate the observed data increases (or at least it doesn t decrease). EM algorithm always converges to a local optimum. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

20 EM Algorithm Summary Initialize the parameters E step: calculate the likelihood a model with these parameters generated the data M step: Update parameters to increase the likelihood from E step Repeat E & M steps until convergence to a local optimum. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

21 EM for GMM - Initialization Choose the number of Gaussian components K K should be much less than the number of data points to avoid overfitting. (Randomly) select parameters for each Gaussian j: w j,µ j,σ j...such that 1= K w j j=1 Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

22 EM for GMM Expectation step The responsibility γ j,n of Gaussian j for observation x n is defined as... γ j,n p(z j x n ) = p(x n z j )p(z j ) p(x n ) = p(x n z j )p(z j ) K k=1 p(z k )p(x n z k ) = w j N(x n µ j,σ 2 ) j K k=1 w k N(x n µ k,σ k 2 ) 22

23 EM for GMM Expectation step Define the responsibility Γ j of Gaussian j for all the observed data as... Γ j N γ j,n n=1 You can think of this as the proportion of the data explained by Gaussian j. 23

24 EM for GMM Maximization step Update our parameters as follows... new w j = Γ j N new µ j = N i=1 γ j,i x i new σ j 2 = N γ (x µ ) 2 j,i i j i=1 Γ j Γ j Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

25 Why does this work? We need to prove that, as our model parameters are adjusted, likelihood of the data never goes down (monotonically nondecreasing) This is the part where I point you to the textbook Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

26 What if our data isn t just scalars, but each data point has multiple dimensions? Can we generalize to multiple dimensions? We need to define a covariance matrix. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

27 Covariance Matrix Given d-dimensional random variable vector X! = [ X 1,..., X d ] the covariance matrix denoted Σ (confusing, eh?) is defined as... $ & & Σ & & & % [ ] E [(X 1 µ 1 )(X 2 µ 2 )]... E [(X 1 µ 1 )(X d µ d )] [ ] E [(X 2 µ 2 )(X 2 µ 2 )]! E [(X 2 µ 2 )(X d µ d )] E (X 1 µ 1 )(X 1 µ 1 ) E (X 2 µ 2 )(X 1 µ 1 )!! "! E (X d µ d )(X 1 µ 1 ) [ ] E [(X d µ d )(X 2 µ 2 )]! E [(X d µ d )(X d µ d )] ' ) ) ) ) ) ( This is a generalization of one-dimensional variance for a scalar random variable X $ ( ) 2 σ 2 = var(x) = E % X µ ' ( Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

28 Multivariate Gaussian Mixture Second dimension The d by d covariance matrix Σ describes the shape and orientation of an elipse. K j=1 First dimension P( X)! = w p(!! X N( µ,σ )) j j Given d dimensions and K Gaussians, how many parameters? Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

29 Example: Initialization (Illustra)on from Andrew Moore's tutorial slides on GMM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

30 After Iteration #1 (Illustra)on from Andrew Moore's tutorial slides on GMM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

31 After Iteration #2 (Illustra)on from Andrew Moore's tutorial slides on GMM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

32 After Iteration #3 (Illustra)on from Andrew Moore's tutorial slides on GMM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

33 After Iteration #4 (Illustra)on from Andrew Moore's tutorial slides on GMM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

34 After Iteration #5 (Illustra)on from Andrew Moore's tutorial slides on GMM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

35 After Iteration #6 (Illustra)on from Andrew Moore's tutorial slides on GMM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

36 After Iteration #20 (Illustra)on from Andrew Moore's tutorial slides on GMM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

37 GMM Remarks GMM is powerful: any density function can be arbitrarily-well approximated by a GMM with enough components. If the number of components K is too large, data will be overfitted. Likelihood increases with K. Extreme case: N Gaussians for N data points, with variances 0, then likelihood. How to choose K? Use domain knowledge. Validate through visualization. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

38 GMM is a soft version of K-means Similarity K needs to be specified. Converges to some local optima. Initialization matters final results. One would want to try different initializations. Differences GMM Assigns soft labels to instances. GMM Considers variances in addition to means. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

GMM for Classification Given training data with multiple classes 1) Model the training data for each class with a GMM 2) Classify a new point by estimating the probability each class

39 GMM for Classification Given training data with multiple classes 1) Model the training data for each class with a GMM 2) Classify a new point by estimating the probability each class generated the point 3) Pick the class with the highest probability as the label. (illustration from Leon Bottou s slides on EM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

40 GMM for Regression Given dataset D={<x 1, y 1 >,...,<x n, y n >}, where y i R and x i is a vector of d dimensions... Learn a d +1 dimensional GMM. Then, compute f (x) = E[y x] (illustration from Leon Bottou s slides on EM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume