Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Similar documents
Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

CSE446: Clustering and EM Spring 2017

Expectation Maximization Algorithm

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Gaussian Mixture Models

Bias-Variance Tradeoff

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

CSCI-567: Machine Learning (Spring 2019)

CS534 Machine Learning - Spring Final Exam

Gaussian Mixture Models, Expectation Maximization

Generative Clustering, Topic Modeling, & Bayesian Inference

Naïve Bayes classification

Machine Learning for Data Science (CS4786) Lecture 12

10-701/ Machine Learning - Midterm Exam, Fall 2010

Clustering and Gaussian Mixture Models

Pattern Recognition. Parameter Estimation of Probability Density Functions

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

K-Means and Gaussian Mixture Models

Gaussian Mixture Models

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

CS6220: DATA MINING TECHNIQUES

Expectation maximization

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Generative v. Discriminative classifiers Intuition

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

ECE521 week 3: 23/26 January 2017

Machine learning - HT Maximum Likelihood

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Lecture 13: Data Modelling and Distributions. Intelligent Data Analysis and Probabilistic Inference Lecture 13 Slide No 1

Generative v. Discriminative classifiers Intuition

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

CMU-Q Lecture 24:

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Machine Learning

Unsupervised Learning

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Expectation-Maximization

Machine Learning Lecture 5

CSC 411: Lecture 04: Logistic Regression

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

COMS 4721: Machine Learning for Data Science Lecture 1, 1/17/2017

Machine Learning Linear Classification. Prof. Matteo Matteucci

Maximum Likelihood Estimation. only training data is available to design a classifier

Bayesian Learning (II)

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Mathematical Formulation of Our Example

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

PMR Learning as Inference

ECE 5984: Introduction to Machine Learning

Classification Logistic Regression

The generative approach to classification. A classification problem. Generative models CSE 250B

ECE 5984: Introduction to Machine Learning

Brief Introduction of Machine Learning Techniques for Content Analysis

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Machine Learning Gaussian Naïve Bayes Big Picture

Lecture 2: GMM and EM

Classification: The rest of the story

Hypothesis Testing and Computational Learning Theory. EECS 349 Machine Learning With slides from Bryan Pardo, Tom Mitchell

Machine Learning, Fall 2012 Homework 2

CPSC 540: Machine Learning

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Lecture 4: Probabilistic Learning

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

EM (cont.) November 26 th, Carlos Guestrin 1

Parameter Estimation. Industrial AI Lab.

Generative modeling of data. Instructor: Taylor Berg-Kirkpatrick Slides: Sanjoy Dasgupta

Support Vector Machines

CPSC 340: Machine Learning and Data Mining

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Bayesian Machine Learning

Stochastic Gradient Descent

MACHINE LEARNING ADVANCED MACHINE LEARNING

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Clustering, K-Means, EM Tutorial

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang

STA 4273H: Statistical Machine Learning

Introduction to Machine Learning Midterm Exam

Mixtures of Gaussians. Sargur Srihari

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Logistic Regression. Machine Learning Fall 2018

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Bayesian Networks Structure Learning (cont.)

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Expectation Maximization

Lecture : Probabilistic Machine Learning

CSE446: non-parametric methods Spring 2017

Mixtures of Gaussians continued

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

L11: Pattern recognition principles

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Transcription:

Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1

Discriminative vs Generative Models Discriminative: Just learn a decision boundary between your sets. Support Vector Machines Generative: Learn enough about your sets to be able to make new examples that would be set members Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 2

The Generative Model POV Assume the data was generated from a process we can model as a probability distribution Learn that probability distribution Once learned, use the probability distribution to Make new examples Classify data we haven t seen before. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 3

Non-parametric distribution not feasible Machine learning students 150 160 170 180 190 200 Height (cm) Let s probabilistically model ML student heights. Ruler has 200 marks (100 to 300 cm) How many probabilities to learn? How many students in the class? What if the ruler is continuous? Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 4

Learning a Parametric Distribution Pick a parametric model (e.g. Gaussian) Learn just a few parameter values p(x Θ) prob. of x, given parameters Θ of a model, M Probability density 150 160 170 180 190 200 Height (cm) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 5

Using Generative Models for Classification Gaussians whose means and variances were learned from data Machine learning students NBA players 150 160 170 180 190 200 Height (cm) 210 220 230 240 New person. Which class does he belong to? Answer: the class that calls him most probable. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 6

Learning a Gaussian Distribution p(x Θ) prob. of x, given parameters Θ of a model, M Θ {µ,σ } The parameters we must learn M 1 ( 2π ) 1/2 e σ (x µ) 2 2σ 2 Probability density The normal Gaussian distribution, often denoted N, for normal 150 160 170 180 190 200 Height (cm) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 7

Goal: Find the best Gaussian Hypothesis space is Gaussian distributions. Θ * Find parameters that maximize the prob. of observing data X {x 1,...x n } Θ * = p(x Θ) argmaxθ where each Θ {µ,σ } Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 8

Some math Θ * = p(x Θ),where each Θ {µ,σ } argmaxθ n p(x Θ) = p(x Θ) i i=1...if can we assume all x i are i.i.d. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 9

p(x Θ) = Numbers getting smaller n p(x Θ) i i=1 What happens as n grows? Problem? We get underflow if n is, say, 500 p(x Θ) n i=1 log(p(x i Θ)) solves underflow. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 10

Remember what we re maximizing Θ* p(x Θ) argmaxθ n = log(p(x Θ) ) i i=1 argmaxθ fitting the Gaussian into this... log(p(x Θ)) = log e (x µ) 2 2σ 2 ( 2π ) 1/2 σ Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 11

Some math gets you log e (x µ) 2 2σ 2 ( 2π ) 1/2 σ = log e (x µ) 2 2σ 2 log( ( 2π ) 1/2 σ ) = (x µ)2 2σ 2 logσ log( 2π ) 1/2 Plug back into equation from slide 11 Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 12

Θ* p(x Θ) argmaxθ n..which gives us = log(p(x Θ) ) i i=1 argmaxθ = n i=1 (x i µ) 2 2σ 2 argmaxθ logσ Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 13

Maximizing Log-likelihood To find best parameters, take the partial derivative with respect to parameters and set to 0. Θ * = n i=1 (x i µ) 2 2σ 2 logσ The result is a closed-form solution µ = 1 n n x i i=1 argmaxθ σ 2 = 1 n {σ,µ} (x µ) 2 i i=1 Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 14 n

What if the data distribution can t be well represented by a single Gaussian? Can we model more complex distributions using multiple Gaussians? Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 15

Gaussian Mixture Model (GMM) Machine learning students & NBA players Two Gaussian components 150 160 170 180 190 200 Height (cm) 210 220 230 240 Model the distribution as a mix of Gaussians K P(x) = j=1 x is the observed value P(z )P(x z ) j j z j is the jth Gaussian Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2014 16

P(x) = What are we optimizing? K P(z j )P(x z j ) j=1 Notating P(z j ) as weight w j and using the Normal (a.k.a. Gaussian) distribution N(µ j,σ j 2 ) gives us... K j=1 = w j N(x µ j,σ j 2 ) This gives 3 variables per Gaussian to optimize: w j,µ j,σ j such that 1= K w j j=1 Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 17

Bad news: No closed form solution. n Θ* p(x Θ) argmaxθ = log(p(x Θ) ) i i=1 argmaxθ n i=1 K j=1 = log w p(x N(µ,σ 2 )) j i j j argmaxθ Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 18

Expectation Maximization (EM) Solution: The EM algorithm EM updates model parameters iteratively. After each iteration, the likelihood the model would generate the observed data increases (or at least it doesn t decrease). EM algorithm always converges to a local optimum. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 19

EM Algorithm Summary Initialize the parameters E step: calculate the likelihood a model with these parameters generated the data M step: Update parameters to increase the likelihood from E step Repeat E & M steps until convergence to a local optimum. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 20

EM for GMM - Initialization Choose the number of Gaussian components K K should be much less than the number of data points to avoid overfitting. (Randomly) select parameters for each Gaussian j: w j,µ j,σ j...such that 1= K w j j=1 Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 21

EM for GMM Expectation step The responsibility γ j,n of Gaussian j for observation x n is defined as... γ j,n p(z j x n ) = p(x n z j )p(z j ) p(x n ) = p(x n z j )p(z j ) K k=1 p(z k )p(x n z k ) = w j N(x n µ j,σ 2 ) j K k=1 w k N(x n µ k,σ k 2 ) 22

EM for GMM Expectation step Define the responsibility Γ j of Gaussian j for all the observed data as... Γ j N γ j,n n=1 You can think of this as the proportion of the data explained by Gaussian j. 23

EM for GMM Maximization step Update our parameters as follows... new w j = Γ j N new µ j = N i=1 γ j,i x i new σ j 2 = N γ (x µ ) 2 j,i i j i=1 Γ j Γ j Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 24

Why does this work? We need to prove that, as our model parameters are adjusted, likelihood of the data never goes down (monotonically nondecreasing) This is the part where I point you to the textbook Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 25

What if our data isn t just scalars, but each data point has multiple dimensions? Can we generalize to multiple dimensions? We need to define a covariance matrix. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 26

Covariance Matrix Given d-dimensional random variable vector X! = [ X 1,..., X d ] the covariance matrix denoted Σ (confusing, eh?) is defined as... $ & & Σ & & & % [ ] E [(X 1 µ 1 )(X 2 µ 2 )]... E [(X 1 µ 1 )(X d µ d )] [ ] E [(X 2 µ 2 )(X 2 µ 2 )]! E [(X 2 µ 2 )(X d µ d )] E (X 1 µ 1 )(X 1 µ 1 ) E (X 2 µ 2 )(X 1 µ 1 )!! "! E (X d µ d )(X 1 µ 1 ) [ ] E [(X d µ d )(X 2 µ 2 )]! E [(X d µ d )(X d µ d )] ' ) ) ) ) ) ( This is a generalization of one-dimensional variance for a scalar random variable X $ ( ) 2 σ 2 = var(x) = E % X µ ' ( Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 27

Multivariate Gaussian Mixture Second dimension The d by d covariance matrix Σ describes the shape and orientation of an elipse. K j=1 First dimension P( X)! = w p(!! X N( µ,σ )) j j Given d dimensions and K Gaussians, how many parameters? Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 28

Example: Initialization (Illustra)on from Andrew Moore's tutorial slides on GMM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 29

After Iteration #1 (Illustra)on from Andrew Moore's tutorial slides on GMM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 30

After Iteration #2 (Illustra)on from Andrew Moore's tutorial slides on GMM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 31

After Iteration #3 (Illustra)on from Andrew Moore's tutorial slides on GMM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 32

After Iteration #4 (Illustra)on from Andrew Moore's tutorial slides on GMM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 33

After Iteration #5 (Illustra)on from Andrew Moore's tutorial slides on GMM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 34

After Iteration #6 (Illustra)on from Andrew Moore's tutorial slides on GMM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 35

After Iteration #20 (Illustra)on from Andrew Moore's tutorial slides on GMM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 36

GMM Remarks GMM is powerful: any density function can be arbitrarily-well approximated by a GMM with enough components. If the number of components K is too large, data will be overfitted. Likelihood increases with K. Extreme case: N Gaussians for N data points, with variances 0, then likelihood. How to choose K? Use domain knowledge. Validate through visualization. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 37

GMM is a soft version of K-means Similarity K needs to be specified. Converges to some local optima. Initialization matters final results. One would want to try different initializations. Differences GMM Assigns soft labels to instances. GMM Considers variances in addition to means. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 38

GMM for Classification Given training data with multiple classes 1) Model the training data for each class with a GMM 2) Classify a new point by estimating the probability each class generated the point 3) Pick the class with the highest probability as the label. (illustration from Leon Bottou s slides on EM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 39

GMM for Regression Given dataset D={<x 1, y 1 >,...,<x n, y n >}, where y i R and x i is a vector of d dimensions... Learn a d +1 dimensional GMM. Then, compute f (x) = E[y x] (illustration from Leon Bottou s slides on EM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 40