Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Similar documents
Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Gaussian Mixture Models

Gaussian Mixture Models, Expectation Maximization

Expectation Maximization Algorithm

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

Expectation Maximization

CSE446: Clustering and EM Spring 2017

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Generative Clustering, Topic Modeling, & Bayesian Inference

Lecture : Probabilistic Machine Learning

Bayesian Machine Learning

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

The Expectation-Maximization Algorithm

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Expectation maximization

Statistical learning. Chapter 20, Sections 1 4 1

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Lecture 4: Probabilistic Learning

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Naïve Bayes classification

Machine Learning for Data Science (CS4786) Lecture 12

CS534 Machine Learning - Spring Final Exam

CPSC 540: Machine Learning

The Expectation Maximization or EM algorithm

Machine Learning Linear Classification. Prof. Matteo Matteucci

Density Estimation. Seungjin Choi

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Statistical Pattern Recognition

But if z is conditioned on, we need to model it:

Non-Parametric Bayes

Notes on Machine Learning for and

Expectation Maximization

Data Preprocessing. Cluster Similarity

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Clustering and Gaussian Mixture Models

CSCI-567: Machine Learning (Spring 2019)

Clustering, K-Means, EM Tutorial

Pattern Recognition. Parameter Estimation of Probability Density Functions

Bayesian Learning (II)

Based on slides by Richard Zemel

Statistical learning. Chapter 20, Sections 1 3 1

PMR Learning as Inference

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

10-701/ Machine Learning - Midterm Exam, Fall 2010

Brief Introduction of Machine Learning Techniques for Content Analysis

Machine Learning 4771

EM (cont.) November 26 th, Carlos Guestrin 1

COM336: Neural Computing

Lecture 6: Gaussian Mixture Models (GMM)

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Review and Motivation

ECE 5984: Introduction to Machine Learning

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Introduction to Bayesian Learning. Machine Learning Fall 2018

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Machine Learning Lecture 5

CMU-Q Lecture 24:

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Mixtures of Gaussians. Sargur Srihari

Lecture 13: Data Modelling and Distributions. Intelligent Data Analysis and Probabilistic Inference Lecture 13 Slide No 1

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

K-Means and Gaussian Mixture Models

Mixtures of Gaussians continued

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Mixture of Gaussians Models

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

MACHINE LEARNING ADVANCED MACHINE LEARNING

Lecture 3: Pattern Classification

ECE521 week 3: 23/26 January 2017

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Expectation-Maximization

A Derivation of the EM Updates for Finding the Maximum Likelihood Parameter Estimates of the Student s t Distribution

Algorithmisches Lernen/Machine Learning

Unsupervised Learning: K-Means, Gaussian Mixture Models

Bayesian Methods: Naïve Bayes

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

10-701/15-781, Machine Learning: Homework 4

Statistical learning. Chapter 20, Sections 1 3 1

Stochastic Gradient Descent

CSC411 Fall 2018 Homework 5

COMP90051 Statistical Machine Learning

Introduction to Gaussian Processes

Latent Variable Models and EM Algorithm

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Bias-Variance Tradeoff

Expectation maximization tutorial

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Transcription:

Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1

The Generative Model POV We think of the data as being generated from some process. We assume this process can be modeled statistically as an underlying distribution. We often assume a parametric distribution, like a Gaussian, because they re easier to represent. We infer model parameters from the data. Then we can use the model to classify/cluster or even generate data. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 2

probability Parametric Distribution Represent the underlying probability distribution with a parametric density function. Machine learning students 150 160 170 180 190 200 Height (cm) Gaussian (normal) distribution, two parameters: p x; μ, σ 2 1 (x μ) 2 = 2πσ e 2σ 2 View each point as generated from p x; μ, σ 2. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 3

Using Generative Models for Classification Gaussians whose means and variances were learned from data Machine learning students BA players 150 160 170 180 190 200 Height (cm) 210 220 230 240 ew point. Which group does he belong to? Answer: the group (class) that calls the new point most probable. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 4

Maximum Likelihood Estimation Our hypothesis space is Gaussian distributions. Find parameter(s) θ that make a Gaussian most likely to generate data X = (x 1,, x n ). Likelihood function: l θ X p X; θ = p x i ; θ i=1 θ ={μ, σ 2 } Only if X is i.i.d. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 5

Log-likelihood Function Likelihood function l θ X p X; θ = p x i ; θ Log-likelihood function i=1 L θ X log l θ X = log p x i ; θ i=1 Maximizing Log-likelihood maximizing likelihood Easier to optimize Prevents underflow!!! What happens when multiplying 1000 probabilities? Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 6

Example Gaussian Log-likelihood Log-likelihood function L θ X log l θ X = log p x i ; θ Recall Gaussian distribution p x; μ, σ 2 1 (x μ) 2 = 2πσ e 2σ 2 So the log-likelihood of Gaussian would be: L μ, σ 2 X = 2 log(2π) log σ i=1 x i μ 2 2σ 2 i=1 a constant term Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 7

Maximizing Log-likelihood Log-likelihood of Gaussian: L μ, σ 2 X = C log σ i=1 2σ 2 x i μ 2 Take the partial derivatives w.r.t μ and σ and set them to 0, i.e. let L L = 0 and = 0. μ σ Then solve (try it yourself), we get μ = 1 x i ; σ 2 = 1 i=1 i=1 x i μ 2 Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 8

What if the data distribution can t be well represented by a single Gaussian? Can we model more complex distributions using multiple Gaussians? Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 9

Gaussian Mixture Model (GMM) Machine learning students & BA players Two Gaussian components Represent the dist. with a mixture of Gaussians z: a membership r.v. indicating which Gaussian that x belongs to. 150 160 170 180 190 200 Height (cm) K p x = P(z = j)p x z = j j=1 Weight of j-th Gaussian. Often notated as w j 210 220 230 240 The j-th Gaussian, parameter:(μ j, σ 2 j ) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 10

Generative Process for GMM Machine learning students & BA players 150 160 170 180 190 200 Height (cm) K p x = P(z = j)p x z = j j=1 210 220 230 240 1. Randomly pick a component j, according to P z = j ; 2. Generate x according to p(x z = j). Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 11

What are we optimizing? GMM distribution: K p x = P(z = j)p x z = j j=1 K = w j j=1 1 e (x μ j ) 2 2σ 2 j 2πσ j Three parameters per Gaussian in the mixture w j, μ j, σ 2 K j, where j=1 w j = 1. Find parameters that maximize data likelihood. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 12

Maximum Likelihood Estimation of GMM Given X = (x 1,, x n ), x i ~p x, log-likelihood is L θ X = log p x i i=1 K = log p z i = j p x i z i = j i=1 j=1 K = log w j i=1 j=1 1 e (x i μ j ) 2 2σ 2 j 2πσ j Try to solve parameters μ j, σ 2 j, w j by setting their partial derivatives to 0? o closed form solution. (Try it yourself) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 13

Why is ML hard for GMM? Each data point x i has a membership random variable z i, indicating which Gaussian it comes from. But the value of z i cannot be observed as x i, i.e. we are uncertain about which Gaussian x i comes from. z i is a latent variable because we can t observe it. Latent variables can also be viewed as missing data, data that we didn t observe. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 14

If we know what value z i takes, ML is easy Machine learning students & BA players 150 160 170 180 190 200 Height (cm) 210 220 230 240 w j = 1 μ j = σ 2 j = i i i 1 z i = j 1 z i=j 1 z i =j x i i 1 z i=j x i μ 2 1 z i =j i Indicator function: 1 z i = j = 1, if z i = j; 0, if z i j. = number of training examples Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 15

Illustration of Soft Membership Which component does the point i come from? The probability that it comes from j: q i (j) p z i = j x i 1 st component 2 nd component From 1 st : 0.99 From 2 nd : 0.01 1 st : 0.5 2 nd : 0.5 1 st : 0.1 2 nd : 0.9 Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 16

Improving our posterior probability The posterior probability of a Gaussian is the probability that this Gaussian generated the data we observe. Let s find a way to use posterior probabilities to make an algorithm that automatically creates a set of Gaussians that would have been very likely to generate this data. 17

Expectation Maximization (EM) Instead of analytically solving the maximum likelihood parameter estimation problem of GMM, we seek an alternative way, EM algorithm. EM algorithm updates parameters iteratively. In each iteration, the likelihood value increases. EM algorithm always converges (to some local optimum), i.e. likelihood value and parameters converge. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 18

EM Algorithm Summary Initialize parameters. w j, μ j, σ 2 j for each Gaussian z j in our model. E step: calculate posterior dist. of latent variables likelihood these Gaussians generated the data M step: update parameters. update w j, μ j, σ 2 j for each Gaussian z j Repeat E and M steps until convergence. go until parameters don t change much It converges to some local optimum. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 19

EM for GMM - Initialization Start by choosing the number of Gaussian components K. Also, choose an initialization of parameters of all components w j, μ j, σ 2 j for j = 1,, K. Make sure K j=1 w j = 1. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 20

EM for GMM Expectation step For each x i, calculate its soft membership, i.e. the posterior dist. of z i, using current parameters. Bayes rule q (j) i p z i = j x i = p z i = j, x i p x i = p x i z i = j p z i = j K p x i z i = l p z i = l l=1 Prior dist. of component j ote: we are guessing the distribution (i.e. a soft membership) of z i, instead of a hard membership. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 21

EM for GMM Maximization step M step: update parameters. Recall q i (j) is the soft membership of x i of the j-th Gaussian. σ 2 j = w j = 1 μ j = i=1 (j) q i i=1 i=1 q (j) i x i q (j) i=1 i q i (j) x i μ j 2 i=1 q i (j) Average of their membership Weighted average using membership Weighted variance using membership Repeat E step and M step until convergence. Convergence criterion in practice: compare with the previous iteration, the likelihood value doesn t increase much, or the parameters don t change much. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 22

What if our data isn t just scalars, but each data point has multiple dimensions? Can we generalize to multiple dimensions? Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 23

Multivariate Gaussian Mixture μ j, a mean vector, marks the center of a ellipse. Second dimension K p x = w j j=1 1 2π d 2 S j 1 2 exp 1 2 x μ j T Sj 1 x μ j K Parameters: μ j, S j, w j for j = 1,, K, with j=1 w j = 1. How many parameters? means covariance's d(d + 1) dk + K + K 2 S j, the covariance matrix (d d), describes the shape and orientation of a ellipse. First dimension d: dimensionality weights Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 24

Example: Initialization (Illustration from Andrew Moore's tutorial slides on GMM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 25

After Iteration #1 (Illustration from Andrew Moore's tutorial slides on GMM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 26

After Iteration #2 (Illustration from Andrew Moore's tutorial slides on GMM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 27

After Iteration #3 (Illustration from Andrew Moore's tutorial slides on GMM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 28

After Iteration #4 (Illustration from Andrew Moore's tutorial slides on GMM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 29

After Iteration #5 (Illustration from Andrew Moore's tutorial slides on GMM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 30

After Iteration #6 (Illustration from Andrew Moore's tutorial slides on GMM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 31

After Iteration #20 (Illustration from Andrew Moore's tutorial slides on GMM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 32

GMM Remarks GMM is powerful: any density function can be arbitrarily-well approximated by a GMM with enough components. If the number of components K is too large, data will be overfitted. Likelihood increases with K. Extreme case: Gaussians for data points, with variances 0, then likelihood. How to choose K? Use domain knowledge. Validate through visualization. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 33

GMM is a soft version of K-means Similarity K needs to be specified. Converges to some local optima. Initialization matters final results. One would want to try different initializations. Differences GMM Assigns soft labels to instances. GMM Considers variances in addition to means. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 34

GMM for Classification 1. Given D = x i, y i, where y i 1,, C. 2. Model p x y = l with a GMM, for each l. 3. Calculate class posterior probability. p x y = l p(y = l) p y = l x = p x y = k p(y = k) C k=1 Bayes optimal classifier 4. Classify x to the class having largest posterior. (illustration from Leon Bottou s slides on EM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 35

GMM for Regression Given D = x i, y i, where y i R. Model p x, y with a GMM. Compute f x = E y x, conditional expectation (illustration from Leon Bottou s slides on EM) Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 36

You Should Know How to do maximum likelihood (ML) estimation? GMM models data dist. with a mixture of K Gaussians, with para μ j, S j, w j, for j = 1,, K. o closed form solution for ML estimation of GMM parameters. How to estimate GMM parameters with EM algorithm? How is GMM related to K-means? How to use GMM for clustering, classification and regression? Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 37