Mixtures of Gaussians continued

Similar documents
K-means. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. November 19 th, Carlos Guestrin 1

Bayesian Networks Structure Learning (cont.)

EM (cont.) November 26 th, Carlos Guestrin 1

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Gaussian Mixture Models

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

CSE446: Clustering and EM Spring 2017

Gaussian Mixture Models, Expectation Maximization

Expectation Maximization Algorithm

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Machine Learning for Data Science (CS4786) Lecture 12

CPSC 540: Machine Learning

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Statistical Pattern Recognition

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Unsupervised learning (part 1) Lecture 19

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Mixture Models & EM algorithm Lecture 21

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Expectation maximization

Data Preprocessing. Cluster Similarity

Expectation Maximization

Generative Clustering, Topic Modeling, & Bayesian Inference

Bias-Variance Tradeoff

Expectation Maximization

Clustering and Gaussian Mixture Models

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Unsupervised Learning: K-Means, Gaussian Mixture Models

Statistical Pattern Recognition

Mixture of Gaussians Models

But if z is conditioned on, we need to model it:

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Dimensionality reduction

K-Means and Gaussian Mixture Models

Latent Variable Models and EM Algorithm

1 EM Primer. CS4786/5786: Machine Learning for Data Science, Spring /24/2015: Assignment 3: EM, graphical models

Naïve Bayes classification

Hidden Markov Models Part 2: Algorithms

Review and Motivation

Machine Learning for Signal Processing Bayes Classification

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

CSCI-567: Machine Learning (Spring 2019)

CS534 Machine Learning - Spring Final Exam

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Lecture 11: Unsupervised Machine Learning

Mobile Robot Localization

ECE 5984: Introduction to Machine Learning

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Bayesian Classifiers and Probability Estimation. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington

Stochastic Gradient Descent

Latent Variable Models

ECE 5984: Introduction to Machine Learning

Clustering, K-Means, EM Tutorial

Logistic Regression. Machine Learning Fall 2018

COM336: Neural Computing

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Generative v. Discriminative classifiers Intuition

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

Bayesian Machine Learning

Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Lecture 4: Probabilistic Learning

Generative v. Discriminative classifiers Intuition

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine Learning Techniques for Computer Vision

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Machine Learning Lecture 5

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

The Expectation-Maximization Algorithm

CPSC 340: Machine Learning and Data Mining

Statistical learning. Chapter 20, Sections 1 4 1

Generative v. Discriminative classifiers Intuition

Soft Computing. Lecture Notes on Machine Learning. Matteo Mattecci.

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Support Vector Machines

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Introduction to Bayesian Learning. Machine Learning Fall 2018

Graphical Models for Collaborative Filtering

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

The Expectation-Maximization Algorithm

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

CPSC 540: Machine Learning

Exponential Family and Maximum Likelihood, Gaussian Mixture Models and the EM Algorithm. by Korbinian Schwinger

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Expectation Maximization (EM)

Parameter estimation Conditional risk

Math 350: An exploration of HMMs through doodles.

Lecture 6: Gaussian Mixture Models (GMM)

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Hidden Markov Models and Gaussian Mixture Models

Machine Learning for Signal Processing Bayes Classification and Regression

EM Algorithm LECTURE OUTLINE

CPSC 540: Machine Learning

Probabilistic Graphical Models

Transcription:

Mixtures of Gaussians continued Machine Learning CSE446 Carlos Guestrin University of Washington May 17, 2013 1 One) bad case for k-means n Clusters may overlap n Some clusters may be wider than others 2 1

Gaussians in m Dimensions 1 Px) = 2π ) m/2 Σ exp # 1 1/2 2 x µ ) T Σ 1 x µ ) 3 Suppose You Have a Gaussian For Each Class 1 Px y = i) 2π ) m/2 Σ i exp 1 1/2 2 x µ i ) T Σ 1 i x µ i ) ) 4 2

Gaussian Bayes Classifier n You have a Gaussian over x for each class y=i: 1 Px y = i) 2π ) m/2 Σ i exp 1 1/2 2 x µ i ) T Σ 1 i x µ i ) n But you need probability of class y=i given x: ) n Thank you Bayes Rule!! Py = i x) = px y = i)py = i) px) 5 Predicting wealth from age 6 3

Predicting wealth from age 7 Learning modelyear, mpg ---> maker " Σ = # σ 2 1 σ 12 σ 1m σ 12 σ 2 2 σ 2m σ 1m σ 2m σ 2 m 8 4

General: Om 2 ) parameters " Σ = # σ 2 1 σ 12 σ 1m σ 12 σ 2 2 σ 2m σ 1m σ 2m σ 2 m 9 Aligned: Om) parameters # σ 2 1 0 0 0 0 0 σ 2 2 0 0 0 0 0 σ 2 3 0 0 Σ = 0 0 0 σ 2 m 1 0 0 0 0 0 σ 2 m 10 5

Aligned: Om) parameters # σ 2 1 0 0 0 0 0 σ 2 2 0 0 0 0 0 σ 2 3 0 0 Σ = 0 0 0 σ 2 m 1 0 0 0 0 0 σ 2 m 11 Spherical: O1) cov parameters " Σ = # σ 2 0 0 0 0 0 σ 2 0 0 0 0 0 σ 2 0 0 0 0 0 σ 2 0 0 0 0 0 σ 2 12 6

Spherical: O1) cov parameters " Σ = # σ 2 0 0 0 0 0 σ 2 0 0 0 0 0 σ 2 0 0 0 0 0 σ 2 0 0 0 0 0 σ 2 13 Next back to Density Estimation What if we want to do density estimation with multimodal or clumpy data? 14 7

But we don t see class labels!!! n MLE: argmax j Py j,x j ) n But we don t know y j!!! n Maximize marginal likelihood: argmax j Px j ) = argmax j i=1 k Py j =i,x j ) 15 Special case: spherical Gaussians and hard assignments 1 Py = i x j ) 2π ) m/2 Σ i exp 1 1/2 2 x j µ i ) T Σ 1 i x j µ i ) ) Py = i) n n If PX Y=i) is spherical, with same σ for all classes: # Px j y = i) exp 1 2 x j µ 2σ 2 i If each x j belongs to one class Cj) hard assignment), marginal likelihood: m k m Px j, y = i) / exp 1 x j µ 2σ 2 C j) j=1 i=1 j=1 ) * 2 n Same as K-means!!! 16 8

The GMM assumption There are k components Component i has an associated mean vector µ ι µ 2 µ 1 µ 3 17 The GMM assumption There are k components Component i has an associated mean vector µ ι Each component generates data from a Gaussian with mean m i and covariance matrix σ 2 Ι Each data point is generated according to the following recipe: µ 1 µ 2 µ 3 18 9

The GMM assumption There are k components Component i has an associated mean vector µ ι Each component generates data from a Gaussian with mean m i and covariance matrix σ 2 Ι Each data point is generated according to the following recipe: 1. Pick a component at random: Choose component i with probability Py=i) µ 2 19 The GMM assumption There are k components Component i has an associated mean vector µ ι Each component generates data from a Gaussian with mean m i and covariance matrix σ 2 Ι Each data point is generated according to the following recipe: 1. Pick a component at random: Choose component i with probability Py=i) 2. Datapoint Νµ ι, σ 2 Ι ) µ 2 x 20 10

The General GMM assumption There are k components Component i has an associated mean vector m i Each component generates data from a Gaussian with mean m i and covariance matrix Σ i Each data point is generated according to the following recipe: 1. Pick a component at random: Choose component i with probability Py=i) 2. Datapoint ~ Nm i, Σ i ) µ 1 µ 2 µ 3 21 Unsupervised Learning with Mixtures of Gaussians EM Algorithm Machine Learning CSE446 Carlos Guestrin University of Washington May 17, 2013 22 11

Supervised Learning of Mixtures of Gaussians n Mixtures of Gaussians: Prior class probabilities: Py) Likelihood function per class: Px y=i) n Suppose, for each data point, we know location x and class y Learning is easy J For prior Py) For likelihood function: 23 Unsupervised Learning: not as hard as it looks Sometimes easy Sometimes impossible and sometimes in between IN CASE YOU RE WONDERING WHAT THESE DIAGRAMS ARE, THEY SHOW 2-d UNLABELED DATA X VECTORS) DISTRIBUTED IN 2-d SPACE. THE TOP ONE HAS THREE VERY CLEAR GAUSSIAN CENTERS 24 12

EM: Reducing Unsupervised Learning to Supervised Learning n If we knew assignment of points to classes è Supervised Learning! n Expectation-Maximization EM) Guess assignment of points to classes Recompute model parameters Iterate 25 DETOUR The E.M. Algorithm n We ll get back to unsupervised learning soon n But now we ll look at an even simpler case with hidden information n The EM algorithm q Can do trivial things, such as the contents of the next few slides q An excellent way of doing our unsupervised learning problem, as we ll see q Many, many other uses 26 13

Silly Example Let events be grades in a class w 1 = Gets an A PA) = ½ w 2 = Gets a B PB) = µ w 3 = Gets a C PC) = 2µ w 4 = Gets a D PD) = ½-3µ Note 0 µ 1/6) Assume we want to estimate µ from data. In a given class there were a A s b B s c C s d D s What s the maximum likelihood estimate of µ given a,b,c,d? 27 Trivial Statistics PA) = ½ PB) = µ PC) = 2µ PD) = ½-3µ P a,b,c,d µ) = K½) a µ) b 2µ) c ½-3µ) d log P a,b,c,d µ) = log K + alog ½ + blog µ + clog 2µ + dlog ½-3µ) FOR MAX LIKE µ, SET LogP µ = 0 LogP µ = b µ + 2c 2µ 3d 1/2 3µ = 0 Gives max like µ = So if class got b + c 6 b + c + d ) A B C D Max like µ = 1 10 14 6 9 10 Boring, but true! 28 14

Same Problem with Hidden Information Someone tells us that Number of High grades A s + B s) = h Number of C s = c Number of D s = d What is the max. like estimate of µ now? We can answer this question circularly: EXPECTATION MAXIMIZATION If we know the expected values of a and b we could compute the maximum likelihood value of µ µ = b + c 6 b + c + d ) REMEMBER PA) = ½ PB) = µ PC) = 2µ PD) = ½-3µ If we know the value of µ we could compute the expected value of a and b 1 a = 2 1 2 + µ h b = µ 1 2 + µ h Since the ratio a:b should be the same as the ratio ½ : µ 29 E.M. for our Trivial Problem We begin with a guess for µ We iterate between EXPECTATION and MAXIMALIZATION to improve our estimates of µ and a and b. REMEMBER PA) = ½ PB) = µ PC) = 2µ PD) = ½-3µ Define µ t) the estimate of µ on the t th iteration b t) the estimate of b on t th iteration µ 0) = initial guess b t ) = µ t ) h ) = Ε 1 t ) [ b µt ] 2 + µ E-step µ t +1) = b t ) + c 6 b t ) + c + d ) = max like est. of µ given b t ) M-step Continue iterating until converged. Good news: Converging to local optimum is assured. Bad news: I said local optimum. 30 15

E.M. Convergence n Convergence proof based on fact that Probdata µ) must increase or remain same between each iteration [NOT OBVIOUS] n But it can never exceed 1 [OBVIOUS] So it must therefore converge [OBVIOUS] In our example, suppose we had h = 20 c = 10 d = 10 µ 0) = 0 Convergence is generally linear: error decreases by a constant factor each time step. t µ t) b t) 0 0 0 1 0.0833 2.857 2 0.0937 3.158 3 0.0947 3.185 4 0.0948 3.187 5 0.0948 3.187 6 0.0948 3.187 31 16