K-means. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. November 19 th, Carlos Guestrin 1

Similar documents
Mixtures of Gaussians continued

Bayesian Networks Structure Learning (cont.)

EM (cont.) November 26 th, Carlos Guestrin 1

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Gaussian Mixture Models

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

CSE446: Clustering and EM Spring 2017

Gaussian Mixture Models, Expectation Maximization

Expectation Maximization Algorithm

ECE 5984: Introduction to Machine Learning

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Bias-Variance Tradeoff

Statistical Pattern Recognition

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Learning Theory Continued

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

CPSC 540: Machine Learning

Generative Clustering, Topic Modeling, & Bayesian Inference

Machine Learning for Data Science (CS4786) Lecture 12

PAC-learning, VC Dimension and Margin-based Bounds

Expectation maximization

K-Means and Gaussian Mixture Models

Data Preprocessing. Cluster Similarity

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Mixture Models & EM algorithm Lecture 21

SVMs, Duality and the Kernel Trick

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Unsupervised Learning: K-Means, Gaussian Mixture Models

Lecture 11: Unsupervised Machine Learning

Support Vector Machines

Generative v. Discriminative classifiers Intuition

ECE 5984: Introduction to Machine Learning

Clustering and Gaussian Mixture Models

Unsupervised learning (part 1) Lecture 19

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Mixture of Gaussians Models

What you need to know about Kalman Filters

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Expectation Maximization, and Learning from Partly Unobserved Data (part 2)

Generative v. Discriminative classifiers Intuition

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

Dynamic models 1 Kalman filters, linearization,

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Statistical Pattern Recognition

Predicting the Treatment Status

Machine Learning for Signal Processing Bayes Classification

November 28 th, Carlos Guestrin 1. Lower dimensional projections

1 EM Primer. CS4786/5786: Machine Learning for Data Science, Spring /24/2015: Assignment 3: EM, graphical models

Expectation Maximization

10-701/15-781, Machine Learning: Homework 4

Clustering, K-Means, EM Tutorial

Introduction to Logistic Regression and Support Vector Machine

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Latent Variable Models and EM Algorithm

Logistic Regression. Machine Learning Fall 2018

Expectation Maximization

Introduction to Bayesian Learning. Machine Learning Fall 2018

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

But if z is conditioned on, we need to model it:

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Dimensionality reduction

Machine Learning

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

BN Semantics 3 Now it s personal! Parameter Learning 1

Generative v. Discriminative classifiers Intuition

Latent Variable Models

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Machine Learning

Gaussian Mixture Models

Announcements - Homework

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Probabilistic Graphical Models

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Machine Learning - MT Classification: Generative Models

Machine Learning for Signal Processing Bayes Classification and Regression

Announcements. Inference. Mid-term. Inference by Enumeration. Reminder: Alarm Network. Introduction to Artificial Intelligence. V22.

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Week 5: Logistic Regression & Neural Networks

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic

CPSC 540: Machine Learning

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

CS534 Machine Learning - Spring Final Exam

Bayesian Machine Learning

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Machine Learning, Midterm Exam

Machine Learning Lecture 5

Machine Learning Gaussian Naïve Bayes Big Picture

Stochastic Gradient Descent

CSCI-567: Machine Learning (Spring 2019)

Lecture 4: Probabilistic Learning

Graphical Models for Collaborative Filtering

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

18.9 SUPPORT VECTOR MACHINES

Transcription:

EM Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University November 19 th, 2007 2005-2007 Carlos Guestrin 1 K-means 1. Ask user how many clusters they d like. e.g. k=5 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. 4. Each Center finds the centroid of the points it owns 5. and jumps there 6. Repeat until terminated! 2

K-means Randomly initialize k centers µ 0 = µ 1 0,, µ k 0 Classify: Assign each point j {1, m} to nearest center: Recenter: µ i becomes centroid of its point: Equivalent to µ i average of its points! 3 Does K-means converge??? Part 2 Optimize potential function: Fix C, optimize µ 4

Coordinate descent algorithms Want: min a min b Fa,b Coordinate descent: fix a, minimize b fix b, minimize a repeat Converges!!! if F is bounded to a often good local optimum as we saw in applet play with it! K-means is a coordinate descent algorithm! 5 One bad case for k-means Clusters may overlap Some clusters may be wider than others 6

7 Gaussian Bayes Classifier Reminder j j j p i y P i y p i y P x x x = = = = Py = i x j " 1 2# m / 2 $ i 1/ 2 exp % 1 2 x j % µ i T $ i %1 x j % µ i + Py = i 8 Predicting wealth from age

Predicting wealth from age 9 Learning modelyear, " = mpg ---> maker $ # 2 1 # 12 L # 1m # 12 # 2 2 L # 2m M M O M %# 1m # 2m L # 2 m 10

General: Om 2 " = parameters $ # 2 1 # 12 L # 1m # 12 # 2 2 L # 2m M M O M %# 1m # 2m L # 2 m 11 Aligned: Om parameters %# 2 1 0 0 L 0 0 0 # 2 2 0 L 0 0 0 0 # 2 3 L 0 0 " = M M M O M M 0 0 0 L # 2 m$1 0 0 0 0 L 0 # 2 m 12

Aligned: Om parameters %# 2 1 0 0 L 0 0 0 # 2 2 0 L 0 0 0 0 # 2 3 L 0 0 " = M M M O M M 0 0 0 L # 2 m$1 0 0 0 0 L 0 # 2 m 13 Spherical: O1 cov parameters $ # 2 0 0 L 0 0 0 # 2 0 L 0 0 0 0 # 2 L 0 0 " = M M M O M M 0 0 0 L # 2 0 % 0 0 0 L 0 # 2 14

Spherical: O1 cov parameters $ # 2 0 0 L 0 0 0 # 2 0 L 0 0 0 0 # 2 L 0 0 " = M M M O M M 0 0 0 L # 2 0 % 0 0 0 L 0 # 2 15 Next back to Density Estimation What if we want to do density estimation with multimodal or clumpy data? 16

But we don t see class labels!!! MLE: argmax j Py j,x j But we don t know y j s!!! Maximize marginal likelihood: argmax j Px j = argmax j i=1 k Py j =i,x j 17 Special case: spherical Gaussians and hard assignments 1 Py = i x j " 2# m / 2 $ i exp % 1 1/ 2 2 x % µ j i T $ %1 i x j % µ i + Py = i If PX Y=i is spherical, with same σ for all classes: % Px j y = i "exp # 1 2$ x 2 2 j # µ i If each x j belongs to one class Cj hard assignment, marginal likelihood: m k m #" Px j, y = i $ # exp % 1 2 x % µ 2 j C j j=1 i=1 Same as K-means!!! j=1 +, 2 18

The GMM assumption There are k components Component i has an associated mean vector µ i µ 1 µ 2 µ 3 19 The GMM assumption There are k components Component i has an associated mean vector µ i Each component generates data from a Gaussian with mean µ i and covariance matrix σ 2 I Each data point is generated according to the following recipe: µ 1 µ 2 µ 3 20

The GMM assumption There are k components Component i has an associated mean vector µ i Each component generates data from a Gaussian with mean µ i and covariance matrix σ 2 I Each data point is generated according to the following recipe: 1. Pick a component at random: Choose component i with probability Py=i µ 2 21 The GMM assumption There are k components Component i has an associated mean vector µ i Each component generates data from a Gaussian with mean µ i and covariance matrix σ 2 I Each data point is generated according to the following recipe: 1. Pick a component at random: Choose component i with probability Py=i 2. Datapoint ~ Nµ i, σ 2 I µ 2 x 22

The General GMM assumption There are k components Component i has an associated mean vector µ i Each component generates data from a Gaussian with mean µ i and covariance matrix Σ i Each data point is generated according to the following recipe: 1. Pick a component at random: Choose component i with probability Py=i 2. Datapoint ~ Nµ i, Σ i µ 1 µ 2 µ 3 23 Unsupervised Learning: not as hard as it looks Sometimes easy Sometimes impossible and sometimes in between IN CASE YOU RE WONDERING WHAT THESE DIAGRAMS ARE, THEY SHOW 2-d UNLABELED DATA X VECTORS DISTRIBUTED IN 2-d SPACE. THE TOP ONE HAS THREE VERY CLEAR GAUSSIAN CENTERS 24

Marginal likelihood for general case 1 Py = i x j " 2# m / 2 $ i exp % 1 1/ 2 2 x % µ j i Marginal likelihood: m m k " Px j = "# Px j, y = i j=1 j=1 i=1 = m k "# j=1 i=1 T $ %1 i x j % µ i 1 2$ m / 2 % i exp 1 1/ 2 2 x µ j i + Py = i T % 1 i x j µ i +, Py = i 25 Special case 2: spherical Gaussians and soft assignments If PX Y=i is spherical, with same σ for all classes: % Px j y = i "exp # 1 2$ x 2 2 j # µ i Uncertain about class of each x j soft assignment, marginal likelihood: m k m k #" Px j,y = i $ exp % 1 2 x 2 #" 2 j % µ i +, Py = i j=1 i=1 j=1 i=1 26

Unsupervised Learning: Mediumly Good News We now have a procedure s.t. if you give me a guess at µ 1, µ 2.. µ k, I can tell you the prob of the unlabeled data given those µ s. Suppose x s are 1-dimensional. There are two classes; w 1 and w 2 Py 1 = 1/3 Py 2 = 2/3 σ = 1. There are 25 unlabeled datapoints x 1 = 0.608 x 2 = -1.590 x 3 = 0.235 x 4 = 3.949 : x 25 = -0.712 From Duda and Hart 27 Duda Hart s Example We can graph the prob. dist. function of data given our µ 1 and µ 2 estimates. We can also graph the true function from which the data was randomly generated. They are close. Good. The 2 nd solution tries to put the 2/3 hump where the 1/3 hump should go, and vice versa. In this example unsupervised is almost as good as supervised. If the x 1.. x 25 are given the class which was used to learn them, then the results are µ 1 =-2.176, µ 2 =1.684. Unsupervised got µ 1 =-2.13, µ 2 =1.668. 28

Duda Hart s Example µ 2 Graph of log Px 1, x 2.. x 25 µ 1, µ 2 against µ 1 and µ 2 µ1 Max likelihood = µ 1 =-2.13, µ 2 =1.668 Local minimum, but very close to global at µ 1 =2.085, µ 2 =-1.257 corresponds to switching y 1 with y 2. 29 Finding the max likelihood µ 1,µ 2..µ k We can compute P data µ 1,µ 2..µ k How do we find the µ i s which give max. likelihood? The normal max likelihood trick: Set log Prob. = 0 µ i and solve for µ i s. # Here you get non-linear non-analytically-solvable equations Use gradient descent Often slow but doable Use a much faster, cuter, and recently very popular method 30

Announcements HW5 out later today Due December 5th by 3pm to Monica Hopes, Wean 4619 Project: Poster session: NSH Atrium, Friday 11/30, 2-5pm Print your poster early!!! SCS facilities has a poster printer, ask helpdesk Last lecture: Students from outside SCS should check with their departments It s OK to print separate pages We ll provide pins, posterboard and an easel Poster size: 32x40 inches Invite your friends, there will be a prize for best poster, by popular vote Thursday, 11/29, 5-6:20pm, Wean 7500 31 Expectation Maximalization 32

DETOUR The E.M. Algorithm We ll get back to unsupervised learning soon But now we ll look at an even simpler case with hidden information The EM algorithm Can do trivial things, such as the contents of the next few slides An excellent way of doing our unsupervised learning problem, as we ll see Many, many other uses, including learning BNs with hidden data 33 Silly Example Let events be grades in a class w 1 = Gets an A PA = ½ w 2 = Gets a B PB = µ w 3 = Gets a C PC = 2µ w 4 = Gets a D PD = ½-3µ Note 0 µ 1/6 Assume we want to estimate µ from data. In a given class there were a A s b B s c C s d D s What s the maximum likelihood estimate of µ given a,b,c,d? 34

Trivial Statistics PA = ½ PB = µ PC = 2µ PD = ½-3µ P a,b,c,d µ = K½ a µ b 2µ c ½-3µ d log P a,b,c,d µ = log K + alog ½ + blog µ + clog 2µ + dlog ½-3µ FOR MAX LIKE µ, SET "LogP "µ = 0 "LogP "µ = b µ + 2c 2µ # 3d 1/2 # 3µ = 0 Gives max like µ = So if class got A b + c 6 b + c + d B C D Max like µ = 1 10 14 6 9 10 Boring, but true! 35 Same Problem with Hidden Information REMEMBER Someone tells us that Number of High grades A s + B s = h Number of C s = c Number of D s = d PA = ½ PB = µ PC = 2µ PD = ½-3µ What is the max. like estimate of µ now? 36

Same Problem with Hidden Information Someone tells us that Number of High grades A s + B s = h Number of C s = c Number of D s = d What is the max. like estimate of µ now? We can answer this question circularly: EXPECTATION MAXIMIZATION If we know the expected values of a and b we could compute the maximum likelihood value of µ µ = b + c 6 b + c + d REMEMBER PA = ½ PB = µ PC = 2µ PD = ½-3µ If we know the value of µ we could compute the expected value of a and b 1 a = 2 1 2 + µ h b = µ 1 2 + µ h Since the ratio a:b should be the same as the ratio ½ : µ 37 E.M. for our Trivial Problem We begin with a guess for µ We iterate between EXPECTATION and MAXIMALIZATION to improve our estimates of µ and a and b. REMEMBER PA = ½ PB = µ PC = 2µ PD = ½-3µ Define µ t the estimate of µ on the t th iteration b t the estimate of b on t th iteration µ 0 = initial guess b t = µ t h = " 1 t [ b µt ] 2 + µ E-step µ t +1 = b t + c 6 b t + c + d = max like est. of µ given b t M-step Continue iterating until converged. Good news: Converging to local optimum is assured. Bad news: I said local optimum. 38

E.M. Convergence Convergence proof based on fact that Probdata µ must increase or remain same between each iteration [NOT OBVIOUS] But it can never exceed 1 [OBVIOUS] So it must therefore converge [OBVIOUS] In our example, suppose we had h = 20 c = 10 d = 10 µ 0 = 0 t 0 1 2 3 µ t 0 0.0833 0.0937 0.0947 0 2.857 3.158 3.185 b t Convergence is generally linear: error decreases by a constant factor each time step. 4 5 6 0.0948 0.0948 0.0948 3.187 3.187 3.187 39