Mixtures of Gaussians continued Machine Learning CSE446 Carlos Guestrin University of Washington May 17, 2013 1 One) bad case for k-means n Clusters may overlap n Some clusters may be wider than others 2 1
Gaussians in m Dimensions 1 Px) = 2π ) m/2 Σ exp # 1 1/2 2 x µ ) T Σ 1 x µ ) 3 Suppose You Have a Gaussian For Each Class 1 Px y = i) 2π ) m/2 Σ i exp 1 1/2 2 x µ i ) T Σ 1 i x µ i ) ) 4 2
Gaussian Bayes Classifier n You have a Gaussian over x for each class y=i: 1 Px y = i) 2π ) m/2 Σ i exp 1 1/2 2 x µ i ) T Σ 1 i x µ i ) n But you need probability of class y=i given x: ) n Thank you Bayes Rule!! Py = i x) = px y = i)py = i) px) 5 Predicting wealth from age 6 3
Predicting wealth from age 7 Learning modelyear, mpg ---> maker " Σ = # σ 2 1 σ 12 σ 1m σ 12 σ 2 2 σ 2m σ 1m σ 2m σ 2 m 8 4
General: Om 2 ) parameters " Σ = # σ 2 1 σ 12 σ 1m σ 12 σ 2 2 σ 2m σ 1m σ 2m σ 2 m 9 Aligned: Om) parameters # σ 2 1 0 0 0 0 0 σ 2 2 0 0 0 0 0 σ 2 3 0 0 Σ = 0 0 0 σ 2 m 1 0 0 0 0 0 σ 2 m 10 5
Aligned: Om) parameters # σ 2 1 0 0 0 0 0 σ 2 2 0 0 0 0 0 σ 2 3 0 0 Σ = 0 0 0 σ 2 m 1 0 0 0 0 0 σ 2 m 11 Spherical: O1) cov parameters " Σ = # σ 2 0 0 0 0 0 σ 2 0 0 0 0 0 σ 2 0 0 0 0 0 σ 2 0 0 0 0 0 σ 2 12 6
Spherical: O1) cov parameters " Σ = # σ 2 0 0 0 0 0 σ 2 0 0 0 0 0 σ 2 0 0 0 0 0 σ 2 0 0 0 0 0 σ 2 13 Next back to Density Estimation What if we want to do density estimation with multimodal or clumpy data? 14 7
But we don t see class labels!!! n MLE: argmax j Py j,x j ) n But we don t know y j!!! n Maximize marginal likelihood: argmax j Px j ) = argmax j i=1 k Py j =i,x j ) 15 Special case: spherical Gaussians and hard assignments 1 Py = i x j ) 2π ) m/2 Σ i exp 1 1/2 2 x j µ i ) T Σ 1 i x j µ i ) ) Py = i) n n If PX Y=i) is spherical, with same σ for all classes: # Px j y = i) exp 1 2 x j µ 2σ 2 i If each x j belongs to one class Cj) hard assignment), marginal likelihood: m k m Px j, y = i) / exp 1 x j µ 2σ 2 C j) j=1 i=1 j=1 ) * 2 n Same as K-means!!! 16 8
The GMM assumption There are k components Component i has an associated mean vector µ ι µ 2 µ 1 µ 3 17 The GMM assumption There are k components Component i has an associated mean vector µ ι Each component generates data from a Gaussian with mean m i and covariance matrix σ 2 Ι Each data point is generated according to the following recipe: µ 1 µ 2 µ 3 18 9
The GMM assumption There are k components Component i has an associated mean vector µ ι Each component generates data from a Gaussian with mean m i and covariance matrix σ 2 Ι Each data point is generated according to the following recipe: 1. Pick a component at random: Choose component i with probability Py=i) µ 2 19 The GMM assumption There are k components Component i has an associated mean vector µ ι Each component generates data from a Gaussian with mean m i and covariance matrix σ 2 Ι Each data point is generated according to the following recipe: 1. Pick a component at random: Choose component i with probability Py=i) 2. Datapoint Νµ ι, σ 2 Ι ) µ 2 x 20 10
The General GMM assumption There are k components Component i has an associated mean vector m i Each component generates data from a Gaussian with mean m i and covariance matrix Σ i Each data point is generated according to the following recipe: 1. Pick a component at random: Choose component i with probability Py=i) 2. Datapoint ~ Nm i, Σ i ) µ 1 µ 2 µ 3 21 Unsupervised Learning with Mixtures of Gaussians EM Algorithm Machine Learning CSE446 Carlos Guestrin University of Washington May 17, 2013 22 11
Supervised Learning of Mixtures of Gaussians n Mixtures of Gaussians: Prior class probabilities: Py) Likelihood function per class: Px y=i) n Suppose, for each data point, we know location x and class y Learning is easy J For prior Py) For likelihood function: 23 Unsupervised Learning: not as hard as it looks Sometimes easy Sometimes impossible and sometimes in between IN CASE YOU RE WONDERING WHAT THESE DIAGRAMS ARE, THEY SHOW 2-d UNLABELED DATA X VECTORS) DISTRIBUTED IN 2-d SPACE. THE TOP ONE HAS THREE VERY CLEAR GAUSSIAN CENTERS 24 12
EM: Reducing Unsupervised Learning to Supervised Learning n If we knew assignment of points to classes è Supervised Learning! n Expectation-Maximization EM) Guess assignment of points to classes Recompute model parameters Iterate 25 DETOUR The E.M. Algorithm n We ll get back to unsupervised learning soon n But now we ll look at an even simpler case with hidden information n The EM algorithm q Can do trivial things, such as the contents of the next few slides q An excellent way of doing our unsupervised learning problem, as we ll see q Many, many other uses 26 13
Silly Example Let events be grades in a class w 1 = Gets an A PA) = ½ w 2 = Gets a B PB) = µ w 3 = Gets a C PC) = 2µ w 4 = Gets a D PD) = ½-3µ Note 0 µ 1/6) Assume we want to estimate µ from data. In a given class there were a A s b B s c C s d D s What s the maximum likelihood estimate of µ given a,b,c,d? 27 Trivial Statistics PA) = ½ PB) = µ PC) = 2µ PD) = ½-3µ P a,b,c,d µ) = K½) a µ) b 2µ) c ½-3µ) d log P a,b,c,d µ) = log K + alog ½ + blog µ + clog 2µ + dlog ½-3µ) FOR MAX LIKE µ, SET LogP µ = 0 LogP µ = b µ + 2c 2µ 3d 1/2 3µ = 0 Gives max like µ = So if class got b + c 6 b + c + d ) A B C D Max like µ = 1 10 14 6 9 10 Boring, but true! 28 14
Same Problem with Hidden Information Someone tells us that Number of High grades A s + B s) = h Number of C s = c Number of D s = d What is the max. like estimate of µ now? We can answer this question circularly: EXPECTATION MAXIMIZATION If we know the expected values of a and b we could compute the maximum likelihood value of µ µ = b + c 6 b + c + d ) REMEMBER PA) = ½ PB) = µ PC) = 2µ PD) = ½-3µ If we know the value of µ we could compute the expected value of a and b 1 a = 2 1 2 + µ h b = µ 1 2 + µ h Since the ratio a:b should be the same as the ratio ½ : µ 29 E.M. for our Trivial Problem We begin with a guess for µ We iterate between EXPECTATION and MAXIMALIZATION to improve our estimates of µ and a and b. REMEMBER PA) = ½ PB) = µ PC) = 2µ PD) = ½-3µ Define µ t) the estimate of µ on the t th iteration b t) the estimate of b on t th iteration µ 0) = initial guess b t ) = µ t ) h ) = Ε 1 t ) [ b µt ] 2 + µ E-step µ t +1) = b t ) + c 6 b t ) + c + d ) = max like est. of µ given b t ) M-step Continue iterating until converged. Good news: Converging to local optimum is assured. Bad news: I said local optimum. 30 15
E.M. Convergence n Convergence proof based on fact that Probdata µ) must increase or remain same between each iteration [NOT OBVIOUS] n But it can never exceed 1 [OBVIOUS] So it must therefore converge [OBVIOUS] In our example, suppose we had h = 20 c = 10 d = 10 µ 0) = 0 Convergence is generally linear: error decreases by a constant factor each time step. t µ t) b t) 0 0 0 1 0.0833 2.857 2 0.0937 3.158 3 0.0947 3.185 4 0.0948 3.187 5 0.0948 3.187 6 0.0948 3.187 31 16