Clustering with Gaussian Mixtures

Size: px

Start display at page:

Download "Clustering with Gaussian Mixtures"

Kenneth Cameron
5 years ago
Views:

1 Note to other teachers and users of these sldes. Andrew would be delghted f you found ths source materal useful n gvng your own lectures. Feel free to use these sldes verbatm, or to modfy them to ft your own needs. PowerPont orgnals are avalable. If you mae use of a sgnfcant porton of these sldes n your own lecture, please nclude ths message, or the followng ln to the source repostory of Andrew s tutorals: Comments and correctons gratefully receved. Clusterng wth Gaussan Mxtures Andrew W. Moore Assocate Professor School of Computer Scence Carnege Mellon Unversty awm@cs.cmu.edu Unsupervsed Learnng You wal nto a bar. A stranger approaches and tells you: I ve got data from classes. Each class produces observatons wth a normal dstrbuton and varance σ I. Standard smple multvarate gaussan assumptons. I can tell you all the P(w s. So far, loos straghtforward. I need a maxmum lelhood estmate of the s. No problem: There s ust one thng. None of the data are labeled. I have dataponts, but I don t now what class they re from (any of them! Uh oh!! Copyrght 00, Andrew W. Moore Nov 0th, 00 Copyrght 00, Andrew W. Moore Clusterng wth Gaussan Mxtures: Slde Some data from a GMM The GMM assumpton There are components. The th component s called ω Component ω has an assocated mean vector 3 Copyrght 00, Andrew W. Moore Clusterng wth Gaussan Mxtures: Slde 3 Copyrght 00, Andrew W. Moore Clusterng wth Gaussan Mxtures: Slde 4 The GMM assumpton There are components. The th component s called ω Component ω has an assocated mean vector Each component generates data from a Gaussan wth mean and covarance matrx σ I Assume that each datapont s generated accordng to the followng recpe: 3 The GMM assumpton There are components. The th component s called ω Component ω has an assocated mean vector Each component generates data from a Gaussan wth mean and covarance matrx σ I Assume that each datapont s generated accordng to the followng recpe:. Pc a component at random. Choose component wth probablty P(ω. Copyrght 00, Andrew W. Moore Clusterng wth Gaussan Mxtures: Slde 5 Copyrght 00, Andrew W. Moore Clusterng wth Gaussan Mxtures: Slde 6

2 The GMM assumpton There are components. The th component s called ω Component ω has an assocated mean vector Each component generates data from a Gaussan wth mean and covarance matrx σ I Assume that each datapont s generated accordng to the followng recpe:. Pc a component at random. Choose component wth probablty P(ω.. Datapont ~ N(, σ I Copyrght 00, Andrew W. Moore Clusterng wth Gaussan Mxtures: Slde 7 x The General GMM assumpton There are components. The th component s called ω Component ω has an assocated mean vector Each component generates data from a Gaussan wth mean and covarance matrx Σ Assume that each datapont s generated accordng to the followng recpe:. Pc a component at random. Choose component wth probablty P(ω.. Datapont ~ N(, Σ Copyrght 00, Andrew W. Moore Clusterng wth Gaussan Mxtures: Slde 8 3 Unsupervsed Learnng: not as hard as t loos Sometmes easy Sometmes mpossble and sometmes n between IN CASE YOU E WONDEING WHAT THESE DIAGAMS AE, THEY SHOW -d UNLABELED DATA (X VECTOS DISTIBUTED IN -d SPACE. THE TOP ONE HAS THEE VEY CLEA GAUSSIAN CENTES Computng lelhoods n unsupervsed case We have x, x, x N We now P(w P(w.. P(w We now σ P(x w,, Prob that an observaton from class w would have value x gven class means x Can we wrte an expresson for that? Copyrght 00, Andrew W. Moore Clusterng wth Gaussan Mxtures: Slde 9 Copyrght 00, Andrew W. Moore Clusterng wth Gaussan Mxtures: Slde 0 lelhoods n unsupervsed case We have x x x n We have P(w.. P(w. We have σ. We can defne, for any x, P(x w,,.. Unsupervsed Learnng: Medumly Good News We now have a procedure s.t. f you gve me a guess at,.., I can tell you the prob of the unlabeled data gven those s. Can we defne P(x,..? Can we defne P(x, x,.. x n,..? [YES, IF WE ASSUME THE X S WEE DAWN INDEPENDENTLY] Suppose x s are -dmensonal. There are two classes; w and w P(w /3 P(w /3 σ. There are 5 unlabeled dataponts x x x x : x (From Duda and Hart Copyrght 00, Andrew W. Moore Clusterng wth Gaussan Mxtures: Slde Copyrght 00, Andrew W. Moore Clusterng wth Gaussan Mxtures: Slde

3 Duda & Hart s Example Graph of log P(x, x.. x 5, aganst ( and ( Max lelhood ( -.3,.668 Local mnmum, but very close to global at (.085, -.57* * corresponds to swtchng w + w. Copyrght 00, Andrew W. Moore Clusterng wth Gaussan Mxtures: Slde 3 We can graph the prob. dst. functon of data gven our and estmates. We can also graph the true functon from whch the data was randomly generated. Duda & Hart s Example They are close. Good. The nd soluton tres to put the /3 hump where the /3 hump should go, and vce versa. In ths example unsupervsed s almost as good as supervsed. If the x.. x 5 are gven the class whch was used to learn them, then the results are ( -.76,.684. Unsupervsed got ( -.3,.668. Copyrght 00, Andrew W. Moore Clusterng wth Gaussan Mxtures: Slde 4 Fndng the max lelhood,.. We can compute P( data,.. How do we fnd the s whch gve max. lelhood? The normal max lelhood trc: Set log Prob (. 0 and solve for s. # Here you get non-lnear non-analytcallysolvable equatons Use gradent descent Slow but doable Use a much faster, cuter, and recently very popular method Expectaton Maxmalzaton Copyrght 00, Andrew W. Moore Clusterng wth Gaussan Mxtures: Slde 5 Copyrght 00, Andrew W. Moore Clusterng wth Gaussan Mxtures: Slde 6 DETOU The E.M. Algorthm We ll get bac to unsupervsed learnng soon. But now we ll loo at an even smpler case wth hdden nformaton. The EM algorthm Can do trval thngs, such as the contents of the next few sldes. An excellent way of dong our unsupervsed learnng problem, as we ll see. Many, many other uses, ncludng nference of Hdden Marov Models (future lecture. Copyrght 00, Andrew W. Moore Clusterng wth Gaussan Mxtures: Slde 7 Slly Example Let events be grades n a class w Gets an A P(A ½ w Gets a B P(B w 3 Gets a C P(C w 4 Gets a D P(D ½-3 (Note 0 /6 Assume we want to estmate from data. In a gven class there were a A s b B s c C s d D s What s the maxmum lelhood estmate of gven a,b,c,d? Copyrght 00, Andrew W. Moore Clusterng wth Gaussan Mxtures: Slde 8 3

4 Computng P(A ½ P(B P(C P(D ½-3 P( a,b,c,d K(½ a ( b ( c (½-3 d log P( a,b,c,d log K + alog ½ + blog + clog + dlog (½-3 LogP FO MAX LIKE, SET 0 LogP b c 3d + 0 / 3 b + c Gves max le 6( b + c + d So f class got A B C Max le Copyrght 00, Andrew W. Moore Clusterng wth Gaussan Mxtures: Slde 9 9 D 0 Same Problem wth Hdden Informaton Someone tells us that Number of Hgh grades (A s + B s h Number of C s c Number of D s d What s the max. le estmate of now? EMEMBE P(A ½ P(B P(C P(D ½-3 Copyrght 00, Andrew W. Moore Clusterng wth Gaussan Mxtures: Slde 0 Same Problem wth Hdden Informaton Someone tells us that Number of Hgh grades (A s + B s h Number of C s c Number of D s d What s the max. le estmate of now? We can answer ths queston crcularly: EMEMBE P(A ½ P(B P(C P(D ½-3 EXPECTATION If we now the value of we could compute the expected value of a and b a Snce the rato a:b should be the same as the rato ½ : h b h + + MAXIMIZATION If we now the expected values of a and b we could compute the maxmum lelhood b + c value of 6 b + c + d ( Copyrght 00, Andrew W. Moore Clusterng wth Gaussan Mxtures: Slde E.M. for our Trval Problem We begn wth a guess for We terate between EXPECTATION and MAXIMALIZATION to mprove our estmates of and a and b. Defne (t the estmate of on the t th b(t the estmate of b on t th (0 ntal guess b( t (t h Ε + ( t b( t + c ( b( t + c + d [ b ( t ] ( t + 6 max le est of gven b ( t E-step M-step Contnue teratng untl converged. Good news: Convergng to local optmum s assured. Bad news: I sad local optmum. EMEMBE P(A ½ P(B P(C P(D ½-3 Copyrght 00, Andrew W. Moore Clusterng wth Gaussan Mxtures: Slde E.M. Convergence Convergence proof based on fact that Prob(data must ncrease or reman same between each [NOT OBVIOUS] But t can never exceed [OBVIOUS] So t must therefore converge [OBVIOUS] In our example, suppose we had h 0 c 0 d 0 (0 0 t (t b(t Convergence s generally lnear: error decreases by a constant factor each tme step Copyrght 00, Andrew W. Moore Clusterng wth Mxtures: Gaussan Slde 3 Bac to Unsupervsed Learnng of GMMs emember: We have unlabeled data x x x We now there are classes We now P(w P(w P(w 3 P(w We don t now.. We can wrte P( data. p ( x... x... p( x... p( x w,... P( w K exp σ ( x P( w Copyrght 00, Andrew W. Moore Clusterng wth Gaussan Mxtures: Slde 4 4

5 P( w x,... x P( w x,... E.M. for GMMs For Max lelhood we now log Pr ob ( data... 0 Some wld' n'crazy algebra turns ths nto :"For Max lelhood, for each, Ths s n nonlnear equatons n s. If, for each x we new that for each w the prob that was n class w s P(w x, Then we would easly compute. If we new each then we could easly compute P(w x, for each w and x. Copyrght 00, Andrew W. Moore Clusterng wth Gaussan Mxtures: Slde 5 I feel an EM experence comng on!! E.M. for GMMs Iterate. On the t th let our estmates be λ t { (t, (t c (t } E-step Compute expected classes of all dataponts for each class ( w x, λ ( x w P( w λt p( x λ p p( x w, ( t, σ I p p( x w, ( t, σ I ( t P t c t p ( t M-step. Compute Max. le gven our data s class membershp dstrbutons x x ( t + x Just evaluate a Gaussan at x Copyrght 00, Andrew W. Moore Clusterng wth Gaussan Mxtures: Slde 6 E.M. Convergence As wth all EM procedures, convergence to a local optmum guaranteed. Ths algorthm s EALLY USED. And n hgh dmensonal state spaces, too. E.G. Vector Quantzaton for Speech Data Copyrght 00, Andrew W. Moore Clusterng wth Gaussan Mxtures: Slde 7 E.M. for General GMMs Iterate. On the t th let our estmates be p (t s shorthand for estmate of P(ω on t th λ t { (t, (t c (t, Σ (t, Σ (t Σ c (t, p (t, p (t p c (t } E-step Just evaluate a Gaussan at Compute expected classes of all dataponts for each class x p ( ( x w P( w λt p( x w, ( t, Σ ( t p ( t P w x c p( x λt p( x w, ( t, Σ ( t p ( t M-step. Compute Max. le gven our data s class membershp dstrbutons T x x x [ x ( t + ][ x ( t + ] ( t + Σ ( t + x x x,λt p ( t + #records Copyrght 00, Andrew W. Moore Clusterng wth Gaussan Mxtures: Slde 8 Gaussan Mxture Example: Start After frst Advance apologes: n Blac and Whte ths example wll be ncomprehensble Copyrght 00, Andrew W. Moore Clusterng wth Gaussan Mxtures: Slde 9 Copyrght 00, Andrew W. Moore Clusterng wth Gaussan Mxtures: Slde 30 5

After nd After 3rd Copyrght 00, Andrew W. Moore Clusterng wth Gaussan Mxtures: Slde 3 Copyrght 00, Andrew W. Moore Clusterng wth Gaussan Mxtures: Slde 3 After 4th After 5th Copyrght 00, Andrew W.

6 After nd After 3rd Copyrght 00, Andrew W. Moore Clusterng wth Gaussan Mxtures: Slde 3 Copyrght 00, Andrew W. Moore Clusterng wth Gaussan Mxtures: Slde 3 After 4th After 5th Copyrght 00, Andrew W. Moore Clusterng wth Gaussan Mxtures: Slde 33 Copyrght 00, Andrew W. Moore Clusterng wth Gaussan Mxtures: Slde 34 After 6th After 0th Copyrght 00, Andrew W. Moore Clusterng wth Gaussan Mxtures: Slde 35 Copyrght 00, Andrew W. Moore Clusterng wth Gaussan Mxtures: Slde 36 6

Some Bo Assay data GMM clusterng of the assay data Copyrght 00, Andrew W.

Moore Clusterng wth Gaussan Mxtures: Slde 38 esultng Densty Estmator Three classes of assay

Copyrght 00, Andrew W. Moore Clusterng wth Gaussan Mxtures: Slde 39 Copyrght 00, Andrew W.

usng posteror probabltes to alert about ambguty and anomalousness Yellow means anomalous Cyan

7 Some Bo Assay data GMM clusterng of the assay data Copyrght 00, Andrew W. Moore Clusterng wth Gaussan Mxtures: Slde 37 Copyrght 00, Andrew W. Moore Clusterng wth Gaussan Mxtures: Slde 38 esultng Densty Estmator Three classes of assay (each learned wth t s own mxture model (Sorry, ths wll agan be sem-useless n blac and whte Copyrght 00, Andrew W. Moore Clusterng wth Gaussan Mxtures: Slde 39 Copyrght 00, Andrew W. Moore Clusterng wth Gaussan Mxtures: Slde 40 esultng Bayes Classfer esultng Bayes Classfer, usng posteror probabltes to alert about ambguty and anomalousness Yellow means anomalous Cyan means ambguous Copyrght 00, Andrew W. Moore Clusterng wth Gaussan Mxtures: Slde 4 Copyrght 00, Andrew W. Moore Clusterng wth Gaussan Mxtures: Slde 4 7

8 Unsupervsed learnng wth symbolc attrbutes NATION mssng MAIED # KIDS It s ust a learnng Bayes net wth nown structure but hdden values problem. Can use Gradent Descent. EASY, fun exercse to do an EM formulaton for ths case too. Copyrght 00, Andrew W. Moore Clusterng wth Gaussan Mxtures: Slde 43 Fnal Comments emember, E.M. can get stuc n local mnma, and emprcally t DOES. Our unsupervsed learnng example assumed P(w s nown, and varances fxed and nown. Easy to relax ths. It s possble to do Bayesan unsupervsed learnng nstead of max. lelhood. There are other algorthms for unsupervsed learnng. We ll vst K-means soon. Herarchcal clusterng s also nterestng. Neural-net algorthms called compettve learnng turn out to have nterestng parallels wth the EM method we saw. Copyrght 00, Andrew W. Moore Clusterng wth Gaussan Mxtures: Slde 44 What you should now How to learn maxmum lelhood parameters (locally max. le. n the case of unlabeled data. Be happy wth ths nd of probablstc analyss. Understand the two examples of E.M. gven n these notes. For more nfo, see Duda + Hart. It s a great boo. There s much more n the boo than n your handout. Other unsupervsed learnng methods K-means (see next lecture Herarchcal clusterng (e.g. Mnmum spannng trees (see next lecture Prncpal Component Analyss smple, useful tool Non-lnear PCA Neural Auto-Assocators Locally weghted PCA Others Copyrght 00, Andrew W. Moore Clusterng wth Gaussan Mxtures: Slde 45 Copyrght 00, Andrew W. Moore Clusterng wth Gaussan Mxtures: Slde 46 8

Clustering with Gaussian Mixtures

Clustering with Gaussian Mixtures Note to other teachers and users of these sldes. Andrew would be delghted f you found ths source materal useful n gvng your own lectures. Feel free to use these sldes verbatm, or to modfy them to ft your