Cluserng (Bshop ch 9) Reference: Daa Mnng by Margare Dunham (a slde source) 1
Cluserng Cluserng s unsupervsed learnng, here are no class labels Wan o fnd groups of smlar nsances Ofen use a dsance measure (such as Eucldean dsance) for ds-smlary Can use cluser membershp/dsances as addonal (creaed) feaures 2
Cluserng Examples Segmen cusomer daabase based on smlar buyng paerns. Group houses n a own no neghborhoods based on feaures (locaon, sq f, sores, lo sze) Idenfy new plan speces Idenfy smlar Web usage paerns 3
Cluserng Problem Gven daa D={x 1,x 2,,x n } of feaure vecors and an neger value k, he Cluserng Problem s o defne a mappng where each x s assgned o one cluser K j, 1<=j<=k. A Cluser, K j, conans precsely hose vecors mapped o. Unlke classfcaon problem, clusers are no known a pror 4
Impac of Oulers on Cluserng Wha are he bes wo clusers? 5
Types of Cluserng Herarchcal Creaes Tree of cluserngs Agglomerave (boom up merge closes ) Dvsve (op down - less common) Paronal One se of clusers creaed, usually # of clusers suppled by user Clusers can be: Overlappng (sof) / Non-overlappng (hard) 6
Closes Clusers? Sngle Lnk: smalles dsance beween pons Complee Lnk: larges dsance beween pons Average Lnk: average dsance beween pons Cenrod: dsance beween cenrods 7
Levels of Cluserng 8
Dendrogram Dendrogram: a ree daa srucure whch llusraes herarchcal cluserng echnques. Each level shows clusers for ha level. Leaf ndvdual clusers Roo one cluser A cluser a level s he unon of s chldren clusers a level +1. 9
Paronal Cluserng Nonherarchcal - creaes one level of cluserng Snce only one se of clusers s oupu, he user normally has o npu he desred number of clusers, k. Somemes ry dfferen k and use bes one 10
Paronal Algorhms K-Means Gaussan mxures (EM) Many, many ohers 11
K-means cluserng 1. Pck k sarng means, µ 1, µ 2,, µ k Can use: randomly pcked examples, perurbaons of sample mean, or equally spaced along prncple componen 2. Repea unl convergence: 1. Spl daa no k ses, S 1, S 2,,S k where x S ff µ closes mean o x 2. Updae each µ o mean of S 12
Lecure Noes for E Alpaydın 2010 Inroducon o Machne Learnng 2e The MIT Press (V1.0) 13
K-Means Example Gven: {2,4,10,12,3,20,30,11,25}, k=2 Randomly assgn means: m 1 =3,m 2 =4 K 1 ={2,3}, K 2 ={4,10,12,20,30,11,25}, m 1 =2.5,m 2 =16 K 1 ={2,3,4},K 2 ={10,12,20,30,11,25}, m 1 =3,m 2 =18 K 1 ={2,3,4,10},K 2 ={12,20,30,11,25}, m 1 =4.75,m 2 =19.6 K 1 ={2,3,4,10,11,12},K 2 ={20,30,25}, m 1 =7,m 2 =25 Sop as he clusers wh hese means say he same. 14
Orgnal k=2 k=3 k=10 15
Tabular vew of k-means µ 1 µ 2 µ 3 µ 4 x 1 1 0 0 0 x 2 0 0 1 0 x 3 1 0 0 0 16
Sof k-means cluserng µ 1 µ 2 µ 3 µ 4 x 1.6.2.1.1 x 2.2.1.5.2 x 3.4.3.1.2 17
From Sof cluserng o EM Use weghed mean based on sofcluserng weghs Sof cluser weghs are probables: P(cluser x) Uses Bayes rule: P(cluser x) prop. o P(x cluser) P(cluser) For each x, he rue cluser for x s a laen (unobserved) varable 18
Sof cluserng o EM 2 Assume paramerc forms for P(cluser) (mulnomal) P(x cluser) (Gaussan) Ieravely: 1. Smoohly esmae he cluser membershp (laen varables) based on daa and old parameers 2. Updae parameers o maxmze he lkelhood of daa assumng new esmaes are ruh Ths s he mxure of Gaussan EM algorhm see hp://ceseer.s.psu.edu/blmes98genle.hml 19
1 2 3 4 P(c).3.2.2.3 P(x c) µ 1, σ 1 µ 2, σ 2 µ 3, σ 3 µ 4, σ 4 x 1 P(x c)p(c) Norm..2.1.1 x 2.2.1.5.2 x 3.4.3.1.2 20
Expecaon-Maxmzaon (EM) Log lkelhood wh a mxure model ( ) = log p L Φ X x Φ ( ) ( ) = log p( x G )P G k =1 Assume hdden varables z, whch when known, make opmzaon much smpler Complee lkelhood, L c (Φ X,Z), n erms of x and z Incomplee lkelhood, L(Φ X), n erms of x Lecure Noes for E Alpaydın 2010 Inroducon o Machne Learnng 2e The MIT Press (V1.0) 21
E- and M-seps Ierae he wo seps 1. E-sep: Esmae z gven X and curren Φ 2. M-sep: Fnd new Φ gven z, X, old Φ. E - sep : Q Φ Φ l ( ) X,Φ l ( ) = E[ L C Φ X,Z ] M - sep : Φ l +1 = argmaxq ( Φ Φ l ) Φ ( ) L( Φ l X ) L Φ l +1 X An ncrease n Q ncreases ncomplee lkelhood Lecure Noes for E Alpaydın 2010 Inroducon o Machne Learnng 2e The MIT Press (V1.0) 22
EM as lkelhood ascen new Lkelhood curren Q (use esmaed membershp) Φ: Parameers for mxure and all G 23
Hddens z = 1 f x belongs o G, 0 oherwse (labels r of supervsed learnng); assume p(x G )~N(µ, ) E-sep: M-sep: EM n Gaussan Mxures P G E[ z X,Φ l ] = ( ) = S l +1 = N h h j p( x G,Φ l )P( G ) p( x G j,φ l )P G j = P( G x,φ l ) h m l +1 = h h x x l +1 ( m ) x l +1 m h ( ) T ( ) Q uses esmaed z s (he h s) n place of unknown labels Lecure Noes for E Alpaydın 2010 Inroducon o Machne Learnng 2e The MIT Press (V1.0) 24
P(G 1 x)=h 1 =0.5 25
Problems wh EM Local mnma Run several mes, ake bes resul Use good nalzaon (perhaps k-means) Degenerae Gaussans - as σ goes o zero, he lkelhood goes o Fx a lower bound on σ Los of parameers o learn Use sphercal Gaussans or shared co-varance marces (or even fxed dsrbuons) 26
EM summary Ierave mehod for maxmzng lkelhood General mehod - no jus for Gaussan mxures, bu also HMMs, Bayes nes, ec. Generally works well, bu can have local mnma and degenerae suaons Ges boh cluserng and dsrbuon (mxure of Gaussans) - dsrbuons can be used for Bayesan learnng (e.g. learn P(x y) usng a gaussan mxure model) 27
EM summary Ierave mehod for maxmzng lkelhood General mehod - no jus for Gaussan mxures, bu also HMMs, Bayes nes, ec. Generally works well, bu can have local mnma and degenerae suaons Ges boh cluserng and dsrbuon (mxure of Gaussans) - dsrbuons can be used for Bayesan learnng (e.g. learn P(x y) usng a gaussan mxure model) 28
Expecaon-Maxmzaon (EM) Log lkelhood wh a mxure model L ( ) ( Φ X = log p x Φ) = log k = 1 ( x G ) P( G ) G a generave model - log of sum s ough Assume hdden varables z, whch when known, make opmzaon much smpler (ell whch G ) Complee lkelhood, L c (Φ X,Z), n erms of x and z Incomplee lkelhood, L(Φ X), n erms of x p 29
E- and M-seps Ierae he followng wo seps: E-sep: Esmae ds. for z gven X and curren Φ M-sep: Fnd new Φ gven z, X, and old Φ. E - sep : Q M - sep : Φ ( ) [ ( ) ] l l Φ Φ = E LC Φ X,Z X, Φ l + 1 = arg max Q( Φ Φ l ) An ncrease n Q ncreases ncomplee ( l + ) ( ) lkelhood L Φ 1 X L Φ l X Φ 30
31 EM n Gaussan Mxures Hdden z = 1 f x belongs o G, 0 oherwse assume p(x G )~N(μ, ) E-sep: M-sep: Use esmaed labels n place of unknown labels [ ] ( ) ( ) ( ) ( ) ( ) l j j l j l l h P P p P p, z E = = Φ Φ Φ Φ, G G, G G G, X x x x ( ) ( )( ) + + + + = = = T l l l l h h h h N h P 1 1 1 1 m x m x x m S G
EM n Gaussan Mxures Hddens z = 1 f x belongs o G, 0 oherwse assume p(x G )~N(μ, ) E - M - sep : sep : Q Φ ( ) [ ( ) ] l l Φ Φ = E LC Φ X,Z X, Φ l + 1 = arg max Q( Φ Φ l ) Φ L c (Φ X,Z) = log(p(x,z ) Φ) Q(Φ new Φ old ) = Z ( log(p(x,z ) Φ old )) P(Z X, Φ) 32
Afer Cluserng Dmensonaly reducon mehods fnd correlaons beween feaures and group feaures Cluserng mehods fnd smlares beween nsances and group nsances Allows knowledge exracon hrough number of clusers, pror probables, cluser parameers,.e., cener, range of feaures. 33
Cluserng as Preprocessng Esmaed group labels h j (sof) or b j (hard) may be seen as he dmensons of a new k dmensonal space, where we can hen learn our dscrmnan or regressor. Local represenaon (only one b j s 1, all ohers are 0; only few h j are nonzero) vs Dsrbued represenaon (Afer PCA; all z j are nonzero) 34
Mxure of Mxures In classfcaon, he npu comes from a mxure of classes (supervsed). If each class s also a mxure, e.g., of Gaussans, (unsupervsed), we have a mxure of mxures: p k ( x C ) = ( ) ( ) p x Gj P Gj p j = 1 K ( x) = p( x C ) ( ) P C = 1 35
Cluserng vs. Classfcaon Less pror knowledge Number of clusers (may be assumed) Meanng of clusers no assumed Unsupervsed learnng - no labels 36
Cluser Parameers m s h feaure vecor n cluser m 37
Cluserng Issues Ouler handlng Dynamc daa Inerpreng resuls Evaluang resuls Number of clusers Daa o be used Scalably 38
Herarchcal Cluserng Clusers are creaed n levels acually creang ses of clusers a each level. Agglomerave Inally each em n s own cluser Ieravely clusers are merged ogeher Boom Up Dvsve Inally all ems n one cluser Large clusers are successvely dvded Top Down 39