CHAPTER 7: CLUSTERING

Semparamerc Densy Esmaon 3 Paramerc: Assume a snge mode for p ( C ) (Chapers 4 and 5) Semparamerc: p ( C ) s a mure of denses Mupe possbe epanaons/prooypes: Dfferen handwrng syes, accens n speech Nonparamerc: No mode; daa speaks for sef (Chaper 8)

Mure Denses 4 p k p G P G 1 where G he componens/groups/cusers, P ( G ) mure proporons (prors), p ( G ) componen denses Gaussan mure where p( G ) ~ N ( μ, ) parameers Φ = {P ( G ), μ, } k =1 unabeed sampe X={ } (unsupervsed earnng)

Casses vs. Cusers Supervsed: X = {,r } Casses C =1,...,K where p( C ) ~ N(μ, ) Φ = {P (C ), μ, } K =1 Unsupervsed : X = { } Cusers G =1,...,k where p( G ) ~ N ( μ, ) Φ = {P ( G ), μ, } k =1 Labes r? 5 k G P G p p 1 K P p p 1 C C T r r r r N r C P m m m S ˆ

Cuserng 6 Unsupervsed Learnng probem: We are ony gven daa descrpon.e., X = { } No cass abes are provded Our goa s o fnd groups among daa Each group possby represens smar obecs for eampe, fndng groups n onne news arces, where ndvdua groups conans arces reaed o spors or busness or pocs ec. Dfferen mehods are avaabe for cuserng K-means E-M agorhm Herarchca cuserng Specra cuserng

k-means Cuserng 7 Fnd k reference vecors (prooypes/codebook vecors/codewords) whch bes represen daa Reference vecors, m, =1,...,k Use neares (mos smar) reference: m mn m Reconsrucon error E b k X b 1 m m 1 f m mn m 0 oherwse 2

Encodng/Decodng 8 b 1 f m mn m 0 oherwse

9 k-means Cuserng

k-means Cuserng 10 Ths s an erave agorhm I akes as npu k, whch s he number of reference vecors or cuser ceners I sars random nazaons (guesses) of k cuser ceners and repeas he foowng wo seps un convergence Assgn each daa pons o s coses cuser cener Thus a daa pons cose o a cuser cener form a group There are k such groups In each group, he average of a daa pons s compued and s assgned as new cuser cener

Epecaon-Mamzaon (EM) 12 In k-means, we approached cuserng as fndng codebook vecors ha mnmzes reconsrucon error Now, our approach s probabsc and we ook for componen densy parameers ha mamzes he kehood of he sampe Log kehood wh a mure mode gven he sampe X = { } s L X og p og k 1 p G P G Where Φncudes he prors P(G ) and parameers of componen denses P( G ) Unforunaey, we can no sove hs opmzaon probem anaycay and resor o erave opmzaon

Epecaon-Mamzaon (EM) 13 The Epecaon-Mamzaon (E-M) agorhm s used n mamum kehood esmaon where he probem nvoves wo ses of random varabes Observabe varabe X Hdden varabe Z The goa of he E-M agorhm s o fnd parameer vecorφ ha mamzes he observabe vaues of X, L(Φ X) Bu n cases, where hs s no feasbe, we assocae an era hdden varabe Z and epress he underyng mode usng X and Z Assumed hdden varabes z, whch when known, make opmzaon much smper Compee kehood, L c (Φ X,Z), n erms of and z Incompee kehood, L(Φ X), n erms of

E- and M-seps 14 Ierae he wo seps 1. E-sep: Esmae z gven X and curren Φ 2. M-sep: Fnd new Φ gven z, X, and od Φ. E - sep: Q M- sep: EL C X,Z 1 argmaq X, An ncrease n Q ncreases ncompee kehood 1 X L X L

z = 1 f beongs o G, 0 oherwse (abes r of supervsed earnng); assume p( G )~N(μ, ) E-sep: M-sep: EM n Gaussan Mures 15 h G P G P G p G P G p z E,,, X, T h h h h N h P 1 1 1 1 m m m S G Use esmaed abes n pace of unknown abes

EM n Gaussan Mures 16 If each componen densy share a common covarance mar S=s 2 I hen p( G )~N(m, s 2 I)) Mamzng he M sep eads o sovng he foowng probem Where, h s a number beween 0 and 1 Ths probem ooks very smar o k-means agorhm, ecep b of k-means agorhm makes a hard assgnmen whe h of E-M agorhm makes a sof assgnmen

17 P(G 1 )=h 1 =0.5

Afer Cuserng 19 Dmensonay reducon mehods fnd correaons beween feaures and group feaures Cuserng mehods fnd smares beween nsances and group nsances Aows knowedge eracon hrough number of cusers, pror probabes, cuser parameers,.e., cener, range of feaures. Eampe: CRM, cusomer segmenaon

Mure of Mures 21 In cassfcaon, he npu comes from a mure of casses (supervsed). If each cass s aso a mure, e.g., of Gaussans, (unsupervsed), we have a mure of mures: p k C p G P G p 1 K p C PC 1

Herarchca Cuserng 23 Cuser based on smares/dsances Dsance measure beween nsances r and s Mnkowsk (L p ) (Eucdean for p = 2) d m Cy-bock dsance r s d r s, 1 p 1/ p d cb r s d r, 1 s

Aggomerave Cuserng 24 Sar wh N groups each wh one nsance and merge wo coses groups a each eraon Dsance beween wo groups G and G : Snge-nk: d r s G, G mn d, r s G, G Compee-nk: d r s G, G ma d, r s G, G Average-nk, cenrod d r s G, G ave d, r s G, G

Eampe: Snge-Lnk Cuserng 25 Dendrogram

Choosng k 26 Defned by he appcaon, e.g., mage quanzaon Po daa (afer PCA) and check for cusers Incremena (eader-cuser) agorhm: Add one a a me un ebow (reconsrucon error/og kehood/nergroup dsances) Manuay check for meanng