Some Reading. Clustering and Unsupervised Learning. Some Data. K-Means Clustering. CS 536: Machine Learning Littman (Wu, TA)

Size: px

Start display at page:

Download "Some Reading. Clustering and Unsupervised Learning. Some Data. K-Means Clustering. CS 536: Machine Learning Littman (Wu, TA)"

Jessica Hutchinson
5 years ago
Views:

1 Some Readng Clusterng and Unsupervsed Learnng CS 536: Machne Learnng Lttman (Wu, TA) Not sure what to suggest for K-Means and sngle-lnk herarchcal clusterng. Klenberg (00). An mpossblty theorem for clusterng n NIPS-15. Landauer, Laham, and Foltz (1999). Learnng human-lke knowledge by sngular value decomposton: A progress report n NIPS-10. K-Means Clusterng Sldes from Andrew Moore (CMU). Some Data Ths could easly be modeled by a Gaussan Mxture (wth 5 components) But let s look at a satsfyng, frendly and nfntely popular alternatve

2 Lossy Compresson Frst Idea Want to transmt ponts, lossly, wth only two bts per pont. Loss = Sum Squared Error between decoded and orgnal coordnates. What encoder/ decoder wll lose the least nformaton? Break nto a grd, decode each bt-par as the mddle of each grd-cell. Any better deas? Second Idea K-Means Break nto a grd, decode each bt-par as the centrod of all data n that grd-cell Other deas? Ask user how many clusters he d lke. (say, k=5) Randomly guess k cluster center locatons Each datapont fnds out whch center t s closest to. Each center fnds the centrod of the ponts t owns and umps there Repeat untl termnated!

3 K-Means Illustrated For fast mplementaton, see: Dan Pelleg and Andrew Moore. Acceleratng Exact k- means Algorthms wth Geometrc Reasonng. Proc. Conference on Knowledge Dscovery n Databases 1999, (KDD99) (avalable on

5 K-Means Questons What s t tryng to optmze? Are we sure t wll termnate? Are we sure t wll fnd an optmal clusterng? How should we start t? How could we automatcally choose the number of centers?.we ll deal wth these questons over the next few sldes

6 Dstorton Gven.. an encoder functon: ENCODE :! m " [1..k] a decoder functon: DECODE : [1..k] "! m Defne Dstorton = R!( x " DECODE[ ENCODE( x )]) = 1 We may as well wrte DECODE[ ] = c so Dstorton = R! = 1 ( x " c ENCODE ( x )) The Mnmal Dstorton What propertes must centers c 1, c,, c k have when dstorton s mnmzed? Dstorton = R! = 1 ( x " c ENCODE ( x )) (1) x must be encoded by ts nearest center.why? c ENCODE ( x ) = arg mn ( x! c ) c "{ c, c,... c } 1...at the mnmal dstorton k The Mnmal Dstorton () Partal dervatve of Dstorton wth respect to each center locaton must be zero. (Center at centrod). Dstorton " Dstorton " c = = = = = R # = 1 k = 1 $ OwnedBy( c ) " " c ( x! c # # $ OwnedBy( c )! ENCODE( x ) # # ( x! c ( x! c ( x! c $ OwnedBy( c ) 0 (for a mnmum) ) ) ) ) OwnedBy(c ) = the set of records owned by center c. Improvng the Confguraton What propertes can be changed for centers c 1, c,, c k have when dstorton s not mnmzed? (1) Change encodng so that x s encoded by ts nearest center () Set each center to the centrod of ponts t owns. There s no pont applyng ether operaton twce n successon. But t can be proftable to alternate. And that s K-means! Easy to prove ths procedure wll termnate n a state at whch nether (1) or () change the confguraton. Why?

7 Is It Optmal? Not necessarly. Can you nvent a confguraton that has converged, but does not have the mnmum dstorton? Seekng Good Optma Idea 1: Be careful about where you start. (Use dataponts, far apart f possble.) Idea : Do many runs of K-Means, each from a dfferent random startng pont. Many deas floatng around. Common Uses of K-Means Often used as an exploratory data analyss tool In one-dmenson, a good way to quantze realvalued varables nto k non-unform buckets Used on acoustc data n speech understandng to convert waveforms nto one of k categores (known as Vector Quantzaton) Also used for choosng color palettes on old fashoned graphcal dsplay devces! (Note: How choose numbers of centers?) Sngle Lnkage Herarchcal Clusterng 1. Say Every pont s ts own cluster. Fnd most smlar par of clusters 3. Merge t nto a parent cluster 4. Repeat

8 Some Detals How do we defne smlarty between clusters? Mnmum dstance between ponts n clusters (smply Eucldan Mnmum Spannng Trees) Maxmum dstance between ponts n clusters Average dstance between ponts n clusters You re left wth a nce dendrogram (taxonomy, herarchy, tree) of dataponts Sngle Lnkage Comments It s nce that you get a herarchy nstead of an amorphous collecton of groups If you want k groups, ust cut the (k-1) longest lnks There s no real statstcal or nformatontheoretc foundaton to ths. Also known n the trade as Herarchcal Agglomeratve Clusterng (note the acronym) Stuff to Know All the detals of K-means The theory behnd K-means as an optmzaton algorthm How K-means can get stuck The outlne of herarchcal clusterng Clusterng Desderata Abstractly, clusterng algorthm f maps dstance matrx D between obects to partton P of obects. What s an deal clusterng algorthm? 1. Rchness: for all P, exsts D s.t. f(d) = P.. Scale nvarance: for all c>0 f(d) = f(cd). 3. Consstency: for any D, f f(d)=p, then for any D wth smaller wthn-cluster dstances and larger between-cluster dstances, f(d )=P.!

9 Any Par of These Possble Usng sngle-lnk clusterng, we can vary the stoppng condton and get: Rchness and consstency: Stop when dstances are more than r. Scale nvarance and consstency: Stop when number of clusters s k. Rchness and scale nvarance: Stop when dstances exceed r/max D. All Three Impossble! It s mpossble to specfy a clusterng algorthm that satsfes all three condtons smultaneously (Klenberg 0). Essentally, the proof shows that consstency and scale nvarance severely constran what parttons can be created. Why Clusterng? Whch brngs up the queston, why dd we want to do clusterng anyway? What s the task? Can we accomplsh the same thngs by assgnng dstances or smlartes to pars of obects? Alternatves to Clusterng An embeddng algorthm e maps dstances D to locatons L n some vector space so that dst(l, L ) approxmates D. multdmensonal scalng (MDS) sngular value decomposton (SVD) non-negatve matrx factorzaton (NMF) non-lnear embeddng (LLE, Isomap) and more No hard decson on clusters.

10 Latent Semantc Analyss Start wth a set of text documents and form a term by document matrx M. Say the smlarty between two words s defned by the dot product of the vectors of documents n whch the words appear. Near synonyms often not assgned hgh smlarty score f they tend not to cooccur. Clusterng could help. Dfferent dea: embed the words n a lower dmensonal space, capturng smlarty. Wll tend to drop nsgnfcant dfferences. Example LSA Applcatons Tran on 30k+ documents, 16k+ words. Scores at human level n vocabulary, multple-choce psych exams. Can grade essay tests. Recognzes synonyms and antonyms as beng closely related (but separable). Smlarty patterns more lke kds (r=.5) than adults (r=.3). Exhbts prmng-lke effects. Useful for IR, and cross-language IR. Can udge coherence, predct what to read. Open Queston What s somethng we can do wth clusterng that we don t know how to do wth embeddng? hgh-level representatons? composablty? collectng cluster-based statstcs? Lots to thnk about.

Instance-Based Learning and Clustering

Instance-Based Learning and Clustering Instane-Based Learnng and Clusterng R&N 04, a bt of 03 Dfferent knds of Indutve Learnng Supervsed learnng Bas dea: Learn an approxmaton for a funton y=f(x based on labelled examples { (x,y, (x,y,, (x n,y