Clusterng & Unsupervsed Learnng Nuno Vasconcelos (Ken Kreutz-Delgado) UCSD
Statstcal Learnng Goal: Gven a relatonshp between a feature vector x and a vector y, and d data samples (x,y ), fnd an approxmatng functon f (x) y x f () ŷy = f ( x ) y Ths s called tranng or learnng. Two major types of learnng: Unsupervsed Classfcaton (aka Clusterng) : only X s known. Supervsed Classfcaton or Regresson: both X and target value Y are known durng tranng, only X s known at test tme. 2
Unsupervsed Learnng Clusterng Why learnng wthout supervson? In many problems labels are not avalable or are mpossble or expensve to get. E.g. n the hand-wrtten dgts example, a human sat n front of the computer for hours to label all those examples. For other problems the classes to be labeled depend on the applcaton. A good example s mage segmentaton: f you want to know f ths s an mage of the wld or of a bg cty, there s probably no need to segment. If you want to know f there s an anmal n the mage, then you would segment. Unfortunately, the segmentaton mask s usually not avalable 3
Revew of Supervsed Classfcaton Although our focus on clusterng, let us start by revewng supervsed classfcaton: To mplement the optmal decson rule for a supervsed classfcaton problem, we need to Collect a labeled d tranng data set D = {(x 1,y 1 ),, (x n,y n )} where x s a vector of observatons and y s the assocated class label, and then Learn a probablty model for each class Ths nvolves estmatng P X Y (x ) and P Y () for each class 4
Supervsed Classfcaton Ths can be done by Maxmum Lkelhood Estmaton MLE has two steps: 1) Choose a parametrc model for each class pdf: P X Y ( x ; θ ) θ Θ 2) Select the parameters of class to be the ones that maxmze the probablty of the d data from that class: ˆ θ = = X Y ( ( ) D θ ) argmax P ; θ Θ XY ( ( ) D θ ) arg max log P ; θ Θ 5
Maxmum Lkelhood Estmaton We have seen that MLE can be a straghtforward procedure. In partcular, f the pdf s twce dfferentable then: Solutons are parameters values such that P θ ( ) ˆ XY ( D ; θ ) = 0 θ θ Θ 2 T () ˆ p P θ Θ 2 XY ( D ; ) θ 0, θ θ You always have to check the second-order d condton max We must also fnd an MLE for the class probabltes P Y () But here there s not much choce of probablty blt model o E.g. Bernoull: ML estmate s the percent of tranng ponts n the class 6
Maxmum Lkelhood Estmaton We have worked out the Gaussan case n detal: D( ) = {x () 1,..., x () n } = set of examples from class The ML estmates for class are 1 ( ) ˆ µ = x ˆ n j P n Y () = j n 1 Σ = ˆ ( ) ( ) ( ˆ )( ˆ ) T xj µ xj µ n j There are many other dstrbutons for whch we can derve a smlar set of equatons But the Gaussan case s partcularly relevant for clusterng (more on ths later) 7
Supervsed Learnng va MLE Ths gves probablty models for each of the classes Now we utlze the fact that: assumng the zero/one loss, the optmal decson rule (BDR) s the MAP rule: * x = PYX x ( ) argmax ( ) Whch can also be wrtten as ( ) = arg max log ( ) + log ( ) * x P XY x P Y Ths completes the process of supervsed learnng of a BDR. We now have a rule for classfyng any (unlabeled) future measurement x. 8
Gaussan Classfer In the Gaussan case the BDR s * 2 ( x ) = argmn d ( x, µ ) + α dscrmnant for P Y X (1 x ) = 0.5 wth d ( x, y) = ( x y) Σ ( x y) 2 T 1 d α = log(2 π ) Σ 2log P ( ) Y Ths can be seen as fndng the nearest class neghbor, usng a funny metrc Each class has ts own squared-dstance whch s the sum of Mahalanobs-squared for that class plus the α constant o We effectvely have dfferent metrcs n dfferent regons of the space 9
Gaussan Classfer A specal case of fnterest s when all classes have the same covarance Σ = Σ dscrmnant for P Y X (1 x ) = 0.5 x = d x µ + α * 2 ( ) argmn (, ) wth d x y x y x y 2 T 1 (, ) = ( ) Σ ( ) α = 2log P ( ) Y Note: α can be dropped when all classes have equal probablty Then ths s close to the NN classfer wth Mahalanobs dstance However, nstead of fndng the nearest neghbor, t looks for the nearest class prototype or template µ 10
Gaussan Classfer Σ = Σ for two classes (detecton) One mportant property of ths case s that the decson boundary s a hyperplane. Ths can be shown by computng the set of ponts x such that d ( x, µ ) + α = d ( x, µ ) + α 2 2 0 0 1 1 and showng that they satsfy T w ( x x ) = 0 0 Ths s the equaton of a hyperplane wth normal w. x 0 can be any fxed pont on the hyperplane, but t s standard to choose t to have mnmum norm, n whch case w and x 0 are then parallel x n x 1 x 3 x 2 dscrmnant for P Y X (1 x ) = 0.5 x 0 w x 11
Gaussan Classfer f all the covarances are the dentty Σ = Ι x = d x µ + α * 2 ( ) argmn (, ) wth d 2 ( x, y) = x y 2 α = 2log P ( ) Y *? Ths s just (Eucldean dstance) template matchng wth class means as templates e.g. for dgt t classfcaton, the class means (templates) t are: Compare complexty of template matchng to nearest neghbors! 12
Unsupervsed Classfcaton - Clusterng In a clusterng problem we do not have labels n the tranng set We can try to estmate both the class labels and the class pdf parameters Here s a strategy: Assume k classes wth pdf s ntalzed to randomly chosen parameter values Then terate between two steps: 1) Apply the optmal decson rule for the (estmated) class pdf s ths assgns each pont to one of the clusters, creatng pseudo-labeled data 2) Update the pdf estmates by dong parameter estmaton wthn each estmated (pseudo-labeled) class cluster found n step 1 13
Unsupervsed Classfcaton - Clusterng Natural queston: what probablty model do we assume? Let s start as smple as possble Assume: k Gaussan classes wth dentty covarances & equal P Y () Each class has an unknown mean (prototype) µ whch must be learned Resultng clusterng algorthm s the k-means algorthm: Start wth some ntal estmate of the µ (e.g. random, but dstnct) Then, terate between 1) BDR Classfcaton usng the current estmates of the k class means: * ( x ) = arg mn x µ 1 k 2 2) Re-estmaton of the k class means: n 1 new ( ) µ µ = xj for = 1,, k n j = 1 14
K-means (thanks to Andrew Moore, CMU) 15
K-means (thanks to Andrew Moore, CMU) 16
K-means (thanks to Andrew Moore, CMU) 17
K-means (thanks to Andrew Moore, CMU) 18
K-means (thanks to Andrew Moore, CMU) 19
K-means Clusterng The name comes from the fact that we are tryng to learn the k means (mean values) of k assumed clusters It s optmal f you want to mnmze the expected value of the squared error between vector x and template to whch x s assgned. K-means results n a Vorono tessellaton of the feature space. Problems: How many clusters? (.e., what s k?) Varous methods avalable, Bayesan nformaton crteron, Akake nformaton crteron, mnmum descrpton length Guessng can work pretty well Algorthm converges to a local mnmum soluton only How does one ntalze? Random can be pretty bad Mean Splttng can be sgnfcantly better 20
Growng k va Mean Splttng Let k = 1. Compute the sample mean of all ponts, µ ( 1 ). (The superscrpt denotes the current value of k) To ntalze t means for k = 2 perturb the mean µ (1) randomly µ 1 (2) = µ (1) µ (2) = (1+ε) (1) 2 µ ε << 1 Then run k-means untl convergence for k = 2 Intalze means for k = 4 µ 1 (4) = µ 1 (2) µ 2 (4) = (1+ε) µ 1 (2) µ (4) 3 = µ (2) 2 µ 4 (4) = (1+ε) µ 2 (2) Then run k-means untl convergence for k = 4 Etc. 21
Deletng Empty Clusters Empty Clusters can be a source of algorthmc dffcultes Therefore, at the end of each teraton of k-means Check the number of elements n each cluster If too low, throw the cluster away Rentalze the mean of the most populated cluster wth a perturbed verson of that mean Note that there are alternatve names: In the compresson lterature ths s known as the Generalzed Loyd Algorthm Ths s actually the rght name, snce Loyd was the frst to nvent t It s also known as (data) Vector Quantzaton and s used n the desgn of vector quantzers 22
Vector Quantzaton Is a popular data compresson technque Fnd a codebook of prototypes for the vectors to compress Instead of transmttng each vector, transmt the codebook ndex Image compresson example Each pxel has 3 colors (requrng 3 bytes of nformaton) Instead, fnd the optmal 256 color prototypes! (256 ~ 1 byte of nformaton) 23
Vector Quantzaton We now have an mage compresson scheme Each pxel has 3 colors (1 byte per color = 3 bytes total needed)) Instead, fnd the nearest neghbor template for 256 colors We transmt the template ndex Snce there are only 256 templates, only need one byte needed Usng the ndex, the decoder looks up the prototype n ts table By sacrfcng a lttle bt of dstorton, we saved 2 bytes per pxel! 24
K-means There are many other applcatons of K-means E.g. mage segmentaton: decompose each mage nto component objects Then run k-means on the colors and look at the assgnments E.g., the pxels assgned to the red cluster tend to be from the booth: 25
K-means We can also use texture nformaton n addton to color Many methods for clusterng usng texture metrcs Here are some results Note that ths s not the state-of-the-art n mage segmentaton But gves a good dea of what k-means can do 26
Extensons to basc K-means There are many extensons to the basc k-means algorthm One of the most mportant applcatons s to the problem of learnng accurate approxmatons to general, nontrval PDF s. Remember that the optmal decson rule ( ) argmax log ( ) log ( ) * x = PXY x + PY s optmal ff the true probabltes P X Y (x ) are correctly estmated Ths often turns out to be mpossble when we use overly smple parametrc models lke the Gaussan Often the true probablty s too complcated for any smple model to hold accurately Even f smple models provde good local approxmatons, there are usually multple clusters when we take a global vew These weakness can be addressed by use of mxture dstrbutons and the use of the Expectaton-Maxmzaton (EM) Algorthm 27
Mxture Dstrbutons Consder the followng problem Certan types of traffc banned from a brdge We want an automatc detector/classfer to see f the ban s holdng A sensor measures vehcle weght Want to classfy each car nto class = OK or class = Banned We know that n each class there are multple sub-classes E.g. OK = {compact, sedan, staton wagon, SUV} Banned = {truck, bus, sem} Each of the sub-classes s close to Gaussan, but for the whole class we get ths 28
Mxture dstrbutons Ths dstrbuton s a mxture The overall shape s determned by a number of (sub) class denstes We ntroduce a random varable Z to account for ths A value of Z = c ponts to class c and thus pcks out the c th component densty from the mxture. E.g. a Gaussan mxture: # of mxture components c th component weght c th mxture component = Gaussan pdf 29
Mxture Dstrbutons Learnng a mxture densty s a type of soft clusterng problem For each tranng pont x k we need to fgure out from whch component class Z k =Z(x k ) =jt was drawn Once we know how ponts are assgned to a component j we can estmate the component j pdf parameters Ths could be done wth k-means A more general algorthm s Expectaton-Maxmzaton (EM) A key dfference from k-means: we never hard assgn the ponts x k In the expectaton step we compute posteror probabltes that a pont x k belongs to class j, for every j, condtoned on all the data D. But we do not make a hard decson! (e.g., we do not assgn the pont x k only ytoas sngle gecass class va athe MAP rule.) ue) Instead, n the maxmzaton step, the pont x k partcpates n all classes to a degree weghted by the posteror class probabltes 30
Expectaton-Maxmzaton (EM) The EM Algorthm: 1. Start wth an ntal parameter vector estmate θ (0) 2. E-step: Gven current parameters θ () and observatons n D, estmate the ndcator functons χ(z k = j) va the condtonal Expectaton h kj = E{ χ(z k = j) D ; θ () } = E{ χ(z k = j ) x k ; θ () } 1. M-step: Weghtng the data x k by h kj, we have a complete data MLE problem for each class j. Ie I.e. Maxmze the class j lkelhoods for the parameters,.e. re-compute θ (+1) 2. Go to 2. In a graphcal form: Estmate parameters θ (+1) E-step M-step Fll n class assgnments hkj 31
Expectaton Maxmzaton (EM) Note that for any mxture densty we have: ( ) ( { χ( = j) ; θ } P ) ( Z j x ; θ ) h = E (Z j x = = ; and kj k k ZX k k n = = P x Z j P Z j C ( ) ( ) XZ ( k k = ; θ ) Z( k = ; θ ) ( ) PX( xk; θ ) P ( x Z = j; θ ) π c= 1 ( ) ( ) XZ k k j P ( x Z = c ; θ ) π n ( ) ( ) XZ k k c { ( ) θ } χ(z = j) nˆ E n x ; = (from Bayes rule) = j k j j k k j k= 1 k= 1 C C n= n n= nˆ j j= 1 j= 1 j n h 32
Expectaton-Maxmzaton (EM) In partcular, for a Gaussan mxture we have: Expectaton Step h ( ) kj = PZX ( Zk = j xk; θ ) = C Maxmzaton Step n ( + 1) j hk j π j k = 1 nˆ =, = ( ) G( x ; µ, σ ) π c=1 nˆ k G( x ( ) 2 ( ) j j j ( ) ; µ, σ ) π ( ) 2 ( ) k c c c n n 1 ( j + 1) 1 2 ( + 1) 2 ( + 1) j k j k j k j k j nˆ 1 ˆ j k = nj k = 1 j n ( x ) µ = h x, σ = h µ Compare to the sngle (non-mxture) Gaussan MLE soluton shown on slde 7! They are equvalent solutons when h kj s the hard ndcator functon whch selects class-labeled data. 33
Expectaton-Maxmzaton (EM) Note that the dfference between EM and k-means s that In the E-step h j s not hard-lmted to 0 or 1 Dong so would make the M-step exactly the same as k-means Plus we get estmates of the class covarances and class probabltes automatcally k-means can be seen as a greedy verson of EM At each teraton, for each pont we make a hard decson (the optmal MAP BDR for dentty covarances & equal class prors) But ths does not take nto account the nformaton n the ponts we throw away. I.e., potentally all ponts carry nformaton about all (sub) classes Note: If the hard assgnment s best, EM wll learn t To get a feelng for EM you can use http://www-cse.ucsd.edu/users/bayrakt/java/em/ 34
END 35