Machie Learig for Data Sciece CS 4786) Lecture 6,7 & 8: Ellipsoidal Clusterig, Gaussia Mixture Models ad Geeral Mixture Models The text i black outlies high level ideas. The text i blue provides simple mathematical details to derive or get to the algorithm or method. The text i red are mathematical details for those who are iterested. Motivatio K-meas algorithm looks for roud clusters ad caot explicitly model clusters where oe cluster has fewer umber of poits tha aother. We would like to address this issue. We will do this by chagig the dissimilarity fuctios to first allow ellipsoidal clusters ad further by explicitly by maitaiig parameter π called mixture distributio that tells us the proportio of poits withi each cluster. 2 Ellipsoidal Clusterig Axis Aliged Ellipsoid The basic idea is goig to be that each of our clusters will be explicitly modeled by a ellipsoid. 2. Prelude: Axis aliged case To this ed, say data-poits withi a cluster are spread as i the figure below. r var = 2 var =.5 For the example below, ituitively we would like the ellipse to be a vertically stadig oe. How so we obtai such a ellipse?
Well what we require is that i terms of dissimilarity measure, all the blue poits o the outer ellipse have the same value of dissimilarity. That is to say that, we wat to squish the ellipse vertically ad elogate it horizotally so that it is circular ie. all blue dots are same distace from ceter). Now say we scale each coordiate by / Variace of the coordiate. I this case, we will fid that the poits have a variace of o each coordiate. To see this, say [ x t = x t []/ Varx [],..., x []), x t [2]/ ] Varx [2],..., x [2]) That is for each x t each x t is the ew variable whose coordiates are scaled iversely by stadard deviatio. We will otice that whe the set of poits are axis aliged ie. are stadig vertically or layig dow horizotally), the x,..., x will have a variace of o each coordiate ad covariace amogst coordiates. Hece uder x,..., x we see that all the blue poits o the ellipse uder the origial set are ow o a circle. To mathematically see this ote that ) 2 Var x [],..., x []) = x t [] x s [] = = x t [] Varx [],..., x []) x t [] = Varx [],..., x []) = ) 2 x s [] Varx [],..., x []) 2 x s []) Varx [],..., x []) Varx [],..., x []) Similarly you will fid that the variace of the secod coordiate is as well. Now sice we bega with poits distributed i a axis aliged way, covariace betwee differet coordiates will be. Thus the ew set of poits are best described by a circle. Thus, we fid that to defie the right ellipse for the origial set, that is for axis-aliged poits x,..., x, to measure the ellipsoidal distace of a poit say x to ceter, we ca istead measure the usual otio of distace euclidea distace) of the ew poit x = Varx [],...,x []) x to the modified ceter r = Varx [],...,x []) Varx [2],...,x [2]) Varx [2],...,x [2]) r. That is, the ellipsoidal distace dx, C ) = x r 2 = x r x r ) [ = x r Varx [],...,x []) Varx [2],...,x [2]) = x r Σ x r ) ] x r ) 2
where Σ is the covariace matrix. Thus we have established that for the axis aliged case, the dissimilarity measure is dx, C ) = x r Σ x r ) We will est show that eve for the geeral case, we have the same dissimilarity measure. Error 2.2 Geeral From Ellipsoids Last Slide Cosider the geeral ellipsoid as show below. r r How do we get a slated ellipsoidal dissimilarity x r ) > for the above? The high level picture is that, if we could somehow x rrotate ) apple the poits so that they are axis aliged, the we ca use the axis aliged versio for dissimilarity. How do we rotate the poits? Basically what we wat is to rotate the Correctio! poits such that the ew set of rotated poits are axis Bug last lecture! aliged. We ca achieve this by cosiderig the eige decompositio x r ) > x r ) apple Σ = UΛU where U is a rotatio matrix ad hece U = U) ad Λ is a diagoal matrix. Now for give x,..., x cosider a ew rotated buch of poits x,..., x where x t = Ux t. Note that, covariace matrix of x,..., x say Σ is give by ) Σ = x t x s x t x s ) = Ux t Ux s Ux t Ux s ) = U x t x s x t x s U ) ) = U x t x s x t x s U = UΣU = UUΛUU = Λ 3
Thus we see that x t = Ux t s the rotated poits are ow axis aliged with covariace Λ. Hece, the dissimilarity measure for the geeral case ca be set as: dx, C ) = x r ) Λ x r ) = Ux Ur Λ Ux Ur ) = x r U Λ U x r ) = x r Σ x r ) Thus we ca see that eve for the geeral case, dx, C ) = x r Σ x r ) defies the right ellipsoidal dissimilarity measure. Now as for the algorithm, it has the same flavor as the K-meas algorithm i that at every step it first radomly iitializes parameters r,..., r K ad Σ,..., Σ K radomly. Next, each poit is assiged to closest cluster uder the ew ellipsoidal dissimilarity measure dx, C ) = x r Σ x r ) Next, i that iteratio, for each cluster, we recompute mea r ad Σ. We repeat these two steps iteratively as show i the pseudo code i the lecture slides. 2.3 Modelig Mixture Distributio Say we had two clusters draw from ormal distributio of same covariace structure with meas separated by some distace. Now say we have a poit equidistat from both the meas of the two clusters. Now if umber of poits draw from both the gaussias were exactly the same, the we would of course have to coclude that this poit equidistat to the mea could belog to either cluster with same probability. However, ow say you were iformed that oe of the cluster has times the umber of poits as the other cluster. Now you would expect that this poit equidistat to the mea is times more likely to be i the first cluster tha the secod. However, our cluster assigmet step that oly looks at dissimilarity does ot capture this iformatio. Hece to fix, this we maitai a mixture distributio parameter that maitais the proportio of poits i each cluster at ay iteratio ad aims to pealize more likely clusters lesser. This pealty to the dissimilarity fuctio is give by logπ ) for the th cluster. The algorithm is give i the lecture slide. 3 Probabilistic Iterpretatio: Hard Gaussia Mixture Models Oe ca obtai a probabilistic iterpretatio of the algorithm as follows: the probability of a poit belogig to a particular cluster is proportioal to probability of pickig cluster give by π times the likeloihood of the poit belogig to cluster. Notice that π = exp logπ ))) similarly, we ca set likelihood px, C ) exp Dissimilarityx, C )). Notice that this esures that desity p is o-egative. To esure that p is a valid desity, it eed to itegrate to. Hece, px, C ) = C exp Dissimilarityx, C )). Now oce ca calculate C the ormalizig costat so that p itegrates to. For the probabilisitic iterpretatio of the ellipsoidal dissimilarity, assume Dissimilarityx, C ) = 2 x r Σ x r ) 4
The factor /2 ust makes calculatios easier. The likelihood is give by px; r, Σ ) exp 2 x r Σ x r )) But ote that p is basically proportioal to the multivariate gaussia distributio ad exp 2 x r Σ x r )) = 2π) d/2 detσ ) ad hece the desity fuctio for the probabilistic iterpretatio ca be obtaied by settig px; r, Σ ) = 2π) d/2 detσ ) exp 2 x r Σ x r )) which is the multivariate gaussia distributio. Uder the probabilistic iterpretatio the hard gaussia mixture model algorithm ca be foud i lecture slides. Specifically we ca use for hard cluster assigmet, assigig cluster to a poit based o oe that has maximum probability, that is a poit is asiged to that cluster to which it has the maximum probability of belogig to. This probability is proportioal to π probability of pickig cluster ) times the likelihood of poit belogig to cluster. 4 Soft) Gaussia Mixture Models Oe issue with hard clusterig is that whe we begi, we radomly guess parameters ad recompute multiple times hopig to coverge to the right oes. Say ow there is a poit that has probability.5 of belogig to cluster ad probability.49 of belogig to cluster 2 o iteratio. So the poit is close to beig equally likely to belog to each of the two clusters, ad this computatio is based o radomly iitialized parameters. Now based o this assigig poit to oly cluster ad ot 2 seems too harsh. The soft assigmet takes care of this issue by replacig the cluster assigmet step o each iteratio by a step that updates for each poit the probability that it belogs to each of the K clusters. That is, every poit belogs to every cluster with some probability give by the variable Q. Specifically, at ay iteratio m, Q m) t [] specifies, based o parameters at iteratio m, that is the probability that poit x t belogs to cluster. Now whe we compute mea ad covariaces at step 2 of iteratio, for every cluster we compute the weighted covariace ad meas ad π) as show i lecture slides. Whe we do probabilistic models, we will come back to mixture models ad see how this makes sese. 5