Non-linear Feature Extraction by the Coordination of Mixture Models

Size: px

Start display at page:

Download "Non-linear Feature Extraction by the Coordination of Mixture Models"

Penelope Garrett
5 years ago
Views:

1 No-liear Feature Extractio by the Coordiatio of Mixture Models J.J. Verbeek N. Vlassis B.J.A. Kröse Itelliget Autoomous Systems Group, Iformatics Istitute, Faculty of Sciece, Uiversity of Amsterdam, Kruislaa 43, 98 SJ Amsterdam, The Netherlads Keywords: Caoical Correlatio Aalysis, Pricipal Compoet Aalysis, Self-orgaizig Maps. Abstract We preset a method for o-liear data projectio that offers o-liear versios of Pricipal Compoet Aalysis ad Caoical Correlatio Aalysis. The data is accessed through a probabilistic mixture model oly, therefore ay mixture model for ay type of data ca be plugged i. Gaussia mixtures are oe example, but mixtures of Beroulli s to model discrete data might be used as well. The algorithm miimizes a objective fuctio that exhibits oe global optimum that ca be foud by fidig the eigevectors of some matrix. Experimetal results o toy data ad real data are provided. Itroductio Much research has bee doe i the field of usupervised o-liear feature extractio to obtai a low dimesioal descriptio of high dimesioal data that preserves importat properties of the data. I this paper we are iterested i data that lies o or close to a low dimesioal maifold embedded (possibly o-liearly) i a Euclidea space of much higher dimesio. A typical example of such data are images of a object uder differet coditios (e.g. pose ad lightig) obtaied with a high resolutio camera. A low dimesioal represetatio may be desirable for differet reasos like compressio for storage ad commuicatio, for visualizatio of high dimesioal data, ad as a preprocessig step for further data aalysis or predictio tasks. A example is give i Fig., we have data i IR 3 which lies o a two dimesioal maifold (top plot). We wat to recover the structure of the data maifold, so that we ca uroll the data maifold ad work with the data expressed i the latet coordiates, i.e. coordiates o the maifold (secod plot). Here, we cosider a method to itegrate several local feature extractors ito a sigle global represetatio. The idea is to lear a mixture model o the data where each mixture compoet acts as a local feature extractor. A mixture of Probabilistic Pricipal Compoet Aalyzers [] is a typical example. After this mixture has bee leared, the local feature extractors are coordiated by fidig for each local feature extractor a suitable liear map ito a sigle global low-dimesioal coordiate system. The collective of the local feature extractors together with the liear maps ito the global latet space provides a global o-liear projectio from the data space ito the latet space. This differs from other o-liear feature extractio methods like: Geerative Topographic Map (GTM) [], Locally Liear GTM [3], Kohoe s Self-Orgaizig Map [4] ad a Variatioal Expectatio Maximizatio algorithm similar to SOM [5]. These algorithms are actually quite opposite i ature to the method preseted here: they start with a give cofiguratio of mixture compoets i the latet space ad adapt the mixture cofiguratio i the data space to optimize some objective fuctio. Here we start with a give ad fixed mixture i the data space ad fid a optimal cofiguratio i the latet space. The advatage of the latter method is that it is ot proe to local optima i the objective fuctio. Moreover, the solutio for a d-dimesioal latet space ca be foud by fidig the eigevec- Of course, fittig the mixture model o the data might be susceptible to local optima.

2 tors of a matrix correspodig to the d smallest up to the (d + ) st smallest eigevalues. The size of this matrix is give by the total umber of locally extracted features plus the umber of mixture compoets, i.e. if all k compoets extract d features the the size is k(d + ). The work preseted i this paper ca be categorized together with other recet algorithms for usupervised o-liear feature extractio like IsoMap [6], Local Liear Embeddig [7], Kerel PCA [8], Kerel CCA [9] ad Laplacia Eigemaps []. All these algorithms perform oliear feature extractio by miimizig a objective fuctio that exhibits a limited amout of stable poits that ca be characterized as eigevectors of some matrix. These algorithms are gerally simple to implemet, sice it is oly eeded to costruct some matrix, the the eigevectors are foud by stadard methods. I this paper we build upo recet work of Brad []. I the ext sectio we review the work of Brad, we derive the objective fuctio by a slight modificatio of the free-eergy used i [] ad we discuss how we ca fid its stable poits. I Sectio 3 we preset a method to obtai better results by usig several mixtures for the data i parallel. Secodly, we show i Sectio 4 that fidig the locally valid liear maps betwee the data space ad the latet space ca be regarded as a weighted form of Caoical Correlatio Aalysis (CCA), ad we show how we ca geeralize the work of Brad to perform oliear CCA. A geeral discussio ad coclusios ca be foud i Sectio 5. Liear Embeddig of Local Feature Extractors I this sectio we describe the work of Brad [], i a otatio that allows us to draw liks with CCA i sectio 4. Cosider a give data set X = {x,..., x N } ad a collectio of k local feature extractors of the data f s (x) for s =,... k. Thus, f s (x) is a vector cotaiig the features extracted from x by feature extractor s. Each feature extractor gives zero or more (liear) features of the data. With each feature extractor, we associate a mixture compoet i a probabilistic k-compoet mixture model. We write the desity o high dimesioal data items x as: p(x) = k p(s)p(x s), () s= where s is a variable ragig over the mixture compoets ad p(s) is the a-priori probability o compoet s (sometimes also termed mixig weight of compoet s). Next, we cosider the relatioship betwee the give represetatio of the data (deoted with x) ad the represetatio of the data i the latet space, which we have to fid. Throughout, we will use g to deote latet coordiates for data items. The g is for Global latet coordiate as opposed to locally extracted features, which we will deote by z. For the uobserved latet coordiate g correspodig to a data poit x ad coditioed o s, we assume the desity: p(g x, s) = N (g; κ s + A s f s (x), σ I) () where N (g; µ, Σ) is a Gaussia distributio o g with mea µ ad covariace Σ. The mea, g s, of p(g x, s) is the sum of the compoet mea κ s i the latet space ad a liear trasformatio, implemeted by A s, of f s (x ). From ow o we will use homogeeous coordiates ad write: L s = [A s κ s ] ad z s = [f s (x ) ], ad thus g s = L s z s. Cosider the margial (over s) distributio o latet coordiates give data: p(g x) = s p(s, g x) = s p(s x)p(g x, s). (3) Give a fixed set of local feature extractors ad a correspodig mixture model, we are iterested i fidig liear maps L s that give rise to ice projectios of the data i the latet space. By ice, we mea that compoets that have a high posterior probability to have geerated x (i.e. p(s x) is large) should give similar projectios of x i the latet space. So we would like the p(g x, s) to look similar for the compoets with large posterior. Sice the margial p(g x) is a Mixture of Gaussias, we ca measure the similarity betwee the predictios by lookig how much the mixture resembles a sigle Gaussia. If the predictios are all the same, the mixture is a sigle Gaussia. We are ow ready to formally defie the objective fuctio Φ({L,..., L k }), which sums the data log-likelihood ad a Kullback-Leibler divergece D which measures how ui modal the distributio p(g x) = s p(s x)p(g x, s) is. Φ = N log p(x ) = D(Q (g, s) p(g, s x )), (4) where Q distributes idepedetly over g ad s: Q (g, s) = q s Q (g) = q s N (g; g, Σ ). (5)

3 This objective fuctio is similar to the oe used i [, 3]. The differece here is that the covariace matrix of p(g x, s) does ot deped o the liear maps L s. If we fix the data space model, turig the data log-likelihood ito a costat, ad set q s = p(s x ), the the objective sums for each data poit x the Kullback-Leibler divergece betwee Q (g) ad p(g x, s), weighted by the posterior p(s x ): Φ =,s q s D(Q (g) p(g x, s)). (6) I order to miimize the objective Φ with respect to g ad Σ, it is easy to derive that we get: g = s q s g s ad Σ = σ I, (7) where I deotes the idetity matrix. Skippig some additive ad multiplicative costats with respect to the liear maps L s, the objective Φ the simplifies to: Φ =,s q s g g s, (8) with g s = L s z s as before. Note that we ca rewrite the objective as: Φ = q s q t g t g s. (9),s,t The objective is a quadratic fuctio of the liear maps L s []. At the expese of some extra otatio, we obtai a clearer form of the objective as a fuctio of the liear maps. Let: u = [q z... q k z k], () U = [u... u N ], () L = [L... L k ]. () Note that from (7), () ad () we have: g = (u L). The expected projectio coordiates ca thus be computed as: G = [g... g N ] = UL. (3) We defie the block-diagoal matrix D as: D D =... (4) D k where D s = q sz s z s. The objective the ca be writte as: Φ = Tr{L (D U U)L}. (5) The objective fuctio is ivariat for traslatio ad rotatio of the global latet space, as ca be easily see from (8). Furthermore, re-scalig the latet space chages the objective mootoically. To make solutios uique with respect to traslatio, rotatio ad scalig, we put two costraits: (tras.) : ḡ = g =, (6) N (rot. + sc.) : Σ g = (g ḡ)(g ḡ) (7) = L U (I /N)UL = I, where we used to deote the matrix with oes everywhere. Solutios are foud by differetiatio usig Lagrage multipliers, ad the colums v of L are foud to be geeralized eigevectors: (D U U)v = λu (I /)Uv (8) Sice the solutios are zero mea, it follows that Uv =, ad hece that: (D U U)v = λu Uv (9) Dv = (λ + )U Uv. () The value of the objective fuctio is give by: Φ = i = i v i (D U U)v i () λ i v i U Uv i () = (N ) i λ i var(g i ), (3) where var(g i ) deotes the variace i the i th coordiate of the projected poits. So rescalig to get uit variace makes the objective equal to (N ) i λ i. The smallest eigevalue is always zero, correspodig to projectios that collapses all projectios i the poit κ. This projectio has for s =,..., k : A s = ad κ s = κ. This is embeddig tells us othig about the data, therefore we eed the eigevectors correspodig to the secod up to the (d + ) st smallest eigevalues to obtai the best embeddig i d dimesios. Note that there is o problem if differet feature extractors extract a differet umber of features. I Fig. we give a illustratio of the above. The top two images show the origial data preseted to the algorithm ad the -dimesioal represetatio foud by the algorithm. The blue ad yellow sticks idicate the local feature extractors which are cetered o the red dots. The two lower scatter plots compare the coordiates o the data geeratig maifold with the discovered latet coordiates. For both coordiates there is high correlatio with the true latet coordiates.

4 .5 3 Icorporatig more poits i the objective fuctio I this sectio we discuss how usig multiple mixtures for the data ca help obtai more stable results. The ext sectio the discusses how this method ca be exploited to defie a o-liear form of Caoical Correlatio Aalysis. Sometimes the method of the previous sectio collapses several liear feature extractors i some part of the uderlyig data maifold oto a poit or a lie. Whereas the Automatic Aligmet (AA) method [4] was less sesitive to such pheomea. AA is based o the Local Liear Embeddig algorithm [7] ad fids liear maps to itegrate a collectio of locally valid feature extractors i a maer similar to what we described i the previous sectio. I AA the objective fuctio is:.5 Φ AA = m, W m g g m, (4) Figure : From top to bottom. () Data i IR 3 with charts idicated by the blue ad yellow sticks. () Data represetatio i IR. (3) First origial coordiate o maifold agaist first foud latet coordiate. (4) Secod origial coordiate o maifold agaist secod foud latet coordiate. agai we have g = s q sg. s The weights, collected i the weight matrix W, are foud by first determiig for every data poit x its earest eighbors () i the data space ad use the W that miimizes: x W m x m, (5) m () uder the costrait that m W m = ad oly earest eighbors ca have o-zero weights, thus: m / () : W m =. The objective fuctio (8) of the previous sectio is i fact oly based o poits for which the posteriors q s assigs o-egligible mass o at least two mixture compoets. This is see directly from (9), sice if oe of the q s takes all mass, all products of the posteriors are zero except for oe term where the squared distace is zero. I other words, the objective oly focuses o data that is o the overlap of at least two mixture compoets. So, oly poits for which the etropy of the posterior is reasoably high cotribute to the objective fuctio. The other data, for which the etropy i the posterior is low, is thus completely eglected i the objective fuctio ad thus i the coordiatio of the local feature extractors. I practice, ofte the bulk of the data has a low etropy posterior ad hardly plays a role i the objective fuctio. To make more use of the available data we should have more etropy i the posteriors. Oe way to do that is to smoothe the posteriors by takig the closest posterior i KL-divergece sese to the old posterior that has

5 These images ca be obtaied from the web page of Sam Roweis at a certai etropy, yieldig: q s qs. α Aother possibility is to smoothe such that the average etropy has a certai value. However, to determie the right amout of etropy i the posteriors seems a rather problematic questio. The problem ca also be circumveted by usig several mixtures, say two, i parallel. The mixtures are traied idepedetly ad the fial posterior for a specific mixture compoet from oe of the two mixtures is defied as half of the posterior withi the mixture from which this compoet origiates. So, the sum of the posteriors o the compoets origiatig from the first mixture equals / ad similarly for the secod mixture. I this maer we always have a etropy of at least oe bit i the posteriors, ad hece all poits make a sigificat cotributio to the objective fuctio. Note that we are usig just a heuristic here, more pricipled ways to lear a mixture with high etropy i the posteriors could be divised but do ot seem essetial. Oly i the case that the two idepedetly leared mixtures are the same we have ot gaied aythig. This, however, is ot likely to be the case for mixture models fitted by EM algorithms that fid local optima of the loglikelihood fuctio. We iitialize the EM mixture learig algorithm at differet values ad i this maer we (probably) fid mixtures that are differet local optima of the data log-likelihood fuctio. We ca thik of this method as coverig the data maifold ot with oe set of tiles, but usig two coverigs of the maifold. I this way every tile is overlappig with several other tiles o its complete extet. We foud i experimets that usig two mixtures istead of oe gives great improvemet i the obtaied results. Moreover, we prevet the computatio of the weight matrix W eeded by AA. The time eeded to compute the weight matrix is i priciple quadratic i the umber of data poits ad ca be problematic i o-euclidea spaces. By usig several mixtures i parallel, we avoid the quadratic cost ad have a method that is based purely o mixture models ad that ca be readily applied to ay type of mixture models. I Fig. we give some examples of data embeddigs obtaied o a data set of 965 pictures of a face. First the images were projected from the origial 8 pixel coordiates to dimesios with PCA. The projectio from to dimesios was doe by fittig a compoet mixture of -dimesioal probabilistic PCA [] ad aligig them with the method described i the previous sectio. We show the latet co- ordiates assiged to the pictures as dots, some pictures are show ext to their latet coordiate. The two bottom figures, where we used two parallel mixtures, show a ice separatio of the smilig faces ad the agry faces. Furthermore, the variatios i gaze directio are icely idetified. The two top images, where we used sigle mixtures with respectively te ad twety mixture compoets do ot show the separatio betwee the two moods. 4 No-liear Caoical Correlatio Aalysis I Caoical Correlatio Aalysis (CCA) [5, 6] two zero mea sets of poits are give: X = {x,..., x N } IR p ad Y = {y,..., y N } IR q. The aim is to fid liear maps a ad b, that map members of X ad Y respectively o the real lie, such that the correlatio betwee the liearly trasformed variables is maximized. This is easily show to be equivalet with miimizig: E = [ax by ] (6) uder the costraits that a[ x x ]a + b[ y y ]b =. Several geeralizatios of CCA are possible. Geeralizig the above such that the sets do ot eed to be zero mea ad allowig a traslatio as well gives the problem of miimizig: E = [(ax + t a ) (by + t b )], (7) agai uder the costraits that the sum of variaces of the projectios of X ad Y are equal to some fixed costat. We ca also geeralize by mappig to IR d istead of the real lie, ad the requirig the covariace matrix of the projectios to be idetity. 3 CCA ca also be readily exteded to take ito accout more tha two poit sets, c.f. [9]. I the geeralized CCA with multiple poitsets ad allowig traslatios ad mappig to IR d, we miimize the squared distace betwee all pairs of projectios uder the costraits that each set of projected poits has uit variace. We deote the projectio of the -th poit i the s- th poit-set as g s. We thus miimize the error 3 If the liear maps are costraied to be orthoormal matrices (rotatios plus reflectio) the this is kow as the Procrustes rotatio [7].

6 fuctio:.5.5 Φ CCA = k = s k s= t= k g s g t (8) g s g, (9) Figure : From top to bottom: () Oe mixture of compoets. () Oe mixture of compoets. (3) Two parallel mixtures of compoets each. where g = k s g s =. s g s. The costrait is that: Note that the objective Φ i equatio (8) coicides with Φ CCA i (9) if we set q s = /k. The differet costraits imposed upo the optimizatio by CCA ad our objective of the previous sectios tur out to be equivalet, sice the stable poits uder both costraits coicide. We ca regard the features extracted by each local feature extractor as a poit set. Hece, we ca iterpret the method of the Sectio as a weighted form of CCA that allows traslatio o multiple poit sets. Viewig the coordiatio of local feature extractors as a form of CCA suggests usig the coordiatio techique for o-liear CCA. This is achieved quite easily, without modifyig the objective fuctio (8). We cosider differet poit sets, each havig a mixture of locally valid liear projectios ito the global (i the sese of shared by all mixture compoets ad poit sets) latet space. We miimize the weighted sum of the squared distaces betwee all pairs of projectios, i.e. we have pairs of projectios due to the same poit set ad also pairs that combie projectios from differet poit sets. As with the use of parallel mixtures, the weight q s associated with the s th projectio of observatio, is give by the posterior of the mixture compoet i the correspodig mixture (e.g. the posterior p(s x ) if projectio s is due to a compoet i the mixture for X ad p(s y ) if s refers to a compoet i the mixture fitted to Y) divided over the umber of poit sets. It is ow clear that the use of parallel mixtures i the previous sectio was merely a form of o-liear CCA where both poit sets are the same. We agai defie g = s q sg s, where s ow rages over all mixture compoets i all mixtures fitted o the C differet poit sets. We use qr c to deote the posterior o compoet r for observatio i poit set c. where the qr c are rescaled with a factor C such that all qr c for oe poit set c sum to oe. Furthermore, we defie g c = r qc rgr c as the average projectio due to poit set c. The objective ca be rewritte i

7 differet forms: Φ = q s q t g s g t (3) s,t = q s g s g (3) s = [ ] qs c g gs c (3) C c,s = [ g g c ] C c + [ ] qs c g c gs c. (33) C c s Observe how i (33) the objective sums both betwee poit set cosistecy of the projectios (first summad) ad withi poit set cosistecy of the projectios (secod summad). As a illustrative toy applicatio of the oliear CCA we took two poit-sets i IR of 6 poits each. The first poit-set was geerated o a S-shaped curve the secod poit set was geerated alog a circle segmet. To both poit sets we added Gaussia oise. We leared a k = compoet mixture model o both poit-sets, illustrated i the top figures of Fig. 3. I the bottom figure the discovered latet space represetatio (vertical axis) is plotted agaist the coordiate o the geeratig curve (horizotal axis). The relatioship is mootoic ad roughly liear. 5 Discussio ad Coclusios We have show how we ca globally maximize a free-eergy objective similar to that of [, 3] with respect to the liear maps L s that map homogeeous coordiates of the local feature extractors ito a global latet space. This was achieved by makig the variace of p(g s, x) idepedet of the liear maps. The liear maps are foud by computig the d+ smallest eigevectors of a matrix. The edge size of the eigeproblem is give by k plus the total umber of extracted features. Typically, if all feature extractors give the same umber of features, this will be k(d+). However, it is ot ecessary to let every mixture compoet extract a equal umber of features. The way i which we fid the latet represetatio of the data is very similar to the k-segmets algorithm [8]. I the latter we first fitted a mixture of PCA s ad used a (oly locally optimal) combiatorial optimizatio algorithm to discover how the differet PCA s (segmets) could be aliged. Also here, first a mixture model is fitted to the data ad the a optimizatio problem is solved to coordiate the mixture compoets Figure 3: (Top) Two poit-sets ad the fitted mixture models. Local charts are idicated by the blue sticks. (Bottom) Scatter plot of the foud latet coordiates (vertical) ad coordiate alog the geeratig curve (horizotal).

8 However, i the curret work the optimizatio of the aligmet is globally optimal ad ca be applied to PCA s of ay dimesio ad latet spaces of ay dimesio. As compared to [4] we do ot have to compute eighbors aymore i the data space, which is costly ad moreover ca be hard whe processig data with mixed discrete ad cotiuous features. We oly eed a mixture model to produce posteriors here. I [] the same objective ad solutio are used as here. We showed how usig multiple mixtures i parallel ca boost performace sigificatly ad how the same techique ca be used for o-liear CCA. We illustrated the viability of the preseted algorithms o sythetic ad real data. I future work, we wat to exploit the lik with CCA further by ivestigatig its applicability i classificatio ad regressio problems. Ackowledgmet This research is supported as project AIF 4997 by the Techology Foudatio STW, applied sciece divisio of NWO ad the techology program of the Miistry of Ecoomic Affairs. Refereces [] M.E. Tippig ad C.M. Bishop. Mixtures of probabilistic pricipal compoet aalysers. Neural Computatio, ():443 48, 999. [] C. M. Bishop, M. Svesé, ad C. K. I Williams. GTM: The geerative topographic mappig. Neural Computatio, :5 34, 998. [3] J.J. Verbeek, N. Vlassis, ad B. Kröse. Locally liear geerative topographic mappig. I M. Wierig, editor, Proc. of th Belgia- Dutch cof. o Machie Learig,. [4] T. Kohoe. Self-Orgaizig Maps. Spriger,. [5] J.J. Verbeek, N. Vlassis, ad B.J.A. Kröse. Self-orgaizatio by optimizig free-eergy. I M. Verleyse, editor, Proc. of Europea Symposium o Artificial Neural Networks. D-side, Evere, Belgium, 3. To appear, preprit available from [6] J.B. Teebaum, V. de Silva, ad J.C. Lagford. A global geometric framework for oliear dimesioality reductio. Sciece, 9(55):39 33, December. [7] S.T. Roweis ad L.K. Saul. Noliear dimesioality reductio by locally liear embeddig. Sciece, 9(55):33 36, December. [8] B. Schölkopf, A.J. Smola, ad K. Müller. Noliear compoet aalysis as a kerel eigevalue problem. Neural Computatio, :99 39, 998. [9] F. R. Bach ad M. I. Jorda. Kerel idepedet compoet aalysis. Joural of Machie Learig Research, 3: 48,. [] M. Belki ad P. Niyogi. Laplacia eigemaps ad spectral techiques for embeddig ad clusterig. I T. G. Dietterich, S. Becker, ad Z. Ghahramai, editors, Advaces i Neural Iformatio Processig Systems 4, Cambridge, MA,. MIT Press. [] M. Brad. Chartig a maifold. I Advaces i Neural Iformatio Processig Systems 5, Cambridge, MA, 3. MIT Press. [] S.T. Roweis, L.K. Saul, ad G.E. Hito. Global coordiatio of local liear models. I T.G. Dietterich, S. Becker, ad Z. Ghahramai, editors, Advaces i Neural Iformatio Processig Systems 4, Cambridge, MA, USA,. MIT Press. [3] J.J. Verbeek, N. Vlassis, ad B. Kröse. Coordiatig Pricipal Compoet Aalyzers. I J.R. Dorrosoro, editor, Proceedigs of It. Cof. o Artificial Neural Networks, pages 94 99, Madrid, Spai,. Spriger. [4] Y.W. Teh ad S.T. Roweis. Automatic aligmet of local represetatios. I Advaces i Neural Iformatio Processig Systems 5, Cambridge, MA, 3. MIT Press. [5] H. Hotellig. Relatio betweee two sets of variates. Biometrika, 8:3 377, 936. [6] K. V. Mardia, J. T. Ket, ad J. M. Bibby. Multivariate Aalysis. Academic Press, 994. [7] T.F. Cox ad M.A.A. Cox. Multidimesioal Scalig. Number 59 i Moographs o statistics ad applied probability. Chapma & Hall, 994. [8] J.J. Verbeek, N. Vlassis, ad B. Kröse. A k-segmets algorithm for fidig pricipal curves. Patter Recogitio Letters, 3(8):9 7,.

Machine Learning for Data Science (CS 4786)

Machine Learning for Data Science (CS 4786) Machie Learig for Data Sciece CS 4786) Lecture & 3: Pricipal Compoet Aalysis The text i black outlies high level ideas. The text i blue provides simple mathematical details to derive or get to the algorithm