More precisely, given the collection fx g, with Eucliden distnces between pirs (; b) of ptterns: = p (x? x b ) ; one hs to nd mp, ' : R n distnce-erro

Improved Multidimensionl Scling Anlysis Using Neurl Networks with Distnce-Error Bckpropgtion Llus Grrido (), Sergio Gomez () nd Jume Roc () () Deprtment d'estructur i Constituents de l Mteri/IFAE Universitt de Brcelon, Digonl 647 E-0808 BARCELONA, Spin () Deprtment d'enginyeri Informtic (ETSE) Universitt Rovir i Virgili Crt. de Slou s/n (Complex eductiu) E-43006 TARRAGONA, Spin. Abstrct We show tht neurl networks, with suitle error function for bckpropgtion, cn be successfully used for Metric Multidimensionl Scling (MDS) (i.e. dimensionl reduction while trying to preserve the originl distnces between ptterns) nd re in fct le to outdo the stndrd lgebric pproch to MDS, known s Clssicl Scling (CLS). Published in Neurl Computtion (999) 595{600. Introduction A stndrd problem in multidimensionl scling nlysis consists in trying to mp collection of ptterns, represented s points in n n-dimensionl spce fx R n ; = ; : : : ; pg; to lower dimensionl spce in such wy tht the distnces between the projected points resemble s closely s possible the distnces between the originl ones.

More precisely, given the collection fx g, with Eucliden distnces between pirs (; b) of ptterns: = p (x? x b ) ; one hs to nd mp, ' : R n distnce-error function! R m ; with m < n, such tht it minimizes the qudrtic E ' = X? d(m) ; where d (m) re the Eucliden distnces computed in the projected spce d (m) = p ('(x )? '(x b )) : Typiclly, m is chosen to be two or three in order to mke ville grphicl representtion of the projected congurtion. This cn help visulize n underlying structure tht might be obscured by the cluttering of dt in the originl spce. It is not known in generl how to nd the exct expression of the best mp '. Yet, there is stndrd method to pproximte it, known s Clssicl Scling (CLS), which involves the digonliztion of the symmetric mtrix S of sclr products S = x x b ; P by mens of n orthogonl mtrix C. Tking fx g to be centered t the origin, i.e. x = 0, nd ssuming tht p > n, it is simple to show tht S cn hve t most n non-zero eigenvlues. Ech of these eigenvlues cn be regrded s the typicl scle of principl direction. If we denote by ; : : : m the m lrgest eigenvlues, the resultnt mpping to R m is given by ' CLS(x ) = = C = ; : : : ; m: See, for exmple, [] nd references therein for detiled description of this method. It should be sid tht Clssicl Scling cn be used in broder context, when only mtrix of dissimilrities is known, s tool to ssign coordintes to the ptterns. It cn be shown tht once coordintes re lredy known for ptterns, s in our cse, Clssicl Scling ctully reduces to Principl Component Anlysis (PCA).

Multidimensionl scling with neurl networks In this note we provide n lterntive solution to this problem which involves the use of neurl networks. The min ide consists in building net with n input units nd number of hidden lyers, contining bottle-neck lyer with only m units nd n output lyer with n units. A modied version of the stndrd bckpropgtion lgorithm [] is then invoked. In ddition to the qudrtic error term between input nd output, it contins new term which is introduced to minimize the dierence between the distnces of pirs in the input nd neck lyers. Finlly, when enough itertions hve been performed, the projected congurtion is red out from the neck lyer. In order to use the net in the most ecient wy it is convenient to perform trnsltion nd globl scling of the initil dt: x?! in = in (x? ); so s to mke in [0; ] n. Then one cn use in s the input to the net. The outcome of the neck lyer, nk, lives in the region [0; ] m since we re using sigmoid ctivtion functions. This implies tht 0 d nk d nk nd din respectively. p m while 0 d in p n for ny pir of input points ( in ; in b ), where stnd for the distnces between ptterns nd b in the neck nd initil lyers, The error function tht we hve considered in the bckpropgtion method is given by E = E + (? ) E where E = X? out X? in nd E = d in pn? dnk p ; m nd [0; ] controls the reltive contribution of ech prt. The term E fvours those mps for which the representtion in the bottle-neck lyer cn be best ccurtely inverted to recover the originl congurtion. The second term, E, is the most importnt one since it forces this representtion in the bottle-neck to inherit, s closely s possible, the metric structure of the originl congurtion. The dierent sclings for d in nd dnk in this term re introduced in order to hve both numbers in the sme rnge. In this wy we cn gurntee tht ll possible congurtions cn still be covered with the use of sigmoids. 3

The vrious sclings involved in this process mke the outcome of the neck lyer not to be directly interpretle s the nl nswer; we cn bring it bck to the originl scle by setting ' NN (x ) = out nk ; p with out = n=m in. However, slightly better solution cn be obtined by choosing insted out = P d(n) dnk P? d nk ; since this is the vlue of tht minimizes the function E() = P (d(n)? dnk for the given ) neck congurtion, which is wht we re ultimtely trying to chieve with the whole procedure. In the prcticl use of the neurl network we hve noticed tht the best results re obtined by letting the prmeter fll to zero s the lerning grows so tht the error function E reduces to E fter certin number of itertions. Actully, non-zero vlue of is only useful in the erly stges of the lerning, in order to speed up convergence. In this sitution, i.e. with E = E, it is esy to prove nlyticlly tht the congurtion minimizing E diers from the P p one minimizing directly (d in?d nk ) only by globl scling n=m of ll coordintes. Thus, the (otherwise techniclly convenient) sclings tht we hve introduced ove re completely hrmless for the purpose of serching for the best mpped congurtion. It is well known tht network with just the input, output nd neck lyers, with liner ctivtion functions nd subject to self-supervised bckpropgtion is equivlent to PCA [3]. Our pproch goes beyond PCA, not only becuse of the use of sigmoid (non-liner) ctivtion functions nd the ddition of number of hidden lyers, but essentilly for the presence of this new distnce-term contribution, E, which fvors those congurtions in the neck lyer tht pproximte better the originl distnces. One my wonder how does our method compre to Non-Liner PCA (NLPCA) [4, 5, 6]. Actully, NLPCA cn be recovered s prticulr cse of our pproch by setting = in the error function (i.e. with E = E ). Clerly, NLPCA will generlly do better thn ordinry PCA in the minimiztion of the term E becuse of the ility to model non-liner congurtions. However, NLPCA does not cre t ll of the distnces between ptterns in the bottle-neck representtion: ny two neck congurtion re eqully good for NLPCA if both provide the sme result in the output lyer. Hence, the comprison of NLPCA with our pproch is relly inpproprite becuse both methods re in fct designed for dierent purposes (minimizing E nd 4

E, respectively). On the contrry, the projected congurtion of stndrd PCA still retins prt of the metric structure of the initil congurtion since it is just liner orthogonl projection onto the lrgest-vrince xes nd hence it produces better results for E thn NLPCA. This is why we will compre performnce of our method with CLS (i.e. PCA) nd not with NLPCA. A comprtive nlysis of both pproches over severl types of congurtions shows tht our method produces better results in the \tougher" situtions, i.e. when some of the discrded directions in the CLS method still hve reltively lrge ssocited eigenvlues. Finlly, it is worth stressing tht CLS provides just liner orthogonl projection wheres the neurl net is le to produce more generl (non-liner) mppings. An exmple As n illustrtion of both procedures we hve considered dt set consisting of dierent niml species, chrcterized by n = 7 ttributes ech (5 boolen + numericl). The coordintes x nd distnces hve been obtined fter scling the numericl ttributes to the rnge [0; ] in order to ssign n equl weight to ll ttributes (implying tht in this cse we simply hve in = x ). When using the neurl net, the best scling for the two-dimensionl neck representtion is given by out out = p 7=. = :946, which is in less thn : % disgreement with the expected vlue of The projected congurtions obtined with ech method re drwn in gure. Ptterns re represented by their lel. As shown in the plot, both pproches produce firly similr congurtion. However, the computtion of the overll reltive error, i.e. P? d(m) " = 0 B @ P C A for ech method shows tht the neurl network is giving out slightly better result, " CLS = 0:78; " NN = 0:346; which mounts to 4:00 % improvement over the CLS method. Extrcted from the \Zoo Dtse" creted by Richrd S. Forsyth (990) (ftp://ftp.ics.uci.edu: ; ~ /pub/mchine-lerning-dtses/zoo). 5

Acknowledgements This work hs been supported in prt by CICYT contrct AEN95-0590 nd by URV project URV96-GNI-3. J.R. lso thnks Ministerio de Educcion y Cultur of Spin for nncil support. References [] T.F. Cox nd M.A.A. Cox, Multidimensionl Scling. Chpmn & Hll, London 994. [] D.E. Rumelhrt, G.E. Hinton nd R.J. Willims 986. Lerning representtions by bckpropgting errors, Nture 33, 533. [3] T.D. Snger 989. Optiml unsupervised lerning in single-lyer liner feedforwrd neurl network, Neurl Networks, 459. [4] M.A. Krmer 99. Non-liner principl component nlysis using utossocitive neurl networks, AICHE Journl 37, 33. [5] D. DeMers nd G. Cottrell 994. Non-Liner Dimensionlity Reduction, Published in NIPS 5; N. Kmbhtl nd T.K. Leen 995. Fst Non-Liner Dimension Reduction, Published in NIPS 6. [6] Ll. Grrido, V. Gitn, M. Serr-Ricrt nd X. Clbet 995. Use of multilyer feedforwrd neurl nets s disply method for multidimensionl distributions, Int. J. Neurl Systems 6, 73; Ll. Grrido, S. Gomez, V. Gitn nd M. Serr-Ricrt 996. A regulriztion term to void the sturtion of the sigmoids in multilyer neurl networks, Int. J. Neurl Systems 7, 57. 6

.5.5 0.5 0.5.5.5.5.5 0.5 0.5.5.5 3 4 5 6 7 8 9 0 3 4 56 7 8 9 0 3 4 5 6 7 8 9 30 3 3 33 34 35 36 37 38 39 440 4 43 44 45 46 47 48 49 50 5 5 53 54 55 56 57 58 59 60 6 6 63 64 65 66 67 68 69 70 7 7 73 74 75 76 77 78 79 80 8 8 83 84 85 86 87 88 89 90.5.5 0.5 0.5.5.5.5.5 0.5 0.5.5.5 3 4 5 6 7 8 9 0 3 4 5 6 7 8 9 0 3 4 5 6 7 8 9 30 3 3 33 34 35 36 37 38 39 40 4 4 43 44 45 46 47 48 49 50 5 5 53 54 55 56 57 58 59 60 6 6 63 64 65 66 67 68 69 70 7 7 73 74 75 76 77 78 79 80 8 8 83 84 85 86 87 88 89 90 CLS NN Figure. 7