Entropic Graphs for Manifold Learning

Entropi Graphs for Manifold Learning Jose A. Costa and Alfred O. Hero III Department of Eletrial Engineering and Computer Siene Uniersity of Mihigan, Ann Arbor, MI 9 Email: josta@umih.edu, hero@ees.umih.edu Abstrat We propose a new algorithm that simultaneously estimates the intrinsi dimension and intrinsi entropy of random data sets lying on smooth manifolds. The method is based on asymptoti properties of entropi graph onstrutions. In partiular, we ompute the Eulidean -nearest neighbors ( - NN) graph oer the sample points and use its oerall total edge length to estimate intrinsi dimension and entropy. The algorithm is alidated on standard syntheti manifolds. I. INTRODUCTION Seeral interesting lasses of signals arising in fields suh as bioinformatis, image proessing or Internet traffi analysis lie in high dimensional etor spaes. It is well known that both omputational omplexity and statistial performane of most algorithms quikly degrades as dimension inreases. This phenomenon, usually known as urse of dimensionality, makes it impratiable to proess suh high dimensional data sets. Howeer, many real life signals do not fill the spae entirely but are onstrained to lie on a smooth low dimensional nonlinear manifold embedded in the high dimensional spae. Manifold learning is onerned with the problem of disoering low dimensional struture based on a set of obsered high dimensional sample points on the manifold. In the reent past, manifold learning has reeied substantial attention from researhers in mahine learning, omputer ision, signal proessing and statistis ] ]. This is due to the fat that effetiely soling the manifold learning problem an bring onsiderable improement to the solution of suh dierse problems as: feature extration in pattern reognition; multiariate density estimation and regression in statistis; data ompression and oding in information theory; isualisation of high dimensional data; or omplexity redution of algorithms. Seeral tehniques for reoering the low dimensional struture of high dimensional data hae been proposed. These range from: linear methods as prinipal omponents analysis (PCA) ] and lassial multidimensional saling (MDS) 6]; loal methods as linear loal imbedding (LLE) ], loally linear projetions (LLP) 7], and Hessian eigenmaps ]; and global methods as ISOMAP ]. One ommon step to the manifold reonstrution algorithms mentioned aboe is that all require the expliit knowledge of the intrinsi dimension of the manifold. In many real life appliations, this parameter annot assumed to be known and has to be estimated from the data. A frequent way of doing this is to use linear projetion tehniques (]): a linear map is expliitly onstruted and dimension is estimated by applying PCA, fator analysis, or MDS to analyze the eigenstruture of the data. These methods rely on the assumption that only a small number of the eigenalues of the (proessed) data oariane will be signifiant. Linear methods tend to oerestimate the intrinsi dimension as they don t aount for non-linearities in the data. Both nonlinear PCA 3] methods and the ISOMAP irument this problem but they still rely on unreliable and ostly eigenstruture estimates. Other methods hae been proposed based on loal geometri tehniques, e.g., estimation of loal neighborhoods ] or fratal dimension 9], and estimating paking numbers ] of the manifold. The losely related problem of estimating the manifold s intrinsi entropy arises if the data samples are drawn from a multiariate distribution supported on the manifold. When the distribution is absolutely ontinuous with respet to the Lebesgue measure restrited to the lower dimensional manifold, this intrinsi entropy an be useful for exploring data ompression oer the manifold or, as suggested in ], lustering of multiple sub-populations on the manifold. The goal of this paper is to deelop an algorithm that jointly estimates both the intrinsi dimension and intrinsi entropy on the manifold, without knowing the manifold desription, gien only a set of random sample points. Our approah is based on entropi graph methods; see ] for an oeriew. Speifially: onstrut the Eulidean -nearest neighbors ( - NN) graph oer all the sample points and use its growth rate to estimate the intrinsi dimension and entropy by simple linear least squares and method of moments proedure. This method shares with the geodesi minimal spanning tree (GMST) method introdued by us in preious work ], the simpliity of aoiding the reonstrution of the manifold or estimating the multiariate density of the samples. Howeer, it has the main adantage of reduing runtime omplexity by an order of magnitude and is appliable to a wider lass of manifolds. The remainder of the paper is organized as follows. In Setion II we disuss the asymptoti behaior of the -NN graph on a manifold and the approximation of -NN geodesi distanes by the orresponding Eulidean distanes. The proposed algorithm is desribed in Setion III. Experimental results are reported in Setion IV. The theoretial results introdued in this paper are presented without proof due to spae limitations. The orresponding proofs an be found in 3]. II. THE -NN GRAPH Let be independent and identially distributed (i.i.d.) random etors with alues in a ompat

! I ^ Š D subset of. The ( -)nearest neighbor of in by "! is gien where distanes between points are measured in terms of some suitable distane funtion #%!. For general integer '&(, the -nearest neighbor of a point is defined in a similar way. The -NN graph puts an edge between eah point in and its -nearest neighbors. Let )+*-,.)/*-,! be the set of -nearest neighbors of in. The total edge length of the -NN graph is defined as:, *! 3 769: "! () where ;=<?> is a power weighting onstant. If @A! CB EDF@'B, where B B is the usual Eulidean ( HG ) norm in, then the -NN graph falls under the framework of ontinuous quasi-additie Eulidean funtionals ]. As a onsequene, its almost sure (a.s.) asymptoti behaior (also onergene in the mean) follows easily from the umbrella theorems for suh graphs: Theorem (, Theorem.3]): Let be i.i.d. random etors with alues in a ompat subset of and Lebesgue R density I. Let &KJ, MLN;PO and define Q DA;S!T. Then UWV, * Z! d XY \ ],, *_^ I!ba where, *! is gien by equation () with Eulidean distane, and ],, * is a onstant independent of I. Furthermore, the mean length e f g, * Z!iḧ T onerges to the same limit. The integral fator ji in the a.s. limit is a monotoni funtion of the extrinsi Rényi Q -entropy of the multiariate Lebesgue density I : kmlsn Io! pdqq l n I!ba () In the limit, when Qts the usual Shannon entropy, D j l n I R Ur R!ba, is obtained. Assume now that u -@ @ is onstrained to. The distribution of @A beomes singular with respet to Lebesgue measure and an appliation of Theorem results in a zero limit for the length funtional of the -NN graph. Howeer, this behaior an be modified by hanging the way distanes between points are measured. For this purpose, we use the framework of Riemann manifolds. lie on a ompat smooth -dimensional manifold w A. Random Points in a Riemann Manifold Gien a smooth manifold w, a Riemann metri x is a mapping whih assoiates to eah point y{z(w an inner % produt ẍ! between etors tangent to w at y ]. A Riemann manifold w }xz! is just a smooth manifold w with a gien Riemann metri x. As an example, when w is a submanifold of the Eulidean spaeg, the naturally indued Riemann metri on w is just the usual dot produt between etors. at y, we an define its norm For any tangent etor ~ to w as B ~gb Px ~ ~\!. Using this norm, it is natural to define the length of a pieewise smooth ure on w, _f > h_sƒw, as! j B-ˆ Š ˆ! B a. The geodesi distane between points y y z.w is the length of the shortest pieewise smooth ure between the two points: y %y! VŒb Ž!H >! y -! y Gien the geodesi distane, one an onstrut a geodesi -NN graph on u by omputing the nearest neighbor relations between points using instead of the usual Eulidean distane. Consequently, we define the total edge length of this g new graph as, * u #! g, where, * the orrespondene s. u! is gien by () with We an now extend Theorem to general ompat Riemann manifolds. This extension, Theorem bellow, states that the asymptoti behaior of, * u! is no longer determined by the density of @ relatie to the Lebesgue measure of, but depends instead on the the density of @ relatie to, the indued measure on w ia the olume element ]. w }xz! be a ompat Riemann - Theorem : Let dimensional manifold. Suppose @ @ elements of w are i.i.d. random with bounded density I relatie to \. Let, * be the -NN graph with lengths omputed using the geodesi distane. Assume &PJ, L ;=O' and define Q D=;S!T. Then, UV g, * u! XY ]#,, * ^b I ỹ!š o azy_! d (3) where ],, * is a onstant independent of I and w %xb!. Furthermore, the mean length e f H, * u! h T onerges to the same limit. Now, the integral fator in the a.s. limit of (3) is a monotoni funtion of the intrinsi Rényi Q -entropy of the multiariate density I on w : kqœ, Io! pdžq Ur ^š I yÿ!š aby_! () An immediate onsequene of Theorem is that, for known, k œ, u #! ;{ g, * œ u! R ]#,, * () is an asymptotially unbiased and strongly onsistent estimator of the intrinsi Q -entropy k œ The intuition behind the proof of Theorem omes from the fat that a Riemann manifold w, with assoiated distane and measure, looks loally like with Eulidean distane and Lebesgue measure. This implies that on small neighborhoods, Io!. of the manifold the total edge length H, * Eulidean length funtional. As w u! behaes like a is assumed ompat, it an

.. be oered by a finite number of suh neighborhoods. This fat, together with subadditie and superadditie properties ] of g, *, allows for repeated appliations of Theorem resulting in (3). B. Approximating Geodesi -NN Distanes Assume now that w. In the manifold learning problem, w (or any representation of it) is not known in adane. Consequently, the geodesi distanes between points on w annot be omputed exatly and hae to be estimated solely from the data samples. In the GMST algorithm ] (or the ISOMAP ]), this is done by running a ostly optimization algorithm oer a global graph of neighborhood relations among all points. Unlike the MST, the -NN graph is only influened by loal distanes. For fixed, the maximum nearest neighbor distane of all points in u goes to zero as the number of samples inreases. For suffiiently large, this implies that the -NN of eah point will fall in a neighborhood of the manifold where geodesi ures are well approximated by the orresponding straight lines between end points. This suggests using simple Eulidean -NN distanes as surrogates for the orresponding true geodesi distanes. In fat, we proe that the geodesi -NN distanes are uniformly well approximated by the orresponding Eulidean -NN distanes in the following sense: }xz! be a ompat Riemann submani- are i.i.d. random etors of w. Then, with probability, Theorem 3: Let w fold of. Suppose @ @ : B @ Dq@= B @ @! D' s > as s (6) III. JOINT INTRINSIC DIMENSION/ENTROPY ESTIMATION Let, * u #! be the total edge length of the Eulidean -NN graph oer u. Its asymptoti behaior is a simple onsequene of Theorems and 3: Corollary : Let w }xz! be a ompat Riemann - dimensional submanifold of. Suppose @ @ are i.i.d. random etors of w with bounded density I relatie to. Assume & J, L?; O? and define Q CDA;o!T. Then, UWV, * u! XY ],, *_^ I yÿ!š abyÿ! d (7) where ],, * is a onstant independent of I and w }xz!. Furthermore, the mean length e, * u!gt onerges to the same limit. We are now ready to apply this result to jointly estimate intrinsi dimension and entropy. The key is to notie that the growth rate of the length funtional is strongly dependent on while the onstant in the onergent limit is equal to the intrinsi Q -entropy. We use this strong growth dependene as a motiation for a simple estimator of. Define g, * u!. Aording to Corollary, has the following approximation q () where CDA;o!T ]#,, * ;ST kqœ, (9) Io! Q D=;o!T and is an error residual that goes to zero a.s. as s. Using the additie model (), we propose a simple nonparametri least squares strategy based on resampling from the population u of points in w. Speifially, let, L (O O! L, be " integers and let # be an integer that satisfies # T % for some fixed z > h. For eah alue of.z & '! randomly draw # bootstrap datasets u)( *, + E,#, with replaement, where the data points within eah u-( * are hosen from the entire data set u independently. From these samples ompute the empirial mean of the -NN length funtionals. * # / ( 3, * u)( *!. Defining.3 f * Ur * h we write down the linear etor model.3 76 () where 6! We now take a method-of-moments (MOM) approah in whih we use () to sole for the linear least squares (LLS) estimates of & followed by inersion of the relations (9). After making a simple large approximation, this approah yields the following estimates: round ;\T 9D! kqœ, HD Ur () ]<;,, *>= ;:9 The importane of onstants ],, * is different whether dimension or entropy estimation is onsidered. On one hand, due to the slow growth of -],, * - @? in the large regime for whih the aboe estimates were deried, ]o,, * is not required for the dimension estimator. On the other hand, the alue of ],, * is required for the entropy estimator to be unbiased. From the proof of Theorem, it omes out that ]#,, * is the limit of the normalized length funtional of the Eulidean -NN graph for a uniform distribution on the unit ube f > 7 h. As losed form expressions are not aailable, this onstant must be determined by Monte Carlo simulations of the -NN length on the orresponding unit ube for uniform random samples. We note, howeer, that in many appliations all that is required is the knowledge of the entropy up to a onstant. For example, when maximum or minimum entropy is used as a disriminant on seeral data sets ], only the relatie ordering of the entropies is important.

Original data NN graph Fig.. The Swiss roll manifold and orresponding -NN graph on sample points. log k NN length 7.9 7.9 7.9 7.99 7.9 7.97 7.96 l n LS fit of l n 7.9 6.67 6.67 6.676 6.67 6.6 6.6 6.6 log n Fig.. Log-log plot of the aerage -NN length manifold and its least squares linear fit, for sample points, and. The estimated slope is whih implies. for the Swiss roll Finally, the omplexity of the algorithm is dominated by the searh of nearest neighbors in the Eulidean metri. Using effiient onstrutions suh as K-D trees, this task an be performed in \! time for sample points. This ontrasts withg both the GMST and ISOMAP that require a ostly \! implementation of a geodesi pairwise distane estimation step. IV. EXPERIMENTAL RESULTS We illustrate the performane of the proposed -NN algorithm on manifolds of known dimension. In all the simulations we used ; and D " D/. With regards to intrinsi dimension estimation, we ompare our algorithm to ISOMAP. In ISOMAP, similarly to PCA, intrinsi dimension is usually estimated by looking at the residual errors as a funtion of subspae dimension. A. Swiss Roll The first manifold onsidered is the standard J -dimensional Swiss roll surfae ] embedded in (Fig. ). Fig. shows a log-log plot of the aerage -NN length. as a funtion of the number of samples. The good agreement between. and its least squares linear fit onfirms the large sample behaior predited by Corollary and shows eidene in faor of linear model (). To ompare the dimension estimation performane of the -NN method to ISOMAP we ran a Monte Carlo simulation. For eah of seeral sample sizes, > independent sets of i.i.d. random etors uniformly distributed on the surfae were generated. We then ounted the number of times that the intrinsi dimension was orretly estimated. To automatially estimate dimension with ISOMAP, we look at its eigenalue residual ariane plot and try to detet the elbow at whih residuals ease to derease signifiantly as estimated dimension inreases ]. This is implemented by a simple minimum angle threshold rule. Table I shows the results of this experiment. As it an be obsered, the -NN algorithm outperforms ISOMAP for small sample sizes. B. Hyper-spheres A more hallenging problem is the ase of the - dimensional sphere! (embedded in ). This manifold does not satisfy any of the usual isometri or onformal embedding onstraints required by ISOMAP or other methods like C-ISOMAP 6] and Hessian eigenmap ]. One again, we tested the algorithm oer 3 generations of uniform random samples oer, for J " "# and different sample sizes, and ounted the number of orret dimension estimates. We note that in all the simulations ISOMAP always oerestimated the intrinsi dimension as. The results for -NN are shown in Table II for different alues of the parameter ". As it an be seen, the -NN method sueeds in finding the orret intrinsi dimension. Howeer, Table II also shows that the number of samples required to ahiee the same leel of auray inreases with the manifold dimension. This is the usual urse of dimensionality phenomenon: as the dimension inreases, more samples are needed for the asymptoti regime in (7) to settle in and alidate the limit in Corollary. C. Hyper-planes We also inestigate -dimensional hyper-planes in! for whih PCA methods are designed. We onsider hyperplanes of the form %#! >. Table III shows the results of running a Monte Carlo simulation under the same onditions as in the preious subsetions. Unlike the ISOMAP, TABLE I NUMBER OF CORRECT DIMENSION ESTIMATES OVER TRIALS AS A FUNCTION OF THE NUMBER OF SAMPLES FOR THE SWISS ROLL MANIFOLD. & ' ISOMAP (( *) ) 9 3 -NN ( +-,. */ ) 9 3 3 TABLE II NUMBER OF CORRECT DIMENSION ESTIMATES OVER 3 TRIALS AS A FUNCTION OF THE NUMBER OF SAMPLES FOR HYPER-SPHERES,! * NEIGHBORS, *. Sphere 6 3,. / 3 3 3 3,. / 7 7,. 67/ 9 3 3 3,. / 3 6 6 6,. 67/ 3 3 3

TABLE III NUMBER OF CORRECT DIMENSION ESTIMATES OVER 3 TRIALS AS A FUNCTION OF THE NUMBER OF SAMPLES FOR HYPER-PLANES,.) NEIGHBORS. Hyper-plane dimension ' 67 67 *+, / 3 3 3 3 *+, / 7 7 67+-, 6 3 3 3 3 67+-, 6 3 6 6 67+-, 67/ 6 real alued dimension estimate. 3. 3. 6 6 TABLE IV NUMBER OF CORRECT DIMENSION ESTIMATES OVER 3 TRIALS AS A FUNCTION OF THE NUMBER OF SAMPLES FOR FULL DIMENSIONAL UNIFORM DISTRIBUTION,! ) NEIGHBORS Unit ube 6 + 6 +-,. / 6 7 7 7 + 6 67+-,. 6 3 3 3 3 + 6 67+-,. 6 6 6 + 6 67+-,. 67/ 7 9 9 In order to improe the performane of the deried estimators, a better understanding of the statistis of the error term in the linear model () would be important. Also of great interest is the study of the effet of additie noise on the manifold samples. With regards to appliations, we plan to test the proposed algorithm on databases of faes, handwritten digits and geneti data. ACKNOWLEDGMENT The work presented here was partially funded by NIH PO CA763- and NSF CCR-37. 3 simulation number 3 6 7 dimension estimate Fig. 3. Real alued intrinsi dimension estimates and histogram for the -D hyper-plane, for & sample points,!, 67 and,. 6. whih was obsered to orretly predit the dimension for all sample sizes inestigated, the -NN method has a tendeny to underestimate the orret dimension at smaller sample sizes. This fat an be obsered in Fig. 3. The first olumn shows the real alued estimates of the intrinsi dimension, i.e., estimates obtained before the rounding operation in (). Any alue that falls in between the dashed lines will then be rounded to the middle point. The seond olumn of Fig. 3 shows the histogram for these rounded estimates oer the 3 simulations trial. We beliee that the resampling strategy of the algorithm may be responsible for this underestimation. Seeral methods for improing the performane of the -NN algorithm are urrently under inestigation. D. Full Dimensional Uniform Samples on the Unit Cube Finally, we onsider uniformly distributed samples on the full dimensional unit ube f > 7 h. The results summarized by Table IV are similar to those for hyper-planes in the preious subsetion. ISOMAP orretly estimated the dimensionality of the data for all sample sizes. V. CONCLUSION We hae introdued a noel method for intrinsi dimension and entropy estimation based on the growth rate of the Eulidean -NN graph length funtional. The proposed algorithm is appliable to a wider lass of manifolds than preious methods and has redued omputational omplexity. We hae alidated the new method by testing it on syntheti manifolds of known dimension. REFERENCES ] S. Roweis and L. Saul, Nonlinear dimensionality redution by loally linear imbedding, Siene, ol. 9, no., pp. 33 36,. ] J. B. Tenenbaum, V. de Sila, and J. C. Langford, A global geometri framework for nonlinear dimensionality redution, Siene, ol. 9, pp. 39 33,. 3] M. Kirby, Geometri Data Analysis: An Empirial Approah to Dimensionality Redution and the Study of Patterns, Wiley-Intersiene,. ] D. Donoho and C. Grimes, Hessian eigenmaps: new loally linear embedding tehniques for high dimensional data, Teh. Rep. TR3-, Dept. of Statistis, Stanford Uniersity, 3. ] A. K. Jain and R. C. Dubes, Algorithms for lustering data, Prentie Hall, Englewood Cliffs, NJ, 9. 6] T. Cox and M. Cox, Multidimensional Saling, Chapman & Hall, London, 99. 7] X. Huo and J. Chen, Loal linear projetion (LLP), in Pro. of First Workshop on Genomi Signal Proessing and Statistis (GENSIPS),. ] P. Vereer and R. Duin, An ealuation of intrinsi dimensionality estimators, IEEE Trans. on Pattern Analysis and Mahine Intelligene, ol. 7, no., pp. 6, January 99. 9] F. Camastra and A. Viniarelli, Estimating the intrinsi dimension of data with a fratal-based method, IEEE Trans. on Pattern Analysis and Mahine Intelligene, ol., no., pp. 7, Otober. ] B. K égl, Intrinsi dimension estimation using paking numbers, in Neural Information Proessing Systems: NIPS, Vanouer, CA, De.. ] A.O. Hero, B. Ma, O. Mihel, and J. Gorman, Appliations of entropi spanning graphs, IEEE Signal Proessing Magazine, ol. 9, no., pp. 9, Otober. ] J. A. Costa and A. O. Hero, Geodesi minimal spanning trees for dimension and entropy estimation in manifold learning, IEEE Trans. on Signal Proessing, 3, under reision. 3] J. A. Costa and A. O. Hero, Manifold learning using Eulidean - neartest neighbor graphs, in preparation, 3. ] J. E. Yukih, Probability theory of lassial Eulidean optimization problems, ol. 67 of Leture Notes in Mathematis, Springer-Verlag, Berlin, 99. ] M. Carmo, Riemannian geometry, Birkhäuser, Boston, 99. 6] V. de Sila and J. B. Tenenbaum, Global ersus loal methods in nonlinear dimensionality redution, in Neural Information Proessing Systems (NIPS), Vanouer, Canada, De..