Entropic Graphs for Manifold Learning

Similar documents
Complexity of Regularization RBF Networks

An iterative least-square method suitable for solving large sparse matrices

The Effectiveness of the Linear Hull Effect

Millennium Relativity Acceleration Composition. The Relativistic Relationship between Acceleration and Uniform Motion

the following action R of T on T n+1 : for each θ T, R θ : T n+1 T n+1 is defined by stated, we assume that all the curves in this paper are defined

Weighted K-Nearest Neighbor Revisited

Test of General Relativity Theory by Investigating the Conservation of Energy in a Relativistic Free Fall in the Uniform Gravitational Field

Feature Selection by Independent Component Analysis and Mutual Information Maximization in EEG Signal Classification

Error Bounds for Context Reduction and Feature Omission

Danielle Maddix AA238 Final Project December 9, 2016

Journal of Theoretics Vol.5-2 Guest Commentary Relativistic Thermodynamics for the Introductory Physics Course

Model-based mixture discriminant analysis an experimental study

Lecture 7: Sampling/Projections for Least-squares Approximation, Cont. 7 Sampling/Projections for Least-squares Approximation, Cont.

3. THE SOLUTION OF TRANSFORMATION PARAMETERS

Control Theory association of mathematics and engineering

Maximum Entropy and Exponential Families

Hankel Optimal Model Order Reduction 1

Stellar Aberration, Relative Motion, and the Lorentz Factor

The Thomas Precession Factor in Spin-Orbit Interaction

Estimating the probability law of the codelength as a function of the approximation error in image compression

IMPEDANCE EFFECTS OF LEFT TURNERS FROM THE MAJOR STREET AT A TWSC INTERSECTION

Sensitivity Analysis in Markov Networks

NUMERICAL EVALUATION OF THE D-EFFICIENCY CRITERION

Scalable Positivity Preserving Model Reduction Using Linear Energy Functions

Entropic Graphs for Manifold Learning

An I-Vector Backend for Speaker Verification

Robust Recovery of Signals From a Structured Union of Subspaces

Factorized Asymptotic Bayesian Inference for Mixture Modeling

Where as discussed previously we interpret solutions to this partial differential equation in the weak sense: b

A Queueing Model for Call Blending in Call Centers

Learning Intrinsic Dimension and Entropy of High-Dimensional Shape Spaces

The gravitational phenomena without the curved spacetime

A NETWORK SIMPLEX ALGORITHM FOR THE MINIMUM COST-BENEFIT NETWORK FLOW PROBLEM

Coding for Random Projections and Approximate Near Neighbor Search

Optimization of Statistical Decisions for Age Replacement Problems via a New Pivotal Quantity Averaging Approach

A Characterization of Wavelet Convergence in Sobolev Spaces

Nonlinear Resource Allocation in Restoration of Compromised Systems

REFINED UPPER BOUNDS FOR THE LINEAR DIOPHANTINE PROBLEM OF FROBENIUS. 1. Introduction

Modeling Probabilistic Measurement Correlations for Problem Determination in Large-Scale Distributed Systems

A New Version of Flusser Moment Set for Pattern Feature Extraction

SINCE Zadeh s compositional rule of fuzzy inference

UPPER-TRUNCATED POWER LAW DISTRIBUTIONS

23.1 Tuning controllers, in the large view Quoting from Section 16.7:

Probabilistic Graphical Models

The Hanging Chain. John McCuan. January 19, 2006

Nonreversibility of Multiple Unicast Networks

Critical Reflections on the Hafele and Keating Experiment

Physical Laws, Absolutes, Relative Absolutes and Relativistic Time Phenomena

Maximum Likelihood Multipath Estimation in Comparison with Conventional Delay Lock Loops

Simplified Modeling, Analysis and Simulation of Permanent Magnet Brushless Direct Current Motors for Sensorless Operation

Wave Propagation through Random Media

Variation Based Online Travel Time Prediction Using Clustered Neural Networks

Modeling of Threading Dislocation Density Reduction in Heteroepitaxial Layers

Computer Science 786S - Statistical Methods in Natural Language Processing and Data Analysis Page 1

The Geometric Interpretation of Some Mathematical Expressions Containing the Riemann ζ-function

Singular Event Detection

A variant of Coppersmith s Algorithm with Improved Complexity and Efficient Exhaustive Search

University of Groningen

Discrete Bessel functions and partial difference equations

GEOMETRY FOR 3D COMPUTER VISION

Communicating Special Relativity Theory s Mathematical Inconsistencies

Simplified Buckling Analysis of Skeletal Structures

Multi-Bit Flipping Algorithms with Probabilistic Gradient Descent

Chapter 8 Hypothesis Testing

arxiv:math/ v1 [math.ca] 27 Nov 2003

FIBER/MATRIX DEBONDING CRITERIONS IN SIC/TI COMPOSITE. NUMERICAL AND EXPERIMENTAL ANALYSIS

Sediment Particle Characterization for Acoustic Applications: Coarse Content, Size and Shape Distributions in a Shelly Sand/Mud Environment

Assessing the Performance of a BCI: A Task-Oriented Approach

A Spatiotemporal Approach to Passive Sound Source Localization

SERBIATRIB th International Conference on Tribology. Kragujevac, Serbia, May 2011

Transformation to approximate independence for locally stationary Gaussian processes

CRITICAL EXPONENTS TAKING INTO ACCOUNT DYNAMIC SCALING FOR ADSORPTION ON SMALL-SIZE ONE-DIMENSIONAL CLUSTERS

Lightpath routing for maximum reliability in optical mesh networks

MUSIC GENRE CLASSIFICATION USING LOCALITY PRESERVING NON-NEGATIVE TENSOR FACTORIZATION AND SPARSE REPRESENTATIONS

Pseudo-Superluminal Motion 1

11.1 Polynomial Least-Squares Curve Fit

Distributed Gaussian Mixture Model for Monitoring Multimode Plant-wide Process

Neuro-Fuzzy Modeling of Heat Recovery Steam Generator

Stability of alternate dual frames

Analysis of discretization in the direct simulation Monte Carlo

DIGITAL DISTANCE RELAYING SCHEME FOR PARALLEL TRANSMISSION LINES DURING INTER-CIRCUIT FAULTS

Multicomponent analysis on polluted waters by means of an electronic tongue

Geodesic Entropic Graphs for Dimension and Entropy Estimation in Manifold Learning

CSC2515 Winter 2015 Introduc3on to Machine Learning. Lecture 5: Clustering, mixture models, and EM

Development of Fuzzy Extreme Value Theory. Populations

7.1 Roots of a Polynomial

Robust Flight Control Design for a Turn Coordination System with Parameter Uncertainties

22.54 Neutron Interactions and Applications (Spring 2004) Chapter 6 (2/24/04) Energy Transfer Kernel F(E E')

The Laws of Acceleration

arxiv:gr-qc/ v7 14 Dec 2003

F = F x x + F y. y + F z

A Riemannian Approach for Computing Geodesics in Elastic Shape Analysis

DERIVATIVE COMPRESSIVE SAMPLING WITH APPLICATION TO PHASE UNWRAPPING

Calibration of Piping Assessment Models in the Netherlands

ONLINE APPENDICES for Cost-Effective Quality Assurance in Crowd Labeling

A simple expression for radial distribution functions of pure fluids and mixtures

On Certain Singular Integral Equations Arising in the Analysis of Wellbore Recharge in Anisotropic Formations

Resolving RIPS Measurement Ambiguity in Maximum Likelihood Estimation

EE 321 Project Spring 2018

LECTURE NOTES FOR , FALL 2004

Transcription:

Entropi Graphs for Manifold Learning Jose A. Costa and Alfred O. Hero III Department of Eletrial Engineering and Computer Siene Uniersity of Mihigan, Ann Arbor, MI 9 Email: josta@umih.edu, hero@ees.umih.edu Abstrat We propose a new algorithm that simultaneously estimates the intrinsi dimension and intrinsi entropy of random data sets lying on smooth manifolds. The method is based on asymptoti properties of entropi graph onstrutions. In partiular, we ompute the Eulidean -nearest neighbors ( - NN) graph oer the sample points and use its oerall total edge length to estimate intrinsi dimension and entropy. The algorithm is alidated on standard syntheti manifolds. I. INTRODUCTION Seeral interesting lasses of signals arising in fields suh as bioinformatis, image proessing or Internet traffi analysis lie in high dimensional etor spaes. It is well known that both omputational omplexity and statistial performane of most algorithms quikly degrades as dimension inreases. This phenomenon, usually known as urse of dimensionality, makes it impratiable to proess suh high dimensional data sets. Howeer, many real life signals do not fill the spae entirely but are onstrained to lie on a smooth low dimensional nonlinear manifold embedded in the high dimensional spae. Manifold learning is onerned with the problem of disoering low dimensional struture based on a set of obsered high dimensional sample points on the manifold. In the reent past, manifold learning has reeied substantial attention from researhers in mahine learning, omputer ision, signal proessing and statistis ] ]. This is due to the fat that effetiely soling the manifold learning problem an bring onsiderable improement to the solution of suh dierse problems as: feature extration in pattern reognition; multiariate density estimation and regression in statistis; data ompression and oding in information theory; isualisation of high dimensional data; or omplexity redution of algorithms. Seeral tehniques for reoering the low dimensional struture of high dimensional data hae been proposed. These range from: linear methods as prinipal omponents analysis (PCA) ] and lassial multidimensional saling (MDS) 6]; loal methods as linear loal imbedding (LLE) ], loally linear projetions (LLP) 7], and Hessian eigenmaps ]; and global methods as ISOMAP ]. One ommon step to the manifold reonstrution algorithms mentioned aboe is that all require the expliit knowledge of the intrinsi dimension of the manifold. In many real life appliations, this parameter annot assumed to be known and has to be estimated from the data. A frequent way of doing this is to use linear projetion tehniques (]): a linear map is expliitly onstruted and dimension is estimated by applying PCA, fator analysis, or MDS to analyze the eigenstruture of the data. These methods rely on the assumption that only a small number of the eigenalues of the (proessed) data oariane will be signifiant. Linear methods tend to oerestimate the intrinsi dimension as they don t aount for non-linearities in the data. Both nonlinear PCA 3] methods and the ISOMAP irument this problem but they still rely on unreliable and ostly eigenstruture estimates. Other methods hae been proposed based on loal geometri tehniques, e.g., estimation of loal neighborhoods ] or fratal dimension 9], and estimating paking numbers ] of the manifold. The losely related problem of estimating the manifold s intrinsi entropy arises if the data samples are drawn from a multiariate distribution supported on the manifold. When the distribution is absolutely ontinuous with respet to the Lebesgue measure restrited to the lower dimensional manifold, this intrinsi entropy an be useful for exploring data ompression oer the manifold or, as suggested in ], lustering of multiple sub-populations on the manifold. The goal of this paper is to deelop an algorithm that jointly estimates both the intrinsi dimension and intrinsi entropy on the manifold, without knowing the manifold desription, gien only a set of random sample points. Our approah is based on entropi graph methods; see ] for an oeriew. Speifially: onstrut the Eulidean -nearest neighbors ( - NN) graph oer all the sample points and use its growth rate to estimate the intrinsi dimension and entropy by simple linear least squares and method of moments proedure. This method shares with the geodesi minimal spanning tree (GMST) method introdued by us in preious work ], the simpliity of aoiding the reonstrution of the manifold or estimating the multiariate density of the samples. Howeer, it has the main adantage of reduing runtime omplexity by an order of magnitude and is appliable to a wider lass of manifolds. The remainder of the paper is organized as follows. In Setion II we disuss the asymptoti behaior of the -NN graph on a manifold and the approximation of -NN geodesi distanes by the orresponding Eulidean distanes. The proposed algorithm is desribed in Setion III. Experimental results are reported in Setion IV. The theoretial results introdued in this paper are presented without proof due to spae limitations. The orresponding proofs an be found in 3]. II. THE -NN GRAPH Let be independent and identially distributed (i.i.d.) random etors with alues in a ompat

! I ^ Š D subset of. The ( -)nearest neighbor of in by "! is gien where distanes between points are measured in terms of some suitable distane funtion #%!. For general integer '&(, the -nearest neighbor of a point is defined in a similar way. The -NN graph puts an edge between eah point in and its -nearest neighbors. Let )+*-,.)/*-,! be the set of -nearest neighbors of in. The total edge length of the -NN graph is defined as:, *! 3 769: "! () where ;=<?> is a power weighting onstant. If @A! CB EDF@'B, where B B is the usual Eulidean ( HG ) norm in, then the -NN graph falls under the framework of ontinuous quasi-additie Eulidean funtionals ]. As a onsequene, its almost sure (a.s.) asymptoti behaior (also onergene in the mean) follows easily from the umbrella theorems for suh graphs: Theorem (, Theorem.3]): Let be i.i.d. random etors with alues in a ompat subset of and Lebesgue R density I. Let &KJ, MLN;PO and define Q DA;S!T. Then UWV, * Z! d XY \ ],, *_^ I!ba where, *! is gien by equation () with Eulidean distane, and ],, * is a onstant independent of I. Furthermore, the mean length e f g, * Z!iḧ T onerges to the same limit. The integral fator ji in the a.s. limit is a monotoni funtion of the extrinsi Rényi Q -entropy of the multiariate Lebesgue density I : kmlsn Io! pdqq l n I!ba () In the limit, when Qts the usual Shannon entropy, D j l n I R Ur R!ba, is obtained. Assume now that u -@ @ is onstrained to. The distribution of @A beomes singular with respet to Lebesgue measure and an appliation of Theorem results in a zero limit for the length funtional of the -NN graph. Howeer, this behaior an be modified by hanging the way distanes between points are measured. For this purpose, we use the framework of Riemann manifolds. lie on a ompat smooth -dimensional manifold w A. Random Points in a Riemann Manifold Gien a smooth manifold w, a Riemann metri x is a mapping whih assoiates to eah point y{z(w an inner % produt ẍ! between etors tangent to w at y ]. A Riemann manifold w }xz! is just a smooth manifold w with a gien Riemann metri x. As an example, when w is a submanifold of the Eulidean spaeg, the naturally indued Riemann metri on w is just the usual dot produt between etors. at y, we an define its norm For any tangent etor ~ to w as B ~gb Px ~ ~\!. Using this norm, it is natural to define the length of a pieewise smooth ure on w, _f > h_sƒw, as! j B-ˆ Š ˆ! B a. The geodesi distane between points y y z.w is the length of the shortest pieewise smooth ure between the two points: y %y! VŒb Ž!H >! y -! y Gien the geodesi distane, one an onstrut a geodesi -NN graph on u by omputing the nearest neighbor relations between points using instead of the usual Eulidean distane. Consequently, we define the total edge length of this g new graph as, * u #! g, where, * the orrespondene s. u! is gien by () with We an now extend Theorem to general ompat Riemann manifolds. This extension, Theorem bellow, states that the asymptoti behaior of, * u! is no longer determined by the density of @ relatie to the Lebesgue measure of, but depends instead on the the density of @ relatie to, the indued measure on w ia the olume element ]. w }xz! be a ompat Riemann - Theorem : Let dimensional manifold. Suppose @ @ elements of w are i.i.d. random with bounded density I relatie to \. Let, * be the -NN graph with lengths omputed using the geodesi distane. Assume &PJ, L ;=O' and define Q D=;S!T. Then, UV g, * u! XY ]#,, * ^b I ỹ!š o azy_! d (3) where ],, * is a onstant independent of I and w %xb!. Furthermore, the mean length e f H, * u! h T onerges to the same limit. Now, the integral fator in the a.s. limit of (3) is a monotoni funtion of the intrinsi Rényi Q -entropy of the multiariate density I on w : kqœ, Io! pdžq Ur ^š I yÿ!š aby_! () An immediate onsequene of Theorem is that, for known, k œ, u #! ;{ g, * œ u! R ]#,, * () is an asymptotially unbiased and strongly onsistent estimator of the intrinsi Q -entropy k œ The intuition behind the proof of Theorem omes from the fat that a Riemann manifold w, with assoiated distane and measure, looks loally like with Eulidean distane and Lebesgue measure. This implies that on small neighborhoods, Io!. of the manifold the total edge length H, * Eulidean length funtional. As w u! behaes like a is assumed ompat, it an

.. be oered by a finite number of suh neighborhoods. This fat, together with subadditie and superadditie properties ] of g, *, allows for repeated appliations of Theorem resulting in (3). B. Approximating Geodesi -NN Distanes Assume now that w. In the manifold learning problem, w (or any representation of it) is not known in adane. Consequently, the geodesi distanes between points on w annot be omputed exatly and hae to be estimated solely from the data samples. In the GMST algorithm ] (or the ISOMAP ]), this is done by running a ostly optimization algorithm oer a global graph of neighborhood relations among all points. Unlike the MST, the -NN graph is only influened by loal distanes. For fixed, the maximum nearest neighbor distane of all points in u goes to zero as the number of samples inreases. For suffiiently large, this implies that the -NN of eah point will fall in a neighborhood of the manifold where geodesi ures are well approximated by the orresponding straight lines between end points. This suggests using simple Eulidean -NN distanes as surrogates for the orresponding true geodesi distanes. In fat, we proe that the geodesi -NN distanes are uniformly well approximated by the orresponding Eulidean -NN distanes in the following sense: }xz! be a ompat Riemann submani- are i.i.d. random etors of w. Then, with probability, Theorem 3: Let w fold of. Suppose @ @ : B @ Dq@= B @ @! D' s > as s (6) III. JOINT INTRINSIC DIMENSION/ENTROPY ESTIMATION Let, * u #! be the total edge length of the Eulidean -NN graph oer u. Its asymptoti behaior is a simple onsequene of Theorems and 3: Corollary : Let w }xz! be a ompat Riemann - dimensional submanifold of. Suppose @ @ are i.i.d. random etors of w with bounded density I relatie to. Assume & J, L?; O? and define Q CDA;o!T. Then, UWV, * u! XY ],, *_^ I yÿ!š abyÿ! d (7) where ],, * is a onstant independent of I and w }xz!. Furthermore, the mean length e, * u!gt onerges to the same limit. We are now ready to apply this result to jointly estimate intrinsi dimension and entropy. The key is to notie that the growth rate of the length funtional is strongly dependent on while the onstant in the onergent limit is equal to the intrinsi Q -entropy. We use this strong growth dependene as a motiation for a simple estimator of. Define g, * u!. Aording to Corollary, has the following approximation q () where CDA;o!T ]#,, * ;ST kqœ, (9) Io! Q D=;o!T and is an error residual that goes to zero a.s. as s. Using the additie model (), we propose a simple nonparametri least squares strategy based on resampling from the population u of points in w. Speifially, let, L (O O! L, be " integers and let # be an integer that satisfies # T % for some fixed z > h. For eah alue of.z & '! randomly draw # bootstrap datasets u)( *, + E,#, with replaement, where the data points within eah u-( * are hosen from the entire data set u independently. From these samples ompute the empirial mean of the -NN length funtionals. * # / ( 3, * u)( *!. Defining.3 f * Ur * h we write down the linear etor model.3 76 () where 6! We now take a method-of-moments (MOM) approah in whih we use () to sole for the linear least squares (LLS) estimates of & followed by inersion of the relations (9). After making a simple large approximation, this approah yields the following estimates: round ;\T 9D! kqœ, HD Ur () ]<;,, *>= ;:9 The importane of onstants ],, * is different whether dimension or entropy estimation is onsidered. On one hand, due to the slow growth of -],, * - @? in the large regime for whih the aboe estimates were deried, ]o,, * is not required for the dimension estimator. On the other hand, the alue of ],, * is required for the entropy estimator to be unbiased. From the proof of Theorem, it omes out that ]#,, * is the limit of the normalized length funtional of the Eulidean -NN graph for a uniform distribution on the unit ube f > 7 h. As losed form expressions are not aailable, this onstant must be determined by Monte Carlo simulations of the -NN length on the orresponding unit ube for uniform random samples. We note, howeer, that in many appliations all that is required is the knowledge of the entropy up to a onstant. For example, when maximum or minimum entropy is used as a disriminant on seeral data sets ], only the relatie ordering of the entropies is important.

Original data NN graph Fig.. The Swiss roll manifold and orresponding -NN graph on sample points. log k NN length 7.9 7.9 7.9 7.99 7.9 7.97 7.96 l n LS fit of l n 7.9 6.67 6.67 6.676 6.67 6.6 6.6 6.6 log n Fig.. Log-log plot of the aerage -NN length manifold and its least squares linear fit, for sample points, and. The estimated slope is whih implies. for the Swiss roll Finally, the omplexity of the algorithm is dominated by the searh of nearest neighbors in the Eulidean metri. Using effiient onstrutions suh as K-D trees, this task an be performed in \! time for sample points. This ontrasts withg both the GMST and ISOMAP that require a ostly \! implementation of a geodesi pairwise distane estimation step. IV. EXPERIMENTAL RESULTS We illustrate the performane of the proposed -NN algorithm on manifolds of known dimension. In all the simulations we used ; and D " D/. With regards to intrinsi dimension estimation, we ompare our algorithm to ISOMAP. In ISOMAP, similarly to PCA, intrinsi dimension is usually estimated by looking at the residual errors as a funtion of subspae dimension. A. Swiss Roll The first manifold onsidered is the standard J -dimensional Swiss roll surfae ] embedded in (Fig. ). Fig. shows a log-log plot of the aerage -NN length. as a funtion of the number of samples. The good agreement between. and its least squares linear fit onfirms the large sample behaior predited by Corollary and shows eidene in faor of linear model (). To ompare the dimension estimation performane of the -NN method to ISOMAP we ran a Monte Carlo simulation. For eah of seeral sample sizes, > independent sets of i.i.d. random etors uniformly distributed on the surfae were generated. We then ounted the number of times that the intrinsi dimension was orretly estimated. To automatially estimate dimension with ISOMAP, we look at its eigenalue residual ariane plot and try to detet the elbow at whih residuals ease to derease signifiantly as estimated dimension inreases ]. This is implemented by a simple minimum angle threshold rule. Table I shows the results of this experiment. As it an be obsered, the -NN algorithm outperforms ISOMAP for small sample sizes. B. Hyper-spheres A more hallenging problem is the ase of the - dimensional sphere! (embedded in ). This manifold does not satisfy any of the usual isometri or onformal embedding onstraints required by ISOMAP or other methods like C-ISOMAP 6] and Hessian eigenmap ]. One again, we tested the algorithm oer 3 generations of uniform random samples oer, for J " "# and different sample sizes, and ounted the number of orret dimension estimates. We note that in all the simulations ISOMAP always oerestimated the intrinsi dimension as. The results for -NN are shown in Table II for different alues of the parameter ". As it an be seen, the -NN method sueeds in finding the orret intrinsi dimension. Howeer, Table II also shows that the number of samples required to ahiee the same leel of auray inreases with the manifold dimension. This is the usual urse of dimensionality phenomenon: as the dimension inreases, more samples are needed for the asymptoti regime in (7) to settle in and alidate the limit in Corollary. C. Hyper-planes We also inestigate -dimensional hyper-planes in! for whih PCA methods are designed. We onsider hyperplanes of the form %#! >. Table III shows the results of running a Monte Carlo simulation under the same onditions as in the preious subsetions. Unlike the ISOMAP, TABLE I NUMBER OF CORRECT DIMENSION ESTIMATES OVER TRIALS AS A FUNCTION OF THE NUMBER OF SAMPLES FOR THE SWISS ROLL MANIFOLD. & ' ISOMAP (( *) ) 9 3 -NN ( +-,. */ ) 9 3 3 TABLE II NUMBER OF CORRECT DIMENSION ESTIMATES OVER 3 TRIALS AS A FUNCTION OF THE NUMBER OF SAMPLES FOR HYPER-SPHERES,! * NEIGHBORS, *. Sphere 6 3,. / 3 3 3 3,. / 7 7,. 67/ 9 3 3 3,. / 3 6 6 6,. 67/ 3 3 3

TABLE III NUMBER OF CORRECT DIMENSION ESTIMATES OVER 3 TRIALS AS A FUNCTION OF THE NUMBER OF SAMPLES FOR HYPER-PLANES,.) NEIGHBORS. Hyper-plane dimension ' 67 67 *+, / 3 3 3 3 *+, / 7 7 67+-, 6 3 3 3 3 67+-, 6 3 6 6 67+-, 67/ 6 real alued dimension estimate. 3. 3. 6 6 TABLE IV NUMBER OF CORRECT DIMENSION ESTIMATES OVER 3 TRIALS AS A FUNCTION OF THE NUMBER OF SAMPLES FOR FULL DIMENSIONAL UNIFORM DISTRIBUTION,! ) NEIGHBORS Unit ube 6 + 6 +-,. / 6 7 7 7 + 6 67+-,. 6 3 3 3 3 + 6 67+-,. 6 6 6 + 6 67+-,. 67/ 7 9 9 In order to improe the performane of the deried estimators, a better understanding of the statistis of the error term in the linear model () would be important. Also of great interest is the study of the effet of additie noise on the manifold samples. With regards to appliations, we plan to test the proposed algorithm on databases of faes, handwritten digits and geneti data. ACKNOWLEDGMENT The work presented here was partially funded by NIH PO CA763- and NSF CCR-37. 3 simulation number 3 6 7 dimension estimate Fig. 3. Real alued intrinsi dimension estimates and histogram for the -D hyper-plane, for & sample points,!, 67 and,. 6. whih was obsered to orretly predit the dimension for all sample sizes inestigated, the -NN method has a tendeny to underestimate the orret dimension at smaller sample sizes. This fat an be obsered in Fig. 3. The first olumn shows the real alued estimates of the intrinsi dimension, i.e., estimates obtained before the rounding operation in (). Any alue that falls in between the dashed lines will then be rounded to the middle point. The seond olumn of Fig. 3 shows the histogram for these rounded estimates oer the 3 simulations trial. We beliee that the resampling strategy of the algorithm may be responsible for this underestimation. Seeral methods for improing the performane of the -NN algorithm are urrently under inestigation. D. Full Dimensional Uniform Samples on the Unit Cube Finally, we onsider uniformly distributed samples on the full dimensional unit ube f > 7 h. The results summarized by Table IV are similar to those for hyper-planes in the preious subsetion. ISOMAP orretly estimated the dimensionality of the data for all sample sizes. V. CONCLUSION We hae introdued a noel method for intrinsi dimension and entropy estimation based on the growth rate of the Eulidean -NN graph length funtional. The proposed algorithm is appliable to a wider lass of manifolds than preious methods and has redued omputational omplexity. We hae alidated the new method by testing it on syntheti manifolds of known dimension. REFERENCES ] S. Roweis and L. Saul, Nonlinear dimensionality redution by loally linear imbedding, Siene, ol. 9, no., pp. 33 36,. ] J. B. Tenenbaum, V. de Sila, and J. C. Langford, A global geometri framework for nonlinear dimensionality redution, Siene, ol. 9, pp. 39 33,. 3] M. Kirby, Geometri Data Analysis: An Empirial Approah to Dimensionality Redution and the Study of Patterns, Wiley-Intersiene,. ] D. Donoho and C. Grimes, Hessian eigenmaps: new loally linear embedding tehniques for high dimensional data, Teh. Rep. TR3-, Dept. of Statistis, Stanford Uniersity, 3. ] A. K. Jain and R. C. Dubes, Algorithms for lustering data, Prentie Hall, Englewood Cliffs, NJ, 9. 6] T. Cox and M. Cox, Multidimensional Saling, Chapman & Hall, London, 99. 7] X. Huo and J. Chen, Loal linear projetion (LLP), in Pro. of First Workshop on Genomi Signal Proessing and Statistis (GENSIPS),. ] P. Vereer and R. Duin, An ealuation of intrinsi dimensionality estimators, IEEE Trans. on Pattern Analysis and Mahine Intelligene, ol. 7, no., pp. 6, January 99. 9] F. Camastra and A. Viniarelli, Estimating the intrinsi dimension of data with a fratal-based method, IEEE Trans. on Pattern Analysis and Mahine Intelligene, ol., no., pp. 7, Otober. ] B. K égl, Intrinsi dimension estimation using paking numbers, in Neural Information Proessing Systems: NIPS, Vanouer, CA, De.. ] A.O. Hero, B. Ma, O. Mihel, and J. Gorman, Appliations of entropi spanning graphs, IEEE Signal Proessing Magazine, ol. 9, no., pp. 9, Otober. ] J. A. Costa and A. O. Hero, Geodesi minimal spanning trees for dimension and entropy estimation in manifold learning, IEEE Trans. on Signal Proessing, 3, under reision. 3] J. A. Costa and A. O. Hero, Manifold learning using Eulidean - neartest neighbor graphs, in preparation, 3. ] J. E. Yukih, Probability theory of lassial Eulidean optimization problems, ol. 67 of Leture Notes in Mathematis, Springer-Verlag, Berlin, 99. ] M. Carmo, Riemannian geometry, Birkhäuser, Boston, 99. 6] V. de Sila and J. B. Tenenbaum, Global ersus loal methods in nonlinear dimensionality redution, in Neural Information Proessing Systems (NIPS), Vanouer, Canada, De..