An Analysis of Locally Defined Principal Curves and Surfaces

Size: px

Start display at page:

Download "An Analysis of Locally Defined Principal Curves and Surfaces"

Derick Randolf Marshall
5 years ago
Views:

1 An Analysis of Locally Defined Principal Curves and Surfaces James McQueen Department of Statistics, University of Wasington Seattle, WA, 98195, USA Abstract Principal curves are generally defined as smoot curves passing troug te middle of te data and ave wide applications in macine learning for example in dimensionality reduction and feature extraction. Recently Umut Ozertem and Deniz Erdogmus (O&E) provide us wit a novel approac for defining principal curves and surfaces in teir paper Locally Defined Principal Curves and Surfaces (2011). Tis report aims to reproduce te results of teir paper and provide critical assessment of its performance, flaws, and merits. 1 Introduction Peraps te most popular dimension reduction tool used today is known as principal components analysis (PCA). PCA is an ortogonal linear transformation wic transforms (rotates) te data into a new coordinate system suc tat te greatest variance of te data projected on eac of te new coordinates is associated wit te first coordinate (first principal component), te second greatest variance wit te second coordinate, etc. Tis is a commonly used tool in dimension reduction as we can project te original data onto te subspace spanned by te first d principal components, tus preserving most of te variance of te data but reducing te dimension. As we are projecting onto a linear space tis can be describe as linear dimension reduction. Inerent to tis metod is tat te first principal component is a line, te first and 1

2 second principal components create a plane, etc., and we rank tese lines according to te amount of variance explained by eac. Te success of PCA as a linear dimension reduction tecnique begs te question: can we extend PCA to a non-linear set of principal surfaces tat retain some of te desirable properties of principal lines? 2 Callenges Tere are a number of callenges in non-linear dimension reduction in general, additionally, tere is no agreed upon definition of a principal curve and terefore no agreed metod of estimating tem. Popular non-linear dimension reduction (or manifold learning) tecniques suc as Isomap (Tenenbaum et al. [2000]), local linear embedding (Roweis and Saul [2000]), Laplacian eigenmaps (Belkin and Niyogi [2003]), and maximum variance unfolding (Weinberger and Saul [2006]) rely on generating locality information of data samples from a data proximity grap. Tese tecniques, owever, depend on careful tuning of parameters controlling grap structure as te accuracy of tese metods depends on te quality of te grap. Furtermore, many of tese tecniques assume tat te data truly lie on a manifold of inerent dimension d and try to recover te underlying manifold. Tese metods rely on te validity of te assumption. Tere is currently no agreed upon definition of principal curves as tey do not come about so naturally as principal components (lines). Principal curves are generally understood to be smoot curves passing troug te middle of te data, owever, a more matematically precise definition is required. Many definitions involve taking a property of te principal line and ten try to find a smoot curve tat fits tese properties: Hastie and Stuetzle [1989] require tem to be self-consistent wereas Delicado [1998] restricts teir total variance and conditional means. After requiring tat a principal curve satisfies some constraint, one ten devises an algoritm tat finds te principal curve tat meets te requirement and minimizes 2

3 a criterion (suc as mean-squared projection errror). Terefore, tese metods aim to find a curve tat best fits te data. Tis as two primary flaws. First, trying to minimize some data-weigted criterion leads to overfitting and so regularization is almost certainly required. A more pilosopical issue is tat principal curves ougt to be tougt of as inerent structures of te data generating mecanism tat ave to be approximated as opposed to defined to be te solutions to an algoritm. 3 Metod Ozertem and Erdogmus (encefort referred to as O&E) take a novel approac to principal curves by defining tem to be inerent structures of te underlying probability distribution of te data. Tey consider a principal surface defined suc tat every point on te principal surface is a local maximum (local mode) of te probability density in te local ortogonal subspace. In particular tis definition implies tat te principal curve (surface of dimension 1) is te ridge of te probability distribution a natural result. Defining principal curves as structures of probability density functions leads to a differential geometric definition depending on te gradient and Hessian of probability density function. Let p(x) be te density function of x R n, let g( ) be te gradient of p( ) and H( ) be te Hessian. Let (λ i (x), q i (x)) be te it eigenvalue, eigenvector pair of H(x). Let C d be te set points x suc tat tere exists a set I {1,..., n} wit I = n d suc tat g(x) T q i (X) = 0 i I. We say x is a regular point of C d if te set I is unique, tat is, g(x) is perpendicular to exactly (n d) eigenvectors. Any regular point x of C d is perpendicular to n d ortogonal eigenvectors and terefore all suc points must lie on a surface wit intrinsic dimension d. Ten define P d te principal surface of dimension d to be te set of regular points of C d suc tat λ i (x) < 0 i I, tat is, P d contains te local maxima of te ortogonal subsbace C d (x) = span{q i(x) i I } (as te gradient at tese points projected 3

4 onto C d (x) is by definition zero). To see wy tis definition is natural, consider P 0. Tis is te set of points tat are ortogonal to n 0 = n eigenvectors (tat is, all of tem) terefore g(x) = 0 so tey are critical points, but additionally tey are te critical points associated wit negative eigenvalues and are tus local maxima. Hence P 0 is te set of local modes, P 1 is te principal curve (surface of dimension 1) and defines te ridge of te density function, i.e. tere is only one direction of increase. Tis is a satisfying definition as it bot defines principal curves as inerent structures of te data generating mecanism and it naturally extends from principal curves (of dimension 1) to principal surfaces of arbitrary dimension d > 1. Additionally, tese surfaces exist wenever te density function permits a gradient and Hessian. In practice Kernel Density Estimation or parametric density estimation (e.g. mixture models) are used to estimate p and tey are composed from densities wit at least second order derivatives. 3.1 Existence and Consistency of Principal Curves Since tis definition of principal curves depends on first and second derivatives of te density as long as tese exist suc tat te essian is non-zero ten te principal curve exist. Tese conditions are mild and since in practice kernel densities are used tese guarantee te gradient and Hessian exist and are continuous and so te principal surfaces will exist. Cacon et al. [2011] sow tat under te assumption tat te kernel bandwidt matrix converges to zero fast enoug, tat te underlying density and kernel ave a sufficient number of continuous square-integrable derivatives, tat te kernel as finite covariance ten te integrated meansquared-error between te vector of order-r derivatives of te KDE converge to tose of te true density. Terefore for a sufficiently smoot kernel and density, te derivatives of te KDE are consistent. Consequently, since te principal surfaces are defined by te first and second derivatives tey too must be consistent. 4

5 4 Algoritm O&E present an adjustment to te mean sift algoritm tat tey claim will converge to te principal surface of dimension d i.e., P d. In te algoritm we initialize wit eiter a mes of points or te data points temselves (te latter ensures te resulting principal surface will be in te support of te data and is also te projection of te data onto te surface). 4.1 Monotonically Increasing functions and Local Covariance Instead of involving te Hessian H(x) in teir subspace constrained mean-sift algoritm, te autors define a new matrix called te Local Covariance. In order to motivate teir definition of Local Covariance, in teir paper te autors sow tat: Lemma 4.1 For strictly increasing, twice differentiable functions f te principal set P d of a density p(x) is te same as te principal set P d of te transformed density f(p(x)). Let x P d wit pdf p(x), gradient g(x) and essian H(x). Let H(x) = QΛQ be te eigendecomposition. Since x is a a point in P d its gradient g(x) is ortogonal to all eigenvectors q i (x) in te set I and wose span is te ortogonal space. Let Q be te matrix wose columns are composed of tese eigenvectors, consequently let Q composed of te remaining eigenvectors tat span te parallel space. Ten in tis case we may write H(x) = Q Λ Q T +Q Λ Q T were te Λ s are te corresponding eigenvalues. Since g(x) is ortogonal to te vectors in Q by definition, it must be in te parallel space: g(x) = Q β for some weigt vector β. We ten calculate tat gradient and essian of te transformed pdf f(p(x)). g f (X) = f (p(x))g(x) = f (p(x))q β = Q (f (p(x))β) Q β 5

6 Terefore te gradient is also in te parallel space and x Cf d as well. H f (x) = f (p(x))h(f(x)) + f (p(x))g(x)g(x) T = f (p(x)) [ Q λ Q T + Q Λ Q ] + f (p(x))q ββ T Q T = { f (p(x))q λ Q T + f (p(x))q ββ T Q T } + f (p(x))q Λ Q Ten since f (x) > 0 for all x te sign eigenvalues of te ortogonal space do not cange and as suc x Pf d, as required. Consider te special case of f = log(x) and p(x) is te Gaussian density, we ave te property tat: H(x) = ( 1/2)Σ 1 Were Σ is te covariance matrix for te Gaussian distribution. Tis implies tat wen te underlying density is assumed to be Gaussian te principal curve definition coincides wit principal components. Tis leads O&E to define a local covariance for any distribution p(x) based on te above. Tis also gives us an ordering of te eigenvectors (as in PCA) suc tat we select te n d eigenvectors associated wit te n d largest eigen-values of te local covariance: Σ 1 (x) 2H log (X) = H(X) p(x) + g(x)g(x)t p(x) 2 By Lemma 4.1 te principal surface defined by using te local covariance is identical to te principal surface defined by te Hessian. In fact, te eigenvalues of te Hessian are just p(x) times te eigenvalues of te local covariance tus we equally take te eigenvectors of H(x) associated wit te n d smallest eigenvalues of te Hessian. In Gassabe et al. [2013] tey point out tat wile te motivation for te use of te so-called local inverse covariance is due to te relationsip to principal components wen te underlying density is Gaussian, in practice te density used will never be Gaussian and so tey argue tat direct 6

7 use of te Hessian or oter estimates of te local covariance can be used wit impunity. In fact, tey prove convergence of te algoritm in all of tese cases (tat is, convergence in a finite number of steps not necessarily to te principal surface). Simulations in teir paper also indicate significant computational-savings using two local-covariance estimates defined by Wang and Carreira-Perpiñán [2010] wit no worse performance (in terms of mean square deviation from te underlying generative spiral) wen compared to use of te local covariance as defined by O&E. 4.2 Mean-sift In order to define an algoritm tat converges to te principal surface of dimension d: P d te autors adjust te well-known Mean-Sift algoritm, tus we briefly review wat te mean-sift algoritm does and wy it makes sense to adjust it for tis purpose. Te mean-sift algoritm is a general-purpose algoritm for finding local-modes in data. It is often used (Comaniciu and Meer [2002]) for clustering. It is a non-parametric metod tat assumes an underlying kernel density estimate. As we will be using te Gaussian Kernel trougout te paper we will specialize to tis case. If we define te k(x) be te Gaussian profile : k(x) = exp ( 1 2 x) (different from te Kernel in tat te squaring operation is done before passing it to te function k). Ten we define te kernel density estimate of p(x t ) te underlying distribution of x t R n based on N i.i.d samples x i p( ) as: ˆp(x t ) 1 ( ) n/2 N ( 1 k N n 2π i=1 x t x i A local mode is a local maximum in te density function, tus, to find a local maximum we take te gradient of te density function wit respect to te point of interest x t and set it 2 ) 7

8 to zero: ( ) n/2 N ( 2 1 ˆp(x) g(x) = (x x N n+2 i )k x t x i 2π i=1 ( ) n/2 N ( 1 1 = (x N n+2 i x)k x t x i ) 2 2π i=1 ( ) [ n/2 N ( 1 1 = k x t x i )] ( N 2 i=1 k x t x i 2) x i N n+2 2π ( N i=1 i=1 k x t x i 2) 2 ) xt Te step in te second line follows as k (x) = 1 k(x) for te special case of Gaussian profile. 2 Te quantity in square brackets: N i=1 k ( x t x i N i=1 k ( x t x i 2) x i 2) xt m(x) Is called te mean-sift and we iteratively set te gradient to zero by setting: x t+1 = x t + m(x t ) A so-called mean-sift update. Wen tis becomes a fixed point we ave found a local mode. 4.3 Deriving te Gradient and Hessian As p(x) is generally unknown, O&E assume an underlying Kernel Density Estimate. Te definition applies for any estimate p(x), owever, presently we will consider fixed bandwidt Kernel Density Estimators wit Gaussian Kernel. Since we are using te same KDE as in 8

9 te Mean-sift te gradient is te same: ˆp(x) g(x) = ( ) [ n/2 N ( 1 1 k N n+2 2π i=1 x t x i )] ( N 2 i=1 k x t x i 2) x i ( N i=1 k x t x i 2) xt We can also simplify te gradient to: g(x) = p(xt ) 2 [ m(x t ) x t] Taking te second derivative we get te Hessian: ( ) n/2 N { ( H(x t 1 1 ) = k x t x i ) 2 [ ] } 1 N n+2 2π 2 (xt x i )(x x i ) T I n i=1 ( ) n/2 N ( 1 1 = k x t x i ) ( 2 N i=1 N n+4 2π i=1 k x t x i 2) (x t x i )(x x i ) T ( N i=1 k x t x i 2) ( N = 1 i=1 k x t x i 2) (x t x i )(x t x i ) T 4 p(xt ) ( N i=1 k x t x i 2) 2 I n p(xt ) 4 { v(x t ) 2 I n } 2 I n Ten we are trying to find points x t wose gradient is ortogonal to exactly n d eigenvectors of te Hessian, tat is, we are looking for local modes in te ortogonal subspace as defined in 3. Tis leads to an adjustment to te Mean-sift algoritm tat te autors name te subspace-constrained mean-sift algoritm. It is similar to a projected gradient (Goldstein [1964] and Levitin and Polyak [1966]) version of te mean sift, were te mean sift update m(x) is projected into te local ortogonal space before being used to update te trajectory x t. 9

10 4.4 Subspace Constrained Mean-sift Given te above idea of constraining te mean-sift update into te ortogonal subspace, te autors adjust te mean-sift algoritm to create te subspace constrained mean-sift algoritm. Subspace Constrained Mean Sift (SCMS) for Gaussian KDE Input: denisty estimate p(x), desired dimension d, tolerance ɛ > 0. Initialize: Trajectories x 0 1,..., x 0 K to a mes or data points. for k = 1 to k = K do wile not converged do 1. m(x t k ) 2. g(x t k ) p(xt k ) 2 3. v(x t k ) N i=1 k N i=1 k ( x t x i N i=1 k ( x t x i 2) x i 2) x t evaluate mean-sift [m(x t k ) xt k ] evaluate gradient ( xt x ) 2 i (x t x i )(x t x i ) T ( N i=1 k x t x i 2) 4. H(x t k ) p(xt ) {v(x t ) 2 I 4 n } evaluate Hessian 5. Σ 1 (x t k ) 1 p(x t k )H(xt k ) + 1 g(x t p(x t k )2 k )g(xt k )T evaluate local covariance 6. perform te eigendecomposition: Σ 1 (x t k ) = VΛV 7. V [v 1,... v n ] eigenvectors wit te (n d) largest eigenvalues of Σ 1 (x t k ). 8. ˆm(x t k ) V V T m(xt k ) project m(xt k ) onto te ortogonal sub-space 9. x t k ˆm(xt k ) + xt k projected/subspace constrained mean-sift update if g T (x t k )VT g(xt k ) /( g(xt k ) VT g(xt k ) ) < ɛ ten declare converged else x t+1 k end if end wile end for x t k Note tat eac trajectory k can be run individually witout knowledge of te oters (tus te for loop can be run in parallel). Te parallelization of te algoritm will decrease computation time, owever, te procedure is still inerently iterative in eac trajectory and requires evaluating te kernel density as well as an eigendecomposition at eac step. Te algoritm is O(N 2 n 3 ) were N is te number of data points and n is te dimension of 10

11 te data. Tus, even wen run in parallel, for large data sets (especially of large dimension) tis algoritm can be slow. 4.5 On te convergence of SCMS It sould be noted tat te autors claim convergence of te SCMS algoritm by relation to te convergence of te Mean-sift algoritm proposed in Comaniciu and Meer [2002], owever, Li et al. [2007] pointed out a fundamental mistake in te proof of te MS algoritm in Comaniciu and Meer [2002], tus tere are no proofs of te optimality of te algoritm and weter or not it converges to te principal curve/surface. Recently, Gassabe et al. [2013] investigated te convergence properties of te SCMS algoritm. It was sown in Carreira-Perpiñán [2007] tat if a Gaussian profile is used ten te MS algoritm reduces to an EM algoritm and tus converges, owever, use of oter profiles do not guarantee convergence. In Gassabe et al. [2013] tey point out tat even if te MS converges it is not obvious tat tis implies convergence of te SCMS let alone to te desired principal surface. Tey do sow, owever, tat te algoritm will converge (i.e. it will end) in a finite number of steps toug not necessarily to te correct surface. 5 Experiments In order to examine te principal curve metod proposed by O&E we perform a number of experiments. In te first two sections we compare te O&E s princpal curve metod to te original Hastie & Stuetzle principal curve metod as well as te metod proposed by Kegl [1999]. In te tird section we display te robustness of O&E s principal curve metod and SCMS algoritm to andle more complicated data sets witout canges. Finally, we perform a simulation study to compare te principal curve and wavelet denoising metods. 11

12 5.1 Standard Principal Curve Data Sets In tis section we examine ow te principal curve algoritm defined by O&E performs on some standard principal curve data sets. Tis is primarily to ensure tat te metods ave been replicated accurately and te results sould look similar to te original paper. We compare tis metod to te metods proposed by Hastie & Stuetzle and Kegl Zig-Zag Data set Figure 1: Te Zig-Zag data set is plotted in ( ), Hastie & Stuetzle s principal curve curve in yellow, and Kegl s Polygonal Line in blue. 1 Figure 1 plots te Hastie and Stuetzle and Kegl s polygonal line algoritm wit te zig-zag data set 2. Figure 2 plots te principal curve for a variety of values of te bandwidt parameter. Once a bandwidt is selected te pricipal curve is found using te SCMS algoritm presented in 4.4. Te algoritm is initialized on te original data points suc tat te resulting curve is te projection of te data onto te principal curve as defined by O& E. Te bandwidt parameters were cosen to display te importance of appropriate selection of te bandwidt as te results can vary eavily from small canges. 1 Computed using Kegl s Java application code: 2 Data set provided by Kegl 12

Figure 2: Zig-zag data set ( ), O&E Principal Curve (blue), an intensity map of te associated KDE estimate is in green indicating te different curves for different bandwidts. 5.1.

13 Figure 2: Zig-zag data set ( ), O&E Principal Curve (blue), an intensity map of te associated KDE estimate is in green indicating te different curves for different bandwidts Spiral Data Set Here we take te spiral data set and compare te tree metods of finding principal curves. Figure 3 plots te Hastie & Stuetzle line as well as te Polygonal Line. Figure 4 plots te principal curve solution for different values of te bandwidt parameter. It sould be noted tat te bandwidt tat performs well on one data set (e.g. zig-zag data) does not necessarily perform well on anoter data set (e.g. spiral data). Terefore, in practice it is advisable to use a data-dependent kernel bandwidt, for example selecting by leave-one out maximum likeliood as we will below. Figure 3: Te Spiral data set is plotted in ( ), Hastie & Stuetzle s principal curve in yellow, as well as Kegl s Polygonal Line in blue 13

Figure 4: Spiral data set ( ), O&E Principal Curve (blue), an intensity map of te associated KDE estimate is in green indicating te different curves for different bandwidts. 5.

14 Figure 4: Spiral data set ( ), O&E Principal Curve (blue), an intensity map of te associated KDE estimate is in green indicating te different curves for different bandwidts. 5.2 Oter Data Sets In addition to te standard principal curve data sets, te principal curve metod defined by O&E can andle arbitrarily complicated data sets including tose wit self-intersections, bifrucations and loops. Tese data sets do not require alteration of te algoritm. Bot of tese are improvements over existing principal curve metods. Tis data set as many self-loops and bifurcations but can be andled by te SCMS algoritm wit no additional canges. Figure 5 sows te resulting principal curve on two complicated data sets wit self-intersections. Te first is a star sape and te second is a epitrocoid. Figure 5: Te underlying data are and te resulting O&E principal curve is plotted in blue. Te left is a star and te rigt is an epitrocoid. Tese plots display te ability of principal curves as defined by te autors to andle complex data sets 14

15 5.3 Signal Denoising Te autors claim tat te principal curve metod can be applied to solve te problem of denoising a signal. In Ozertem et al. [2008] tey apply teir metod to piece-wise linear functions tat ave been corrupted by wite were tey acieve some level of success but do not compare to oter more robust denoising metods. In tis case we will consider a deterministic one-dimensional time signal D wic as been corrupted by some form of i.i.d mean-zero noise ɛ. In tis case we will let ɛ be Gaussian wite noise. We define te signal D deterministically as a function of sinusoids. Te goal of any denoising metod is, given a corrupted signal X = D+ɛ to estimate D. Naturally, as variability of te noise increases tis task becomes more callenging. Tis leads to defining a notion of Signal-To-Noise-Ratio: SNR = D 2 E[ ɛ 2 ] It is common in te signal processing literature to write te SNR in terms of decibels. SNR (in db) = 10 log 10 (SNR). As suc we will follow tis standard Principal Curve Denoising Here we follow Ozertem et al. [2008] in teir paper on applying principal curve metods to denoise piece-wise linear signals. Tat is, we use te Gaussian kernel wit single bandwidt parameter tat we will select by leave-one-out maximum likeliood cross-validation as in Leiva-Murillo and Rodríguez [2012] Tat is, we select te bandwidt tat maximizes P r(x i x i ) over te entire data set were x i is te data set excluding te point i. Te estimated true signal D will be P 1 = ˆD i.e. te principal curve for te data set under te Gaussian KDE. 15

16 5.3.2 Wavelet Denoising As muc of te details in wavelet denoising teory are beyond te scope of tis paper, our discussion will be brief. Te Discrete Wavelet Transformation (DWT) is an ortonormal transformation, suc tat, for an ortonormal matrix W (defined by a coice of wavelet filter) we define te wavelet coefficients to be: W = WX. Since W is an ortonormal transformation tis representation W bot preserves energy (i.e. W 2 = X 2 but also is an exact (alternate) representation of te signal in tat we can invert te transformation to recover te data set: X = W 1 W = W T W. Were te last step follows from te ortonormality of W. Te metod of Wavelet denoising comes from te more general ortonormal transformation denoising. Te metodology is simple: 1. Given a signal X and a coice of wavelet filter we calculate te wavelet coefficients W 2. Given a tresold δ set to zero all wavelet coefficients W t suc tat W t < δ 3. Given tresolded coefficients W (T ) calculate te inverse transformation to arrive at new signal X (T ) Ten we take X (T ) to be our estimate of D. Based on te work of Donoo and Jonstone [1994] we use te universal tresold wic is derived from assuming Guassian wite noise (as in tis case): δ (U) = 2σ 2 e log(n). Since te variance of te noise is (typically) unknown we use te Median Absolute Deviation (MAD) estimate σ 2 (MAD) = median{w 1,1,...,W 1,N/2 } wic under certain assumptions (Donoo and Jonstone [1994]) is an unbiased estimate of σ 2 e Results We fix a sample size N and generate corrupted signal X = D + ɛ varying te SNR. We ten apply bot wavelet and principal curve metods to estimate D. We evaluate teir 16

17 performance based on Mean-Squared-Error from te true signal D. We ten repeat tis procedure 100 times for eac sample size N and eac SNR to provide Monte-Carlo standard deviation bounds. In figure 6 we consider small sample sizes N = 32 and N = 64. We plot log MSE to make te plots easier to read. In bot cases ere we see tat in terms of meansquared error for any signal to noise ratio tat te principal curve metod is out-performing te wavelet metod. Figure 6: Te top on bot sides plots te underlying signal (uncorrupted) for data sizes 32 and 64. Te bottom plots te resulting log MSE for te principal curve metod (black) and wavelet metod (blue) for varying SNRs. Wat we see in figures 7 and 8, owever, is a cange as we increase te sample size. For larger data sets te more teoretically well-founded wavelet denoising metod vastly outperforms te principal curve metod wic appears to stagnate around -8 log mean-squared error regardless of sample size. Tere migt be a variety of reasons for tis. First, tere exists a great deal of teory on wavelet denoising (and ortonormal transforms in general), wereas tere is none for te principal curves. In particular, te tresold δ was found assuming an underlying Gaussian wite noise process but te bandwidt selection metod for te principal curve was a general metod for selecting bandwidts for density estimation. Tat being said, te principal curve metod is still performing generally well and wit additional teory (in particular on kernel selection and bandwidt selection in tis particular instance of signal denoising) could lead to an improved principal curve denoising metod tat may be more comparable to state of te art metods. 17

18 Figure 7: Te top on bot sides plots te underlying signal (uncorrupted) for data sizes 128 and 256. Te bottom plots te resulting log MSE for te principal curve metod (black) and wavelet metod (blue) for varying SNRs. In order to assess weter tis test was biased towards te wavelet metod due to use of a true signal based on sinusoids, te experiment was re-run using a piece-wise linear signal as in Ozertem and Erdogmus [2008]. Te resulting plots (similar to tose in Figures 6 troug 8) are in te supplementaray appendix. As is in tis case, te principal curve metod outperforms te wavelet metod for data sets of small sizes N < 32 but by te time N > 64 te wavelet metod outperforms te principal curve metod (in terms of MSE). It sould be noted tat te KDE-SCMS is O(N 2 ) wereas te DWT is O(N) (faster tan te fast Fourier transform) toug in principle te KDE-SCMS can be parallelized reducing te computation load. Figure 8: Te top plots te underlying signal (uncorrupted) for data sized Te bottom plots te resulting log MSE for te principal curve metod (black) and wavelet metod (blue) for varying SNRs. 18

19 6 Significance Tis metod of defining principal curves and surfaces offers a number of advantages over existing metods. By defining principal surfaces as inerent structures of te geometry rater tan solutions to an optimizing criterion O&E allow for a ricer definition of principal curves. Furtermore, te metod extends te existing principal curves literature by defining principal surfaces in suc a manner tat can be naturally extended from principal curves of 1 dimension to principal surfaces of arbitrary dimension. Currently no oter metod of defining principal curves allows for tis. Additionally, tis definition allows finding principal curves in data wit loops, bifurcations and self-intersections witout any additional canges in te definition or algoritm. In teir definition of principal curves O&E rely on a known density function p(x). In practice, of course, tis is not available and so must be estimated from data. O&E take te approac of approximating p(x) via Kernel Density Estimation. Tis allows tem to do a number of tings. First, te smootness constraints tat are usually placed on principal curves can be removed by assuming tat p(x) itself is smoot resulting in inerently smoot principal curves. If p(x) is estimated via KDE te result will be inerently smoot. Furtermore, tese Kernel Density estimates always ave second order derivatives and so principal curves (as defined by O&E) are well defined, and, under certain regularity conditions tey are consistent. Finally, issues of overfitting and outlier robustness can be andled in te density estimation pase (wic as a muc larger existing literature) tan in te principal curve approximation pase. Wile tis metod offers a new insigt into principal curves, its impact in manifold learning is less supported. Te Subspace Constrained Mean-Sift (SCMS) algoritm presented by te autors may converge to a principal surface of dimension d, owever, te vector of values will still be in te ambient, larger, dimension D > d. Tus, wile te points are guaranteed to 19

20 lie on a surface of lower inerent dimension, tis metod alone cannot be used for dimension reduction unless paired wit anoter suitable algoritm to parametrize te principal surface or approximate it by projecting te points onto vectors of lower dimension. Muc of te literature on manifold learning assumes tat tere is an underlying true manifold from wic te data is generated. Tese metods can assess teir quality by determining if it will recover te true manifold given sufficient data. Te principal surface metod defined by O&E does not assume an underlying manifold (in fact tere is no guarantee tat te resulting principal surface itself is indeed a manifold), and it is unknown weter te metod would recover te underlying manifold if te data were generated from one, or if te principal surface of appropriate dimension can be used as a reasonable estimate of te underlying manifold. Neverteless, principal curves as defined by O&E ave enjoyed success in signal processing. In particular, tey ave been used in vector quantization Gassabe et al. [2012], as well as in signal denoising Ozertem et al. [2008]. Recently, Zang and Pedrycz [2014]t proposed extending principal curves to Granular principal curves in order to do apply principal curves to large data sets by granulating te data. Te metod proposed by O&E opens te door to a new (potentially ric) principal curve/surface framework to be studied and applied. References M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6), pages , M.A. Carreira-Perpiñán. Gaussian mean sift is an em algoritm. IEEE Transactions on Pattern Analysis and Macine Intelligence 29, pages ,

21 J. E. Cacon, T. Duong, and M.P. Wand. Asymptotics for general multivariate kernel density derivative estimators. Statistica Sinica, in press, D. Comaniciu and P. Meer. Mean sift: a robust approac toward feature space analysis. IEEE Transactions on Pattern Analysis and Macine Intelligence, 24(5): , P. Delicado. Principal curves and principal oriented points, URL ttp:// D.L Donoo and I.M. Jonstone. Ideal spatial adaptation by wavelet srinkage. Biometrika, 81, pages , Y.A. Gassabe, T. Linder, and G. Takaara. On noisy source vector quantization via subspace constrained mean sift algoritm. Proceedings of te 26t Biennial Symposium on Communications, Kingston, Canada, pages , Y.A. Gassabe, T. Linder, and G. Takaara. On some convergence properties of te subspace constrained mean sift. Pattern Recognition 46, pages , A.A. Goldstein. Convex programming in ilbert spaces. Bulletin of te American Matematical Society:70, pages , T. Hastie and W. Stuetzle. Principal curves. Journal of American Statistical Association, 84: , B. Kegl. Principal curves; learning, design, and applications. PD tesis, Concordia University, Montreal, Canada, J.P. Leiva-Murillo and A.A. Rodríguez. Algoritms for gaussian bandwidt selection in kernel density estimators. Pattern Recognition Letters, Vol 33. Issue 13, pages , E.S. Levitin and B.T. Polyak. Constrained minimization problems. USSR Computational Matematics and Matematical Pysics 6, pages 1 50,

22 X. Li, Z. Hu, and F. Wu. A note on te convergence of te mean sift. Pattern Recognition 40, pages , U. Ozertem and D. Erdogmus. Local conditions for critical and principal manifolds. IEEE Int. Conf on Acoustics Speec and Signal Processing, pages , U. Ozertem, D. Erdogmus, and O. Arikan. Piecewise smoot signal denoising via priciple curve projections. IEEE International Conference on Macine Learning for Signal Processing, pages , S.T. Roweis and L.K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(55500), pages , J.B Tenenbaum, V. de Silva, and J.C Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500), pages , W. Wang and M.A. Carreira-Perpiñán. Manifold blurring mean sift algoritsm for manifold denoising. Proceeding of te IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2010), pages , K.Q. Weinberger and L.K. Saul. Unsuperviased learning of image manifolds by semidefinite programming. International Journal of Computer Vision 70(1), pages 77 90, H. Zang and W. Pedrycz. From principal curves to granular principal curves. IEE Transactions on Cybernetics Vol 44. N0. 6, pages ,

A = h w (1) Error Analysis Physics 141

A = h w (1) Error Analysis Physics 141 Introduction In all brances of pysical science and engineering one deals constantly wit numbers wic results more or less directly from experimental observations. Experimental observations always ave inaccuracies.