Very fast optimal bandwidth selection for univariate kernel density estimation

Size: px

Start display at page:

Download "Very fast optimal bandwidth selection for univariate kernel density estimation"

Natalie Rosa Owens
5 years ago
Views:

1 Very fast optimal bandwidt selection for univariate kernel density estimation VIKAS CHANDAKANT AYKA and AMANI DUAISWAMI Perceptual Interfaces and eality Laboratory Department of Computer Science and Institute for Advanced Computer Studies University of Maryland, CollegePark, MD 783 Most automatic bandwidt selection procedures for kernel density estimates require estimation of quantities involving te density derivatives. Estimation of modes and inflexion points of densities also require derivative estimates. Te computational complexity of evaluating te density derivative at M evaluation points given N sample points from te density is O(MN). In tis paper we propose a computationally efficient ɛ exact approximation algoritm for te univariate Gaussian kernel based density derivative estimation tat reduces te computational complexity from O(MN) to linear O(N + M). Te constant depends on te desired arbitrary accuracy, ɛ. We apply te density derivative evaluation procedure to estimate te optimal bandwidt for kernel density estimation, a process tat is often intractable for large data sets. For example for N = M = 49, 6 points wile te direct evaluation of te density derivative takes around.76 ours te fast evaluation requires only 65 seconds wit an error of around. Algoritm details, error bounds, procedure to coose te parameters and numerical experiments are presented. We demonstrate te speedup acieved on te bandwidt selection using te solve-te-equation plug-in metod. We also demonstrate tat te proposed procedure can be extremely useful for speeding up exploratory projection pursuit tecniques. [CS-T-4774/UMIACS-T-5-73]: December, 5 CS-T-4774/UMIACS-T-5-73

2 aykar and Duraiswami. INTODUCTION Kernel density estimation/regression tecniques [Wand and Jones 995] are widely used in various inference procedures in macine learning, data mining, pattern recognition, and computer vision. Efficient use of tese metods require te optimal selection of te smooting parameter called te bandwidt of te kernel. A pletora of tecniques ave been proposed for data-driven bandwidt selection [Jones et al. 996]. Te most successful state of te art metods rely on te estimation of general integrated squared density derivative functionals. Tis is te most computationally intensive task, te computational cost being O(N ), in addition to te O(N ) cost of computing te kernel density estimate. Te core task is to efficiently compute an estimate of te density derivative. Te current most practically successful approac, solve-te-equation plug-in metod [Seater and Jones 99] involves te numerical solution of a non-linear equation. Iterative metods to solve tis equation will involve repeated use of te density functional estimator for different bandwidts wic adds muc furter to te computational burden. We also point out tat estimation of te density derivatives also comes up in various oter applications like estimation of modes and inflexion points of densities [Fukunaga and Hostetler 975] and estimation of te derivatives of te projection index in projection pursuit algoritms [Huber 985; Jones and Sibson 987]. A good list of applications wic require te estimation of density derivatives can be found in [Sing 977a]. Te computational complexity of evaluating te density derivative at M evaluation points given N sample points from te density is O(MN). In tis paper we propose a computationally efficient ɛ exact approximation algoritm for te univariate Gaussian kernel based density derivative estimation tat reduces te computational complexity from O(MN) to linear O(N + M). Te algoritm is ɛ exact in te sense tat te constant idden in O(N + M), depends on te desired accuracy, wic can be arbitrary. In fact for macine precision accuracy tere is no difference between te direct and te fast metods. Te proposed metod can be viewed as an extension of te improved fast Gauss transform [Yang et al. 3] proposed to accelerate te kernel density estimate. Te rest of te paper is organized as follows. In we introduce te kernel density estimate and discuss te performance of te estimator. Te kernel density derivative estimate is introduced in 3. 4 discusses te density functionals wic are used by most of te automatic bandwidt selection strategies. 5 briefly describes te different strategies for automatic optimal bandwidt selection. Te solve-te-equation plug-in metod is described in detail. Our proposed fast metod is described in detail in 6. Algoritm details, error bounds, procedure to coose te parameters, and numerical experiments are presented. In 7 we sow te speedup acieved for bandwidt estimation bot on simulated and real data. In 8 we also sow ow te proposed procedure can be used for speeding up projection pursuit tecniques. 9 finally concludes wit a brief discussion on furter extensions.. KENEL DENSITY ESTIMATION A univariate random variable X on as a density p if, for all Borel sets A of, p(x)dx = Pr[x A]. Te task of density estimation is to estimate p from an A CS-T-4774/UMIACS-T-5-73

3 Optimal bandwidt estimation 3 i.i.d. sample x,..., x N drawn from p. Te estimate p : () N is called te density estimate. Te parametric approac to density estimation assumes a functional form for te density, and ten estimates te unknown parameters using tecniques like te maximum likeliood estimation. However unless te form of te density is known a priori, assuming a functional form for a density very often leads to erroneous inference. On te oter and nonparametric metods do not make any assumption on te form of te underlying density. Tis is sometimes referred to as letting te data speak for temselves [Wand and Jones 995]. Te price to be paid is a rate of convergence slower tan /N, wic is typical of parametric metods. Some of te commonly used non-parametric estimators include istograms, kernel density estimators, and ortogonal series estimators [Izenman 99]. Te istogram is very sensitive to te placement of te bin edges and te asymptotic convergence is muc slower tan kernel density estimators. Te most popular non-parametric metod for density estimation is te kernel density estimator (KDE) (also known as te Parzen window estimator [Parzen 96]) given by p(x) = ( ) x xi K, () N i= were K(u) is called kernel function and = (N) is called te bandwidt. Te bandwidt is a scaling factor wic goes to zero as N. In order tat p(x) is a bona fide density, K(u) is required to satisfy te following two conditions: K(u), K(u)du =. () Te kernel function is essentially spreading a probability mass of /N associated wit eac point about its neigborood. Te most widely used kernel is te Gaussian of zero mean and unit variance. K(u) = e u /. (3) π In tis case te kernel density estimate can be written as. Computation complexity p(x) = N π e (x xi) /. (4) Te computational cost of evaluating Eq. 4 at N points is O(N ), making it proibitively expensive. Different metods ave been proposed to accelerate tis sum. If te source points are on an evenly spaced grid ten we can evaluate te sum at an evenly spaced grid exactly in O(N log N) using te fast Fourier transform i= Te best rate of convergence of te MISE of kernel density estimate is of order N 4/5 wile tat of te istogram is of te order N /3. Te KDE is not very sensitive to te sape of te kernel. Wile te Epanecnikov kernel is te optimal kernel, in te sense tat it minimizes te MISE, oter kernels are not tat suboptimal [Wand and Jones 995]. Te Epanecnikov kernel is not used ere because it gives an estimate aving a discontinuous first derivative, because of its finite support. CS-T-4774/UMIACS-T-5-73

4 4 aykar and Duraiswami (FFT). One of te earliest metods, especially proposed for univariate fast kernel density estimation was based on tis idea [Silverman 98]. For irregularly spaced data, te space is divided into boxes, and te data is assigned to te closest neigboring grid points to obtain grid counts. Te KDE is also evaluated at regular grid points. For target points not lying on te te grid te value is obtained by doing some sort of interpolation based on te values at te neigboring grid points. As a result tere is no guaranteed error bound for suc kind of metods. Te Fast Gauss Transform(FGT) [Greengard and Strain 99] is an approximation algoritm tat reduces te computational complexity to O(N), at te expense of reduced precision. Te constant depends on te desired precision, dimensionality of te problem, and te bandwidt. Yang et al. [Yang et al. 3; Yang et al. 5] presented an extension of te fast Gauss transform (te improved fast Gauss transform or IFGT) tat was suitable for iger dimensional problems and provides comparable performance in lower dimensions. Te main contribution of te current paper is te extension of te improved fast Gauss transform to accelerate te kernel density derivative estimate, and solve te optimal bandwidt problem. Anoter class of metods for suc problems are dual-tree metods [Gray and Moore ; 3] wic are based on space partitioning trees for bot te source and target points. Using te tree data structure distance bounds between nodes can be computed. An advantage of te dual-tree metods is tat tey work for all common kernel coices, not necessarily Gaussian.. Performance In order to understand te performance of te KDE we need a measure of distance between two densities. Te commonly used criteria, wic can be easily manipulated is te L norm, also called as te integrated square error (ISE) 3. Te ISE between te estimate p(x) and te actual density p(x) is given by ISE( p, p) = L ( p, p) = [ p(x) p(x)] dx. (5) Te ISE depends on a particular realization of N points. Te ISE can be averaged over tese realizations to get te mean integrated squared error (MISE) defined as [ ] MISE( p, p) = E[ISE( p, p)] = E [ p(x) p(x)] dx = E[{ p(x) p(x)} ]dx = IMSE( p, p), (6) were IMSE is integrated mean squared error. Te MISE or IMSE doesn t depend on te actual data-set as we take expectation. So tis is a measure of te average performance of te kernel density estimator, averaged over te support of te density and different realization of te points. Te MISE for te KDE can be sown 3 Oter distance measures like mean integrated absolute error (based on te L distance [Devroye and Lugosi ]), Kullback-Liebler divergence, and Hellinger distance are used. In tis paper we use only te L criterion. CS-T-4774/UMIACS-T-5-73

5 Optimal bandwidt estimation 5 to be ( see for a derivation) MISE( p, p) = [ (K N p)(x) (K p) (x) ] dx + [(K p)(x) p(x)] dx, were is te convolution operator and K (x) = (/)K(x/). Te dependence of te MISE on te bandwidt is not very explicit in te above expression. Tis makes it difficult to interpret te influence of te bandwidt on te performance of te estimator. An asymptotic large sample approximation for tis expression is usually derived via te Taylor s series called as te AMISE, te A is for asymptotic. Based on a certain assumptions 4, te AMISE between te actual density and te estimate can be sown to be AMISE( p, p) = N (K) µ (K) (p ), (8) were (g) = g(x) dx,, µ (g) = (7) x g(x)dx, (9) and p is te second derivative of te density p (See for a complete derivation.). Te first term in te expression 8 is te integrated variance and te second term is te integrated squared bias. Te bias is proportional to 4 wereas te variance is proportional to /N, wic leads to te well known bias-variance tradeoff. Based on te AMISE expression te optimal bandwidt AMISE can be obtained by differentiating Eq. 8 w.r.t. bandwidt and setting it to zero. [ ] /5 (K) AMISE = µ (K). () (p )N However tis expression cannot be used directly since (p ) depends on te second derivative of te density p, wic we are trying to estimate in te first place. We need to use an estimate of (p ). Substituting Eq. in Eq. 8 te minimum AMISE tat can be attained is inf AMISE( p, p) = 5 4 [ µ (K) (K) 4 (p )] /5 N 4/5. () Tis expression sows tat te best rate of convergence of te MISE of KDE is of order N 4/5. 3. KENEL DENSITY DEIVATIVE ESTIMATION In order to estimate (p ) we will need an estimate of te density derivative. A simple estimator for te density derivative can be obtained by taking te derivative of te kernel density estimate p(x) defined earlier [Battacarya 967; Scuster 4 Te second derivative p (x) is continuous, square integrable and ultimately monotone. lim N = and lim N N =, i.e., as te number of samples N is increased approaces zero at a rate slower tan /N.Te kernel function is assumed to be symmetric about te origin ( zk(z)dz = ) and as finite second moment ( z K(z)dz < ). CS-T-4774/UMIACS-T-5-73

6 6 aykar and Duraiswami 969] 5. If te kernel K is differentiable r times ten te r t density derivative estimate p (r) (x) can be written as p (r) ( ) x (x) = N r+ K (r) xi, () i= were K (r) is te r t derivative of te kernel K. Te r t derivative of te Gaussian kernel k(u) is given by K (r) (u) = ( ) r H r (u)k(u) (3) were H r (u) is te r t Hermite polynomial. Te Hermite polynomials are set of ortogonal polynomials [Abramowitz and Stegun 97]. Te first few Hermite polynomials are H (u) =, H (u) = u, and H (u) = u. Hence te density derivative estimate wit te Gaussian kernel can be written as p (r) ( ) r ( ) x xi (x) = H r e (x x i) /. (4) πn r+ 3. Computational complexity i= Te computational complexity of evaluating te r t derivative of te density estimate due to N points at M target locations is O(rNM). 3. Performance Similar to te analysis done for KDE te AMISE for te kernel density derivative estimate, under certain assumptions 6, can be sown to be (See for a complete derivation) AMISE( p (r), p (r) ) = (K(r) ) 4 + Nr+ 4 µ (K) (p (r+) ). (5) It can be observed tat te AMISE for estimating te r t derivative depends upon te te (r + ) t derivative of te true density. Differentiating Eq. 5 w.r.t. bandwidt and setting it to zero we obtain te optimal bandwidt r AMISE to estimate te r t density derivative. [ (K r (r) ] /r+5 )(r + ) AMISE =. (6) µ (K) (p (r+) )N Substituting Eq. 6 in te equation for AMISE, te minimum AMISE tat can be attained is [ r+/r+5 inf AMISE( p(r), p (r) ) = C µ (K) (K) 4(r+) (p )] N 4/r+5. 5 Some better estimators wic are not necessarily te p t order derivatives of te KDE ave been proposed [Sing 977b]. 6 Te (r + ) t derivative p (r+) (x) is continuous, square integrable and ultimately monotone. lim N = and lim N N r+ =, i.e., as te number of samples N is increased approaces zero at a rate slower tan /N r+. Te kernel function is assumed to be symmetric about te origin ( zk(z)dz = ) and as finite second moment ( z K(z)dz < ). CS-T-4774/UMIACS-T-5-73

7 Optimal bandwidt estimation 7 were C is a constant depending on r. Tis expression sows tat te best rate of convergence of te MISE of KDE of te derivative is of order N 4/r+5. Te rate becomes slower for iger values of r, wic says tat estimating te derivative is more difficult tan estimating te density. 4. ESTIMATION OF DENSITY FUNCTIONALS ater tan te actual density derivative metods for automatic bandwidt selection require te estimation of wat are known as density functionals. Te general integrated squared density derivative functional is defined as [ (p (s) ) = p (x)] (s) dx. (7) Using integration by parts, tis can be written in te following form, (p (s) ) = ( ) s p (s) (x)p(x)dx. (8) More specifically for even s we are interested in estimating density functionals of te form, [ ] Φ r = p (r) (x)p(x)dx = E p (r) (X). (9) An estimator for Φ r is, Φ r = N p (r) (x i ). () i= were p (r) (x i ) is te estimate of te r t derivative of te density p(x) at x = x i. Using a kernel density derivative estimate for p (r) (x i ) (Eq. ) we ave Φ r = N r+ i= j= K (r) ( x i x j ). () It sould be noted tat computation of Φ r is O(rN ) and ence can be very expensive if a direct algoritm is used. 4. Performance Te asymptotic MSE for te density functional estimator under certain assumptions 7 is as follows. ( See 3 for a complete derivation.) [ AMSE( Φ r, Φ r ) = N r+ K(r) () + ] µ (K)Φ r+ + N r+ Φ (K (r) ) + 4 [ ] p (r) (y) p(y)dy Φ r () N 7 Te density p ad k > continuous derivatives wic are ultimately monotone. Te (r + ) t derivative p (r+) (x) is continuous, square integrable and ultimately monotone. lim N = and lim N N r+ =, i.e., as te number of samples N is increased approaces zero at a rate slower tan /N r+. Te kernel function is assumed to be symmetric about te origin ( zk(z)dz = ) and as finite second moment ( z K(z)dz < ). CS-T-4774/UMIACS-T-5-73

8 8 aykar and Duraiswami Te optimal bandwidt for estimating te density functional is cosen te make te bias term zero. Te optimal bandwidt is given by [Wand and Jones 995] [ K (r) ] /r+3 () g MSE =. (3) µ (K)Φ r+ N 5. AMISE OPTIMAL BANDWIDTH SELECTION For a practical implementation of KDE te coice of te bandwidt is very important. Small leads to an estimator wit small bias and large variance. Large leads to a small variance at te expense of increase in bias. Te bandwidt as to be cosen optimally. Various tecniques ave been proposed for optimal bandwidt selection. A brief survey can be found in [Jones et al. 996] and [Wand and Jones 995]. Te best known of tese include rules of tumb, oversmooting, least squares cross-validation, biased cross-validation, direct plug-in metods, solvete-equation plug-in metod, and te smooted bootstrap. 5. Brief review of different metods Based on te AMISE expression te optimal bandwidt AMISE as te following form, [ ] /5 (K) AMISE = µ (K). (4) (p )N However tis expression cannot be used directly since (p ) depends on te second derivative of te density p, wic we are trying to estimate in te first place. Te rules of tumb use an estimate of (p ) assuming tat te data is generated by some parametric form of te density (typically a normal distribution). Te oversmooting metods rely on te fact tat tere is a simple upper bound for te AMISE-optimal bandwidt for estimation of densities wit a fixed value of a particular scale measure. Te least squares cross-validation directly minimize te MISE based on a leave-one-out kernel density estimator. Te problem is tat te function to be minimized as fairly large number of local minima and also te practical performance of tis metod is somewat disappointing. Te biased cross-validation uses te AMISE instead of using te exact MISE formula. Tis is more stable tan te least squares cross-validation but as a large bias. Te plug-in metods use an estimate of te density functional (p ) in Eq. 4. However tis is not completely automatic since estimation of (p ) requires te specification of anoter pilot bandwidt g. Tis bandwidt for estimation of te density functional is quite different from te te bandwidt used for te kernel density estimate. As discussed in Section 4 we can find an expression for te AMISE-optimal bandwidt for te estimation of (p ). However tis bandwidt will depend on an unknown density functional (p ). Tis problem will continue since te optimal bandwidt for estimating (p (s) ) will depend on (p (s+) ). Te usual strategy used by te direct plug-in metods is to estimate (p (l) ) for some l, wit bandwidt cosen wit reference to a parametric family, usually a normal density. Tis metod is usually referred to as te l-stage direct plug-in metod. As te te number of stages l increases te bias of te bandwidt decreases, since te CS-T-4774/UMIACS-T-5-73

9 Optimal bandwidt estimation 9 dependence on te assumption of some parametric family decreases. However tis comes at te price of te estimate being more variable. Tere is no good metod for te coice of l, te most common coice being l =. 5. Solve-te-equation plug-in metod Te most successful among all te current metods, bot empirically and teoretically, is te solve-te-equation plug-in metod [Jones et al. 996]. Tis metod differs from te direct plug-in approac in tat te pilot bandwidt used to estimate (p ) is written as a function of te kernel bandwidt. We use te following version as described in [Seater and Jones 99]. Te AMISE optimal bandwidt is te solution to te equation [ ] /5 (K) = µ (K) Φ, (5) 4 [γ()]n were Φ 4 [γ()] is an estimate of Φ 4 = (p ) using te pilot bandwidt γ(), wic depends on te kernel bandwidt. Te bandwidt is cosen suc tat it minimizes te asymptotic MSE for te estimation of Φ 4 and is given by [ K (4) ] /7 () g MSE =. (6) µ (K)Φ 6 N Substituting for N from Eq. 4 g MSE can be written as a function of as follows [ K (4) ] /7 ()µ (K)Φ 4 g MSE = 5/7 AMISE (K)Φ. (7) 6 Tis suggest tat we set [ ] /7 K (4) ()µ (K) Φ 4 (g ) γ() = 5/7, (8) (K) Φ 6 (g ) were Φ 4 (g ) and Φ 6 (g ) are estimates of Φ 4 and Φ 6 using bandwidts g and g respectively. Φ 4 (g ) = N(N )g 5 Φ 6 (g ) = N(N )g 7 i= j= i= j= K (4) ( x i x j g ). (9) K (6) ( x i x j g ). (3) Te bandwidts g and g are cosen suc tat it minimizes te asymptotic MSE. [ ] /7 [ ] /9 K (4) () K (6) () g = g =, (3) µ (K) Φ 6 N µ (K) Φ 8 N were Φ 6 and Φ 8 are estimators for Φ 6 and Φ 8 respectively. We can use a similar strategy for estimation of Φ 6 and Φ 8. However tis problem will continue since te optimal bandwidt for estimating Φ r will depend on Φ r+. Te usual strategy is CS-T-4774/UMIACS-T-5-73

10 aykar and Duraiswami to estimate a Φ r at some stage, using a quick and simple estimate of bandwidt cosen wit reference to a parametric family, usually a normal density. It as been observed tat as te te number of stages increases te variance of te bandwidt increases. Te most common coice is to use only two stages. If p is a normal density wit variance σ ten for even r we can compute Φ r exactly [Wand and Jones 995]. Φ r = ( ) r/ r!. (3) (σ) r+ (r/)!π/ An estimator of Φ r will use an estimate σ of te variance. Based on tis we can write an estimator for Φ 6 and Φ 8 as follows. Φ 6 = 5 6 π σ 7, Φ8 = 5 3 π σ 9. (33) Te two stage solve-te-equation metod using te Gaussian kernel can be summarized as follows. () Compute an estimate σ of te standard deviation σ. () Estimate te density functionals Φ 6 and Φ 8 using te normal scale rule. Φ 6 = 5 6 π σ 7, Φ8 = 5 3 π σ 9. (3) Estimate te density functionals Φ 4 and Φ 6 using te kernel density estimators wit te optimal bandwidt based on te asymptotic MSE. [ ] /7 [ ] /9 6 3 g = g = π Φ6 N π Φ8 N Φ 4 (g ) = N(N ) πg 5 i= j= ( ) xi x j H 4 e (x i x j ) /g. g Φ 6 (g ) = N(N ) πg 7 i= j= (4) Te bandwidt is te solution to te equation [ ] /5 =, π Φ 4 [γ()]n were and Φ 4 [γ()] = CS-T-4774/UMIACS-T-5-73 N(N ) πγ() 5 γ() = [ i= j= ( ) xi x j H 6 e (x i x j ) /g. g ( ) xi x j H 4 e (x i x j ) /γ(), γ() 6 ] /7 Φ 4 (g ) 5/7. Φ 6 (g )

11 Optimal bandwidt estimation Tis equation can be solved using any numerical routine like te Newton- apson metod. Te main computational bottleneck is te estimation of Φ wic is of O(N ). 6. FAST DENSITY DEIVATIVE ESTIMATION Te r t kernel density derivative estimate using te Gaussian kernel of bandwidt is given by p (r) (x) = ( ) r πn r+ i= ( ) x xi H r e (x xi) /. (34) Let us say we ave to estimate te density derivative at M target points, {y j } M j=. More generally we need to evaluate te following sum, ( ) yj x i G r (y j ) = q i H r e (y j x i ) / j =,..., M, (35) i= were {q i } N i= will be referred to as te source weigts, + is te bandwidt of te Gaussian and + is te bandwidt of te Hermite. Te computational complexity of evaluating Eq. 35 is O(rN M). Te fast algoritm is based on separating te x i and y j in te Gaussian via te factorization of te Gaussian by Taylor series and retaining only te first few terms so tat te error due to truncation is less tan te desired error. Te Hermite function is factorized via te binomial teorem. For any given ɛ > te algoritm computes an approximation Ĝr(y j ) suc tat Ĝ r (y j ) G r (y j ) ɛ, (36) Q were Q = N i= q i. We call Ĝr(y j ) an ɛ exact approximation to G r (y j ). 6. Factorization of te Gaussian For any point x te Gaussian can be written as, e y j x i / = e (y j x ) (x i x ) / = e x i x / e y j x / e (x i x )(y j x )/. (37) In Eq. 37 te first exponential e xi x / depends only on te source coordinates x i. Te second exponential e yj x / depends only on te target coordinates y j. However for te tird exponential e (yj x )(xi x )/ te source and target are entangled. Tis entanglement is separated using te Taylor s series expansion. Te factorization of te Gaussian and te evaluation of te error bounds are based on te Taylor s series and Lagrange s evaluation of te remainder wic we state ere witout te proof. Teorem 6.. [Taylor s Series] For any point x, let I be an open set containing te point x. Let f : I be a function wic is n times differentiable CS-T-4774/UMIACS-T-5-73

12 aykar and Duraiswami on I. Ten for any x I, tere is a θ wit < θ < suc tat f(x) = n k= k! (x x ) k f (k) (x ) + n! (x x ) n f (n) (x + θ(x x )), (38) were f (k) is te k t derivative of te function f. Based on te above teorem we ave te following corollary. Corollary 6.. Let B rx (x ) be a open interval of radius r x wit center x, i.e., B rx (x ) = {x : x x < r x }. Let + be a positive constant and y be a fixed point suc tat y x < r y. For any x B rx (x ) and any non-negative integer p te function f(x) = e (x x )(y x )/ can be written as f(x) = e (x x )(y x )/ = and te residual p (x) p p! < p p! p k= k k! ( ) k ( ) k x x y x + p (x), (39) ( ) p ( ) p x x y x e x x y x /. ( rx r ) p y e r x r y /. (4) Proof. Let us define a new function g(x) = e [x(y x )]/. Using te result ( ) k g (k) (x ) = k k e[x (y x )]/ y x (4) and Teorem 6., we ave for any x B rx (x ) tere is a θ wit < θ < suc tat Hence were, ( ) k ( ) k x x y x k= ( ) p ( ) p } x x y x e θ[(x x ).(y x )]/. { p k g(x) = e [x (y x )]/ k! + p p! f(x) = e (x x )(y x )/ = p (x) = p p! CS-T-4774/UMIACS-T-5-73 p k= k k! ( ) k ( ) k x x y x + p (x), ( ) p ( ) p x x y x e θ[(x x )(y x )]/.

13 Optimal bandwidt estimation 3 Te remainder is bounded as follows. ( ) p ( ) p p (x) p x x y x e θ x x y x /, p! ( ) p ( ) p p x x y x e x x y x / [Since < θ < ], p! ( < p rx r ) p y e r xr y/ p! [Since x x < r x and y x < r y ]. Using Corollary 6. te Gaussian can now be factorized as [ p e yj xi / k ( ) ] [ k = e xi x / xi x k! were, k= e yj x / ( ) ] k yj x + error p. (4) error p p p! ( ) p ( xi x yj x ) p e ( x i x y j x ) /. (43) 6. Factorization of te Hermite polynomial Te r t Hermite polynomial can be written as [Wand and Jones 995] Hence, H r (x) = r/ l= a l x r l, were a l = ( )l r! l l!(r l)!. ( ) r/ yj x i ( yj x H r = a l l= x ) r l i x. Using te binomial teorem (a + b) n = n m= ( n m) a m b n m, te x i and y j can be separated as follows. ( yj x x ) r l r l i x ( ) ( ) m ( ) r l m r l = ( ) m xi x yj x. m m= Substituting in te previous equation we ave were, ( ) yj x i H r = r/ l= r l m= a lm = a lm ( xi x ) m ( ) r l m yj x (44) ( ) l+m r! l l!m!(r l m)!. (45) CS-T-4774/UMIACS-T-5-73

14 4 aykar and Duraiswami 6.3 egrouping of te terms Using Eq. 4 and 44, G r (y j ) after ignoring te error terms can be approximated as [ p r/ r l k ( ) k ( ) ] m Ĝ r (y j ) = a lm q i e x i x / xi x xi x k! k= l= m= i= [ ( ) k ( ) ] r l m yj x yj x were = e y j x / p k= r/ l= r l m= B km = k k! a lm B km e yj x / ( ) k ( yj x yj x ( ) k ( ) m q i e x i x / xi x xi x. i= ) r l m Te coefficients B km can be evaluated separately in O(prN). Evaluation of Ĝr(y j ) at M points is O(pr M). Hence te computational complexity as reduced from te quadratic O(rNM) to te linear O(prN + pr M). 6.4 Space subdivision Tus far, we ave used te Taylor s series expansion about a certain point x. However if we use te same x for all te points we typically would require very ig truncation number p since te Taylor s series gives good approximation only in a small open interval around x. We uniformly sub-divide te space into K intervals of lengt r x. Te N source points are assigned into K clusters, S n for n =,..., K wit c n being te center of eac cluster. Te aggregated coefficients are now computed for eac cluster and te total contribution from all te clusters is summed up. Ĝ r (y j ) = were, K p r/ r l n= k= l= m= B n km = k k! 6.5 Decay of te Gaussian a lm B n kme y j c n / ( ) k ( ) r l m yj c n yj c n (46) ( ) k ( ) m q i e xi x / xi x xi x. (47) x i S n Since te Gaussian decays very rapidly a furter speedup is acieved if we ignore all te sources belonging to a cluster if te cluster is greater tan a certain distance from te target point, i.e., y j c n > r y. Te cluster cutoff radius r y depends on te desired error ɛ. Substituting = and = we ave Ĝ r (y j ) = p r/ r l y j c n r y k= l= m= CS-T-4774/UMIACS-T-5-73 ( ) k+r l m a lm Bkme n y j c n / yj c n (48)

15 Optimal bandwidt estimation 5 were, Bkm n = ( ) k+m q i e xi x / xi x. (49) k! x i S n 6.6 Computational and space complexity Computing te coefficients Bkm n for all te clusters is O(prN). Evaluation of Ĝr(y j ) at M points is O(npr M), were n if te maximum number of neigbor clusters wic influence y j. Hence te total computational complexity is O(prN + npr M). Assuming N = M te total computational complexity is O(cN) were te constant c = pr + npr depends on te desired error, te bandwidt, and r. For eac cluster we need to store all te pr coefficients. Hence te storage needed is of O(prK + N + M). 6.7 Error bounds and coosing te parameters Given any ɛ >, we want to coose te following parameters, K (te number of intervals), r y (te cut off radius for eac cluster), and p (te truncation number) suc tat for any target point y j Ĝ r (y j ) G r (y j ) Q ɛ, (5) were Q = N i= q i. Let us define ij to be te point wise error in Ĝr(y j ) contributed by te i t source x i. We now require tat Ĝr(y j ) G r (y j ) = ij ij q i ɛ. (5) One way to acieve tis is to let i= i= ij q i ɛ i =,..., N. We coose tis strategy because it elps us to get tigter bounds. Let c n be te center of te cluster to wic x i belongs. Tere are two different ways in wic a source can contribute to te error. Te first is due to ignoring te cluster S n if it is outside a given radius r y from te target point y j. In tis case, ij = q i H r ( yj x i i= ) e y j x i /. (5) For all clusters wic are witin a distance r y from te target point te error is due to te truncation of te Taylor s series after order p. From Eqs. 43 and using te fact tat = and = we ave, ij q ( ) ( ) p ( ) p i p! H yj x i xi c n yj c n r e ( x i c n y j c n ) /. (53) 6.7. Coosing te cut off radius. From Eq. 5 we ave ( ) H yj x i r e y j x i / ɛ (54) CS-T-4774/UMIACS-T-5-73

16 6 aykar and Duraiswami p= p=3 ij y j c n Fig.. Te error at y j due to source x i, i.e., ij [Eq. 6] as a function of y j c n for different values of p and for =. and r = 4. Te error increases as a function of y j c n, reaces a maximum and ten starts decreasing. Te maximum is marked as *. q i = and x i c n =.. We use te following inequality to bound te Hermite polynomial [Baxter and oussos ]. ( ) H yj x i r r!e yj xi /4. (55) Substituting tis bound in Eq. 54 we ave e y j x i /4 ɛ/ r!. (56) Tis implies tat y j x i > ln ( r!/ɛ). Using te reverse triangle inequality, a b a b, and te fact tat y j c n > r y and x i c n r x, we ave y j x i = (y j c n ) (x i c n ) (yj c n ) (x i c n ) > r y r x (57) So in order tat te error due to ignoring te faraway clusters is less tan q i ɛ we ave to coose r y suc tat r y r x > ln ( r!/ɛ). (58) If we coose r y > r x ten, r y > r x + ln ( r!/ɛ). (59) Let be te maximum distance between any source and target point. Te we coose te cutoff radius as ( r y > r x + min, ln ( ) r!/ɛ). (6) CS-T-4774/UMIACS-T-5-73

17 Optimal bandwidt estimation Coosing te truncation number. For all sources for wic y j c k r y we ave ij q ( ) ( ) p ( ) p i p! H yj x i xi c n yj c n r e ( x i c n y j c n ) /. (6) Using te bound on te Hermite polynomial (Eq. 55) tis can be written as ij q i ( ) p ( ) p r! xi c n yj c n e ( xi cn yj cn ) /4. p! For a given source x i we ave to coose p suc tat ij q i ɛ. ij depends bot on distance between te source and te cluster center, i.e., x i c n and te distance between te target and te cluster center, i.e., y j c n. Te speedup is acieved because at eac cluster S n we sum up te effect of all te sources. As a result we do not ave a knowledge of y j c n. So we will ave to bound te rigt and side of Eq. 6, suc tat it is independent of y j c n. Fig. sows te error at y j due to source x i, i.e., ij [Eq. 6] as a function of y j c n for different values of p and for =. and r = 4. Te error increases as a function of y j c n, reaces a maximum and ten starts decreasing. Te maximum is attained at (obtained by taking te first derivative of te.h.s. of Eq. 6 and setting it to zero), y j c n = x i c n + x i c n + 8p Hence we coose p suc tat, (6) (63) ij [ yj c n = y j c n ] q i ɛ. (64) In case y j c n > r y we need to coose p based on r y, since ij will be muc lower tere. Hence out strategy for coosing p is (we coose r x = /.), 6.8 Numerical experiments ij [ yj c n =min ( y j c n,r y), x i c n =/] q i ɛ, (65) In tis section we present some numerical studies of te speedup and error as a function of te number of data points, te bandwidt, te order r, and te desired error ɛ. Te algoritms were programmed in C++ and was run on a.6 GHz Pentium M processor wit 5Mb of AM. Figure sows te running time and te maximum absolute error relative to Q for bot te direct and te fast metods as a function of N = M. Te bandwidt was =. and te order of te derivative was r = 4. Te source and te target points were uniformly distributed in te unit interval. We see tat te running time of te fast metod grows linearly as te number of sources and targets increases, wile tat of te direct evaluation grows quadratically. We also observe tat te error is way below te desired error tus validating our bound. However te bound is not very tigt. Figure 3 sows te tradeoff between precision and speedup. An increase in speedup is obtained at te cost of reduced accuracy. Figure 4 sows te results CS-T-4774/UMIACS-T-5-73

18 8 aykar and Duraiswami 6 Direct Fast 6 Desired error Acutal error Time (sec) 4 Max. abs. error / Q N (a) N (b) Fig.. (a) Te running time in seconds and (b) maximum absolute error relative to Q for te direct and te fast metods as a function of N. N = M source and te target points were uniformly distributed in te unit interval. For N > 56 te timing results for te direct evaluation were obtained by evaluating te result at M = points and ten extrapolating. [ =., r = 4, and ɛ = 6.] as a function of bandwidt. Better speedup is obtained at larger bandwidts. Figure 5 sows te results for different orders of te density derivatives. 7. SPEEDUP ACHIEVED FO BANDWIDTH ESTIMATION Te solve-te-equation plug-in metod of [Jones et al. 996] was implemented in MATLAB wit te core computational task of computing te density derivative written in C Syntetic data We demonstrate te speedup acieved on te mixture of normal densities used by Marron and Wand [Marron and Wand 99]. Te family of normal mixture densities is extremely ric and, in fact any density can be approximated arbitrarily well by a member of tis family. Fig. 6 sows te fifteen densities wic were used by te autors in [Marron and Wand 99] as a typical representative of te densities likely to be encountered in real data situations. We sampled N = 5, points from eac density. Te AMISE optimal bandwidt was estimated bot using te direct metods and te proposed fast metod. Table I sows te speedup acieved and te absolute relative error. Fig. 6 sows te actual density and te estimated density using te optimal bandwidt estimated using te fast metod. 7. eal data We used te Adult database from te UCI macine learning repository [Newman et al. 998]. Te database extracted from te census bureau database contains 3,56 training instances wit 4 attributes per instance. Of te 4 attributes 6 are continuous and 8 nominal. Table II sows te speedup acieved and te absolute relative error for two of te continuous attributes. CS-T-4774/UMIACS-T-5-73

19 Optimal bandwidt estimation Desired error Acutal error Speedup 5 Max. abs. error / Q ε (a) 5 5 ε (b) Fig. 3. (a) Te speedup acieved and (b) maximum absolute error relative to Q for te direct and te fast metods as a function of ɛ. N = M = 5, source and te target points were uniformly distributed in te unit interval. [ =. and r = 4] 3 Direct Fast 5 Desired error Acutal error Time (sec) Max. abs. error / Q 4 3 Bandwidt, (a) Bandwidt, (b) Fig. 4. (a) Te running time in seconds and (b) maximum absolute error relative to Q for te direct and te fast metods as a function of. N = M = 5, source and te target points were uniformly distributed in te unit interval. [ɛ = 6 and r = 4] 8. POJECTION PUSUIT Projection Pursuit (PP) is an exploratory tecnique for visualizing and analyzing large multivariate data-sets [Friedman and Tukey 974; Huber 985; Jones and Sibson 987]. Te idea of projection pursuit is to searc for projections from igto low-dimensional space tat are most interesting. Tese projections can ten be used for oter nonparametric fitting and oter data-analytic purposes Te conventional dimension reduction tecniques like principal component analysis looks for a projection tat maximizes te variance. Te idea of PP is to look for projections CS-T-4774/UMIACS-T-5-73

20 aykar and Duraiswami 4 Direct Fast 5 Desired error Acutal error Time (sec) 3 Max. abs. error / Q Order, r (a) Order, r (b) Fig. 5. (a) Te running time in seconds and (b) maximum absolute error relative to Q for te direct and te fast metods as a function of r. N = M = 5, source and te target points were uniformly distributed in te unit interval. [ɛ = 6 and =.] Table I. Te bandwidt estimated using te solve-te-equation plug-in metod for te fifteen normal mixture densities of Marron and Wand. direct and fast are te bandwidts estimated using te direct and te fast metods respectively. Te running time in seconds for te direct and te fast metods are sown.te absolute relative error is defined as direct fast / direct. In te study N =, points were sampled from te corresponding densities. For te fast metod we used ɛ = 3. Density direct fast T direct (sec) T fast (sec) Speedup Abs. elative Error e e e e e e e e e e e e e e e-7 tat maximize oter measures of interestingness, like non-normality, entropy etc. Te PP algoritm for finding te most interesting one-dimensional subspace is as follows. () Given N data points in a d dimensional space (centered and scaled), {x i d } N i=, project eac data point onto te direction vector a d, i.e., z i = a T x i. CS-T-4774/UMIACS-T-5-73

21 Optimal bandwidt estimation (a) Gaussian 3 3 (b) Skewed unimodal 3 3 (c) Strongly skewed (d) Kurtotic unimodal 3 3 (e) Outlier 3 3 (f) Bimodal (g) Separated bimodal () Skewed bimodal (i) Trimodal (j) Claw (k) Double Claw (l) Asymmetric Claw (m) Asym. Double Claw (n) Smoot Comb (o) Discrete Comb Fig. 6. Te fifteen normal mixture densities of Marron and Wand. Te solid line corresponds to te actual density wile te dotted line is te estimated density using te optimal bandwidt estimated using te fast metod. CS-T-4774/UMIACS-T-5-73

22 aykar and Duraiswami Table II. Optimal bandwidt estimation for five continuous attributes for te Adult database from te UCI macine learning repository. Te database contains 356 training instances. Te bandwidt was estimated using te solve-te-equation plug-in metod. direct and fast are te bandwidts estimated using te direct and te fast metods respectively. Te running time in seconds for te direct and te fast metods are sown. Te absolute relative error is defined as direct fast / direct. For te fast metod we used ɛ = 3. Attribute direct fast T direct (sec) T fast (sec) Speedup Error Age e-5 fnlwgt e-6 () Compute te univariate nonparametric kernel density estimate, p, of te projected points z i. (3) Compute te projection index I(a) based on te density estimate. (4) Locally optimize over te te coice of a, to get te most interesting projection of te data. (5) epeat from a new initial projection to get a different view. Te projection index is designed to reveal specific structure in te data, like clusters, outliers, or smoot manifolds. Some of te commonly used projection indices are te Friedman-Tukey index [Friedman and Tukey 974], te entropy index [Jones and Sibson 987], and te moment index. Te entropy index based on ényi s order- entropy is given by I(a) = p(z) log p(z)dz. (66) Te density of zero mean and unit variance wic uniquely minimizes tis is te standard normal density. Tus te projection index finds te direction wic is most non-normal.in practice we need to use an estimate p of te te true density p, for example te kernel density estimate using te Gaussian kernel. Tus we ave an estimate of te entropy index as follows. Î(a) = log p(z)p(z)dz = E [log p(z)] = N log p(z i ) = N i= log p(a T x i ). (67) Te entropy index Î(a) as to be optimized over te d-dimensional vector a subject to te constraint tat a =. Te optimization function will require te gradient of te objective function. For te index defined above te gradient can be written as d da [Î(a)] = p (a T x i ) N p(a T x i ) x i. (68) i= For te PP te computational burden is greatly reduced if we use te proposed fast metod. Te computational burden is reduced in te following tree instances. () Computation of te kernel density estimate. CS-T-4774/UMIACS-T-5-73 i=

23 Optimal bandwidt estimation x Age (a) Age fnlwgt x (b) fnlwgt Fig. 7. Te estimated density using te optimal bandwidt estimated using te fast metod, for two of te continuous attributes in te Adult database from te UCI macine learning repository. (a) (b) (c) (d) Fig. 8. (a) Te original image. (b) Te centered and scaled GB space. Eac pixel in te image is a point in te GB space. (c) KDE of te projection of te pixels on te most interesting direction found by projection pursuit. (d) Te assignment of te pixels to te tree modes in te KDE. () Estimation of te optimal bandwidt. (3) Computation of te first derivative of te kernel density estimate, wic is required in te optimization procedure. Fig. 8 sows an example of te PP algoritm on a image. Fig. 8(a) sows te original image of te and wit a ring against a background. Perceptually te image as tree distinct regions, te and, te ring, and te background. Eac pixel is represented as a point in a tree dimensional GB space. Fig. 8(b) sows te te presence of tree clusters in te GB space. We ran te PP algoritm on tis space. Fig. 8(c) sows te KDE of te points projected on te most interesting direction. Tis direction is clearly able to distinguis te tree clusters. Fig. 8(d) sows te segmentation were eac pixel is assigned to te mode nearest to it. 9. CONCLUSIONS We proposed an fast ɛ exact algoritm for kernel density derivative estimation wic reduced te computational complexity from O(N ) to O(N). We demon- CS-T-4774/UMIACS-T-5-73

24 4 aykar and Duraiswami strated te speedup acieved for optimal bandwidt estimation bot on simulated as well as real data. As an example we demonstrated ow to potentially speedup te projection pursuit algoritm. We focussed on te univariate case in te current paper since te bandwidt selection procedures for te univariate case are pretty mature. Bandwidt selection for te multivariate case is a field of very active researc [Wand and Jones 994]. Our future work would include te relatively straigtforward but more involved extension of te current procedure to andle iger dimensions. As pointed out earlier many applications oter tan bandwidt estimation require derivative estimates. We ope tat our fast computation sceme sould benefit all te related applications. Te C++ code is available for academic use by contacting te first autor.. APPENDIX : MISE FO KENEL DENSITY ESTIMATOS First note tat MISE=IMSE. [ ] MISE( p, p) = E [ p(x) p(x)] dx = E[ p(x) p(x)] dx = IMSE( p, p). Te mean square error (MSE) can be decomposed into variance and squared bias of te estimator. (69) MSE( p, p, x) = E[ p(x) p(x)] = V ar[ p(x)] + (E[ p(x)] p(x)). (7) Te kernel density estimate p(x) is given by p(x) = N were K (x) = (/)K(x/).. Bias i= K( x x i ) = N K (x x i ), Te mean of te estimator can be written as E[ p(x)] = E[K (x x i )] = E[K (x X)] = N i= Using te convolution operator we ave i= K (x y)p(y)dy. (7) E[ p(x)] p(x) = (K p)(x) p(x). (7) Te bias is te difference between te smooted version (using te kernel) of te density and te actual density.. Variance Te variance of te estimator can be written as V ar[ p(x)] = N V ar[k (x X)] = N (E[K (x X)] E[K (x X)] ). (73) Using Eq. 7 we ave te following expression for te variance. V ar[ p(x)] = N [(K p)(x) (K p) (x)]. (74) CS-T-4774/UMIACS-T-5-73

25 Optimal bandwidt estimation 5.3 MSE Using Eq. 7 and Eq. 74 te MSE at a point x can be written as, MSE( p, p, x) = N [ (K p)(x) (K p) (x) ] + [(K p)(x) p(x)]. (75).4 MISE Since MISE=IMSE we ave, MISE( p, p) = [ (K N p)(x) (K p) (x) ] dx + [(K p)(x) p(x)] dx. Te dependence of te MISE on te bandwidt is not very explicit in te above expression. Tis makes it difficult to interpret te influence of te bandwidt on te performance of te estimator. An asymptotic approximation for tis expression is usually derived called as te AMISE.. APPENDIX : ASYMPTOTIC MISE FO KENEL DENSITY ESTIMATOS In order to derive an large sample approximation to MISE we make te following assumptions on te density p, te bandwidt, and te kernel K. () Te second derivative p (x) is continuous, square integrable and ultimately monotone 8. () lim N = and lim N N =, i.e., as te number of samples N is increased approaces zero at a rate slower tan /N. (3) In order tat p(x) is a valid density we assume K(z) and K(z)dz =. Te kernel function is assumed to be symmetric about te origin ( zk(z)dz = ) and as finite second moment ( z K(z)dz < ).. Bias From Eq. 7 and a cange of variables we ave E[ p(x)] = (K p)(x) = K (x y)p(y)dy = Using Taylor s series p(x z) can be expanded as Hence E[ p(x)] = p(x) (76) K(z)p(x z)dz. (77) p(x z) = p(x) zp (x) + z p (x) + o( ). (78) K(z)dz p (x) zk(z)dz + p (x) z K(z)dz + o( ). (79) 8 An ultimately monotone function is one tat is monotone over bot (, M) and (M, ) for some M >. CS-T-4774/UMIACS-T-5-73

26 6 aykar and Duraiswami From Assumption 3 we ave, K(z)dz = zk(z)dz = µ (K) = Hence z K(z)dz < (8) E[ p(x)] p(x) = µ (K)p (x) + o( ). (8) Te KDE is asymptotically unbiased. Te bias is directly proportional to te value of te second derivative of te density function, i.e., te curvature of te density function.. Variance From Eq. 74 and a cange of variables we ave V ar[ p(x)] = N [(K p)(x) (K p) (x)] = [ ] K N (x y)p(y)dy [ N = [ ] K (z)p(x z)dz [ N N Using Taylor s series p(x z) can be expanded as K (x y)p(y)dy ] K(z)p(x z)dz] (8) p(x z) = p(x) + o(). (83) We need only te first term because of te factor /N. Hence V ar[ p(x)] = [p(x) + o()] K (z)dz [p(x) + o()] N N = N p(x) K (z)dz + o(/n) (84) Based on Assumption lim N N =, te variable asymptotically converges to zero..3 MSE Te MSE at a point x can be written as (using Eqs. 8 and 84), MSE( p, p, x) = N p(x)(k) µ (K) p (x) + o( 4 + /N). (85) were (K) = K (z)dz. CS-T-4774/UMIACS-T-5-73

27 Optimal bandwidt estimation 7.4 MISE Since MISE=IMSE we ave, MISE( p, p) = N (K) p(x)dx µ (K) p (x) dx + o( 4 + /N) = AMISE( p, p) + o( 4 + /N), (86) were AMISE( p, p) = N (K) µ (K) (p ). (87). APPENDIX 3 : AMISE FO KENEL DENSITY DEIVATIVE ESTIMATOS First note tat MISE=IMSE. MISE( p (r), p (r) ) = E [ ] [ p (r) (x) p (r) (x)] dx = E[ p (r) (x) p (r) (x)] dx = IMSE( p (r), p (r) ). (88) Te mean square error (MSE) can be decomposed into variance and squared bias of te estimator. MSE( p (r), p (r), x) = E[ p (r) (x) p (r) (x)] = Var[ p (r) (x)] + (E[ p (r) (x)] p (r) (x)). (89) An simple estimator for te density derivative can be obtained by taking te derivative of te kernel density estimate p(x) [Battacarya 967; Scuster 969]. If te kernel K is differentiable r times ten te r t density derivative estimate p (r) (x) can be written as p (r) (x) = = N N r+ i= i= ( ) x K (r) xi K (r) (x x i) (9) were K (r) is te r t derivative of te kernel K and K (r) (x) = (/r+ )K (r) (x/). In order to derive an large sample approximation to MISE we make te following assumptions on te density p, te bandwidt, and te kernel K. () Te (r+) t derivative p (r+) (x) is continuous, square integrable and ultimately monotone 9. () lim N = and lim N N r+ =, i.e., as te number of samples N is increased approaces zero at a rate slower tan /N r+. 9 An ultimately monotone function is one tat is monotone over bot (, M) and (M, ) for some M >. CS-T-4774/UMIACS-T-5-73

28 8 aykar and Duraiswami (3) In order tat p(x) is a valid density we assume K(z) and K(z)dz =. Te kernel function is assumed to be symmetric about te origin ( zk(z)dz = ) and as finite second moment ( z K(z)dz < ).. Bias Te mean of te estimator can be written as E[ p (r) (x)] = E[K (r) N (x x i)] Using te convolution operator we ave i= = E[K (r) (x X)] = K (r) (x y)p(y)dy. (9) E[ p (r) (x)] = (K (r) p)(x) = (K p (r) )(x). (9) were we ave used te relation K (r) p = K p (r). We now derive a large sample approximation to te mean. Using a cange of variables te mean can be written as follows. E[ p (r) (x)] = (K p (r) )(x) = K (x y)p (r) (y)dy = K(z)p (r) (x z)dz. (93) Using Taylor s series p (r) (x z) can be expanded as Hence p (r) (x z) = p (r) (x) zp (r+) (x) + z p (r+) (x) + o( ). (94) [ E[ p (r) (x)] = p (r) (x) + p (r+) (x) ] K(z)dz [ From Assumption 3 we ave, K(z)dz = zk(z)dz = µ (K) = Hence te bias can be written as [ p (r+) (x) ] z K(z)dz zk(z)dz + o( ). (95) z K(z)dz < (96) E[ p (r) (x)] p (r) (x) = µ (K)p (r+) (x) + o( ). (97) Te estimate is asymptotically unbiased. Te bias is estimating te r t derivative is directly proportional to te value of te (r + ) t derivative of te density function. CS-T-4774/UMIACS-T-5-73 ]

Fast optimal bandwidth selection for kernel density estimation

Fast optimal bandwidth selection for kernel density estimation Fast optimal bandwidt selection for kernel density estimation Vikas Candrakant Raykar and Ramani Duraiswami Dept of computer science and UMIACS, University of Maryland, CollegePark {vikas,ramani}@csumdedu