Fast optimal bandwidth selection for kernel density estimation

Fast optimal bandwidt selection for kernel density estimation Vikas Candrakant Raykar and Ramani Duraiswami Dept of computer science and UMIACS, University of Maryland, CollegePark {vikas,ramani}@csumdedu Abstract We propose a computationally efficient ɛ exact approximation algoritm for univariate Gaussian kernel based density derivative estimation tat reduces te computational complexity from O(MN) to linear O(N +M) We apply te procedure to estimate te optimal bandwidt for kernel density estimation We demonstrate te speedup acieved on tis problem using te solve-te-equation plug-in metod, and on exploratory projection pursuit tecniques Introduction Kernel density estimation tecniques ] are widely used in various inference procedures in macine learning, data mining, pattern recognition, and computer vision Efficient use of tese metods requires te optimal selection of te bandwidt of te kernel A series of tecniques ave been proposed for data-driven bandwidt selection 4] Te most successful state of te art metods rely on te estimation of general integrated squared density derivative functionals Tis is te most computationally intensive task wit O(N ) cost, in addition to te O(N ) cost of computing te kernel density estimate Te core task is to efficiently compute an estimate of te density derivative Currently te most practically successful approac, solve-te-equation plugin metod 9] involves te numerical solution of a nonlinear equation Iterative metods to solve tis equation will involve repeated use of te density functional estimator for different bandwidts wic adds muc to te computational burden Estimation of density derivatives is needed in various oter applications like estimation of modes and inflexion points of densities ] and estimation of te derivatives of te projection index in projection pursuit algoritms 5] Optimal bandwidt selection A univariate random variable X on R as a density p if, for all Borel sets A of R, p(x)dx = Prx A] A Te task of density estimation is to estimate p from an iid sample x,, x N drawn from p Te estimate p : R (R) N R is called te density estimate Te most popular non-parametric metod for density estimation is te kernel density estimator (KDE) ] () p(x) = ( ) x xi K, N were K(u) is te kernel function and is te bandwidt Te kernel K(u) is required to satisfy te following two conditions: () K(u) and K(u)du = Te most widely used kernel is te Gaussian of zero mean and unit variance In tis case te KDE can be written as (3) p(x) = N e (x x i) / π R Te computational cost of evaluating Eq 3 at N points is O(N ), making it proibitively expensive Te Fast Gauss Transform (FGT) 3] is an approximation algoritm tat reduces te computational complexity to O(N), at te expense of reduced precision Yang et al ] presented an extension te improved fast Gauss transform(ifgt) tat scaled well wit dimensions Te main contribution of te current paper is te extension of te IFGT to accelerate te kernel density derivative estimate, and solve te optimal bandwidt problem Te integrated square error (ISE) between te estimate p(x) and te actual density p(x) is given by ISE( p, p) = R p(x) p(x)] dx Te ISE depends on a particular realization of N points It can be averaged over tese realizations to get te mean integrated squared error (MISE) An asymptotic large sample approximation for MISE, te AMISE, is usually derived via te Taylor s series Te A ere is for asymptotic Based on certain assumptions, te AMISE between te actual density and te estimate can be sown to be (4) AMISE( p, p) = N R(K) + 4 4 µ (K) R(p ), were, R(g) = R g(x) dx, µ (g) = R x g(x)dx, and p is te second derivative of te density p Te first term in Eq 4 is te integrated variance and te second term is te integrated squared bias Te bias is proportional to 4 wereas te variance is proportional to /N, wic leads to te well known bias-variance tradeoff Based on te AMISE expression te optimal bandwidt AMISE can be obtained by differentiating Eq 4 wrt bandwidt and setting it to zero ] /5 (5) AMISE = R(K) µ (K) R(p )N

However tis expression cannot be used directly since R(p ) depends on te second derivative of te density p In order to estimate R(p ) we will need an estimate of te density derivative A simple estimator for te density derivative can be obtained by taking te derivative of te KDE p(x) defined earlier ] Te r t density derivative estimate p (r) (x) can be written as (6) p (r) (x) = N r+ ( ) x K (r) xi, were K (r) is te r t derivative of te kernel K Te r t derivative of te Gaussian kernel k(u) is given by K (r) (u) = ( ) r H r (u)k(u), were H r (u) is te r t Hermite polynomial Hence te density derivative estimate can be written as (7) p (r) ( ) r ( ) x xi (x) = H r e (x x i) / πn r+ Te computational complexity of evaluating te r t derivative of te density estimate due to N points at M target locations is O(rN M) Based on similar analysis te optimal bandwidt r AMISE to estimate te rt density derivative can be sown to be ] R(K (8) r (r) ] /r+5 )(r + ) AMISE = µ (K) R(p (r+) )N 3 Estimation of density functionals Rater tan requiring te actual density derivative, metods for automatic bandwidt selection require te estimation of wat are known as density functionals Te general integrated squared density derivative functional is defined as R(p (s) ) = R p (s) (x) ] dx Using integration by parts, tis can be written in te following form, R(p (s) ) = ( ) s R p(s) (x)p(x)dx More specifically for even s we are interested in estimating density functionals of te form, ] (39) Φ r = p (r) (x)p(x)dx = E p (r) (X) R An estimator for Φ r is, (3) Φr = N p (r) (x i ) were p (r) (x i ) is te estimate of te r t derivative of te density p(x) at x = x i Using a kernel density derivative estimate for p (r) (x i ) (Eq 6) we ave (3) Φr = N r+ j= K (r) ( x i x j ) It sould be noted tat computation of Φ r is O(rN ) and ence can be very expensive if a direct algoritm is used Te optimal bandwidt for estimating te density functional is given by ] K (3) r (r) ] /r+3 () AMSE = µ (K)Φ r+ N 4 Solve-te-equation plug-in metod Te most successful among all current bandwidt selection metods 4], bot empirically and teoretically, is te solve-te-equation plug-in metod 4] We use te following version as described in 9] Te AMISE optimal bandwidt is te solution to te equation ] /5 R(K) (43) = µ (K) Φ, 4 γ()]n were Φ 4 γ()] is an estimate of Φ 4 = R(p ) using te pilot bandwidt γ(), wic depends on Te bandwidt is cosen suc tat it minimizes te asymptotic MSE for te estimation of Φ 4 and is K (4) ] /7 () (44) g MSE = µ (K)Φ 6 N Substituting for N, g MSE can be written as a function of as follows K (4) ] /7 ()µ (K)Φ 4 (45) g MSE = 5/7 AMISE R(K)Φ 6 Tis suggests tat we set ] /7 K (4) ()µ (K) Φ 4 (g ) (46) γ() = 5/7, R(K) Φ 6 (g ) were Φ 4 (g ) and Φ 6 (g ) are estimates of Φ 4 and Φ 6 using bandwidts g and g respectively Te bandwidts g and g are cosen suc tat it minimizes te asymptotic MSE (47) g = 6 π Φ6 N ] /7 g = 3 π Φ8 N ] /9 were Φ 6 and Φ 8 are estimators for Φ 6 and Φ 8 respectively We can use a similar strategy for estimation of Φ 6 and Φ 8 However tis problem will continue since te optimal bandwidt for estimating Φ r will depend on Φ r+ Te usual strategy is to estimate a Φ r at some stage, using a quick and simple estimate of bandwidt cosen wit reference to a parametric family, usually a normal density It as been observed tat as te number of stages increases, te variance of te bandwidt increases Te most common coice is to use only two stages If p is a normal density wit variance σ ten for

even r we can compute Φ r exactly ] An estimator of Φ r will use an estimate σ of te variance Based on tis we can estimate Φ 6 and Φ 8 as (48) Φ6 = 5 6 π σ 7, Φ8 = 5 3 π σ 9 Te two stage solve-te-equation metod using te Gaussian kernel can be summarized as follows () Compute an estimate σ of te standard deviation () Estimate te density functionals Φ 6 and Φ 8 using te normal scale rule (Eq 48) (3) Estimate te density functionals Φ 4 and Φ 6 using Eq 3 wit te optimal bandwidt g and g (Eq 47) (4) Te bandwidt is te solution to te nonlinear Eq 43 wic can be solved using any numerical routine like te Newton- Rapson metod Te main computational bottleneck is te estimation of Φ wic is of O(N ) 5 Fast density derivative estimation To estimate te density derivative at M target points, {y j R} M j=, we need to evaluate sums suc as (59) ( ) yj x i G r (y j ) = q i H r e (yj xi) / j =,, M, were {q i R} N will be referred to as te source weigts, R + is te bandwidt of te Gaussian and R + is te bandwidt of te Hermite polynomial Te computational complexity of evaluating Eq 59 is O(rNM) For any given ɛ > te algoritm computes an approximation Ĝr(y j ) suc tat (5) Ĝr(y j ) G r (y j ) Qɛ, were Q = N q i We call Ĝ r (y j ) an ɛ exact approximation to G r (y j ) We describe te algoritm briefly More details can be found in 8] For any point x R te Gaussian can be written as, (5) e y j x i / = e x i x / e y j x / e (x i x )(y j x )/ In Eq 5 for te tird exponential e (y j x )(x i x )/ te source and target are entangled Tis entanglement is separated using te Taylor s series expansion as follows e (x x )(y x )/ = (5) p k= k + error, ( ) k ( x x y x Using tis te Gaussian can now be factorized as p e y j x i / k ( ) ] k = e x i x / xi x k= ( ) ] k e y j x / yj x (53) + err ) k Te r t Hermite polynomial can be factorized as ] ( ) r/ yj x i r l ( ) m xi x H r = a lm l= m= ( ) r l m yj x (54), were (55) a lm = ( ) l+m r! l l!m!(r l m)! Using Eq 53 and 54, G r (y j ) after ignoring te error terms can be approximated as B km = k p Ĝ r (y j ) = k= r/ l= r l m= a lm B km e y j x / ( ) k ( ) r l m yj x yj x, were ( ) k ( ) m q i e x i x / xi x xi x Tus far, we ave used te Taylor s series expansion about a certain point x However if we use te same x for all te points we typically would require very ig truncation number p since te Taylor s series gives good approximation only in a small open interval around x We uniformly sub-divide te space into K intervals of lengt r x Te N source points are assigned into K clusters, S n for n =,, K wit c n being te center of eac cluster Te aggregated coefficients are now computed for eac cluster and te total contribution from all te clusters is summed up Since te Gaussian decays very rapidly a furter speedup is acieved if we ignore all te sources belonging to a cluster if te cluster is greater tan a certain distance from te target point, ie, y j c n > r y Substituting = and = te final algoritm can be written as (56) Ĝ r (y j ) = p r/ r l a lm B n km y j c n r y k= l= m= ) k+r l m e y j c n / ( yj c n Bkm n = ( ) k+m q i e xi cn / xi c n x i S n 5 Computational and space complexity Computing te coefficients Bkm n for all te clusters is O(prN) Evaluation of Ĝ r (y j ) at M points is O(npr M), were n if te maximum number of neigbor clusters wic influence y j Hence te total computational complexity is O(prN + npr M) For eac cluster we need to store all te pr coefficients Hence te storage needed is of O(prK + N + M)

Time (sec) 6 4 Direct Fast Max abs error / Q 6 8 Desired error Acutal error 4 8 6 4 3 3 3 35 3 5 5 5 7 3 3 4 6 N 4 4 6 N Figure : Running time in seconds and maximum absolute error relative to Q for te direct and te fast metods as a function of N N = M source and te target points were uniformly distributed, ] =, r = 4, and ɛ = 6 ] 5 Coosing te parameters Given any ɛ >, we want to coose te following parameters, K (te number of intervals), r y (te cut off radius for eac cluster), and p (te truncation number) suc tat for any target point y j, Ĝr(y j ) G r (y j ) Qɛ We give te final results for te coice of te parameters Te detailed derivations can be seen in te tecnical report 8] Te number of clusters K is cosen suc tat r x = / Te cutoff radius r y is given by = r x + ln ( r!/ɛ) Te truncation number p r y is cosen suc tat b=min (b,r y ), a=rx] = r! p! ɛ, were, ( ab ) p e (a b) /4, and b = a+ a +8p 53 Numerical experiments Te algoritm was programmed in C++ and was run on a 6 GHz Pentium M processor wit 5Mb of RAM Figure sows te running time and te maximum absolute error relative to Q for bot te direct and te fast metods as a function of N = M We see tat te running time of te fast metod grows linearly as te number of sources and targets increases, wile tat of te direct evaluation grows quadratically 6 Speedup acieved for bandwidt estimation We demonstrate te speedup acieved on te mixture of normal densities used by Marron and Wand 6] Te family of normal mixture densities is extremely ric and, in fact any density can be approximated arbitrarily well by a member of tis family Fig sows a sample of four different densities out of te fifteen densities wic were used by te autors in 6] as typical representatives of te densities likely to be encountered in real data situations We sampled N = 5, points from eac density Te AMISE optimal bandwidt was estimated using bot te direct metods and te proposed fast metod Table sows te speedup acieved and te 7 6 5 4 3 3 3 (c) 4 35 3 5 5 5 3 3 Figure : Four normal mixture densities from Marron and Wand 6]Te solid line sows te actual density and te dotted line is te estimated density using te optimal bandwidt absolute relative error We also used te Adult database from te UCI macine learning repository 7] Te database extracted from te census bureau database contains 3,56 training instances wit 4 attributes per instance Table sows te speedup acieved and te absolute relative error for two of te continuous attributes 7 Projection Pursuit Projection Pursuit (PP) is an exploratory tecnique for visualizing and analyzing large multivariate datasets 5] Te idea of PP is to searc for projections from ig- to low-dimensional space tat are most interesting Te PP algoritm for finding te most interesting one-dimensional subspace is as follows First project eac data point onto te direction vector a R d, ie, z i = a T x i Compute te univariate nonparametric kernel density estimate, p, of te projected points z i Compute te projection index I based on te density estimate Locally optimize over te te coice of a, to get te most interesting projection of te data Repeat from a new initial projection to get a different view Te projection index is designed to reveal specific structure in te data, like clusters, outliers, or smoot manifolds Te entropy index based on Rényi s order- entropy is given by I = p(z) log p(z)dz Te density of zero mean and unit variance wic uniquely minimizes tis is te standard normal density Tus te projection index finds te direction wic is most non-normal In practice we need to use an estimate p of te te true density p, for example te KDE using te Gaussian kernel Tus we ave an estimate of te entropy index as follows (77) Î = log p(z)p(z)dz = N (d) log p(a T x i )

Table : Te running time in seconds for te direct and te fast metods for four normal mixture densities of Marron and Wand 6] (See Fig ) Te absolute relative error is defined as direct fast / direct For te fast metod we used ɛ = 3 Density direct fast T direct (sec) T fast (sec) Speedup Abs Relative Error 543 543 8536 6 8387 53e-6 94 94 5989 886 6679 634e-6 (c) 436 436 7867 67 6769 84e-6 (d) 349 3493 839 9 6983 383e-6 Table : Optimal bandwidt estimation for two continuous attributes for te Adult database 7] Attribute direct fast T direct (sec) T fast (sec) Speedup Abs Relative Error Age 86846 86856 46793 664 745 7e-5 fnlwgt 499564359 499584 46379 6883 6737 49e-6 Te entropy index Î as to be optimized over te d-dimensional vector a subject to te constraint tat a = Te optimization function will require te gradient of te objective function For te index defined above te gradient can be written as d da Î] = N p (a T x i ) N p(a T x i ) x i For te PP te computational burden is greatly reduced if we use te proposed fast metod Te computational burden is reduced in te following tree instances () Computation of te kernel density estimate, () estimation of te optimal bandwidt, and (3) computation of te first derivative of te kernel density estimate, wic is required in te optimization procedure Fig 3 sows an example of te PP algoritm to segment an image based on color 8 Conclusions We proposed an fast ɛ exact algoritm for kernel density derivative estimation wic reduced te computational complexity from O(N ) to O(N) We demonstrated te speedup acieved for optimal bandwidt estimation bot on simulated as well as real data A extended version of tis paper is available as a tecnical report 8] References ] P K Battacarya Estimation of a probability density function and its derivatives Sankya, Series A, 9:373 38, 967 ] K Fukunaga and L Hostetler Te estimation of te gradient of a density function, wit applications in pattern recognition IEEE Trans Info Teory, ():3 4, 975 3] L Greengard and J Strain Te fast Gauss transform SIAM J Sci Stat Comput, ():79 94, 99 4] M C Jones, J S Marron, and S J Seater A brief survey of bandwidt selection for density estimation J Amer Stat Assoc, 9(433):4 47, Marc 996 5] M C Jones and R Sibson Wat is projection pursuit? J R Statist Soc A, 5: 36, 987 5 5 (c) 3 Figure 3: Te original image Te centered and scaled RGB space Eac pixel in te image is a point in te RGB space (c) KDE of te projection of te pixels on te most interesting direction found by projection pursuit (d) Te assignment of te pixels to te tree modes in te KDE 6] J S Marron and M P Wand Exact mean integrated squared error Te Ann of Stat, ():7 736, 99 7] D J Newman, S Hettic, C L Blake, and C J Merz UCI repository of macine learning databases ttp://wwwicsuciedu/ mlearn/mlrepositorytml, 998 8] V C Raykar and R Duraiswami Very fast optimal bandwidt selection for univariate kernel density estimation Tecnical Report CS-TR-4774, University of Maryland, CollegePark, 5 9] SJ Seater and MC Jones A reliable data-based bandwidt selection metod for kernel density estimation J Royal Statist Soc B, 53:683 69, 99 ] M P Wand and M C Jones Kernel Smooting Capman and Hall, 995 ] C Yang, R Duraiswami, N Gumerov, and L Davis Improved fast Gauss transform and efficient kernel density estimation In IEEE Int Conf on Computer Vision, pages 464 47, 3 5 5 5 5 (d) 4