Fast Exact Univariate Kernel Density Estimation

Size: px

Start display at page:

Download "Fast Exact Univariate Kernel Density Estimation"

Louisa Wells
5 years ago
Views:

1 Fast Exact Univariate Kernel Density Estimation David P. Hofmeyr Department of Statistics and Actuarial Science, Stellenbosc University arxiv: v2 [stat.co] 12 Jul 2018 July 13, 2018 Abstract Tis paper presents new metodology for computationally efficient kernel density estimation. It is sown tat a large class of kernels allows for exact evaluation of te density estimates using simple recursions. Te same metodology can be used to compute density derivative estimates exactly. Given an ordered sample te computational complexity is linear in te sample size. Combining te proposed metodology wit existing approximation metods results in extremely fast density estimation. Extensive erimentation documents te effectiveness and efficiency of tis approac compared wit te existing state-of-te-art. Keywords: linear time, density derivative 1

2 1 Introduction Estimation of density functions is a crucial task in loratory data analysis, wit broad application in te fields of statistics, macine learning and data science. Here a sample of observations, x 1,..., x n, is assumed to represent a collection of realisations of a random variable, X, wit unknown density function, f. Te task is to obtain an estimate of f based on te sample values. Kernel based density estimation is arguably te most popular non-parametric approac. In kernel density estimation, te density estimator, ˆf, is given by a mixture model comprising a large number of (usually n) components. In te canonical form, one as ˆf(x) = 1 n x xi K, (1) were K( ) is called te kernel, and is a density function in its own rigt, satisfying K 0, K = 1. Te parameter > 0 is called te bandwidt, and controls te smootness of ˆf, wit larger values resulting in a smooter estimator. A direct evalution of (1) at a collection of m evaluation points, { x 1,..., x m }, as computational complexity O(nm), wic quickly becomes proibitive as te sample size becomes large, especially if te function estimate is required at a large number of evaluation points. Furtermore many popular metods for bandwidt selection necessitate evaluating te density estimate (or its derivatives) at te sample points temselves (Scott and Terrell, 1987; PW, 1976; Seater and Jones, 1991), making te procedure for coosing quadratic in computational complexity. Existing metods wic overcome tis quadratic complexity barrier are limited to kernels wit bounded support and te Laplace kernel (Fan and Marron, 1994), or tey rely on approximations. Popular approximations including binning (Scott and Seater, 1985; Hall and Wand, 1994) and te fast Gauss (Yang et al., 2003, FGT) and Fourier (Silverman, 1982, FFT) tranforms, as well as combinations of tese. A more recent approac (Raykar et al., 2010) relies on truncations of te Taylor series ansion of te kernel function. Generally speaking tese metods reduce te complexity to O(n + m), wit te constant term depending on te desired accuracy level. In tis paper te class of kernels of te form K(x) = poly( x ) ( x ), were poly( ) denotes a polynomial function of finite degree, is considered. It is sown tat tese kernels 2

3 allow for extremely fast and exact evaluation of te corresponding density estimates. Tis is acieved by defining a collection of O((α + 1)n) terms, were α is te degree of te polynomial, of wic te values { ˆf(x 1 ),..., ˆf(x n )} are linear combinations. Tese terms arise from loiting te binomial ansion of polynomial terms and te trivial factorisation of te onential function. Furtermore tese terms can be computed recursively from te order statistics of te sample. Given an ordered sample, te exact computation of te collection of values { ˆf(x 1 ),..., ˆf(x n )} terefore as complexity O((α + 1)n). Hencefort we will use poly α ( ) to denote a polynomial function of degree α. An important benefit of te proposed kernels over tose used in te fast sum updating approac (Fan and Marron, 1994), is tat bounded kernels cannot be reliably used in cross validation pseudo-likeliood computations. Tis is because te likeliood for points wic do not lie witin te support of te density estimate based on te remaining points is zero. Numerous popular bandwidt selection tecniques can terefore not be applied. Remark 1 Te derivative of a poly α ( x ) ( x ) function is equal to x multiplied by a poly α 1 ( x ) ( x ) function, provided tis derivative exists. Te proposed metodology can terefore be used to exactly and efficiently evaluate { ˆf (k) (x 1 ),..., ˆf (k) (x n )}, were ˆf (k) denotes te k-t derivative of ˆf. Altoug a given poly( x ) ( x ) function is not infinitely differentiable at 0, for a given value of k it is straigtforward to construct a poly( x ) ( x ) function wit at least k continuous derivatives. An alternative is to utilise leave-one-out estimates of te derivative, wic can be computed for any poly( x ) ( x ) function provided no repeated values in te sample. Remark 2 Te proposed class of kernels is extremely ric. Te popular Gaussian kernel is a limit case, wic can be seen by considering tat te density of an arbitrary sum of Laplace random variables lies in tis class. Te remainder of te paper is organised as follows. In Section 2 te kernels used in te proposed metod are introduced, and relevant properties for kernel density estimation are discussed. It is sown tat density estimation using tis class of kernels can be performed in linear time from an ordered sample using te recursive formulae mentioned above. An extensive simulation study is documented in Section 3, wic sows te efficiency and effectiveness of te proposed approac. A final discussion is given in Section 4. 3

4 2 Computing Kernel Density Estimates Exactly Tis section is concerned wit efficient evaluation of univariate kernel density estimates. A general approac for evaluating te estimated density based on kernels wic are of te type K(x) = poly( x ) ( x ) is provided. Tese kernels admit a convenient algebraic ansion of teir sums, wic allows for te evaluation of te density estimates using a few simple recursions. Te resulting computational complexity is O((α + 1)n) for an ordered sample of size n, were α is te degree of te polynomial. To illustrate te proposed approac we need only consider te evaluation of a function of te type ( x x i α x x ) i, (2) for an arbitrary α {0, 1, 2,...}. Te extension to a linear combination of finitely many suc functions, of wic ˆf is an example, is trivial. To tat end let x (1) x (2)... x (n) be te order statistics from te sample. Ten define for eac k = 0, 1, 2,..., α and eac j = 0, 1, 2,..., n te terms l(k, j) = r(k, j) = j ( ( x (i) ) k x(i) x (j) ( (x (i) ) k x(j) x (i) i=j+1 ), (3) ), (4) were for convenience l(k, 0) and r(k, n) are set to zero for all k. Next, for a given x R define n(x) to be te number of sample points less tan or equal to x, i.e., n(x) = 4

5 n δ x i ((, x]), were δ xi ( ) is te Dirac measure for x i. Ten, ( x x i α x x ) i ( = x x (i) α x x ) (i) n(x) = (x x (i) ) α x(i) x x + (x (i) x) α x(i) n(x) x(n(x)) x = x x(n(x)) + ( ( α x(n(x)) x = k i=n(x)+1 α x α k ( x (i) ) k k i=n(x)+1 α x k k (i)( x) α k ) x α k l(k, n(x)) + x(i) x (n(x)) x(n(x)) x (i) ) x x(n(x)) ( x) α k r(k, n(x)). Now, if x is itself an element of te sample, say x = x (j), ten we ave ( x (j) x i α x ) (j) x i α = ((x(j)) α k l(k, j) + ( x (j)) α k r(k, j) ). k Te values ˆf(x (1) ),..., ˆf(x (n) ) can terefore be ressed as linear combinations of terms in k,j {l(k, j), r(k, j)}. Next it is sown tat for eac k = 0,..., α, te terms l(k, j), r(k, j) can be obtained recursively. Consider, j+1 l(k, j + 1) = ( x (i) ) k x(i) x (j+1) j = ( x (i) ) k x(i) x (j) + x (j) x (j+1) + ( x (j+1) ) k j x(j) x (j+1) = ( x (i) ) k x(i) x (j) + ( x (j+1) ) k x(j) x (j+1) = l(k, j) + ( x (j+1) ) k. And similarly, r(k, j 1) = (x (i) ) k x(j 1) x (i) i=j x(j 1) x ( (j) ) = (x (i) ) k x(j) x (i) + (x (j) ) k i=j+1 x(j 1) x (j) (r(k, = j) + (x(j)) k). 5

6 Te complete set of values ˆf(x (1) ),..., ˆf(x (n) ) can tus be computed wit a single forward and a single backward pass over te order statistics, requiring O((α + 1)n) operations in total. On te oter and evaluation at an arbitrary collection of m evaluation points requires O((α + 1)(n + m)) operations. Relevant properties of te cosen class of kernels can be simply derived. Consider te kernel given by were c is te normalising constant. K(x) = c ( x ) β k x k, Of course c can be incorporated directly into te coefficients β 0,..., β α, but for completeness te un-normalised kernel formulation is also considered. It migt be convenient to a practitioner to only be concerned wit te sape of a kernel, wic is defined by te relative values of te coefficients β 0,..., β α, witout necessarily concerning temselves initially wit normalisation. Many important properties in relation to te field of kernel density estimation can be simply derived using te fact tat Specifically, one as c 1 = σ 2 K := R(K) := x k ( x )dx = 2 β k x k ( x )dx = 2 = c 2 x 2 K(x)dx = c K(x) 2 = c 2 j=0 β k β j 1 2 j=0 0 x k ( x)dx = 2k! β k k! x k+2 ( x )dx = 2c β k β j x k+j ( 2 x )dx x k+j 2 k+j ( x )dx = c 2 j=0 β k (k + 2)! β k β j (k + j)! 2k+j Furtermore it can be sown tat for K(x) to ave at least k continuous derivatives it is sufficient tat β i 1, for all i = 0, 1,..., k. i! 6

7 Coosing te simplest kernel from te proposed class (i.e., tat wit te lowest degree polynomial) wic admits eac smootness level leads us to te sub-class defined by K α (x) := 1 2(α + 1) x k k! ( x ), α = 1, 2,... (5) In te eriments presented in te following section te kernels K 1 and K 4 will be considered. Te kernel K 1 is cosen as te simplest differentiable kernel in te class, wile K 4 is selected as it as efficiency very close to tat of te ubiquitous Gaussian kernel. Te efficiency of a kernel K relates to te asymptotic mean integrated error wic it induces, and may be defined as eff(k) := (σ K R(K)) 1. It is standard to consider te relative efficiency eff rel (K) := eff(k)/eff(k ). Te kernel K is te kernel wic maximises eff(k) as defined ere, and is given by te Epanecnikov kernel. Te efficiency and sape of te cosen kernels can be seen in Figure 1, and in relation to te popular Gaussian kernel. Remark 3 Te efficiency of a kernel is more frequently defined as te inverse of te definition adopted ere. It is considered preferable ere to speak of maximising efficiency, rater tan minimising it, and ence te above formulation is adopted instead. 2.1 Density Derivative Estimation It is frequently te case tat te most important aspects of a density for analysis can be determined using estimates of its derivatives. For example, te roots of te first derivative provide te locations of te stationary points (modes and anti-modes) of te density. In addition pointwise derivatives are useful for determining gradients of numerous projection indices used in projection pursuit (Huber, 1985). Te natural estimate for te k-t derivative of f at x is simply, f (k) (x) = ˆf (k) (x) = 1 n = 1 n k+1 d k x dx K xi k x K (k) xi Only te first derivative will be considered licitly, were iger order derivatives can be simply derived, given an appropriate kernel (i.e., one wit sufficiently many derivatives). 7

8 Relative Efficiency K(x) α x (a) Relative efficiency of K α for α = 0, 1,..., 15. Relative efficiency of Gaussian kernel ( ). (b) Plots of K 1 ( ), K 4 ( ) and Gaussian kernel ( ) Figure 1: Relative efficiency and sape of te kernels used in eriments Considering again te kernels K α as defined previously, consider tat for α = 1, 2,... we ave K α (1) (x) = d 1 x k ( x ) dx 2(α + 1) k! ( 1 k x k 1 sign(x) = ( x ) sign(x) ( x ) 2(α + 1) k! ( = ( x ) α 1 ) x x k 1 x x k 1 2(α + 1) k! k! = ( x ) 2(α + 1) x x α 1 α! = ( x ) 2(α + 1)! x x α 1 ) x k k! 8

9 To compute estimates of ˆf (1) (x) only a very sligt modification to te metodology discussed previously is required. Specifically, consider tat ( (x x i ) x x i α 1 x x ) i n(x) = (x x (i) ) α x(i) x x (x (i) x) α x(i) i=n(x)+1 ( ) α x(n(x)) x x = x α k x(n(x)) l(k, n(x)) ( x) α k r(k, n(x)). k Te only difference between tis and te corresponding terms in te estimated density is te - separating terms in te final ression above. Now, using te above ression for K (1) α R(K (1) α ) = = = 1 (2(α + 1)!) (2(α + 1)!) 2 2 (x), consider tat (2α)! ((α + 1)!) 2 2 2α 2. x 2α+2 ( 2 x )dx x 2α+2 ( x )dx 22α+2 Unlike for te task of density estimation, te relative efficiency of a kernel for estimating te first derivative of a density function is determined in relation to te biweigt kernel. Te relative efficiency of te adopted class for estimation of te derivative of a density is sown in Figure 2. Te relative efficiency of te Gaussian kernel is again included for context. Again te kernel K 4 as similar efficiency to te Gaussian. Here, unlike for density estimation, we can see a clear maximiser wit K 7. Tis kernel will terefore also be considered for te task of density derivative estimation in te eriments to follow. 3 Simulations Tis section presents te results from a toroug simulation study conducted to illustrate te efficiency and effectiveness of te proposed approac for density and density derivative estimation. A collection of eigt univariate densities are considered, many of wic are taken from te popular collection of bencmark densities given in Marron and Wand (1992). Plots of all eigt densities are given in Figure 3 9

10 Relative Efficiency α Figure 2: Relative efficiency of kernels K α, α = 1, 2,..., 15, for estimating te derivative of a density function. Relative efficiency of Gaussian kernel ( ) (a) Gaussian (b) Uniform (c) Scale mixture (d) Simple bimodal (e) Skew (f) Spiked bimodal (g) Claw () Skew bimodal Figure 3: Collection of densities used in eriments. 10

11 For context, comparisons will be made wit te following existing metods. For tese te Gaussian kernel was used, as te most popular kernel used in te literature. 1. Te exact estimator using te Gaussian kernel, for wic te implementation in te R package kedd (Guidoum, 2015) was used. Tis approac was only applied to samples wit fewer tan observations, due to te ig computation time required for large samples. 2. Te binned estimator wit Gaussian kernel using te package KernSmoot (Wand, 2015). 3. Te fast Fourier transform using R s base stats package. 4. Te truncated Taylor ansion approac (Raykar et al., 2010), for wic a wrapper was created to implement te autors c++ code 1 from witin R. Te main computational components of te proposed metod were implemented in c++, wit te master functions in R via te Rcpp (Eddelbuettel and François, 2011) package 2. Te binned estimator using te proposed class of kernels will also be considered. Because of te nature of te kernels used, as discussed in Section 2, te computational complexity of te corresponding binned estimator is O(n + (α + 1)b), were b is te number of bins. Accuracy will be assessed using te integrated squared error between te kernel estimates and te true sampling densities (or teir derivatives), i.e., ˆf (k) f (k) 2 2 = ( ˆf (k) (x) f(x) (k) ) 2 dx. Exact evaluation of tese integrals is only possible for very specific cases, and so tey are numerically integrated. For simplicity in all cases Silverman s rule of tumb is used to select te bandwidt parameter (Silverman, 2018). Tis extremely popular euristic is motivated by te optimal asymptotic mean integrated squared error (AMISE) bandwidt value. Te euristic is most commonly applied to density estimation, were te direct extension to te first two derivatives will also be used erein. For kernel K te 1 te autors code was obtained from ttps:// Software/optimal_bw/optimal_bw_code.tm 2 A simple R package is available from ttps://gitub.com/davidhofmeyr/fkde 11

12 AMISE optimal bandwidt is given by AMISE = (2k + 1)R(K (k) 1/(2k+5) ) σk 4 R(f. (k+2) )n Tis objective is generally preferred over te mean integrated squared error as it reduces te dependency on te underlying unknown density function to only te functional R(f (k+2) ). Silverman s euristic replaces R(f (k+2) ) wit R(φ (k+2) ˆσ ), were φ σ is te normal density wit scale parameter σ. Te scale estimate ˆσ is computed from te observations, usually as teir standard deviation. 3.1 Density Estimation In tis subsection te accuracy and efficiency of te proposed metod for density estimation are investigated Evaluation on a Grid Many of te approximation metods for kernel density estimation necessitate tat te evaluation points, { x 1,..., x m }, are equally spaced (Scott and Seater, 1985; Silverman, 1982). In addition suc an arrangement is most suitable for visualisation purposes. Here te speed and accuracy of te various metods for evaluation/approximation of te density estimates are considered, were evaluation points are restricted to being on a grid. Accuracy: Te accuracy of all metods is reported in Table 1. Sixty samples were drawn from eac density, tirty of size and tirty of size Te number of evaluation points was kept fixed at Te estimated mean integrated squared error is reported in te table. Te lowest average is igligted in eac case. Te error values for all metods utilising te Gaussian kernel (φ) are extremely similar, wic attests to te accuracy of te approximate metods. Te error of kernel K 4 is also very similar to tat of te Gaussian kernel metods. Tis is unsurprising due to its similar efficiency value. Te kernel K 1 obtains te lowest error over all and in te most cases. In addition te estimated pointwise mean squared error for density (d) was computed for te exact estimates using kernels K 1 and K 4 and te truncated Taylor approximation 12

13 Table 1: Estimated mean integrated squared error of density estimates from 30 replications. Sets of and observations are considered. Lowest average error for eac scenario is igligted in bold. Apparent ties were broken by considering more significant figures. Metod Density Exact φ Tr. Taylor φ Binned φ FFT φ Exact K 1 Exact K 4 Binned K 1 Binned K 4 (a) n=1e e e e e e e e e-04 n=1e e e e e e e e-06 (b) n=1e e e e e e e e e-02 n=1e e e e-03 8e e-03 8e e-03 (c) n=1e e e e e e e e e-01 n=1e e e e e e e e-03 (d) n=1e e e e e e e e e-03 n=1e e e-05 1e e e e e-05 (e) n=1e e e e e e e e e-01 n=1e e e e e e e e-03 (f) n=1e e e e e e e e e-03 n=1e e e e e e e e-03 (g) n=1e e e e e e e e e-02 n=1e e e e e e e e-03 () n=1e e e e e e e e e-01 n=1e e e e e e e e-02 13

14 MSE 0e+00 2e 04 4e 04 6e 04 8e x MSE 0e+00 2e 06 4e 06 6e x (a) n = 1000 (b) n = Tr. Taylor φ ( ), Exact K 1 ( ), Exact K 4 ( ) Figure 4: Estimated pointwise mean squared error for density (d). for te Gaussian kernel estimate. Tese can be seen in Figure 4. In addition te sape of density (d) is sown. Tis density was cosen as it illustrates te improved relative performance of more efficient kernels as te sample size increases. For te smaller sample size kernel K 1 as a lower estimated mean integrated squared error, wic is evident in Figure 4(a). Te mean squared error for te oter two metods is almost indistinguisable. On te oter and for te large sample size, sown in Figure 4(b), te error for kernel K 1 is noticeably larger at te extrema of te underlying density tan K 4 and te Gaussian approximation. A brief discussion will be given in te discussion to follow in relation to kernel efficiency and te coice of kernel. Computational efficiency: Te running times for all densities are extremely similar, and more importantly te comparative running times between different metods are almost exactly te same accross te different densities. It is terefore sufficient for comparisons to consider a single density. Note tat in order to evaluate te density estimate at a point not in te sample, te proposed approac requires all computations needed to evaluate te density at te sample points. Evaluation on a grid may terefore be seen as someting 14

15 of a worst case for te proposed approac. However, once te decision to evaluate te density estimate at points oter tan te sample points as been made, te marginal cost of increasing te number of evaluation points is extremely small. Tis fact is well captured by Figure 5. Tis figure sows plots of te average running times from te metods considered wen applied to density (d), plotted on a log-scale. Figure 5(a) sows te effect of increasing te number of observations, wile keeping te number of evaluation points fixed at On te oter and Figure 5(b) sows te case were te number of observations is kept fixed (at ) and te number of evaluation points is increased. In te former te proposed metod, despite obtaining an exact evalution of te estimate density, is reasonably competitive wit te slower of te approximate metods. It is also orders of magnitude faster tan te exact metod using te Gaussian kernel. In te latter it can be seen tat as te number of evaluation points increases te proposed exact approac is even competitive wit te fastest approximate metods. Overall te binned approximations provide te fastest evaluation. Te nature of te proposed kernels and te proposed metod for fast evaluation means tat te corresponding binned estimators (particularly tat pertaining to kernel K 1 ) are extremely computationally efficient Evaluation at te Sample Points Evaluation of te estimated density at te sample points temselves as important applications in, among oter tings, computation of non-parametric pseudo-likelioods and in te estimation of sample entropy. Of te metods considered only te exact metods and te truncated Taylor ansion approximation are applicable to tis problem. Table 2 sows te average integrated squared error of te estimated densities from 30 replications for eac sampling scenario. Unsuprisingly te accuracy values and associated conclusions are similar to tose for te grid evaluations above. An important difference is tat wen te density estimates are required at all of te sample values, te proposed exact metod outperforms te approximate metod in terms of computation time. Tis is seen in Table 3, were te average running times for all densities are reported. Te exact evalution for kernel K 4 is similar to te truncated Taylor approximate metod, wile te exact evaluation using 15

16 running time (seconds) 1e 03 1e 02 1e 01 1e+00 1e+03 1e+04 1e+05 1e+06 sample size running time (seconds) 1e 02 1e 01 1e+00 1e+03 1e+04 1e+05 1e+06 number of grid points (a) Fixed number of evaluation points, increasing sample size (b) Fixed number of observations, increasing number of evaluation points Exact φ ( ), Tr. Taylor φ ( + ), Binned φ ( ), FFT φ ( ), Exact K 1 ( ), Exact K 4 ( ), Binned K 1 ( ), Binned K 4 ( ) Figure 5: Computation times for density (d) evaluated on a grid kernel K 1 is rougly five times faster wit te current implementations. Remark 4 It is important to reiterate te fact tat te proposed approac is exact. Tis exactness becomes increasingly important wen tese density estimates form part of a larger routine, suc as maximum pseudo-likeliood or in projection pursuit. Wen te density estimates are only approximate it becomes more difficult to determine ow canges in te sample points, or in yperparameters, will affect tese estimated values. 3.2 Density Derivative Estimation In tis subsection te estimation of te first derivative of a density is considered. Te same collection of densities used in density estimation is considered, except tat density (b) is omitted since it is not differentiable at its boundaries. Of te available implementations for te metods considered, only te exact estimation and te truncated Taylor ansion for te Gaussian kernel were available. Only estimation at te sample points was considered, since all available metods are capable of tis task. Te average integrated squared error accuracy is reported in Table 4. Once again te kernel K 1 sows te lowest error most often, 16

17 Table 2: Average integrated squared error of density estimates from 30 replications. Sets of and observations are considered. Evaluation is conducted for te entire collection of sample points in eac case. Lowest average error for eac scenario is igligted in bold. Apparent ties were broken by considering more significant figures. Density (a) (b) (c) (d) (e) (f) (g) () Exact Gauss n = 1e e e e e e e e e-01 Trunc. Taylor Gauss n = 1e e e e e e e e e-01 n = 1e e e e e e e e e-02 Exact K 1 n = 1e e e e e e e e e-02 n = 1e e e e e e e e e-02 Exact K 4 n = 1e e e e e e e e e-01 n = 1e e e e e e e e e-02 Table 3: Average running time of density estimation from 30 replications. Sets of and observations are considered. Evaluation is conducted for te entire collection of sample points in eac case. Lowest average computation time for eac scenario is igligted in bold. Density (a) (b) (c) (d) (e) (f) (g) () Exact Gauss n = 1e e e e e e e e e-01 Trunc. Taylor Gauss n = 1e e e e e e e e e-03 n = 1e e e e e e e e e-01 Exact K 1 n = 1e+03 7e-04 9e e-04 7e-04 9e-04 9e-04 9e e-04 n = 1e e e e e e e e e-02 Exact K 4 n = 1e e e e e e e e e-03 n = 1e e e e e e e e e-01 17

18 Table 4: Average integrated squared error of first derivative estimates from 30 replications.. Sets of and observations are considered. Evaluation is conducted for te entire collection of sample points in eac case. Lowest average for eac scenario is igligted in bold. Density (a) (c) (d) (e) (f) (g) () Exact Gauss n = 1e e e e e e e e+00 Trunc. Taylor Gauss n = 1e e e e e e e e+00 n = 1e e e e e e e e+00 Exact K 1 n = 1e e e e e e e e+00 n = 1e e e e e e e e+00 Exact K 4 n = 1e e e e e e e+00 1e+01 n = 1e e e e e e e e+00 Exact K 7 n = 1e e e e e e e e+01 n = 1e e e e e e e e+00 owever in tis case only wen te densities ave very sarp features. Te performance of te lower efficiency kernel is sligtly worse on densities (a) and (d), for wic te euristic used for bandwidt selection is closer to optimal based on te AMISE objective (in te case of density (a) it is exactly optimal). In tese cases te error of kernel K 7 is lowest. Te relative computational efficiency of te proposed approac is even more apparent in te task of density derivative estimation. Table 5 reports te average running times on all densities considered. Here it can be seen tat te evaluation of te pointwise derivative at te sample points wen using kernel K 1 is an order of magnitude faster tan wen using te truncated Taylor ansion. Evaluation wit te kernel K 4 is rougly tree times faster tan te approximate metod wit te current implementations, and te running time wit kernel K 7 is similar to te approximate approac. 4 Discussion and A Brief Comment on Kernel Coice In tis work a ric class of kernels was introduced wose members allow for extremely efficient and exact evaluation of kernel density and density derivative estimates. A muc smaller sub-class was investigated more deeply. Kernels in tis sub-class were selected for 18

19 Table 5: Average running time of estimation of first derivative from 30 replications. Sets of and observations are considered. Evaluation is conducted for te entire collection of sample points in eac case. Lowest average computation time for eac scenario is igligted in bold. Density (a) (c) (d) (e) (f) (g) () Exact Gauss n = 1e e e e e e e e-01 Trunc. Taylor Gauss n = 1e e-03 6e-03 5e e e e e-03 n = 1e e e e e e e e-01 Exact K 1 n = 1e e e e e-04 1e e-04 1e-03 n = 1e e e e e e e e-02 Exact K 4 n = 1e e e e e-03 2e e-03 2e-03 n = 1e e e e e e e e-01 Exact K 7 n = 1e e e e e e e e-03 n = 1e e e e e e e e-01 teir simplicity of ression and te fact tat tey admit a large number of derivatives relative to tis simplicity. Toroug erimentation wit kernels from tis sub-class was conducted sowing extremely promising performance in terms of accuracy and empirical running time. It is important to note tat te efficiency of a kernel for a given task relates to te AMISE error wic it induces, but under te assumption tat te corresponding optimal bandwidt parameter is also selected. Te popular euristic for bandwidt selection wic was used erein tends to over-estimate te AMISE optimal value wen te underlying density as sarp features and ig curvature. Wit tis euristic tere is strong evidence tat kernel K 1 represents an excellent coice for its fast computation and its accurate density estimation. On te oter and, if a more sopisticated metod is employed to select a bandwidt parameter closer to te AMISE optimal, ten K 4 is recommended for its very similar error to te popular Gaussian kernel and its comparatively fast computation. An interesting direction for future researc will be in te design of kernels in te broader class introduced erein wic ave simple ressions (in te sense tat te polynomial component as a low degree), and wic ave ig relative efficiency for estimation of a specific derivative of te density wic is of relevance for a given task. 19

20 References Eddelbuettel, D. and R. François (2011). Rcpp: Seamless R and C++ integration. Journal of Statistical Software 40 (8), Fan, J. and J. S. Marron (1994). Fast implementations of nonparametric curve estimators. Journal of computational and grapical statistics 3 (1), Guidoum, A. (2015). kedd: Kernel estimator and bandwidt selection for density and its derivatives. R package version Hall, P. and M. P. Wand (1994). On te accuracy of binned kernel density estimators. Huber, P. J. (1985). Projection pursuit. Te annals of Statistics, Marron, J. S. and M. P. Wand (1992). Exact mean integrated squared error. Te Annals of Statistics, PW, R. (1976). On te coice of smooting parameters for parzen estimators of probability density functions. IEEE Transactions on Computers. Raykar, V. C., R. Duraiswami, and L. H. Zao (2010). Fast computation of kernel estimators. Journal of Computational and Grapical Statistics 19 (1), Scott, D. W. and S. J. Seater (1985). Kernel density estimation wit binned data. Communications in Statistics-Teory and Metods 14 (6), Scott, D. W. and G. R. Terrell (1987). Biased and unbiased cross-validation in density estimation. Journal of te american Statistical association 82 (400), Seater, S. J. and M. C. Jones (1991). A reliable data-based bandwidt selection metod for kernel density estimation. Journal of te Royal Statistical Society. Series B (Metodological), Silverman, B. (1982). Algoritm as 176: Kernel density estimation using te fast fourier transform. Journal of te Royal Statistical Society. Series C (Applied Statistics) 31 (1),

21 Silverman, B. W. (2018). Density estimation for statistics and data analysis. Routledge. Wand, M. (2015). KernSmoot: Functions for Kernel Smooting Supporting Wand & Jones. R package version Yang, C., R. Duraiswami, N. A. Gumerov, L. S. Davis, et al. (2003). Improved fast gauss transform and efficient kernel density estimation. In ICCV, Volume 1, pp

Chapter 1. Density Estimation

Chapter 1. Density Estimation Capter 1 Density Estimation Let X 1, X,..., X n be observations from a density f X x. Te aim is to use only tis data to obtain an estimate ˆf X x of f X x. Properties of f f X x x, Parametric metods f