arxiv: v1 [stat.me] 25 Mar 2019

Size: px

Start display at page:

Download "arxiv: v1 [stat.me] 25 Mar 2019"

Brook McDaniel
5 years ago
Views:

1 -Divergence loss for the kernel density estimation with bias reduced Hamza Dhaker a,, El Hadji Deme b, and Youssou Ciss b a Département de mathématiques et statistique,université de Moncton, NB, Canada b LERSTAD,UFR SAT, Universite Gaston Berger, Saint-Louis, Senegal arxiv: v [stat.me] 5 Mar 09 Abstract Allthough nonparametric kernel density estimation with bias reduce is nowadays a standard technique in explorative data-analysis, there is still a big dispute on how to assess the quality of the estimate and which choice of bandwidth is optimal. This article examines the most important bandwidth selection methods for kernel density estimation with bias reduce, in particular, normal reference, least squares cross-validation, biased crossvalidation and -Divergence loss. Methods are described and expressions are presented. We will compare these various bandwidth selector on simulated data. As an example of real data, we will use econometric data sets CO per capita in example and the second data set consists of 07 eruption lengths in minutes for the Old Faithful geyser in Yellowstone National Park, USA. Keywords: -Divergence; Kernel Density Estimation; bandwidth.. ntroduction Selecting an appropriate bandwidth for a kernel density estimator is of crucial importance, and the purpose of the estimation may be an influential factor in the selection method. n many situations, it is sufficient to subjectively choose the smoothing parameter by looking at the density estimates produced by a range of bandwidths. A good overview on kernel density estimators is supplied by Silverman [8]; Scott [7]; Mugdadi and Ahmad [4]. Throughout this article, we use the following notation. Let X,..., X n, of size n from a density f where the Parzen [5] kernel estimator f n is defined by f n = n K( x X i ) nh h i= n general, K is the kernel function (e.g. normal density function) and h is the bandwidth. A large body of research is devoted to choosing h, essentially the amount of smoothing to apply. Usually, smoothing parameters can be chosen via cross validation or by minimizing a measure of error. An important recent paper in this area is Xie and Wu []. Xie and Wu studied an estimator which reduces the bias, the performance of which in both theory and simulations proved to be clearly superior to other methods currently popular in the literature. Xie and Wu [] presented an estimator which reduces the bias, defined by: n = f n Bias( f n ) () = f n h f n t K(t)dt () the bandwidth is the most dominant parameter in the kernel density estimator. This parameter controls the amount of smoothing, and is analogous to the bandwidth in a histogram. Even though the kernel estimator depends on the Corresponding author address: hamza.dhaker@umoncton.ca (Hamza Dhaker) Preprint submitted to Journal Name March 6, 09

2 kernel and the bandwidth in a rather complicated way, a graphical representation clearly illustrates the difference in importance between these two parameters, see figure.3 and.6a in Wand and Jones [0]. To explore the most relevant bandwidth selection methods in density estimation for complete data see the reviews of Turlach [9], Cao et al. [3], Jones et al. [8] or Heidenreich et al. (03), Mammen et al. ([] and []), and the recent work on -Divergence for Bandwidth Selection by Dhaker and al. [5]. Our aim in this paper is to propose and compare several bandwidth selection procedures for the kernel density estimators introduced by Xie and Wu []. The procedures we study are bandwidth selector based on the criterion of -divergence with different beta values. A simulation study is then carried out to assess the finite sample behavior of these bandwidth selectors. The remainder of the paper is organised as follows. n Section, we state our main results: presents the method proposed for bandwidth selector based D. Section 3 estimation of the optimal bandwidth Section 4 is devoted to our simulation results. Section 5 applies the methods to real data. We conclude the paper in Section 6,. Bandwidth selection based -divergence The -divergence [Cichocki [4], Basu [] and Eguchi [6]] is a general framework of similarity measures induced from various statistical models, such as Poisson, Gamma, Gaussian, nverse Gaussian and compound Poisson distribution. For the connection between the -divergence and the various statistical distributions, see Jorgensen [9]. Beta divergence was proposed in Basu [] and Minami and Eguchi [3] and is defined as dissimilarity between the density function and its estimator D ( ˆ f n, f ) = n dx n f dx ( ) f dx. n the case =, D ( n, f ) = S E( n ) = ( n f ) dx. The following theorem allows us to give the analytical value of bandwidth which minimizes the mean D ( ˆ f n, f ). Theorem. Let the following conditions on f be satisfied: (F) f is compactly supported on. (F) f is four times continuously differentiable on. (F3) f (4) f dx <. the bandwidth h ED that minimizes the mean -divergence between a kernel estimator f h and density f is: Proposition. Under (F) (F3) we have ED ( n, f ) = ( ) h8 t 4 K(t)dt 576 K(t) h = h ED = 7 dt f dx [ t4 K(t)dt ] f ( f (4) ) dx /9 f ( f (4) ) dx K (t)dt nh n /9 (3) f dx O p(n c ) O(h 6 ) (4) and AED ( n, f ) = ( ) h8 t 4 K(t)dt 576 f ( f (4) ) dx K (t)dt nh f dx (5) where 0 < c < 8

3 Corollary. Assuming that the assumptions in Theorem hold with =. Then, we have ED ( f h, f ) = MS E( f h, f ) in that case Proof. R(g) = g(t) dt and µ 4 (K) = x 4 Kdx AED ( f h, f ) = AMS E( f h, f ) { } /9 9 h AMS E( f h, f ) = R(K) n /9 (6) (µ 4 (K)) R( f (4) ) f n = ( f n Bias( f ) With a random variable ξ = O p () whose expectation is 0 and variance, we can write f n as (see Kanazawa [0]) f n = f h f () f t K(t)dt h4 f (4) 4 f t 4 K(t)dt O(h 6 ) K(t) dt nh f / ξ O p (n / ) (7) Using the result of the corollary.6 [7], we have lim sup n x n c f (r) n f (r) = 0 with 0 < c < r 4 f n = f n Bias( f n ) = f n h f n () t K(t)dt = f n h f () t K(t)dt O(n c ) = f h f () t K(t)dt h4 f (4) t 4 K(t)dt O(h 6 ) K(t) / dt ξ O f 4 f nh f p (n / ) h f () t K(t)dt O(n c ) = f h4 f (4) t 4 K(t)dt O(h 6 ) K(t) / dt ξ O 4 f nh f p (n / ) O(n c ) Where the O(h 6 ) terms depend upon x. Using ( z) = z ( ) z O(z 3 ) f n = f h4 f (4) 4 f = f h 4 f (4) 4 f O p (n c ) O(h 6 ) ], t 4 K(t)dt O(h 6 ) K(t) dt nh f t 4 K(t)dt K(t) / dt ξ nh f 3 / ξ O p (n / ) O(n c ) ( ) h8 576 ( f (4) ) f ( t 4 K(t)dt) K(t) dt ξ nh f

4 and f n = f h4 f (4) t 4 K(t)dt O(h 6 ) K(t) / dt ξ O 4 f nh f p (n / ) O(n c ) = f h 4 f (4) [ ( ) t 4 K(t)dt K(t) / dt ξ 4 f nh f ( )( ) h8 ( f (4) ) ( t 4 K(t)dt) K(t) dt ξ 576 f nh f O p (n c ) O(h 6 )] D ( n, f ) = n dx h 4 f (4) f = f [ 4 ( ) h8 ( f (4) ) 576 f O p (n c ) O(h 6 )]dx f h 4 [ ( ) 4 ( )( ) h8 ( f (4) ) 576 f O p (n c ) O(h 6 )]dx f dx ( ) = f ( ) [ h8 576 n f dx ( ) t 4 K(t)dt K(t) dt nh f ( t 4 K(t)dt) f (4) f ( t 4 K(t)dt) K(t) dt ξ nh f t 4 K(t)dt ( t 4 K(t)dt) f dx / ξ K(t) dt nh f K(t) dt ξ nh f ( f (4) ) K(t) dt ξ f nh f O p(n c ) O(h 6 )]dx f ( )( ) [ h8 ( f (4) ) ( t 4 K(t)dt) K(t) dt ξ 576 f nh f O p(n c ) O(h 6 )]dx = f [( ) h8 ( f (4) ) ( t 4 K(t)dt) K(t) dt ξ 576 f nh f O p(n c ) O(h 6 )]dx = [ h ( t 4 K(t)dt) f ( f (4)) ] dx K (t)dt f dxξ O p (n c ) O(h 6 ) nh / ξ ED ( n, f ) = ( ) h8 t 4 K(t)dt 576 f ( f (4)) dx K (t)dt nh f dx O p(n c ) O(h 6 ) 4

5 3. The choice of the bandwidth h n this section, we describe bandwidth selection methods for the density estimator defined in (). These methods consist of adaptations of common automatic selectors for kernel density estimation. We propose two selection methods a Normal reference and the cross-validation method. The Normal reference bandwidth is based on estimating the infeasible optimal expression (6), in which the unknown element is R( f (4) ). 3.. Rule-of-thumb for bandwidth selection This method is based on the rule-of-thumb, Silverman [8], for complete data.the idea is to assume that the underlying distribution is normal, N(µ, σ), and in this situation so f (4) = ( f (4) ) x m = σ 0 e ( σ (9 ) 36 π f = σ π e ( x m σ ), σ 5 π e ( x m σ ) (3 6 ( x m σ ( x m σ ) 30 ( x m σ ) ( x m ) 4 ), σ ) 4 8 ( x m σ ) 6 ( x m ) 8 ), σ f ( f (4) ) dx = σ 7 (π) ( ). and f dx = (π) n that case the asymptotically optimal bandwidth h in Equation (3) becomes the normal reference bandwidth. h = h ED = = 7 R(K) f dx µ 4 (K) f ( f (4) ) dx 7 /9 R(K) (π) µ 4 (K) σ 7 (π) n /9 ( 9 ) /9 n /9 with σ being the standard deviation of f. For the Gaussian kernel, µ 4 (K) = 3 and R(K) = (4π) so that 4 4 h NR = π n /9 σ (8) for = 6 h NR = 86 π n /9 σ. (9) The standard deviation σ can be estimated by the sample standard deviation s or by the standardized interquartile range QR/.34 for robustness against outliers (.34 = Φ (3/4) Φ (/4)), but a better rule of thumb is (e.g., Silverman, 986, pp. 4547; Hrdle, 99, p. 9). 5

6 6 ĥ NR = 86 π n /9 σ, (0) with σ = min(s, QR/.34) 3.. Cross-validation The method previously defined is based on minimising estimations of the MS E, more precisely of the AMS E. This procedure relies on the minimisation of the S E (integrated squared error), the methodology is the same as in Rudemo [6] and Bowman [] applied to (). Let write: D ( n, f ) = n dx n f dx ( ) f dx, Note that ( ) f dx dz does not depend on h, so the minimisation of the SE is equivalent to minimise the following function: L(h) = D ( n, f ) f dx, ( ) = n dx n f dx, = n dx E ( n ). The principle of the least squares cross-validation method is to find an estimate of L(h) from the data and minimize it over h. Consider the estimator LS CV(h) = n dx n n h(i) (X i), with h(i) (X i) = h(n ) 4. Simulation nj i K ( X i X j ) h. n this section we evaluate the performance of the bandwidth selection procedures presented in Section. To this goal we have carried out a simulation study including rule-of-thumb (h NR ), cross-validation bandwidth (h LS CV ) and h for minimizing criterion -divergence (with =.5,.and.9). We consider various sets of experiments in which data are generated from the mixture of a Normal N(0, ) and Normal N(µ, σ) distributions. Hence, the DGP (Data Generating Process) is generated from m(µ, σ ) with the density i= m(µ, σ ) = 0.5N(0, ) 0.5N(µ, σ) () where µ = 0,, 5 and σ =, 0.5, 0.. One thousand Monte Carlo samples of size n are generated from the normal mixture model in Equation () for each combination of n = 50, 00, 700. The results of our different sets of experiments are presented in Tables -3. Table give the exhibits simulated relative efficiency RE(ĥ) = MS E( fĥms E )/MS E( fĥ) of the kernel estimator, using bandwidths ĥ NR, ĥ LS CV and ĥ, it is lower than, because the optimal bandwidth h MS E minimize MS E. Each bandwidth, mean E(ĥ) and mean relation error E(ĥ/h MS E ) are obtained, these values are given by respectively, Tables and 3. 6

7 - For all situations, each relative efficiency RE(ĥ) < because the optimal bandwidth h MS E minimizes the MS E. - The normal reference bandwidth hnr performswell if the true density is not very far from normal, such as the cases of (µ, σ) = (0, ), (0, 0.5), (, ), and (, 0.5). Otherwise, it usually has the smallest RE(ĥ) and largest E(ĥ), tending to oversmooth its kernel density estimate the most. - We have to remark that in table, the LSCV bandwidth hlscv needs a large sample size in order to be competitive. Note also that in Table, it is seen that E(ĥ LS CV ) is close to the optimal ĥ MS E, but the corresponding E(ĥ LS CV /ĥ MS E ) is large, which means that the bias of ĥ LS CV is small but its variation is large in Table The bandwidth ĥ seems to be the best existing bandwidth selectors. n most situations, it is indeed one of the best bandwidth selectors, However, it behaves very poorly for small σ (the true density curve is sharp). Table : RE ( ĥ ) for normal mixture f = 0.5φ 0.5φ σ (x µ) n ĥ NR ĥ LS CV ĥ D. CV ĥ D.5 CV ĥ D.9 CV µ = 0 σ = µ = 0 σ = µ = 0 σ = µ = σ = µ = σ = µ = σ = µ = 5 σ = µ = 5 σ = µ = 5 σ =

8 Table : E ( ĥ ) for normal mixture f = 0.5φ 0.5φ σ (x µ) n ĥ NR ĥ LS CV ĥ D. CV ĥ D.5 CV ĥ D.9 CV h MS E µ = 0 σ = µ = 0 σ = µ = 0 σ = µ = σ = µ = σ = µ = σ = µ = 5 σ = µ = 5 σ = µ = 5 σ =

9 Table 3: E ĥ/h MS E for normal mixture f = 0.5φ 0.5φ σ (x µ) n ĥ NR ĥ LS CV ĥ D. CV ĥ D.5 CV ĥ D.9 CV µ = 0 σ = µ = 0 σ = µ = 0 σ = µ = σ = µ = σ = µ = σ = µ = 5 σ = µ = 5 σ = µ = 5 σ =

10 Figure : Boxplots of the relative values Re forthe bandwidth selectors forestimation of the densities µ = 0,, 5 and σ =, 0.5, 0.. The sample size varies from 00 to 000. Figure compare, for densities with (µ = 0,, 5 and σ =,.5,.), the results of the five bandwidth selection NR, LS CV and D (discussed in Section 3), relatively to the results obtained by using the MS E optimal bandwidth (h MS E ). These figures present boxplots of the ratio RE(ĥ) = MS E( fĥms E )/MS E( fĥ) for each bandwidths ĥ NR, ĥ LS CV and ĥ D (with =.,.5 and.9). We see the LS CV and D (with =.5) methods gave overall the bests ratios across all simulations, and that this ratio was rather large in general. 5. llustration with real data Two examples are provided to demonstrate the performance of kernel density estimation with different bandwidths, where the Gaussian kernel is used. All of them are classical examples of unimodal and bimodal distributions, respectively. Example The first data set comprises the CO per capita in the year of 04. The data set can be downloaded from the world bank website. Figure shows the estimated density of CO per capita in the year of 008,, we using bandwidths ĥ NR =.38, ĥ LS CV = 0.439, ĥ.5 = 0.83, ĥ. = 0.93 and ĥ.9 = 0.54 The data set that the estimated density that was computed with the ĥ LS CV = and ĥ.9 bandwidths captures the peak that characterizes the mode, while the estimated density with the bandwidths that ĥ NR, ĥ.5 and ĥ. smoothes out this peak. This happens because the outliers at the tail of the distribution contribute to ĥ NR, ĥ.5 and ĥ. be larger than the than other bandwidths.. Example we use the time between eruptions set for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA (07 sample data, source: Silvermanl []). Figure 3 plot the data points and the kernel density estimates for old faithful geyser data, we using bandwidths ĥ NR = 0.44, ĥ LS CV = 0.6, ĥ.5 = 0.76, ĥ. = 0.8 and ĥ.9 = 0.0. An important point to note that the density curve for eruption length is similar to bimodal normal density (normal mixture). From our example we see that the h NB is always larger than the others bandwidths, he heavily oversmoothes its kernel density curve, underestimating the two peaks of the curve but overestimating the valley between them. About h LS CV, ĥ.5 and ĥ.9 seems to undersmooth the curve too much, overestimating the two peaks but underestimating for 0

11 Figure : Estimated density of CO per capita in 008 using the different bandwidths. D. (solid line); D.9 (dashed line); D.5 (dotted line); LS CV, least squares cross-validation (dotdash line) and NR, normal reference (longdash line). the valley. However ĥ. is proper bandwidth for their density estimate to be able to capture the feature of the true density curve. 6. Conclusions This paper proposed the method for bandwidth selection of bias reduction kernel density estimator, given in (). A various bandwidth selection strategies have been proposed such as normal reference h NR, least squares cross-validation h LS CV and h for minimizing criterion -divergence (with =.5,. and.9). The normal reference bandwidth h NR method is a simple and quick selector, but limited the practical use., since they are restricted to situations where a pre-specified family of densities is correctly selected. The LS CV do not provide a smooth density estimation, although asymptotically optimal, the finite sample behavior of h LS CV is disappointing for its variability and undersmoothing. We have attempted to evaluate choice the optimal bandwidth h LS CV and h NR, using -divergence. Compared to traditional bandwidth selection methods designed for kernel density estimation, our proposed D bandwidth selection method is always one of the best for having large RE(ĥ) and small E(ĥ/ĥ MS E ). Simulation studies showed that our proposed optimal bandwidth D method designed for kernel density estimation adapts to different situations, and out-performs other bandwidths. we conclude that the choice of the bandwidth based on the real data is consistent with the one based on simulations which is the D =. and.5 ) method gives us a smoother density estimation. References [] Basu, A., Harris,. R., Hjort N. L. and Jones M. C. (998). Robust and efficient estimation by minimising a density power divergence, Biometrika 85(3), [] Bowman A. W., (984). An alternative method of cross-validation for the smoothing of density estimates, Biometrika 7(), [3] Cao, R., Cuevas, A., and Gonalez-Manteiga, W. (994), A comparative study of several smoothing methods in density estimation?, Computational Statistics and Data Analysis, 7, 53?76. [4] Cichocki, A.; Zdunek, R.; Amari, S. (006) Csiszar?s divergences for nonnegative matrix factorization: Family of new algorithms. n Lecture Notes in Computer Science; Springer: Charleston, SC, USA, 3889, 3?39.

12 Figure 3: Estimated density of Old Faithful geyser using the different bandwidths. D. (solid line); D.9 (dashed line); D.5 (dotted line); LS CV, least squares cross-validation (dotdash line) and NR, normal reference (longdash line). [5] Dhaker H., Ngom P., Deme Elhadji and Mbodj M. New approach for bandwidth selection in the kernel density estimation based on - divergence, Journal of Mathematical Sciences: Advances and Applications. 5, 08, [6] Eguchi S. and Kano Y. (00) Robustifying maximum likelihood estimation. Technical report, nstitute of Statistical Mathematics, June. [7] Eugene F. S, (969), Estimation of a probability density function and its derivatives, The annals of Mathematical Statistics, 40(4), [8] Jones, M.C., Marron, J.S., and Sheather, S.J. (996), A brief survey of bandwidth selection for density estimation, Journal of the American Statistical Association, 9, [9] Jorgensen B. (997) The Theory of Dispersion Models. Chapman Hall/CRC Monographs on Statistics and Applied Probability. [0] Kanazawa Y. (993) Hellinger distance and Kullback-Leibler loss for the kernel density estimator, Statistics and Probability Letters 8(4), [] Mammen, E., Martinez-Miranda, M.D., Nielsen, J.P., and Sperlich, S. (0), Do-validation for kernel density estimatio?, Journal of the American Statistical Association, 06, [] Mammen, E., Martinez-Miranda, M.D., Nielsen, J.P., and Sperlich, S. (04), Further theoretical and practical insight to the do-validated bandwidth selector, Journal of the Korean Statistical Society, 43, [3] Minami, M.; Eguchi, S. (00) Robust blind source separation by Beta-divergence. Neural Comput. 4, [4] Mugdadi, A. R. and brahim A. A. (004), A bandwidth selection for kernel density estimation of functions of random variables, Computational Statistics and Data Analysis, 47(), 49-6 [5] Parzen, E. (96), On estimation of a probability density function and mode, Annals of Mathematical Statistics 33, [6] Rudemo M. (98) Empirical choice of histograms and kernel density estimators, Scandinavia Journal of Statistics 9(), [7] Scott, W. D (99), Multivariate density estimation theory, practice, and visualization, Wiley, New York. [8] Silverman, B. W. (986), Density estimation for statistics and data analysis, Chapman and Hall, London. [9] Turlach, B.A. (993), Bandwidth selection in kernel density estimation: A review, Technical report, Universite catholique de Louvain. [0] Wand, M. P. and M. C. Jones (995), Kernel Smoothing, Chapman and Hall, London, UK. [] Xie X. and Wu J. (04), Some mprovement on Convergence Rates of Kernel Density Estimator, Applied Mathematics, 04, 5,

O Combining cross-validation and plug-in methods - for kernel density bandwidth selection O

O Combining cross-validation and plug-in methods - for kernel density selection O Carlos Tenreiro CMUC and DMUC, University of Coimbra PhD Program UC UP February 18, 2011 1 Overview The nonparametric problem