On variable bandwidth kernel density estimation

JSM 04 - Section on Nonparametric Statistics On variable bandwidth kernel density estimation Janet Nakarmi Hailin Sang Abstract In this paper we study the ideal variable bandwidth kernel estimator introduced by Mcay 7, 8] and the plug-in practical version of variable bandwidth kernel estimator with two sequences of bandwidths as in Giné and Sang 4]. The dominating terms of the variance of the true estimator in the variance decomposition are separated from the other terms. Based on the exact formula of bias and these dominating terms, we develop the optimal bandwidth selection of this variable kernel estimator. ey Words: bandwidth selection, variable kernel density estimation.. Introduction Suppose that X i, i N, are independent identically distributed i.i.d. observations with density function ft, t R. Let to be a symmetric probability kernel satisfying some differentiability properties. The classical kernel density estimator ˆft; h n = nh n h n, where h n is the bandwidth sequence with h n 0, nh n, and its properties have been well studied in the literature. The variance of has order Onh n and the bias has order h n if ft has bounded second order derivative. In this paper, we study the following variable bandwidth kernel density estimator proposed by Mcay 7, 8]: ft; h n = nh n where αs is a smooth function of the form αfx i h n αfx i t X i, αs := cp / s/c. 3 The function p is at least four times differentiable and satisfies px for all x and px = x for all x t 0 for some t 0 <, and 0 < c < is a fixed number. is a variable bandwidth kernel density estimator since the bandwidth has form h n /αfx i if we rewrite in the form of the classical one,. The study of variable bandwidth kernel density estimation goes back to Abramson ]. He proposed the following estimator f A t; h n = n h n γt, X i h n γt, X i t X i, 4 Department of Mathematics, University of Mississippi, University, MS 38677, USA 3586

JSM 04 - Section on Nonparametric Statistics where, γt, s = fs ft/0 /. The bandwidth h n /γt, X i at each observation X i is inversely proportional to f / X i if fx i ft/0. Notice that also has the square root law since αfx i = f / X i if fx i t 0 c by the definition of the function px. The estimator or 4 has clipping procedure in αs or γt, s since they make the true bandwidth h n /αfx i h n /c or h n /γt, X i 0 / h n /ft /. The clipping procedure prevent too much contribution to the density estimation at t if the observation X i is too far away from t. Abramson showed that, this square root law improves the bias from the order of h n to the order of h 4 n for the estimator 4 while at the same time keeps the variance at the order of nh n if ft 0 and fx has fourth order continuous derivative at t. However, this variable bandwidth estimator 4 is not a density function of a true probability measure since the integral of f A t; h n over t is not -it would if γ depended only on s-. Terrell and Scott ] and Mcay 8] showed that the following modification of Abramson estimator without the clipping filter ft/0 / on f / X i studied in Hall and Marron 5], f HM t; h n = n h n f / X i h n f / X i t X i, 5 which has integral and hence is a true probability density, may have bias of order much larger than h 4 n. Therefore the clipping is necessary for such bias reduction. Hall, Hu and Marron 995 then proposed the estimator f HHM t; h n = nh n f / X i f / X i I t X i < h n B 6 h n where B is a fixed constant; see also Novak 0] for a similar estimator. This estimator is nonnegative and achieves the desired bias reduction but, like Abramson s, it does not integrate to. In conclusion, it seems that the estimator has all the advantages: it is a true density function with square root law and smooth clipping procedure. But notice that this estimator and all the other variable bandwidth kernel density estimators can not be applied in practice since they all include the studied density function f. Therefore, we call them ideal estimators in the literature. Hall and Marron 5] studied a true density estimator ˆf HM t;, = n ˆf / X i ; ˆf / X i ;, by plugging in the classical estimator as the pilot estimator. Here, the bandwidth sequence is the h n as in 5 and the bandwidth sequence is applied in the classical kernel density estimator. They took Taylor expansion of t Xi ˆf / X i ; at t Xi f / X i and studied the asymptotics of the true estimator. By applying this Taylor decomposition, Mcay 8] studied convergence of plug-in true estimator of in probability and pointwise. Giné and Sang 3, 4] studied plug-in true estimators of 6 and for one and multiple d- dimensional observations. They proved that the discrepancy between the true estimator and the true value converges uniformly over a data adaptive region at a rate Olog n/n 4/8d by applying empirical process techniques. The true estimator in Giné and Sang 4] has the following 3587

JSM 04 - Section on Nonparametric Statistics form ˆft;, = n α h ˆfX i ; α ˆfX i ;. 7,n And, they also studied the uniform convergence in almost sure sense of true estimators with bias order h 6 n. In section, we provide decomposition of some important terms which we use in later sections. In section 3 of this paper, we study the decomposition of the variance of the true estimator 7 and find the exact formula of the dominating terms in the variance decomposition. Moreover, we provide a theoretical formula for the optimal bandwidth in section 4.. Preliminary decomposition For convenience, we adopt the notations as in Giné and Sang 4] for the Taylor series expansion of t Xi α ˆfX i ; at t Xi αfx i. We also give the statements without detailed explanation. For details, readers are referred to Giné and Sang 4]. P C will denote the set of all probability densities on R that are uniformly continuous and are bounded by C <, and P C,k will denote the set of densities on R for which themselves and their partial derivatives of order k or lower are bounded by C < and are uniformly continuous. Define δt = δt, n by the equation δt = α ˆft; αft. αft Then and α ˆft; = αft δt 8 δt Bc ˆft; ft for a constant B that depends only on p. Although we study the asymptotics of the true estimator pointwise, the uniform asymptotic behavior of the quantity δ is needed in the latter analysis. Define Dt; = ˆft; E ˆft; and bt; = E ˆft; ft. Note that for f P C,, and by Giné and Guillou ], for f P C. Denote sup t R sup t R sup bt; = Oh,n, 9 t R Dt; = O log h,n n log h,n n h,n := U. Then we have, ˆft; ft = sup Dt; bt; = O U 0 t R 3588

JSM 04 - Section on Nonparametric Statistics and for f P C,. By the definition of δt, we also have, sup δt = O U t R δt = α ft ˆft; ft] αft α η ˆft; ft] αft where η = ηt, 0 is between ˆft; and ft. Notice that α ηt, c 3 A for some constant A which depends only on the clipping function p. Set L t = t t and Lt = t t t, t R. We then have the following Taylor Series expansion α h ˆfX i ;,n = αfx i L αfx i δx i δ t, X i, 3 where δ t, X i = / ξt X i h,n α fx i δ X i, 4 ξ being a random number between t X i αfx i and t X i αfx i t X i αfx i δx i. By the analysis in Giné and Sang 4], sup δ t, x = O t,x R ˆf ; f = OU if f P C,. 5 Therefore, ˆft;, = ft; n n n n L αfx i αfx i δx i αfx i δ t, X i L αfx i δ X i αfx i αfx i δx i δ t, X i. 3589

JSM 04 - Section on Nonparametric Statistics 3. Variance of the true estimator To easy the notation, we denote t x = t x αfx, L t x = L t x αfx, L t, x = L t x αfx, Mx = L t xα fx = L t x αfx α fx and M i = MX i. Theorem Let X,..., X n be a random sample of size n with density function ft, t R, and ˆft;, defined as in 7 is an estimator of ft. Assume that the kernel on R is a non-negative symmetric function with support contained in T, T ], T <, integrates to and has bounded fourth order derivatives. The function αx in the estimator ˆft;, is defined in 3 for a nondecreasing clipping function ps ps for all s and ps = s for all s c ] with five bounded and uniformly continuous derivatives, and constant c > 0. Then, where V ar ˆft;, = V V V 3 n o V = E t X α fx = αftft o, ] X X V = E t X αfx M, and Proof. Note that, V 3 = h,n E M M X X 3 X X 3 ]. 6 V ar ˆft;, = E ˆf t;, E ˆft;, of the true estimator 7, we start by calculating E ˆf t;,. It is obvious that E ˆf t;, = nh E,n n nh E,n t X α ˆfX ; α h ˆfX ;,n t X α h ˆfX ;,n t X α h ˆfX ;,n α ˆfX ; 7 α ˆfX ;. 8 3590

JSM 04 - Section on Nonparametric Statistics The decompositions of α ˆfX ; in 8 and t X α ˆfX ; in 3 then give: 7 = nh,n E t X α fx δx 9 nh,n E L t,x α fx δ X δx 0 nh,n E δ t, X α fx δx nh,n E t X α fx δx L t, X δx nh,n E L t, X δx δ t, X α fx δx nh,n E t X δ t, X α fx δx. By Proposition of Nakarmi and Sang 9], we have 9 = V o n = αftft n o. Notice that each term in 0- has δ or δ. Then, they all have order on if we take = On /5 and = On /9 by the boundedness of, L, αf due to the boundedness of f and and 5. For example, U = O nh = on.,n Again, by applying the decompositions of α ˆfX ; in 8 and t Xi α ˆfX i ; in 3, and noting that the expectation terms with coefficients and the terms with coefficients n nh,n = h,n nh,n along with quantities δ δ, δ δ or δ δ are negligible comparing with On by the boundedness of, L, αf and and 5, we get 8 = n nh,n E t X t X αfx αfx ] 3 h,n E L tx L t X δx δx αfx αfx ] 4 h,n E tx αfx L t X δx αfx ] 5 h,n E t X αfx L t, X δ X αfx ] 6 h,n E tx αfx δ t, X αfx ] on. 7 On the other hand, by the boundedness of, L, αf and and 5, the terms of the form E E have order o n if they include quantities δ δ, δ δ or δ δ. Therefore, E ˆft;, = h,n {E tx αfx ]} 8 h,n E L tx δx αfx ] 9 h,n E tx αfx ]EL t X δx αfx ] 30 h,n E tx αfx ]EL t, X δ X αfx ] 3 h,n E tx αfx ]Eδ t, X αfx ] on. 3 359

JSM 04 - Section on Nonparametric Statistics By the above analysis, we shall study the difference between the terms 9, 3-7 and 8-3. The difference between 3 and 8 The difference between 3 and 8 is n E t X αfx ], which has order On by the bias formula of the ideal estimator. The difference between 4 and 9 Since L has bounded support and αfx is is bounded and bounded away from zero, t x E M L αfx α fx fxdx = O. 33 Recall the decomposition of δt in. 9, 0 and 33 give and h,n E M bx ; = Oh,n, h,n M E ˆfX ; ft] = OU h,n E M α η ˆfX ; ft] = OU. We first apply the decomposition of δt, and use the above analysis. Then, by applying the results from section 3. and 3. of Giné and Sang 4], we get the following, 9 = h,n E M DX ; h,n E M bx ; ] OU 3 ] = h,n E M bx ; O U 3 n n h,n. 34 On the other hand, by a similar analysis, 4 = h,n E M M DX ; bx ; ]DX ; bx ; ] OU 3 := Q Q Q 3 OU 3, where Q = h,n EM M DX ; DX ; Q = h,n EM M DX ; bx ; Q 3 = h,n EM M bx ; bx ; = h,n EM bx ; ]. 359

JSM 04 - Section on Nonparametric Statistics We also denote N i = Xi u fudu, i =,. Then Q = n h E M M 0 N 0 ] 35 ] X X X X n h E M M N 36 ] n X X X X 3 n h E M M N 37 n ] X X 3 X X 3 n h E M M N 38 n n 3 n h n h E n n h E M M M M 0 N X X 3 E M M 0 N N ] X X X X 3 ] X X 4 39 40 ]. 4 It is easy to see that, 35, 36, 40= On h,n by applying 33. Since X X 3 X u E X = fudu = N, 4 37= 0. Similarly, 39 = 4 = 0. From 3.37 of Giné and Sang 4], we have V 3 = O. Also it is easy to see that EM N = O. Hence 38 = n n h E M M X X 3 n n h EM N ] = V 3 On n ] X X 3 3593

JSM 04 - Section on Nonparametric Statistics and Q = V 3 n On. Q = h E M M DX ; bx ;,n ] X X = E M M N nh n nh E M M nh ] X X 3 N E M M 0 N ] ] X v fv fx dv 43 ] X v fv fx dv 44 ] X v fv fx dv. 45 Under the condition that fx has bounded second order continuous derivative, X v fv fx dv = Oh 3,n. Thus, by the boundedness of and N and 33, 44 = 0 since E X X 3 43 = O n = 45. X = X u fudu = N. Therefore, Q = O n. Notice that the first quantity in 34 is same as Q 3. Hence 4 9 = V 3 n On. The difference between 6 and 3 Next we denote Ax = t xαfx and L t, x = L t, x α fx αfx. Then 0 and the decomposition of δt in give 6 =h,n E AX L t, X DX ; bx ; ] OU 3 =h,n E AX L t, X D X ; ] 46 4h,n E AX L t, X DX ; bx ; ] 47 h,n E AX L t, X b X ; ] OU 3 48 46 = n h E AX L t, X 4 n h E AX L t, X i<j n ] ] X X i X X i ] X X j 49 ] 50 3594

JSM 04 - Section on Nonparametric Statistics By independence and a similar argument as in 33, we have E AX L t, X 0 ] ] 49 = n h n h E AX L t, X n n h E AX L t, X =On 5 ] ] X X ] ] X X 3 5 and 4 50 = n h 4n n h E AX L t, X E AX L t, X X X X X ] ] 0 ] ] ]] X X 3 ]] 4n X X 3 n h E AX L t, X 0 ] =On. The last two terms 5 and 53 are zeroes by the argument 4. Now we decompose the term 3. Again, by 0 and the decomposition of δt in, 3 =h,n EAX ]E L t, X DX ; bx ; ] OU 3 =h,n EAX ]E L t, X D X ; ] 54 4h,n EAX ]E L t, X DX ; bx ; ] 55 h,n EAX ]E L t, X b X ; ] OU 3. 56 54 = n h EAX E L t, X 4 n h EAX E L t, X i<j n ] ] X X i X X i ] X X j 5 53 57 ] 58 3595

JSM 04 - Section on Nonparametric Statistics By a similar argument as in 33, we have EAX ]E L t, X 0 ] ] 57 = n h n h EAX ]E L t, X n n h EAX ]E L t, X =On h,n 59 ] ] X X ] ] X X 3 59 and 4 58 = n h 4n n h EAX ]E L t, X EAX ]E L t, X X X X X ] ] 0 ] ] ]] X X 3 ]] 4n X X 3 n h EAX ]E L t, X 0 ] =On h,n. The last two terms 60 and 6 are zeroes by the argument 4. By similar explanation as Q, 55 = O n = 47. Also notice that 48=56 and 5=59. Therefore, we get, 6 3 = o n. The difference between 7 and 3 To study the difference between 7 and 3, by Proposition of Giné and Sang 4], for the ideal estimator, we have ft; E ft; log n =. n 60 6 Therefore, t X αfx E t X αx = log n n. 3596

JSM 04 - Section on Nonparametric Statistics Hence, we have the following for the difference between 7 and 3. 7 3 =h,n E tx αfx αfx δ t, X ] h,n E tx αfx ] E αfx δ t, X ] =h,n E{E tx αfx αfx δ t, X X ]} h,n E tx αfx ] E{E αfx δ t, X X ]} =h =,n E{ tx αfx E t X αfx ] E αfx δ t, X X ]} log n log n nh 3,n n O sup δ t, x t,x R h,n log n n = on. The difference between 5 and 30 We shall apply the decomposition of δt as in. The difference between 5 and 30 with the second part of has order on as the analysis in the difference between 7 and 3. Therefore 5 30 =h,n E tx αfx M DX ; bx ; ] h,n E tx αx ] E M DX ; bx ; ] on =h,n E tx αfx M DX ; ] h,n E tx αx ] E M DX ; ] on ] X X = n h E t X αfx M N 6,n ] X X n h E t X αx ] E M N on,n 63 ] X X = n h E t X αfx M 64,n n h E t X αfx ] EM N on. 65,n We have the equality 6 by the independence. The term 63 is zero. Since, Lα and f are bounded functions and v has bounded support, we have V = u E t X αfx L t uα X fu fudu] = E t X αfx L t X vα fx vvfx vdv] C E t X αfx ] = O 66 3597

JSM 04 - Section on Nonparametric Statistics and 64 = On. On the other hand, E t X αfx ] = O and u v EM N = L t uα fufu fvdvdu = L t uα fufu wfu wdwdu =O since the function αf in L t is bounded below and above. Therefore in 65, n h E t X αfx ] EM N = On. 67,n Hence 66 and 67 give us 5 30 = V n On. Thus, V ar ˆft;, = V V V 3 n o. 4. Optimal bandwidth selection Denote z n = w v αf v t w. Then lim n z n = wαft since / 0. In the following we change variables twice, then take limit by applying the compact support of and the boundedness below and above of the function α. ] u X V = E t X αfx Mu fudu = wαft w αft wft w T Lz n α f v t wvf v t wdvdw T n T wαftαftft L wαft T α ftvftdvdw = α ftf t ylydy = α ftf tµ 0 := V. 3598

JSM 04 - Section on Nonparametric Statistics For the quantity V 3, let x 3 = x s, x = x w and x = t z be the change of variables. Again, since / 0, L has compact support and the function α is bounded below and above, 6 = h,n h L t x α fx fx L t x α fx fx,n x x 3 x x 3 fx 3 dx 3 dx dx = L z wh,n αft z w α ft z w ft z wlzαft zw sα ft zft z ft z w ssdwdzds n f 3 tα ft L zαftsw sdwdzds = f 3 tα ft L zαftdz = f 3 t α ft L ydy := V αft 3. Theorem Suppose f is a density in C 4 R and p is a clipping function in C 5 R. Then the integrated mean squared error on D r is R, D r = D r τ4 D 4 /f Furthermore, the optimal bandwidth is given by 4 h,n = h 8 V V V 3,ndt dt oh 8 D r nh,n. 68,n 576V V V nτ4 D 4/f 3 ] /9. Proof. The bias in the integrated MSE 68 is from Corollary in Giné and Sang 4]. The variance is dominated by V V V 3 n. They are from the analysis of 9, the differences between 3-7 and 8-3 respectively. References ] Abramson, I. 98. On bandwidth variation in kernel estimates - a square-root law, Ann. Statist. 0 7-3. ] Giné, E. and A. Guillou. 00. Rates of strong uniform consistency for multivariate kernel density estimators. Ann. I. H. Poincaré-Probab. et Stat. 38 907-9. 3] Giné, E. and H. Sang. 00. Uniform asymptotics for kernel density estimators with variable bandwidths. J. Nonparametric Statistics 773-795. 3599

JSM 04 - Section on Nonparametric Statistics 4] Giné, E. and H. Sang. 03. On the estimation of smooth densities by strict probability densities at optimal rates in sup-norm. IMS Collections, From Probability to Statistics and Back: High-Dimensional Models and Processes 9 8-49. 5] Hall, P. and J. S. Marron. 988. Variable window width kernel estimates of probability densities. Probab. Th. Rel. Fields 80 37-49. Erratum: Probab. Th. Rel. Fields 9 33. 6] P. Hall, T. Hu and J. S. Marron. Improved Variable Window ernel Estimates of Probability Densities. Ann. Statist. 3 995-0. 7] Mcay, I. J. 993. A note on bias reduction in variable kernel density estimates. Canad. J. Statist. 367-375. 8] Mcay, I.J. 993.Variable kernel methods in density estimation. Ph.D Dissertation, Queen s University. 9] Nakarmi, J. and H. Sang. 04. Central limit theorem for the variable bandwidth kernel estimator. Submitted. 0] S. Yu. Novak. 999.A generalized kernel density estimator. Russian Teor. Veroyatnost. i Primenen. 000.44 634 645; translation in Theory Probab. Appl. 44 570 583. ] Terrell, G. R. and D. Scott. 99.Variable kernel density estimation. Ann. Statist. 0 36-65. 3600