DEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS

ISSN 1440-771X AUSTRALIA DEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS An Iproved Method for Bandwidth Selection When Estiating ROC Curves Peter G Hall and Rob J Hyndan Working Paper 11/00

An iproved ethod for bandwidth selection when estiating ROC curves Peter G. Hall 1 and Rob J. Hyndan 1, 13 Septeber 00 Abstract: The receiver operating characteristic (ROC curve is used to describe the perforance of a diagnostic test which classifies observations into two groups. We introduce a new ethod for selecting bandwidths when coputing kernel estiates of ROC curves. Our technique allows for interaction between the distributions of each group of observations and gives substantial iproveent in MISE over other proposed ethods, especially when the two distributions are very different. Key words: Bandwidth selection; binary classification; kernel estiator; ROC curve. JEL classification: C1, C13, C14. 1 INTRODUCTION A receiver operating characteristic (ROC curve can be used to describe the perforance of a diagnostic test which classifies individuals into either group G 1 or group G. For exaple, G 1 ay contain individuals with a disease and G those without the disease. We assue that the diagnostic test is based on a continuous easureent T and that a person is classified as G 1 if T δ and G otherwise. Let G(t = Pr(T t G 1 and F (t = Pr(T t G denote the distribution functions of T for each group. (Thus F is the specificity of the test and 1 G is the sensitivity of the test. Then the ROC curve is defined as R(p = 1 G(F 1 (1 p where 0 p 1. Let {X 1,..., X } and {Y 1,..., Y n } denote independent saples of independent data fro G 1 and G, and let Fˆ and Ĝ denote their epirical distribution functions. Then a siple estiator of R(p is Rˆ(p = 1 Ĝ(Fˆ 1 (1 p, although this has the obvious weakness of being a step function while R(p is sooth. Zou, W.J. Hall & Shapiro (1997 and Lloyd (1998 proposed a sooth kernel estiator of R(p as follows. Let K(x be a continuous density function and L(x = x K(u du. The kernel estiators of F and G are F (t = 1 ( t L X i h 1 and G (t = i=1 i=1 1 ( t L 1 Centre for Matheatics and its Applications, Australian National University, Canberra ACT 000, Australia. Departent of Econoetrics and Business Statistics, Monash University, VIC 3800, Australia. Corresponding author: Rob Hyndan (Rob.Hyndan@onash.edu.au. 1 h Y i.

For the sake of siplicity we have used the sae kernel for each distribution, although of course this is not strictly necessary. The kernel estiator of R(p is then R (p = 1 G (F 1 (1 p. Qiu & Le (001 and Peng & Zhou (00 have discussed estiators alternative to R (p. Lloyd and Yong (1999 were the first to suggest epirical ethods for choosing bandwidths h 1 and h of appropriate size for R (p, but they treated the proble as one of estiating F and G separately, rather than of estiating the ROC function R. We shall show that by adopting the latter approach one can significantly reduce the surplus of ean squared error over its theoretically iniu level. This is particularly true in the practically interesting case where F and G are quite different. In the present paper we introduce and describe a bandwidth choice ethod which achieves these levels of perforance. A related proble, which leads to bandwidths of the correct order but without the correct constants, is that of soothing in distribution estiation. See, for exaple, Mielniczuk, Sarda and Vieu (1989, Sarda (1993, Altan and Legér (1995, and Bowan, Hall and Prvan (1998. METHODOLOGY.1 Optiality criterion and optial bandwidths If the tails of the distribution F are uch lighter than those of G then the error of an estiator of F in its tail can produce a relatively large contribution to the error of the corresponding estiator of G(F 1. As a result, if the L perforance criterion γ 1 (S = S [ E Ĝ(F 1 (p G(F (p] 1 dp (.1 for a set S [0, 1], is not weighted in an appropriate way then choice of the optial bandwidth in ters of γ 1 (S can be driven by relative tail properties of f and g. Forula (A.1 in the appendix will provide a theoretical illustration of this phenoenon. We suggest that the weight be chosen equal to f(f 1, so that the L criterion becoes γ(s = S [ E Ĝ(F 1 (p G(F (p] 1 f(f 1 (p dp. (. We shall show in the appendix that for this definition of ean integrated squared error, { } γ(s β(s E[F (t F (t] g (t + E[ Ĝ(t G(t] f (t dt (.3 F 1 (S where F 1 (S denotes the set of points F 1 (p with p S. Note particularly that the right-hand side is additive in the ean squared errors E( F F and E( Ĝ G, so that in principle h 1 and h ay be chosen individually, rather than together. That is, if h 1 and Hall and Hyndan: 13 Septeber 00

h iniise β 1 (S = E[F (t F (t] g (t dt and β (S = E[ Ĝ(t G(t] f (t dt, F 1 (S F 1 (S respectively, then they provide asyptotic iniisation of γ(s. To express optiality we take F 1 (S equal to the whole real line, obtaining the global criterion β(h 1, h = β 1 (h 1, h + β (h 1, h where β 1 (h 1, h = E[F (t F (t] g (t dt and β (h 1, h = E[Ĝ(t G(t] f (t dt β 1 = 1 (1 F F g + δ 1 + o( 1 h 1 + h 4 1 (.4 Suppose K is a copactly supported and syetric probability density, and f is bounded, continuous and square-integrable. Then arguents siilar to those of Azzalini (1981 show that E(F F = 1 [(1 F F h 1 ρ f] + ( 1 ρ h 1 f + o(n 1 h 1 + h 1 4, where ρ = (1 L(uL(u du, ρ = u K(u du. Of course, an analogous forula holds for E( Ĝ G, and so the forulae at (.4 adit siple asyptotic approxiations: β = n 1 (1 G G f + δ + o(n 1 h + h 4 where δ 1 = 1 h 1 ρ f g 1 + 4 ρ h4 1 (f g (.5 and δ = n 1 h ρ f 1 g + 4 ρ h 4 (fg (.6 The asyptotically optial bandwidths are therefore h 1 = 1/3 c(f, g and h = n 1/3 c(g, f where { }/{ } c(f, g 3 = ρ f(u g (u du ρ [f (u g(u] du. A conventional plug-in rule for choosing h 1 and h ay be developed directly fro these forulae. However, it requires selection of pilot bandwidths for estiating f, g and their derivatives. The technique suggested in the next section avoids that difficulty. Hall and Hyndan: 13 Septeber 00 3

. Epirical choice of bandwidth Let f and ĝ denote leave-one-out kernel estiators of f and g, respectively: f ( x Xi1 ( x Xi (x h 1 = K K ( 1h 1 h 1 i 1 <i 1 h 1 ĝ ( y Yi1 ( y Yi (y h = K K. n(n 1h h 1 i 1 <i n h Let fˆ i(x h 1 = {( 1 h 1 } 1 j=i K{(x X j /h 1 }, and define ĝ i (y h analogously, and let f 1 and ĝ 1 denote the kernel estiators of (f and (g, respectively: f 1 (x h 1 = ĝ 1 (y h = ( 1h 4 1 n(n 1h 4 i 1 =1 i =1 n n K ( x X i1 K ( x X i h 1 h 1 ( K y Y i1 h i 1 =1 i =1 K ( y Y i. h Note that the latter two estiators include all ters whereas the other estiators are leave-one-out estiators. We include the diagonal ters in the estiators of (f and (g as they act like ridge paraeters and produce better epirical perforance. Now let (h 1, h = 1 h 1 ρ 1 ĝ 1 4 (X i h + 4 ρ h 1 n 1 f 1 (Y i h 1 ĝ i (Y i h i=1 i=1 n n 1 h ρ n 1 f 1 (Y i h 1 + 4 ρ 4 h 1 ĝ 1 (X i h fˆ i(x i h 1. i=1 i=1 We could choose h 1 and h to iniize (h 1, h. To otivate this approach, note that E{ (h 1, h } = 1 h 1 ρ (Eĝ 1 f + 4 ρ 4 h 1 (Efˆ (Eĝ g n 1 h ρ (Efˆ g + 1 4 ρ h 4 (Eĝ (Efˆ f, (.7 which indicates that is an alost-unbiased approxiation to δ = δ 1 + δ ; copare (.7 with the su of the ters at (.5 and (.6. The relative size of stochastic error ay also be shown to be asyptotically negligible. Indeed, if n as n, if K is copactly supported and has a Hölder-continuous derivative, and if f and g are copactly supported and have three bounded derivatives, then (h 1, h /δ(h 1, h converges to 1 with probability 1, uniforly in n 1+ɛ h 1, h n ɛ for each 0 < τ < 1, as n. However, iniizing (h 1, h leads to soe nuerical instability. Instead, we constrain the iniization so that h 1 = ρh where ρ = h 1 /h and h 1 and h are the bandwidths selected for estiating F and G using the plug-in rule proposed by Lloyd and Yong (1999. Miniizing (h 1, h under this constraint provides values of h 1 and h which are suitable for estiating R (p. n Hall and Hyndan: 13 Septeber 00 4

3 SOME SIMULATIONS We copare the estiates obtained with our bandwidth selection ethod outlined above to those obtained by Lloyd and Yong (1999 using their plug-in rule. Let [ W (p = E G (F 1 (p G(F (p] 1 f(f 1 (p (3.1 denote ean squared error. Thus, ean integrated squared error, introduced at (., is given by γ(s = S W (p dp. The ideal but practically unattainable iniu of W (p, for a nonrando bandwidth, can be deduced by siulation, and will be denoted by W 0 (p. This value will be copared with its analogue, W 1 (p, obtained fro (3.1 using the values of h 1 and h chosen using the ethod outlined in Section.; and with W (p, obtained fro (3.1 using the values of h 1 and h chosen using the plug-in procedure suggested by Lloyd and Yong (1995. In our first exaple, illustrated in the first panel of Figure 1, we used Lloyd and Yong s (1999 odel, where F and G are N(0, 1 and N(1, 1 respectively. In the second exaple we chose F and G to be ore different; F was N(0, 1 and G was an equal ixture of N(, 1 and N(, 1. In both cases our ethod offers an iproveent, which as expected is greater when the distributions are further apart. The areas under the curves represent the increase in γ(s due to bandwidth selection. In these ters our ethod iproves on that of Lloyd and Yong (1999 by 1.% and 8.6%, in the respective exaples. Exaple 1 Exaple W i (x W 0 (x 0.0 0.5 1.0 1.5.0 W i (x W 0 (x 0.5 0.0 0.5 1.0 1.5.0.5 0.0 0. 0.4 0.6 0.8 1.0 p 0.0 0. 0.4 0.6 0.8 1.0 p Figure 1: Solid lines: W 1 (p W 0 (p. Dashed lines: W (p W 0 (p. Hall and Hyndan: 13 Septeber 00 5

APPENDIX: Derivation of (.3 Assue that f and g have continuous derivatives and are bounded away fro 0 on S. Put A = F F, B = Ĝ G and C = F 1 F 1, and write I for the identity function. Then by Taylor expansion, I = F (F 1 + C = I + A(F 1 + C f(f 1 + o p ( A(F 1 + C, whence it follows that C = [A(F 1 /f(f 1 ] + o p ( A(F 1. Hence, g(f 1 Ĝ(F 1 G(F 1 = B(F 1 f(f 1 A(F 1 + o p ( A(F 1 + B(F 1. (A.1 Note the ratio g(f 1 /f(f 1 on the right-hand side of (A.1. Since the variance of A equals (1 F F then the unweighted criterion γ 1, defined at (.1, can be largely deterined by the value of (g/f (1 F F in the tails if this quantity is not bounded. Using instead the weighted criterion γ, defined at (., we ay deduce fro (A.1, related coputations and the independence of the saples that S E[ Ĝ(F 1 G(F 1 ] f(f 1 = [1 + o(1] which is equivalent to (.3. F 1 (S [E(B f + E(A g ] REFERENCES Altan, N. and Léger, C. (1995. Bandwidth selection for kernel distribution function estiation. J. Statist. Plann. Inf. 46, 195 14. Azzalini, A. (1981. A note on the estiation of a distribution function and quantiles by a kernel ethod. Bioetrika 68, 36 38. Bowan, A.W., Hall, P. and Prvan, T. (1998. Cross-validation for the soothing of distribution functions. Bioetrika 85, 799 808. Lloyd, C.J. (1998. The use of soothed ROC curves to suarise and copare diagnostic systes. J. Aer. Statist. Assoc. 93, 1356 1364. Lloyd, C.J. and Yong, Z (1999. Kernel estiators of the ROC curve are better than epirical. Statist. Prob. Letters 44, 1 8. Mielniczuk, J., Sarda, P. and Vieu, P. (1989. Local data-driven bandwidth choice for density estiation. J. Statist. Plann. Inf. 3, 53 69. Peng, L. and Zhou, X.-H. (00. Local linear soothing of receiver operator characteristic (ROC curves. J. Statist. Plann. Inf., to appear. Qiu, P. and Le, C. (001. ROC curve estiation based on local soothing. J. Statist. Coput. and Siul. 70, 55 69. Sarda, P. (1993. Soothing paraeter selection for sooth distribution functions. J. Statist. Plann. Inf. 35, 65 75. Zou, K.H., Hall, W.J. and Shapiro, D.E. (1997. Sooth non-paraetric receiver operating characteristic (ROC curves for continuous diagnostic tests. Statistics in Medicine 16 143 156. Hall and Hyndan: 13 Septeber 00 6