A new distribution-free quantile estimator

Biometrika (1982), 69, 3, pp. 635-40 Prited i Great Britai 635 A ew distributio-free quatile estimator BY FRANK E. HARRELL Cliical Biostatistics, Duke Uiversity Medical Ceter, Durham, North Carolia, U.S.A. AND c. E. DA VIS Departmet of Biostatistics, Uiversity of North Carolia, Chapel Hill, North Carolia, U.S.A. SUMMARY A ew distributio-free estimator QP of the pth populatio quatile is formulated, where QP is a liear combiatio of order statistics admittig a jackkife variace estimator havig excellet properties. The small sample efficiecy of QP is studied uder a variety of light ad heavy-tailed symmetric ad asymmetric distributios. For the distributios ad values of p studied, Q Pis geerally substatially more efficiet tha the traditioal estimator based o oe or two order statistics. Some key words: Distributio-free estimator; Noparametric estimator; Order statistic; Percetile; Quatile. 1. INTRODUCTION The estimatio of populatio quatiles or percetiles is of great iterest, particularly whe the statisticia is uwillig to assume a parametric form for the distributio or eve to assume the distributio to be symmetric. Sample quatiles have may desirable properties. However, they also have drawbacks. They are ot particularly efficiet estimators of locatio for distributios such as the ormal, good estimators of the variace of sample quatiles do ot exist for geeral distributios, sample quatiles may ot be jackkifed, ad the sample media differs i form ad i efficiecy depedig o the sample size beig eve or odd. Maritz & Jarrett (1978) have developed a estimator of the sample media that performs well for some distributios. We propose to estimate the pth quatile by a liear combiatio of the order statistics. For most distributios the ew estimator offers a sigificat gai i efficiecy over the traditioal oe ad admits a jackkife variace estimator that performs well. The properties of the estimator ad its variace estimator are studied over a wide variety of light- ad heavy-tailed symmetric ad asymmetric distributios, with emphasis o small sample results. 2. ESTIMATORS Let X 1,..., X deote a radom sample of size from a cotiuous distributio havig distributio fuctio F(. ). Let X 0 > ~ ~ X<> deote the order statistics of the sample ad X = (Xc 1»..., X<>). Oe traditioal estimator of the pth populatio quatile F- 1 (p) is (1)

636 FRANK E. HARRELL AND C. E. DAVIS where (+ l)p = j +g ad j is the itegral part of (+ l)p. Whe p = t, TP is the usual sample media. The expected value of the kth order statistic is give by f 1 00 E(X<k>) = f3 xf(x)k- 1 {1-F(x)}-kdF(x) (k,-k+l) -oo Sice E(X{(+l)p}) coverges to F- 1 (p) for p E (0, 1), we take as our estimator of F-1(p) somethig which estimates E(X{(+ l)p}) whether or ot (+ 1)) pis a iteger, amely f Q = 1 1 F-l(y)y(+l)p-1(1-y)(+l)(l-p)-ldy P /3{(+l)p,(+l)(l-p)} 0 ' where F(X) is the sample distributio fuctio, F(x) = - 1 ~ l(x; ~ x), l(a) beig the idicator fuctio of the set A. The estimator ca be reexpressed as where QP = L W,iX(il' (2) i= 1 1 ii/ W:. = y<+ l)p-1(1-y)(+ 1)(1-p)-1 dy,i /3{ ( + 1) p, ( + 1) (1-p)} (i-1)/ = 1; 1 {p(+ 1), (1-p) (+ l)}-j(i-1)/{p(+ 1), (1-p) (+ l)} (3) ad lx(a, b) deotes the icomplete beta fuctio. Maritz & Jarrett (1978) used a similar idea to estimate the secod momet of the sample media. For p =!or ~ 100 with p =I= t, the weights W,i i (3) ca be adequately calculated with umerical itegratio usig Simpso's rule with 2 itervals betwee (i-1)/ ad i/. For other cases, the icomplete beta fuctio should be calculated exactly. The algorithm of Majumdar & Bhattacharjee (1973) is very efficiet for this calculatio. Followig David (1981, p. 273), QP is asymptotically ormally distributed uder mild assumptios o F(. ). Mote Carlo studies usig the Kolmogorov-Smirov statistic have show that for the uiform ad ormal distributios, the ormal approximatio is adequate for samples as small as 20 for p = t or 30-50 for p = 0 95. For asymmetric distributios such as the expoetial, sample sizes as large as 80-100 may be required for p = 0 9 or above. For the calculatio of tail probabilities usig the variace estimator give below ad a ormal approximatio, sample sizes ecessary for accurate cofidece itervals may be smaller. However, more research is eeded i this area. I order to calculate the jackkife variace estimate of QP (Miller, 1974), cosider removig order statistic j from the sample. The resultig estimate is Si= sj X, The jackkife variace estimator is (si); = l(i =I= j) W-1,i-l<i>i>" -1 VP= -- L (Si-S) 2 = (-1) (- 1 ~SJ-S2), i=l (4)

where A ew distributio-free quatile estimator S = - 1 " S. = - 1 ut X (u) (." 1) W + ( ) W L., J ' i = " - - 1, i - 1 - i - 1, ;, j=l r. SJ= XT AX, Au= (l-1) w;-1,1-1 + (-l) w;-1,l 637 Aim= (l-1) W-1,1-1 W-1,m-1 +(m-l-l) W-1,1 W-1,m-1 +(-m) w-1,l w-1,m (m > l), Amt= Aim W- 1,; = 0 (i < 1 or i > -1). Simulatios show that the jackkife versio of QP' while havig lower bias tha QP, has larger variace, resultig i a estimator with similar efficiecy to TP. The extreme order statistic weights for the jackkife estimator for small ad p = -!- are sometimes egative, resultig i early ubiased estimators of extreme quatiles although havig large variace. The jackkifed quatile estimator will ot be discussed further, oly VP, the associated variace estimator. Kaigh & Lachebruch (1982) have proposed a quatile estimator which is the average of subsample TP-like estimators. Their estimator has properties similar to QP ad does ot require umerical itegratio. It may require larger sample sizes for estimatig extreme quatiles, ad a variace estimator has ot bee studied. 3. EFFICIENCY OF QP RELATIVE TO TP FOR VARIOUS DISTRIBUTIONS To ivestigate the performace of QP with respect to TP for a wide variety of distributios, the geeralized lambda distributio (Ramberg et al., 1979) was cosidered. The distributio is defied by p-l(p) = µ+rr{pa-(l-p)b}, where µ ad <T are respectively locatio ad scale parameters, set to 0 ad 1 for this ivestigatio, ad a ad b are shape parameters. Table 1 shows the distributios used with the stadardized skewess, a 3, ad kurtosis, a 4, values. Table 1. Geeralized lambda distributios a b Cl3 Cl4 Descriptio 1 1 0 1 8 Light-tailed symmetric 0 1349 0 1349 0 3 Normal-like -0 1359-0 1359 0 9 Very heavy-tailed symmetric, like t distributio with 5 degrees of freedom -1-1 (f) (f) Cauchy-like 0 0251 0 0953 0 9 4 2 Medium-tailed asymmetric 0 0 0004 2 9 Expoetial-like The mea squared error of a estimator was used to measure its efficiecy, for example MSE(Qp) = E{Qp-F- 1 (p)} 2, MSE(Tp) = E{Tp-F- 1 (p)} 2. These mea squared errors were estimated by geeratig 1000 radom samples for each distributio ad each ad averagig the squared errors. Radom order statistics were geerated usig the method of Lurie & Hartley (1972) icorporatig a Tausworthe uiform radom umber geerator (Whittlesey, 1968).

638 FRANK E. HARRELL AND C. E. DAVIS (a) a= 1 b = 1 2 0 ' 1 5 p -------0--------0 05 ----o----0 10 ---------0 25 ---6---0 ----A---- 0 75 ---0--0 90 0 95 l>-----&,,,.. - -------~ - ~.:::--~-~ --(-,,o--~// - ---==~_:g_-..::=-----~ 0-::s:- <--0-------0-------<>----- 2 0 (b) a= 0 1349, b = 0 1349 0'---'--1 0---''--2~0-... 3~0--'-4 0-'--~50....60 0'--...1...-1~0-.. ~20. 3.0---''--4~0--... 5~0.,60 (d) a = - 1 b = - 1 2 0, 0..._.. 1~0-'--~20,... 3.o., 4~0-... 5~0.,60 - - l:!r --A.,...-b--=- --=--~ --=- --=---/;, ~.. /...------------- /,/..._'!,.,,,...,,;' I...' I o............._........., 10 20 30 40 50 60 2. 0 (e) a= 0 0251, b = 0 0953 (f) a = 0 b = 0 0004 2 0, :>, 0 ~ 1 5 ~ l O $ ~ 10 20 30 40 50 60 0 10 20 30 40 50 60 Fig. 1. Simulated mea squared error efficiecy of QP agaist TP for geeralized lambda distributios with differet parameters of the distributio. The efficiecy of QP relative to TP is MSE(Tp)/MSE(Qp) This was estimated by takig the ratio of estimated mea squared errors. Mote Carlo experimets were performed for = 6, 10, 16, 23, 45, 60 ad p = 0 05, 0 10, 0 25, 0, 0 75, 0 90, 0 95. Results for p >!are ot displayed for symmetric distributios. The results are show i Fig. 1. Oe additioal experimet was performed for the ormal distributio for = 250 ad p =! resultig i a estimated relative effiecy of Q P of l 07. From Fig. 1 we see that, except for the Cauchy-like distributio, QP is geerally more tha l l times as efficiet as TP"

A ew distributio-free quatile estimator 639 4. EXACT EFFICIENCY OF QP AND TP RELATIVE TO PARAMETRIC ESTIMATORS FOR THE NORMAL DISTRIBUTION For the ormal distributio, the uiform miimum variace ubiased estimator of F- 1 (p) is XP = X +<l>- 1 (p)s/e(s), where X ads are the sample mea ad stadard deviatio respectively, <I>(.) is the ormal distributio fuctio, ad E(s) = {2/(-1)} 1 / 2 r(t,)/r{t(-1)}. For 2:::; :::; 20, exact efficiecies of QP ad TP ca be calculated usig the momets of ormal order statistics tabled by Sarha & Greeberg (1962, p. 193). For p = O l ad p =, the efficiecies are show i Fig. 2. We do ot recommed the use of QP for small ad extreme p; however, the relative performace of XP is actually worse i this situatio tha for larger. From Fig. 2 we see that the ew estimator has much to offer over TP especially for extreme quatiles. 1 25 (a) p = O l 1 25 (b) p = l O :>, " ~ -~ " f:s ~ l O 0 75 0 25.._o..._..._._... 5_._...._... 10._..._..._1._5......_..20 0 250 5 10 Fig. 2. Exact mea squar!ld error efficiecy of ew estimator, QP, ad sample quatile TP' with respect to XP for the ormal distributio with quatiles O l ad. 15 20 5. PERFORMANCE OF THE VARIANCE ESTIMATOR VP For the 1000 repeated samples i the simulatios for each distributio ad sample size, the sample variace of the simulated Q P estimator was calculated. This was compared to the sample mea of simulated VP estimates. The ratio of the mea VP to the simulated V(Qp) was used to estimate E( Vp)/V(Qp) to measure the bias of VP" The results idicated that E(Vp) was seldom differet from V(Qp) by more tha a factor of 1 15, eve for the Cauchy-like distributio with > 16. Thus cofidece itervals for QP ca be readily costructed usig the asymptotic ormality of Q p The authors are grateful to the referee for costructive commets that improved the clarity of the paper. REFERENCES DAVID, H. A. (1981). Order Statistics, 2d editio. New York: Wiley. KAIGH, W. D. & LACHENBRUCH, P. A. (1982). A geeralized quatile estimator. Comm. Statist. A 11. To appear. LURIE, D. & HARTLEY, H. 0. (1972). Machie-geeratio of order statistics for Mote Carlo computatios. Am. Statisticia 26, 26-7. Errata (1972) 26, 56-57. MAJUMDAR, K. L. & BHATTACHARJEE, G. P. (1973). The icomplete beta itegral (Algorithm AS 63). Appl. Statist. 22, 409--11.

640 FRANK E. HARRELL AND C. E. DAVIS MARITZ, J. S. & JARRETT, R. G. (1978). A ote o estimatig the variace of the sample media. J. Am. Statist. Assoc. 73, 194-6. MILLER, R. (1974). The jackkife-a review. Biometrika 61, 1-15. RAMBERG, J. S., DuDEWICZ, E. J., TADIKAMALLA, P.R. & MYKYTKA, E. F. (1979). A probability distributio ad its uses i fittig data. Techometrics 21, 201-14. SARHAN, A. E. & GREENBERG, B. G. (Eds). (1962). Cotributios to Order Statistics. New York: Wiley. WHITTLESEY, J. R. B. (1968). A compariso of the correlatioal behavior of radom umber geerators for the IBM 360. Comm. Assoc. Comp. Mach. 11, 641-4. [Received October 1980. Revised February 1982]