UNIMODAL KERNEL DENSITY ESTIMATION BY DATA SHARPENING

Size: px

Start display at page:

Download "UNIMODAL KERNEL DENSITY ESTIMATION BY DATA SHARPENING"

Aleesha Dixon
5 years ago
Views:

1 Statistica Sinica 15(2005), UNIMODAL KERNEL DENSITY ESTIMATION BY DATA SHARPENING Peter Hall 1 and Kee-Hoon Kang 1,2 1 Australian National University and 2 Hankuk University of Foreign Studies Abstract: We discuss a robust data sarpening metod for rendering a standard kernel estimator, wit a given bandwidt, unimodal. It as teoretical and numerical properties of te type tat one would like suc a tecnique to enjoy. In particular, we sow teoretically tat, wit probability converging to 1 as sample size diverges, our tecnique alters te kernel estimator only in places were te latter as spurious bumps, and is identical to te kernel estimator in places were tat estimator is monotone in te correct direction. Moreover, it automatically splices togeter, in a smoot and seamless way, tose parts of te estimator tat it leaves uncanged and tose tat it adjusts. Provided te true density is unimodal our estimator generally reduces mean integrated squared error of te standard kernel estimator. Key words and prases: Bandwidt, data sarpening, eavy tailed distributions, kernel metods, mean squared error, nonparametric density estimation, order constraint, tilting metods, unimodal density. 1. Introduction Motivation for te assumption of unimodality usually as a Bayesian connection. It expresses a prior belief tat te sampled distribution is omogeneous. Imposing it as a constraint, wen constructing a nonparametric density estimator, is arguably te most effective way of incorporating knowledge of omogeneity into te final result, witout sacrificing te particularly advantageous adaptivity conferred by nonparametric metods. Te constraint of unimodality also makes te estimator particularly robust against undersmooting, wic, in te absence of a restriction on te number of modes, can seriously impair te qualitative appearance of te estimator. Tere is an extensive literature on function estimation under sape constraints. Recently treated metods for unimodal density estimation include tose considered by Wang (1995), Bickel and Fan (1996), Birgé (1997), Ceng, Gasser and Hall (1999) and Hall and Presnell (1999). A tecnique based on data sarpening was suggested by Braun and Hall (2001), but witout any teoretical support and in te absence of clear guidance as to coice of distance function. In te

2 74 PETER HALL AND KEE-HOON KANG present paper we demonstrate, bot teoretically and in numerical terms, tat a version of data sarpening based on L 1 loss, or using closely related distance functions, performs particularly well. It produces smoot estimators wit very good mean squared error performance, and wit aestetically attractive features wic few oter approaces enjoy. Wen used for unimodal density estimation, data sarpening usually starts from a conventional kernel density estimator. It involves modifying te sample so as to ensure te standard estimator, applied to te altered rater tan te original sample, is unimodal. In particular, for a given bandwidt and kernel te data are moved te least amount subject to te kernel estimator aving te unimodal property. Provided te kernel is a probability density, altering te sample does not affect te basic properties tat make standard kernel metods so attractive, for example te fact tat tey are nonnegative and integrate to 1. If te sampled density truly is unimodal ten, at least for moderately large samples, we expect te standard kernel estimator to depart from unimodality only in te tails and in te vicinity of te true mode. Tese are te places were te true density is relatively flat. Tere, a standard kernel density estimator tends to suffer from spurious wiggles tat prevent it from reflecting qualitative features of te sampled density. Hence, in principle it is necessary only to modify te estimator in suc places. We sall sow tat tese are exactly te places were te data sarpening algoritm adjusts te data, and tat wit ig probability it does not alter any data tat are not very close to te mode or some distance out in one or oter of te tails. Likewise, it alters te standard kernel estimator itself only very close to te mode or out in te tails. And it automatically splices togeter te adjusted and original forms of te kernel estimator at points were tey join. Te final estimator is very smoot; it enjoys as many derivatives as te kernel tat was used in its construction. Of course, tese goals could be acieved in oter ways, using more explicit tecniques. One suc approac would be to draw orizontal lines at appropriate eigts across unwanted valleys in te conventional kernel estimator, so as to fill tem in; to incorporate a degree of monotone smooting at places were te lines met te grap of te conventional estimator, so as to ensure te final estimator was smoot; and to renormalise te result, so it integrated to 1. However, on account of te normalisation step tis approac does not ave te property tat it equals te standard kernel estimator in places were te latter does not need adjustment. Tat can lead to significant performance difficulties, as we sall explain sortly. Moreover, te fact tat te explicitly linearised density estimator as perfectly flat sections means tat it conveys te unwanted visual impression tat tere is someting intrinsically special and interesting about tose parts of te

3 UNIMODAL DENSITY ESTIMATION 75 true density. Tis can be a particular problem in te tails of te linearised density estimator, were a grap of te estimator will often decrease to zero in an eye-catcing sequence of flat steps wit curved edges. Tese and oter aspects of te explicitly linearised estimator are of course no more tan artifacts of te tecnique employed to construct it, altoug tat is usually far from obvious to te casual observer. Suc issues migt be unimportant if te use to wic te final estimate was put was a purely quantitative one, not influenced by te qualitative appearance of a grap of te estimator, but tat is often not te case in practice. Te majority of tese difficulties can be removed by appropriately tuning te explicitly constructed estimator. However, te total number of subsidiary smoots and tapers tat are required can be large, and developing a totally objective and effective procedure for implementing tem produces an inelegant and tedious procedure, involving metodology tat is aestetically unattractive. Moreover, in te case of relatively eavy-tailed densities te linearisation step can add significantly to te probability mass of te density estimator. Wile tis problem is eliminated by te normalisation step, te latter often creates difficulties of its own, by significantly increasing or reducing te eigt of te density estimator in te body of te distribution. Tat can impair performance, for example by seriously increasing mean squared error. A related penomenon occurs in te case of unimodal density estimation using data tilting metods (Hall and Huang (2002)), were te need to remove spurious wiggles in te tails of a conventional density estimator can result in a detrimental increase in te density estimator at oter places, leading to poor mean squared error performance. Numerical results illustrating tis point will be summarised in Section 4, and related penomena can be observed for unimodal density estimators constructing using linearisation and related tecniques. In principle, data sarpening can be used to ensure unimodality for oter estimator types. However, te extreme simplicity of standard kernel estimators, and te fact tat tey can be readily used wit a fixed bandwidt, make tem especially attractive on bot aestetic and computational grounds. By way of contrast, local likeliood metods for density estimation, wic usually require spatially varying bandwidts in order to be implemented effectively, are unattractive. Te distance, D say, troug wic te dataset as to be moved in order to ensure unimodality, can be used to test te null ypotesis tat te density is unimodal. Te ypotesis would be rejected if D were too large. Te null distribution could be estimated using bootstrap metods, by resampling from a unimodal distribution. One candidate for te latter would be te constrained distribution of te test statistic, altoug oter options are available.

4 76 PETER HALL AND KEE-HOON KANG Multivariate applications of our metod are also possible. Indeed, in te setting of two or more dimensions tere are relatively few alternative approaces. Again, te distance troug wic te data ave to be moved could be used as te basis for a test of unimodality. Some early metods for density estimation under qualitative constraints, starting from Grenander s (1956) introduction of te tecnique tat is now often referred to as nonparametric maximum likeliood, employed data tilting. See Prakasa Rao (1969) for an application to unimodal density estimation. Recent applications of related ideas include work of Bickel and Fan (1996) and Hall and Huang (2002), on unimodal density estimation. Fougères (1997) as sown tat te metod of monotone rearrangement, suggested by Hardy, Littlewood and Pólya (1952) and applied to a kernel density estimator, produces a consistent unimodal estimator wen te true density is unimodal. Metods for estimating regression functions under qualitative constraints include tose suggested by Friedman and Tibsirani (1984), Ramsay (1988), Kelly and Rice (1990), Qian (1994), Tantiyaswasdikul and Woodroofe (1994), Delecroix, Simioni and Tomas- Agnan (1995), Mammen and Tomas-Agnan (1999) and Hall and Huang (2001). Data sarpening metods were surveyed by Braun and Hall (2001). Te remainder of tis paper is organised as follows. Section 2 introduces metodology. Main teoretical results are summarised in Section 3, and numerical properties are outlined in Section 4. Tecnical arguments for Section 3 are given in Section Metodology A density f defined on te real line is said to be unimodal if tere exists a point m f, called a mode of f, suc tat f is increasing on (, m f ) and decreasing on (m f, ). (Here and below we use te terms increasing and decreasing to mean nondecreasing and nonincreasing, respectively.) Assuming f as a continuous derivative on te real line we say tat f is uniquely unimodal, or equivalent tat it is unimodal wit a unique mode, if tere exists a unique point m f in te interior of te support of f suc tat f (m f ) = 0. Note tat te definition of unique unimodality excludes densities wit soulders, as well as tose wit more tan one turning point. Let X = {X 1,..., X n } denote te original dataset drawn from te population wit density f. Te conventional kernel estimator of f is ˆf X (x) = (n) 1 n ( ) x Xi K, (2.1) were is a bandwidt and K a kernel function. We wis to perturb te data as little as possible suc tat te constraint of unimodality is satisfied. To tis end,

5 UNIMODAL DENSITY ESTIMATION 77 let Y = {Y 1,..., Y n } and ˆf Y be respectively a sarpened dataset derived from X, and te corresponding kernel density estimator obtained on replacing X by Y at (2.1). We take Y to be te minimiser of D(X, Y) = i Ψ(X i Y i ) subject to ˆf Y being unimodal, were Ψ denotes a distance function. Examples of functions Ψ include Ψ(x) = x p, to wic we sall refer as L p distance, and more general distance measures analogous to tose used in te teory of M estimation, suc as Ψ(x) = x 0 ψ(y) dy were ψ denotes an antisymmetric function tat is positive on te positive alf line. Forcing ˆf Y to be unimodal means ensuring te existence of a quantity m suc tat ˆf Y (x) 0 for x m, and ˆf Y (x) 0 for x m. In practical numerical terms we first coose a candidate m for m, and ten coose Y(m) to minimise i Ψ(X i Y i ) subject to ˆf Y being unimodal. Finally, writing (m) for te corresponding value of i Ψ(X i Y i ), we select m to minimise (m). (In practice, we found tat m did not depend on te starting candidate.) Tus, our metod produces an estimator of te mode as well as a unimodal density estimator. However, we sall not explore properties of m ere, except to note tat, wile its bias and error about te mean are of te same orders as tose of te conventional mode estimator (defined as te point at wic ˆf X attains its global maximum), our m is not asymptotically normally distributed. As we in Section 4, excellent statistical performance, especially for eavy tailed distributions, is obtained very generally using distance measures suc as L 1. In particular, for all te distributions wit wic we ave experimented, tis approac equals or outperformed metods based on L p distance for p > 1. Te excellent statistical properties of our metods based on L 1 -type distances arise from te distance function being asymptotically linear, rater tan increasing at a faster rate, for large values of its argument. Tis feature implies tat only a relatively small penalty is imposed for moving an outlying data value a long distance to te main body of te sample, in order to ensure unimodality. In contrast, wen using L 2 distance te algoritm tends to move data from te middle of te distribution to te tails, as well as moving tem in te opposite direction; it sifts many data by small amounts, rater tan a small number of data by a large amount, and tis almost invariably impairs performance. However, te fact tat te distance function x as a discontinuous derivative at te origin means tat wen using L 1 distance one tends to experience numerical difficulties in 10% to 20% of samples, depending on distribution type, wen attempting to minimise i X i Y i subject to ˆf Y being unimodal. Tis problem can be overcome by smooting te loss function at its vertex. One approac is to use a function suc as Ψ = Ψ tan, constructed wit ψ(x) = arctan(x). Tis Ψ is asymptotically linear, and perfectly convex, but as a smoot bowl sape at te origin, so its statistical performance is close to tat for

6 78 PETER HALL AND KEE-HOON KANG L 1 distance, witout te latter s numerical drawbacks. As in te setting of robust statistical metods, tere are many opportunities for improving performance, for instance by using a non-convex loss function corresponding to a redescending weigt. An example is te function Ψ based on ψ(x) = sin(x/a) I( x < πa). (Here, te parameter a would ave to be cosen.) For discussion of tis and many oter opportunities for coosing ψ, see, for example, Andrews et al. (1972). In numerical work in Section 4 we compare te above data sarpening metods wit a data tilting approac suggested by Hall and Huang (2002), based on ideas discussed by Hall and Presnell (1999). To construct a unimodal density estimator by tilting te empirical distribution, we cange te data weigts instead of altering te data temselves. Accordingly, te kernel density estimator becomes ˆf(x p) = 1 n ( ) x Xi p i K, were p 1,..., p n are cosen to minimise a measure D(p) of te distance between te multinomial distribution p = (p 1,..., p n ) on n points, and te corresponding uniform distribution, subject to ˆf(x p) being unimodal. Te resulting constrained density estimator will be denoted by ˆf tilt. In Section 4 we take D 1 (p) = n i p i log(np i ), tis being a particular form of power divergence (Cressie and Read (1984)). It gives relatively good performance in te context of unimodal density estimation, largely because it is robust against aberrations caused by reducing one or more weigts p i to 0. Tat operation is frequently necessary in order to eliminate outlying data tat cause spurious bumps in te tails of unconstrained kernel density estimators. In comparison, te more conventional power divergence measure, D 0 (p) = i log(np i), wic is infinite wen one or more values of p i vanis, often leads to undefinable density estimators if te imposed constraint is unimodality. Neverteless, even wen data tilting uses D 1 it is outperformed by data sarpening based on Ψ tan. 3. Teoretical Properties Tere is a variety of approaces to establising teory for data sarpening unimodal density estimators, many of tem tailored to specific distance measures D. In most instances te teory sows tat in te case of compactly supported kernels, wit ig probability and away from te mode and te tails of te sampled distribution, te sarpened estimator ˆf Y is identical to its conventional counterpart. For brevity and simplicity we sall describe only one approac ere, developed for te case of L 1 distance; see (3.1) below. It as te advantage tat, under a mild and straigtforward condition, (3.3), te algoritm based on minimising L 1 distance guarantees not only tat ˆf Y = ˆf X along most of te support of f, but also tat specific convergence rates are acieved uniformly on

7 UNIMODAL DENSITY ESTIMATION 79 te line. See Teorem 3.1. Alternative approaces, based on more general loss functions, do not appear to allow suc a simple and elegant description of te properties of ˆf Y. All our results extend easily to te case of estimating f under te constraint tat it as k 1 modes, assuming tis condition is incorporated into te basic algoritm and tat te modes of ˆfY are required to be at least a certain fixed, sufficiently small distance apart. We initially assume te kernel K is a symmetric, compactly supported, uniquely unimodal probability density wit two bounded derivatives. Call tis condition (C K,1 ). Teorems below old witout cange if K is taken to be te Gaussian kernel, altoug te proofs are more elaborate in tat case. However, if K is not compactly supported ten it is not generally true tat ˆf Y = ˆf X, wit ig probability, for most of te support of f. Given a data sarpening algoritm A tat takes te original dataset X = {X 1,..., X n } to Y = {Y 1,..., Y n }, were Y i denotes te image of X i, we define n D(X, Y) = X i Y i. (3.1) We apply subscripts to A and Y to denote specific algoritms and te sarpened datasets tat tey produce, respectively. In particular, let A 0 denote te version of A tat selects Y = Y 0 to minimise D(X, Y) subject to ˆf Y being unimodal. Let F 0 be te class of unimodal densities f wit te properties: te support of f is an interval wit endpoints a f < b f (not necessarily finite), te mode m f of f is unique and lies in (a f, b f ), f as a bounded, uniformly continuous derivative on (, ), and f is bounded away from 0 on te set S(c) = [c 1, c 2 ] [c 3, c 4 ] wenever c = (c 1,..., c 4 ) satisfies a f < c 1 < c 2 < m f < c 3 < c 4 < b f. (3.2) Put 0 = n 1/5. Te notation 0 means tat te ratio / 0 is bounded away from zero and infinity as n. Teorem 3.1. Let f F 0, and assume D is given by (3.1), tat (C K,1 ) olds and tat 0 as n. Suppose too tat tere exists a particular data sarpening algoritm A 1, not necessarily A 0, wic wit probability 1 produces a unimodal density estimator ˆf Y, and wic moves X a distance D(X, Y 1 ) to Y 1, were 2 0 D(X, Y 1) 0 (3.3) wit probability 1. n, Ten wen A 0 is used instead of A 1, c satisfies (3.2) and P f {Y i = X i for eac i suc tat eiter X i or Y i is in S(c)} 1, (3.4)

8 80 PETER HALL AND KEE-HOON KANG sup ˆf Y (x) ˆf X (x) = o p ( 0 ), sup ˆf Y(x) ˆf X (x) = o p (1). (3.5) <x< <x< Te algoritm A 0 is not necessarily uniquely defined, and results suc as (3.4) sould be interpreted as olding uniformly in all versions of A 0. Tat is, te probability tat tere is a version of A 0 for wic Y i X i for some i suc tat eiter X i or Y i is in S(c), converges to zero as n. Non-uniqueness of A 0 can occur wit positive probability wen n = 2, altoug we conjecture tat wen (C K,1 ) is satisfied and n 3, A 0 is uniquely defined wit probability 1. We may paraprase (3.4) by saying tat wit probability converging to 1, A 0 leaves unaltered eac data value X i tat is neiter very close to te true mode nor very far out in te tails. Since K is compactly supported, and (3.4) olding for Y 1 implies te same for Y 0, ten if Y is produced by eiter A 0 or A 1 we ave, for eac vector c suc tat (3.2) olds, P f { ˆf X (x) = ˆf Y (x) for all x S(c)} 1 (3.6) as n. Terefore, away from te mode and te tails, ˆfY and its derivatives ave all te weak convergence properties of ˆf X and te latter s derivatives. Hence, in terms of integrated squared error, we cannot expect improved performance of ˆf Y over ˆf X in tose regions. Note tat we ave assumed only one derivative of f. Tis explains te somewat slow convergence rate at (3.5). However, (3.6) implies tat if Y = Y 0 is produced by A 0, and if f as two bounded derivatives, ten (3.5) as a more conventional form away from te mode and te tails, as follows. Using te data sarpening algoritm A 0, we ave for eac vector c suc tat (3.2) olds, ˆf Y (x) f(x) = O p ( 2 0 ) and ˆf Y (x) f (x) = O p ( 0 ) wenever x S(c), and sup ˆf Y (x) f(x) = O p ( 2 0 l1/2 ), sup ˆf Y (x) f (x) = O p ( 0 l 1/2 ), (3.7) x S(c) x S(c) were l = log n. In fact, under mild additional conditions tese results also extend to neigbouroods of te mode, as we sow following Teorem 3.2. It is only in te extreme tails, beyond locations tat temselves move furter out into eac tail as n increases, tat ˆf Y and ˆf Y may not enjoy te convergence rates of te conventional estimators ˆf X and ˆf X, respectively. In principle (3.4) does not exclude te possibility of cross mappings, were a datum X i in one tail is mapped to Y i in te opposite tail or in a close neigbourood of te mode; or were X i near te mode is mapped to Y i in a tail. However, it is easy to see tat te probability tat tis occurs tends to 0 as n. Indeed, if a cross mapping were to arise ten wit probability converging to 1 we could produce a lesser value of D(X, Y 0 ) by instead moving X to a

9 UNIMODAL DENSITY ESTIMATION 81 point in S(c) and ten moving anoter point in S(c) to te original image of X i. However, for any given c satisfying (3.2), result (3.4) states tat te probability of tis occurring converges to 0. Result (3.4) implies tat wen D(X, Y) is defined by (3.1), and (3.3) olds, te algoritm A 0 is in effect constructed quite separately near te mode and in eiter tail. In te next two teorems we note tat it is possible to find algoritms for te mode and te tails, respectively, suc tat ˆf Y is unimodal and (3.3) olds. Following Teorem 3.3 we describe ow to combine tese two algoritms into a single algoritm tat satisfies (3.3). Assume f F 0 as two bounded derivatives in a neigbourood of te mode, f is continuous at te mode, and f (m f ) < 0. Call tis condition (C f ). Suppose too tat K is a compactly supported, symmetric, uniquely unimodal probability density wit tree bounded derivatives; denote tis constraint by (C K,2 ). Teorem 3.2. Assume (C f ) and (C K,2 ) old, and 0. Ten tere exists a data sarpening algoritm A 1 wic, wen applied to data on (m f ɛ, m f + ɛ) for ɛ > 0 sufficiently small, ensures tat ˆf Y is unimodal on (m f ɛ, m f + ɛ) and also guarantees tat for eac 0 < δ < ɛ, as n, P f {Y i = X i for eac i suc tat eiter X i or Y i is in (m f ɛ, m f δ) (m f + δ, m f + ɛ)} 1 and i : X i m f ɛ X i Y i = O p ( 1 0 ). (3.8) Result (3.8) implies a rate of convergence in te neigbourood of te origin. Indeed, suppose we use te algoritm A 0, instead of A 1, to sarpen data. Assume too tat te conditions of Teorems 3.1 and 3.2 old. Ten it follows from (3.4), (3.8) and te fact tat te probability of cross-mappings converges to 0, tat for any ɛ > 0, (3.8) olds wen te Y i s are produced by A 0 rater tan A 1. Now, a Taylor expansion gives ˆf X (x) ˆf Y (x) sup K n 2 i : X i m f ɛ X i Y i, uniformly in x satisfying x m f δ, for all sufficiently large n. Terefore, by (3.8), ˆf X ˆf Y = O p ( 2 0 l1/2 ) uniformly in a neigbourood of te mode. It follows from tis result and te fact tat ˆf X f = O p ( 2 0 l1/2 ) uniformly in a neigbourood of te mode, tat ˆf Y f = O p ( 2 0 l1/2 ) tere. Te latter property enables us to replace te supremum over x S(c), in te first result at (3.7), by te supremum over x [c 1, c 4 ], were c 1 and c 4 satisfy (3.2). Similarly it may be proved tat an identical cange can be made to te second result at (3.7). Te particular construction of A 1 tat we give in te proof of Teorem 3.2 actually

10 82 PETER HALL AND KEE-HOON KANG ensures tat ˆf X f = O p ( 2 0 ) in an O p( 0 )-neigbourood of m f, and ˆf X = ˆf Y outside tat neigbourood and away from te extreme tails. Finally we discuss a data sarpening algoritm tat produces monotone tails and at te same time ensures (3.3). We consider te case of tails tat decrease only polynomially fast, of wic tose of Student s t density are an example. Densities wit tails tat decrease exponentially quickly are more easily treated. It suffices to consider data sarpening in te rigt and tail. Let α > 1 be given and assume K is a symmetric, compactly supported, uniquely unimodal density wit ν α,k bounded derivatives, tat f F 0 as ν α,f derivatives on (C, ) for some C > 0, and tat f (j) (x) x α j as x for 0 j ν α,f, were ν α,f, ν α,k 2. Call tis condition (C α ). Te values of ν α,k and ν α,f depend only on α; details are given by (5.14) in te proof of Teorem 3.3. Teorem 3.3. If (C α ) olds for α > 8, and 0, ten tere exists a data sarpening algoritm A 1 wic, wen applied to data on (C, ) for C > 0 sufficiently large, ensures ˆf Y is decreasing on (C, ) and also guarantees tat for all C > C, P f {Y i = X i for eac i suc tat eiter X i or Y i is in (C, C )} 1 and i : X i >C X i Y i 0 in probability as n. 2 0 Te constraint α > 8 can be weakened by using a longer argument. Moreover, our numerical work will focus particular attention on performance of data sarpening wen α is as small as 3/2. Assume f F 0, and suppose in addition tat te conditions of Teorems apply to f and K, wit te conditions of Teorems 3.3 olding in bot tails, possibly for different values of α. Teorems 3.2 and 3.3 imply tat we may construct a single data sarpening algoritm A 1 tat produces an estimator ˆf Y wic is unimodal, and suc tat (3.3) and (3.4) bot old. Indeed, (3.4) itself enables us to splice togeter te separate data sarpening algoritms for te tails, and tat for te mode, into a single algoritm tat applies to te wole dataset, in te knowledge tat unimodality of ˆfY will take care of itself at all places tat are intermediate between eiter tail and te mode. Tis is guaranteed by te fact tat ˆf Y = ˆf X in suc places; since f does not vanis tere ten ˆf X is monotone at te intermediate places, wit ig probability. Of course, once (3.3) as been proved for A 1 ten we know from Teorem 3.1 tat bot (3.3) and (3.4) old for A Numerical Properties In te work summarised ere we drew datasets of size n = 25 from te standard Normal distribution (distribution 1), Student s t distribution wit tree

11 UNIMODAL DENSITY ESTIMATION 83 degrees of freedom (distribution 2), te Normal mixture 0.2 N(0, 1) N( 1 2, ( 2 3 )2) N( 13 12, ( 5 9 )2) (distribution 3), and te Normal mixture ( ( 0.35 N 1, ( 3 5 )2) N 1, ( 5 2 )2) N (5, ( 3 2 )2) (distribution 4). All are unimodal and are depicted in Figure 1. Distributions 1 and 2 were cosen 20 because tey represent opposite extremes in terms of tail weigt, distribution 40 3 represents moderately skewed densities (it is density #2 of Marron and60wand (1992)), wile distribution 4 is igly skewed wit a long 80 and relatively flat part, and terefore presents particular callenges to metods 100 for unimodal density 120 estimation. Neverteless, 140 out of te four densities te second is arguably te most difficult for wic to construct unimodal estimators. Standard kernel estimators, computed using data from te second distribution, usually ave quite a few spurious bumps in te tails. Ideally, tese sould be removed witout increasing log() mean integrated squared error. We terefore devote muc of our attention to te log(mise) relative performance of metods applied to distribution x MSE Squared bias Variance Estimates -6 OR PSI L2 pi Density 1 Density Density 3 Density Figure 1. Four true densities. Density 1 is standard normal, density 2 is Student s t wit tree degrees of freedom, density 3 is moderately skewed, and density 4 is igly skewed wit a long, relatively flat portion. For te results presented ere te Gaussian kernel was used trougout, since it is a favourite of statisticians in problems involving mode analysis. However,

12 84 PETER HALL AND KEE-HOON KANG very similar results were obtained using, for example, te biweigt kernel. Indeed, te Gaussian kernel is so ligt tailed tat, even toug it is not compactly supported, it produces density estimates wic visually satisfy te properties reported in Section 3. Tat is, except on very fine scales, te constrained density estimates are usually visually indistinguisable, in places away from te modes and te tails, from teir conventional, unconstrained counterparts. We return to tis point in our discussion of Figure 4 below. We discuss results for estimators constrained by data sarpening using eiter L 2 distance or te distance Ψ tan. Similar results are obtained using L p rater tan L 2 distance, for p > 1. As expected, te efficiency advantages of Ψ tan distance decrease as p decreases to 1, altoug Ψ tan retains its competitive edge. For comparison we also give results for te standard, unconstrained kernel estimator; for te estimator ˆf tilt obtained by data tilting, tis time using te distance measure D 1. See Section 2 for details of te construction of ˆftilt. A comparison wit te kernel rearrangement metod of Fougères (1997) will be made later in tis section. Wen constructing ˆf Y te constrained optimisation step was usually implemented using te NAG routine E04UCF. Tis routine failed only rarely (e.g., in fewer tan one per cent of simulations in te Normal case), and it can be protected against tis problem in tose instances were it fails. An alternative is to use an easily-coded simulated annealing algoritm tere, similar to tat described by Braun and Hall ((2001), Section 3.2), using te penalty p(y) = {i : ξ i <m} ˆf (ξ i ) I{ ˆf (ξ i ) < 0} + {i : ξ i >m} ˆf (ξ i ) { ˆf (ξ i ) > 0}, were ξ 1,..., ξ ν denote grid points on wic te constraints were imposed, and m is a candidate for te mode. In eiter case, te algoritm can be speeded up by moving outlying data a little closer to te body of te distribution, rigt at te start. Tis reflects te teoretical results discussed in Section 3. Te simulations reported below used a small subroutine of tis type. Wen te estimators are applied to data from te standard normal distribution it is found tat a grap of MISE for te data sarpening estimator, using distance measure Ψ tan, lies below tat for te oter tree estimators across all values of. However, altoug tis estimator as a sligt advantage over its unconstrained kernel counterpart, at te optimal bandwidt it offers only negligible improvements relative to te oter two unimodal estimators. Neverteless te data tilting estimator, wen used wit a small suboptimal bandwidt, performs quite poorly relative to bot its data sarpening competitors. Analogous results are also obtained for oter very ligt-tailed distributions. As we see below, te

13 UNIMODAL DENSITY ESTIMATION 85 data sarpening Density 1 estimator based on Ψ tan really comes into its own wen applied Density 2 to distributions tat ave eavier tails tan te normal. Density 3 In particular, Density 4 Figure 2 plots te logaritm of MISE against te logaritm of bandwidt for all four estimators wen te sampled distribution is te eavytailed Student s t wit tree degrees of freedom. Te performance advantages of data sarpening using distance measure Ψ tan are obvious. Note too tat, in terms of optimal MISE performance, te unconstrained kernel metod lies between te two data sarpening approaces. x MSE Squared bias Variance Estimates -6 OR PSI L2 pi log(mise) log() Figure 2. Plots of MISE for distribution 2. Te vertical axis gives to te logaritm of MISE, and te orizontal axis gives te logaritm of bandwidt. Solid, bold-dased, dotted and dased lines correspond to te unconstrained estimate, data sarpening estimates constrained using distances Ψ tan and L 2, and te constrained estimate using data tilting, respectively. 0.2 Figure 2 also sows tat te optimal bandwidt for data sarpening, using Ψ tan, is smaller tan tat for te standard kernel estimator. Tis reflects te fact tat tis form of data sarpening produces an estimator wit smaller integrated variance, and sligtly larger integrated squared bias. Te optimal tradeoff between te two terefore occurs at a smaller bandwidt. Te main contribution to reduced integrated variance comes from te tails of te distribution, were data sarpening reduces te considerable stocastic variability of te standard kernel estimator by removing all te bumps tere. Variance actually increases in te middle of te distribution; see Figure 3. However, te latter property is not caracteristic of data sarpening, wic often reduces variance in te vicinity of te mode. In particular tis is te case for distributions 1 and 3. Figure 3, wic gives pointwise mean squared error, squared bias and variance in te case of data from distribution 2, sows tat in tis instance data

14 PETER HALL AND KEE-HOON KANG sarpening tends 120 to reduce bias towards te centre of te distribution. Tis is also observed 140 for distribution 1, but for distributions 3 and 4 bias sligtly Density 1 increasesdensity at te2centre. Density 3 Density log() log(mise) MSE Squared bias x x 5 10 Variance Estimates -6 OR PSI L2 pi Figure 3. Plots of pointwise mean squared error, squared bias and variance for distribution 2. Line types are as for Figure 2. In eac case te bandwidt was cosen to be tat wic minimises MISE, and te vertical axis is marked in units of x 5 10 Figure 2 revealed tat, wile data tilting outperformed L 2 data sarpening for te second distribution, it was not competitive wit te unconstrained kernel estimator in MISE terms. Figure 3 indicates wy. Te only way tilting can remove spurious modes in te tails of te standard kernel estimator is to give outlying data zero weigt. Tis forces relatively ig weigts to be used near te centre of te distribution, wit a corresponding large increase in bias, wic is clearly evident from te second panel of Figure 3. Te data tilting estimator commonly suffers from tis difficulty for distributions wit one or more relatively eavy tails. Sampling from te tird distribution, wic as te skewed density sown in te tird panel of Figure 1, one again finds tat te data sarpening estimator based on distance measure Ψ tan as lowest minimum MISE. For tis distribution

15 UNIMODAL DENSITY ESTIMATION 87 te data tilting3metod as igest MISE of all four estimators. Tis is again a result of its relatively large bias towards te centre of te distribution, wic occurs for te reasons 6 noted in te previous paragrap; in te case of te tird 8 distribution te left and tail is relatively eavy. 20 Performance 40 advantages of data sarpening are even more clearly evident for te fourt 60 distribution, were te MISE curve for te estimator based on 80 Ψ tan lies below tat for any of te oter two estimators over muc of its range 100 on eiter side 120 of its minimum. On tis occasion, since te rigt and tail of te distribution is140 relatively eavy, te tecnique based on data tilting again performs Density 1 poorly. However, Density 2 in MISE terms L 2 data sarpening produces an estimator wic performsdensity almost 3 as well wen sarpening is based on Ψ tan, and wic performs Density 4 better tan te unconstrained kernel estimator. log() log(mise) MSE Squared bias Variance Estimates L2 OR PSI OR x p i Figure 4. Kernel estimates, and results of data sarpening, for a dataset from te second distribution. Line types are as in Figure 2. Te bold solid line represents te true density. In te lower panel te top and bottom graps sow ow data are moved in te cases of data sarpening based on L 2 and Ψ tan distance, respectively Figure 4 sows individual curve estimates, and te manner in wic data are sarpened, in te case of a dataset drawn from te second distribution. Te dataset cosen was tat for wic ISE was at its 75t percentile, wen te bandwidt was cosen to be optimal for minimising MISE of te unconstrained kernel

16 88 PETER HALL AND KEE-HOON KANG estimator. Tis particular dataset produces a markedly asymmetric density estimate regardless of estimator type. Tere are several outliers in te left and tail, wit te result tat te unconstrained estimate as two spurious modes tere. However, data in te rigt and tail are distributed in a manner wic suggests properties of a ligt tailed distribution, rater tan te eavy tailed distribution from wic tey came. As a result te standard kernel estimate as no unwanted bumps in te rigt and tail. Neiter does it ave any spurious modes in te neigbourood of te point at wic te global maximum is acieved. Terefore, except for te left-and tail, te estimate based on Ψ tan coincides almost exactly wit its unconstrained counterpart. In te left-and tail it produces a better fit tan data tilting. Te lower panel of Figure 4 sows ow data are sarpened to produce a unimodal estimate, wen te distance function is eiter L 2 or Ψ tan. In eac instance te data in te rigt and tail are virtually unaltered, and in te case of Ψ tan distance, excepting te two most extreme values, data are altered in only minor ways in te left and tail. (If te kernel used is compactly supported, for example te biweigt, tere are no canges at all except for te four data furtest to te left.) However, wen distance is measured in L 2 terms te canges to data in te left and tail are muc more widespread. We also examined te effect of data-driven bandwidt selection on te performance of our data-sarpened estimators. In particular, for eac sample of size n = 25 we cose te bandwidt using te metod suggested by Seater and Jones (1991), and compared te unconstrained estimates wit tose constructed by sarpening based on Ψ tan -distance. Mean integrated squared error was approximated by averaging integrated squared error over 200 simulations. Data sarpening reduced MISE by 13%, 18%, 12% and 8% in te cases of distributions 1 4, respectively. Of course, te main source of improvement comes from te tails and from te vicinity of te mode. As one would expect, given tat asymptotic performance of te data-sarpened estimator becomes increasingly like tat of its unconstrained counterpart as n increases (see Section 3), te margin of MISE improvement offered by data sarpening narrows for large n. Figure 5 compares our metod wit tat of Fougères (1997) for distribution 2 (Student s t wit tree degrees of freedom). Here, and for distribution 4, Fougères approac as a marginal advantage. However, te metod generally gives a relatively roug estimate, especially in sample sizes tat are larger tan tat (n = 25) used ere. A clearer idea of te rougness problem, for larger sample sizes, is obtainable from te figures of Fougères (1997). For distribution 1, Fougères approac gives te largest minimum mean squared error of all sape-constrained metods studied; in te case of distribu-

17 UNIMODAL DENSITY ESTIMATION Density 1 Density 2 tion 3, itdensity occupies 3 te rank between te data tilting metod and data sarpening, and in particular Density 4 is beind te latter. log(mise) MSE Squared bias Variance Estimates log() OR PSI L2 pi x Figure 5. Comparison of Fougères metod wit data sarpening using distance Ψ tan. In te context of Figure 2 (i.e., for distribution 2) te upper panel graps te logaritm of MISE. Kernel estimates, for te same sample as in Figure 4, are sown in te lower panel. In bot panels te dotted line corresponds to Fougères estimate, wile te bold-dased line sows te result for te data-sarpened estimate. In te upper panel te (ligt) solid line sows results for te unconstrained estimate, wile in te lower panel te (bold) solid line depicts te true density. Finally, we applied te sarpening approac based on Ψ tan distance, and data tilting, to te Buffalo snowfall dataset, wic is frequently used to explore te effect of bandwidt coice on density estimate sape; see e.g., Silverman (1986, p. 45) and Scott (1992, p. 137). Te dataset consists of te annual snowfalls, measured in inces, at Buffalo, New York, from 1910 to Figure 6 illustrates te unconstrained kernel estimate, and te results of data sarpening and data tilting, applied to tese data using te bandwidt = 6. Silverman (1986) discussed te use of tis bandwidt, for tese data, noting tat it gave a trimodal estimate wen applied to a standard kernel estimator. He observed tat doubling te size of te bandwidt gave a unimodal estimator in te unconstrained case.

18 Density 1 PETER HALL AND KEE-HOON KANG Density 2 Density 3 Density 4 log() 0.02 log(mise) Estimates x MSE Squared bias Variance -6 pi PSI OR x L Figure 6. Unconstrained kernel estimates, and results of data sarpening and data tilting, for Buffalo snowfall data. Te upper panel graps kernel estimates wose line types are as in Figure 2. Te middle panel indicates ow data are moved in te case of data sarpening based on Ψ tan distance. Data weigts for te tilting approac are sown in te lower panel. 80 x Tecnical Arguments 5.1. Proof of Teorem 3.1. Consider a general data sarpening algoritm in wic X is moved a distance D(X, Y) to Y. By a Taylor expansion, ˆf Y(x) ˆf X (x) 1 n 2 n K ( x X i + X i Y ) i K ( x X ) i sup K n 3 D(X, Y). (5.1) Given a particular algoritm A j, let Y j denote te corresponding sarpened dataset, and let ˆf Y j be te version of ˆf Y produced from Y j. By assumption, we can construct A 1 suc tat ˆf Y 1 is unimodal and 2 0 D(X, Y 1) 0 in probability

19 UNIMODAL DENSITY ESTIMATION 91 as n. Hence, by te definition of A 0, ˆfY 0 is unimodal and 2 0 D(X, Y 0) 0 (5.2) in probability. Let c = (c 1,..., c 4 ) and c = (c 1,..., c 4 ) be vectors suc tat a f < c 1 < c 1 < c 2 < c 2 < m f < c 3 < c 3 < c 4 < c 4 < b f. Ten bot c and c satisfy (3.2). Consider te algoritm A 2 constructed from A 0 by putting Y i = X i for X i S(c) and taking Y i to be defined by A 0 oterwise. Ten D(X, Y 2 ) D(X, Y 0 ). (5.3) It is readily proved tat for f F 0, sup ˆf X (x) f (x) = o p (1). (5.4) <x< Since 0 and K is compactly supported ten for all sufficiently large n, ˆf Y 2 = ˆf Y 0 outside S(c ); call tis result (R 1 ). Properties (5.1), employed in te case Y = Y 2, (5.3) and (5.4), and te fact tat 2 0 D(X, Y 0) 0 in probability, imply tat sup ˆf Y 2 f 0 in probability. It follows from tis result and te fact tat f is bounded away from 0 on an open set containing S(c ), tat wit probability converging to 1 as n, ˆf Y 2 is positive on [c 1, c 2 ] and negative on [c 3, c 4 ]. In conjunction wit (R 1) tis implies tat wit probability tending to 1, ˆfY 2 is unimodal on te wole real line; call tis result (R 2 ). It follows from (R 2 ) tat if te probability tat te inequality (5.3) is strict does not converge to 0, ten wit strictly positive probability, for arbitrarily large n, te algoritm A 2 produces a unimodal density estimator and at te same time strictly reduces te distance D(X, Y 0 ). However, by definition of A 0 tis is impossible. Terefore, wit probability converging to 1 te inequality at (5.3) is actually an identity. It follows from tis result and te definition of A 2 tat te probability tat A 0 fixes X i for eac i suc tat X i S(c), converges to 1 as n. Tis is equivalent to P f {Y i = X i for eac i suc tat X i S(c)} 1, wic is a weaker form of (3.4). To obtain te full force of (3.4), observe tat if A 0 involves mapping X i / S(c) to Y i S(c), ten wit probability converging to 1 we may produce a lesser value of D(X, Y) by instead mapping X i to a point in S(c )\S(c), wile retaining unimodality of ˆfY. However, tis contradicts te definition of A 0 as an algoritm tat minimises D(X, Y), and so te probability tat X i / S(c) is mapped to Y i S(c) must converge to 0.

20 92 PETER HALL AND KEE-HOON KANG Te second part of (3.5) follows from te tree conditions (5.1), employed in te case Y = Y 0, (5.2) and (5.4). Te first part of (3.5) may be derived similarly Proof of Teorem 3.2. Write Y i = X i 3 i and observe tat by a Taylor expansion, ˆf Y(x) = ˆf X (x) + 1 n i K ( x X ) ( i 2 n ) + O p 2 i n n were, ere and at (5.5) below, te remainder is of te stated form uniformly in x. Applying Teorem 3 of Komlós, Major and Tusnády (1975) we deduce tat ˆf X (x) = E{ ˆf X (x)} + (n 3 ) 1/2 U(x) + O p {(n 2 ) 1 log n}, (5.5) were U(x) = 1/2 W 0 {F (x y)} K (y) dy, W 0 is a standard Brownian bridge wose construction relative to te data depends on n, and F denotes te distribution function for wic f is te density. Terefore, noting tat 0, and assuming for te present tat n 2 i = O p (n), (5.6) we deduce tat ˆf Y (x) = E{ ˆf X (x)} + (n3 ) 1/2 U(x) + 1 n i K ( x X ) i + O p ( 3 0 log n) n (5.7) uniformly in x. Witout loss of generality te true mode is 0. Let L be a bounded, compactly supported function wit a continuous derivative, put i = L(X i /), and define t = x/ and a(t) = K (u) L(t + u) du. Ten (5.6) olds and 1 n n i K ( x X ) i = a(t) f(x) + o p ( 0 ) uniformly in x. Furtermore, (3.8) follows from te construction of te data sarpening algoritm. It remains to sow tat L can be cosen suc tat ˆf Y is unimodal. To tis end, observe tat E{ ˆf X (x)} = f (x) + o() = t f (0) + o() and f(x) = f(0)+o(1) uniformly in x suc tat x C 1, for any C 1 > 0. Combining (5.7) and te results in tis paragrap we see tat t f (0) + a(t) f(0) + (n 3 ) 1/2 U(x) + o p ( 0 ) ˆf Y(x) = uniformly in x C 1 E{ ˆf X (x)} + a(t) f(x) + (n3 ) 1/2 (5.8) U(x) + o p ( 0 ) uniformly in x.

21 UNIMODAL DENSITY ESTIMATION 93 Note too tat for any C 2 > 0, { sup x y }/( ) U(x) max [1, {log + (y/)} 1/2 ] = O p (1) (5.9) for 0 y C 2. Since K (0) = 0, K 0 on te positive real line, and L as a bounded derivative, ten a(t) = 0 {L (t + u) L (t u)} K (u) du. Let te support of K be [ C 3, C 3 ], let C 4 > 0 be muc larger tan C 3, and let C 5 > 0 be muc larger tan C 4. Given < A 1 < and A 2 < 0, coose L suc tat L(u) = A 1 u 2 /4 + A 2 u 3 /12 for u [ C 4, C 4 ], and L redescends very slowly, and wit very small values of te first tree derivatives, to 0 to te left of C 4 and to te rigt of C 4, vanising outside [ C 5, C 5 ]. Put B j = A j u>0 u K (u) du. Ten a(t) = B 1 +B 2 t on [ C 4 +C 3, C 4 C 3 ], and a redescends slowly to 0 to te left of C 4 +C 3 and to te rigt of C 4 C 3, vanising outside [ C 5 C 3, C 5 +C 3 ]. As A 2 increases, so too does te curvature of ˆfY near its global maximum, modulo te effect of te additive noise term (n 3 ) 1/2 U(x) at (5.8). However, in neigbouroods of widt O() of te origin, te noise is of te same order as te gradient of ˆf Y ; but te size of noise does not depend on A 2, wereas te absolute value of te gradient increases wit A 2. Arguing tus it follows from (5.8) and (5.9) tat if η > 0 is given, ten A 1, B 1, C 4 and C 5 may be cosen suc tat te probability tat ˆf Y is unimodal on ( ɛ, ɛ) exceeds 1 η for all sufficiently large n. Taking L to be a random function, wit O p (1) bounds applying to L and te support of L, we deduce tat te probability tat ˆf Y is unimodal converges to 1 as n. Tis is sufficient to ensure existence of te claimed algoritm A 1 ; in cases were tis particular construction does not produce a unimodal ˆfY we instead translate all data to a single point and rely on te unimodality of K to ensure tat of ˆfY. Note too tat ˆf Y = ˆf X outside an O p ()-neigbourood of te origin Proof of Teorem 3.3. Let m f < C <, put β = 2/(α + 2), and let δ (0, (4/α) β) be a small positive number. We take Y i = X i for C X i ξ 1 Z 1 β 0, were Z 1 = Z 1 (n) denotes a random function of te data wit te property lim lim inf P (ɛ < Z 1 < λ) = 1. (5.10) ɛ 0, λ n

22 94 PETER HALL AND KEE-HOON KANG Data X i (ξ 1, ξ 2 ], were ξ 2 = δ (4/α) 0, will be sarpened respectively to Y i = F 1 (i/n), wile data X i > ξ 2 will be sarpened in a manner tat we sall describe tree paragraps below. We call ξ 1 and ξ 2 te first and second breakpoints, respectively. Te size of te first breakpoint is tat of values x at wic te order of te stocastic error of ˆf X (x) is te same as te order of te quantity f (x) being estimated; note tat te bias of ˆf X (x) is always of strictly smaller order tan f (x). To appreciate wy te critical size is β 0, note tat te variance of ˆf X (x) is asymptotic to v n (x) 2 0 x α as bot n and x increase, in te sense tat te ratio of te variance and v n (x) is bounded away from zero and infinity. Moreover, f (x) /x α 1 is bounded away from zero and infinity as x. Terefore te relative stocastic error of te estimator ˆf X (x) of f (x) is of order 1 for values x suc tat v n (x) is of size x 2(α+1), i.e., for x suc tat x β 0. More concisely, using metods developed during te proof of Teorem 3.2, particularly te argument based on te Komlós-Major-Tusnády (1975) approximation, it may be proved tat if Z 2 = Z 2 (n) denotes te supremum of values z suc tat ˆf X (x) < 0 for C x z, ten (5.10) olds wit Z 1 replaced tere by Z 2. Terefore, at te very least, values of X i tat exceed Z 2 β 0 must be sarpened if te density estimator ˆf Y is to ave negative gradient at all points to te rigt of C. We take Z 1 < Z 2 in order to effect a sligt taper, to ensure te gradient remains negative on bot sides of te first breakpoint. Te size of te second breakpoint is cosen for tecnical reasons tat will become clear in te proof at (5.15) below. Tapering beyond te second breakpoint, used to ensure tat te density estimator as negative gradient tere, does not require any data value X i to be moved by more tan O p (X i ), uniformly in i. For example it is sufficient, modulo minor tapering in te neigbourood of ξ 2, to equally space te Y i s to te rigt of ξ 2, using te spacing F 1 (i/n) F 1 {(i 1)/n} were F 1 (i/n) is te supremum of values F 1 (j/n) ξ 2. Te density of data at any point x, were x C, is by assumption bounded by a constant multiple of x α. Terefore te total contribution to 2 0 D(X, Y) from sarpening tose X i s for wic X i > ξ 2, equals O p (n 2 0 ξ 2 ) x x α dx = O p (n 2 0 ξ 2 α 2 ) = o p (1), (5.11) provided α > 8 and δ in te definition of ξ 2 is sufficiently small. Te sum of te distances, multiplied by 2 0, troug wic data X i tat lie between te first and second breakpoints are sarpened, equals ξ2 } X i F 1 (i/n) = O p {n 2 0 (x α /n) 1/2 x a dx 2 0 X i (ξ 1,ξ 2 ] ξ 1

A = h w (1) Error Analysis Physics 141

A = h w (1) Error Analysis Physics 141 Introduction In all brances of pysical science and engineering one deals constantly wit numbers wic results more or less directly from experimental observations. Experimental observations always ave inaccuracies.