Nonparametric Density Estimation and Monotone Rearrangement

Size: px

Start display at page:

Download "Nonparametric Density Estimation and Monotone Rearrangement"

Noah Holt
5 years ago
Views:

1 Nonparametric Density Estimation and Monotone Rearrangement Robben Riksen July 16, 2014 Bachelor Thesis Supervisor: dr. A.J. van Es Korteweg-de Vries Institute for Mathematics Faculty of Natural Sciences, Mathematics and Informatics University of Amsterdam

3 Abstract In nonparametric density estimation, the method of kernel estimators is commonly used. However, if extra information about the monotonicity of the target density is available, kernel estimation does not take this into account and will in most cases not give a monotone estimate. A recently appeared method in this field of research is the method of monotone rearrangement. Using this extra information to improve the density estimate, it has some interesting properties. In this thesis, nonparametric density estimation by means of kernel estimators is discussed. The reflection method is introduced to improve kernel estimation if the target density equals zero on the negative real line and has a discontinuity in zero. In the case where information about monotonicity of the target density is available, monotone rearrangement is used to monotonize the original estimate. The theoretical and asymptotic properties of the reflection method and monotone rearrangement will be discussed briefly, after which simulations in Matlab are run to measure the performance, in terms of MISE, of both methods separately and applied together. It appears to be the case that both the reflection method and monotone rearrangement greatly reduce the MISE of the kernel estimator, but when applied together, the MISE remains of the level of that of the reflection method. Cover image: a plot of an exponential-1 density (dashed-dotted line), a kernel estimate based on a sample of size n = 100 (dashed line) and its rearranged estimate (solid line). Title: Nonparametric Density Estimation and Monotone Rearrangement Author: Robben Riksen, robben.riksen@student.uva.nl, Supervisor: dr. A.J. van Es Second grader: dr. I.B. van Vulpen Date: July 16, 2014 Korteweg-de Vries Institute for Mathematics University of Amsterdam Science Park 904, 1098 XH Amsterdam 3

5 Contents Introduction 7 1. Theory Kernel Estimation of Smooth Densities The Histogram Method Kernel Estimators Defining Errors Bandwidth Selection Decreasing Densities Reflection Method Monotone Rearrangement Introducing Monotone Rearrangement Properties of Monotone Rearrangement Results Comparing Methods Simulations Conclusion Discussion and Review of the Progress Populaire Samenvatting 34 References 37 A. Theorems and Proofs 39 B. MATLAB code 44 B.1. Epanechnikov Kernel Estimator B.2. Gaussian Kernel Estimator B.3. Montone Rearrangement B.4. Simulations C. Samples 52 C.1. Sample C.2. Sample

7 Introduction When I returned from my semester abroad at the University of Edinburgh, I had made up my mind about the master s degree I wanted to pursue after graduating for my bachelor s degree: it had to be the master Stochastics and Financial Mathematics at the University of Amsterdam. For a longer time, my interest had gone out to statistics, so writing my bachelor thesis is this field of research was a logical step in the preparation for this master s programme. When approached, Dr. Van Es suggested a thesis about Monotone Rearrangement, a method of improving density estimates of monotone probability densities. Being such a recently developed method in this application, it sparked my interest and we decided it would be a suitable subject for a bachelor thesis. A common challenge in statistics is the estimation of an unknown density function out of a given data set. For example, this can be valuable to gain information about the occurrence of certain events in a population. Estimating the probability density from which such a data set is sampled, can be done in parametric and a nonparametric way. If we assume or know that a sample is drawn from, say, a N (µ, σ 2 ) distribution. We will apply parametric density estimation to estimate µ and/or σ. There are many ways of doing this, but the produced estimate will be normal distributed and its accuracy can only be improved by varying the parameters. If the assumption was right, this parametric way of density estimation is satisfying enough. But if we do not know if the density we want to estimate belongs to a known family of parameterized probability densities, the parametric approach obviously is not satisfying. In nonparametric density estimation, these kind of assumptions are not made. We only assume the target density exists, and make no assumptions about a parametric family to which the target density might, or might not, belong. Again, there are many ways to obtain a density function from observed data in a nonparametric way, but the most common way to do this is with kernel estimators, the method we will discuss in this thesis. The method of kernel estimation is designed to estimate smooth densities, but can also be applied to densities with discontinuities. We will take a closer look at densities that are equal to zero on the negative real line, and therefore have a discontinuity in zero; the boundary. To improve the kernel estimator in this case, the commonly used reflection method is applied. This method, introduced by Schuster (1985) [1], reflects the data set around zero, that is, adds the negative of the data points to the data set and applies the kernel estimation method to this new, bigger set of data points. This greatly improves the kernel estimation method. Suppose now there is some extra information about the target density: we know the density we want to estimate is monotonically decreasing, or increasing. The availability 7

8 of such information occurs often. For example, in biometrics, age-height charts, charts in which the expected height for children is plotted against their age, should be monotonic in age. In econometrics, demand functions are monotonic in price (both examples from [2]). And there are a lot of other examples in which a monotone density is expected. In these cases, kernel estimation still works fine. But the method of kernel estimators does not take this extra information into account, and will in most cases not produce a monotone estimate. Wanting to produce the best possible estimate from our data set, we would like to use all information available to minimize our estimation error. To monotonize the density found by kernel estimation, the method of monotone rearrangement is introduced. This relatively simple method, essentially rearranging function values in decreasing or increasing order, appeared first in the context of variational analysis in 1952 in the book Inequalities by Hardy, Littlewood and Pólya [3], but was only very recently introduced in the context of density estimation. Monotone rearrangement, however, proved to have many interesting properties in this context. This was the motivation to look into this subject further in this thesis The goal of this thesis is to familiarize with the method of kernel estimation and monotone rearrangement, and investigate whether a combination of methods, designed to improve kernel estimation, improves the estimate of a target density even more. Therefore, we will apply the reflection method and monotone rearrangement together to estimate a monotone probability density out of a generated sample. This leads to the following research question: Does the combination of the reflection method and monotone rearrangement significantly improve the estimation of a monotonic density from a given data sample? In an attempt to answer this question, these methods were implemented in Matlab and simulations were run with different sample sizes. This thesis will start with giving an overview of the method of kernel estimators, by introducing them in an intuitive way. Then, measures of how well an estimator estimates the target density will be introduced and discussed, after which several problems with the original kernel estimators are pointed out in the case where they are used to estimate target densities with a discontinuity in zero. A solution is proposed in the form of the reflection method. After that, the method of monotone rearrangement is introduced and some properties of this method are discussed. Then simulations are run to answer the main question of this thesis: Does the combination of the reflection method and monotone rearrangement significantly improve the estimation of a monotonic density from a given data sample? The results of the simulations are discussed and possible improvements to the simulation and estimation methods are proposed. 8

9 1. Theory This section will start with introducing the concept of kernel estimation in an intuitive way. The reflection method [1] will be introduced as a method to improve kernel estimation for densities that have a bounded support. An example of such a density is the exponential density, which is zero for all points on the negative real axis and has a discontinuity in 0. Then monotone rearrangement will be introduced and applied, to take into account extra information about monotonicity of the target density. Throughout this chapter, unless stated otherwise, the same data set with 100 points drawn from a N (µ, σ 2 ) distribution, with µ = 2 and σ 2 = 1, will be used in the examples to illustrate 2 the theory. The value of these data points can be found in Appix C.1. Furthermore, in general descriptions and theory it will be assumed that a sample of n indepently, identical distributed observations X 1, X 2,..., X n is given for which the density, f, will be estimated Kernel Estimation of Smooth Densities In this section we will introduce the method of kernel estimation, by starting with the histogram method as a density estimator. We then expand this idea to introduce kernel estimators and find a smooth estimate of the target density The Histogram Method Probably the most intuitive way of representing the probability density from which a data set is sampled, is by use of a histogram. To do this, the bin width h and starting point x 0 are chosen. The bins are defined as the half open intervals [x 0 + (m 1)h, x 0 + mh), for m Z. Then the histogram density approximation is defined as ˆf(x) = 1 nh {no. of X i in same bin as x}. For the data set from Appix C.1 this definition, with starting point x 0 = 0 and bin width h = , results in the density ˆf as depicted in figure 1.1. Clearly, the histogram method for approximating the density is not at all satisfactory. The approximation is not a continuous function, but the major drawback is that the choice of the starting point x 0 has a big influence on the approximated density, as is shown figure 1.2. The starting point is shifted to the left by 0.10 to x 0 = 0.1 while the bin width is the same. This results in the peak exting less far to the right and the structure 9

10 Figure 1.1.: Histogram for the data set from Appix C.1, with starting point x 0 = 0 and bin width h = Figure 1.2.: Histogram for the data set from Appix C.1, with starting point x 0 = 0.1 and bin width h =

11 separated from the rest of the bins visible in figure 1.1 has almost disappeared in figure 1.2. If we now, instead of placing the data points in bins, place bins around the data points, we get rid of the problem of the starting point. The only variable then will be the shape of the bin and the bin width h. This idea leads to the definition of kernel estimators, which we will introduce below Kernel Estimators A far more satisfactory method to estimate the probability density, is a family of methods based on kernel estimators. First a kernel function K, a function that places weight on (and around) the observations, has to be introduced. This kernel function, or simply kernel, K has to integrate to 1 over the real line. Now the kernel estimator can be defined as follows. Definition 1.1. Let X 1, X 2,..., X n be indepently, identical distributed observations with density f. A kernel estimator of the target function f, ˆf, with kernel K, is defined by ˆf(x) = 1 nh n ( ) x Xi K, (1.1) h i=1 where n is the number of data points and, from now on, h is called the bandwidth. A kernel K is said to be of order m if K n (x)dx = 0, for 0 < n < m, and K m (x)dx 0. (1.2) The mostly used symmetrical kernel are of order 2, because otherwise it can appear that the kernel estimate takes negative values in some areas, which is in most cases not desirable when estimating a probability density. An example of a commonly used kernel function is the standard normal density function. To visualize the working of a kernel estimator using the normal distribution function as a kernel, imagine small normal densities placed over the data points X i that add up to the estimated probability density. This is illustrated in figure 1.3, where we used the first 4 data points from sample 1 in Appix C.1 and bandwidth h = 0.5 to construct the density estimation ˆf. Varying the bandwidth h has big effects on the density estimate ˆf. Choosing h too small, will result in too much weight being given to single data point in the tail of the distribution, while choosing h too big will result in over smoothing and might conceal structures in the target density. Therefore, the selected bandwidth will largely determine the behaviour of the kernel estimator. This is illustrated by figure 1.4, where the bandwidth h = 0.3 is chosen. A (non-existent) and before invisible structure around x = 3 has appeared by choosing the bandwidth too small. 11

12 Figure 1.3.: Kernel estimator for the N (2, 0.5)-distribution from 4 data points with bandwidth h = 0.5, showing individual kernels. Figure 1.4.: Kernel estimator for the N (2, 0.5)-distribution from 4 data points with bandwidth h = 0.3, showing individual kernels. 12

13 Another property of a kernel estimator that determines its accuracy is the choice of the kernel. A huge variation in the choice of the kernel used in kernel estimation is possible. A commonly used symmetric kernel is the gaussian kernel we used in figure 1.3: K(t) = 1 2π e 1 2 t2. Another commonly used kernel is the Epanechnikov kernel K e : K e (t) = { 3 ( t2) if 5 t otherwise (1.3) This Kernel, sometimes called the optimal kernel, was introduced by Epanechnikov (1996) [4] and is the most efficient kernel function in terms of minimizing the MISE, although the difference compared to other kernels relatively small. Its drawback, on the other hand, is that is not continuously differentiable. The Kernels we just discussed will be the same for all data points. Variable kernels can be introduced, in which the shape of the kernel can dep on the position of the data point around which it is placed, or its distance to other data points. Deping on the information known about the data set, this can greatly increase the accuracy of the estimate Defining Errors To compare different estimation methods and to draw conclusions about which method improves the estimation, it is necessary to define what measures of error will be used to compare different techniques. In the derivations in this section, the line of Density Estimation for Statistics and Data Analysis, by Silverman (1986) [5], will be followed. For the error in the estimation in a single point, the mean square error is defined in the usual way. Definition 1.2. The mean square error is given by where ˆf(x) is the estimatior of the density f. MSE x ( ˆf) := E[ ˆf(x) f(x)] 2, (1.4) It is easy to see, using the expression for the variance of ˆf, Var ˆf(x) = E ˆf 2 (x) (E ˆf(x)) 2, that the MSE can be written in the following way MSE x ( ˆf) = We will now define the bias of ˆf. ( E ˆf(x) f(x)) 2 + Var ˆf. (1.5) Definition 1.3. Let ˆf(x) be the estimation of the density f. Then is called the bias of ˆf. b ˆf(x) := E ˆf(x) f(x), (1.6) 13

14 So in a single point, the mean square error is given by a bias term, representing the systematic error of ˆf, and a variance term. To quantify the error on the whole domain of f, the mean integrated square error is introduced. Definition 1.4. The mean integrated square error is given by MISE( ˆf) := E [ ˆf(x) f(x) ] 2 dx. (1.7) The MISE is a commonly used method of quantifying the error in an estimate. Now, because the integrant is non-negative, the expectation and the integral can be interchanged, and the MISE takes the following form [6]. [ MISE( ˆf) ] 2 : = E ˆf(x) f(x) dx = MSE x ( ˆf)dx [ = E ˆf(x) ] 2 f(x) dx + Var ˆf(x)dx = b 2ˆf(x)dx + Var ˆf(x)dx. (1.8) Thus, the MISE is the sum of the integrated square bias and the integrated variance. If in this definition ˆf is taken as in definition 1.1, its expectation value is [7] ( ) E ˆf(x) 1 x y = h K f(y)dy, (1.9) h and its variance can be found as [ Var ˆf(x) 1 n ( ) ] 1 x = Var n h K Xi h i=1 = 1 n [ ( )] 1 x Var n 2 h K Xi h i=1 ( = 1 [ ( )] ( [ ( )]) ) 2 1 x Xi 1 x E n h 2 K2 E h h K Xi h = 1 ( ) 1 x y n h 2 K2 f(y)dy 1 [ ( ) 2 1 x y K f(y)dy]. (1.10) h n h h Written in this form, it becomes clear that the bias term in the MISE does not directly dep on the sample size, thus this term will not be reduced just by taking larger samples. 14

15 From now on, it will be assumed that the unknown density has continuous derivatives of all required orders and the kernel function K used to define ˆf is a symmetric function with the following properties: K(x)dx = 1 (1.11) xk(x)dx = 0 (1.12) x 2 K(x)dx = k 2 0. (1.13) In other words, we are dealing with a symmetric kernel of order 2. These properties are met by most commonly used kernel functions, for in many cases the kernel K is a symmetric probability density with variance k 2. The kernel functions used in the simulations later on in this thesis will also satisfy these properties. To calculate the MISE of an estimation, it is useful to derive a more applicable (approximate) form of the bias and variance. So for the rest of this section we will take on a heuristic approach. We can rewrite the bias in the following way b ˆf(x) = E ˆf(x) f(x) ( ) 1 x y = h K f(y)dy f(x) h = K(t)f(x ht)dt f(x) (1.14) = K(t) [f(x ht) f(x)] dt, (1.15) where at a change of variable y = x ht is applied and at it is used that K integrates to unity. If the Taylor expansion f(x ht) = f(x) htf (x) + h 2 t 2 f (x) (1.16) is substituted in expression 1.15 the following expression for the bias is found: b ˆf(x) = hf (x) = = tk(t)dt h2 f (x) t 2 K(t)dt +... ( 1) p h p f (p) t p K(t)dt p! ( 1) p h p b (p) (x), (1.17) p=1 p=1 where b (p) (x) = f (p) p! t p K(t)dt. Using properties (1.12) and (1.13) of K, the bias can be approximated by 15

16 b ˆf(x) = h 2 b (2) (x) + o(h 4 ) h 2 b (2) (x). (1.18) So, the integrated square bias can be approximated by b 2ˆf(x)dx 1 4 h4 k 2 2 (f (x)) 2 dx. (1.19) In a similar manner, an approximation of the variance of ˆf can be found. Substitution of expression (1.9) into (1.10), gives Var ˆf(x) = 1 n 1 h 2 K2 ( x y h ) f(y)dy 1 [ 2 f(x) + b n ˆf(x)]. The substitution y = x ht in the integral and approximation (1.18) for the bias, leads to the following approximation for the variance: Var ˆf(x) 1 nh K 2 (t)f(x ht)dt 1 n [ f(x) + O(h 2 ) ] 2. With the Taylor expansion (1.16) and the assumption that n is large and h is small, this can be rewritten as Var ˆf(x) 1 K 2 (t) {f(x) htf (x) +...} dt + O(n 1 ) nh = 1 nh f(x) K 2 (t)dt + O(n 1 ) 1 nh f(x) K 2 (t)dt. (1.20) Because f is a probability density, f(x)dx = 1, so the integral over the variance of ˆf can be approximated by Var ˆf(x)dx 1 nh This means we can approximate the MISE by [6] K 2 (t)dt. (1.21) MISE = b 2ˆf(x)dx + Var ˆf(x)dx 1 4 h4 k2 2 (f (x)) 2 dx + 1 nh K 2 (t)dt. (1.22) 16

17 The last expression (1.22) is known as the Asymptotic Mean Integrated Square Error, AMISE, and is a useful approximation to the MISE when dealing with large samples and much easier to calculate than the MISE in equation (1.8) [8]. It now becomes clear that decreasing the bandwidth h, whilst reducing the integrated square bias term in (1.8), increases the integrated variance term. This means that finding the optimal bandwidth will always be a trade-off between random and systematic error. This is known as the variance-bias trade-off. More about bandwidth selection will be said in section 1.3. Besides the MISE as a measure of how well a density estimator approaches the target density, the asymptotic properties of the estimator are also important. Prakasa Rao (1983) [6] proved that there is no reasonable estimator ˆf n (x) such that [ ] E ˆfn (x) = f(x), (1.23) so we are forced to look for asymptotically unbiased estimators. That is, a sequence of density estimators ˆf n is asymptotically unbiased if, for every density f and x, [ lim E f ˆfn (x)] = f(x). (1.24) n We call a sequence of density estimators ˆf n weakly consistent if ˆf n (x) p f(x) as n. (1.25) Both definitions are valuable when looking for an appropriate density estimator Bandwidth Selection As stated before, a good choice of the bandwidth h in a kernel estimator is very important, for the bandwidth will largely determine the behaviour of the estimator. We will define the optimal bandwidth as that bandwidth which minimizes the AMISE (1.22). In this thesis we will not go into detail about different methods to determine the optimal bandwidth, but we will just state the theoretical asymptotically optimal bandwidth. Parzen (1962) [9] showed that the asymptotically optimal bandwidth for a two times continuously differentiable density f is equal to ( h opt n = k ) 1 ( ) K(t) dt f (x) 2 5 dx n 1 5, (1.26) where k 2 is the same as in equation (1.13). In this thesis however, we will be dealing with densities with discontinuities. In Van Eeden 1985 [10] the asymptotically optimal 17

18 bandwidth is derived for non-smooth densities. Let D be the set of discontinuity points of f, then the asymptotically optimal bandwidth is given by h opt n = ( K(t) 2 dt 0 ( K(u)du ) 2 t dt ) 1 ( ) 2 (f(d+) f(d )) 2 d D 1 2 n 1 2, (1.27) where f(d+) = lim h 0 f(d + h) and f(d ) = lim h 0 f(d h). So we conclude that for smooth densities, the optimal bandwidth is of order n 1 5, and for discontinuous densities the optimal bandwidth is of order n 1 2. In practice, for finite samples, the optimal bandwidth has to be approximated. There are many methods of finding a suitable bandwidth, but in the simulations later on in this thesis we will use the Sheather and Jones bandwidth selection procedure [11]. Now we have defined the measures by which we can measure the quality of a density estimator and know what the theoretical optimal bandwidths are, we can move on to the different methods of improving the estimator found by kernel estimation Decreasing Densities When a probability density function is monotonically decreasing, the lower bound of the interval on which this density is defined has to exist. Otherwise this function can never be a probability density, for it will not integrate to unity. So the density function will take on positive (non-zero) values on an interval [a l, ) and will be equal to zero on the interval (, a l ). This means, for a monotonely decreasing density, there has to be a discontinuitiy in x = 0. However, most kernel estimators will not correct for this discontinuity and will have postive values on the interval where the target density equals zero. Especially at this boundary, the density estimate will be inaccurate and even not consistent with the target density. The problems induced by such a boundary are called boundary effects (or boundary constraints). Kernel estimators not taking into account these boundary constraints we will call unconstrained estimators. In this section we will discuss the reflection method. A method that corrects the kernel density estimate to these boundary effects Reflection Method In the previous sections, the kernel density method is introduced as a method to estimate a target density that satisfies certain smoothness criteria. For example, in the derivation of the AMISE (1.22) we assumed the target function f to have a continuous second derivative over the whole real line. In many cases densities are used that do not satisfy these conditions. A common example of such a density is the exponential-λ density, f(x) = λe λx 1 {x 0}, which has an obvious discontinuity in x = 0 and only positive values on the positive real line. From figure 1.5, in which an exponential-1 density is 18

19 estimated by an kernel estimator with an Epanechnikov kernel from a sample of n = 100 observations (sample 2 in Appix C.2), it becomes clear that the estimator gives too little weight to the points close to the boundary on the positive x-axis and too much (namely any) to points on the negative axis. Figure 1.5.: Kernel density estimation with Epanechnikov kernel with bandwidth h = 0.7 (dashed line) of an exponential-1 density (solid line). So such boundary effects influence the performance of the estimator near the boundary [8],[12]. To quantify this, suppose f is a density with f(x) = 0, for x < 0, and f(x) > 0, for x 0, and has a continuous second derivative away from x = 0. Let ˆf be a kernel estimator of f based on a kernel K with support [ 1, 1] and bandwidth h. We express a point x as x = ch, h 0. With the same change of variable as in equation (1.14) we find for the expectation of ˆf E ˆf(x) = c Now if c 1, that is x h, we saw in equation (1.18) that 1 K(t)f(x ht)dt. (1.28) E ˆf(x) = f(x) + h 2 b (2) (x) + o(h 4 ), (1.29) for a kernel of order 2. If on the other hand 0 c < 1, we find after Taylor expansion of f(x ht) around x 19

20 c E ˆf(x) = f(x) K(t)dt + o(h). (1.30) 1 Because in general c K(t)dt 1, ˆf 1 is not consistent in points close to x = 0. By the assumption that the kernel K is symmetric, we find E ˆf(0) = 1 f(0) + o(h) at the 2 boundary x = 0. This can be explained intuitively by the kernel estimator that has to be a smooth function on the whole line, but is an estimation of the target function f that is not continuous in x = 0. To create a consistent kernel estimator near the boundary there exist multiple methods. In this section we will take a closer look at the reflection method, as introduced by Schuster (1985) [1]. All observations are mirrored in x = 0, so to the set of observations X 1... X n, the observations X 1... X n are added. Then on this new set of observations the kernel estimator is applied for x [0, ). This method can also be described in the form of a kernel estimator on X 1... X n for x [0, ) as ˆf R (x) = 1 nh n i=1 { K ( ) ( x Xi x + Xi + K h h )}. (1.31) Note that ˆf R (x) integrates to 1 on the positive real axis and is a density. If we apply the reflection method to the same set of observations that was used in figure 1.5, this results in figure 1.6. At first sight ˆf R (x) already is a better estimator of the target density f near the boundary, and it no longer gives any weight to negative values of x. This results in an asymptotically unbiased estimate of f. Following the same steps as before E ˆf R (x) = E 1 n { ( ) ( )} x Xi x + Xi K + K nh h h i=1 = E 1 ( ) x X h K + E 1 ( ) x + X h h K h ( ) ( ) 1 x y 1 x + y = h K f(y)dy + h h K f(y)dy h = K(t)f(x ht)dt + K(t)f( x + ht)dt, (1.32) where in the first integral the change of variable y = x ht is applied, and in the second integral y = x + ht. If we again express a point x close to the boundary as x = ch, h 0, c [0, 1), and consider that f(x) = 0 for x < 0 and K(t) = 0 for t < 1 and 1 < t, we find 20

21 Figure 1.6.: Reflection method with Epanechnikov kernel with bandwidth h = 0.7 (dashed line) of an exponential-1 density (solid line). E ˆf R (x) = c 1 = f(x) K(t)f(x ht)dt + c 1 1 K(t)dt + f( x) c 1 K(t)f( x + ht)dt c K(t)dt + o(h), (1.33) using Taylor expansion of f(x ht) around x and f( x + ht) around x is used. So for x = 0 the expectation becomes f(0) + o(h) and we see that the estimator ˆf R (x) is asymptotically unbiased at the boundary. To achieve a smaller bias of order O(h 2 ) near the boundary, the generalized reflection method was introduced by Karunamuni and Alberts (2006) [13], but this is beyond the scope of this thesis Monotone Rearrangement In some cases there is additional information about the target density. Suppose it is known that the probability density is a monotone function, it then is desirable to use this extra information to find a density estimate that matches the original density better. Clearly, the kernel estimation method discussed in the previous section does not take this information into account. In this section monotone rearrangement will be introduced, a method that adapts the approximated density gained by the use of kernel estimation to respect the given monotonicity of the target probability function. 21

22 In this chapter, unless stated otherwise, f will be the target density and ˆf the not necessarily monotone estimation of f, for example generated by a kernel method as discussed in the previous chapter Introducing Monotone Rearrangement Let U be uniformly distributed on a compact interval A = [a l, a u ] in R, and g : A R a strictly increasing and differentiable function. Now the distribution function F g (y) of g(u) is proportional to [14] F g (y) = au a l 1 {g(u) y} du + a l = g 1 (y), t g([a l, a u ]). (1.34) In the case where g is not a strictly increasing function, the distribution function is still given by the integral above, and we will define the isotone increasing rearrangement as the generalized inverse of the distribution function: g (x) = inf{y R F g (y) x}. (1.35) In other words, g is the quantile function of F g and F g (y) is the measure of the set in which g(u) y [3]. In a similar way the isotone decreasing rearrangement can be defined with distribution function F g (y) = au and decreasing rearrangement: a l 1 {g(u) y} du + a l, t g([a l, a u ]) (1.36) g (x) = inf{y R F g (y) x}. (1.37) This reasoning is valid for every compact interval A in R, so for reasons of convenience A = [0, 1] is used from now on. In section it is shown that the rearrangement of ˆf, as defined in (1.35) and (1.37), will always give a better approximation of the monotone target function, when applied to a non-monotonic estimation. To illustrate the principle of monotone rearrangement, a simple example is studied. The method of monotone rearrangement as described above is applied to the function g(x) = (2x 1), x [0, 1]. (1.38) Clearly, this function is not monotone. Using definition (1.34) and (1.35), the increasing monotone rearrangement of the function g is given by 22

23 { g (x) := inf y R { (2u 1) 2 y } } du x. (1.39) So the rearrangement sorts the values of g on [0, 1] in ascing order. The result is plotted in figure 1.7. Figure 1.7.: Increasing monotone rearrangement (dashed line) applied to the function g(x) = (2x 1) 2 (solid line). In the same way, using definition (1.37), the decreasing rearrangement is given by g (x) = inf { y R { (2u 1) 2 y } } du x. (1.40) Resulting in figure 1.8. In section we used the reflection method to improve the kernel estimator. From figure 1.6 it is clear that the resulting estimate is not a monotone decreasing function, while the target density is. Applying the isotone decreasing rearrangement as defined in (1.37) to the reflected density, results in figure 1.9. If we plot the reflection estimate and its monotone rearrangement together and zoom in at the part around x = 2 where the reflection method estimate is not monotone, we see how the monotone rearrangement method rearranges the function values to find a monotone estimate, as is shown in figure As we just saw, the monotone rearrangement method monotonizes the kernel density estimate, to achieve a better estimate of the monotone target density. However, the rearrangements achieved by equations (1.35) and (1.37) will not always be continuously differentiable functions. For example, if we take a look at the function f(x) = x+ 1 4 sin(4πx) 23

24 Figure 1.8.: Decreasing monotone rearrangement (dashed line) applied to the function g(x) = (2x 1) 2 (solid line). Figure 1.9.: Decreasing monotone rearrangement applied to reflection method with Epanechnikov kernel with bandwidth h = 0.7 (dashed line) of an exponential-1 density (solid line) and its isotone rearrangement, as depicted in figure 1.11 below, we clearly see that the rearrangement is not everywhere continuously differentiable. In some cases however, it 24

25 Figure 1.10.: Decreasing monotone rearrangement (dotted line) applied to reflection method with Epanechnikov kernel (dashed line) of an exponential-1 density (solid line) might be necessary to find a differentiable estimator of f. To find an everywhere differentiable increasing rearrangement, the indicator function in the distribution function (1.34) can be approximated by a kernel in the following way [14]. Let K d be a positive kernel of order 2 with compact support [ 1, 1] and h d the corresponding bandwidth. Then F g,hd (y) = 1 h d 1 0 y is a smoothed version of the distribution function and ( ) g(u) v K d dvdu (1.41) h d g h d (x) = inf{y R F g,hd (y) x} (1.42) is called the smoothed increasing rearrangement of g. In the same way as before, the decreasing rearrangement can be defined via the smoothed distribution function and becomes F g,hd (y) = 1 h d 1 0 y ( ) g(u) v K d dvdu, (1.43) h d 25

26 Figure 1.11.: Increasing isotone rearrangement (dashed line) applied to the function f(x) = x + 1 sin(4πx) (solid line). 4 g h d (x) = inf{y R F g,hd (y) x}. (1.44) For sufficiently small h d, the smoothed rearrangements g h d and g h d of an unconstrained probability density are still probability densities, and converge pointwise to the isotone rearrangements g and g respectively, as proved in Birke (2009) [14] (see Appix A, theorem A.1 for the proof). As pointed out in Neumeyer (2007) [15], both the smoothed and the isotone rearrangement share the same rate of convergence. Deping on the properties required for the estimator, one of these method can be chosen. The smoothed rearrangement will be preferred if one requires a smooth estimator, however for the isotone rearrangement there is no need for the choice of bandwidth h d and flat parts of g will be better reflected by g. In the rest of this thesis and in the simulations, we will choose to use the isotone rearrangement rather than the smoothed rearrangement Properties of Monotone Rearrangement The method of monotone rearrangement has promising properties. Bennett and Sharpley (1988) [16] showed that the monotone rearrangement f of a function f has the same 26

27 L p -norm, p [1, ), as the function f: f p = f p. Chernozhukov et al. (2009) [2] proved that monotone rearrangement of an unconstrained estimate of the target function, decreases the estimation error in the L p -norm, that is ˆf f p ˆf f p (see Appix A, theorem A.3 for a proof). Taking p = 2 this implies that the MISE of the rearranged estimate will be smaller than that of the original unconstrained estimate. Neumeyer (2007) [15] stated that the smoothed rearrangement of an estimator ˆf is pointwise consistent if ˆf is pointwise consistent. Furthermore, she showed that the asymptotic behavior of the rearrangement is the same as that of the original unconstrained estimator. The above definitions of monotone rearrangement were only introduced for densities with a bounded support. In many purposes, as is the case with our example of the exponential-1 density, the support of the target density is unbounded. For decreasing densities it will often be of the form A = [a l, ) and for increasing densities of the form A = (, a u ]. Dette and Volgushev (2008) [17] showed that in these cases the rearrangements can be defined on an unbounded support and that the asymptotic behavior of the rearrangement is the same as described above. In this order, we can safely use monotone rearrangement in the estimation of the exponential-1 density function. 27

28 2. Results 2.1. Comparing Methods The main point of this thesis is to investigate whether improving the kernel estimator by the reflection method, as introduced in section 1.4.1, before applying the method of monotone rearrangement, will result in a better estimate of a monotone target density. In the last section we saw the asymptotic properties of the monotone rearrangement of an estimate are the same as those of the original estimate. So since the reflection method produces an consistent estimate of a density function with a discontinuity in x = 0, applying monotone rearrangement to this estimate will also produce a consistent estimate, and it will share the same asymptotic properties. In practice however, one does not encounter samples of infinite size. Therefore it is valuable to look at the behaviour of MISE of the methods for samples of smaller sizes. To measure the performance of the methods, simulations have been run in Matlab with different sample sizes. The density we will look at, is the exponential-1 density we also used in section 1.4.1: f(x) = e x 1 [0, ) (x) Simulations In the simulations ran in this thesis, we used a Sheather and Jones bandwidth selector. This selector was originally designed by Sheather and Jones (1991) [11] for smooth densities. In our case, we applied it to the exponential-1 density, which has a discontinuity in x = 0. Van Es and Hoogstrate (1997) [18] showed that the Sheather and Jones bandwidth selection method adapts slightly to the non-smoothness of the target density. The bandwidths produced by this method are of smaller order than n 1 5 (1.26), the optimal rate for smooth densities, but the optimal rates of n 1 2 (1.27) for densities with discontinuities are not reached. However, they state that the Sheather and Jones bandwidth still performs well in smaller sample size applications, which is why we chose to use this bandwidth selection method. Therefore we used a Matlab script published by Dynare [19], mh optimal bandwidth, which we adapted slightly to fit our needs (e.g. removed options that were unnecessary in our application to speed up the selection process and reduce the amount of input variables). The kernel we used in the simulations is a gaussian kernel, for the Sheather and Jones bandwidth selection method requires an at least six times differentiable kernel and the Epanechnikov kernel is not continuously differentiable. Simulations where run on samples of different sizes, n = 50, 100, 500, For each sample size, 1000 samples were generated to which the kernel estimation methods have 28

29 been applied. The estimated densities were used to approximate the MSE and MISE of the method, where the average of the function values was used to approximate the expectation of the estimator and the sample variance to approximate the variance of the estimator. Furthermore, to calculate the MISE from the MSE, the Composite Simpson s Rule is used to approximate the integral from the numerical values. From each sample, the density was estimated by a pure kernel estimator as in definition 1.1 with a gaussian kernel (second column of table 2.1), the decreasing isotone monotone rearrangement of this kernel estimation (third column of table 2.1), the reflection method applied to the sample (fourth column of table 2.1) and the decreasing isotone monotone rearrangement applied to the reflection method (fifth column of table 2.1). After the calculation of the estimate, the MSE and the MISE of the estimators was calculated. Hereby we have to note, that for the smaller sample size n = 50, the values of the MISE can vary when the simulations are repeated. For the larger sample sizes the mean integrated square errors are stable in repetition of the simulation. We tried several ways to implement the monotone rearrangement algorithm. Implementation literally following the definition in equation (1.37) turned out to be inaccurate if the simulations had to be run within reasonable calculation time. Another method of computing the rearranged estimate was proposed in Chernozhukov et al. (2009) [2], using quantiles of the set of numerical function values as the rearrangement. This method proved to be far more accurate in terms of MISE, and required significantly less calculation time. The most accurate and efficient method however, also proposed in the same paper [2], and in Anevski and Fougères (2008) [20], was simulating the rearrangement as a sorting operator. If the function values of ˆf, the function we want to apply the rearrangement to, are calculated in a fine enough grid of equidistant points, the decreasing rearrangement can just be simulated by the sorting of these function values in decreasing order. Therefore, this is the method we used, to calculate the mean integrated square errors for the rearrangements. For the reproducibility of these results, one can find the full Matlab code used for the simulations in Appix B.4. n MISE MISE rearr MISE refl MISE refl+rearr Table 2.1.: The MISE for the different methods of improving the kernel estimation for samples of size n. The mean square errors of the simulations for sample size n = 1000 are represented in figure 2.1, where it becomes clear that al methods have a comparable performance away from the boundary x = 0. The difference in the performance of these methods lies, as we would expect, close to the boundary. A close-up of this area is shown in the second 29

30 graph of figure 2.1. Figure 2.1.: In the estimation of the exponential-1 density, the MSE of the pure kernel estimator (solid line), its monotone rearrangement (dashed line), the reflection method (dotted line) and its monotone rearrangement (dotted-dashed line). Note that the dotted line and the dotted-dashed line are indistinguishable in these plots. The graph below is a close-up of the area close to the boundary x = 0. 30

31 3. Conclusion As we saw from table 2.1 for each sample size n, applying decreasing rearrangement after applying the reflection method has no, or a very small effect. One could think the MISE is reduced in the simulation with sample size n = 50, but because the variation in the MISE for this sample size is relatively big, the small improvement is not significant. What does become clear from table 2.1 is that both the monotone rearrangement method and the reflection method significantly reduce the MISE. It also shows that in terms of the reduction of the MISE, the reflection method is preferred over the monotone rearrangement. However, in terms of computation time the monotone rearrangement algorithm performs better than the reflection method. The time necessary for rearranging a found estimate is negligible compared to the computation time necessary for kernel estimation. For example, for a sample of size n = 1000, unconstrained kernel estimation and monotone rearrangement takes roughly 0.5 seconds, while the reflection method takes somewhat more than 1 second to compute. Rearranging the reflected estimate takes negligible time, but on the other hand will not improve the estimate significantly. Therefore, for extremely large samples it might be more desirable to use the monotone rearrangement algorithm than to use the reflection method. The conclusion we can draw from the data in table 2.1 is that combining the method of monotone rearrangement and the reflection method will not significantly improve the estimate relatively to just applying the reflection method. 31

32 4. Discussion and Review of the Progress Several remarks can be made about the simulations ran in this thesis. First of all, it should be stressed that the sample size of n = 50 is too small to draw any conclusions, for the MISE varies a lot when the estimation procedure is repeated on a different set of samples. Secondly, more monotone probability densities should be investigated to confirm the answer to the research question more firmly. And thirdly, the Sheather and Jones bandwidth selector we used to estimate the target density in this thesis is not optimal for discontinuous densities, as was stated in section 1.3. Therefore other bandwidth selection methods should be investigated to improve the density estimates. For example, for larger sample sizes, one could use the least squares cross-validation bandwidths, which are discussed in Stone (1984) [21]. These bandwidths are asymptotically equivalent to the optimal bandwidths, even in the non-smooth cases [18]. This might be the reason why the accuracy of the unconstrained kernel estimator is less than that in [14], where a different bandwidth selector was used. Since the monotone rearranged estimator is based on this, less accurate, kernel estimator, the monotone rearrangement will also be less accurate, thus the mean integrated square errors will not be comparable to those from [14]. In this thesis, we started out with the introduction of kernel estimators and some basic properties of this method. Then we pointed out the boundary effects that occur when a target density with a discontinuity in x = 0 is estimated. As a solution to the loss of consistency, the reflection method was discussed. Monotone rearrangement, a method rearranging function values in ascing or decreasing order, is proposed in the case where a monotonic density is to be estimated. Simulations in Matlab were run, to compare the reflection method and the method of monotone rearrangement separately and applied together. Review of the Progress Taking on a project exting over such a long period of time was new for me, and I learned a lot in the progress. Looking back on the last few months, there are a lot of things I could have done better, but also a lot of things that worked out well. The most important lesson I learned in these months, is that I should not hesitate to approach my supervisor. While reading the articles I used for this thesis, there were moments where I was struggling a lot with minor details, which in some cases cost me days. However, a simple explanation by my supervisor, sometimes just a clarification of a definition, was enough to get me back on track. After the introduction to the subject of nonparametric density estimation and mono- 32

33 tone rearrangement by Dr. Van Es, I started familiarizing myself with the subject. There were a lot of books at hand with a proper introduction to kernel estimation to get acquainted with this principle. However, since the method of monotone rearrangement was only recently introduced in the subject of statistics, I had to get all the information on this subject out of articles that are not written for undergraduate students who are new to this field of research. This made it harder to fully understand the method and work with it. Therefore it took me more time than I had expected, especially because most of the time while I was working on this project, I still had to follow courses, so I had to divide my attention. Only when the courses ed, this project could finally get my full attention, and in those weeks the real progress was made. Furthermore, simulating in Matlab was relatively new to me, but fortunately enough it did not took me long to gain experience in the programming and I really enjoyed it. Personally, I found the conclusion drawn rather surprising, for I had expected that monotone rearrangement would perform better in combination with the reflection method. If I had had more time for this project, I would have liked to find out why the combined performance of these methods is relatively poor. Also, I would have liked to take a look at another method of nonparametric density estimation, the method of nonparametric maximum likelihood estimators (NPMLE), and measure its performance relatively to the kernel estimators. On top of that, I would have liked to learn more about asymptotic statistics to find out more about the asymptotic properties of different estimation methods. Luckily, I will get this chance next year during my masters. To conclude, I would like to thank Dr. Van Es for introducing me to this subject and his support during the process of writing this thesis. 33

34 5. Populaire Samenvatting Het onderwerp van deze bachelorscriptie is het schatten van kansdichtheden en verschille methoden om een dergelijke schatting te verbeteren. Om dat volledig te kunnen begrijpen, is het uiteraard een vereiste te weten wat een kansdichtheid is. Gemakkelijk gezegd, is een kansdichtheid een functie die aan elke gebeurtenis de kans van zijn voorkomen toekent. In het geval dat er eindig veel mogelijkheden zijn, is dit zelfs de definitie van een kansdichtheid. Om alles te verduidelijken zullen we de begrippen verder verkennen met behulp van een voorbeeld. Stel we zouden van een bepaalde geiser de lengte van een eruptie willen kunnen voorspellen. Om dat te kunnen doen, gaan we een maand lang met een stopwatch naast de geiser staan, en elke keer dat hij uitbarst meten we hoeveel minuten de uitbarsting duurt. Aan het eind van de maand hebben we, zeg 50 metingen gedaan (we zullen de eerste helft van de dataset uit Appix C.2 gebruiken om deze gegevens te simuleren). Een veel gebruikte manier om die data weer te geven is met behulp van een histogram. We delen de tijd op in balkjes en als de lengte van een eruptie binnen een balkje valt, maken we het balkje iets hoger. Zo krijgen we uiteindelijk een histogram waarin de hoogte van elk balkje het relatieve voorkomen van de eruptielengte is, zoals te zien is in het linker deel van figuur 5.1. Figure 5.1.: Links: histogram van de eruptielengtes (eerste helft data Appix C.2). Rechts: schatting van de kansdichtheid op basis van het histogram. Als we nu de balkjes wegdenken, alleen de bovenkant nemen en dit beschouwen als een curve, zoals gedaan is in de rechterkant van figuur 5.1, zouden we dit kunnen interpreteren als een schatting van de kansdichtheid. De oppervlakte onder een interval 34

Boundary Correction Methods in Kernel Density Estimation Tom Alberts C o u(r)a n (t) Institute joint work with R.J. Karunamuni University of Alberta

Boundary Correction Methods in Kernel Density Estimation Tom Alberts C o u(r)a n (t) Institute joint work with R.J. Karunamuni University of Alberta November 29, 2007 Outline Overview of Kernel Density