Endpoint estimation for observations with normal measurement errors

Size: px

Start display at page:

Download "Endpoint estimation for observations with normal measurement errors"

Mabel Kelly
5 years ago
Views:

1 Endpoint estimation for observations with normal measurement errors Xuan Leng Liang Peng Xing Wang Chen Zhou January 6, 27 Abstract This paper investigates the estimation of the finite endpoint of a distribution function when the observations are contaminated by normally distributed measurement errors. Under the framewor of Extreme Value Theory, we propose a class of estimators for the standard deviation of the measurement errors as well as for the endpoint. Asymptotic theories for the proposed estimators are established while their finite sample performance are demonstrated by simulations. In addition, we apply the proposed methods to the outdoor long jump data to estimate the ultimate limit for human beings in the long jump. Keywords: Convolution, extreme value theory, ultimate world record, Weibull domain of attraction. Introduction For a continuous distribution function F, the right endpoint of the distribution is defined as θ = sup{x : F x < }. Estimating the endpoint θ has been applied in various contexts when θ < +. For example, Aarssen and de Haan 994 estimated the endpoint of the distribution Erasmus University Rotterdam, P.O. Box 738, 3 DR Rotterdam, The Netherlands. Department of Ris Management and Insurance, Robinson College of Business, Georgia State University, Atlanta, GA 333, USA. Department of Ris Management and Insurance, Robinson College of Business, Georgia State University, Atlanta, GA 333, USA. Erasmus University Rotterdam, P.O. Box 738, 3 DR Rotterdam, The Netherlands.

2 of the life span of human beings, namely, the maximum life span. In productivity analysis, estimating the production frontier can be viewed as estimating the endpoint of the distribution of outputs conditional on the inputs; see e.g. Cazals et al. 22. Another notable application in Einmahl and Magnus 28 considers estimating the ultimate world record, in other words, the limit of human being, in a specific sport. For general statistical methods designed for estimating the endpoint, see, e.g., Hall 982, Athreya and Fuuchi 997, Hall and Wang 999, 25, and Girard et al. 22 among others. Since the endpoint is only related to the right tail region of the distribution, it is natural to consider Extreme Value Theory EVT in endpoint estimation. EVT models the tail region of a distribution function. The endpoint is finite if the distribution function belongs to the domain of attraction of the Weibull distribution. By regarding the endpoint as a high quantile with a probability level tending to one, the limit of the high quantile estimator can be considered as an endpoint estimator. Therefore, one may estimate the endpoint using the estimators on the extreme value index, scale and shift; see e.g. de Haan and Ferreira 26, Chapter 4.5. In reality, data are often contaminated by measurement errors. In other words, instead of observing independent and identically distributed i.i.d. sample drawn from the underlying distribution with a finite endpoint, we observe data as the convolution of the initial random variable and an independent measurement error term. Mathematically, suppose X,..., X n are random variables with a continuous distribution function F X, with a finite endpoint θ. Instead of observing {X i } n i=, we observe Y i = X i +ε i, i =, 2,..., n, where {ε i } are i.i.d. random errors with mean zero, and {ε i } are independent of {X i }. The goal in this paper is to estimate the endpoint θ based on the observations {Y i } n i= when ε follows a normal distribution. Notice that in this case, the distribution function of Y has no finite endpoint. In a broader context, extracting the distribution of X based on the observations {Y i } n i= is related to the so-called deconvolution problem. In the literature of nonparametric deconvolution, ernel estimators for the density function of {X i } were proposed; see, e.g., Carroll and Hall 988, Stefansi and Carroll 99, Meister26 and Meister and Neumann 2. Based on estimating the density function, Hall and Simar 22 proposed an estimator of θ 2

3 assuming that the density f X is approximated by a flat function in the neighborhood of θ and the variance of the measurement error, n = varε i, depending on the sample size n, shrins to zero as n. By contrast, Kneip et al. 25 did not require that σ tends to zero as the sample size increases while dealing with a constant σ. They proposed a joint estimation of θ and σ when the measurement errors {ε i } follow a normal distribution N,. Nevertheless, both of these two approaches require that f X θ >. Our approach is close to that in Goldenshluger and Tsybaov 24. They modeled f X near θ by a power function with index α. In other words f X θ =. The Goldenshluger and Tsybaov 24 approach requires that the variance and the index α are nown. More specifically, by deriving that σ 2 log n max Y d i θ b n Λ, as n, where i n b n = σ 2 log n [ ] σα log n 2 log log n + log 2 + log σ, and Λ is the Gumbel distribution, Goldenshluger and Tsybaov 24 suggested the estimator ˆθ = max i n Y i b n, which has a speed of convergence at 2 log n. Motivated by the estimator in Goldenshluger and Tsybaov 24, we aim at estimating the endpoint when σ is unnown with a broader class of F X. Under an EVT model for F X, we show that σ can be estimated by using the top order statistics of {Y i }, where is an intermediate sequence such that and /n as n. With some proper conditions on the intermediate sequence, the estimator for σ possesses asymptotic normality with a speed of convergence. In addition, the conditions on ensures that apart from the main term σ 2 log n, the other minor terms in b n play no role asymptotically. Consequently, we can directly estimate the endpoint by ˆθ = max Y i ˆσ 2 log n. The estimator inherits the i n asymptotic normality of ˆσ, however, with a compromised speed of convergence. Compared to the existing literature, our approach has two main advantages. Firstly, we do not require that F X has a density, but only assume a second order expansion of the survival function F X x = F X x in the neighborhood of θ. Therefore, it is an EVT approach. Compared to Kneip et al. 25, our model assumption allows for a broader class of F X, which includes the model in Goldenshluger and Tsybaov 24 as a special case. Secondly, we 3

4 do not require any additional information such as the variance of the measurement errors,. Therefore, our approach is more close to the real situation encountered in applications. Our estimation procedure and its asymptotic properties are discussed in Section 2. We conduct a simulation study in Section 3 and then apply our method to sports data in Section 4. Section 5 concludes. The proofs are postponed to Section 6. 2 Methodology and asymptotic results Recall that X,..., X n are i.i.d. random variables with a continuous distribution function F X that has a finite right endpoint θ := sup{x : F X x < }. Further, in a neighborhood of θ, we assume the following second order expansion of the survival function F X x = F X x : as u +, F X θ u = Lu α + du α+β + o u α+β, where L, α and β are positive constants, and d. Under the condition, F X belongs to the maximum domain of attraction of the Weibull distribution with an extreme value index γ = /α. Suppose we do not observe {X i } directly. Instead, we observe Y i = X i +ε i, i =, 2,..., n, where {ε i } are i.i.d. normally distributed random errors with mean zero and unnown variance, and {ε i } are independent of {X i }. The question is how to estimate θ based on the observed {Y i }. By showing that, as n, log n max log log n Y i σ 2 log n θ = O p, i n Goldenshluger and Tsybaov 24 proposed to estimate θ by max i n Y i σ 2 log n, where σ is assumed to be nown. Since σ is often unnown in reality, we propose an estimator for σ first, and consequently an estimator for θ. Let Z i = Y i θ, and denote the distribution function and survival function of Z i as F Z and F Z = F Z, respectively. We first show F Z has a second order expansion as in the following Proposition. 4

5 Proposition. Suppose a random variable X has a distribution function F X with a finite right endpoint θ = sup{x : F X x < } <. Assume that F X satisfies the condition. ε is a normally distributed random error with mean zero and an unnown variance and is independent of X. Denote Z = X + ε θ. Then, as t, log F Z t = t2 2 + α + log t log c c 2 c t β + o, where β = minβ, 2, c = Γα + Lσ2α+ 2π, and c 2 = 2π Γα + β + dα+β+ β =β 2π α 2 + Γα + 2Lα+3 β =2. The tail property of F Z motivates the following estimator of σ. Denote Y,n Y 2,n Y n,n as the order statistics of Y,, Y n. Then the order statistics of the unobserved random variables {Z i } n i= are Z n i,n = Y n i,n θ. Consider an intermediate sequence = n such that and /n as n. Since F Z Z n i,n i/n, we have approximately from Proposition that logi/n Y n i,n θ 2σ, for i =, 2,...,. By taing the difference between Y n i,n and Y n,n, we get that Y n i,n Y n,n 2σ logi/n log/n = log/i logn/i + logn/ log/i 2 logn/. We use the equation above and the idea of Generalized Method of Moments to estimate σ. Tae a positive continuous function g on, ] such that gs log s ds = 2. We construct a weighted sum of the differences {Y n i,n Y n,n } using the weights { gi/} as gi/ Y n i,n Y n,n 2σ i= Hence, we get an estimator of σ as 2 logn/ / gs log s ds logn/. ˆσ g = logn/ 2 gi/y n i,n Y n,n. 2 i= 5

6 Replacing σ with ˆσ g in Y n,n σ 2 log n, an estimator of θ is then given as ˆθ g = Y n,n ˆσ g 2 log n. 3 To obtain the asymptotic property of the estimator, we further assume that β > and the intermediate sequence satisfies the following condition: as n, = n, /n, and logn/ β 2 log logn/ 2 β =2 = O. 4 We first prove the asymptotic property for ˆσ g. That of ˆθ g will then follow as a direct corollary. The asymptotic normality of ˆσ g requires the following conditions on gs, s, ]. There exists ɛ >, such that lim gs s s/2 ɛ =, and gs log s ds = 2. 5 In addition, we assume a joint condition on the intermediate sequence and the function g as follows, lim n sup s t /,/ s,t gs log s gt log t =. 6 Examples such as gs = log s, and gs = 2ν + 2 s ν, ν > 2, satisfy all conditions. The following theorem gives the asymptotic normality of ˆσ g. Theorem 2. Suppose X,..., X n are i.i.d. random variables with a continuous distribution function F X. Assume that F X has a finite right endpoint θ = sup{x : F X x < } and satisfies the condition with β >. Suppose Y i = X i +ε i, i =, 2,..., n, where {ε i } are i.i.d. normally distributed random errors with mean zero and an unnown variance, and {ε i } are independent of {X i }. Assume that g :, ], satisfies 5 and := n is an intermediate sequence satisfying 4 and 6. asymptotic property: ˆσg σ d N, Then, as n, the estimator ˆσ g defined in 2 has the following 4 mins, t gsgt dsdt. st Consequently, the estimator ˆθ g defined in 3 possesses asymptotic normality as follows. Corollary. Under the same conditions as in Theorem 2, as n, d mins, t ˆθ g θ N, gsgt log n 2 st 6 dsdt.

7 3 Simulations In this section, we investigate, through simulations, the finite sample behavior of the suggested endpoint estimator. We generate observations Y i = X i + ε i, i =, 2,..., n, where {X i } n i= and {ε i } n i= are two sets of i.i.d. random variables independently drawn from the following data generating processes. In all three cases, we set the true value of the endpoint for X i to θ =, while setting the distribution of ε i to a normal distribution N, with two potential levels of σ. a X i follows a uniform distribution on [, ] and σ =. or.2. Notice that condition holds for the uniform distribution with α = and β =. b X i follows a reversed Burr distribution with the following distribution function F X x = + x 2, x <, and σ = 2 or 3. Notice that condition holds for the reversed Burr distribution with α = 2 and β = 22. c X i follows a shifted Beta distribution with the following probability density function f X x = 42x + x 5, < x <, and σ =. or.2. Notice that condition holds for the shifted Beta distribution with α = 2 and β =, which violates our required condition β > in Theorem 2. From each data generating process, we draw r = samples with a sample size n = 5. Then we estimate the endpoint for each sample using an estimator ˆθ g with a specific choice, g s = log s on, ]. It is straightforward to chec that g satisfies the condition 5. In addition, the condition 6 is degenerated for the g function. Corollary implies the asymptotic property of ˆθ g as follows. As n, d ˆθ g θ N, 52 log n σ2. The quantiles of the reversed Burr distribution are q. = 9.72, q.25 = 5.85, q.75 = 2.59 and q

8 During the estimation procedure, we need to determine the value of, i.e. the number of high upper order statistics used. For that purpose, we perform a pre-study as follows. By varying from 5 to 5, we plot the average estimate 2 θg := r r ˆθ i= g,i, the standard deviation: r r i= ˆθ g,i θ g 2 and the root mean squared error RMSE: r r ˆθ i= g 2,i for each data generating process. In Figure, we demonstrate the results for the two data generating processes in b. We observe that the optimal that minimizes the RMSE is achieved at about =. We do choose = throughout the simulation study, also for the other data generating processes in a and c. 3 Figure : Endpoint estimation for various : Reversed Burr Distribution reversed Burr σ = 2 reversed Burr σ = 3 Estimation error Bias Standard Deviation RMSE Estimation error Bias Standard Deviation RMSE Note: The figure shows the bias, standard deviation and RMSE for the estimates ˆθ g across samples with sample size n = 5. The observations are generated by combining the reversed Burr distribution in b with measurement errors following N,, σ = 2 left or σ = 3 right. We compare the performance of our suggested endpoint estimator with that of the probability weighted moment PWM estimator for the endpoint, ignoring the existence of measurement 2 Since the true endpoint equals, the average estimate can be read as the estimation bias. 3 The figures for the other data generating processes are available upon request. 8

9 errors. The PWM estimator is defined as where ˆθ P W M = Y n,n âp W Mn/ ˆγ P W M, 7 ˆγ P W M := I 4I 2 I 2I 2, and â P W M n/ := 2I I 2 I 2I 2, with the probability weighted moments given by I j = i= i j Y n,n i+ Y n,n, j =, 2. Note that ˆθ P W M is an estimator of θ only if there is no measurement error in the observations, i.e. ε i =, i =, 2,..., n. To determine the optimal choice of used in the estimator ˆθ P W M, we also conduct a pre-study similar to the aforementioned procedure. We decide to choose = 2 for all data generating processes when applying the PWM estimator. Notice that for the PWM estimator, we stop estimating the endpoint if ˆγ P W M > because the PWM estimator is valid only for γ <. In other words, from samples, we may end up with less than endpoint estimates when using the PWM estimator. For each data generating process, we plot the estimated endpoints using the two estimators across all samples in boxplots; see Figure 2. We observe that ˆθ P W M largely overestimates the true endpoint. In contrast, our estimator ˆθ g performs well across all data generating processes, and consistently outperforms the PWM estimator. The medians across simulated samples are close to the true endpoint and the variations are lower. 9

10 Figure 2: Boxplots of estimated endpoints σ =. σ =.2 σ = 2 σ = 3 Estimated endpoints Estimated endpoints θ^g θ^pwm θ^g θ^pwm i θ^g θ^pwm θ^g θ^pwm ii σ =. σ =.2 Estimated endpoints θ^g θ^pwm θ^g θ^pwm iii Note: The three plots show the estimated endpoints using the suggested endpoint estimator and the PWM estimator for six data generating processes. Each plot is based on samples with sample size n=5. In the panel i, the observations are generated by combining the uniform distribution on [, ] with measurement errors following N,, σ =. left or σ =.2 right. In the panel ii, the observations are generated by combining the reversed Burr distribution in b with measurement errors following N,, σ = 2 left or σ = 3 right. In the panel iii, the observations are generated by combining the shifted beta distribution in c with measurement errors following N,, σ = 2 left or σ = 3 right. The endpoints are estimated by ˆθ g with = and ˆθ P W M in 7 with = 2. Horizontal lines indicate the true endpoints.

11 4 Application In order to investigate the limit of human being in sports, Einmahl and Magnus 28 applies the endpoint estimation to the training data of top athletes. Initially, they gathered data for 28 types of sports. However, they stopped estimating the endpoint for five out of the 28 sports due to the fact that the estimated extreme value indices, γ, for these five sports are close to. 4 Continuing from their study, we shall apply our estimator on the endpoint to the outdoor long jump data both men and women for two reasons. Firstly, from a theoretical perspective, if we assume that the observed training data are contaminated by normally distributed measurement errors, the extreme value index γ for the observations should be equal to. This is in line with the empirical observations in Einmahl and Magnus 28. We will justify this argument by repeating the estimation of the extreme value index γ for the training data and testing whether it is significantly below zero. Different from Einmahl and Magnus 28, we employ the PWM estimator for this analysis. Secondly, for the outdoor long jump, the presence of wind can be a potential factor generating such a measurement error. To justify this argument, we shall apply our suggested estimator to the training data for the outdoor long jump, while comparing it with applying the PWM endpoint estimator to the training data for the indoor long jump. Assuming that the indoor long jump is much less affected by wind, we expect that the two endpoints estimated from these two different datasets mutually agree with each other. We collect the data from the official website of the International Association of Athletics Federations IAAF 5 for the indoor and outdoor long jump, both for men and women. In total, we construct four datasets for these four sports. For each sport, the website presents the all time personal bests of the top athletes. In addition, for indoor long jump both men and women, the website provides the personal bests of the top athletes in each year from 999 to 26. Consequently, for each of these two sports, we combine the data across the aforementioned 9 lists. Similarly for men s outdoor long jump, we combine the data across 7 lists because the 4 The five sports are, m running and outdoor long jump for both men and women, together with men s 4m running. 5 See

12 records of the years 2 and 2 are not available. For women s outdoor long jump, 8 lists are combined due to the missing records in 2. When combining the data for each sport, we eep only the best record for each athlete across the lists. Table gives a summary of the number of athletes and the best and worst achievements for each of the four sports. Since there are clusters in the present data, we smooth each dataset using the method suggested in Einmahl and Magnus 28 as follows. For example, if c athletes share the same personal best, l = 8.47 m, we smooth them by l i = i, i =,..., c. 2c Table : Data description men women Long jump the Number Best Worst the Number Best Worst Outdoor Indoor Note: The table presents the descriptive information regarding the four datasets used in the application. The four datasets correspond to the indoor and outdoor long jump data for both men and women. Each dataset consists of top athletes personal best in the corresponding sport. We start with estimating the extreme value index γ by the PWM estimator. Figure 3 shows the estimates ˆγ P W M against various values of. To balance the estimation bias and variance, we choose from the first stable region in each plot. Table 2 reports the chosen values of and the corresponding estimated γ with its 95% confidence intervals. From the confidence intervals, we observe that for indoor long jump, γ = is rejected for both men and women. Hence, the endpoints of these two distributions exist. By contrast, for outdoor long jump, we cannot reject γ = for either men or women. These results agree with the finding in Einmahl and Magnus 28 and support considering the outdoor data as contaminated by measurement errors. 2

13 Next, we continue using the PWM method to estimate the endpoints for the indoor long jump data, while using our new estimator, ˆθ g, to estimate the endpoint for the outdoor long jump data. We plot the endpoint estimates against various values of in Figure 4. A technical difference between these two estimators is regarding the sample size n. For the PWM estimator, the sample size n is not used whereas for our new method, it is necessary to now the sample size n in advance. Obviously, a lower bound for n is the current sample size, i.e. the number of top athletes included in the IAAF website. We are aware of the caveat that this number may underestimate the true number of top athletes who may potentially produce a similar performance. To address this caveat and test the sensitivity of n, we tae an arbitrary value n = 3 which is much higher than the current sample sizes. Table 2 presents the endpoint estimates with their 95% confidence intervals based on the selected values of. Firstly, we observe that the values of ˆθ g based on the outdoor long jump data are not sensitive to the sample size n, particularly after considering the estimation error reflected by the confidence intervals. Secondly, although the point estimates from applying our new estimator to the outdoor long jump data are consistently lower than that from applying the PWM method to the indoor long jump data, the two results agree with each other to certain extent. The point estimates from applying our new estimator to the outdoor long jump data falls into the confidence interval based on applying the PWM method to the indoor long jump data and vice versa. Finally, we compare the results to the actual observations in the dataset. Although the PWM estimator based on the indoor long jump data suggested that the endpoints for men s and women s long jump are at 8.79 and respectively, there are multiple observations in the outdoor long jump data that exceed those estimated endpoints: for men s long jump there are four observations higher than 8.79, while for women s long jump there are six observations higher than Following our assumption, having an observation Y above the endpoint of the distribution of X must be due to a positive value in the measurement error ε. In the context of long jump, it implies that the wind should have helped in delivering these personal bests. The website of IAAF provides the wind speed at the time when the long jump data was recorded: 3

14 for eight out of the nine cases, the recorded wind speeds were positive. The only exception is the 8.87m set by Carl Lewis in the 99 Toyo World Championships, where the wind speed was recorded as -.2m/s. Aside from this exceptional case, the other eight cases support the view that the wind can be a potential factor causing measurement errors in the long jump performance. Figure 3: Estimation of the extreme value indices indoor long jump for men indoor long jump for women γ^pwm γ^pwm outdoor long jump for men outdoor long jump for women γ^pwm γ^pwm Note: The plots show the estimated extreme value indices with the corresponding 95% confidence intervals for various values of using the PWM estimator. 4

15 Figure 4: Estimation of the endpoints indoor long jump for men indoor long jump for women θ^pwm θ^pwm outdoor long jump for men n = 776 outdoor long jump for women n = 76 θ^g θ^g Note: The plots show the estimated endpoints with the corresponding 95% confidence intervals for various values of. For the indoor long jump data, the endpoints are estimated by the PWM estimator ˆθ P W M. For the outdoor long jump data, the endpoints are estimated by the estimator ˆθ g. 5

16 Table 2: Estimation results: long jump data men women n Point 95% C.I. n Point 95% C.I. Outdoor ˆγ P W M [ -.28,.23 ] [ -.277,.9 ] ˆθ g [ 7.8, 8.85 ] [ 6.97, ] ˆθ g [ 7.528, ] [ 6.824, 7.48 ] Indoor ˆγ P W M [ -.75, ] [ -.56, -. ] ˆθ P W M [ 8.226, ] [ 6.85, ] Note: The table shows the estimation results based on the long jump data. For both indoor and outdoor long jump, the estimated extreme value indices using the PWM estimator are presented. For the outdoor long jump data, the endpoints are estimated using the estimator ˆθ g with setting the number of athletes n to either the current sample size or 3. For the indoor long jump data, the endpoints are estimated using the PWM estimator. For all estimates, the corresponding 95% confidence intervals are provided. 6

17 5 Conclusion In this paper, we consider the estimation of the finite endpoint θ of a distribution function F X. Instead of having observations drawn from F X, we only observe a contaminated sample Y i = X i + ε i, i =, 2,, n, where X i follows the distribution F X and ε i is a measurement error following N,. We start with proposing a class of estimators ˆσ g for σ, depending on an appropriate weighing function g on, ]. Then we suggest an estimator ˆθ g = max i n Y i ˆσ g 2 log n for estimating the endpoint. Both the estimators ˆσ g and ˆθ g possess asymptotic normality. We demonstrate, by extensive simulation studies, the superior performance of our suggested estimator to that of the PWM estimator when ignoring the presence of measurement errors. In addition, we apply our suggested estimator to resolve the difficulties encountered by Einmahl and Magnus 28: the estimated extreme value index is close to, and not significantly different from, zero. By assuming the presence of measurement errors stemming from the wind, we apply our suggested endpoint estimator to the outdoor long jump data. The results are comparable with applying the PWM estimator to the indoor long jump data, for which the impact of wind is negligible. 6 Proofs Proof of Proposition. Write F Z x = 2πσ e x t2 2 F X θ + tdt = 2πσ e x2 2 tx exp t2 2 F X θ + tdt. The proof is split into two parts. First, we show that as x, the integral above will be dominated by only integrating in the neighborhood of zero, i.e., for some ɛ >, F Z x = e x2 tx 2 exp 2πσ t2 [ 2 F X θ + tdt + O x α+ e ɛx/σ2]. 8 ɛ Then, we will calculate the integral in the right hand side of 8. 7

18 We first handle the equation 8. Since ɛ tx exp t2 ɛx 2 F X θ + tdt exp and exp ɛ ɛ tx t2 2 F X θ + tdt e ɛ2 2 exp ɛ 8 is proved by showing that, as x, ɛ exp exp t2 2 dt 2πσ exp tx F X θ + tdt, ɛx, tx F X θ + t dt = O x α+. 9 Denote Gt := F X θ t. Then as t, Gt = Lt α + dt α+β + o t α+β. The left hand side in 9 is then calculated as ɛ tx exp F X θ + t dt = σ2 x [ = σ2 x x G ɛx =: σ2 x G x [I + I 2 ]. ɛx e /t t 2 G tx/σ 2 G x/ t α tx e /t G t 2 dt dt + ɛx e /t t α+2 dt Gtu To deal with I, we use the fact that the function G is regularly varying, i.e. lim t Gt = u α. From Proposition B.. in de Haan and Ferreira 26, for any η > there exists x x = x η such that if x and tx x, G tx/ G x/ t α η max t α+η, t α η. By choosing ɛ /x, we have that for all t > σ2 ɛx and sufficiently large x such that x > σ2 x the two required conditions hold. Therefore, we can apply the inequality above to obtain that e /t t 2 G tx/ G x/ t α dt η e /t t 2 α t η + t η dt <. ɛx Thus we can apply the dominated convergence theorem to get that I as x. ] 8

19 By verifying that I 2 Γα + as x and σ2 x G x = Ox α, we proved 9 and consequently 8, for any ɛ /x. Next we calculate the integral in the right hand of 8. We first perform a variable transformation such that the term in the exponential part tx t2 is replaced by a new term s, 2 i.e. we define s = t2 tx. This transformation is one to one for t [ ɛ, ] with the inverse 2 transformation where t = x + 2sσ2 x 2 =: sσ2 φs; x, x φs; x = sσ2 + o, 2x2 as x, and the o term is uniform for all s ɛx+ɛ/2. Write ɛx+ɛ/2 /2 tx exp ɛ t2 2 F X θ + tdt = σ2 e s + 2σ2 x x x 2 s G ds. sφs; x To calculate this integral, we again need a comparison between G x sφs;x and G x. However, here we need a second order expansion. Notice that the function G satisfies the second order regular variation condition as lim w Gwu Gw u α dβ = u α u β. L w β β By Theorem B.2.8 in de Haan and Ferreira 26, for all η >, there exists x = x η, such that for w, wu x, Gwu Aw Gw u α u α u β β η max u α+β+η, u α+β η, where A w := dβ L w β. We intend to apply this inequality with w = x and wu = x sφs;x. For sufficiently large x, w = x > x. For wu, notice that x sφs;x = /t /ɛ. We get the required condition by choosing ɛ such that ɛ < /x. Hence we obtain from the above inequality that Zs; x := A x G x sφs;x G x sφs; x α 9 sφs; x α sφs; x β β

20 η max sφs; x α+β η, sφs; x α+β+η. This inequality allows to write the integral in the right hand of 8 as x ɛx+ɛ/2 /2 e s + 2σ2 x x 2 s G = σ2 x ɛx+ɛ/2 x G e s + 2σ2 + σ2 x A x G x ɛx+ɛ/2 + σ2 x x ɛx+ɛ/2 x A σ G 2 =:J + J 2 + J 3. ds sφs; x /2 x 2 s sφs; x α ds e s + 2σ2 x 2 s e s + 2σ2 x 2 s In all three terms, we need to deal with integrals in the form Ix; ν := ɛx+ɛ/2 /2 sφs; x α sφs; xβ ds β /2 Zs; x, ds /2 e s + 2σ2 x 2 s sφs; x ν ds, for ν >. Notice that 2σ2 s and φs; x as x hold uniformly for s ɛx+ɛ/2. By x 2 using dominance convergence theorem, we get that as x, Ix; ν e s s ν ds = Γν +. 2 /2 Further write + 2σ2 s x = s + o, as x, where the term o is again 2 x 2 uniform for all s ɛx+ɛ/2. Then, Ix; ν = ɛx+ɛ/2 ɛx+ɛ/2 e s sφs; x ν ds σ2 x 2 e s ssφs; x ν ds + o =: II II 2. By applying the dominance convergence theorem again, we obtain that as x, x 2 II 2 Γν + 2. For II, we use the expansion of φs; x in to get that II = ɛx+ɛ/2 ɛx+ɛ/2 e s s ν ds νσ2 2x 2 e s s ν+ ds + o 2

21 = Γν + + ox 2 νσ2 νσ2 Γν o = Γν + Γν o. 2x2 2x2 By combining the two terms we get a second order expansion of Ix; µ: as x, Ix; ν = Γν + σ2 x 2 ν 2 + Γν o. 3 By applying 3 with ν = α, we have that as x, J = σ2 x x G Γα + σ2 α x Γα o. By applying 2 with ν = α and ν = α + β, we get that as x, J 2 = σ2 x x Γα + β + Γα + x A G. β Lastly, based on the inequality and 2, we get that J 3 = oj 2 as x. By combining all three terms, we obtain that as x, tx exp ɛ t2 2 F X θ + tdt = σ2 x x G Γα + σ2 α x Γα + β + Γα + x Γα o + A + o β = σ2 x x G 2πc2 Γα + + Lα+ x β + o d L Γα + σ2β x β =Lx α σ 2α+2 + d L σ2β x β + o 2πc2 Γα + + Lα+ x β + o d L Γα + σ2β x β = 2πσx α c + c 2 x β + o, where c, β and c 2 are defined as in the proposition. Finally the proposition is proved by substituting this relation to 8. Next we prove Theorem 2. Write the estimator ˆσ g in 2 as ˆσ g σ = logn/ / g[s]/ Z n [s],n Z n,n σ 2 logn/ We firstly establish the asymptotic properties of tail quantile process {Z n [s],n, s } as follows. 2 ds.

22 Proposition 3. Assume the same conditions as in Proposition. Then there exists a sequence of standard Brownian motions {W n s : s } such that as n, Z n [s],n σ 2 logn/ = ψ,n + ψ 2,n s + /2 s W n s + s /2 δ o p 2 logn/ where α + log logn/ 4 logn/ ψ,n = + logn/, c 3 = 2 log c α + logσ 2, qn/ = ψ 2,n s = σ β c 2 2 +β/2 c logn/ β 2 β < 2 α+2 log s 2 logn/ and all terms o p are uniform for s [, ]., 32 logn/ log logn/ 2 β 2 α + log logn/ + + o, 4 logn/ c 3 + qn/ logn/ + o p Proof. Denote U = log F Z, where denotes the left continuous inverse function. Write Z i = UE i where {E i } n i= is a sample of i.i.d. standard exponential distributed random variables. Let E,n E 2,n E n,n be the corresponding order statistics. To obtain the asymptotic properties of {Z n [s],n, s }, we derive the expansion of Ut and the asymptotic properties of {E n [s],n, s }. From the tail expansion of log F Z in Proposition, we get that as t, t = 2 Ut2 + α + log Ut log c c 2 c Ut β + o. 4 Since Ut as t, it implies that Ut σ 2t. Write Ut = σ 2t + rt, where rt = o t. We intend to obtain further explicit expression for the rt term. Substituting Ut in 4 by σ 2t + rt yields that, as t, = r 2 t + 2σ 2t rt + α + log t + 2 α + logσ 2 log c + 2 α + log + rt σ 2σ2 c 2 σ β 2t + o. 2t c 22

23 Since rt t as t, we can solve this equation to obtain that σα + rt = 2 t /2 log t + σ log c α + logσ 2 t /2 + t /2 qt, 2 2 where qt = o as t. By reiterating this procedure, we eventually obtain that as t, where Ut = σ σα + 2t 2 t /2 log t + σ 2c 3 t /2 + t /2 qt + o, 5 2 qt = σ β c 2 2 +β c t β 2 β < 2 σα t log 2 t β 2. Note that q is related to the q function defined in this proposition by qt = qlog t 2σ. Next we derive the asymptotic properties of {E n [s],n, s } from Theorem in de Haan and Ferreira 26 as follows. For any δ >, there exists a sequence of standard Brownian motions {{W n s} s } such that as n, s /2+δ En [s],n logn/ + log s s W n s sup s p. When plugging the asymptotic expansion of E n [s],n into Z n [s],n = UE n [s],n, we observe Z that the asymptotic property of n [s],n is mainly driven by En [s],n /2. logn/ logn/ In order to mae γ a smooth substitution, we first derive the asymptotic behavior of for any given γ R. Write En [s],n logn/ γ = [ + En [s],n logn/ ] γ log s logn/ + s W n s + o p s /2 δ logn/ = + γ log s + /2 s W n s + o p s /2 δ logn/ log s + /2 s W n s + o p s /2 δ + θ logn/ where θ = x γ x 2 x=ξ for some ξ between and log s + /2 s W ns+o ps /2 δ logn/. p Since as, uniformly for all / s log s + /2 s W ns+o ps /2 δ logn/ 23 2

24 and 2 + x γ is bounded in the neighborhood of zero, we get that with probability tending to, θ is bounded uniformly for all / s. By verifying that the quadratic term x 2 2 can be uniformly written as logn/ s /2 δ o p, we log s + /2 s W ns+o ps /2 δ logn/ get that as n En [s],n γ = + γ log s + /2 s W n s + o p s /2 δ, logn/ logn/ where the o p term is uniformly for all / s. For γ =, the relation should be read as log E n [s],n log logn/ = log s + /2 s W n s + o p s /2 δ. logn/ Finally, by plugging E n [s],n into Z n [s],n = UE n [s],n, while using the expansion of U γ in 5 and the asymptotic expansion of for γ = /2, /2, and + β/2, we obtain the result of Proposition 3. En [s],n logn/ Remar. In the expansion of the tail quantile process, ψ,n is a deterministic term not depending on s, ψ 2,n is a deterministic term depending on s, the third component gives a random term, and finally qn/ logn/ + o p has a uniform approximation independent of s. It will turn out to be clear in the proof of Theorem 2 that such a detailed expansion is necessary for achieving the intended asymptotic results. We use Proposition 3 to prove the asymptotic normality of ˆσ g. Proof of Theorem 2. Write ˆσ g σ = logn/ g / Z n,n Z n,n σ 2 logn/ + logn/ g[s]/ Z n [s],n Z n,n 2/ σ ds 2 logn/ =: I + I 2. From Proposition 3, we get that, as n, Z n [s],n Z n,n σ 2 logn/ = log s + /2 s W n s W n + o p s /2 δ 2 logn/ 24

25 holds uniformly for / s. α + log logn/ log s + o + qn/ 4 logn/ 2 logn/ logn/ o p, 6 First, for I, replacing s by in 6, yields that as n, I = logn/ log + /2 W n / + δ o p log log logn/ /2 g / + 2 logn/ logn/ 2 O + qn/ logn/ o p. Using the modulus of continuity for W n / and note that from the condition 5, g / = o /2 ɛ for any δ < ɛ, as n, we then get from the above equation that I = /2 g/ δ O p = δ ɛ o p = o p. Next, we deal with I 2. By multiplying both sides of 6 with logn/gs and taing an integral on the interval [2/, ], we get that as n, I2 = logn/ gs Z n [s],n Z n,n 2/ σ ds 2 logn/ = gs log s ds + 2 2/ 2 gs s W n s W n ds + o p gss /2 δ ds 2/ 2/ α + log logn/ + gs log s ds + o + qn/o p 4 logn/ 2 2/ = + 2 gs s W n s W n ds + o p + α + log logn/ + o + qn/o p. 4 logn/ 2/ gs ds. To obtain the last equality, we used the condition 5 to derive the following facts. First, both gss /2 δ and gs are integrable on [,] for any δ < ɛ. Second, as n, 2/ gs s W n s W n 2/ ds s /2+ɛ s W n s W n ds = o p. Lastly, as, 2/ gs log s ds 2/ s /2+ɛ log s ds = 25 log log 2 + e /2 ɛ t t de t

26 2 + e ɛt t dt = o. log log 2 Finally, the condition 4 implies that log logn// logn/ and qn/ is bounded as n, which leads to the asymptotic property of I2 : as n, I 2 gs s W n s W n ds p. 2 We remar that the difference between I 2 and I 2 is of an order /2 o p. This is shown by using the condition 6 as follows. Write I2 I2 logn/ g[s]/ gs Z n [s],n Z n,n 2/ σ ds 2 logn/ g[s]/ gs log s + /2 O p s /2 δ ds 2/ g[s]/ 2/ log[s]/ gs log s log log s s + /2 O p s /2 δ ds + log[s]/ log s g[s]/ 2/ log[s]/ sup gs s t /,/ s,t log s gt log t log s 2/ + log[s]/s g[s]/ 2/ log[s]/ log s ds + O p log[s]/s g[s]/ log[s]/ s /2 δ ds 2/ =: I 2 + I 22 + I 23. The condition 6 implies that as n, I 2 = o p. Next, for I 22, notice that log[s]/s = log {s} s the condition 5 implies that there exists c such that gs < c s /2+ɛ applying the condition 6 again, we get that as n. log s + /2 O p s /2 δ ds log s + /2 O p s /2 δ ds c s < for s 2/. In addition, I 22 log[s]/s gs ds + o log[s]/s log sds 2/ 2/ 26 for all s, ]. By

27 c 2/ c c 2 /2 ɛ /2 c /2 s s /2+ɛ ds + o ɛ /2 2/ 2/ c log sds s s +ɛ /2 ds + o c 2. Lastly, for I 23, notice that for all 2/ s <, log[s]/ max log s, log/ and s 2 < [s]/ < s implies that g[s]/ < c [s]/ /2+ɛ < c 2 s /2+ɛ. Hence, as n c I 23 O p 2/ s c 2s +ɛ δ max log s, log/ ds / c O p s c 2s +ɛ δ log s ds + log/ 2/ Since as n, log/, / s 2+ɛ δ ds and / c s c 2s +ɛ δ ds. lim / 2/ s 2+ɛ δ log s ds [ / 2+ɛ δ = lim log / 2 + 2/ 2+ɛ δ log2/ ] 2 2 =, we obtain that I 23 as n. Combining the three terms, we have shown that as n, I2 I 2 = o p, which leads to I2 2 gs s W ns W n ds p. The theorem is thus proved by combining the two parts I and I 2 and further calculating the asymptotic variance. Proof of Corollary. Write ˆθg θ = σ Zn,n 2 log n σ 2 log n σ ˆσg 2 σ =: I I 2. Theorem 2 gives the asymptotic property of I 2. Hence we only need to show that I n. Since as n, E n,n log n exp{ e x }, we get that for γ, γ En,n log n + Op γ = = + γ log logn/ logn/ logn/ + o p. d 27 p, as

28 Plugging this expansion to the U function given in 5 yields that Z n,n σ 2 logn/ = UE n,n σ 2 logn/ = ψ,n + log 2 logn/ + o p, as n. The corollary is proved by verifying that ψ,n and log / logn/, as n, which are implied by the condition 6. References [] Aarssen, K. and L. De Haan 994. On the maximal life span of humans. Mathematical Population Studies 4, [2] Athreya, K.B. and J. Fuuchi 997. Confidence intervals for endpoints of a c.d.f. via bootstrap. Journal of Statistical Planning and Inference 58, [3] Carroll, R.J., and P. Hall 988. Optimal rates of convergence for deconvolving a density. Journal of the American Statistical Association 83, [4] Cazals, C., J.P. Florens and L. Simar 22. Nonparametric frontier estimation: a robust approach. Journal of Econometrics 6, -25. [5] Einmahl, J. H., and J. R. Magnus 22. Records in athletics through extreme-value theory. Journal of the American Statistical Association 3, [6] Girard, S., A. Guillou, and G. Stupfler 22. Estimating an endpoint with high-order moments. Test 2, [7] Goldenshluger, A., and A. Tsybaov 24. Estimating the endpoint of a distribution in the presence of additive observation errors. Statistics and Probability Letters 68, [8] de Haan, L., and A. Ferreira 26. Extreme Value Theory: An Introduction. Springer Series in Operations Research and Financial Engineering. New Yor: Springer. 28

29 [9] Hall, P On estimating the endpoint of a distribution. The Annals of Statistics, [] Hall, P., and L. Simar 22. Estimating a changepoint, boundary, or frontier in the presence of observation error. Journal of the American Statistical Association 97, [] Hall, P., and J.Z. Wang 999. Estimating the end-point of a probability distribution using minimum-distance methods. Bernoulli 5, [2] Hall, P., and J.Z. Wang 25. Bayesian lielihood methods for estimating the endpoint of a distribution. Journal of the Royal Statistical Society: Series B 675, [3] Kneip, A., L. Simar and I.V. Keilegom 25. Frontier estimation in the presence of measurement error with unnown variance. Journal of Econometrics 84, [4] Meister, A. 26. Density estimation with normal measurement error with unnown variance. Statistica Sinica 6, [5] Meister, A. and Neumann, M.H. 2. Deconvolution from non-standard error densities under replicated measurements. Statistica Sinica 2, [6] Stefansi, L. and R.J. Carroll 99. Deconvoluting ernel density estimators. Statistics 2,

The high order moments method in endpoint estimation: an overview

1/ 33 The high order moments method in endpoint estimation: an overview Gilles STUPFLER (Aix Marseille Université) Joint work with Stéphane GIRARD (INRIA Rhône-Alpes) and Armelle GUILLOU (Université de