Estimation of the extreme value index and high quantiles under random censoring

Estimation of the extreme value index and high quantiles under random censoring Jan Beirlant () & Emmanuel Delafosse (2) & Armelle Guillou (2) () Katholiee Universiteit Leuven, Department of Mathematics, Celestijnenlaan 200B, 300 Leuven, Belgium (2) Université Paris VI, L.S.T.A., Boîte 58, 75 rue du Chevaleret, 7503 Paris Key words and phrases: Pareto index, extreme quantile, censoring, Kaplan-Meier estimator. Abstract. In this paper, we consider the estimation problem of the extreme value index and extreme quantiles in the presence of censoring. Taing into account the fact that our main motivation is application in insurance, we focus on the Fréchet and Gumbel domains of attraction. In the case of no-censoring, the most famous estimator of the Pareto index is the classical Hill estimator (975). Some adaptations of this estimator in the case of censoring are proposed and used to build extreme quantile estimators. A theoretical study of the asymptotic properties of such estimators is started. The finite sample behaviour is illustrated in a small simulation study and also in a practical insurance example. Résumé. Dans cet article, nous considérons le problème de l estimation d un index des valeurs extrêmes et de quantiles extrêmes en présence de censure aléatoire. Compte tenu du fait que notre motivation principale concerne l application en assurance, nous nous concentrons sur les domaines d attraction de Fréchet et de Gumbel. Dans le cas non censuré, l estimateur de l index le plus connu est l estimateur de Hill (975). Nous proposons des adaptations de cet estimateur de l index dans le cas censuré que nous utilisons par la suite dans le but d estimer un quantile extrême. Une étude théorique des propriétés asymptotiques de ces nouveaux estimateurs est proposée. Par ailleurs, leur comportement est illustré sur la base de simulations et sur un exemple de données réelles. Mots-clés: Index de Pareto, quantile extrême, données censurées, estimateur de Kaplan- Meier.. Introduction. When a data set contains observations within a restricted range of values, but otherwise not measured, it is called a censored data set. Statistical techniques for analyzing censored data sets are quite well studied, especially in survival analysis and biostatistics in general where censoring mechanisms are quite common. Especially the case of right censoring where some results are nown to be at least as large as the reported value, received a lot of attention. Here we can for instance refer to Cox and Oaes (984). This then

concerns central characteristics of the underlying distribution. The literature on tail or extreme value analysis for censored data is almost non existing. In Reiss and Thomas (997) (section 6.), Beirlant et al. (996) (section 2.7) and Beirlant and Guillou (200) in case of truncated data, some estimators of tail indices were proposed without any deeper study on their behaviour. However, important problems such as the estimation of extreme quantiles apparently were not considered before in general. Data sets with censored extreme data often occur in insurance when reported payments cannot be larger than the maximum payment value of the contract. When the reported payment equals the maximum payment, this real payment can indeed be equal to the maximum or can be censored. The situation where all data above a fixed value are censored is referred to as truncation or type I censoring. This case was considered in Beirlant and Guillou (200). It can occur when the observations are not the real payments but the payments as a fraction of the sum insured, in which case the truncation level equals 00%. Here we consider random right censoring. The claim sizes X are possibly censored by the maximum payment Y. A maximum payment of a given contract is then considered as a realization of the random variable Y. Different situations can now occur, whether the censoring values (or maximum payment values) are observed or not. To be more specific, let X i, i IN, be independent and identically distributed (i.i.d.) random variables with common distribution function (df) F and let Y i, i IN, be a second i.i.d. sequence with df G. We only observe Z i = X i Y i, δ i = l Xi Y i, i IN. We denote by H the df of Z and let τ H = inf{x : H(x) = }, the supremum of the support of H. We define H (z) = IP(Z > z, δ = ) = IP(z < X Y ). Being motivated by actuarial applications we confine ourselves to the case where sample maxima from X samples are in the domain of attraction of the Fréchet or Gumbel law. This typically means that we consider polynomially decreasing tails or exponentially decreasing tails with infinite right endpoint. We will consequently consider the following cases: Observing (Z, δ), X independent of Y, and both X and Y are in the domain of attraction of the Fréchet law; Observing (Z, δ), X independent of Y, X is in the domain of attraction of the Fréchet or the Gumbel law, and Y in the domain of attraction of the Fréchet law. In order to illustrate the methods presented in this paper, we use a liability insurance example from Frees and Valdez (998). 2. Estimation techniques. 2.. Observing (Z, δ), X independent of Y, and both X and Y are in the domain of attraction of the Fréchet law 2

Supposing that F is of Pareto-type, that is, there exists a positive constant α for which where l is a slowly varying function at infinity satisfying F (x) = x α l (x), () l (λx) l (x) when x, for all λ > 0. In order for the censoring to be not too heavy, it appears natural to assume that the censoring distribution is also heavy tailed G(x) = x β l 2 (x), (2) for some β > 0 and slowly varying l 2. Assuming that X and Y are independent, so that H(x) = ( F (x))( G(x)), it now follows that H(x) = x (α+β) l(x), (3) with l also a slowly varying function at infinity. These conditions can be restated in terms of the tail quantile functions as U F (x) = x /α l,u (x), U G (x) = x /β l 2,U (x), U H (x) = x /(α+β) lu (x), with U F (x) = inf{y : F (y) /x}, x >, and l,u (x), l 2,U (x) and l U (x) again slowly varying functions at infinity. Our goal is( to ) discuss the estimation problem of γ := α and of extremes quantiles x F,p := U F p with p <. This problem has received a lot of attention in case of nocensoring, i.e. when X i Y i for all i =,..., n. The most famous estimator of γ is Hill s n (975) estimator, given by H X,,n = log X n i+,n log X n,n. (4) i= Turning to the estimation of high quantiles, the estimator proposed by Weissman (978) serves as a reference under Pareto-type models without censoring: ˆx p, = X ( + ) HX,,n n,n. (5) (n + )p In case of random right censoring, the lielihood based on E j,t = Z j, Z t j > t, is changed into N t ( ) αe α δj ( ) j E α δj j, 3

leading to the estimator H (c) Z,t = ni= log(z i /t)l {Zi >t} ni=, (6) δ i l {Zi >t} while for the extreme quantile estimator we propose to use ˆx (c) p,t = t ( ˆFn (t) p ) H (c) Z,t, (7) where ˆF n (x), < x < τ H denotes the Kaplan-Meier (958) product limit estimator of F (x), defined as ˆF n [ n (x) = δ ] j,nl Zj,n x, n j + where Z j,n denote the order statistics associated to Z,..., Z n and δ j,n := δ if and only if Z j,n = Z. The corresponding tail probability estimator is now of course given by IP ˆ (c) (X > x) = ( ˆF n (t)) ( x) (c) /H Z,t. (8) t When choosing t = Z n,n, we obtain the estimator ( log(zn j+,n H (c) Z,,n = ) log(z n,n ) ), (9) δ n j+,n which is the original Hill estimator adapted for right censoring. We will give also another interpretation for this estimator which is based on a novel QQ-plot. 2.2. Observing (Z, δ), X independent of Y, X in the domain of attraction of the Fréchet or Gumbel law, and Y in the domain of attraction of the Fréchet law When considering the extension to the case where γ 0, again as in the no-censoring case there are mainly two sets of solutions which originated from two different formulations of the model. First, the maximum lielihood approach based on POT s (Peas over Threshold) is based on the results given by Balema and de Haan (974) and Picands (975), stating that the limit distribution of the absolute exceedances over a threshold t when t is given by a generalized Pareto distribution (GPD). In the case of censoring, we can easily adapt the lielihood to [ fgp D (Ẽj) ] δ j [ FGP D (Ẽj) ] δ j 4

where Ẽj = Z j t if Z j > t and F GP D (x) = ( ) + γ x γ. Then, the maximization of σ this expression leads to a POT estimator for γ which we further denote by ˆγ t,ml. (c) Secondly, we can construct a new estimator based on upper order statistics for instance within the framewor of the QQ-plot regression technique. For example, in the case of no-censoring, Beirlant et al. (996) proposed an estimator of a real-valued index based on a generalized quantile plot, which taes over the role of the Pareto quantile plot in this more general setting. More precisely they proposed to loo at the graph with coordinates ( n + ) log, log UH j,n, j =,..., n, j with UH j,n = X n j,n H X,j,n. Again this plot becomes ultimately linear for small j with slope approximating γ. Then, one can construct several regression based estimators, such as ˆγ,UH = log UH j,n log UH +,n. From the above it appears natural to define a generalization of ˆγ,UH to the censoring case as a slope estimator of the generalized quantile plot adapted for censoring ( ( log ˆFn (Z n j+,n ) ), log UH j,n) (c), (0) (j =,..., n ) where UH (c) j,n = Z n j,n H (c) Z,j,n: ˆγ (c),uh = log UH (c) j,n log UH (c) +,n. () δ n j+,n Using one of the abovementioned estimators ˆγ (c).,. of γ 0 we can now propose new estimators for the quantile x F,p, in the spirit of the one proposed by Deers et al. (989) in the case of no-censoring: ˆx (c) p,t,. = t + ˆγ (c).,. t ( ˆFn(t) )ˆγ (c).,. p ˆγ (c).,.. (2) Under suitable assumptions, we establish the asymptotic properties of our estimators. We illustrate their behaviour in a small simulation study, but also in a practical insurance example. 5

Bibliography [] Balema, A. and de Haan, L. (974). Residual life time at great age, Ann. Probab., 2, 792-804. [2] Beirlant, J. and Guillou, A. (200). Pareto index estimation under moderate right censoring, Scand. Actuarial J., 2, -25. [3] Beirlant, J. Teugels, J.L. and Vyncier, P. (996). Practical Analysis of Extreme Values, Leuven University Press, Leuven. [4] Beirlant, J., Vyncier, P. and Teugels, J.L. (996). Excess functions and estimation of the extreme value index, Bernoulli, 2, 293-38. [5]Cox, D.R. and Oaes, D (984). Analysis of Survival Data, Chapman and Hall, New Yor. [6] Deers, A.L.M., Einmahl, J.H.J. and de Haan, L. (989). A moment estimator for the index of an extreme-value distribution, Ann. Statist. 7, 833-855. [7] Frees, E. and Valdez, E. (998). Understanding relationships using copulas, North American Actuarial Journal, 2, 5. [8] Hill, B.M. (975). A simple general approach to inference about the tail of a distribution, Ann. Statist., 3, 63-74. [9] Kaplan, E.L. and Meier, P. (958). Non-parametric estimation from incomplete observations, J. Amer. Statist. Assoc., 53, 457-48. [0] Picands III, J. (975). Statistical inference using extreme order statistics, Ann. Statist., 3, 9-3. [] Reiss, R.D. and Thomas, M. (997). Statistical Analysis of Extreme Values with Applications to Insurance, Finance, Hydrology and Other Fields, Birhäuser Verlag, Basel. [2] Weissman, I. (978). Estimation of parameters and large quantiles based on the largest observations. J. Amer. Statist. Assoc. 73, 82-85. 6