Fitting mixtures of Erlangs to censored and truncated data using the EM algorithm

Size: px

Start display at page:

Download "Fitting mixtures of Erlangs to censored and truncated data using the EM algorithm"

Norman Casey
6 years ago
Views:

1 Fitting mixtures of Erlangs to censored and truncated data using the EM algorithm Katrien Antonio 1,2, Andrei Badescu 3, Lan Gong 3, Sheldon Lin 3, and Roel Verbelen 2 1 Faculty of Economics and Business, University of Amsterdam, The Netherlands. 2 LStat, Faculty of Economics and Business, KU Leuven, Belgium. 3 Department of Statistics, University of Toronto, Canada. January 9, 2014 Abstract We present a calibration procedure for fitting mixtures of Erlangs to censored and truncated data by iteratively using the EM algorithm. Mixtures of Erlangs form a very versatile, yet analytically tractable, class of distributions making them suitable for modeling purposes. The effectiveness of the proposed algorithm is demonstrated on simulated data as well as real data sets. Keywords: Mixture of Erlang distributions with a common scale parameter; Censoring; Truncation; Expectation-maximization algorithm; Maximum likelihood. 1 Introduction The class of mixtures of Erlangs with a common scale parameter is very flexible in terms of the possible shapes of its members. Tms 1994) shows that mixtures of Erlangs are dense in the space of positive distributions in the sense that there always exists a series of mixtures of Erlangs that weakly converges, i.e. converges in distribution, to any positive distribution. As such, any continuous distribution can be approximated by a mixture of Erlang distributions to any accuracy. Furthermore, via direct manipulation of the Laplace transform, a wide variety of distributions whose membership in this class is not immediately obvious can be written as a mixture of Erlangs. The class of mixtures of Erlang distributions with a common scale is also closed under mixture, convolution and compounding. At the same time, it is possible to work analytically with this class leading to explicit expressions for e.g. the Laplace transform, the hazard rate, a quantile or Value-at-Risk VaR) and stop-loss moments. Klugman et al. 2013), Willmot and Lin 2011) and Willmot and Woo 2007) give an overview of these analytical and computational properties of mixtures of Erlangs. Mixture models are often used to reflect the heterogeneity in a population consisting of multiple groups or clusters McLachlan and Peel, 2001). In some applications, these clusters can be Corresponding author. adress: roel.verbelen@kuleuven.be 1

2 1 Introduction 2 physically identified and used to interpret the fitted distributions. This is however not the approach we follow; the components in the mixture will not be identified with existing groups. Mixtures of Erlangs are discussed here for their great flexibility in modeling data and should be regarded as a semiparametric density estimation technique. The densities in the mixture are parametrically specified as Erlangs, whereas the associated weights form the nonparametric part. The number of Erlangs in the mixture with non-zero weights can be viewed as a smoothing parameter. Mixtures of Erlangs have much of the flexibility of nonparametric approaches, while retaining some of the advantages of parametric approaches such as a reasonable amount of parameters) and furthermore allowing for tractable results. Lee and Lin 2010) iteratively use the EM algorithm Dempster et al., 1977) for finite mixtures to estimate the parameters of a mixture of Erlang distributions with a common scale parameter. They demonstrate the usefulness of mixtures of Erlangs in the context of quantitative risk management for the insurance business. This industry requires on the one hand the flexibility of nonparametric density estimation techniques to describe the insurance losses and on the other hand the feasibility to analytically quantify the risk, which is exactly what the class of mixtures of Erlangs offers. However, modeling censored and/or truncated losses is not covered by the approach in Lee and Lin 2010). In many practical problems data are censored and/or truncated. In biostatistics a typical example is a study which follows patients over a period of time. In case the event of interest has not yet occurred before the end of the study, the patient drops out or dies from another cause, independent of the cause of interest, the event time is right censored. In actuarial science insurance losses as studied by Lee and Lin, 2010)) are often censored and truncated due to policy modifications such as deductibles left truncation) and policy limits right censoring). The modeling of data on operational risk OR), as required by the Basel Committee on Banking Supervision, is another example. Banks use data from their internal OR data base, and refer to the Operational Riskdata exchange Association ORX) for external loss data. Usually, the internal OR database of banks in North America is subect to a minimum recording threshold of $10, $15, 000, while in the external database operational losses start from $30, 000. In both cases, when fitting a parametric distribution to the loss data, one should take the left truncation of the data into consideration. Motivated by the large number of areas where censored and truncated data are encountered, we develop an extension of the iterative EM algorithm of Lee and Lin 2010) for fitting mixtures of Erlangs with common scale parameter to censored and truncated data. The traditional way of dealing with grouped and) truncated data using the EM algorithm involves treating the unknown number of truncated observations as a random variable and including it into the complete data vector Dempster et al., 1977; McLachlan and Krishnan, 2007, p. 66; McLachlan and Peel, 2001, p. 257; McLachlan and Jones, 1988). We will not follow this approach and rather only include the uncensored observations and the component-label vectors in the complete data vector as is also done in Lee and Scott 2012). We demonstrate the fitting procedure on a wide range of applications, from econometrics, actuarial science and biostatistics. Our R implementation is available online at In the following, we briefly introduce mixtures of Erlangs with a common scale parameter in Section 2. The adusted EM algorithm, able to deal with censored and truncated data, is presented in Section 3. The procedures used to initialize the parameters, to adust the shapes of the Erlangs in the mixture and to choose the number of components are discussed in Section 4. Examples follow in Section 5 and Section 6 concludes.

3 3 The EM algorithm for censored and truncated data 3 2 Mixtures of Erlangs with a common scale parameter The Erlang distribution is a positive continuous distribution with density function fx; r, θ) xr 1 e x/θ θ r r 1)! for x > 0, 1) where r, a positive integer, is the shape parameter and θ > 0 the scale parameter the inverse λ 1/θ is called the rate parameter). The cumulative distribution function is obtained by integrating 1) by parts r times F x; r, θ) x Using the lower incomplete gamma function defined as 0 z r 1 e z/θ r 1 θ r r 1)! dz 1 x/θ x/θ)n e. 2) n! γs, x) the cumulative distribution function becomes F x; r, θ) x 0 x z r 1 e z/θ θ r r 1)! dz 1 r 1)! 0 n0 z s 1 e z dz, x/θ 0 u r 1 e u du γr, x/θ) r 1)!. 3) Following Lee and Lin 2010) we consider mixtures of M Erlang distributions with common scale parameter θ > 0 and having density fx; α, r, θ) x r 1 e x/θ M α θ r r 1)! α fx; r, θ) for x > 0, 4) where the positive integers r r 1,..., r M ) with r 1 <... < r M are the shape parameters of the Erlang distributions and α α 1,..., α M ) with α > 0 and M α 1 are the weights used in the mixture. Similarly, the cumulative distribution function can be written as a weighted sum of terms 2) or 3). Tms 1994, p. 163) shows that the class of mixtures of Erlang distributions with a common scale parameter is dense in the space of distributions on R +. Lee and Lin 2010) given an alternative proof using characteristic functions. 3 The EM algorithm for censored and truncated data Lee and Lin 2010) formulate the EM algorithm customized for fitting mixtures of Erlangs with a common scale parameter to complete data. In an addendum available online 1, we work out the details with zero-one component indicators inspired by McLachlan and Peel 2001) and Lee and Scott 2012). Here, we construct an adusted EM algorithm which is able to deal with censored and truncated data. We represent a censored sample truncated to the range [t l, t u ] by X {l i, u i ) i 1,..., n}, 1

4 3 The EM algorithm for censored and truncated data 4 where t l and t u represent the lower and upper truncation points, l i and u i the lower and upper censoring points and t l l i u i t u for i 1,..., n. t l 0 and t u mean no truncation from below and above, respectively. The censoring status is determined as follows: Uncensored: Left Censored: Right Censored: Interval Censored: t l l i u i : x i t u t l l i < u i < t u t l < l i < u i t u t l < l i < u i < t u For example, when the truncation interval equals [t l, t u ] [0, 10], an uncensored observation at 1 gets denoted by l i, u i ) 1, 1), an observation left censored at 2 by l i, u i ) 0, 2), an observation right censored at 3 by l i, u i ) 3, 10) and an observation censored between 4 and 5 by l i, u i ) 4, 5). Thus, l i and u i should be seen as the lower and upper endpoints of the interval of which contains observation i. The parameter vector to be estimated is Θ α, θ). The number of Erlangs M in the mixture and the corresponding positive integer shapes r are fixed. The value of M is, in most applications, however unknown and has to be inferred from the available data, along with the shape parameters, see Section 4. The portion of the likelihood containing the unknown parameter vector Θ is given by LΘ; X ) i U fx i ; Θ) F u i ; Θ) F l i ; Θ) F t u ; Θ) F t l ; Θ) F t u ; Θ) F t l ; Θ) i C where U is the subset of observations in {1,..., n} which are uncensored and C is the subset of left, right and interval censored observations. In case there is no truncation, i.e. [t l, t u ] [0, ], the contribution of a left censored observation to the likelihood equals F u i ; Θ) since l i 0, of a right censored observation 1 F l i ; Θ) with u i, and of an interval censored observation F u i ; Θ) F l i ; Θ). The corresponding log likelihood is lθ; X ) ln α fx i ; r, θ) + ln α F u i ; r, θ) F l i ; r, θ)) i U i C n ln α F θ)) t u ; r, θ) F t l ; r,, 5) i1 which is difficult to optimize numerically. 3.1 Construction of the complete data vector The EM algorithm provides a computationally much easier way to fit this finite mixture to the censored and truncated data. The main clue is to regard the censored sample X as being incomplete since the uncensored observations x x 1,..., x n ) and their associated componentindicator vectors z z 1,..., z n ) with { 1 if observation x i comes from th component density fx; r, θ) z 6) 0 otherwise

5 3 The EM algorithm for censored and truncated data 5 for i 1,..., n and 1,..., M, are not available. The component-label vectors z 1,..., z n are taken to be realized values of the random vectors Z 1,..., Z n i.i.d. Mult M 1, α). The complete data vector, Y x 1,..., x n, z) {x i, z i ) i 1... n}, contains all uncensored observations x i and their corresponding mixing component vector z i. The probability density function of an uncensored observation x i after truncation t l, t u ) is given by fx i ; t l, t u fx i ; Θ), Θ) F t u ; Θ) F t l ; Θ) fx i ; r, θ) α F t u ; Θ) F t l ; Θ) α F tu ; r, θ) F t l ; r, θ) F t u ; Θ) F t l ; Θ) β fx i ; t l, t u, r, θ), fx i ; r, θ) F t u ; r, θ) F t l ; r, θ) for t l x i t u and zero otherwise. This is again a mixture with mixing weights β and component density functions given by, respectively, β α F tu ; r, θ) F t l ; r, θ) F t u ; Θ) F t l ; Θ) 7) and fx i ; t l, t u, r, θ) fx i ; r, θ) F t u ; r, θ) F t l ; r, θ). 8) The component density functions fx i ; t l, t u, r, θ) are truncated versions of the original component density functions fx i ; r, θ). The weights β are re-weighted versions of the original weights α by means of the probabilities of the corresponding component to lie in the truncation interval. The log likelihood of the complete sample Y then becomes lθ; Y) n i1 ) z ln β fx i ; t l, t u, r, θ), 9) which has a simpler form than the incomplete log likelihood 5) as it does not contain logarithms of sums. The EM algorithm deals with the censored and truncated data from the mixture of Erlangs with common scale in the following steps. 3.2 Initial step Initialization of the parameters θ and α is done in the same way as in the uncensored case Lee and Lin, 2010) and is based on the denseness property Tms, 1994, p. 163): θ 0) maxd) nd and α 0) i1 I r 1 θ 0) < d i r θ 0)), for 1,..., M, 10) r M n d

6 3 The EM algorithm for censored and truncated data 6 with r 0 0 for notational convenience. The data d used in 10), of which the size is denoted by n d, can be restricted to the uncensored data points only. Alternatively, the censored data can also be used by treating the left and right censored data points as being observed, i.e. use u i and l i, respectively, and by using the mean for the interval censored data points, i.e. use l i + u i )/2. In this manner, censored observations at the low-end or the high-end of the sample range which can be of importance at initialization will not be ignored. The truncation is only taken into account to transform the initial values for α into the initial values for β via 7). 3.3 E-step In the kth iteration of the E-step, we take the conditional expectation of the complete log likelihood 9) given the incomplete data X and using the current estimate Θ k 1) for Θ. QΘ; Θ k 1) ) ElΘ; Y) X ; Θ k 1) ) E θ)) Z ln β fx i ; t l, t u, r, i U X ; Θk 1) + E θ)) Z ln β fx i ; t l, t u, r, i C X ; Θk 1) Q u Θ; Θ k 1) ) + Q c Θ; Θ k 1) ), 11) where Q u Θ; Θ k 1) ) and Q c Θ; Θ k 1) ) are the conditional expectations of the uncensored and censored part of the complete log likelihood, respectively. Uncensored case. The truncation does not complicate the computation of the expectation for the uncensored data as Q u Θ; Θ k 1) ) E θ)) Z ln β fx i ; t l, t u, r, i U X ; Θk 1) [ E Z X ; Θ k 1)] ) ln β fx i ; t l, t u, r, θ) i U ) u z k) ln β fx i ; t l, t u, r, θ) i U [ u z k) lnβ ) + r 1) lnx i ) x i θ r lnθ) i U )] lnr 1)!) ln F t u ; r, θ) F t l ; r, θ), 12)

7 3 The EM algorithm for censored and truncated data 7 with, for i U and 1,..., M, u z k) P Z 1 x i, t l, t u ; Θ k 1) ) 8) 7) β k 1) fx i ; t l, t u, r, θ k 1) ) M m1 βk 1) m fx i ; t l, t u, r m, θ k 1) ) β k 1) fx i ; r, θ k 1) )/ F t u ; r, θ k 1) ) F t l ; r, θ k 1) ) ) M m1 βk 1) m fx i ; r m, θ k 1) )/ F t u ; r m, θ k 1) ) F t l ; r m, θ k 1) ) ) α k 1) fx i ; r, θ k 1) ) M m1 αk 1) m fx i ; r m, θ k 1) ). 13) The E-step for the uncensored part only requires the computation of the posterior probabilities u z k) that observation i belongs to the th component in the mixture, which remains the same in the truncated case and in the untruncated case. Censored case. Denote by c z k) im the posterior probability that observation i belongs to the mth component in the mixture for a censored data point. Then Q c Θ; Θ k 1) ) E θ)) Z ln β fx i ; t l, t u, r, i C X ; Θk 1) θ)) E Z ln β fx i ; t l, t u, r, i C l i, u i, t l, t u ; Θ k 1) i C i C m1 m1 ) c z k) im [ln E β m fx i ; t l, t u, r m, θ) Zim 1, l i, u i, t l, t u ; θ k 1)] [ c z k) ) im lnβ m ) + r m 1)E lnx i ) Z im 1, l i, u i, t l, t u ; θ k 1) 1 X ) θ E Zim i 1, l i, u i, t l, t u ; θ k 1) r m lnθ) lnr m 1)!) )] ln F t u ; r m, θ) F t l ; r m, θ) 14) Again using Bayes rule, we can compute these posterior probabilities, for i C and m 1,..., M, as c z k) im P Z im 1 l i, u i, t l, t u ; Θ k 1) ) F ui ; t l, t u, r m, θ k 1) ) F l i ; t l, t u, r m, θ k 1) ) ) βk 1) m M βk 1) βk 1) m M βk 1) 7) αk 1) m M αk 1) F ui ; t l, t u, r, θ k 1) ) F l i ; t l, t u, r, θ k 1) ) ) F ui ; r m, θ k 1) ) F l i ; r m, θ k 1) ) ) / F t u ; r m, θ k 1) ) F t l ; r m, θ k 1) ) ) F ui ; r, θ k 1) ) F l i ; r, θ k 1) ) ) / F t u ; r, θ k 1) ) F t l ; r, θ k 1) ) ) F ui ; r m, θ k 1) ) F l i ; r m, θ k 1) ) ) F ui ; r, θ k 1) ) F l i ; r, θ k 1) ) ). 15)

8 3 The EM algorithm for censored and truncated data 8 The terms in 14) for Q c Θ; Θ k 1) ) containing ElnX i ) Z im 1, l i, u i, t l, t u ; θ k 1) ) will not play a role in the EM algorithm as they do not depend on the unknown parameter vector Θ. The E-step does require the computation of the expected value of X i conditional on the censoring times and the mixing component Z i for the current value Θ k 1) of Θ: E X ) Zim i 1, l i, u i, t l, t u ; θ k 1) ui l i fx; r m, θ k 1) ) x F u i ; r m, θ k 1) ) F l i ; r m, θ k 1) ) dx r m θ k 1) F u i ; r m, θ k 1) ) F l i ; r m, θ k 1) ) ui l i x rm e x/θk 1) θ k 1) ) r m+1 rm! dx r mθ k 1) F u i ; r m + 1, θ k 1) ) F l i ; r m + 1, θ k 1) ) ) F u i ; r m, θ k 1) ) F l i ; r m, θ k 1) ) for i C and m 1,..., M, which has a closed-form expression., 3.4 M-step In the M-step, we maximize the expected value 11) of the complete data log likelihood obtained in the E-step with respect to the parameter vector Θ over all β, θ) with β > 0, M β 1 and θ > 0. The expressions for Q u Θ; Θ k 1) ) and Q c Θ; Θ k 1) ) are given in 12) and 14), respectively. The maximization over the mixing weights β, requires the maximization of i U u z k) lnβ ) + i C c z k) lnβ ), which can be done analogously as in the uncensored case see addendum online). We implement the restriction M β 1 by setting β M 1 M 1 β. Setting the partial derivatives at β k) equal to zero implies that the optimizer satisfies i U β k) u z k) By the sum constraint we have + i C c z k) i U u z k) im + i C c z k) im β k) M β k) M for 1,..., M 1. i U u z k) im + i C c z k) im, n and the same form also follows for 1,..., M 1: i U β k) u z k) + i C c z k) n for 1,..., M. 16) The new estimate for the prior probability β in the truncated mixture is the average of the posterior probabilities of belonging to the th component in the mixture. The optimizer indeed corresponds to a maximum since the matrix of second order partial derivatives is negative definite matrix with a compound symmetry structure.

9 4 Choice of the shape parameters and of the number of Erlangs in the mixture 9 In order to maximize QΘ; Θ k 1) ) with respect to θ, we set the first order partial derivatives equal to zero see Appendix A). This leads to the following M-step equation for θ: with θ k) T k) i U x i + i C E X li i, u i, t l, t u ; θ k 1) )) /n T k), 17) M βk) r β k) t l ) r e tl /θ t u ) r e tu /θ θ r 1 r 1)! F t u ; r, θ) F t l ; r, θ)) θθ k). As in the uncensored case, the new estimate θ k) in 17) for the common scale parameter θ again has the interpretation of the sample mean divided by the average shape parameter in the mixture, but in the formula for the sample mean, we now take the expected value of the censored data points given the censoring times and subtract a correction term T k) due to the truncation. However, T k) in 17) depends on θ k) and has a complicated form. Therefore, it is not possible to find an analytical solution of 17). We circumvent this issue by evaluating T in the previous value of θ, i.e. θ k 1), and use 17) with T k 1) in the right-hand side as an approximation for finding the maximum in the kth EM step. The convergence of the EM algorithm to the maximum likelihood estimator can however not be theoretically guaranteed, but works in practice, based on our experiments. Lee and Scott 2012) use the same approach to fit multivariate Gaussian mixture models to censored and truncated data. The E- and M-steps are iterated until lθ k) ; X ) lθ k 1) ; X ) is sufficiently small. The maximum likelihood estimator of the original mixing weights α for 1,..., M can be retrieved by inverting expression 7). This is most easily done by first computing α β F t u ; r, θ) F t l ; r, θ) for 1,..., M, where β and θ denote the values in the final EM step, and then normalizing the weights such that they sum to 1. 4 Choice of the shape parameters and of the number of Erlangs in the mixture 4.1 Initialization We start by making an initial choice for the number of Erlangs M in the mixture and set the shapes equal to r for 1, 2,..., M. Extending Lee and Lin 2010) 2, we introduce a spread factor s by which we multiply the shapes in order to get a wider spread at the initial step, i.e. r s for 1, 2,..., M. The initialization of θ and α is based on the denseness of mixtures of Erlangs see Tms 1994, p. 163)), as explained in Section 3.2. The initialization procedure can be based either on the 2 We acknowledge the help of Simon Lee who suggested this approach in personal communication.

10 4 Choice of the shape parameters and of the number of Erlangs in the mixture 10 uncensored data or also censored data can be employed. The initial scale θ 0) is chosen such that θ 0) r M equals the maximum data point and the initial weights α for 1, 2,..., M are set to be the relative frequency of data points in the interval r 1 θ 0), r θ 0) ]. If this interval does not contain any data points for some, the initial weight corresponding to the Erlang in the mixture with shape r will be zero and consequently the weight α will remain zero at each subsequent iteration. This is clear from the updating scheme 16) in the M-step and the expressions 13) and 15) of the posterior probabilities in the E-step. The shapes r with initial weight α equal to zero are therefore removed from the mixture at the initial step. Numerical experiments show that the iterative scheme performs well and results in fast convergence using the above choice of initial estimates for θ and α. 4.2 Adusting the shapes Since the initial shape parameters are pre-fixed and hence not estimated, the fitted mixture might be sub-optimal. Adustment of the shape parameters is necessary. Ideally, for a given number of Erlangs M, we want to choose optimal values for the shapes. The choice of the shapes for a given M however is an optimization problem over N M which is impossible to solve. We have to resort to a practical procedure which explores the parameter space efficiently in order to obtain a satisfying choice for the shapes. After applying the EM algorithm once to obtain the maximum likelihood estimates corresponding to the initial choice of the shape parameters, we perform stepwise variations of the shapes, each time refitting the scale and the weights using the EM algorithm, and compare the log likelihoods of the results. We hereby follow the procedure proposed by Lee and Lin 2010): 1. Run the algorithm starting from the shapes {r 1,..., r M 1, r M + 1} with initial scale θ and weights {α 1,..., α M 1, α M } the final estimates of the previous execution of the EM algorithm. Repeat this step for as long as the log likelihood improves, each time replacing the old set of parameters by the new ones. This procedure is then applied on the M 1)th shape and so forth until all the shapes are treated. 2. Run the algorithm starting from the shapes {r 1 1, r 2,..., r M } with initial scale θ and weights {α 1, α 2,..., α M } the final estimates of the previous execution of the EM algorithm. Repeat this step for as long as the log likelihood improves, each time replacing the old set of parameters by the new ones. This procedure is then applied on the 2nd shape and so forth until all the shapes are treated. 3. Repeat the loops described in the previous steps until the log likelihood can no longer be increased. Using this algorithm we eventually reach a local maximum of the log likelihood, by which we mean that the fit can no longer be improved by either increasing or decreasing any of the r. 4.3 Reducing the number of Erlangs Too many Erlangs in the mixture will result in an issue of overfitting, which is always a problem in statistical modeling. A decision rule such as Akaike s information criterion AIC, Akaike, 1974) or Schwartz s Bayesian information criterion BIC, Schwarz, 1978) help to decide on the

11 5 Examples 11 value of M. Any other information criterion IC) or obective function could be optimized depending on the purpose for which the model is used. We use a backward stepwise search. As mixtures of Erlangs are dense in the space of continuous distributions, we start from a close-fitting mixture of M Erlangs resulting from the shape adustment procedure described in the Section 4.2 and compute the value of the IC. We next reduce the number of Erlangs M in the mixture by deleting the shape r with smallest weight α, refit the scale and weights using the EM algorithm and readust shapes using the same shape adustment procedure. If the resulting fit with M 1 Erlangs attains a lower value of the IC, the new parameter values replace the old ones. We continue reducing the number of Erlangs in the mixture until the value of the IC does no longer decrease by deleting an additional shape. A backward selection has the advantage of providing initial values close to the maximum likelihood estimates corresponding to the new set of shapes which greatly reduces the run time Lee and Lin 2010)). In contrast, by using a forward stepwise procedure it is not clear which additional shape parameter to use and how the parameters from the previous run can be used to provide useful information on parameter initialization. As a guideline, we recommend to start from an initial choice for the number of Erlangs M and a spread s resulting in a close-fitting or even overfitting of the data. 4.4 Compare the final fit using different starting values Since the log likelihood may have multiple local maxima due to the presence of the positive integer shapes, the quality of the initial estimate can greatly influence the final result and therefore it is wise to try and compare the results of several initial starting points. We compare the final fits, after the shape adustment procedure and reduction of the number of Erlangs using an IC, starting from different sets of initial values obtained by varying the spread factor s in the initial step. 5 Examples The usefulness of the proposed fitting procedure is demonstrated in four applications. A first example involves simulated data from a bimodal distrubution which we censor and truncate allowing us to compare the original density and the entire uncensored and untruncated sample to the fitted mixture of Erlangs. The second example illustrates the use of mixtures of Erlangs to represent the right-censored unemployment durations. The third example fits a mixture of Erlang distributions to truncated claim size data and demonstrates how the fitted mixture can be used to analytically price reinsurance contracts. In the fourth example, we illustrate the effectiveness of the proposed algorithm on a small interval-censored dataset representing the time to cosmetic deterioration. The developed R program used and the datasets discussed are available online The R code contains the procedures discussed in section 4. An illustration is provided for the first example we consider.

12 5 Examples Simulated data We generate a random sample of 5000 observations from the bimodal mixture of gamma distributions with density function given by f u x) 0.4fx; r 5, θ 0.5) + 0.6fx; r 10, θ 1). 18) Next we truncate the data by reecting all observations beneath the 5% sample quantile or above the 95% sample quantile. The remaining 4500 data points are subsequently being right censored by generation 4500 observations from another mixture of gamma distributions with density function f rc x) 0.4fx; r 5, θ 2/3) + 0.6fx; r 9, θ 1.25). The resulting data set is composed of 2579 uncensored and 1921 right censored data points, and is used to calibrate the Erlang mixture, keeping the lower and upper truncation into account. Using the automatic search from Section 4.4 we start from M 10 Erlangs in the mixture and let the spread factor s used in the initial step range from 1 to 10. AIC is used to decide upon how many Erlangs to use in the mixture and the right censored data points are treated as being observed at the initial step. The lowest AIC value was obtained with a mixture of 4 Erlangs. The final parameter estimates are in Table 1. This optimal choice of shapes was reached using spread factor s 3. Table 1: Parameter estimates of the fitted mixture of 4 Erlangs to the censored and truncated data generated starting from density 18). r α θ A graphical comparison of the fitted distribution and the original data generated can be found in Figure 1. We compare the fitted mixture of Erlangs density to the true density 18) and a histogram of all 5000 generated data points, before truncation and censoring. The fitted mixture of Erlangs density almost perfectly coincides with the underlying true density Fitted Density True Density Observed Relative Frequency Figure 1: Graphical comparison of the density of the fitted mixture of 4 Erlangs, the true underlying density 18) and the histogram of the generated data before censoring and truncation.

13 5 Examples Unemployment duration We examine the economic data from the January Current Population Survey s Displaced Workers Supplements DWS) for the years 1986, 1988, 1990, and 1992 which was first analyzed in McCall 1996). A thorough discussion of this dataset is available in Cameron and Trivedi 2005). The variable under consideration is unemployment duration spell) or more accurately oblessness duration, measured in two-week intervals. All other covariates in the dataset are ignored in the analysis. Following Cameron and Trivedi 2005), a spell is considered complete if the person is re-employed at a full-time ob CENSOR1) and right-censored otherwise. This results in 1073 uncensored data points and 2270 right censored data points. The parameter estimates of the Erlang mixture, obtained by using the automatic search procedure starting from M 10 Erlangs in the mixture with spread factor s used in the initial step ranging from 1 to 10, are given in Table 2. AIC is again used to decide upon the number of Erlangs in the mixture and the right censored data points are treated as being observed at initialization. The lowest AIC value was obtained with a mixture of 4 Erlangs. This optimal choice of shapes was reached using spread factor s 3. Table 2: Parameter estimates of the fitted mixture of 4 Erlangs to the right-censored unemployment data. r α θ Figure 2 compares the Kaplan-Meier survival curve Kaplan and Meier 1958)), along with 95% confidence bounds, to the survival function of the estimated Erlang mixture. The fitted survival function provides a smooth fit of the data, closely resembling the non-parametric estimate. Survival Time Figure 2: Graphical comparison of the survival function of the fitted mixture of 4 Erlangs and the Kaplan-Meier estimator with 95% confidence bounds for the right-censored unemployment data.

14 5 Examples Secura Belgian Re insurance data The Secura Belgian Re dataset discussed in Beirlant et al. 2004) contains 371 automobile claims from 1988 until 2001 gathered from several European insurance companies. The data are uncensored, but left truncated at 1, 200, 000 since a claim is only reported to the reinsurer if the claim size is at least as large as 1,200,000 Euro. The sizes of the claims are corrected among others for inflation. Based on these observations, the reinsurer wants to calibrate a model in order to price reinsurance contracts. The search procedure using AIC prefers a mixture of only two Erlangs with shapes 5 en 15. The parameter estimates of this best-fitting mixture are shown in Table 3. In Figure 3 left) we compare the histogram of the truncated data to the fitted truncated density. Figure 3 right) illustrates that the truncated survival function of the mixture of two Erlangs perfectly coincides with the Kaplan-Meier estimate. Table 3: Parameter estimates of the fitted mixture of 2 Erlangs to the left-truncated claim sizes in the Secura Belgian Re dataset. r α θ e+00 2e 07 4e 07 6e 07 Fitted Truncated Density Observed Relative Frequency Survival e+06 4e+06 6e+06 8e+06 0e+00 2e+06 4e+06 6e+06 8e+06 Claim size Figure 3: Graphical comparison of the truncated density of the fitted mixture of 2 Erlangs and the histogram of the left-truncated claim sizes left) and of the truncated survival function and the Kaplan-Meier estimator with 95% confidence bounds right) for the Secura Belgian Re dataset. Following Beirlant et al. 2004, p. 188), we use the calibrated Erlang mixture to price an excessof-loss XL) reinsurance contract, where the reinsurer pays for the claim amount in excess of a given limit. The net premium ΠR) of such a contract with retention level R > 1, 200, 000 is given by the ΠR) EX R) + X > 1, 200, 000) where X denotes the claim size and ) + max, 0). In case X follows a mixture of M Erlang distributions, where we assume without loss of generality r i i for i 1,..., M, the net

15 5 Examples 15 premium is ΠR) θe R/θ 1 F 1, 200, 000; α, r, θ) θ 2 1 F 1, 200, 000; α, r, θ) M 1 n0 n1 M 1 kn M 1 kn 1 ) R/θ) n A k n! A k ) fr; n, θ), 19) with A k M k+1 α for k 0,..., M 1. The derivation of this property can be reconstructed using Willmot and Woo 2007) or Klugman et al., 2013, p. 21). In Table 4, we compare the nonparametric, Hill and Generalized Pareto GP) based estimates of ΠR) for the Secura Belgian Re dataset from Table 6.1 in Beirlant et al. 2004, p. 191) to the estimates obtained using formula 19). The maximum claim size observed in the dataset equals 7, 898, 639 which is the only data point on which the non-parametric estimate of the net premium with retention level R 7, 500, 000 is based. The non-parametric estimate corresponding to retention level R 10, 000, 000 is hence zero. The fitted Erlang mixture allows us to estimate the net premium using intrinsically all data points, but postulates a lighter tail compared to the Pareto-type alternatives since Erlang mixtures have an asymptotically exponential tail Neuts 1981, p. 62)). Whereas the estimates based on the extreme value methodology are higher than the non-parametric ones, the estimates based on the Erlang mixture are lower. At the high-end of the sample range, the estimators differ strongly, as implied by the different tail behavior of the three approaches. The reinsurance actuary should carefully investigate the right tail behavior of the data in order to choose his approach. Table 4: Non-parametric, Hill, GP and Mixed Erlang-based estimates for ΠR). R Non-Parametric Hill GP Mixed Erlang 3,000, , , , , ,500, , , , , ,000,000 74, , , , ,500,000 53, , , , ,000,000 35, , , , ,500,000 1, , , , ,000, , , Besides modeling the tail of the claim size distribution above a certain threshold, Beirlant et al. 2004, p. 198) also estimate a global statistical model to describe the whole range of all possible claim outcomes for the Secura Belgian Re dataset. This is needed when trying to estimate ΠR) for values of R smaller than the threshold above which the extreme value distribution is fit. Based on the mean excess function, the authors propose the use of a mixture of an exponential and a Pareto distribution. Instead of having to use this body-tail approach explicitly, the implemented shape adustment and reduction techniques when fitting the Erlang mixture has guided us to a mixture with two components of which the first one represents the body of the distribution and the second represents the tail. The fitting procedure for Erlang mixtures is able to make this choice implicitly in a data driven way, leading to close representation of the data. In Table 5 we compare the estimated net premiums from Table 6.2 in Beirlant et al. 2004, p. 198) obtained using the Exp-Par model to the non-parametric and Mixed Erlang estimates.

16 5 Examples 16 Table 5: Non-parametric, Hill, GP and Mixed Erlang-based estimates for ΠR). R Non-Parametric Exp-Par Mixed Erlang 1,250, , , , ,500, , , , ,750, , , , ,000, , , , ,250, , , , ,500, , , ,619.0 Note that when R 1, 200, 000, the net premium equals the mean excess loss EX R X > R), which is called the mean residual lifetime in survival context. Klugman et al., 2013, p. 20) show that the distribution of the excess loss or residual lifetime is again a mixture of M Erlangs with the same scale θ and different weights which we can compute analytically: M α n0 α n+fr; n + 1, θ) M 1 n0 A for 1,..., M. nfr; n + 1, θ) 5.4 Time to cosmetic deterioration In order to illustrate the fitting procedure of Erlang mixtures in case of interval-censoring, we consider a small dataset from a retrospective study to compare the cosmetic effects of radiotherapy alone versus radiotherapy and aduvant chemotherapy on women with early breast cancer. The 94 subects in the study are monitored over time with visits every 4-6 months but, as their recovery progressed, the interval between visits lengthened. This dataset is discussed among others in Finkelstein and Wolfe 1985) and Klein and Moeschberger 2003) and is available in the packages MLEcens and KMsurv in R. We examine the time to the first appearance of moderate or severe breast retraction, a cosmetic deterioration of the breast. As the patients were observed only at some random times, the exact times of breast retraction are known only to fall within the interval between visits 56 interval-censored observations) or after the last time the patient was seen 38 right-censored observations). In this illustration, we ignore the treatment indicator. AIC prefers a mixture of 3 Erlangs of which the estimated parameters are shown in Table 6. A non-parametric estimator of the survival function for interval-censored data is given by Turnbull 1976). The estimator is a modified version of the Kaplan-Meier estimator based on an iterative procedure and has no closed form. Figure 4 compares the survival function of the fitted Erlang mixture to Turnbull s estimate for the censored time to cosmetic deterioration. The fitted survival function of the Erlang mixtures lies within the 95% confidence bounds at each time point and provides a smoother fit of the data compared to the non-parametric alternative. Table 6: Parameter estimates of the fitted mixture of 2 Erlangs to the censored time to cosmetic deterioration. r α θ

17 6 Discussion 17 Survival Time Figure 4: Graphical comparison of the survival function of the fitted mixture of 3 Erlangs and the Turnbull estimator of the survival function with 95% confidence bounds of the censored time to cosmetic deterioration. 6 Discussion We extend the Lee and Lin 2010) EM algorithm for fitting mixtures of Erlangs with a common scale parameter to censored and truncated data. The EM algorithm able to deal with censored and truncated data remains a simple iterative algorithm. The initialization of the parameters can be done in a similar way as in Lee and Lin 2010) based on the denseness property Tms, 1994, p. 163) and provides close starting values making the algorithm converge fast. The shape adustment procedure explores the parameter space in a clever way such that, when adusting and reducing the shapes, the previous estimates for the scale and the weights provide a very close approximation to the maximum likelihood estimates corresponding to the new set of shapes, which greatly reduces the run time. Extending Lee and Lin 2010), we suggest the use of a spread factor to achieve a wider spread for the shapes at the initial step. We recommend comparing the resulting fits starting from different initial values obtained by varying the spread factor and changing the initial number of Erlangs. We implement the fitting procedures in R and show how mixtures of Erlangs can be used to adequately represent any univariate distribution in a wide variety of applications where data is allowed to be censored and truncated. The examples on several simulated and real datasets illustrate the effectiveness of our proposed algorithm and demonstrate the approximation strength of mixtures of Erlangs. Future research may explore incorporating regressor variables in the mixture of Erlangs with common scale and introducing the flexibility of this approach in a regression context. Adusting the EM algorithm tailored to the class of multivariate mixtures of Erlangs, introduced by Lee and Lin 2009), to the case of censored and truncated data is another appealing extension. Acknowledgements The authors are grateful to Simon Lee and Gerda Claeskens for valuable comments and suggestions. Andrei Badescu and Sheldon Lin gratefully acknowledge the financial support received from the Natural Sciences and Engineering Research Council of Canada NSERC). Katrien Antonio acknowledges financial support from NWO through a Veni 2009 grant. Katrien Antonio and Roel Verbelen acknowledge financial support from a FWO research proect.

18 A Partial derivative of Q 18 Appendix A Partial derivative of Q In order to maximize QΘ; Θ k 1) ) with respect to θ, we set the first order partial derivative at θ k) equal to zero QΘ; Θ k 1) ) θ i U i C θθ k) u z k) c z k) x i θ 2 r [ θ θ F t u ; r, θ) F t l ; r, θ) ] ) F t u ; r, θ) F t l ; r, θ) E Xi Zim 1, l i, u i, t l, t u ; θ k 1) ) θ 2 r θ [ θ F t u ; r, θ) F t l ; r, θ) ] ) F t u ; r, θ) F t l ; r, θ) 3) 1 u θ 2 z k) x i + 1 θ 2 n θ i U i U i C i U i U u z k) u z k) c z k) θ i C + i C c z k) n θθ k) ) c z k) E r γr, t u /θ) γr, t l /θ) ) r 1)! F t u ; r, θ) F t l ; r, θ)) θ γr, t u /θ) γr, t l /θ) ) r 1)! F t u ; r, θ) F t l ; r, θ)) 16) 1 θ 2 x i + 1 ) li θ 2 E X i, u i, t l, t u ; θ k 1) n θ i U i C i U i C ) Z X i 1, l i, u i, t l, t u ; θ k 1) θθ k) β k) r u z k) t l /θ 2 t l /θ ) r 1 e t l /θ t u /θ 2 t u /θ) r 1 e tu /θ r 1)! F t u ; r, θ) F t l ; r, θ)) c z k) t l /θ 2 t l /θ ) r 1 e t l /θ t u /θ 2 t u /θ) r 1 e tu /θ r 1)! F t u ; r, θ) F t l ; r, θ)) 1 θ 2 x i + 1 ) li θ 2 E X i, u i, t l, t u ; θ k 1) n θ n θ 2 M β k) i C t l ) r e tl /θ t u ) r e tu /θ θ r 1 r 1)! F t u ; r, θ) F t l ; r, θ)) β k) r θθ k) θθ k) 0, where we used expression 3) of the cumulative distribution of an Erlang.

19 References 19 References Akaike, H. 1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 196): Beirlant, J., Goegebeur, Y., Segers, J., Teugels, J., De Waal, D., and Ferro, C. 2004). Statistics of Extremes: Theory and Applications. Wiley Series in Probability and Statistics. Wiley. Cameron, A. and Trivedi, P. 2005). Microeconometrics: Methods and applications. Cambridge University Press. Dempster, A. P., Laird, N. M., and Rubin, D. B. 1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B Methodological), pages Finkelstein, D. M. and Wolfe, R. A. 1985). A semiparametric model for regression analysis of interval-censored failure time data. Biometrics, 414):pp Kaplan, E. L. and Meier, P. 1958). Nonparametric estimation from incomplete observations. Journal of the American statistical association, 53282): Klein, J. and Moeschberger, M. 2003). Survival Analysis: Techniques for censored and truncated data. Statistics for Biology and Health. Springer, second edition. Klugman, S. A., Paner, H. H., and Willmot, G. E. 2013). Loss models: Further topics. John Wiley & Sons. Lee, G. and Scott, C. 2012). EM algorithms for multivariate Gaussian mixture models with truncated and censored data. Computational Statistics & Data Analysis, 569): Lee, S. C. and Lin, X. S. 2009). Modeling dependent risks with multivariate Erlang mixtures. ASTIN Bulletin, 421): Lee, S. C. and Lin, X. S. 2010). Modeling and evaluating insurance losses via mixtures of Erlang distributions. North American Actuarial Journal, 141):107. McCall, B. P. 1996). Unemployment insurance rules, oblessness, and part-time work. Econometrica, 643): McLachlan, G. and Jones, P. 1988). Fitting mixture models to grouped and truncated data via the em algorithm. Biometrics, pages McLachlan, G. and Peel, D. 2001). Finite mixture models. Wiley. McLachlan, G. J. and Krishnan, T. 2007). The EM algorithm and extensions, volume 382. Wiley-Interscience. Neuts, M. F. 1981). Matrix-geometric solutions in stochastic models: an algorithmic approach. The John Hopkins University Press. Schwarz, G. 1978). Estimating the dimension of a model. The Annals of Statistics, 62): Tms, H. C. 1994). Stochastic models: an algorithmic approach. Wiley.

20 References 20 Turnbull, B. W. 1976). The empirical distribution function with arbitrarily grouped, censored and truncated data. Journal of the Royal Statistical Society. Series B Methodological), pages Willmot, G. E. and Lin, X. S. 2011). Risk modelling with the mixed Erlang distribution. Applied Stochastic Models in Business and Industry, 271):2 16. Willmot, G. E. and Woo, J.-K. 2007). On the class of erlang mixtures with risk theoretic applications. North American Actuarial Journal, 112):

Stock Sampling with Interval-Censored Elapsed Duration: A Monte Carlo Analysis

Stock Sampling with Interval-Censored Elapsed Duration: A Monte Carlo Analysis Michael P. Babington and Javier Cano-Urbina August 31, 2018 Abstract Duration data obtained from a given stock of individuals