Regression Tree Credibility Model. Liqun Diao, Chengguo Weng

Size: px

Start display at page:

Download "Regression Tree Credibility Model. Liqun Diao, Chengguo Weng"

Belinda Shaw
5 years ago
Views:

1 Regression Tree Credibility Model Liqun Diao, Chengguo Weng Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, N2L 3G1, Canada Version: Sep 16, 2016 Abstract Credibility theory is viewed as cornerstone in actuarial science. This paper brings machine learning techniques into the area and proposes a novel credibility model and a credibility premium formula based on regression trees, which is called regression tree credibility (RTC) premium. The proposed RTC method first recursively binary partitions a collective of individual risks into exclusive sub-collectives using a new credibility regression tree algorithm based on credibility loss, and then applies the classical Bühlmann- Straub credibility formula to predict individual net premiums within each sub-collective. The proposed method effectively predicts individual net premiums by incorporating covariate information, and it is particularly appealing to capture various non-linear covariates effects and/or interaction effects because no specific regression form needs to be pre-specified in the method. Our proposed RTC method automatically selects influential covariate variables for premium prediction with no additional ex ante variable selection procedure required. The superiority in prediction accuracy of the proposed RTC model is demonstrated by extensive simulation studies. Keywords: Credibility Theory; Regression Trees; Premium Rating; Predictive Analytics. Assistant Professor. liqun.diao@uwaterloo.ca Corresponding Author. Associate Professor. chengguo.weng@uwaterloo.ca

2 1 Introduction Risk classification, as a key ingredient of insurance pricing, exploits observable characteristics to classify insureds with similar expected claims and build a tariffication system to enable price discrimination on insurance products. An a priori classification scheme is a premium rating system that is developed to express correctly measurable a priori information about a new policyholder or an insured without any claims experience. An a priori classification scheme is often unable to identify all the important factors for premium rating because some are either unmeasurable or unobservable. A posteriori classification (or experience rating) system is therefore necessary to re-rate risks by integrating claims experience into the rating system, and to obtain a more fare and reasonable price discrimination scheme. Credibility theory, viewed as cornerstone in the field of actuarial science (Hickman and Heacox, 1999), has become the paradigm for insurance experience rating and widely used by actuaries. Its advent dates back to Whitney (1918). The idea behind the theory is to view the net premium of an individual risk as a function µ(θ) of a random element Θ representing the unobservable characteristics of the individual risk, and calculate the (credibility) premium as a linear combination of individual claims experience and collective net premium. The credibility theory has been developed into two streams: limited fluctuation credibility theory and greatest accuracy credibility theory. The former is a stability-oriented theory and its main objective is to incorporate into the premium as much individual claims experience as possible and keep the premium sufficiently stable. The modern applications of credibility theory mainly rely on the latter, the greatest accuracy theory, which approximates the net premium by minimizing mean square prediction error and thus deriving the best linear unbiased estimator based on both individual and collective claims experience. The theoretical foundation of credibility theory was first established by Bühlmann (1967, 1969), bringing about the famous Bühlmann model, and subsequently extended in many directions. The model by Bühlmann and Straub (1970) allows individual risks to be associated with distinct volume measures (e.g., risk exposure). Hachemeister (1975) generalized the model to the regression context, and Jewell (1973) extended it to multi-dimensional risks. Since insurance data often show a natural hierarchical structure, the hierarchical credibility model was developed to embody this feature (Jewell, 1975; Taylor, 1979; Sundt, 1980; Norberg, 1986; Bühlmann and Jewell, 1987). The credibility theory encompassed within the framework of generalized linear model can be found in Nelder and Verrall (1997), Ohlsson (2008) and Garrido and Zhou (2009). In insurance practice, relevant covariates are usually available to partially, if not fully, reflect the risk profile of individual risks (e.g., age, gender, annual income, driving history, etc.). In some of the existing literature, covariate information has been incorporated into various credibility models (e.g., Hachemeister, 1975; Nelder and Verrall, 1997; Ohlsson, 2008; Garrido and Zhou, 2009), but a specific regression form 2

3 for either individual net premium or claim amount on covariates needs to be pre-specified (e.g., linear regression, generalized linear regression, ect.) before the credibility premium is calculated. As such, the available credibility models lack the flexibility of incorporating nonlinear effects and/or interaction effects, which are highly likely to exist among insurance data, and thus render the risk of model misspecification. We propose novel regression tree credibility (RTC) model in this paper, which, for the first time, brings machine learning techniques into the framework of credibility theory and provides encouraging performance in terms of premium prediction accuracy compared to classical credibility models. Regression trees (Breiman et al. 1984) are popular machine learning algorithms developed to construct prediction models by recursively binary partitioning data space into smaller regions in which a simple model is sufficient. Many interesting extensions and alterations on regression trees were proposed and a thorough review can be found in Loh (2004). Regression trees are powerful non-parametric alternatives to parametric regression methods and particularly appealing if interest lies in risk classification and prediction. Our proposed RTC model provides us with a flexible way to utilize covariate information for premium prediction. It takes two steps. In the first step, a novel regression trees algorithm is proposed to utilize covariate information to partition a collective of risks into exclusive sub-collectives such that the risk profiles tend to be homogeneous within each sub-collective and heterogeneous across sub-collectives. The resulting regression trees are called credibility regression trees because they are built to minimize credibility loss in premium prediction. In the second step, the classical credibility premium formula is then applied within each sub-collective. Because the eventual credibility premium formula for individual risks results from the implementation of a regression trees algorithm, we call our credibility model regression tree credibility model. Our proposed RTC model posses many advantages over the existing credibility models in the literature. First, for the first time, covariate variables are exploited to classify the unobservable risk profiles for a credibility model in a prioritized way, whereas they, if considered, are typically used to characterize the conditional net premium of an individual risk, for example, in the regression credibility model of Hachemeister (1975). The well-known hierarchical credibility models (Bühlmann and Jewell, 1987) also utilizes covariate information to classify individual risks, but it is done in an ad hoc manner. The classification in hierarchical credibility models is created according to some inherent hierarchical structure of insurance data, for example, certain geographic or demographic structure. In contrast, the RTC model classifies risks in a data-driven manner. Second, our proposed credibility model provides an effective framework to capture nonlinear and interaction effects of covariates on individual net premium benefited from the use of regression trees. In other words, no explicit regression form needs to be assumed for either the individual net premium or risk distribution in general, whereas all the existing credibility models using covariates require a specific regression form (e.g., Hachemeister, 1975; Nelder and Verrall, 1997). Last but not least, 3

4 our proposed RTC model automatically selects significant covariate variables for premium prediction in the course of building the credibility regression tree, and no ex ante variable selection procedure is required. The rest of the paper proceeds as follows. Section 2 provides a brief review on Bülhmann-Straub credibility model. Section 4 specifies our credibility model, Section 3 explains the benefit of splitting, and Section 5 describes the procedure of establishing credibility regression tree and calculating premiums. Sections 6 and 7 contain simulation results for balanced and unbalanced claims data respectively. Section 8 concludes the paper. Finally, some technical proof and additional graphics are provided in appendices. 2 Bühlmann-Straub Credibility Model In this section, a briefly review of the Bühlmann-Straub credibility model will be presented to facilitate the development of our regression tree credibility model in the subsequent sections. Model 1 (Bühlmann-Straub Credibility Model) Consider a portfolio of I risks, where each individual risk i has n i years of claim experiences, i = 1, 2,..., I. Let Y i,j denote the claim ratio of individual risk i in year j, and m i,j be the associated volume measure, also known as weight variable. The profile of individual risk i is characterized by a scalar θ i, which is the realization of a random element Θ i (usually either a random variable or a random vector). Collect all the claims experience of individual risk i into a vector Y i, i.e., Y i = (Y i,1,..., Y i,ni ) T, and assume that the following conditions are satisfied: H11. Conditionally given Θ i = θ i, {Y i,j : j = 1, 2,..., n i } are independent with E[Y i,j Θ i = θ i ] = µ(θ i ) and Var[Y i,j Θ i = θ i ] = σ2 (θ i ) m i,j for some unknown but deterministic functions µ( ) and σ 2 ( ); H12. The pairs (Θ 1, Y 1 ),..., (Θ I, Y I ) are independent, and {Θ 1,..., Θ I } are independent and identically distributed. Define structural parameters σ 2 = E[σ 2 (Θ i )] and τ 2 = Var[µ(Θ i )] for risks within the collective I := {1, 2,..., I}, and denote n i m i = m i,j, and m = j=1 Let µ = E[Y i,j ] denote the collective net premium, and define Y i = n i j=1 m i,j m i Y i,j, and Y = I m i. i=1 I i=1 m i m Y i, 4

5 which are the sample average of claims associated with individual risk i and that of the collective I. In the literature, two estimators were proposed for individual premium µ(θ i ) in the above Bühlmann- Straub credibility model: inhomogeneous credibility premium and homogeneous credibility premium. The inhomogeneous credibility premium is the best premium predictor P (I) i among the class I P n i i = a 0 + a i,j Y i,j, a 0 R, a i,j R i=1 j=1 [ to minimize the quadratic loss E (µ(θ i ) P i ) 2] [ or equivalently E (Y i,ni+1 P i ) 2], and its formula is given by where P (I) i = α i Y i + (1 α i )µ, α i = m i m i + σ 2 /τ 2 (2.1) is called credibility factor for individual risk i (Bühlmann and Gisler, 2005). The corresponding minimum [ quadratic losses for E (µ(θ i ) P i ) 2] [ and E (Y i,ni+1 P i ) 2] are respectively given by and L (I) 1,i = σ 2 m i + σ 2 /τ 2 (2.2) L (I) 2,i = σ 2 σ2 + m i + σ 2 /τ 2. (2.3) In contrast, the homogeneous credibility premium for an individual risk i from the collective I is defined as the best premium predictor P (H) i among the class I Q n i i : Q i = a i,j Y i,j, E[Q i ] = E[µ(Θ i )], a i,j R to minimize the quadratic loss E given by i=1 j=1 [ (µ(θ i ) Q i ) 2] [ or equivalently E (Y i,ni+1 Q i ) 2] and its formula is P (H) i = α i Y i + (1 α i )Y, (2.4) where the credibility factor α i is the same as the one in the inhomogeneous premium and given in (2.1). [ The corresponding minimum quadratic losses for E (µ(θ i ) Q i ) 2] [ and E (Y i,ni+1 Q i ) 2] are respectively given by ( L (H) 1,i = τ 2 (1 α i ) α ) i, (2.5) α 5

6 and ( L (H) 2,i = σ 2 + τ 2 (1 α i ) α ) i, (2.6) α where α = I i α i. In the implementation of the above credibility premium formulas, the structural parameters σ 2 and τ 2 are respectively estimated by σ 2 = 1 I I i 1 n i 1 n i j=1 ( ) 2 m i,j Yi,j Y i and τ 2 = max ( τ 2 ), 0, (2.7) where and [ τ 2 I I = c I 1 c = I 1 I i [ I i ] m i m (Y i Y ) 2 I σ2, (2.8) m m i m ( 1 m i m ) ] 1. σ 2 g and τ 2 g are unbiased and consistent estimators for σ 2 and τ 2 respectively, but τ 2 can possibly be negative (p.62, Bühlmann and Gisler, 2005); thus, τ 2 is used as the estimator for τ 2. 3 Benefit of Partitioning This section investigates the benefit of partitioning a collective of risks into sub-collectives for premium prediction under the Bühlmann model, which is a special case of the Bühlmann-Straub model reviewed in Section 2 with n i = n and m i,j = 1 for and i = 1,..., I and j = 1,..., n. Under the Bühlmann model, it is easy to check that the credibility factor for individual risk α i is the same for all individual risks, i.e., α := α i = n n + σ 2, for i = 1,..., I ; /τ 2 the four quadratic loss functions in (2.2), (2.3), (2.5) and (2.6) respectively become L (I) 1,i = σ 2 n + σ 2 /τ 2 = τ 2 (1 α), L (I) 2,i = σ 2 σ 2 + n + σ 2 /τ 2 = σ2 + τ 2 (1 α), L (H) 1,i = τ 2 (1 α) + 1 I L (H) 2,i = σ 2 + τ 2 (1 α) + 1 I 6 τ 2 (1 α) 2, α τ 2 (1 α) 2. α

7 As the value of I goes large, the second item in the expression of L (H) 1,i and the third term in L (H) 2,i get ignorable and L (H) 1,i and L (H) 2,i reduce to L (I) 1,i and L(I) 2,i respectively. Therefore, only L(I) 1,i and L(I) 2,i will be studied regarding the benefit of partitioning in the rest of the section. Without loss of generality, it will be proved that the overall credibility quadratic loss for any collective I = {1, 2,..., I} is not worsen off if it is split into two mutually exclusive sub-collectives I 1 and I 2, where I 1 I 2 = I and I 1 I 2 =, subject to an arbitrary partitioning criterion. When working with the loss function L (I) 1,i, the total credibility loss for the collective I is the sum of individual losses over the collective L = I i=1 L (I) 1,i = I n/σ 2 + 1/τ 2, where σ 2 and τ 2 are the structural parameters for the collective I. We next consider the summation of the credibility loss of the two sub-collectives I 1 and I 2. Let p be the probability that a randomly selected individual risk from the collective I falls into the sub-collective I 1 so that q = 1 p is the probability that it falls into the sub-collective I 2. On average, there will be pi individual risks lying in sub-collective I 1 and qi individual risks in sub-collective I 2. Let µ (k), σ(k) 2 and τ (k) 2 be structural parameters for sub-collective I k, k = 1, 2, defined in the same manner as µ, σ 2 and τ 2 which have been introduced in Section 2, but based on sub-collective information. The expected total credibility loss functions for sub-collectives I 1 and I 2 are, respectively, given by Ip L 1 = n/σ(1) 2 + 1/τ (1) 2 Iq and L 2 = n/σ(2) 2 + 1/τ (2) 2. Further, it is easy to check σ 2 = pσ(1) 2 + qσ2 (2), τ 2 = pτ(1) 2 + qτ (2) 2 + pq(µ (1) µ (2) ) 2 = τ 2 + pq(µ (1) µ (2) ) 2, (3.9) where τ 2 = pτ 2 (1) + qτ 2 (2). Proposition 3.1 The credibility losses L, L 1 and L 2 defined in this subsection satisfy L 1 + L 2 L. (3.10) Proof. See Appendix A. When working with the individual credibility loss L (I) 2,i, the total credibility loss for the collective I is given by L = Iσ 2 + I n/σ 2 + 1/τ 2 = Iσ2 + L 7

8 The expected total credibility loss functions for sub-collectives I 1 and I 2 are respectively of the form L 1 = Ipσ(1) 2 + Ip n/σ(1) 2 + 1/τ (1) 2 = Ipσ(1) 2 + L 1 and L 2 = Ipσ(2) 2 + Iq n/σ(2) 2 + 1/τ (2) 2 = Ipσ(2) 2 + L 2. It is easily shown that L 1 + L 2 L by applying Proposition 3.1 and the fact σ 2 = pσ(1) 2 + qσ2 (2). The discussion above suggests that any partitioning on a collective of individual risks does not deteriorate the prediction accuracy of the credibility premium formulas, given that the corresponding structural parameters for each resulting sub-collective can be computed without any statistical errors. This result provides a theoretical foundation for the use of any partitioning method in premium prediction, and it also encourages the use of prediction models which classify a given collective of risks into a great number of mutually exclusive sub-collectives. As in many other statistical models, however, statistical estimation error is inevitable, the structural parameters are unknown in general and estimation of these parameters using available data will bring in statistical errors. The existence of estimation error restrains one from excessively partitioning the overall collective and instead, one needs to keep a balance between the benefits from partitioning and the adverse effects from estimation errors. 4 Model Setup Model 2 (Unbalanced Claims Model) Consider a portfolio of I risks numbered with 1,..., I. Y i = (Y i,1,..., Y i,ni ) T be the vector of claim ratios, m i = (m i,1,..., m i,ni ) be the corresponding weight vector, and X i = (X i,1,..., X i,p ) T be the covariate vector associated with individual risk i, i = 1,..., I. The risk profile of each individual risk i is characterized by a scalar θ i, which is a realization of a random element Θ i. The following two conditions are further assumed: H21. The triplets (Θ 1, Y 1, X 1 ),..., (Θ I, Y I, X I ) are independent; H22. Conditionally given Θ i = θ i and X i = x i, the entries Y i,j, j = 1,..., n, are independent, and E[Y i,j X i = x i, Θ i = θ i ] = µ(x i, θ i ) and Var[Y i,j X i = x i, Θ i = θ i ] = σ2 (x i, θ i ) m i,j for some unknown but deterministic functions µ(, ) and σ 2 (, ). The word unbalanced in the title of the above model is used to signify the fact that the above model permits distinct lengths of claims experiences and distinct volume measures across individual risks in the collective. By setting n i = n and m i,j = 1 for i = 1,..., I and j = 1,..., n i, the above unbalanced claims model reduces to the following balanced claims model. Model 3 (Balanced Claims Model) Consider a portfolio of I risks numbered with 1,..., I. Let Y i = (Y i,1,..., Y i,n ) T be the vector of claims and X i = (X i,1,..., X i,p ) T be the covariate vector associated with Let 8

9 individual risk i, i = 1,..., I. The risk profile of individual risk i is characterized by a scalar θ i, which is a realization of a random element Θ i. The following two conditions are further assumed: H31. The triplets (Θ 1, Y 1, X 1 ),..., (Θ I, Y I, X I ) are independent; H32. Conditionally given Θ i = θ i and X i = x i, the entries Y i,j, j = 1,..., n, are independent, and E[Y i,j X i = x i, Θ i = θ i ] = µ(x i, θ i ) and Var[Y i,j X i = x i, Θ i = θ i ] = σ 2 (x i, θ i ) (4.11) for some unknown but deterministic functions µ(, ) and σ 2 (, ). The above credibility models are more general compared to the ones in the literature, because the existing ones unexceptionally assume some specific functional forms for µ(x i, Θ i ) and σ 2 (X i, Θ i ). For example, the well-known Hachemeister model assumes µ(x i, Θ i ) = X T i β(θ i), a linear regression form with random coefficient β(θ i ) depending on the underlying risk profile Θ i of individual risk i. In contrast, our proposed model leaves the form of individual net premium µ(x i, Θ i ) unspecified. Since partitioning of data space improves premium prediction accuracy as discussed in Section 3, it is natural to consider partitioning guided by available covariate information. Inspired by the ideas in nonparametric statistics, we approximate µ(x i, Θ i ) and σ 2 (X i, Θ i ) by piecewise constant forms as follows respectively K I{X i A k }µ (k) (Θ i ), and k=1 K k=1 I{X i A k } σ2 (k) (Θ i) m i,j, where I{ } is an indicator function, {A 1, A 2,..., A K } is a partition of covariate space, K is the number of partitions and µ (k) (Θ i ) and σ 2 (k) (Θ i) respectively represent the net premium and the variance of an individual risk i from the kth sub-collective with a risk profile Θ i, i.e., µ (k) (θ i ) = E(Y i,j X i A k, Θ i = θ i ) and σ 2 (k) (θ i) = Var(Y i,j X i A k, Θ i = θ i ). (4.12) The condition of X i A k means that the individual risk i is classified into the kth sub-collective based on its covariate information. Following the way of deriving inhomogeneous credibility premiums for the Bülhmann-Straub credibility model reviewed in Section 2, the inhomogeneous credibility premium to predict µ (k) (Θ i ) using information from the kth sub-collective is of the form π (I)(k) i ( = α (k) i Y i + 1 α (k) i ) µ (k), (4.13) where µ (k) = E[Y i,j X i A k ] = E[µ (k) (Θ i )] is the sub-collective net premium corresponding to the kth partition, α (k) i = m i m i + σ(k) 2 /τ (k) 2, 9

10 is the credibility factor of the kth sub-collective and τ(k) 2 and σ2 (k) are the structural parameters for the kth sub-collective, i.e., τ 2 (k) = Var[µ (k)(θ i )] and σ 2 (k) = E[σ2 (k) (Θ i)]. We develop the following prediction rule based on covariate-dependent partitioning and inhomogeneous sub-collective credibility premiums The minimum quadratic losses E π (I) i = K k=1 I{X i A k }π (I)(k) i. (4.14) [ ( ) 2 ] [ ( ) 2 ] µ (k)(θ i ) π (I)(k) i and E Y i,ni+1 π (I)(k) i are respectively L (I) 1,(k) = I σ(k) 2 I{X i A k } m i + σ(k) 2 /τ (k) 2 i=1 (4.15) and L (I) 2,(k) = ( ) I I{X i A k } σ(k) 2 + σ(g) 2 m i + σ(k) 2 /τ (k) 2. (4.16) i=1 Similarly, we can develop homogeneous credibility prediction rule π (H) i = K k=1 I{X i A k }π (H)(k) i, (4.17) where π (H)(k) i ( = α (k) i Y i + 1 α (k) i ) Y (k), (4.18) and Y (k) denotes the average claims experience of the risks in the kth sub-collective. The two homogeneous credibility losses of the kth sub-collective are L (H) 1,(k) = I i=1 I{X i A k }τ(k) 2 (1 α(k) i ) ( α(k) i α (k) ), (4.19) and L (H) 2,(k) = [ I I{X i A k } σ(k) 2 + τ (k) 2 (1 α(k) i ) i=1 ( α(k) i α (k) )], (4.20) where α (k) = I i=1 I{X i A k }α (k) i. 10

11 Figure 1: An example of covariate-dependent partitioning Figure 1 illustrates one example of a covariate-dependent partitioning, in which covariate X i,1 and a threshold c are selected to partition the collective into two subcollectives G 11 and G 12. Consequently, the profile of individual risks in sub-collective G 11 is characterized by the conditional distribution of (Θ i X i,1 > c), whereas that in subcollective G 12 is governed by the conditional distribution of (Θ i X i,1 c). If the risk profile variable Θ i is independent of the covariate X i,1, the two sub-collectives are homogeneous in the sense that the risk profile variables from the two sub-collectives follow the same distribution; in this case, as one can see from the proof of Proposition 3.1, there is no benefit of partitioning at least for a balanced claims model because the total credibility loss remains the same, i.e., the inequality in (3.10) becomes an equality, while such a partition will increase the estimation error resulted from less data being used for estimation of structural parameters after a partitioning. Thus, such covariate is not a significant variable for the data space partitioning and a good partitioning scheme should avoid partitioning sample space using such covariate variable. In Figure 1, the sub-collectives G 11 and G 12 are considered for further partitioning and G 11 is selected and partitioned into two further sub-collectives G 21 and G 22. While the partitioning procedure may go further for a finer partition, Figure 1 demonstrates the case where the sample space is partitioned into three sub-collective G 12, G 21 and G 22, respectively, defined by {X i,1 c}, {X i,1 > c & X i,2 > d}, and {X i,1 > c & X i,2 d}. In an example of auto insurance, X 1 may represent gender (male if X 1 = 1 and female if X 1 = 0) and X 2 may represent age. If we choose c = 0 and d = 21, then G 21 is the group of male drivers aged more than 21, G 22 is of male drivers with an age under 21 and G 12 is a group of female drivers. The insurer then provides discriminated price to each insured group due to their distinct risk profile. The choices of variables and thresholds used for partitioning are critical for gaining the benefits of partitioning, because if they are not properly selected, the benefit from partitioning can be very little while the estimation of structural parameters with smaller samples in sub-collectives may lead to large statistical estimation errors. Generally speaking, an effective partitioning algorithm is anticipated to partition the data space in a premeditated manner such that the risks are relatively homogeneous within each resulted sub-collective and inhomogeneous between the sub-collectives in certain sense. We propose a novel credibility regression tree algorithm, which chooses the covariates, cutting points and order of partitioning in a data-driven manner and form a partition on the data space which minimizes the overall credibility loss. A credibility 11

12 regression tree is an alteration of the Classification and Regression Trees (CART; Breiman et al.,1984). A brief review of CART will be given in Section 5.1 and our proposed credibility regression tree will be described in Section Regression Tree Credibility Premium 5.1 General Procedure in Regression Trees Methods Recursive partitioning methods are machine-learning techniques to quantify relationship between two sets of variables, the responses and covariates (e.g., Zhang and Singer, 2010). Recursive partitioning methods recursively partition data space into small regions so that a simple regression model is sufficient in each region. They are powerful alternatives to parametric regression methods and possess many inherent advantages. First, recursive partitioning methods do not require to pre-specify any mathematical form for the regression relationship between the responses and covariates, and they therefore enjoy more flexibility in capturing nonlinear effects and/or interaction relationship of the covariates on responses. Second, no a prior variable selection procedure is necessary before the establishment of a regression model. Variable selection is automatically achieved in the course of model building. Third, the decision structure can be utilized for the purpose of prediction and easily visualized by a hierarchical graphical model. Recursive partitioning methods have become popular and widely used tools in many scientific fields, but surprisingly, they have been rarely applied in actuarial science so far. The Classification and Regression Trees (CART), with details developed in Breiman et al. (1984), is perhaps the most well-known recursive partitioning algorithm. Many extensions and alterations have been developed on the basis of CART; see Loh (2014) for a thorough review. CART is formulated in terms of three main steps: (1) tree growing, (2) tree pruning, and (3) tree selection. In what follows, we present a brief description on each of the three steps. (1) Tree Growing. This step is to grow a large tree. Consider a data set with a response W and covariates X i = (X i1,..., X ip ) T. The partition algorithm begins with the top node which contains the entire data set, and proceeds through binary splits of the form {X ij c} versus {X ij > c} (where c is a constant) for ordinal or continuous covariate X ij, and X ij S (where S is a subset of the values that X ij may take) for categorical covariate X ij, leading to two mutually exclusive descendant nodes. The principle behind the splitting procedure is heterogeneity reduction in the response distribution and this is realized by separating the data points into two exclusive descendant nodes, where the responses are more homogeneous within each subset than they were combined together in the parent node. The heterogeneity of each node can be measured in various ways depending on the nature of 12

13 the underlying problem and usually quantified by a loss function such as the quadratic or L 2 loss function (for a node g) I L 2,(g) = I{i g} ( ) 2 W i W (g), (5.21) i=1 where W i is the ith response from the node g and W (g) is the sample average of responses in node g. In the course of tree building, the first split is selected as the one that yields the largest reduction in loss. The second split is chosen as the one leading to the largest loss reduction among all possible splits in the two descendant nodes. The tree keeps splitting until some pre-specified criteria, such as the minimum number of observations in each terminal node and depth of tree, are achieved, and the splitting process produces a large tree ψ 0, which almost always overfits the data, so a less complex model is needed and it will be built through steps (2) and (3). (2) Tree Pruning. This step cuts (prunes) the large tree ψ 0 from the last step to create a sequence of nested subtrees. The creation of the subtrees relies on a cost-complexity pruning algorithm. Let ψ g denote a subtree of ψ 0 rooted at node g containing g and all of its descendant nodes. We further denote the set of terminal nodes of a tree ψ by ψ. The cost-complexity of a subtree ψ is defined as L ν (ψ) = L(ψ) + ν ψ, where L(ψ) = g ψ L 2,(g) is the loss of the subtree ψ, ψ denotes the size of the tree ψ, i.e., the number of terminal nodes in ψ, and ν is called complexity parameter. Whether the descendants of node g should be pruned is decided by comparing the following two quantities: L ν (ψ g ) = L(ψ g ) + ν ψ g, and L ν (g) = L(g) + ν, where L ν (ψ g ) and L ν (g) are respectively the cost-complexities of the tree ψ g rooted at node g containing g and all its descendant nodes and the tree having only node g, i.e., the tree ψ g with all its descendant nodes collapsed into node g. ν g = L(g) L(ψ g), ψ g 1 is the critical point at which L ν (ψ g ) is equal to L ν (g). If ν goes larger than ν g, L ν (ψ g ) > L ν (g), which suggests that it is not worthwhile to split further on node g and all descendant nodes of node g should be collapsed into node g (i.e., pruned). The pruning process starts with the weakest-linked node, i.e., calculating ν g for all non-terminal nodes in the large tree ψ 0 and pruning the descendants of the one with the smallest ν g. Working with the resulted subtree, the ν gs of non-terminal nodes in the subtree are recalculated and the weakest-linked node is pruned off. The pruning continues 13

14 until only the top node left and it creates a sequence subtrees of ψ 0 corresponding to a sequence of critical points of the weakest-linked nodes of the subtrees. The sequence of critical points is denoted by ν 1, ν 2,..., ν M and each point corresponds to a subtree, where M is the number of subtrees in the sequence. (3) Tree Selection. This step selects the best model among the sequence of nested subtrees established in the last step. Typically, the model selection is conducted via a cross-validation procedure (e.g., Breiman et al., 1984, Ch. 8.5), which is probably the most widely used method designed to estimate the prediction error of each candidate tree. The prediction error is often, not necessarily though, computed using the same loss functions as the one used for tree growing and tree pruning. To conduct a L-fold cross-validation, the data set is divided into L exclusive subsets I 1, I 2,..., I L. The cross-validation error for the subtree corresponding to a critical point ν m is given by L I I{i I l }(W i ˆψ l (X i ; ν m )) 2, l=1 i=1 where ˆψ l (X i ; ν m ) is the predicted value of the ith response under the tree grown using the full data set excluding subset I l and pruned with complexity parameter fixed at ν m. The best tree is selected as the one on which the smallest cross-validation error is attained. As described above, the loss functions play a fundamental role in building the large tree, creating the sequence of candidate trees and selecting the best tree using cross-validation procedure. CART algorithm can be implemented by the rpart package (Therneau and Atkinson, 2011) in the statistical software R (R Core Team, 2014). 5.2 Credibility Regression Trees As previously mentiond in the end of Section 4, in this paper we propose novel credibility regression trees by altering the loss function and cross-validation procedure in the default regression tree algorithms. The loss function and cross-validation procedure we proposed are as follows. (1) Loss Functions. The loss function is the key component of the CART algorithm and the choice of loss function decides the decision rules in tree growing, pruning and selection steps. We would expect different decisions are made and different tree structure is eventually built by adopting a different loss function. The credibility regression tree algorithm will use in tree growing and pruning steps one of the credibility loss functions given in (4.15), (4.16), (4.19) and (4.20) in place of the default L 2 loss function (5.21). 14

15 (2) Longitudinal Cross-validation. The cross-validation procedure in CART divides the collective I = {1, 2,..., I} into L subsets and the cross-validation error is calculated to estimate the prediction error when interest lies in predicting individual premium for new policy holders with no claims experience. However, in the context of credibility theory, the interest lies in predicting individual net premium for those insureds who have already had claims experiences with the insurance company, and thus, a new cross-validation procedure is required. The cross-validation procedure we propose divides individual claims experiences into L subsets instead, and accordingly we call it longitudinal cross-validation. Recall that we observe (Y i, X i ) for individual risk i from the collective, where Y i = (Y i,1,..., Y i,ni ) T denotes the observed claims experience of individual risk i and X i is the associated covariates, i = 1, 2,..., I. The longitudinal cross-validation procedure we adopt for credibility regression tree algorithm takes the following steps: Step 1. For each i = 1,..., I, denote b i = n i /L and l i = (n i T i = B, }. {{.., B, 1, 2,..., l } i, b i blocks mod L), and define a sequence where B = {1, 2,..., L}. For example, when conducting 5-fold cross-validation, T i = {1, 2, 3} for n i = 3, and T i = {1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3} for n i = 13. For each i = 1,..., I, independently sample a number V i,j from the sequence T i without replacement for j = 1,..., n i. Step 2. Denote J i,l := {j {1,..., n i } : V i,j = l} and l Y i = {Y i,j : j J i,l }, l = 1,..., 5. We can then divide the whole dataset of the collective into 5 subsets: { ( l Y i, X i ) : l Y i, i I }, l = 1,..., 5. Step 3. As in the CART algorithm, the pruning step in credibility regression tree also results in a sequence of nested candidate trees associated with a sequence of complexity parameters ν 1,..., ν M. The cross-validation error for the candidate tree corresponding to ν m is computed as L I n i I{j J i,l }(Y i,j ˆψ l (X i ; ν m )) 2, l=1 i=1 j=1 where ˆψ l (X i ; ν m ) is the predicted value of the ith response under the tree grown using data excluding the set { l Y i, i I} and pruned at fixed complexity parameter ν m. Step 4. Select the tree which leads to the smallest cross-validation error. 15

16 5.3 Premium Calculation Following the procedure described in the preceding Sections 5.2, we can use one of the four loss functions defined in (4.15), (4.16), (4.19) and (4.20) to build a regression tree. The terminal nodes of an established credibility regression tree form a partition of sample space into a number, say K, of sub-collectives, I k = {i I : X i A k }, k = 1,..., K where {A k, k = 1,..., K} is a partition of covariate space. Within each sub-collective k {1,..., K}, individual premium can be computed by formula (4.18) as described in Section 4. 6 Simulation Studies: Balanced Claims Model 6.1 Simulation Schemes We consider a collective of I = 300 individual risks from Model 3, Balanced Claims Model, and for each i = 1,..., I, independently simulate covariate vector X i = (X i,1,..., X i,5 ), where X i,1,..., X i,5 are independent and identically distributed from the discrete uniform distribution over the set {1,..., 100}. Our simulation studies are based on the following four distinct schemes, where the first scheme is the base and the others are its modifications made from different perspectives. Scheme 1 (Base simulation scheme) For each i = 1,..., I, independently simulate ε i,j from a given distribution function F ( ), for j = 1,..., n, and define the n claims of individual risk i as Y i,j = e f(xi) + ε i,j, j = 1,..., n, (6.22) where f(x i ) = 0.01 (X i,1 + 2X i,2 X i,3 + 2 X i,1 X i,3 ) X i,2 X i,4. (6.23) In our simulation, F takes one of the following three distributions: (1) EXP(1.6487): An exponential distribution with mean , i.e., F (x) = 1 e x/1.6487, x 0; (2) LN (0, 1): A log-normal distribution with parameters 0 and 1, i.e., ln ε i,j N(0, 1); ( 3.29 (3) PAR(3, ): A Pareto distribution F (x) = 1 x+3.29) 3. The three distributions above are selected to reflect different level of heavy-tailedness, and their parameter values are set to have the same expected values. The terms 2 X i,1 X i,3 and X i,2 X i,4 are included 16

17 in the expression of f(x i ) to reflect the nonlinear and interaction effects from the covariate variables on the claim distribution. The response variable Y i,j only depends on the first four covariates and the fifth variable is included as a noise variable to mimic the scenario that some covariate variables included in the analysis may not be influential for the claims Y i,j. In Scheme 1, the mean and the variance of individual risk i are respectively given by µ(x i, Θ i ) = e f(xi) + E[ε] e f(xi) (6.24) and σ 2 (X i, Θ i ) = Var[ε] , for EXP(1.6487), , for LN (0, 1), , for PAR(3, ), (6.25) for i = 1,..., I, where E[ε] and Var[ε], respectively, denote the mean and the variance of the distribution function F. As shown in (6.25), the variance of the claim random variable for each individual risk simulated from Scheme 1 remains the same across the collective and does not depend on covariates. To reflect the scenario where the variance of claim random variable may differ across the collective of individual risks, we design the following scheme, which differs from Scheme 1 in way of generating variables ε i,j. Scheme 2 For each i = 1,..., I, independently simulate ε i,j from a distribution function F ( ; X i ) which depends on the covariate X i of the individual risk i, for j = 1,..., n, and define the n claims of individual risk i as Y i,j = e f(xi) + ε i,j, j = 1,..., n, (6.26) where f(x i ) is given by (6.23). We respectively consider three distinct distributions for F ( ; X i ): (1) EXP ( e γ(xi)/2), (2) LN(0, γ(x i )), (3) PAR(3, 2e γ(xi)/2 ), (6.27) where γ(x i ) = Xi,1 X i,2 + X i,1 X i,2. The function γ(x i ) in Scheme 2 is normalized by 102 to attain E[γ(X i )] = 1, which is the value of the scale parameter in the log-normal distribution used in Scheme 1. The mean and the variance of individual risk i are respectively given by µ(x i, Θ i ) = e f(xi) + E[ε i,1 ] 17

18 = e f(xi) + e γ(xi)/2, for EXP(e γ(xi)/2 ), e f(xi) + e γ(xi)/2, for LN(0, γ(x i )), e f(xi) + 2e γ(xi)/2, for PAR(3, 2e γ(xi)/2 ), (6.28) and e γ(xi)/2, for EXP(e γ(xi)/2 ), σ 2 ( (X i, Θ i ) = Var[ε i,1 ] = e γ(x i) 1 ) e γ(xi), for LN(0, γ(x i )), 3e γ(xi), for PAR ( 3, 2e γ(xi))/2), (6.29) for i = 1,..., I. Note that the claim distribution of each individual risk i is fully determined by its covariate X i simulation schemes 1 and 2. This means that the risk profile is completely characterized by observable covariates X i and there is no random effect on the claims distribution. However, two insureds, in insurance practice, with exactly the same covariate values may have different risk profiles thus it is reasonable to take into account a random effect component when modelling the claims distribution. In light of such observation, we design simulation schemes 3 and 4. Scheme 3 For each i = 1,..., I, (1) independently simulate random effect variable Θ i from the uniform distribution U(0.9, 1.1). (2) independently simulate ε i,j from a distribution function F ( ; X i ) which depends on the covariates X i associated with risk i, for j = 1,..., n, and (3) define the n claims of individual risk i as Y i,j = Θ i [ e f(xi) + ε i,j ], j = 1,..., n, (6.30) where f(x i ) is given by (6.23). In our simulation studies, we consider each of the three distributions described in Scheme 2 for the distribution F ( ; X i ) in the step (2) of the present scheme. In Scheme 3, a random variable Θ i is multiplied to [ e f(xi) + ε i,j ], j = 1,..., n, for generating claims Y i,j and reflecting the random effect from covariates on claims distribution. The claim distribution of two risks with the same covariates is no longer the same in simulation Scheme 3, because the claim distribution depends on the random effect variable ξ i in addition to the covariate X i. This also means that the profile of an individual risk i simulated for Scheme 3 can not be fully captured by its covariates X i anymore. Accordingly, for each individual risk i = 1, 2,..., I, the mean and the variance of risk are respectively given by ) µ(x i, Θ i ) = Θ i (e f(xi) + E[ε i,1 ] 18

19 ( Θ i e f(x i) + e γ(xi)/2), for EXP(e γ(xi)/2 ), ( = Θ i e f(x i) + e γ(xi)/2), for LN(0, γ(x i )), (6.31) ( Θ i e f(x i) + 2e γ(xi)/2), for PAR(3, 2e γ(xi)/2 ), and Θ 2 i eγ(xi)/2, for EXP(e γ(xi)/2 ), σ 2 (X i, Θ i ) = ξi 2 ( Var[ε i,1 ] = Θ 2 i e γ(x i) 1 ) e γ(xi), for LN(0, γ(x i )), (6.32) 3Θ 2 i eγ(xi), for PAR ( 3, 2e γ(xi)/2). Note that the variance of individual claim distribution depends on the random effect variable Θ i in addition to the covariate X i in Scheme 3. Scheme 3 contains a multiplicative random effect. We next consider a different way of incorporating random effect term into the claim distribution as described in Scheme 4. Scheme 4 For each i = 1,..., I, Θ i = (ξ i,1, ξ i,2 ) T, (1) independently simulate random effect variables ξ i,1 and ξ i,2 from the uniform distribution U(0.9, 1.1). (2) independently simulate ε i,j from a distribution function F ( ; X i, ξ i,2 ) which depends on the covariate X i and random effect variable ξ i,2, for j = 1,..., n, and (3) define the n claims of individual risk i as Y i,j = e ξi,1 f(xi) + ε i,j, j = 1,..., n, (6.33) where f(x i ) is defined in (6.23). We respectively consider the three distributions as defined in equation (6.27) in Scheme 2 with γ(x i ) replaced by ξ i,2 γ(x i ) so that the distribution of ε i,j depends on the random effect variable ξ i,2 in addition to covariate variable X i. For individual risks i = 1,..., I from Scheme 4, the mean and the variance of individual risk are respectively given by and µ(x i, Θ i ) = e ξi,1 f(xi) + E[ε i,1 ] e ξi,1 f(xi) + e [ξi,2 γ(xi)]/2, for EXP(e [ξi,2 γ(xi)]/2 ), = e ξi,1 f(xi) + e [ξi,2 γ(xi)]/2, for LN(0, ξ i,2 γ(x i )), e ξi,1 f(xi) + 2e [ξi,2 γ(xi)]/2, for PAR(3, 2e [γ(xi)]/2 ), e [ξi,2 γ(xi)]/2, for EXP(e [ξi,2 γ(xi)]/2 ), σ 2 ( (X i, Θ i ) = Var[ε i,1 ] = e ξ i,2 γ(x i) 1 ) e ξi,2 γ(xi), for LN(0, ξ i,2 γ(x i )), 3e ξi,2 γ(xi), for PAR ( 3, 2e [ξi,2 γ(xi)]/2). (6.34) (6.35) 19

20 For each of the four simulation schemes designed above and each of the three involved distributions (exponential, log-normal, and Pareto), we independently simulate 1, 000 samples of the claims experience of I = 300 individual risks with n = 5, 10 and 20, respectively. For each simulated sample, we use L 2 loss function (5.21) and the four credibility loss functions defined in (4.15), (4.16), (4.19) and (4.20) to grow and prune regression trees and the best tree is selected using longitudinal cross-validation. In addition, we consider ad hoc covariate-dependent partitioning, which is defined via the following notation: P(X k ) = { } {i I : X i,j 50}, {i I : X i,j > 50}, j = 1,, 5. and { P(X j1, X j2 ) = {i I : X i,j1 50, X i,j2 50}, {i I : X i,j1 50, X i,j2 > 50}, } {i I : X i,j1 > 50, X i,j2 50}, {i I : X i,j1 > 50, X i,j2 > 50}, for j 1, j 2 = 1,..., 5 with j 1 j 2. P(X k ) presents a equally binary partitioning based on a single covariate and P(X j1, X j2 ) is one based on two covariates. Those based on three or four covariates are denoted by P(X k1, X k2, X k3 ) and P(X k1, X k2, X k3, X k4 ) and can be defined in the same manner. In our simulations, P(X 2 ), P(X 4 ), P(X 1, X 2, X 3 ), P(X 1, X 2, X 4 ), P(X 2, X 3, X 4 ), P(X 1, X 3, X 4 ), and P(X 1, X 2, X 3, X 4 ) are considered. 6.2 Performance Measure: Prediction Error To compare the prediction performance of credibility premium resulted from various partitioning (ad hoc or data-driven) methods, we compute the prediction error for a given partitioning {I 1,..., I K } as follows PE = 1 I I K i=1 k=1 ( 2 I{X i I k } π (H)(k) i µ(x i, Θ i )), (6.36) where the form of π (H)(k) i is given in (4.18) and the net premium of each individual risk µ(x i, Θ i ) is given in (6.24), (6.28), (6.31) and (6.34) for Schemes 1-4 respectively. In addition, we consider credibility premium P (H) i in (2.4) without partitioning the collective. The resulting collective prediction error is computed as PE 0 = 1 I I i=1 ( P (H) i µ(x i, Θ i )) 2, which does not use any covariate information and is anticipated to underperform compared to various kinds of covariate-dependent partitioning. We define the relative prediction error (RPE) for each covariatedependent partitioning as the ratio of prediction error to the collective prediction error and denote it by R = PE/PE 0. Smaller prediction error or relative prediction error suggests better performance. 20

21 6.3 Simulation Results The performance in terms of RPE of the various covariate-dependent partitioning methods under Balanced Claims Model with n = 5 is summarized in Figure 2, which contains 12 plots and summarizes the results for three distributions (exponential, log-normal, Pareto) and four simulation schemes described in Section 6.1. The 12 boxplots in each plot respectively correspond to four credibility regression tree premiums (highlighted in grey color) using credibility loss function (4.15), (4.16), (4.19) and (4.20) (R 1 -R 4 ), L 2 regression tree premium ( ), partition P(X 2 ) (R (2) ), partition P(X 4 ) (R (4) ), partition P(X 1, X 2, X 3 ) (R (123) ), partition P(X 1, X 2, X 4 ) (R (124) ), partition P(X 2, X 3, X 4 ) (R (234) ), partition P(X 1, X 3, X 4 ) (R (134) ) and partition P(X 1, X 2, X 3, X 4 ) (R (1234) ). We have the following interesting observations from Figure Almost all the boxes in Figure 2 are below the value of 1 with an exception being the one resulted from the ad hoc partitioning using covariate X i,4 and with boxplots labeled by R (4). A RPE smaller than 1 suggests more accurate prediction compared to the credibility premium for which no partitioning is performed. Thus, Figure 2 indicates that all the covariate-dependent partitioning methods we considered except P(X 4 ) enhance the prediction accuracy of credibility premium. If revisiting the form of f(x i ) in (6.23), one can see that X i,4 only appears in the interaction term, and it may not be informative enough to be used solely for enhancing prediction accuracy. 2. In each of the 12 plots, the first four boxplots in grey colour summarize the PRE s resulted from the 1,000 simulations using the four proposed credibility regression trees. The four boxplots are positioned at a similar level and take a similar shape across all the 12 plots, which indicates that the four credibility losses perform almost equally when being used to build regression tree credibility premium to predict individual net premium. 3. The four boxplots in grey are constantly positioned lower than those from the other prediction rules, and this is true across all the plots in Figure 2. This observation indicates that our regression tree credibility premium always outperform the others across all the simulation schemes and all the distribution types involved in our numerical simulation. 4. The L 2 regression tree performs consistently worse than the four credibility regression trees, but it performs either better than or as well as each of the 7 ad hoc partitioning methods. 5. Comparing different columns in Figure 2, one can see that the relative prediction errors tend to be smaller for Pareto distribution, whereas the exponential distribution tends to yield larger relative prediction errors. As we know, Pareto distribution is commonly perceived as a heavy-tailed distribution, 21

22 Figure 2: Boxplots of RPE s for Balanced Claims Model with n = 5 for three distributions and four simulation schemes (Schemes 1-4) using various covariate-dependent partitioning methods. Exponential Log-normal Pareto Scheme Scheme Scheme Scheme 4 22

23 exponential distribution is light-tailed, and log-normal is in between. Therefore, such a comparison suggests that our regression tree based methods can improve relative prediction accuracy to a larger extent for a heavy-tailed claim distribution than for a light-tailed one. 6. The plots in the third and fourth rows contain results respectively from simulation schemes 3 and 4, where the individual net premium depends on unobservable random effect term(s). Compared with those in the first and second rows, they do not display an obvious shift in their positions and shapes, which hints that the presence of random effects does not deteriorate the prediction performance of regression tree credibility premium. 7. The rightmost 7 boxplots in each plot are respectively using ad hoc partitions P(X 2 ), P(X 4 ), P(X 1, X 2, X 3 ), P(X 1, X 2, X 4 ), P(X 2, X 3, X 4 ), P(X 1, X 3, X 4 ) and P(X 1, X 2, X 3, X 4 ). As one can see, their partitions are all based on the four influential covariate variables X i,1, X i,2, X i,3 and X i,4 and the noise covariate X i,5 is not involved. In contrast, we did not preclude the covariate X i,5 from the list of covariates by any ex ante variable selection procedure in the course of building the credibility regression trees and the L 2 regression tree. The boxplots in Figure 2, however, show that the regression tree based methods yield significantly smaller RPE s than those ad hoc partitioning methods even though they are disadvantaged by the inclusion of the noise covariate X i,5. 8. The relative prediction errors of partitions P(X 1, X 2, X 3 ) and P(X 1, X 2, X 4 ) are in general smaller than the ones using partitions P(X 2, X 3, X 4 ) and P(X 1, X 3, X 4 ) even though the number of partitions are the same for the four partitioning methods. This suggests that it is not the number of subcollectives into which the collective is partioned that matters, but the form of partition that matters. A partition formed in an ad hoc manner does not guarantee a promising prediction result and thus, it is necessary to adopt data-driven partitioning algorithms, such as our regression tree based model, for developing accurate premium prediction. We also examine the performance of the prediction rules by increasing the number of individual claims to n = 10 and n = 20, and the resulting RPE s are respectively reported by boxplots shown in Figures 4 and 5 in Appendix B, for which all the comments made for n = 5 also apply. We further explore how the prediction accuracy of those 13 prediction rules may change along with the increase of the claims experience of individual risks, which is captured by the parameter n in our simulation schemes. To this end, we compute the average (absolute) prediction error from those 1,000 simulations for each combination of a value for n (5, 10, or 20), a distribution (exponential, log-normal, or Pareto) and one simulation scheme from those four which we have previously defined. Table 1 reports the results for base prediction rule (PE 0 ), CRT premium using credibility loss function (4.19) (PE 3 ), L 2 regression tree 23

REGRESSION TREE CREDIBILITY MODEL

REGRESSION TREE CREDIBILITY MODEL LIQUN DIAO AND CHENGGUO WENG Department of Statistics and Actuarial Science, University of Waterloo Advances in Predictive Analytics Conference, Waterloo, Ontario Dec 1, 2017 Overview Statistical }{{ Method