Regression Tree Credibility Model. Liqun Diao, Chengguo Weng

Size: px
Start display at page:

Download "Regression Tree Credibility Model. Liqun Diao, Chengguo Weng"

Transcription

1 Regression Tree Credibility Model Liqun Diao, Chengguo Weng Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, N2L 3G1, Canada Version: Sep 16, 2016 Abstract Credibility theory is viewed as cornerstone in actuarial science. This paper brings machine learning techniques into the area and proposes a novel credibility model and a credibility premium formula based on regression trees, which is called regression tree credibility (RTC) premium. The proposed RTC method first recursively binary partitions a collective of individual risks into exclusive sub-collectives using a new credibility regression tree algorithm based on credibility loss, and then applies the classical Bühlmann- Straub credibility formula to predict individual net premiums within each sub-collective. The proposed method effectively predicts individual net premiums by incorporating covariate information, and it is particularly appealing to capture various non-linear covariates effects and/or interaction effects because no specific regression form needs to be pre-specified in the method. Our proposed RTC method automatically selects influential covariate variables for premium prediction with no additional ex ante variable selection procedure required. The superiority in prediction accuracy of the proposed RTC model is demonstrated by extensive simulation studies. Keywords: Credibility Theory; Regression Trees; Premium Rating; Predictive Analytics. Assistant Professor. liqun.diao@uwaterloo.ca Corresponding Author. Associate Professor. chengguo.weng@uwaterloo.ca

2 1 Introduction Risk classification, as a key ingredient of insurance pricing, exploits observable characteristics to classify insureds with similar expected claims and build a tariffication system to enable price discrimination on insurance products. An a priori classification scheme is a premium rating system that is developed to express correctly measurable a priori information about a new policyholder or an insured without any claims experience. An a priori classification scheme is often unable to identify all the important factors for premium rating because some are either unmeasurable or unobservable. A posteriori classification (or experience rating) system is therefore necessary to re-rate risks by integrating claims experience into the rating system, and to obtain a more fare and reasonable price discrimination scheme. Credibility theory, viewed as cornerstone in the field of actuarial science (Hickman and Heacox, 1999), has become the paradigm for insurance experience rating and widely used by actuaries. Its advent dates back to Whitney (1918). The idea behind the theory is to view the net premium of an individual risk as a function µ(θ) of a random element Θ representing the unobservable characteristics of the individual risk, and calculate the (credibility) premium as a linear combination of individual claims experience and collective net premium. The credibility theory has been developed into two streams: limited fluctuation credibility theory and greatest accuracy credibility theory. The former is a stability-oriented theory and its main objective is to incorporate into the premium as much individual claims experience as possible and keep the premium sufficiently stable. The modern applications of credibility theory mainly rely on the latter, the greatest accuracy theory, which approximates the net premium by minimizing mean square prediction error and thus deriving the best linear unbiased estimator based on both individual and collective claims experience. The theoretical foundation of credibility theory was first established by Bühlmann (1967, 1969), bringing about the famous Bühlmann model, and subsequently extended in many directions. The model by Bühlmann and Straub (1970) allows individual risks to be associated with distinct volume measures (e.g., risk exposure). Hachemeister (1975) generalized the model to the regression context, and Jewell (1973) extended it to multi-dimensional risks. Since insurance data often show a natural hierarchical structure, the hierarchical credibility model was developed to embody this feature (Jewell, 1975; Taylor, 1979; Sundt, 1980; Norberg, 1986; Bühlmann and Jewell, 1987). The credibility theory encompassed within the framework of generalized linear model can be found in Nelder and Verrall (1997), Ohlsson (2008) and Garrido and Zhou (2009). In insurance practice, relevant covariates are usually available to partially, if not fully, reflect the risk profile of individual risks (e.g., age, gender, annual income, driving history, etc.). In some of the existing literature, covariate information has been incorporated into various credibility models (e.g., Hachemeister, 1975; Nelder and Verrall, 1997; Ohlsson, 2008; Garrido and Zhou, 2009), but a specific regression form 2

3 for either individual net premium or claim amount on covariates needs to be pre-specified (e.g., linear regression, generalized linear regression, ect.) before the credibility premium is calculated. As such, the available credibility models lack the flexibility of incorporating nonlinear effects and/or interaction effects, which are highly likely to exist among insurance data, and thus render the risk of model misspecification. We propose novel regression tree credibility (RTC) model in this paper, which, for the first time, brings machine learning techniques into the framework of credibility theory and provides encouraging performance in terms of premium prediction accuracy compared to classical credibility models. Regression trees (Breiman et al. 1984) are popular machine learning algorithms developed to construct prediction models by recursively binary partitioning data space into smaller regions in which a simple model is sufficient. Many interesting extensions and alterations on regression trees were proposed and a thorough review can be found in Loh (2004). Regression trees are powerful non-parametric alternatives to parametric regression methods and particularly appealing if interest lies in risk classification and prediction. Our proposed RTC model provides us with a flexible way to utilize covariate information for premium prediction. It takes two steps. In the first step, a novel regression trees algorithm is proposed to utilize covariate information to partition a collective of risks into exclusive sub-collectives such that the risk profiles tend to be homogeneous within each sub-collective and heterogeneous across sub-collectives. The resulting regression trees are called credibility regression trees because they are built to minimize credibility loss in premium prediction. In the second step, the classical credibility premium formula is then applied within each sub-collective. Because the eventual credibility premium formula for individual risks results from the implementation of a regression trees algorithm, we call our credibility model regression tree credibility model. Our proposed RTC model posses many advantages over the existing credibility models in the literature. First, for the first time, covariate variables are exploited to classify the unobservable risk profiles for a credibility model in a prioritized way, whereas they, if considered, are typically used to characterize the conditional net premium of an individual risk, for example, in the regression credibility model of Hachemeister (1975). The well-known hierarchical credibility models (Bühlmann and Jewell, 1987) also utilizes covariate information to classify individual risks, but it is done in an ad hoc manner. The classification in hierarchical credibility models is created according to some inherent hierarchical structure of insurance data, for example, certain geographic or demographic structure. In contrast, the RTC model classifies risks in a data-driven manner. Second, our proposed credibility model provides an effective framework to capture nonlinear and interaction effects of covariates on individual net premium benefited from the use of regression trees. In other words, no explicit regression form needs to be assumed for either the individual net premium or risk distribution in general, whereas all the existing credibility models using covariates require a specific regression form (e.g., Hachemeister, 1975; Nelder and Verrall, 1997). Last but not least, 3

4 our proposed RTC model automatically selects significant covariate variables for premium prediction in the course of building the credibility regression tree, and no ex ante variable selection procedure is required. The rest of the paper proceeds as follows. Section 2 provides a brief review on Bülhmann-Straub credibility model. Section 4 specifies our credibility model, Section 3 explains the benefit of splitting, and Section 5 describes the procedure of establishing credibility regression tree and calculating premiums. Sections 6 and 7 contain simulation results for balanced and unbalanced claims data respectively. Section 8 concludes the paper. Finally, some technical proof and additional graphics are provided in appendices. 2 Bühlmann-Straub Credibility Model In this section, a briefly review of the Bühlmann-Straub credibility model will be presented to facilitate the development of our regression tree credibility model in the subsequent sections. Model 1 (Bühlmann-Straub Credibility Model) Consider a portfolio of I risks, where each individual risk i has n i years of claim experiences, i = 1, 2,..., I. Let Y i,j denote the claim ratio of individual risk i in year j, and m i,j be the associated volume measure, also known as weight variable. The profile of individual risk i is characterized by a scalar θ i, which is the realization of a random element Θ i (usually either a random variable or a random vector). Collect all the claims experience of individual risk i into a vector Y i, i.e., Y i = (Y i,1,..., Y i,ni ) T, and assume that the following conditions are satisfied: H11. Conditionally given Θ i = θ i, {Y i,j : j = 1, 2,..., n i } are independent with E[Y i,j Θ i = θ i ] = µ(θ i ) and Var[Y i,j Θ i = θ i ] = σ2 (θ i ) m i,j for some unknown but deterministic functions µ( ) and σ 2 ( ); H12. The pairs (Θ 1, Y 1 ),..., (Θ I, Y I ) are independent, and {Θ 1,..., Θ I } are independent and identically distributed. Define structural parameters σ 2 = E[σ 2 (Θ i )] and τ 2 = Var[µ(Θ i )] for risks within the collective I := {1, 2,..., I}, and denote n i m i = m i,j, and m = j=1 Let µ = E[Y i,j ] denote the collective net premium, and define Y i = n i j=1 m i,j m i Y i,j, and Y = I m i. i=1 I i=1 m i m Y i, 4

5 which are the sample average of claims associated with individual risk i and that of the collective I. In the literature, two estimators were proposed for individual premium µ(θ i ) in the above Bühlmann- Straub credibility model: inhomogeneous credibility premium and homogeneous credibility premium. The inhomogeneous credibility premium is the best premium predictor P (I) i among the class I P n i i = a 0 + a i,j Y i,j, a 0 R, a i,j R i=1 j=1 [ to minimize the quadratic loss E (µ(θ i ) P i ) 2] [ or equivalently E (Y i,ni+1 P i ) 2], and its formula is given by where P (I) i = α i Y i + (1 α i )µ, α i = m i m i + σ 2 /τ 2 (2.1) is called credibility factor for individual risk i (Bühlmann and Gisler, 2005). The corresponding minimum [ quadratic losses for E (µ(θ i ) P i ) 2] [ and E (Y i,ni+1 P i ) 2] are respectively given by and L (I) 1,i = σ 2 m i + σ 2 /τ 2 (2.2) L (I) 2,i = σ 2 σ2 + m i + σ 2 /τ 2. (2.3) In contrast, the homogeneous credibility premium for an individual risk i from the collective I is defined as the best premium predictor P (H) i among the class I Q n i i : Q i = a i,j Y i,j, E[Q i ] = E[µ(Θ i )], a i,j R to minimize the quadratic loss E given by i=1 j=1 [ (µ(θ i ) Q i ) 2] [ or equivalently E (Y i,ni+1 Q i ) 2] and its formula is P (H) i = α i Y i + (1 α i )Y, (2.4) where the credibility factor α i is the same as the one in the inhomogeneous premium and given in (2.1). [ The corresponding minimum quadratic losses for E (µ(θ i ) Q i ) 2] [ and E (Y i,ni+1 Q i ) 2] are respectively given by ( L (H) 1,i = τ 2 (1 α i ) α ) i, (2.5) α 5

6 and ( L (H) 2,i = σ 2 + τ 2 (1 α i ) α ) i, (2.6) α where α = I i α i. In the implementation of the above credibility premium formulas, the structural parameters σ 2 and τ 2 are respectively estimated by σ 2 = 1 I I i 1 n i 1 n i j=1 ( ) 2 m i,j Yi,j Y i and τ 2 = max ( τ 2 ), 0, (2.7) where and [ τ 2 I I = c I 1 c = I 1 I i [ I i ] m i m (Y i Y ) 2 I σ2, (2.8) m m i m ( 1 m i m ) ] 1. σ 2 g and τ 2 g are unbiased and consistent estimators for σ 2 and τ 2 respectively, but τ 2 can possibly be negative (p.62, Bühlmann and Gisler, 2005); thus, τ 2 is used as the estimator for τ 2. 3 Benefit of Partitioning This section investigates the benefit of partitioning a collective of risks into sub-collectives for premium prediction under the Bühlmann model, which is a special case of the Bühlmann-Straub model reviewed in Section 2 with n i = n and m i,j = 1 for and i = 1,..., I and j = 1,..., n. Under the Bühlmann model, it is easy to check that the credibility factor for individual risk α i is the same for all individual risks, i.e., α := α i = n n + σ 2, for i = 1,..., I ; /τ 2 the four quadratic loss functions in (2.2), (2.3), (2.5) and (2.6) respectively become L (I) 1,i = σ 2 n + σ 2 /τ 2 = τ 2 (1 α), L (I) 2,i = σ 2 σ 2 + n + σ 2 /τ 2 = σ2 + τ 2 (1 α), L (H) 1,i = τ 2 (1 α) + 1 I L (H) 2,i = σ 2 + τ 2 (1 α) + 1 I 6 τ 2 (1 α) 2, α τ 2 (1 α) 2. α

7 As the value of I goes large, the second item in the expression of L (H) 1,i and the third term in L (H) 2,i get ignorable and L (H) 1,i and L (H) 2,i reduce to L (I) 1,i and L(I) 2,i respectively. Therefore, only L(I) 1,i and L(I) 2,i will be studied regarding the benefit of partitioning in the rest of the section. Without loss of generality, it will be proved that the overall credibility quadratic loss for any collective I = {1, 2,..., I} is not worsen off if it is split into two mutually exclusive sub-collectives I 1 and I 2, where I 1 I 2 = I and I 1 I 2 =, subject to an arbitrary partitioning criterion. When working with the loss function L (I) 1,i, the total credibility loss for the collective I is the sum of individual losses over the collective L = I i=1 L (I) 1,i = I n/σ 2 + 1/τ 2, where σ 2 and τ 2 are the structural parameters for the collective I. We next consider the summation of the credibility loss of the two sub-collectives I 1 and I 2. Let p be the probability that a randomly selected individual risk from the collective I falls into the sub-collective I 1 so that q = 1 p is the probability that it falls into the sub-collective I 2. On average, there will be pi individual risks lying in sub-collective I 1 and qi individual risks in sub-collective I 2. Let µ (k), σ(k) 2 and τ (k) 2 be structural parameters for sub-collective I k, k = 1, 2, defined in the same manner as µ, σ 2 and τ 2 which have been introduced in Section 2, but based on sub-collective information. The expected total credibility loss functions for sub-collectives I 1 and I 2 are, respectively, given by Ip L 1 = n/σ(1) 2 + 1/τ (1) 2 Iq and L 2 = n/σ(2) 2 + 1/τ (2) 2. Further, it is easy to check σ 2 = pσ(1) 2 + qσ2 (2), τ 2 = pτ(1) 2 + qτ (2) 2 + pq(µ (1) µ (2) ) 2 = τ 2 + pq(µ (1) µ (2) ) 2, (3.9) where τ 2 = pτ 2 (1) + qτ 2 (2). Proposition 3.1 The credibility losses L, L 1 and L 2 defined in this subsection satisfy L 1 + L 2 L. (3.10) Proof. See Appendix A. When working with the individual credibility loss L (I) 2,i, the total credibility loss for the collective I is given by L = Iσ 2 + I n/σ 2 + 1/τ 2 = Iσ2 + L 7

8 The expected total credibility loss functions for sub-collectives I 1 and I 2 are respectively of the form L 1 = Ipσ(1) 2 + Ip n/σ(1) 2 + 1/τ (1) 2 = Ipσ(1) 2 + L 1 and L 2 = Ipσ(2) 2 + Iq n/σ(2) 2 + 1/τ (2) 2 = Ipσ(2) 2 + L 2. It is easily shown that L 1 + L 2 L by applying Proposition 3.1 and the fact σ 2 = pσ(1) 2 + qσ2 (2). The discussion above suggests that any partitioning on a collective of individual risks does not deteriorate the prediction accuracy of the credibility premium formulas, given that the corresponding structural parameters for each resulting sub-collective can be computed without any statistical errors. This result provides a theoretical foundation for the use of any partitioning method in premium prediction, and it also encourages the use of prediction models which classify a given collective of risks into a great number of mutually exclusive sub-collectives. As in many other statistical models, however, statistical estimation error is inevitable, the structural parameters are unknown in general and estimation of these parameters using available data will bring in statistical errors. The existence of estimation error restrains one from excessively partitioning the overall collective and instead, one needs to keep a balance between the benefits from partitioning and the adverse effects from estimation errors. 4 Model Setup Model 2 (Unbalanced Claims Model) Consider a portfolio of I risks numbered with 1,..., I. Y i = (Y i,1,..., Y i,ni ) T be the vector of claim ratios, m i = (m i,1,..., m i,ni ) be the corresponding weight vector, and X i = (X i,1,..., X i,p ) T be the covariate vector associated with individual risk i, i = 1,..., I. The risk profile of each individual risk i is characterized by a scalar θ i, which is a realization of a random element Θ i. The following two conditions are further assumed: H21. The triplets (Θ 1, Y 1, X 1 ),..., (Θ I, Y I, X I ) are independent; H22. Conditionally given Θ i = θ i and X i = x i, the entries Y i,j, j = 1,..., n, are independent, and E[Y i,j X i = x i, Θ i = θ i ] = µ(x i, θ i ) and Var[Y i,j X i = x i, Θ i = θ i ] = σ2 (x i, θ i ) m i,j for some unknown but deterministic functions µ(, ) and σ 2 (, ). The word unbalanced in the title of the above model is used to signify the fact that the above model permits distinct lengths of claims experiences and distinct volume measures across individual risks in the collective. By setting n i = n and m i,j = 1 for i = 1,..., I and j = 1,..., n i, the above unbalanced claims model reduces to the following balanced claims model. Model 3 (Balanced Claims Model) Consider a portfolio of I risks numbered with 1,..., I. Let Y i = (Y i,1,..., Y i,n ) T be the vector of claims and X i = (X i,1,..., X i,p ) T be the covariate vector associated with Let 8

9 individual risk i, i = 1,..., I. The risk profile of individual risk i is characterized by a scalar θ i, which is a realization of a random element Θ i. The following two conditions are further assumed: H31. The triplets (Θ 1, Y 1, X 1 ),..., (Θ I, Y I, X I ) are independent; H32. Conditionally given Θ i = θ i and X i = x i, the entries Y i,j, j = 1,..., n, are independent, and E[Y i,j X i = x i, Θ i = θ i ] = µ(x i, θ i ) and Var[Y i,j X i = x i, Θ i = θ i ] = σ 2 (x i, θ i ) (4.11) for some unknown but deterministic functions µ(, ) and σ 2 (, ). The above credibility models are more general compared to the ones in the literature, because the existing ones unexceptionally assume some specific functional forms for µ(x i, Θ i ) and σ 2 (X i, Θ i ). For example, the well-known Hachemeister model assumes µ(x i, Θ i ) = X T i β(θ i), a linear regression form with random coefficient β(θ i ) depending on the underlying risk profile Θ i of individual risk i. In contrast, our proposed model leaves the form of individual net premium µ(x i, Θ i ) unspecified. Since partitioning of data space improves premium prediction accuracy as discussed in Section 3, it is natural to consider partitioning guided by available covariate information. Inspired by the ideas in nonparametric statistics, we approximate µ(x i, Θ i ) and σ 2 (X i, Θ i ) by piecewise constant forms as follows respectively K I{X i A k }µ (k) (Θ i ), and k=1 K k=1 I{X i A k } σ2 (k) (Θ i) m i,j, where I{ } is an indicator function, {A 1, A 2,..., A K } is a partition of covariate space, K is the number of partitions and µ (k) (Θ i ) and σ 2 (k) (Θ i) respectively represent the net premium and the variance of an individual risk i from the kth sub-collective with a risk profile Θ i, i.e., µ (k) (θ i ) = E(Y i,j X i A k, Θ i = θ i ) and σ 2 (k) (θ i) = Var(Y i,j X i A k, Θ i = θ i ). (4.12) The condition of X i A k means that the individual risk i is classified into the kth sub-collective based on its covariate information. Following the way of deriving inhomogeneous credibility premiums for the Bülhmann-Straub credibility model reviewed in Section 2, the inhomogeneous credibility premium to predict µ (k) (Θ i ) using information from the kth sub-collective is of the form π (I)(k) i ( = α (k) i Y i + 1 α (k) i ) µ (k), (4.13) where µ (k) = E[Y i,j X i A k ] = E[µ (k) (Θ i )] is the sub-collective net premium corresponding to the kth partition, α (k) i = m i m i + σ(k) 2 /τ (k) 2, 9

10 is the credibility factor of the kth sub-collective and τ(k) 2 and σ2 (k) are the structural parameters for the kth sub-collective, i.e., τ 2 (k) = Var[µ (k)(θ i )] and σ 2 (k) = E[σ2 (k) (Θ i)]. We develop the following prediction rule based on covariate-dependent partitioning and inhomogeneous sub-collective credibility premiums The minimum quadratic losses E π (I) i = K k=1 I{X i A k }π (I)(k) i. (4.14) [ ( ) 2 ] [ ( ) 2 ] µ (k)(θ i ) π (I)(k) i and E Y i,ni+1 π (I)(k) i are respectively L (I) 1,(k) = I σ(k) 2 I{X i A k } m i + σ(k) 2 /τ (k) 2 i=1 (4.15) and L (I) 2,(k) = ( ) I I{X i A k } σ(k) 2 + σ(g) 2 m i + σ(k) 2 /τ (k) 2. (4.16) i=1 Similarly, we can develop homogeneous credibility prediction rule π (H) i = K k=1 I{X i A k }π (H)(k) i, (4.17) where π (H)(k) i ( = α (k) i Y i + 1 α (k) i ) Y (k), (4.18) and Y (k) denotes the average claims experience of the risks in the kth sub-collective. The two homogeneous credibility losses of the kth sub-collective are L (H) 1,(k) = I i=1 I{X i A k }τ(k) 2 (1 α(k) i ) ( α(k) i α (k) ), (4.19) and L (H) 2,(k) = [ I I{X i A k } σ(k) 2 + τ (k) 2 (1 α(k) i ) i=1 ( α(k) i α (k) )], (4.20) where α (k) = I i=1 I{X i A k }α (k) i. 10

11 Figure 1: An example of covariate-dependent partitioning Figure 1 illustrates one example of a covariate-dependent partitioning, in which covariate X i,1 and a threshold c are selected to partition the collective into two subcollectives G 11 and G 12. Consequently, the profile of individual risks in sub-collective G 11 is characterized by the conditional distribution of (Θ i X i,1 > c), whereas that in subcollective G 12 is governed by the conditional distribution of (Θ i X i,1 c). If the risk profile variable Θ i is independent of the covariate X i,1, the two sub-collectives are homogeneous in the sense that the risk profile variables from the two sub-collectives follow the same distribution; in this case, as one can see from the proof of Proposition 3.1, there is no benefit of partitioning at least for a balanced claims model because the total credibility loss remains the same, i.e., the inequality in (3.10) becomes an equality, while such a partition will increase the estimation error resulted from less data being used for estimation of structural parameters after a partitioning. Thus, such covariate is not a significant variable for the data space partitioning and a good partitioning scheme should avoid partitioning sample space using such covariate variable. In Figure 1, the sub-collectives G 11 and G 12 are considered for further partitioning and G 11 is selected and partitioned into two further sub-collectives G 21 and G 22. While the partitioning procedure may go further for a finer partition, Figure 1 demonstrates the case where the sample space is partitioned into three sub-collective G 12, G 21 and G 22, respectively, defined by {X i,1 c}, {X i,1 > c & X i,2 > d}, and {X i,1 > c & X i,2 d}. In an example of auto insurance, X 1 may represent gender (male if X 1 = 1 and female if X 1 = 0) and X 2 may represent age. If we choose c = 0 and d = 21, then G 21 is the group of male drivers aged more than 21, G 22 is of male drivers with an age under 21 and G 12 is a group of female drivers. The insurer then provides discriminated price to each insured group due to their distinct risk profile. The choices of variables and thresholds used for partitioning are critical for gaining the benefits of partitioning, because if they are not properly selected, the benefit from partitioning can be very little while the estimation of structural parameters with smaller samples in sub-collectives may lead to large statistical estimation errors. Generally speaking, an effective partitioning algorithm is anticipated to partition the data space in a premeditated manner such that the risks are relatively homogeneous within each resulted sub-collective and inhomogeneous between the sub-collectives in certain sense. We propose a novel credibility regression tree algorithm, which chooses the covariates, cutting points and order of partitioning in a data-driven manner and form a partition on the data space which minimizes the overall credibility loss. A credibility 11

12 regression tree is an alteration of the Classification and Regression Trees (CART; Breiman et al.,1984). A brief review of CART will be given in Section 5.1 and our proposed credibility regression tree will be described in Section Regression Tree Credibility Premium 5.1 General Procedure in Regression Trees Methods Recursive partitioning methods are machine-learning techniques to quantify relationship between two sets of variables, the responses and covariates (e.g., Zhang and Singer, 2010). Recursive partitioning methods recursively partition data space into small regions so that a simple regression model is sufficient in each region. They are powerful alternatives to parametric regression methods and possess many inherent advantages. First, recursive partitioning methods do not require to pre-specify any mathematical form for the regression relationship between the responses and covariates, and they therefore enjoy more flexibility in capturing nonlinear effects and/or interaction relationship of the covariates on responses. Second, no a prior variable selection procedure is necessary before the establishment of a regression model. Variable selection is automatically achieved in the course of model building. Third, the decision structure can be utilized for the purpose of prediction and easily visualized by a hierarchical graphical model. Recursive partitioning methods have become popular and widely used tools in many scientific fields, but surprisingly, they have been rarely applied in actuarial science so far. The Classification and Regression Trees (CART), with details developed in Breiman et al. (1984), is perhaps the most well-known recursive partitioning algorithm. Many extensions and alterations have been developed on the basis of CART; see Loh (2014) for a thorough review. CART is formulated in terms of three main steps: (1) tree growing, (2) tree pruning, and (3) tree selection. In what follows, we present a brief description on each of the three steps. (1) Tree Growing. This step is to grow a large tree. Consider a data set with a response W and covariates X i = (X i1,..., X ip ) T. The partition algorithm begins with the top node which contains the entire data set, and proceeds through binary splits of the form {X ij c} versus {X ij > c} (where c is a constant) for ordinal or continuous covariate X ij, and X ij S (where S is a subset of the values that X ij may take) for categorical covariate X ij, leading to two mutually exclusive descendant nodes. The principle behind the splitting procedure is heterogeneity reduction in the response distribution and this is realized by separating the data points into two exclusive descendant nodes, where the responses are more homogeneous within each subset than they were combined together in the parent node. The heterogeneity of each node can be measured in various ways depending on the nature of 12

13 the underlying problem and usually quantified by a loss function such as the quadratic or L 2 loss function (for a node g) I L 2,(g) = I{i g} ( ) 2 W i W (g), (5.21) i=1 where W i is the ith response from the node g and W (g) is the sample average of responses in node g. In the course of tree building, the first split is selected as the one that yields the largest reduction in loss. The second split is chosen as the one leading to the largest loss reduction among all possible splits in the two descendant nodes. The tree keeps splitting until some pre-specified criteria, such as the minimum number of observations in each terminal node and depth of tree, are achieved, and the splitting process produces a large tree ψ 0, which almost always overfits the data, so a less complex model is needed and it will be built through steps (2) and (3). (2) Tree Pruning. This step cuts (prunes) the large tree ψ 0 from the last step to create a sequence of nested subtrees. The creation of the subtrees relies on a cost-complexity pruning algorithm. Let ψ g denote a subtree of ψ 0 rooted at node g containing g and all of its descendant nodes. We further denote the set of terminal nodes of a tree ψ by ψ. The cost-complexity of a subtree ψ is defined as L ν (ψ) = L(ψ) + ν ψ, where L(ψ) = g ψ L 2,(g) is the loss of the subtree ψ, ψ denotes the size of the tree ψ, i.e., the number of terminal nodes in ψ, and ν is called complexity parameter. Whether the descendants of node g should be pruned is decided by comparing the following two quantities: L ν (ψ g ) = L(ψ g ) + ν ψ g, and L ν (g) = L(g) + ν, where L ν (ψ g ) and L ν (g) are respectively the cost-complexities of the tree ψ g rooted at node g containing g and all its descendant nodes and the tree having only node g, i.e., the tree ψ g with all its descendant nodes collapsed into node g. ν g = L(g) L(ψ g), ψ g 1 is the critical point at which L ν (ψ g ) is equal to L ν (g). If ν goes larger than ν g, L ν (ψ g ) > L ν (g), which suggests that it is not worthwhile to split further on node g and all descendant nodes of node g should be collapsed into node g (i.e., pruned). The pruning process starts with the weakest-linked node, i.e., calculating ν g for all non-terminal nodes in the large tree ψ 0 and pruning the descendants of the one with the smallest ν g. Working with the resulted subtree, the ν gs of non-terminal nodes in the subtree are recalculated and the weakest-linked node is pruned off. The pruning continues 13

14 until only the top node left and it creates a sequence subtrees of ψ 0 corresponding to a sequence of critical points of the weakest-linked nodes of the subtrees. The sequence of critical points is denoted by ν 1, ν 2,..., ν M and each point corresponds to a subtree, where M is the number of subtrees in the sequence. (3) Tree Selection. This step selects the best model among the sequence of nested subtrees established in the last step. Typically, the model selection is conducted via a cross-validation procedure (e.g., Breiman et al., 1984, Ch. 8.5), which is probably the most widely used method designed to estimate the prediction error of each candidate tree. The prediction error is often, not necessarily though, computed using the same loss functions as the one used for tree growing and tree pruning. To conduct a L-fold cross-validation, the data set is divided into L exclusive subsets I 1, I 2,..., I L. The cross-validation error for the subtree corresponding to a critical point ν m is given by L I I{i I l }(W i ˆψ l (X i ; ν m )) 2, l=1 i=1 where ˆψ l (X i ; ν m ) is the predicted value of the ith response under the tree grown using the full data set excluding subset I l and pruned with complexity parameter fixed at ν m. The best tree is selected as the one on which the smallest cross-validation error is attained. As described above, the loss functions play a fundamental role in building the large tree, creating the sequence of candidate trees and selecting the best tree using cross-validation procedure. CART algorithm can be implemented by the rpart package (Therneau and Atkinson, 2011) in the statistical software R (R Core Team, 2014). 5.2 Credibility Regression Trees As previously mentiond in the end of Section 4, in this paper we propose novel credibility regression trees by altering the loss function and cross-validation procedure in the default regression tree algorithms. The loss function and cross-validation procedure we proposed are as follows. (1) Loss Functions. The loss function is the key component of the CART algorithm and the choice of loss function decides the decision rules in tree growing, pruning and selection steps. We would expect different decisions are made and different tree structure is eventually built by adopting a different loss function. The credibility regression tree algorithm will use in tree growing and pruning steps one of the credibility loss functions given in (4.15), (4.16), (4.19) and (4.20) in place of the default L 2 loss function (5.21). 14

15 (2) Longitudinal Cross-validation. The cross-validation procedure in CART divides the collective I = {1, 2,..., I} into L subsets and the cross-validation error is calculated to estimate the prediction error when interest lies in predicting individual premium for new policy holders with no claims experience. However, in the context of credibility theory, the interest lies in predicting individual net premium for those insureds who have already had claims experiences with the insurance company, and thus, a new cross-validation procedure is required. The cross-validation procedure we propose divides individual claims experiences into L subsets instead, and accordingly we call it longitudinal cross-validation. Recall that we observe (Y i, X i ) for individual risk i from the collective, where Y i = (Y i,1,..., Y i,ni ) T denotes the observed claims experience of individual risk i and X i is the associated covariates, i = 1, 2,..., I. The longitudinal cross-validation procedure we adopt for credibility regression tree algorithm takes the following steps: Step 1. For each i = 1,..., I, denote b i = n i /L and l i = (n i T i = B, }. {{.., B, 1, 2,..., l } i, b i blocks mod L), and define a sequence where B = {1, 2,..., L}. For example, when conducting 5-fold cross-validation, T i = {1, 2, 3} for n i = 3, and T i = {1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3} for n i = 13. For each i = 1,..., I, independently sample a number V i,j from the sequence T i without replacement for j = 1,..., n i. Step 2. Denote J i,l := {j {1,..., n i } : V i,j = l} and l Y i = {Y i,j : j J i,l }, l = 1,..., 5. We can then divide the whole dataset of the collective into 5 subsets: { ( l Y i, X i ) : l Y i, i I }, l = 1,..., 5. Step 3. As in the CART algorithm, the pruning step in credibility regression tree also results in a sequence of nested candidate trees associated with a sequence of complexity parameters ν 1,..., ν M. The cross-validation error for the candidate tree corresponding to ν m is computed as L I n i I{j J i,l }(Y i,j ˆψ l (X i ; ν m )) 2, l=1 i=1 j=1 where ˆψ l (X i ; ν m ) is the predicted value of the ith response under the tree grown using data excluding the set { l Y i, i I} and pruned at fixed complexity parameter ν m. Step 4. Select the tree which leads to the smallest cross-validation error. 15

16 5.3 Premium Calculation Following the procedure described in the preceding Sections 5.2, we can use one of the four loss functions defined in (4.15), (4.16), (4.19) and (4.20) to build a regression tree. The terminal nodes of an established credibility regression tree form a partition of sample space into a number, say K, of sub-collectives, I k = {i I : X i A k }, k = 1,..., K where {A k, k = 1,..., K} is a partition of covariate space. Within each sub-collective k {1,..., K}, individual premium can be computed by formula (4.18) as described in Section 4. 6 Simulation Studies: Balanced Claims Model 6.1 Simulation Schemes We consider a collective of I = 300 individual risks from Model 3, Balanced Claims Model, and for each i = 1,..., I, independently simulate covariate vector X i = (X i,1,..., X i,5 ), where X i,1,..., X i,5 are independent and identically distributed from the discrete uniform distribution over the set {1,..., 100}. Our simulation studies are based on the following four distinct schemes, where the first scheme is the base and the others are its modifications made from different perspectives. Scheme 1 (Base simulation scheme) For each i = 1,..., I, independently simulate ε i,j from a given distribution function F ( ), for j = 1,..., n, and define the n claims of individual risk i as Y i,j = e f(xi) + ε i,j, j = 1,..., n, (6.22) where f(x i ) = 0.01 (X i,1 + 2X i,2 X i,3 + 2 X i,1 X i,3 ) X i,2 X i,4. (6.23) In our simulation, F takes one of the following three distributions: (1) EXP(1.6487): An exponential distribution with mean , i.e., F (x) = 1 e x/1.6487, x 0; (2) LN (0, 1): A log-normal distribution with parameters 0 and 1, i.e., ln ε i,j N(0, 1); ( 3.29 (3) PAR(3, ): A Pareto distribution F (x) = 1 x+3.29) 3. The three distributions above are selected to reflect different level of heavy-tailedness, and their parameter values are set to have the same expected values. The terms 2 X i,1 X i,3 and X i,2 X i,4 are included 16

17 in the expression of f(x i ) to reflect the nonlinear and interaction effects from the covariate variables on the claim distribution. The response variable Y i,j only depends on the first four covariates and the fifth variable is included as a noise variable to mimic the scenario that some covariate variables included in the analysis may not be influential for the claims Y i,j. In Scheme 1, the mean and the variance of individual risk i are respectively given by µ(x i, Θ i ) = e f(xi) + E[ε] e f(xi) (6.24) and σ 2 (X i, Θ i ) = Var[ε] , for EXP(1.6487), , for LN (0, 1), , for PAR(3, ), (6.25) for i = 1,..., I, where E[ε] and Var[ε], respectively, denote the mean and the variance of the distribution function F. As shown in (6.25), the variance of the claim random variable for each individual risk simulated from Scheme 1 remains the same across the collective and does not depend on covariates. To reflect the scenario where the variance of claim random variable may differ across the collective of individual risks, we design the following scheme, which differs from Scheme 1 in way of generating variables ε i,j. Scheme 2 For each i = 1,..., I, independently simulate ε i,j from a distribution function F ( ; X i ) which depends on the covariate X i of the individual risk i, for j = 1,..., n, and define the n claims of individual risk i as Y i,j = e f(xi) + ε i,j, j = 1,..., n, (6.26) where f(x i ) is given by (6.23). We respectively consider three distinct distributions for F ( ; X i ): (1) EXP ( e γ(xi)/2), (2) LN(0, γ(x i )), (3) PAR(3, 2e γ(xi)/2 ), (6.27) where γ(x i ) = Xi,1 X i,2 + X i,1 X i,2. The function γ(x i ) in Scheme 2 is normalized by 102 to attain E[γ(X i )] = 1, which is the value of the scale parameter in the log-normal distribution used in Scheme 1. The mean and the variance of individual risk i are respectively given by µ(x i, Θ i ) = e f(xi) + E[ε i,1 ] 17

18 = e f(xi) + e γ(xi)/2, for EXP(e γ(xi)/2 ), e f(xi) + e γ(xi)/2, for LN(0, γ(x i )), e f(xi) + 2e γ(xi)/2, for PAR(3, 2e γ(xi)/2 ), (6.28) and e γ(xi)/2, for EXP(e γ(xi)/2 ), σ 2 ( (X i, Θ i ) = Var[ε i,1 ] = e γ(x i) 1 ) e γ(xi), for LN(0, γ(x i )), 3e γ(xi), for PAR ( 3, 2e γ(xi))/2), (6.29) for i = 1,..., I. Note that the claim distribution of each individual risk i is fully determined by its covariate X i simulation schemes 1 and 2. This means that the risk profile is completely characterized by observable covariates X i and there is no random effect on the claims distribution. However, two insureds, in insurance practice, with exactly the same covariate values may have different risk profiles thus it is reasonable to take into account a random effect component when modelling the claims distribution. In light of such observation, we design simulation schemes 3 and 4. Scheme 3 For each i = 1,..., I, (1) independently simulate random effect variable Θ i from the uniform distribution U(0.9, 1.1). (2) independently simulate ε i,j from a distribution function F ( ; X i ) which depends on the covariates X i associated with risk i, for j = 1,..., n, and (3) define the n claims of individual risk i as Y i,j = Θ i [ e f(xi) + ε i,j ], j = 1,..., n, (6.30) where f(x i ) is given by (6.23). In our simulation studies, we consider each of the three distributions described in Scheme 2 for the distribution F ( ; X i ) in the step (2) of the present scheme. In Scheme 3, a random variable Θ i is multiplied to [ e f(xi) + ε i,j ], j = 1,..., n, for generating claims Y i,j and reflecting the random effect from covariates on claims distribution. The claim distribution of two risks with the same covariates is no longer the same in simulation Scheme 3, because the claim distribution depends on the random effect variable ξ i in addition to the covariate X i. This also means that the profile of an individual risk i simulated for Scheme 3 can not be fully captured by its covariates X i anymore. Accordingly, for each individual risk i = 1, 2,..., I, the mean and the variance of risk are respectively given by ) µ(x i, Θ i ) = Θ i (e f(xi) + E[ε i,1 ] 18

19 ( Θ i e f(x i) + e γ(xi)/2), for EXP(e γ(xi)/2 ), ( = Θ i e f(x i) + e γ(xi)/2), for LN(0, γ(x i )), (6.31) ( Θ i e f(x i) + 2e γ(xi)/2), for PAR(3, 2e γ(xi)/2 ), and Θ 2 i eγ(xi)/2, for EXP(e γ(xi)/2 ), σ 2 (X i, Θ i ) = ξi 2 ( Var[ε i,1 ] = Θ 2 i e γ(x i) 1 ) e γ(xi), for LN(0, γ(x i )), (6.32) 3Θ 2 i eγ(xi), for PAR ( 3, 2e γ(xi)/2). Note that the variance of individual claim distribution depends on the random effect variable Θ i in addition to the covariate X i in Scheme 3. Scheme 3 contains a multiplicative random effect. We next consider a different way of incorporating random effect term into the claim distribution as described in Scheme 4. Scheme 4 For each i = 1,..., I, Θ i = (ξ i,1, ξ i,2 ) T, (1) independently simulate random effect variables ξ i,1 and ξ i,2 from the uniform distribution U(0.9, 1.1). (2) independently simulate ε i,j from a distribution function F ( ; X i, ξ i,2 ) which depends on the covariate X i and random effect variable ξ i,2, for j = 1,..., n, and (3) define the n claims of individual risk i as Y i,j = e ξi,1 f(xi) + ε i,j, j = 1,..., n, (6.33) where f(x i ) is defined in (6.23). We respectively consider the three distributions as defined in equation (6.27) in Scheme 2 with γ(x i ) replaced by ξ i,2 γ(x i ) so that the distribution of ε i,j depends on the random effect variable ξ i,2 in addition to covariate variable X i. For individual risks i = 1,..., I from Scheme 4, the mean and the variance of individual risk are respectively given by and µ(x i, Θ i ) = e ξi,1 f(xi) + E[ε i,1 ] e ξi,1 f(xi) + e [ξi,2 γ(xi)]/2, for EXP(e [ξi,2 γ(xi)]/2 ), = e ξi,1 f(xi) + e [ξi,2 γ(xi)]/2, for LN(0, ξ i,2 γ(x i )), e ξi,1 f(xi) + 2e [ξi,2 γ(xi)]/2, for PAR(3, 2e [γ(xi)]/2 ), e [ξi,2 γ(xi)]/2, for EXP(e [ξi,2 γ(xi)]/2 ), σ 2 ( (X i, Θ i ) = Var[ε i,1 ] = e ξ i,2 γ(x i) 1 ) e ξi,2 γ(xi), for LN(0, ξ i,2 γ(x i )), 3e ξi,2 γ(xi), for PAR ( 3, 2e [ξi,2 γ(xi)]/2). (6.34) (6.35) 19

20 For each of the four simulation schemes designed above and each of the three involved distributions (exponential, log-normal, and Pareto), we independently simulate 1, 000 samples of the claims experience of I = 300 individual risks with n = 5, 10 and 20, respectively. For each simulated sample, we use L 2 loss function (5.21) and the four credibility loss functions defined in (4.15), (4.16), (4.19) and (4.20) to grow and prune regression trees and the best tree is selected using longitudinal cross-validation. In addition, we consider ad hoc covariate-dependent partitioning, which is defined via the following notation: P(X k ) = { } {i I : X i,j 50}, {i I : X i,j > 50}, j = 1,, 5. and { P(X j1, X j2 ) = {i I : X i,j1 50, X i,j2 50}, {i I : X i,j1 50, X i,j2 > 50}, } {i I : X i,j1 > 50, X i,j2 50}, {i I : X i,j1 > 50, X i,j2 > 50}, for j 1, j 2 = 1,..., 5 with j 1 j 2. P(X k ) presents a equally binary partitioning based on a single covariate and P(X j1, X j2 ) is one based on two covariates. Those based on three or four covariates are denoted by P(X k1, X k2, X k3 ) and P(X k1, X k2, X k3, X k4 ) and can be defined in the same manner. In our simulations, P(X 2 ), P(X 4 ), P(X 1, X 2, X 3 ), P(X 1, X 2, X 4 ), P(X 2, X 3, X 4 ), P(X 1, X 3, X 4 ), and P(X 1, X 2, X 3, X 4 ) are considered. 6.2 Performance Measure: Prediction Error To compare the prediction performance of credibility premium resulted from various partitioning (ad hoc or data-driven) methods, we compute the prediction error for a given partitioning {I 1,..., I K } as follows PE = 1 I I K i=1 k=1 ( 2 I{X i I k } π (H)(k) i µ(x i, Θ i )), (6.36) where the form of π (H)(k) i is given in (4.18) and the net premium of each individual risk µ(x i, Θ i ) is given in (6.24), (6.28), (6.31) and (6.34) for Schemes 1-4 respectively. In addition, we consider credibility premium P (H) i in (2.4) without partitioning the collective. The resulting collective prediction error is computed as PE 0 = 1 I I i=1 ( P (H) i µ(x i, Θ i )) 2, which does not use any covariate information and is anticipated to underperform compared to various kinds of covariate-dependent partitioning. We define the relative prediction error (RPE) for each covariatedependent partitioning as the ratio of prediction error to the collective prediction error and denote it by R = PE/PE 0. Smaller prediction error or relative prediction error suggests better performance. 20

21 6.3 Simulation Results The performance in terms of RPE of the various covariate-dependent partitioning methods under Balanced Claims Model with n = 5 is summarized in Figure 2, which contains 12 plots and summarizes the results for three distributions (exponential, log-normal, Pareto) and four simulation schemes described in Section 6.1. The 12 boxplots in each plot respectively correspond to four credibility regression tree premiums (highlighted in grey color) using credibility loss function (4.15), (4.16), (4.19) and (4.20) (R 1 -R 4 ), L 2 regression tree premium ( ), partition P(X 2 ) (R (2) ), partition P(X 4 ) (R (4) ), partition P(X 1, X 2, X 3 ) (R (123) ), partition P(X 1, X 2, X 4 ) (R (124) ), partition P(X 2, X 3, X 4 ) (R (234) ), partition P(X 1, X 3, X 4 ) (R (134) ) and partition P(X 1, X 2, X 3, X 4 ) (R (1234) ). We have the following interesting observations from Figure Almost all the boxes in Figure 2 are below the value of 1 with an exception being the one resulted from the ad hoc partitioning using covariate X i,4 and with boxplots labeled by R (4). A RPE smaller than 1 suggests more accurate prediction compared to the credibility premium for which no partitioning is performed. Thus, Figure 2 indicates that all the covariate-dependent partitioning methods we considered except P(X 4 ) enhance the prediction accuracy of credibility premium. If revisiting the form of f(x i ) in (6.23), one can see that X i,4 only appears in the interaction term, and it may not be informative enough to be used solely for enhancing prediction accuracy. 2. In each of the 12 plots, the first four boxplots in grey colour summarize the PRE s resulted from the 1,000 simulations using the four proposed credibility regression trees. The four boxplots are positioned at a similar level and take a similar shape across all the 12 plots, which indicates that the four credibility losses perform almost equally when being used to build regression tree credibility premium to predict individual net premium. 3. The four boxplots in grey are constantly positioned lower than those from the other prediction rules, and this is true across all the plots in Figure 2. This observation indicates that our regression tree credibility premium always outperform the others across all the simulation schemes and all the distribution types involved in our numerical simulation. 4. The L 2 regression tree performs consistently worse than the four credibility regression trees, but it performs either better than or as well as each of the 7 ad hoc partitioning methods. 5. Comparing different columns in Figure 2, one can see that the relative prediction errors tend to be smaller for Pareto distribution, whereas the exponential distribution tends to yield larger relative prediction errors. As we know, Pareto distribution is commonly perceived as a heavy-tailed distribution, 21

22 Figure 2: Boxplots of RPE s for Balanced Claims Model with n = 5 for three distributions and four simulation schemes (Schemes 1-4) using various covariate-dependent partitioning methods. Exponential Log-normal Pareto Scheme Scheme Scheme Scheme 4 22

23 exponential distribution is light-tailed, and log-normal is in between. Therefore, such a comparison suggests that our regression tree based methods can improve relative prediction accuracy to a larger extent for a heavy-tailed claim distribution than for a light-tailed one. 6. The plots in the third and fourth rows contain results respectively from simulation schemes 3 and 4, where the individual net premium depends on unobservable random effect term(s). Compared with those in the first and second rows, they do not display an obvious shift in their positions and shapes, which hints that the presence of random effects does not deteriorate the prediction performance of regression tree credibility premium. 7. The rightmost 7 boxplots in each plot are respectively using ad hoc partitions P(X 2 ), P(X 4 ), P(X 1, X 2, X 3 ), P(X 1, X 2, X 4 ), P(X 2, X 3, X 4 ), P(X 1, X 3, X 4 ) and P(X 1, X 2, X 3, X 4 ). As one can see, their partitions are all based on the four influential covariate variables X i,1, X i,2, X i,3 and X i,4 and the noise covariate X i,5 is not involved. In contrast, we did not preclude the covariate X i,5 from the list of covariates by any ex ante variable selection procedure in the course of building the credibility regression trees and the L 2 regression tree. The boxplots in Figure 2, however, show that the regression tree based methods yield significantly smaller RPE s than those ad hoc partitioning methods even though they are disadvantaged by the inclusion of the noise covariate X i,5. 8. The relative prediction errors of partitions P(X 1, X 2, X 3 ) and P(X 1, X 2, X 4 ) are in general smaller than the ones using partitions P(X 2, X 3, X 4 ) and P(X 1, X 3, X 4 ) even though the number of partitions are the same for the four partitioning methods. This suggests that it is not the number of subcollectives into which the collective is partioned that matters, but the form of partition that matters. A partition formed in an ad hoc manner does not guarantee a promising prediction result and thus, it is necessary to adopt data-driven partitioning algorithms, such as our regression tree based model, for developing accurate premium prediction. We also examine the performance of the prediction rules by increasing the number of individual claims to n = 10 and n = 20, and the resulting RPE s are respectively reported by boxplots shown in Figures 4 and 5 in Appendix B, for which all the comments made for n = 5 also apply. We further explore how the prediction accuracy of those 13 prediction rules may change along with the increase of the claims experience of individual risks, which is captured by the parameter n in our simulation schemes. To this end, we compute the average (absolute) prediction error from those 1,000 simulations for each combination of a value for n (5, 10, or 20), a distribution (exponential, log-normal, or Pareto) and one simulation scheme from those four which we have previously defined. Table 1 reports the results for base prediction rule (PE 0 ), CRT premium using credibility loss function (4.19) (PE 3 ), L 2 regression tree 23

REGRESSION TREE CREDIBILITY MODEL

REGRESSION TREE CREDIBILITY MODEL LIQUN DIAO AND CHENGGUO WENG Department of Statistics and Actuarial Science, University of Waterloo Advances in Predictive Analytics Conference, Waterloo, Ontario Dec 1, 2017 Overview Statistical }{{ Method

More information

The Credibility Estimators with Dependence Over Risks

The Credibility Estimators with Dependence Over Risks Applied Mathematical Sciences, Vol. 8, 2014, no. 161, 8045-8050 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2014.410803 The Credibility Estimators with Dependence Over Risks Qiang Zhang

More information

A few basics of credibility theory

A few basics of credibility theory A few basics of credibility theory Greg Taylor Director, Taylor Fry Consulting Actuaries Professorial Associate, University of Melbourne Adjunct Professor, University of New South Wales General credibility

More information

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018 Data Mining CS57300 Purdue University Bruno Ribeiro February 8, 2018 Decision trees Why Trees? interpretable/intuitive, popular in medical applications because they mimic the way a doctor thinks model

More information

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro Decision Trees CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Classification without Models Well, partially without a model } Today: Decision Trees 2015 Bruno Ribeiro 2 3 Why Trees? } interpretable/intuitive,

More information

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13 Boosting Ryan Tibshirani Data Mining: 36-462/36-662 April 25 2013 Optional reading: ISL 8.2, ESL 10.1 10.4, 10.7, 10.13 1 Reminder: classification trees Suppose that we are given training data (x i, y

More information

CS 195-5: Machine Learning Problem Set 1

CS 195-5: Machine Learning Problem Set 1 CS 95-5: Machine Learning Problem Set Douglas Lanman dlanman@brown.edu 7 September Regression Problem Show that the prediction errors y f(x; ŵ) are necessarily uncorrelated with any linear function of

More information

INSTITUTE AND FACULTY OF ACTUARIES. Curriculum 2019 SPECIMEN SOLUTIONS

INSTITUTE AND FACULTY OF ACTUARIES. Curriculum 2019 SPECIMEN SOLUTIONS INSTITUTE AND FACULTY OF ACTUARIES Curriculum 09 SPECIMEN SOLUTIONS Subject CSA Risk Modelling and Survival Analysis Institute and Faculty of Actuaries Sample path A continuous time, discrete state process

More information

Classification and Regression Trees

Classification and Regression Trees Classification and Regression Trees Ryan P Adams So far, we have primarily examined linear classifiers and regressors, and considered several different ways to train them When we ve found the linearity

More information

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition Ad Feelders Universiteit Utrecht Department of Information and Computing Sciences Algorithmic Data

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Lecture 06 - Regression & Decision Trees Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 4: Vector Data: Decision Tree Instructor: Yizhou Sun yzsun@cs.ucla.edu October 10, 2017 Methods to Learn Vector Data Set Data Sequence Data Text Data Classification Clustering

More information

CHAPTER-17. Decision Tree Induction

CHAPTER-17. Decision Tree Induction CHAPTER-17 Decision Tree Induction 17.1 Introduction 17.2 Attribute selection measure 17.3 Tree Pruning 17.4 Extracting Classification Rules from Decision Trees 17.5 Bayesian Classification 17.6 Bayes

More information

Evaluation requires to define performance measures to be optimized

Evaluation requires to define performance measures to be optimized Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

Generalization to Multi-Class and Continuous Responses. STA Data Mining I

Generalization to Multi-Class and Continuous Responses. STA Data Mining I Generalization to Multi-Class and Continuous Responses STA 5703 - Data Mining I 1. Categorical Responses (a) Splitting Criterion Outline Goodness-of-split Criterion Chi-square Tests and Twoing Rule (b)

More information

Experience Rating in General Insurance by Credibility Estimation

Experience Rating in General Insurance by Credibility Estimation Experience Rating in General Insurance by Credibility Estimation Xian Zhou Department of Applied Finance and Actuarial Studies Macquarie University, Sydney, Australia Abstract This work presents a new

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

In this chapter, we provide an introduction to covariate shift adaptation toward machine learning in a non-stationary environment.

In this chapter, we provide an introduction to covariate shift adaptation toward machine learning in a non-stationary environment. 1 Introduction and Problem Formulation In this chapter, we provide an introduction to covariate shift adaptation toward machine learning in a non-stationary environment. 1.1 Machine Learning under Covariate

More information

Holdout and Cross-Validation Methods Overfitting Avoidance

Holdout and Cross-Validation Methods Overfitting Avoidance Holdout and Cross-Validation Methods Overfitting Avoidance Decision Trees Reduce error pruning Cost-complexity pruning Neural Networks Early stopping Adjusting Regularizers via Cross-Validation Nearest

More information

CS 6375 Machine Learning

CS 6375 Machine Learning CS 6375 Machine Learning Decision Trees Instructor: Yang Liu 1 Supervised Classifier X 1 X 2. X M Ref class label 2 1 Three variables: Attribute 1: Hair = {blond, dark} Attribute 2: Height = {tall, short}

More information

Evaluation. Andrea Passerini Machine Learning. Evaluation

Evaluation. Andrea Passerini Machine Learning. Evaluation Andrea Passerini passerini@disi.unitn.it Machine Learning Basic concepts requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain

More information

More on Unsupervised Learning

More on Unsupervised Learning More on Unsupervised Learning Two types of problems are to find association rules for occurrences in common in observations (market basket analysis), and finding the groups of values of observational data

More information

Influence measures for CART

Influence measures for CART Jean-Michel Poggi Orsay, Paris Sud & Paris Descartes Joint work with Avner Bar-Hen Servane Gey (MAP5, Paris Descartes ) CART CART Classification And Regression Trees, Breiman et al. (1984) Learning set

More information

Data Mining. Preamble: Control Application. Industrial Researcher s Approach. Practitioner s Approach. Example. Example. Goal: Maintain T ~Td

Data Mining. Preamble: Control Application. Industrial Researcher s Approach. Practitioner s Approach. Example. Example. Goal: Maintain T ~Td Data Mining Andrew Kusiak 2139 Seamans Center Iowa City, Iowa 52242-1527 Preamble: Control Application Goal: Maintain T ~Td Tel: 319-335 5934 Fax: 319-335 5669 andrew-kusiak@uiowa.edu http://www.icaen.uiowa.edu/~ankusiak

More information

the tree till a class assignment is reached

the tree till a class assignment is reached Decision Trees Decision Tree for Playing Tennis Prediction is done by sending the example down Prediction is done by sending the example down the tree till a class assignment is reached Definitions Internal

More information

day month year documentname/initials 1

day month year documentname/initials 1 ECE471-571 Pattern Recognition Lecture 13 Decision Tree Hairong Qi, Gonzalez Family Professor Electrical Engineering and Computer Science University of Tennessee, Knoxville http://www.eecs.utk.edu/faculty/qi

More information

Lecture 3: Statistical Decision Theory (Part II)

Lecture 3: Statistical Decision Theory (Part II) Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical

More information

Final Exam, Machine Learning, Spring 2009

Final Exam, Machine Learning, Spring 2009 Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3

More information

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Decision Trees Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s

More information

Growing a Large Tree

Growing a Large Tree STAT 5703 Fall, 2004 Data Mining Methodology I Decision Tree I Growing a Large Tree Contents 1 A Single Split 2 1.1 Node Impurity.................................. 2 1.2 Computation of i(t)................................

More information

Tree-based methods. Patrick Breheny. December 4. Recursive partitioning Bias-variance tradeoff Example Further remarks

Tree-based methods. Patrick Breheny. December 4. Recursive partitioning Bias-variance tradeoff Example Further remarks Tree-based methods Patrick Breheny December 4 Patrick Breheny STA 621: Nonparametric Statistics 1/36 Introduction Trees Algorithm We ve seen that local methods and splines both operate locally either by

More information

A Practitioner s Guide to Generalized Linear Models

A Practitioner s Guide to Generalized Linear Models A Practitioners Guide to Generalized Linear Models Background The classical linear models and most of the minimum bias procedures are special cases of generalized linear models (GLMs). GLMs are more technically

More information

Practice Exam 1. (A) (B) (C) (D) (E) You are given the following data on loss sizes:

Practice Exam 1. (A) (B) (C) (D) (E) You are given the following data on loss sizes: Practice Exam 1 1. Losses for an insurance coverage have the following cumulative distribution function: F(0) = 0 F(1,000) = 0.2 F(5,000) = 0.4 F(10,000) = 0.9 F(100,000) = 1 with linear interpolation

More information

Show that the following problems are NP-complete

Show that the following problems are NP-complete Show that the following problems are NP-complete April 7, 2018 Below is a list of 30 exercises in which you are asked to prove that some problem is NP-complete. The goal is to better understand the theory

More information

Detection of Uniform and Non-Uniform Differential Item Functioning by Item Focussed Trees

Detection of Uniform and Non-Uniform Differential Item Functioning by Item Focussed Trees arxiv:1511.07178v1 [stat.me] 23 Nov 2015 Detection of Uniform and Non-Uniform Differential Functioning by Focussed Trees Moritz Berger & Gerhard Tutz Ludwig-Maximilians-Universität München Akademiestraße

More information

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology

More information

ONE-YEAR AND TOTAL RUN-OFF RESERVE RISK ESTIMATORS BASED ON HISTORICAL ULTIMATE ESTIMATES

ONE-YEAR AND TOTAL RUN-OFF RESERVE RISK ESTIMATORS BASED ON HISTORICAL ULTIMATE ESTIMATES FILIPPO SIEGENTHALER / filippo78@bluewin.ch 1 ONE-YEAR AND TOTAL RUN-OFF RESERVE RISK ESTIMATORS BASED ON HISTORICAL ULTIMATE ESTIMATES ABSTRACT In this contribution we present closed-form formulas in

More information

Regression tree methods for subgroup identification I

Regression tree methods for subgroup identification I Regression tree methods for subgroup identification I Xu He Academy of Mathematics and Systems Science, Chinese Academy of Sciences March 25, 2014 Xu He (AMSS, CAS) March 25, 2014 1 / 34 Outline The problem

More information

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. CS 189 Spring 013 Introduction to Machine Learning Final You have 3 hours for the exam. The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. Please

More information

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Machine Learning Midterm Exam Solutions 10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,

More information

Causal Inference in Observational Studies with Non-Binary Treatments. David A. van Dyk

Causal Inference in Observational Studies with Non-Binary Treatments. David A. van Dyk Causal Inference in Observational Studies with Non-Binary reatments Statistics Section, Imperial College London Joint work with Shandong Zhao and Kosuke Imai Cass Business School, October 2013 Outline

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Pattern Recognition Approaches to Solving Combinatorial Problems in Free Groups

Pattern Recognition Approaches to Solving Combinatorial Problems in Free Groups Contemporary Mathematics Pattern Recognition Approaches to Solving Combinatorial Problems in Free Groups Robert M. Haralick, Alex D. Miasnikov, and Alexei G. Myasnikov Abstract. We review some basic methodologies

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

Statistics and learning: Big Data

Statistics and learning: Big Data Statistics and learning: Big Data Learning Decision Trees and an Introduction to Boosting Sébastien Gadat Toulouse School of Economics February 2017 S. Gadat (TSE) SAD 2013 1 / 30 Keywords Decision trees

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Random Forests. These notes rely heavily on Biau and Scornet (2016) as well as the other references at the end of the notes.

Random Forests. These notes rely heavily on Biau and Scornet (2016) as well as the other references at the end of the notes. Random Forests One of the best known classifiers is the random forest. It is very simple and effective but there is still a large gap between theory and practice. Basically, a random forest is an average

More information

Learning Decision Trees

Learning Decision Trees Learning Decision Trees CS194-10 Fall 2011 Lecture 8 CS194-10 Fall 2011 Lecture 8 1 Outline Decision tree models Tree construction Tree pruning Continuous input features CS194-10 Fall 2011 Lecture 8 2

More information

ABC random forest for parameter estimation. Jean-Michel Marin

ABC random forest for parameter estimation. Jean-Michel Marin ABC random forest for parameter estimation Jean-Michel Marin Université de Montpellier Institut Montpelliérain Alexander Grothendieck (IMAG) Institut de Biologie Computationnelle (IBC) Labex Numev! joint

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

7 Sensitivity Analysis

7 Sensitivity Analysis 7 Sensitivity Analysis A recurrent theme underlying methodology for analysis in the presence of missing data is the need to make assumptions that cannot be verified based on the observed data. If the assumption

More information

A Measure of Robustness to Misspecification

A Measure of Robustness to Misspecification A Measure of Robustness to Misspecification Susan Athey Guido W. Imbens December 2014 Graduate School of Business, Stanford University, and NBER. Electronic correspondence: athey@stanford.edu. Graduate

More information

Generalized linear mixed models (GLMMs) for dependent compound risk models

Generalized linear mixed models (GLMMs) for dependent compound risk models Generalized linear mixed models (GLMMs) for dependent compound risk models Emiliano A. Valdez joint work with H. Jeong, J. Ahn and S. Park University of Connecticut 52nd Actuarial Research Conference Georgia

More information

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University Lecture 18: Kernels Risk and Loss Support Vector Regression Aykut Erdem December 2016 Hacettepe University Administrative We will have a make-up lecture on next Saturday December 24, 2016 Presentations

More information

Bayesian Learning. Bayesian Learning Criteria

Bayesian Learning. Bayesian Learning Criteria Bayesian Learning In Bayesian learning, we are interested in the probability of a hypothesis h given the dataset D. By Bayes theorem: P (h D) = P (D h)p (h) P (D) Other useful formulas to remember are:

More information

Mathematical models in regression credibility theory

Mathematical models in regression credibility theory BULETINUL ACADEMIEI DE ŞTIINŢE A REPUBLICII MOLDOVA. MATEMATICA Number 3(58), 2008, Pages 18 33 ISSN 1024 7696 Mathematical models in regression credibility theory Virginia Atanasiu Abstract. In this paper

More information

Machine Learning, Fall 2012 Homework 2

Machine Learning, Fall 2012 Homework 2 0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0

More information

Lecture Outline. Biost 518 Applied Biostatistics II. Choice of Model for Analysis. Choice of Model. Choice of Model. Lecture 10: Multiple Regression:

Lecture Outline. Biost 518 Applied Biostatistics II. Choice of Model for Analysis. Choice of Model. Choice of Model. Lecture 10: Multiple Regression: Biost 518 Applied Biostatistics II Scott S. Emerson, M.D., Ph.D. Professor of Biostatistics University of Washington Lecture utline Choice of Model Alternative Models Effect of data driven selection of

More information

ASIGNIFICANT research effort has been devoted to the. Optimal State Estimation for Stochastic Systems: An Information Theoretic Approach

ASIGNIFICANT research effort has been devoted to the. Optimal State Estimation for Stochastic Systems: An Information Theoretic Approach IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL 42, NO 6, JUNE 1997 771 Optimal State Estimation for Stochastic Systems: An Information Theoretic Approach Xiangbo Feng, Kenneth A Loparo, Senior Member, IEEE,

More information

Credibility Theory for Generalized Linear and Mixed Models

Credibility Theory for Generalized Linear and Mixed Models Credibility Theory for Generalized Linear and Mixed Models José Garrido and Jun Zhou Concordia University, Montreal, Canada E-mail: garrido@mathstat.concordia.ca, junzhou@mathstat.concordia.ca August 2006

More information

Experimental Design and Data Analysis for Biologists

Experimental Design and Data Analysis for Biologists Experimental Design and Data Analysis for Biologists Gerry P. Quinn Monash University Michael J. Keough University of Melbourne CAMBRIDGE UNIVERSITY PRESS Contents Preface page xv I I Introduction 1 1.1

More information

Decision Tree Learning Lecture 2

Decision Tree Learning Lecture 2 Machine Learning Coms-4771 Decision Tree Learning Lecture 2 January 28, 2008 Two Types of Supervised Learning Problems (recap) Feature (input) space X, label (output) space Y. Unknown distribution D over

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Performance of Cross Validation in Tree-Based Models

Performance of Cross Validation in Tree-Based Models Performance of Cross Validation in Tree-Based Models Seoung Bum Kim, Xiaoming Huo, Kwok-Leung Tsui School of Industrial and Systems Engineering Georgia Institute of Technology Atlanta, Georgia 30332 {sbkim,xiaoming,ktsui}@isye.gatech.edu

More information

Notation. Pattern Recognition II. Michal Haindl. Outline - PR Basic Concepts. Pattern Recognition Notions

Notation. Pattern Recognition II. Michal Haindl. Outline - PR Basic Concepts. Pattern Recognition Notions Notation S pattern space X feature vector X = [x 1,...,x l ] l = dim{x} number of features X feature space K number of classes ω i class indicator Ω = {ω 1,...,ω K } g(x) discriminant function H decision

More information

Classification Using Decision Trees

Classification Using Decision Trees Classification Using Decision Trees 1. Introduction Data mining term is mainly used for the specific set of six activities namely Classification, Estimation, Prediction, Affinity grouping or Association

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

Online Supplementary Material. MetaLP: A Nonparametric Distributed Learning Framework for Small and Big Data

Online Supplementary Material. MetaLP: A Nonparametric Distributed Learning Framework for Small and Big Data Online Supplementary Material MetaLP: A Nonparametric Distributed Learning Framework for Small and Big Data PI : Subhadeep Mukhopadhyay Department of Statistics, Temple University Philadelphia, Pennsylvania,

More information

Solving Classification Problems By Knowledge Sets

Solving Classification Problems By Knowledge Sets Solving Classification Problems By Knowledge Sets Marcin Orchel a, a Department of Computer Science, AGH University of Science and Technology, Al. A. Mickiewicza 30, 30-059 Kraków, Poland Abstract We propose

More information

Introduction. Chapter 1

Introduction. Chapter 1 Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics

More information

On the Problem of Error Propagation in Classifier Chains for Multi-Label Classification

On the Problem of Error Propagation in Classifier Chains for Multi-Label Classification On the Problem of Error Propagation in Classifier Chains for Multi-Label Classification Robin Senge, Juan José del Coz and Eyke Hüllermeier Draft version of a paper to appear in: L. Schmidt-Thieme and

More information

Lecture 7: DecisionTrees

Lecture 7: DecisionTrees Lecture 7: DecisionTrees What are decision trees? Brief interlude on information theory Decision tree construction Overfitting avoidance Regression trees COMP-652, Lecture 7 - September 28, 2009 1 Recall:

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Decision Trees Tobias Scheffer Decision Trees One of many applications: credit risk Employed longer than 3 months Positive credit

More information

Introduction to Probabilistic Graphical Models

Introduction to Probabilistic Graphical Models Introduction to Probabilistic Graphical Models Sargur Srihari srihari@cedar.buffalo.edu 1 Topics 1. What are probabilistic graphical models (PGMs) 2. Use of PGMs Engineering and AI 3. Directionality in

More information

Workshop on Heterogeneous Computing, 16-20, July No Monte Carlo is safe Monte Carlo - more so parallel Monte Carlo

Workshop on Heterogeneous Computing, 16-20, July No Monte Carlo is safe Monte Carlo - more so parallel Monte Carlo Workshop on Heterogeneous Computing, 16-20, July 2012 No Monte Carlo is safe Monte Carlo - more so parallel Monte Carlo K. P. N. Murthy School of Physics, University of Hyderabad July 19, 2012 K P N Murthy

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

5. Discriminant analysis

5. Discriminant analysis 5. Discriminant analysis We continue from Bayes s rule presented in Section 3 on p. 85 (5.1) where c i is a class, x isap-dimensional vector (data case) and we use class conditional probability (density

More information

Summary and discussion of: Dropout Training as Adaptive Regularization

Summary and discussion of: Dropout Training as Adaptive Regularization Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain

More information

The Model Building Process Part I: Checking Model Assumptions Best Practice

The Model Building Process Part I: Checking Model Assumptions Best Practice The Model Building Process Part I: Checking Model Assumptions Best Practice Authored by: Sarah Burke, PhD 31 July 2017 The goal of the STAT T&E COE is to assist in developing rigorous, defensible test

More information

Machine Learning 3. week

Machine Learning 3. week Machine Learning 3. week Entropy Decision Trees ID3 C4.5 Classification and Regression Trees (CART) 1 What is Decision Tree As a short description, decision tree is a data classification procedure which

More information

Regression Clustering

Regression Clustering Regression Clustering In regression clustering, we assume a model of the form y = f g (x, θ g ) + ɛ g for observations y and x in the g th group. Usually, of course, we assume linear models of the form

More information

LECTURE NOTE #3 PROF. ALAN YUILLE

LECTURE NOTE #3 PROF. ALAN YUILLE LECTURE NOTE #3 PROF. ALAN YUILLE 1. Three Topics (1) Precision and Recall Curves. Receiver Operating Characteristic Curves (ROC). What to do if we do not fix the loss function? (2) The Curse of Dimensionality.

More information

Algorithmisches Lernen/Machine Learning

Algorithmisches Lernen/Machine Learning Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines

More information

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2 Representation Stefano Ermon, Aditya Grover Stanford University Lecture 2 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 1 / 32 Learning a generative model We are given a training

More information

Course 4 Solutions November 2001 Exams

Course 4 Solutions November 2001 Exams Course 4 Solutions November 001 Exams November, 001 Society of Actuaries Question #1 From the Yule-Walker equations: ρ φ + ρφ 1 1 1. 1 1+ ρ ρφ φ Substituting the given quantities yields: 0.53 φ + 0.53φ

More information

Announcements Kevin Jamieson

Announcements Kevin Jamieson Announcements My office hours TODAY 3:30 pm - 4:30 pm CSE 666 Poster Session - Pick one First poster session TODAY 4:30 pm - 7:30 pm CSE Atrium Second poster session December 12 4:30 pm - 7:30 pm CSE Atrium

More information

Decision trees COMS 4771

Decision trees COMS 4771 Decision trees COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples).

More information

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS

More information

1 Degree distributions and data

1 Degree distributions and data 1 Degree distributions and data A great deal of effort is often spent trying to identify what functional form best describes the degree distribution of a network, particularly the upper tail of that distribution.

More information

Lecture 3: Introduction to Complexity Regularization

Lecture 3: Introduction to Complexity Regularization ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,

More information

Generalization Error on Pruning Decision Trees

Generalization Error on Pruning Decision Trees Generalization Error on Pruning Decision Trees Ryan R. Rosario Computer Science 269 Fall 2010 A decision tree is a predictive model that can be used for either classification or regression [3]. Decision

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

10-701/ Machine Learning, Fall

10-701/ Machine Learning, Fall 0-70/5-78 Machine Learning, Fall 2003 Homework 2 Solution If you have questions, please contact Jiayong Zhang .. (Error Function) The sum-of-squares error is the most common training

More information

A Reservoir Sampling Algorithm with Adaptive Estimation of Conditional Expectation

A Reservoir Sampling Algorithm with Adaptive Estimation of Conditional Expectation A Reservoir Sampling Algorithm with Adaptive Estimation of Conditional Expectation Vu Malbasa and Slobodan Vucetic Abstract Resource-constrained data mining introduces many constraints when learning from

More information

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE Data Provided: None DEPARTMENT OF COMPUTER SCIENCE Autumn Semester 203 204 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE 2 hours Answer THREE of the four questions. All questions carry equal weight. Figures

More information