A Practitioner s Guide to Generalized Linear Models

Size: px

Start display at page:

Download "A Practitioner s Guide to Generalized Linear Models"

Godwin Walsh
5 years ago
Views:

1 A Practitioners Guide to Generalized Linear Models Background The classical linear models and most of the minimum bias procedures are special cases of generalized linear models (GLMs). GLMs are more technically efficient than iterative methods, and provide statistical diagnostics that aid in variable selection. Today, they re the industry standard for PPA and other personal lines pricing. Their primary applications are in ratemaking and underwriting, although there s been an increased use for target marketing analysis. The failings of one-way analysis A one way analysis summarizes statistics for each eplanatory variable, but doesn t take into account the effect of other variables. These analyses can be distorted by correlations between rating factors. Traditional techniques attempt to standardize the data to remove the distorting effect caused by the correlations, but they are only approimations. One-way analyses also don t consider interdependencies between factors in the way they affect claims eperience, which eists when the effect of one factor varies depending on the levels of another factor. Multivariate methods (like GLMs) adjust for correlations, and allow us to investigate interaction effects. The failings of minimum bias procedures Minimum bias procedures impose a set of equations relating the observed data, the rating variables, and set of parameters to be determined. An iterative procedure is used to converge to the optimal solution. However, once this solution is found, there s no systematic way to test whether a variable influences the result with statistical significance. This type of procedure lacks a statistical framework which would allow us to better assess the quality of the modeling. The connection of minimum bias to GLM Minimum Bias Procedure Corresponding Generalized Linear Models Link Function Error Function Multiplicative Balance Principle Logarithmic Poisson Additive Balance Principle Identity Normal Multiplicative Least Squares Logarithmic Normal Multiplicative Maimum Likelihood with eponential density function Logarithmic Gamma Multiplicative Maimum Likelihood with Normal density function Logarithmic Normal Additive Maimum likelihood with Normal density function Identity Normal The chi-squared additive and multiplicative minimum bias models have no corresponding GLM analog. Linear models The purpose of linear models and GLMs is to epress the relationship between an observed response variable (Y) and a number of covariates (X). Linear models state that Y is the sum of its mean and a random variable (Y = μ + ε). We assume that the epected value of Y (μ) can be written as a linear combination of the covariates, and that the error term (ε) is normally distributed with mean zero and variance σ 2. Eample: Suppose Y is the average claim severity, and that there are two factors, territory and se, resulting in four covariates: male, female, urban, and rural. The linear model epresses the observed item (Y) as a linear combination of a specified selection of the four variables, plus an error term that s normally distributed as above. One model is Y = β 1 X 1 + β 2 X 2 + β 3 X 3 + β 4 X 4 + ε. Since there is a linear dependency between the four covariates, the model in this form is not uniquely defined. To fi this, we instead consider Y = β 1 X 1 + β 2 X 2 + β 3 X 3 + ε. This implies that there s an average response for men (β 1 ), and an average response for women (β 2 ). The effect of being an urban driver has an additional additive effect (β 3 ) regardless of gender. We could also think of this as a model which assumes an average response for the base case of women in rural areas, with additional additive effects for being male and for being in an urban area. If we have a matri of 4 observations, we can epress them as a system of equations. For the classical linear model, we minimize the sum of the squared errors to solve for the three parameters. If the system epands greatly, vectors and matrices may be used in place of the system of equations. Solving the matri algebra will yield the same factors as the system of equations. Page 1 of 7

2 In conclusion, the basic ingredients for a linear model consist of (1) a set of assumptions about the relationship between the observed values and the predictor variables, and (2) an objective function which is to be optimized in order to solve the problem. Classic linear model assumptions Y = E Y + ε E Y = Xβ Random Component: Systematic Component: Link Function: Each component of Y is independent and is normally distributed. The mean of each component may differ, but they all have a common variance The covariates are combined to give the linear predictor η; η = Xβ. The relationship between the random and systematic components is specified via a link function. In the linear model, the link function is equal to the identity function. These assumptions are not easy to guarantee. It s hard to assume normality and constant variance for response variables. Linear regression attempts to transform the data so these requirements are met, but there s no reason why such a transformation should eist. Also, the values for the response variable may be restricted to be positive, in which case, the assumption of normality is violated. If the response variable is strictly non-negative, then the variance is a function of the mean, and tends to zero with the mean. Additivity is also not realistic for many applications. Many insurance risks tend to vary multiplicatively with the rating factors, not additively. Generalized linear model assumptions Random Component: Systematic Component: Link Function: Each component of Y is independent and is from one of the eponential family of distributions. The covariates are combined to give the linear predictor η; η = Xβ. The relationship between the random and systematic components is specified via a link function, g, that is differentiable and monotonic, such that E Y μ = g 1 η A member of the eponential family of distributions has 2 properties: 1. The distribution is completely specified in terms of its mean and variance 2. The variance is a function of its mean. The 2 nd property can be seen by defining the variance as Var Y i = φv μ i ω i. Distribution V() Normal 1 Poisson Gamma 2 Binomial (1 ) Inverse Gaussian 3 Here, V() is the variance function and is a specified function. The parameter φ scales the variance, and ω i is a constant that assigns weights to individual observations. In addition to the distributions shown here, a member of the eponential family is the Tweedie distribution. This distribution has a point mass at zero, and a variance function proportional to μ p. This distribution is used to model pure premium data directly. The choice of the variance function affects the results of the GLM. Two eamples are shown here: This eample shows the result of fitting three different GLMS to three data points. As shown in the graph, the GLM with a Normal variance function produced fitted values which are attracted to the original data points with equal weight. With a Poisson error, the GLM assumes that the variance increases with the epected values of each observation. Observations with smaller epected values have a smaller assumed variance. The model produces fitted values which are more influenced by points on the left than by the points on the right. With the Gamma variance function, the GLM is even more strongly influence by the point on the left, since the model assumes the variance increases with the square of the epected value. Page 2 of 7

3 A more realistic eample is listed here. This consists of an artificially generated dataset representing an insurance portfolio. Claims eperience is randomly generated using a gamma distribution, then analyzed using three models to see how closely the results of each model relate to the true factor effect. In this case, we know the true effect of the rating factors, in practice, this is not true. Three methods used: 1. One way analysis 2. GLM with Normal variance function 3. GLM with gamma variance function Because of the correlations between the rating factors in the data, the one way analysis is etremely distorted. The GLM with the assumed Normal is close to the correct relativities, but the GLM with the Gamma variance function yields results closest to the true effect. In addition to the variance function, two other parameters define the variance of each observation, the scale parameter φ and the prior weights ω i. The prior weights allow information about the known credibility to be incorporated in the model. Observations with higher eposure are deemed to have lower variance and the model will be more influenced by these observations. Let cell i denote a cell defined by a classification system. m ik be the number of claims arising from the k t unit of eposure in cell i ω i be the number of eposures in cell i Y i be the observed claim frequency in cell i ω i k=1 Y i = 1 ω i m ik If m ik is Poisson with frequency f i for all eposures, then E m ik = f i = Var[m ik ]. Assuming that the eposures are independent, μ i = f i and Var Y i = μ i 1 ω i An alternative eample: Let z ik be the claim size of the k t claim in cell i ω i be the number of claims in the cell Y i be the observed mean claim size in cell i ω i k=1 Y i = 1 ω i z ik. So, in this case, V μ i = μ i, φ = 1, and the prior weights are the eposures. Assume that the random process generating each individual claim is gamma distributed, and that each claim is independent: E z ik = m i Var z ik = σ 2 2 m i μ i = m i 2 Var Y i = μ σ 2 i ω i So, in this case, the variance of Y i follows the general form for all eponential distributions, with V μ i = μ 2 i, φ = σ 2, and prior weight equal to the number of claims in the cell. Prior weights can also be used to attach a lower credibility to a part of the data which is known to be less reliable. In some cases, the scale parameter is equal to 1, and falls out of the analysis. In general, this is not true, and the scale parameter must be estimated from the data. This is not actually necessary in order to solve for the GLM parameters, but it is necessary in order to calculate certain statistics (like the standard error). φ can be treated as another parameter, and estimated by maimum likelihood. This is mathematically difficult, and in its place, an estimate of φ can be used. Page 3 of 7

4 Estimates of φ: 1. The moment estimator = φ = 1 2. The total deviance estimator = φ = D ( i ω i Y i μ 2 i n p V μ i n p ) In practice, we sometimes attempt to transform data to satisfy the requirements of Normality, constant variance, and additivity of effects. GLMs merely require that there be a link function that guarantees the condition of additivity. Classical linear models require that Y be additive in the covariates, GLMs require that some transformation of Y be additive in the covariates. In theory, a different link function could be used for each observation i, but in practice, this is impractical, and rarely done. The link function must be differentiable and monotonic (either strictly increasing or strictly decreasing). Typical choices are: g() g 1 Identity Log ln () e Logit ln 1 Reciprocal 1 e 1 + e The log-link function is appealing, because the effect of the covariates are multiplicative. When a log link function is used, the GLM estimates logs of multiplicative effects. Choices of link functions and error functions can yield GLMs which are equivalent to a number of minimum bias models, as well as a simple linear model. When the effect of an eplanatory variable is known, it s appropriate to include information about this variable in the model as a known effect, by introducing an offset term ξ into the definition of the linear prediction, giving η = Xβ + ξ. This gives us: 1 E Y = μ = g 1 η = g 1 (Xβ + ξ) An eample of this is when we re fitting a GLM to the claim count (as opposed to frequency). Since we assume that the epected count of claims increases in proportion to the eposure of an observation, we should incorporate this information in the GLM. We set the offset term to be equal to the log of the eposure of each observation. In the case of the Poisson multiplicative GLM, modeling claim counts with an offset term equal to the log of the eposure produces identical results to modeling claim frequencies with no offset term, but with prior weights set equal to the eposure of each observation Typical GLM forms Y Claim Frequencies Claim Counts Average Claim Size Probability Link Function g() ln () ln () ln () ln ( 1 ) Error Poisson Poisson Gamma Binomial Scale Parameter 1 1 Estimated 1 Variance Function V 2 t t Prior Weights ω i Eposure 1 # Claims 1 Offset (ξ) 0 ln (eposure) 0 0 Appeal Invariant to time Invariant to Currency GLM maimum likelihood estimators Once the model is defined, the components are derived by maimizing the likelihood function to find the parameters which produce the observed data with the highest probability. In simple eamples, the produce for maimizing likelihood involves finding the solution to a system of equations with linear algebra. In practice, numerical techniques are used due to the large number of observations. Page 4 of 7

5 Page 5 of 7

6 Solving simple eamples The general procedure used in the 2 handwritten eamples following is: 1. Specify the design matri X and the vector of parameters β 2. Choose the error structure and the link function 3. Identify the log-likelihood function 4. Take the logarithm to convert the product of many terms into a sum 5. Maimize the logarithm of the likelihood function by taking partial derivatives with respect to each parameter, setting them to zero and solving the resulting system of equations 6. Compute the predicted values Solving for large datasets using numerical techniques In insurance modeling, it s not practical to use the above techniques; instead, we use iterative numerical techniques, such as Newton-Raphson iteration. Iterative processes can be started using either a value of zero for the elements, or by using the estimates implied by a one-way analysis or of another previously used GLM. Base levels and the intercept term In practice, when considering many factors each with many levels, it s helpful to parameterize the GLM by including an intercept term, which is a parameter that applies to all observations. This is done (in our eample) by defining the design matri by redefining beta-one as the intercept term, and only having one parameter relating to gender. When considering categorical factors and an intercept term, one level of each factor should have no parameter associated with it, so that the model remains uniquely defined. If a model were structured with an intercept term, but WITHOUT each factor having a base level, then the GLM solving routine would remove as many parameters as necessary to make the model uniquely defined. This process is called aliasing. Aliasing Aliasing occurs when there s a linear dependency among the observed covariates. There are two types: Intrinsic and Etrinsic. Intrinsic aliasing occurs because of dependences inherent in the definition of the covariates. These arise most commonly whenever categorical factors are included in the model. GLM software will remove parameters which are aliased. The choice of which parameter to alias does not affect the fitted values. Etrinsic aliasing also arises from a dependency among the covariates, but occurs when the dependency results from the nature of the data, rather than inherent properties of the covariates themselves. It arises if one level of a particular factor is perfectly correlated with a level of another factor When modeling in practice, a common problem occurs when two or more factors contain levels that are almost, but not quite, perfectly correlated. When levels of two factors are nearly aliased in this way, convergence problems can occur, or the GLM will give results that appear very confusing. To deal with this, eamine two-way tables of eposure and claim counts for the factors containing the nonsense parameter estimates. Then, identify the factor combinations which cause the near aliasing. The issue can be resolved by deleting or ecluding records, or by reclassifying the records into another factor level. Model diagnostics GLMs can also produce additional information indicating the certainty of the parameter estimates. The multivariate version of the Cramer-Rao lower bound can define standard errors for each parameter estimate. Standard errors can be thought of as being indicators of the speed with which log-likelihood falls from the maimum, given a change in a parameter. It s assumed that parameter estimates as asymptotically normally distribution, so it s possible to use a statistical test on individual parameter estimates, comparing each estimate with zero using a χ 2 test, with the square of the parameter estimate divided by its variance being compared to a χ 2 distribution. This compares the parameter with the base level of the factor. We can repeatedly change the base level, and construct a triangle of tests, comparing every pair of estimates. If none of the differences is significant, it s a good bet that the factor is not either. Measures of deviance can be used to assess the theoretical significance of a particular factor. Deviance is a measure of how much the fitted values differ from the observations. Define the deviance function: d Y i ; μ i = 2ω i Y i μ i Y i ζ V ζ dζ Page 6 of 7

7 Since V() is strictly positive, the deviance function is also strictly positive, and satisfies the condition for being a distance function. This function is a measurement of the difference between the fitted and the actual observations which gives more weight to the difference between Y i and μ i when the variance function is small. The deviance function can be thought of as a generalized form of the squared error. Summing the deviance function across all observations gives an overall measure of deviance, called the total deviance: D = n i=1 2ω i μ i Y i We can divide this by the scale parameter to get the scaled deviance, which is a generalized form of the sum of squared errors, adjusting for the shape of the distribution. D = n i=1 2ω i φ μ i For the class of eponential distributions, this is equal to twice the difference between the maimum achievable likelihood and the likelihood of the model. One useful test considers the ratio of the likelihood of two nested models. Nested Models refer to the situation where one model contains eplanatory variables which are a subset of the eplanatory variables in a second model. The change in scaled deviance between two nested models is a sample from a χ 2 distribution with degrees of freedom equal to the difference in degrees of freedom between the two models. Degrees of freedom for a model are the number of observations minus the number of parameters. This allows us to test the significance of the parameters that differ between the two models. It measures whether the inclusion of an eplanatory factor in a model improves the model enough, given the etra parameters which it adds to the model. The χ 2 tests depend on the scaled deviance. For some distributions, the scale parameter is not known, and must be estimated. In the event that the scale parameter used isn t accurate, the reliability of this test is decreased. After adjusting for degrees of freedom and the (true) scale parameter, the effect of the scale parameter is also distributed with a χ 2 distribution. The ratio of the change in deviance and the adjusted estimate of the scale is therefore distributed with an F- distribution. The F-Test is suitable for use when the scale parameter is noti known. When we know the scale, there s no advantage. Y i Y i ζ V ζ Y i ζ V ζ dζ dζ Page 7 of 7

PL-2 The Matrix Inverted: A Primer in GLM Theory

PL-2 The Matrix Inverted: A Primer in GLM Theory 2005 CAS Seminar on Ratemaking Claudine Modlin, FCAS Watson Wyatt Insurance & Financial Services, Inc W W W. W A T S O N W Y A T T. C O M / I N S U R A