GLM I An Introduction to Generalized Linear Models

Similar documents
Confidence Intervals. CAS Antitrust Notice. Bayesian Computation. General differences between Bayesian and Frequntist statistics 10/16/2014

Bayesian Computation

A Practitioner s Guide to Generalized Linear Models

Chapter 5: Generalized Linear Models

Advanced Ratemaking. Chapter 27 GLMs

PL-2 The Matrix Inverted: A Primer in GLM Theory

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20

Content Preview. Multivariate Methods. There are several theoretical stumbling blocks to overcome to develop rating relativities

The simple linear regression model discussed in Chapter 13 was written as

Generalized Linear Models (GLZ)

Regression Model Building

Generalized linear models

Chapter 14 Student Lecture Notes 14-1

LOGISTIC REGRESSION Joseph M. Hilbe

Chapter 4: Regression Models

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

Chapter 1. Modeling Basics

Trendlines Simple Linear Regression Multiple Linear Regression Systematic Model Building Practical Issues

Basic Business Statistics 6 th Edition

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

Categorical Predictor Variables

CS-11 Tornado/Hail: To Model or Not To Model

Basic Business Statistics, 10/e

Generalized Linear Models: An Introduction

Statistical Techniques II EXST7015 Simple Linear Regression

ECON2228 Notes 2. Christopher F Baum. Boston College Economics. cfb (BC Econ) ECON2228 Notes / 47

Correlation Analysis

Chapter 4. Regression Models. Learning Objectives

Introduction to Regression

Topic 20: Single Factor Analysis of Variance

Introducing Generalized Linear Models: Logistic Regression

Table of z values and probabilities for the standard normal distribution. z is the first column plus the top row. Each cell shows P(X z).

Business Statistics. Chapter 14 Introduction to Linear Regression and Correlation Analysis QMIS 220. Dr. Mohammad Zainal

Linear Regression With Special Variables

Generalized linear mixed models (GLMMs) for dependent compound risk models

Investigating Models with Two or Three Categories

Lecture 3: Inference in SLR

Generalized linear models

Generalized Linear Models 1

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011)

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /1/2016 1/46

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

Sociology 6Z03 Review II

Six Sigma Black Belt Study Guides

Linear, Generalized Linear, and Mixed-Effects Models in R. Linear and Generalized Linear Models in R Topics

Final Exam - Solutions

Chapter 3 Multiple Regression Complete Example

Math 423/533: The Main Theoretical Topics

Question Possible Points Score Total 100

Inferences for Regression

Generalized Linear Models

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

Regression Models. Chapter 4. Introduction. Introduction. Introduction

A discussion on multiple regression models

INFERENCE FOR REGRESSION

Regression Models - Introduction

Activity #12: More regression topics: LOWESS; polynomial, nonlinear, robust, quantile; ANOVA as regression

Mathematical Notation Math Introduction to Applied Statistics

Wooldridge, Introductory Econometrics, 4th ed. Chapter 2: The simple regression model

Generalized linear mixed models (GLMMs) for dependent compound risk models

11. Generalized Linear Models: An Introduction

Ratemaking with a Copula-Based Multivariate Tweedie Model

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 18.1 Logistic Regression (Dose - Response)

Regression. Bret Hanlon and Bret Larget. December 8 15, Department of Statistics University of Wisconsin Madison.

Chapter 1 Statistical Inference

Analysis of Covariance. The following example illustrates a case where the covariate is affected by the treatments.

Generalized Linear. Mixed Models. Methods and Applications. Modern Concepts, Walter W. Stroup. Texts in Statistical Science.

Correlation and regression

Solutions to the Spring 2015 CAS Exam ST

Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure.

Lecture 18 Miscellaneous Topics in Multiple Regression

Analysis of Bivariate Data

GLM models and OLS regression

STA 216: GENERALIZED LINEAR MODELS. Lecture 1. Review and Introduction. Much of statistics is based on the assumption that random

9 Generalized Linear Models

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

Multiple linear regression S6

Lecture 10 Multiple Linear Regression

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

Unit 6 - Simple linear regression

Business Statistics. Lecture 10: Correlation and Linear Regression

Regression analysis is a tool for building mathematical and statistical models that characterize relationships between variables Finds a linear

General Regression Model

Generalized linear mixed models for dependent compound risk models

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY (formerly the Examinations of the Institute of Statisticians) GRADUATE DIPLOMA, 2007

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

Statistics in medicine

SAS Software to Fit the Generalized Linear Model

Chapter 13. Multiple Regression and Model Building

PENALIZING YOUR MODELS

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Single and multiple linear regression analysis

Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/

Single-level Models for Binary Responses

Regression Models for Quantitative and Qualitative Predictors: An Overview

Answer Keys to Homework#10

Unit 6 - Introduction to linear regression

Multiple linear regression

Ch 13 & 14 - Regression Analysis

Transcription:

GLM I An Introduction to Generalized Linear Models CAS Ratemaking and Product Management Seminar March Presented by: Tanya D. Havlicek, ACAS, MAAA ANTITRUST Notice The Casualty Actuarial Society is committed to adhering strictly to the letter and spirit of the antitrust laws. Seminars conducted under the auspices of the CAS are designed solely to provide a forum for the expression of various points of view on topics described in the programs or agendas for such meetings. Under no circumstances shall CAS seminars be used as a means for competing companies or firms to reach any understanding expressed or implied that restricts competition or in any way impairs the ability of members to exercise independent business judgment regarding matters affecting competition. It is the responsibility of all seminar participants to be aware of antitrust regulations, to prevent any written or verbal discussions that appear to violate these laws, and to adhere in every respect to the CAS antitrust compliance policy. Outline Overview of Statistical Modeling Linear Models ANOVA Simple Linear Regression Multiple Linear Regression Categorical Variables Transformations Generalized Linear Models Why GLM? From Linear to GLM Basic Components of GLM s Common GLM structures References

Generic Modeling Schematic Predictor Vars Weights Driver Age Claims Region Exposures Relative Equity Premium Credit Score Response Vars Losses Default Persistency Statistical Model 3 Model Results Parameters Validation Statistics Basic Linear Model Structures - Overview Simple ANOVA : Yij = µ + eij or more generally Yij = µ + ψi + eij In Words: Y is equal to the mean for the group with random variation and possibly fixed variation Traditional Classification Rating Group Means Assumptions: errors independent & follow N(,σe ) ψi = i =,..,k (fixed effects model) ψi ~ N(,σψ ) (random effects model) 4 Basic Linear Model Structures - Overview Simple Linear Regression : yi = bo + bxi + ei Assumptions: linear relationship errors independent and follow N(,σe ) Multiple Regression : yi = bo + bxi +.+ bnxni + ei Assumptions: same, but with n independent random variables (RV s) Transformed Regression : transform x, y, or both; maintain errors are N(,σe ) yi = exp(xi) log(yi) = xi 5

Simple Regression (special case of multiple regression) Model: Yi = bo + bxi + ei Y is the dependent variable explained by X, the independent variable Y could be Pure Premium, Default Frequency, etc Want to estimate relationship of how Y depends on X using observed data Prediction: Y= bo + b x* for some new x* (usually with some confidence interval) 6 Simple Regression A formalization of best fitting a line through data with a ruler and a pencil Correlative relationship ( Yi Y)( Xi X) i Simple e.g. determine a trend to apply, a Y X N ( Xi X) i Mortgage Insurance Average Claim Paid Trend N 7, 6, 5, Severity 4, 3,,, 985 99 995 5 Accident Year Severity Predicted Y 7 Note: All data in this presentation are for illustrative purposes only Regression Observe Data 8

Regression Observe Data Foreclosure Hazard vs Borrower Equity Position 8 7 Relative Foreclosure Hazard 6 5 4 3-5 -5 5 5 75 5 Equity as % of Original Mortgage 9 Regression Observe Data Foreclosure Hazard vs Borrower Equity Position <% 8 7 Relative Foreclosure Hazard 6 5 4 3-5 -4-3 - - Equity as % of Original Mortgage Simple Regression ANOVA df SS MS F Significance F Regression 5.748 5.748 848.74 <. Residual 7.57.6 Total 8 53.853 How much of the sum of squares is explained by the regression? SS = Sum Squared Errors SSTotal = SSRegression + SSResidual (Residual also called Error) SSTotal = (yi y) = 53.853 SSRegression = b est*[ xi yi -/n( xi )( yi)] = 5.748 SSResidual = (yi yi est) = SSTotal SSRegression.57 = 53.853 5.74

Simple Regression ANOVA df SS MS F Significance F Regression 5.748 5.748 848.74 <. Residual 7.57.6 Total 8 53.853 Regression Statistics Multiple R.99 MS = SS divided by df R Square.984 R : (SS Regression/SS Total) Adjusted R Square.979.984 = 5.748 / 53.853 percent of variance explained F statistic: (MS Regression/MS Residual) significance of regression: F tests Ho: b= v. HA: b Simple Regression Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.% Upper 95.% 3.363.73 46.65. 3.9 3.57 3.9 3.57 Intercept X -.88.8-9.5. -.888 -.768 -.888 -.768 T statistics: (b i est H o(b i)) / s.e.(b i est) significance of individual coefficients T = F for b in simple regression (-9.5) = 848.74 F in multiple regression tests that at least one coefficient is nonzero. For the simple case, at least one is the same as the entire model. F stat tests the global null model. 3 Residuals Plot Looks at (yobs ypred) vs. ypred Can assess linearity assumption, constant variance of errors, and look for outliers Standardized Residuals (raw residual scaled by standard error) should be random scatter around, standard residuals should lie between - and With small data sets, it can be difficult to assess assumptions.5 Plot of Standardized Residuals Standardized Residual.5 -.5 - -.5-4 -.5 3 4 5 6 7 8 Predicted Foreclosure Hazard

Normal Probability Plot Can evaluate assumption ei ~ N(,σe ) Plot should be a straight line with intercept µ and slope σe Can be difficult to assess with small sample sizes Normal Probability Plot of Residuals 4 Standard Residuals 3.5 3 Standardized Residual.5.5.5 - -.6 -. -.8 -.4.4.8..6 -.5 - -.5 Theoretical z Percentile 5 Residuals If absolute size of residuals increases as predicted value increases, may indicate nonconstant variance May indicate need to transform dependent variable May need to use weighted regression May indicate a nonlinear relationship Plot of Standardized Residuals 3 Standard Residuals Standardized Residual, 5,, 5, 3, 35, 4, 45, 5, - - -3 Predicted Severity 6 Distribution of Observations Average claim amounts for Rural drivers are normally distributed as are average claim amounts for Urban drivers Mean for Urban drivers is twice that of Rural drivers The variance of the observations is equal for Rural and Urban The total distribution of average claim amounts across Rural and Urban is not Normal here it is bimodal Distribution of Individual Observations Rural Urban µ R µ U 7

Distribution of Observations The basic form of the regression model is Y = bo + bx + e µ i = E[Yi] = E[bo + bxi + ei] = bo + bxi + E[ei] = bo + bxi The mean value of Y, rather than Y itself, is a linear function of X The observations Yi are normally distributed about their mean µ i Yi ~ N(µ i, σe ) Each Yi can have a different mean µ i but the variance σe is the same for each observation Y Line Y = b o + b X b o + b X b o + b X 8 X X X Multiple Regression (special case of a GLM) Y = β + β X + β X + + β n X n + ε E[Y] = β X β is a vector of the parameter coefficients Y is a vector of the dependent variable X is a matrix of the independent variables Each column is a variable Each row is an observation Same assumptions as simple regression ) model is correct (there exists a linear relationship) ) errors are independent 3) variance of ei constant 4) ei ~ N(,σe ) Added assumption the n variables are independent 9 Multiple Regression Uses more than one variable in regression model R-sq always goes up as add variables Adjusted R-Square puts models on more equal footing Many variables may be insignificant Approaches to model building Forward Selection - Add in variables, keep if significant Backward Elimination - Start with all variables, remove if not significant Fully Stepwise Procedures Combination of Forward and Backward

Multiple Regression Goal : Find a simple model that explains things well with assumptions reasonably satisfied Cautions: All predictor variables assumed independent as add more, they may not be multicollinearity linear relationships among the X s Tradeoff: Increase # of parameters ( for each variable in regression) lose degrees of freedom (df) keep df as high as possible for general predictive power problem of over-fitting Multiple Regression Model: Claim Rate = f (Loan-to-Value (LTV), Delinquency Status, Home Price Appreciation (HPA)) Degrees of freedom ~ # observations - # parameters Any parameter with a t-stat with absolute value less than is not significant SUMMARY OUTPUT Regression Statistics Multiple R.97 R Square.94 Adjusted R Square.94 Standard Error.5 Observations 586 ANOVA df SS MS F Significance F 7.76 849.3 <. Regression.77 Residual 575.. Total 585 8.96 Coefficients Standard Error t Stat P-value Lower 95% Upper 95%.3.3..4.36 Intercept 4.4 ltv85 -.. -.9. -. -.9 ltv9 -.7. -9.. -.8 -.6 ltv95 -.4. -9.. -.5 -.3 ltv97 -.. -6.. -.3 -. ss3 -.75. -55.3. -.77 -.73 ss6 -.6. -56.. -.63 -.59 ss9 -.45. -53.5. -.47 -.43 ss -.35. -4.. -.37 -.33 ssfcl -.4. -.8. -.6 -. HPA -.48.3-8.. -.53 -.43 T-stats are also used for evaluating significance of coefficients in GLM s Multiple Regression Residuals Plot Standard Residual vs Predicted Claim Rate Standard Residuals.5.5 Standard Residual.5..4.6.8 -.5 - -.5 - -.5 Predicted Claim Rate Residual Plots are also used to evaluate fits of GLM s 3

Multiple Regression Normal Probability Plot 3 Normal Probability Plot Standard Residual -3 -.5 - -.5 - -.5.5.5.5 3 - - -3 Theoretical z Percentile Percentile or Quantile Plots are also used to evaluate fits of GLM s 4 Categorical Variables (used in LM s and GLM s) Explanatory variables can be discrete or continuous Discrete variables generally referred to as factors Values each factor takes on referred to as levels Discrete variables also called Categorical variables In the multiple regression example given, all variables were categorical except HPA 5 Categorical Variables Assign each level a Dummy variable A binary valued variable X= means member of category and otherwise Always a reference category defined by being for all other levels If only one factor in model, then reference level will be intercept of regression If a category is not omitted, there will be linear dependency Intrinsic Aliasing 6

Categorical Variables Example: Loan To Value (LTV) Grouped for premium 5 Levels <=85%, LTV85 85.% - 9%, LTV9 9.% - 95%, LTV95 95.% - 97%, LTV97 >97% Reference Generally positively correlated with claim frequency Allowing each level it s own dummy variable allows for the possibility of non-monotonic relationship Each modeled coefficient will be relative to reference level X X X3 X4 Loan # LTV LTV85 LTV9 LTV95 LTV97 97 93 3 95 4 85 5 Design Matrix 7 Transformations A possible solution to nonlinear relationship or unequal variance of errors Transform predictor variables, response variable, or both Examples: Y = log(y) X = log(x) X = /X Y = Y Substitute transformed variable into regression equation Maintain assumption that errors are N(,σe ) 8 Why GLM? What if the variance of the errors increases with predicted values? More variability associated with larger claim sizes What if the values for the response variable are strictly positive? assumption of normality violates this restriction If the response variable is strictly non-negative, intuitively the variance of Y tends to zero as the mean of X tends to zero Variance is a function of the mean (poisson, gamma) What if predictor variables do not enter additively? Many insurance risks tend to vary multiplicatively with rating factors 9

Classic Linear Model to Generalized Linear Model LM: X is a matrix of the independent variables Each column is a variable Each row is an observation β is a vector of parameter coefficients ε is a vector of residuals LM Y = β X+ ε E[Y] = β X E[Y] = µ = η ε ~ N(,σe ) GLM: X, β same as in LM ε is still vector of residuals g is called the link function GLM g (µ) = η = β X E[Y] = µ = g - (η) Y = g - (η) + ε ε ~ exponential family 3 Classic Linear Model to Generalized Linear Model LM: 3 ) Random Component : Each component of Y is independent and normally distributed. The mean µ i allowed to differ, but all Y i have common variance σe ) Systematic Component : The n covariates combine to give the linear predictor η = β X 3) Link Function : The relationship between the random and systematic components is specified via a link function. In linear model, link function is identity fnc. E[Y] = µ = η GLM: ) Random Component : Each component of Y is independent and from one of the exponential family of distributions ) Systematic Component : The n covariates are combined to give the linear predictor η = β X 3) Link Function : The relationship between the random and systematic components is specified via a link function g, that is differentiable and monotonic E[Y] = µ = g - (η) Linear Transformation versus a GLM Linear transformation uses transformed variables GLM transforms the mean GLM not trying to transform Y in a way that approximates uniform variability Y Linear X X X The error structure Linear transformation retains assumption Yi ~ N(µ i, σe ) GLM relaxes normality GLM allows for non-uniform variance Variance of each observation Y i is a function of the mean E[Y i ] = µ i 3

The Link Function Example: the log link function g(x) = ln (x) ; g - (x) = e x Suppose Premium (Y) is a multiplicative function of Policyholder Age (X ) and Rating Area (X ) with estimated parameters β, β η i = β X + β X g(µ i ) = η i E[Y i ] = µ i = g - (η i ) E[Y i ] = exp (β X + β X ) E[Y] = g - (β X) E[Y i ] = exp (β X ) exp(β X ) = µ i g(µ i ) = ln [exp (β X ) exp(β X ) ] = η i = β X + β X The GLM here estimates logs of multiplicative effects 33 Examples of Link Functions Identity g(x) = x g - (x) = x additive rating plan Reciprocal g(x) = /x g - (x) = /x Log g(x) = ln(x) g - (x) = e x multiplicative rating plan Logistic g(x) = ln(x/(-x)) g - (x) = e x /(+ e x ) 34 Error Structure Exponential Family Distribution completely specified in terms of its mean and variance The variance of Y i is a function of its mean E[Y i ] = µ i Var (Y i ) = φ V(µ i ) / ω i V(µ) structure specifies the distribution of Y, but V(µ), the variance function, is not the variance of Y φ is a parameter that scales the variance ω i is a constant that assigns a weight, or credibility, to observation i 35

Error Structure Members of the Exponential Family Normal (Gaussian) -- used in classic regression Poisson (common for frequency) Binomial Negative Binomial Gamma (common for severity) Inverse Gaussian Tweedie (common for pure premium) aka Compound Gamma-Poisson Process Claim count is Poisson distributed Size-of-Loss is Gamma distributed 36 General Examples of Error/Link Combinations Traditional Linear Model response variable: a continuous variable error distribution: normal link function: identity Logistic Regression response variable: a proportion error distribution: binomial link function: logit Poisson Regression in Log Linear Model response variable: a count error distribution: Poisson link function: log Gamma Model with Log Link response variable: a positive, continuous variable error distribution: gamma link function: log 37 Specific Examples of Error/Link Combinations Observed Response Link Fnc Error Structure Variance Fnc Claim Frequency Log Poisson µ Claim Severity Log Gamma µ Pure Premium Log Tweedie µ p (<p<) Retention Rate Logit Binomial µ(-µ) 38

References Anderson, D.; Feldblum, S; Modlin, C; Schirmacher, D.; Schirmacher, E.; and Thandi, N., A Practitioner s Guide to Generalized Linear Models (Second Edition), CAS Study Note, May 5. Devore, Jay L. Probability and Statistics for Engineering and the Sciences 3 rd ed., Duxbury Press. Foote et al. 8. Negative equity and foreclosure: Theory and evidence. Journal of Urban Economics. 64():34-45. McCullagh, P. and J.A. Nelder. Generalized Linear Models, nd Ed., Chapman & Hall/CRC SAS Institute, Inc. SAS Help and Documentation v 9..3 39