C-14 Finding the Right Synergy from GLMs and Machine Learning

Size: px

Start display at page:

Download "C-14 Finding the Right Synergy from GLMs and Machine Learning"

Samantha Horton
5 years ago
Views:

1 C-14 Finding the Right Synergy from GLMs and Machine Learning 2010 CAS Annual Meeting Claudine Modlin November 8, 2010 Slide 1

2 Definitions Parametric modeling Objective: build a predictive model User makes assumptions (e.g., distribution, model structure) and specifies preliminary list of explanatory variables User guides statistical method in order to effectively describe a particular response (e.g., claim frequency) Result is an algorithm, a set of parameters, and diagnostics Examples: minimum bias methods, linear regression, GLM Slide 2

Definitions Machine learning tools Objective: learn new things (which may help in building a model) Find patterns (often complex) in an unknown underlying distribution Tool may be supervised,

3 Definitions Machine learning tools Objective: learn new things (which may help in building a model) Find patterns (often complex) in an unknown underlying distribution Tool may be supervised, unsupervised, or blend of the two Result might be a new variable, a tree, a grouping, a score, etc Examples: principal components analysis, decision trees, clustering, artificial neural networks Slide 3

4 A confusing message.. GLMs have weaknesses, as evident by unexplained predictive power in the GLM residuals. Therefore they need to be corrected via machine learning methods. Slide 4

5 Before we jump to conclusions. Make sure your GLM is as good as it can be (i.e., follow best practices) Use machine learning methods to improve each stage of the GLM process Slide 5

6 Before we jump to conclusions. All models are wrong, but some are useful George E.P. Box What does useful imply other than reliably accurately predictive? Easy to understand and communicate Available in a timely manner Capable of implementation Slide 6

7 Agenda Kristi: GLM best practices Machine learning at every stage of GLM analysis Claudine: Additional enhancements to GLM Mining GLM residuals via machine learning Slide 7

8 GLM enhancements GLM enhancements Testing link function assumption Saddles for interaction detection Slide 8

9 GLM enhancement Test link function via Box-Cox investigation E[Y i ] = µ i = g -1 (ΣX ij.β j +ξ i ) Var[Y i ] = φ.v(µ i )/ω i Box-Cox link function defined as: g(x) = (x λ - 1) / λ for λ 0; ln(x) for λ=0 λ = 1 g(x) = (x - 1) additive (with a base level shift) λ 0 g(x) ln(x) multiplicative (via l'hôpital) λ = -1 g(x) = 1-1/x inverse (with a base level shift) Test a range of values of λ and see which maximizes likelihood Slide 9

10 GLM enhancement Test link function via Box-Cox investigation Likelihood λ Frequencies Slide 10

11 GLM enhancement Test link function via Box-Cox investigation Likelihood λ Frequencies Amounts Slide 11

12 Interactions Policyholder Age Slide 12

13 Interactions Policyholder Age Slide 13

14 Interactions % % % % % % % % % % 0.2 5% 0.1 0% Policyholder Age Slide 14

15 GLM enhancement Interactions Why are interactions present? 1. Because that's how the factors behave 2. Because multiplicative models can go wrong at the edges 1.5 * 1.4 * 1.7 * 1.5 * 1.8 * 1.5 * 1.8 = 26! Slide 15

16 GLM enhancement Interaction detection within GLMs - Saddles Vehicle group β β β β β β β β β Vehicle group β β β β β β β β β Vehicle group β β β β β β β β β β β β Age β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β Age β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β Age β β β β β β β Slide 16

17 Example Slide 17

18 Example Slide 18

19 Example Slide 19

20 Saddles Slide 20

21 Saddles Slide 21

22 Saddles Slide 22

23 Saddles Slide 23

24 Saddles Slide 24

25 Transforming categorical and non-linear responses into single parameter variates Response Factor Levels Slide 25

26 Transforming categorical and non-linear responses into single parameter variates Response β Factor Levels Slide 26

27 Saddles Slide 27

28 Saddles Slide 28

29 Saddles example: no interaction Slide 29

30 Saddles example: unsimplified interaction Slide 30

31 Saddles example: unsimplified interaction Slide 31

32 Saddles example: quadrant interaction Slide 32

33 Slide 33

34 Higher dimensions Why stop with 2 dimensions Fast and parsimonious way of detecting complex signals and model corrections Can be used to guide GLM refinement or used in own right Underwriting rules Slide 34

35 Saddles - model comparison 14% 200,000 13% 180, ,000 12% 140,000 11% 120,000 10% 100,000 9% 80,000 60,000 8% 40,000 7% 20,000 6% 70% 80% 90% 92% 94% 96% 98% 100% 102% 104% 106% 108% 110% 120% 130% 0 Weight Saddle Original Slide 35

36 Saddles - model comparison 14% 200,000 13% 180, ,000 12% 140,000 11% 120,000 10% 100,000 9% 80,000 60,000 8% 40,000 7% 20,000 6% 70% 80% 90% 92% 94% 96% 98% 100% 102% 104% 106% 108% 110% 120% 130% 0 Weight Observed Saddle Original Slide 36

37 Saddles - model comparison 10.0% 200, % 9.0% 8.5% 180, , , , % 100, % 7.0% 6.5% 80,000 60,000 40,000 20, % % 70.0% 90.0% 94.0% 98.0% 102.0% 106.0% 110.0% 130.0% 150.0% Weight Observed Saddle Original Slide 37

38 GLM residuals What if there is still unexplained power in the GLM residuals and why? Limited list of explanatory variables Missing interactions Poor decisions in factor selection Other? Slide 38

39 Mining GLM residuals Supervised machine learning tools can mine residuals from GLM and develop algorithms that group risks with similar residuals Results can form basis of a single correction factor to the GLM Potential disadvantages of this approach Hard to distinguish signal from noise in the residual when no basis for evaluating residual Prone to overfitting Difficult to understand and explain effect on model, which can lead to implementation issues Slide 39

40 Mining GLM residuals Slide 40

41 An alternative approach to mine GLM residuals Identify additional signal in residual that can be attributed to a particular high-dimension factor for example, Geography (zip code) Vehicle (VIN) Worker compensation SIC code Any factor requiring a large number of small units as building blocks and many building blocks have little or no claims experience EMB uses a Bayesian-based data mining method that utilizes the signal in the residuals to correct the GLM results for that high-dimension factor This type of focused correction factor is easier to control and understand Slide 41

42 Mining GLM residuals in controlled manner Geography example Goal is to remove the noise and find the signal Actual Experience Signal Explained Unexplained Noise Non-Geographic Geographic Non- Geographic Geographic Slide 42

43 Mining GLM residuals in controlled manner Geography example Goal is to find the geographic signal Actual Experience Signal Explained Unexplained Noise Non-Geographic Geographic Non- Geographic Geographic Slide 43

44 Mining GLM residuals in controlled manner Geography example GLM Factors GLM Residual Operator Chars Vehicle Chars Policy Chars (xterr) Geophysical Geo- Other Geodemo Non- Geo Signal Geo Signal Noise Slide 44 Explained Geo Risk

systematic effect Smoothing used to find signal Correction

45 Mining GLM residuals in controlled manner Geography example Check the residuals to determine if there is any unexplained systematic effect Smoothing used to find signal Correction factors applied to geo estimates to determine best estimate Slide 45

46 Mining GLM residuals in controlled manner Geography example Operator Chars Vehicle Chars GLM Factors Policy Chars (xterr) Geophysical Geo- Other Unexplained Signal & Noise Geodemo Non- Geo Signal Geo Signal Noise Low $ Slide 46 Expected Geo Risk High $

47 Mining GLM residuals in controlled manner Geography example Assess whether new territorial groupings follow observed data well (ideally on hold-out data) Time Period Selector Node , , , , Node ,000 Weight , , , Smoothing SELECTED (40) Band 0 Slide 47

48 Mining GLM residuals in controlled manner Vehicle example Initial Estimator Vehicle Risk Estimator Vehicle Symbols Symbol Relativities Standardized fitted value from GLM Spatially corrected standardized fitted value Clustered to form symbols Relativities calculated for each symbol Slide 48

49 Mining GLM residuals in controlled manner Vehicle example Initial Estimator: Base Price Curb Weight Engine Size Model Year Airbag Features Theft Deterrents Vehicle Component Relativities Vehicle characteristics are used as proxies for VIN These are standardized for nonvehicle factors (e.g., age of driver) Slide 49

50 Mining GLM residuals in controlled manner Vehicle example Smooth residuals across neighbor vehicles Slide 50

51 Mining GLM residuals in controlled manner Vehicle example Vehicle estimator is clustered into new symbols Modeled Vehicle Signal Spatial Correction Base Price Curb Weight Engine Size Model Year Airbag Features Theft Deterrents Smoothed Residual Low $ Vehicle Component Relativities High $ Slide 51

52 Mining GLM residuals in controlled manner Vehicle example Slide 52 Technique has proven very successful based on proper hold-out sampling validation

53 Summary Model building tools build models. Machine learning tools explore data. GLMs are a powerful and practical multivariate method for insurance analysis, particularly ratemaking. Model-building in general can be improved by following best practices and enhancements. Machine learning tools can improve the GLM process at every stage: data preparation, variable reduction, interaction detection, variable simplification, model validation. Data mining methods can squeeze additional predictive power out of GLM residuals. Rather than mining residuals on a broad basis, consider mining residuals and correcting a particular high-dimension factor easier to control easier to understand Slide 53

54 Contact us EMB El Camino Real Suite 150 San Diego, California T +1 (858) F +1 (858) Slide 54

55 EMB refers to the software and consulting practice carried on by EMB America LLC, EMB Software Management LLP and their directly or indirectly affiliated firms or entities, partnerships or joint ventures, each of which is a separate and distinct legal entity. Slide 55

PL-2 The Matrix Inverted: A Primer in GLM Theory

PL-2 The Matrix Inverted: A Primer in GLM Theory 2005 CAS Seminar on Ratemaking Claudine Modlin, FCAS Watson Wyatt Insurance & Financial Services, Inc W W W. W A T S O N W Y A T T. C O M / I N S U R A