R code and output of examples in text. Contents. De Jong and Heller GLMs for Insurance Data R code and output. 1 Poisson regression 2

Size: px

Start display at page:

Download "R code and output of examples in text. Contents. De Jong and Heller GLMs for Insurance Data R code and output. 1 Poisson regression 2"

Sharlene Neal
5 years ago
Views:

1 R code and output of examples in text Contents 1 Poisson regression 2 2 Negative binomial regression 5 3 Quasi likelihood regression 6 4 Logistic regression 6 5 Ordinal regression 10 6 Nominal regression 12 7 Gamma regression 13 8 Inverse Gaussian regression 15 9 Logistic regression GLMM Logistic regression GEE Logistic regression GAM 17 January 31,

2 1 Poisson regression Number of children: log link birth <- read.table("birth.csv",sep=",",header=t) birth.log <- glm( formula = children ~ age, family = poisson(link = log),data=birth) summary(birth.log) glm(formula = children ~ age, family = poisson(link = log), data = birth) Estimate Std. Error z value Pr( z ) (Intercept) e-08 *** age e-07 *** (Dispersion parameter for poisson family taken to be 1) Null deviance: on 140 degrees of freedom Residual deviance: on 139 degrees of freedom AIC: Number of Fisher Scoring iterations: 5 anova(birth.log) Analysis of Deviance Table Model: poisson, link: log Response: children Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev NULL age Number of children: identity link R produces the following error message. Notice also the error message in the SAS output. Clearly there is a problem with this model. birth.id <- glm( formula = children ~ age, family = poisson(link = identity),data=birth) Error: no valid set of coefficients has been found: please supply starting values Diabetes deaths, categorical age In order the read the data into R, diabetes.xls must be saved as diabetes.csv. Gender and age are both character variables in the data file, so R will treat them as categorical. The way that the model is specified is deaths gender + age The default base level in R is the lowest level, which is female gender and age <25. In order to reproduce the SAS output, we control the base level using the C function. In the case of age, for example, we want to be the base level. This is the fourth level of age, so the term is specified in the model as C(age,base=4). January 31,

3 Diabetes <- read.table("diabetes.csv",sep=",",header=t) attach(diabetes) ### categorical age Model1 <- glm(deaths ~ C(gender,base=2) + C(age,base=4), family = poisson(link = log), offset = l_popn) summary(model1) glm(formula = deaths ~ C(gender, base = 2) + C(age, base = 4), family = poisson(link = log), offset = l_popn) Estimate Std. Error z value Pr( z ) (Intercept) < 2e-16 *** C(gender, base = 2) e-15 *** C(age, base = 4) e-09 *** C(age, base = 4) *** C(age, base = 4) ** C(age, base = 4) e-10 *** C(age, base = 4) < 2e-16 *** C(age, base = 4) < 2e-16 *** C(age, base = 4) < 2e-16 *** (Dispersion parameter for poisson family taken to be 1) Null deviance: on 15 degrees of freedom Residual deviance: on 7 degrees of freedom AIC: Number of Fisher Scoring iterations: 5 Diabetes deaths, cubic age Polynomials are specified in R using the poly function. Model2 <- glm(deaths ~ C(gender,base=2) + poly(agemidpt,3), family = poisson(link = log), offset = l_popn) summary(model2) glm(formula = deaths ~ C(gender, base = 2) + poly(agemidpt, 3), family = poisson(link = log), offset = l_popn) Estimate Std. Error z value Pr( z ) (Intercept) < 2e-16 *** C(gender, base = 2) e-15 *** poly(agemidpt, 3) < 2e-16 *** poly(agemidpt, 3) poly(agemidpt, 3) (Dispersion parameter for poisson family taken to be 1) Null deviance: on 15 degrees of freedom Residual deviance: on 11 degrees of freedom AIC: Number of Fisher Scoring iterations: 5 January 31,

4 This gives different coefficients for the agemidpt polynomial to SAS. The SAS solution is reproduced as minage <- min(agemidpt) maxage <- max(agemidpt) agestd <- (agemidpt-0.5*(minage+maxage))/(0.5*(maxage-minage)) Model3 <- glm(deaths ~ C(gender,base=2) + agestd + I(agestd^2) + I(agestd^3), + family = poisson(link = log), offset = l_popn) summary(model3) glm(formula = deaths ~ C(gender, base = 2) + agestd + I(agestd^2) + I(agestd^3), family = poisson(link = log), offset = l_popn) Estimate Std. Error z value Pr( z ) (Intercept) < 2e-16 *** C(gender, base = 2) e-15 *** agestd < 2e-16 *** I(agestd^2) I(agestd^3) (Dispersion parameter for poisson family taken to be 1) Null deviance: on 15 degrees of freedom Residual deviance: on 11 degrees of freedom AIC: Number of Fisher Scoring iterations: 5 Third party claims TP <- read.table("third party claims.csv",sep=",",header=t) attach(tp) model1 <- glm(claims ~ log(accidents), family=poisson, offset=log(population)) summary(model1) glm(formula = claims ~ log(accidents), family = poisson, offset = log(population)) Estimate Std. Error z value Pr( z ) (Intercept) <2e-16 *** log(accidents) <2e-16 *** (Dispersion parameter for poisson family taken to be 1) Null deviance: on 175 degrees of freedom Residual deviance: on 174 degrees of freedom AIC: Number of Fisher Scoring iterations: 4 January 31,

5 2 Negative binomial regression Negative binomial regression is in the MASS library, which must be installed and loaded. The function is glm.nb. Third party claims library(mass) model2 <- glm.nb(claims ~ log(accidents) + offset(log(population))) summary(model2) glm.nb(formula = claims ~ log(accidents) + offset(log(population)), init.theta = , link = log) Estimate Std. Error z value Pr( z ) (Intercept) <2e-16 *** log(accidents) <2e-16 *** (Dispersion parameter for Negative Binomial(5.8309) family taken to be 1) Null deviance: on 175 degrees of freedom Residual deviance: on 174 degrees of freedom AIC: Number of Fisher Scoring iterations: 1 Correlation of (Intercept) log(accidents) Theta: Std. Err.: x log-likelihood: The dispersion parameter is Theta= In SAS the dispersion parameter is given as , which is 1/ Swedish mortality, categorical age and year mortality <- read.table("mortality.csv",header=t,sep=",") mortality <- mortality[,-c(3,5,7,9,11)] mortality <- na.omit(mortality) attach(mortality) library(mass) model1 <- glm.nb(male_death ~ factor(age) + factor(year) + offset(l_male_exp)) There were 50 or more warnings (use warnings() to see the first 50) summary(model1,corr=f) glm.nb(formula = Male_death ~ factor(age) + factor(year) + offset(l_male_exp), init.theta = , link = log) [parameter estimates table omitted] January 31,

6 (Dispersion parameter for Negative Binomial( ) family taken to be 1) Null deviance: on 5867 degrees of freedom Residual deviance: 7709 on 5704 degrees of freedom AIC: Number of Fisher Scoring iterations: 1 Theta: Std. Err.: x log-likelihood: Quasi likelihood regression model3 <- glm(claims ~ log(accidents), family=quasi(link="log",variance="mu"), + offset=log(population)) summary(model3) glm(formula = claims ~ log(accidents), family = quasi(link = "log", variance = "mu"), offset = log(population)) Estimate Std. Error t value Pr( t ) (Intercept) < 2e-16 *** log(accidents) e-12 *** (Dispersion parameter for quasi family taken to be ) Null deviance: on 175 degrees of freedom Residual deviance: on 174 degrees of freedom AIC: NA Number of Fisher Scoring iterations: 4 4 Logistic regression Vehicle insurance: quadratic vehicle value car <- read.table("car.csv",sep=",",header=t) model1 <- glm(clm ~ veh_value + I(veh_value^2), family=binomial, data=na.omit(car)) summary(model1) glm(formula = clm ~ veh_value + I(veh_value^2), family = binomial, data = na.omit(car)) Estimate Std. Error z value Pr( z ) (Intercept) < 2e-16 *** veh_value e-10 *** I(veh_value^2) e-05 *** January 31,

7 (Dispersion parameter for binomial family taken to be 1) Null deviance: on degrees of freedom Residual deviance: on degrees of freedom AIC: Number of Fisher Scoring iterations: 6 Vehicle insurance: banded vehicle value ### create banded variable valuecat <- cut(car$veh_value, c(-1,2.5,5.0,7.5,10.0,12.5,100)) table(valuecat) valuecat (-1,2.5] (2.5,5] (5,7.5] (7.5,10] (10,12.5] (12.5,100] car <- cbind(car,valuecat) model2 <- glm(clm ~ factor(valuecat), family=binomial, data=na.omit(car)) summary(model2) glm(formula = clm ~ factor(valuecat), family = binomial, data = na.omit(car)) Estimate Std. Error z value Pr( z ) (Intercept) < 2e-16 *** factor(valuecat)(2.5,5] e-06 *** factor(valuecat)(5,7.5] factor(valuecat)(7.5,10] factor(valuecat)(10,12.5] factor(valuecat)(12.5,100] (Dispersion parameter for binomial family taken to be 1) Null deviance: on degrees of freedom Residual deviance: on degrees of freedom AIC: Number of Fisher Scoring iterations: 5 Vehicle insurance: full model, adjusted for exposure source("logit-exposure-adjusted.r") attach(car) model3 <- glm(clm ~ C(factor(agecat),base=3)+ C(factor(area),base=3) + + C(factor(veh_body),base=10) + factor(valuecat), family=binomial(logitexp(exposure))) summary(model3) glm(formula = clm ~ C(factor(agecat), base = 3) + C(factor(area), base = 3) + C(factor(veh_body), base = 10) + factor(valuecat), family = binomial(logitexp(exposure))) January 31,

8 Estimate Std. Error z value Pr( z ) (Intercept) < 2e-16 *** C(factor(agecat), base = 3) e-06 *** C(factor(agecat), base = 3) C(factor(agecat), base = 3) C(factor(agecat), base = 3) e-06 *** C(factor(agecat), base = 3) *** C(factor(area), base = 3) C(factor(area), base = 3) C(factor(area), base = 3) * C(factor(area), base = 3) C(factor(area), base = 3) C(factor(veh_body), base = 10) * C(factor(veh_body), base = 10) C(factor(veh_body), base = 10) ** C(factor(veh_body), base = 10) C(factor(veh_body), base = 10) C(factor(veh_body), base = 10) C(factor(veh_body), base = 10) C(factor(veh_body), base = 10) C(factor(veh_body), base = 10) C(factor(veh_body), base = 10) C(factor(veh_body), base = 10) C(factor(veh_body), base = 10) ** factor(valuecat)(2.5,5] e-05 *** factor(valuecat)(5,7.5] factor(valuecat)(7.5,10] factor(valuecat)(10,12.5] factor(valuecat)(12.5,100] (Dispersion parameter for binomial family taken to be 1) Null deviance: on degrees of freedom Residual deviance: on degrees of freedom AIC: Number of Fisher Scoring iterations: 4 Vehicle insurance: logistic regression on grouped data ### grouped data car.group <- read.table("car_grouped.csv",sep=",",header=t) ### the response is a two-column matrix ### the first column is the number of successes (claims) ### the second column is the number of failures (number-claims) model4 <- glm(cbind(claims,number-claims) ~ C(factor(agecat),base=6)+ C(factor(area),base=6) + + C(factor(veh_body),base=13) + factor(valuecat), + family=binomial, data=car.group) summary(model4) glm(formula = cbind(claims, number - claims) ~ C(factor(agecat), base = 6) + C(factor(area), base = 6) + C(factor(veh_body), base = 13) + factor(valuecat), family = binomial, data = car.group) Estimate Std. Error z value Pr( z ) (Intercept) < 2e-16 *** C(factor(agecat), base = 6) e-05 *** C(factor(agecat), base = 6) C(factor(agecat), base = 6) C(factor(agecat), base = 6) e-05 *** C(factor(agecat), base = 6) *** January 31,

9 C(factor(area), base = 6) C(factor(area), base = 6) C(factor(area), base = 6) * C(factor(area), base = 6) C(factor(area), base = 6) C(factor(veh_body), base = 13) ** C(factor(veh_body), base = 13) C(factor(veh_body), base = 13) C(factor(veh_body), base = 13) C(factor(veh_body), base = 13) C(factor(veh_body), base = 13) C(factor(veh_body), base = 13) C(factor(veh_body), base = 13) C(factor(veh_body), base = 13) C(factor(veh_body), base = 13) C(factor(veh_body), base = 13) C(factor(veh_body), base = 13) *** factor(valuecat) *** factor(valuecat) factor(valuecat) factor(valuecat) factor(valuecat) (Dispersion parameter for binomial family taken to be 1) Null deviance: on 928 degrees of freedom Residual deviance: on 901 degrees of freedom AIC: Number of Fisher Scoring iterations: 5 ROC curves and AUC The AUC is easily computed using the somers2 function in the Hmisc package, which needs to be downloaded from the CRAN website. A function ROC for computing and plotting the ROC curve, is given on the book website in file ROC-function.r. car <- read.table("car.csv",sep=",",header=t) valuecat <- cut(car$veh_value, c(-1,2.5,5.0,7.5,10.0,12.5,100)) car <- cbind(car,valuecat) attach(car) The following object(s) are masked _by_.globalenv : valuecat library(hmisc) ### need this for somers2 function to compute AUC Attaching package: Hmisc The following object(s) are masked from package:base : format.pval The following object(s) are masked from package:base : round.posixt The following object(s) are masked from package:base : trunc.posixt Warning message: package Hmisc was built under R version source("roc-function.r") ### from book website; for plotting ROC curve January 31,

10 model5 <- glm(clm ~ C(factor(agecat),base=3)+ C(factor(area),base=3) + + C(factor(veh_body),base=10) + factor(valuecat), family=binomial) ## compute fitted values from logistic regression and store in fittedvalues fittedvalues <- predict(model5, type = response, newdata = car) somers2(fittedvalues,clm) C Dxy n Missing e e e e+00 ROC(fittedvalues,clm) The AUC is given as the element C of the somers2 result, which is Sensitivity specificity 5 Ordinal regression Proportional odds model A few functions for this model are available. We prefer vglm in the VGAM package. The VGAM manual is worth consulting before attempting to implement the next three models. injury <- read.table("injury.csv",sep=",",header=t) attach(injury) library(vgam) Loading required package: splines Loading required package: stats4 Attaching package: VGAM [warnings omitted] ## change base levels to those in the text ## (not necessary, this is just to demonstrate that the solution is the same ## as the SAS solution) road.x <- C(factor(roaduserclass),base=4) age.x <- C(factor(agecat),base=7) sex.x <- C(sex,base=2) model1 <- vglm(degree ~ road.x + age.x + sex.x + age.x*sex.x, cumulative(parallel=true), + weights=number) summary(model1) vglm(formula = degree ~ road.x + age.x + sex.x + age.x * sex.x, family = cumulative(parallel = TRUE), weights = number) January 31,

11 Pearson Residuals: logit(p[y<=1]) logit(p[y<=2]) Value Std. Error t value (Intercept): (Intercept): road.x road.x road.x age.x age.x age.x age.x age.x age.x sex.x age.x1:sex.x age.x2:sex.x age.x3:sex.x age.x4:sex.x age.x5:sex.x age.x6:sex.x Number of linear predictors: 2 Names of linear predictors: logit(p[y<=1]), logit(p[y<=2]) Dispersion Parameter for cumulative family: 1 Residual Deviance: on 400 degrees of freedom Log-likelihood: on 400 degrees of freedom Number of Iterations: 7 Partial proportional odds model We use vglm for this model. parameter. The partial proportional odds are specified via the parallel model2 <- vglm(degree ~ road.x + age.x + sex.x + age.x*sex.x, + cumulative(parallel=true~age.x*sex.x-1), + weights=number) summary(model2) vglm(formula = degree ~ road.x + age.x + sex.x + age.x * sex.x, family = cumulative(parallel = TRUE ~ age.x * sex.x - 1), weights = number) Pearson Residuals: logit(p[y<=1]) logit(p[y<=2]) Value Std. Error t value (Intercept): (Intercept): road.x1: road.x1: road.x2: road.x2: road.x3: road.x3: age.x January 31,

12 age.x age.x age.x age.x age.x sex.x age.x1:sex.x age.x2:sex.x age.x3:sex.x age.x4:sex.x age.x5:sex.x age.x6:sex.x Number of linear predictors: 2 Names of linear predictors: logit(p[y<=1]), logit(p[y<=2]) Dispersion Parameter for cumulative family: 1 Residual Deviance: on 397 degrees of freedom Log-likelihood: on 397 degrees of freedom Number of Iterations: 7 6 Nominal regression As the private health insurance data are not publicly available, nominal regression is illustrated here on the degree of injury data. The vglm function in the VGAM package is used. injury <- read.table("injury.csv",sep=",",header=t) attach(injury) library(vgam) Loading required package: splines Loading required package: stats4 Attaching package: VGAM [warnings omitted] ## change base levels to those in the text road.x <- C(factor(roaduserclass),base=4) age.x <- C(factor(agecat),base=7) sex.x <- C(sex,base=2) ## nominal regression model model3 <- vglm(degree ~ road.x + age.x + sex.x + age.x*sex.x, + multinomial, weights=number) summary(model3) vglm(formula = degree ~ road.x + age.x + sex.x + age.x * sex.x, family = multinomial, weights = number) Pearson Residuals: log(mu[,1]/mu[,3]) log(mu[,2]/mu[,3]) Value Std. Error t value (Intercept): (Intercept): road.x1: road.x1: road.x2: road.x2: road.x3: road.x3: January 31,

13 age.x1: age.x1: age.x2: age.x2: age.x3: age.x3: age.x4: age.x4: age.x5: age.x5: age.x6: age.x6: sex.x1: sex.x1: age.x1:sex.x1: age.x1:sex.x1: age.x2:sex.x1: age.x2:sex.x1: age.x3:sex.x1: age.x3:sex.x1: age.x4:sex.x1: age.x4:sex.x1: age.x5:sex.x1: age.x5:sex.x1: age.x6:sex.x1: age.x6:sex.x1: Number of linear predictors: 2 Names of linear predictors: log(mu[,1]/mu[,3]), log(mu[,2]/mu[,3]) Dispersion Parameter for multinomial family: 1 Residual Deviance: on 384 degrees of freedom Log-likelihood: on 384 degrees of freedom Number of Iterations: 7 7 Gamma regression Vehicle insurance car <- read.table("car.csv",sep=",",header=t) #### banded vehicle value valuecat <- cut(car$veh_value, c(-1,2.5,5.0,7.5,10.0,12.5,100)) #### create variables with same base levels as in the text age.x <- C(factor(car$agecat),base=3) ## agecat=3 base level area.x <- C(factor(car$area),base=3) ## area C is 3rd level gender.x <- C(factor(car$gender),base=2) ## gender M is 2nd level veh_body.x <- C(factor(car$veh_body),base=10) ## SEDAN is 10th level car <- cbind(car,valuecat, age.x,area.x,gender.x,veh_body.x) model1 <- glm(claimcst0 ~ age.x + gender.x + age.x*gender.x + area.x + veh_body.x, + family=gamma(link="log"),data=subset(car,clm==1)) summary(model1) glm(formula = claimcst0 ~ age.x + gender.x + age.x * gender.x + area.x + veh_body.x, family = Gamma(link = "log"), data = subset(car, clm == 1)) January 31,

14 [output omitted] (Dispersion parameter for Gamma family taken to be ) Null deviance: on 4623 degrees of freedom Residual deviance: on 4595 degrees of freedom AIC: Number of Fisher Scoring iterations: 7 Personal injury insurance, no adjustment for quickly settled claims persinj <- read.table("persinj.csv",sep=",",header=t) model3 <- glm(total ~ op_time + factor(legrep) + op_time*factor(legrep), + family=gamma(link="log"), data=persinj) summary(model3) glm(formula = total ~ op_time + factor(legrep) + op_time * factor(legrep), family = Gamma(link = "log"), data = persinj) Estimate Std. Error t value Pr( t ) (Intercept) < 2e-16 *** op_time < 2e-16 *** factor(legrep) < 2e-16 *** op_time:factor(legrep) e-10 *** (Dispersion parameter for Gamma family taken to be ) Null deviance: on degrees of freedom Residual deviance: on degrees of freedom AIC: Number of Fisher Scoring iterations: 6 Runoff triangle runoff <- read.table("runoff triangle.csv",sep=",",header=t) runoff$y[runoff$y<0] <- 1 ### replace negative value by 1 model4 <- glm(y ~ factor(devyear) + factor(accyear), family=gamma(link="log"), data=runoff) summary(model4) glm(formula = Y ~ factor(devyear) + factor(accyear), family = Gamma(link = "log"), data = runoff) Estimate Std. Error t value Pr( t ) (Intercept) < 2e-16 *** factor(devyear) * factor(devyear) * factor(devyear) factor(devyear) factor(devyear) January 31,

15 factor(devyear) * factor(devyear) * factor(devyear) ** factor(devyear) ** factor(accyear) factor(accyear) factor(accyear) factor(accyear) factor(accyear) factor(accyear) factor(accyear) factor(accyear) factor(accyear) (Dispersion parameter for Gamma family taken to be ) Null deviance: on 54 degrees of freedom Residual deviance: on 36 degrees of freedom AIC: Number of Fisher Scoring iterations: 11 8 Inverse Gaussian regression The data frame car used here is the one created for the vehicle insurance, Gamma regression model. model2 <- glm(claimcst0 ~ age.x + gender.x + area.x, + family=inverse.gaussian(link="log"),data=subset(car,clm==1)) summary(model2) glm(formula = claimcst0 ~ age.x + gender.x + area.x, family = inverse.gaussian(link = "log"), data = subset(car, clm == 1)) Estimate Std. Error t value Pr( t ) (Intercept) < 2e-16 *** age.x * age.x age.x age.x age.x gender.x ** area.x area.x area.x area.x area.x * (Dispersion parameter for inverse.gaussian family taken to be ) Null deviance: on 4623 degrees of freedom Residual deviance: on 4612 degrees of freedom AIC: Number of Fisher Scoring iterations: 11 January 31,

16 9 Logistic regression GLMM The software in this area is developing very rapidly. We use here glmmpql in the MASS package. claimslong <- read.table("claimslong.txt",header=t,sep=",") ## create binary variable for claim/no claim claimslong <- cbind(claimslong,clm=1*(claimslong$numclaims0)) #### create variables with same base levels as in the text, for comparability age.x <- C(factor(claimslong$agecat),base=6) value.x <- C(factor(claimslong$valuecat),base=6) period.x <- C(factor(claimslong$period),base=3) claimslong <- cbind(claimslong,age.x,value.x,period.x) library(mass) model1 <- glmmpql(clm ~ age.x + value.x + period.x, + random=~1 policyid, family=binomial, data=claimslong) Loading required package: nlme iteration 1 iteration 2 iteration 3 iteration 4 iteration 5 iteration 6 iteration 7 iteration 8 summary(model1) Linear mixed-effects model fit by maximum likelihood Data: claimslong AIC BIC loglik NA NA NA Random effects: Formula: ~1 policyid (Intercept) Residual StdDev: Variance function: Structure: fixed weights Formula: ~invwt Fixed effects: clm ~ age.x + value.x + period.x Value Std.Error DF t-value p-value (Intercept) age.x age.x age.x age.x age.x value.x value.x value.x value.x value.x period.x period.x Correlation: [correlation matrix omitted] Standardized Within-Group Residuals: Min Q1 Med Q3 Max Number of Observations: Number of Groups: Parameter estimates are similar to those produced by SAS. They are not identical because proc nlmixed and glmmpql use different methods for finding the maximum likelihood solution. January 31,

17 10 Logistic regression GEE As for GLMMs, software for these models is evolving constantly. We use geeglm in the geepack package, which gives identical parameter estimates to proc genmod. model2 <- geeglm(clm ~ age.x + value.x + period.x, + id=policyid, corstr="exchangeable", family=binomial, data=claimslong) summary(model2) geeglm(formula = clm ~ age.x + value.x + period.x, family = binomial, data = claimslong, id = policyid, corstr = "exchangeable") Estimate Std.err Wald p(w) (Intercept) e e+00 age.x e e-06 age.x e e-01 age.x e e-01 age.x e e-08 age.x e e-04 value.x e e-09 value.x e e-01 value.x e e-02 value.x e e-01 value.x e e-01 period.x e e+00 period.x e e-13 Estimated Scale Parameters: Estimate Std.err (Intercept) Correlation: Structure = exchangeable Link = identity Estimated Correlation Parameters: Estimate Std.err alpha Number of clusters: Maximum cluster size: 3 11 Logistic regression GAM GAMs can be fitted using either the special purpose gam package, or the more general gamlss package. We illustrate the use of both. ######## vehicle insurance data car <- read.table("car.csv",sep=",",header=t) #### banded vehicle value valuecat <- cut(car$veh_value, c(-1,2.5,5.0,7.5,10.0,12.5,100)) #### create variables with same base levels as in the text age.x <- C(factor(car$agecat),base=3) ## agecat=3 base level area.x <- C(factor(car$area),base=3) ## area C is 3rd level gender.x <- C(factor(car$gender),base=2) ## gender M is 2nd level veh_body.x <- C(factor(car$veh_body),base=10) ## SEDAN is 10th level car <- cbind(car,valuecat, age.x,area.x,gender.x,veh_body.x) ### use gam in gam package: library(gam) Loading required package: splines model1 <- gam(clm ~ age.x + area.x + veh_body.x + s(veh_value), + family=binomial, data=car) summary(model1) gam(formula = clm ~ age.x + area.x + veh_body.x + s(veh_value), family = binomial, data = car) January 31,

18 (Dispersion Parameter for binomial family taken to be 1) Null Deviance: on degrees of freedom Residual Deviance: on degrees of freedom AIC: Number of Local Scoring Iterations: 7 DF for Terms and Chi-squares for Nonparametric Effects Df Npar Df Npar Chisq P(Chi) (Intercept) 1 age.x 5 area.x 5 veh_body.x 12 s(veh_value) e-06 par(mfrow=c(2,2)) plot(model1) A B C D F partial for age.x partial for area.x age.x area.x BUS HDTOP STNWG partial for veh_body.x s(veh_value) veh_body.x veh_value The highly nonlinear effect of vehicle value, with a peak around 4 ($40 000), is seen clearly. The gamlss implementation gives parameter estimates for the parametric explanatory variables, which are similar to those given by proc gam. ### use gamlss: library(gamlss) Loading required package: splines ********** GAMLSS Version ********** For more on GAMLSS look at Type gamlssnews() to see new features/changes/bug fixes. January 31,

19 model2 <- gamlss(clm ~ age.x + area.x + veh_body.x + cs(veh_value), + family=bi, data=car) GAMLSS-RS iteration 1: Global Deviance = GAMLSS-RS iteration 2: Global Deviance = summary(model2) ******************************************************************* Family: c("bi", "Binomial") gamlss(formula = clm ~ age.x + area.x + veh_body.x + cs(veh_value), family = BI, data = car) Fitting method: RS() - Mu link function: logit Mu Estimate Std. Error t value Pr( t ) (Intercept) e+00 age.x e-05 age.x e-01 age.x e-01 age.x e-05 age.x e-04 area.x e-01 area.x e-01 area.x e-02 area.x e-01 area.x e-01 veh_body.x e-03 veh_body.x e-01 veh_body.x e-02 veh_body.x e-01 veh_body.x e-01 veh_body.x e-02 veh_body.x e-01 veh_body.x e-01 veh_body.x e-01 veh_body.x e-01 veh_body.x e-01 veh_body.x e-04 cs(veh_value) e-07 - No. of observations in the fit: Degrees of Freedom for the fit: Residual Deg. of Freedom: at cycle: 2 Global Deviance: AIC: SBC: ******************************************************************* Warning message: addive terms exists in the mu formula results maybe are not appropriate in: vcov.gamlss(object, "all") January 31,

Sample solutions. Stat 8051 Homework 8

Sample solutions. Stat 8051 Homework 8 Sample solutions Stat 8051 Homework 8 Problem 1: Faraway Exercise 3.1 A plot of the time series reveals kind of a fluctuating pattern: Trying to fit poisson regression models yields a quadratic model if