An Introduction to GAMs based on penalized regression splines. Simon Wood Mathematical Sciences, University of Bath, U.K.

An Introduction to GAMs based on penalied regression splines Simon Wood Mathematical Sciences, University of Bath, U.K.

Generalied Additive Models (GAM) A GAM has a form something like: g{e(y i )} = η i = X i β + f 1 ( 1i ) + f 2 ( 2i, 3i ) + f 3 ( 4i ) + g is a known link function. y i independent with some eponential family distribution. Crucially this var(y i ) = V (E(y i ))φ, where V is a distribution dependent known function. f j are smooth unknown functions (subject to centering conditions). X β is parametric bit. i.e. a GAM is a GLM where the linear predictor depends on smooth functions of covariates.

GAM Representation and estimation Originally GAMs were estimated by backfitting, with any scatterplot smoother used to estimate the f j.... but it was difficult to estimate the degree of smoothness. Now the tendency is to represent the f j using basis epansions of moderate sie, and to apply tuneable quadratic penalties to the model likelihood, to avoid overfit.... this makes it easier to estimate degree of smoothness, by estimating the tuning parameters/ smoothing parameters.

Eample basis-penalty: P-splines Eilers and Mar have popularied the use of B-spline bases with discrete penalties. If b k () is a B-spline and β k an unknown coefficient, then K f () = β k b k (). Wiggliness can be penalied by e.g. k K 1 P = (β j 1 2β j + β j+1 ) 2 = β T Sβ. k=2 b k () 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.8 1.0 f() 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 1.0

Other reduced rank splines Reduced rank versions of splines with derivative based penalties often have slightly better MSE performance. e.g. Choose a set of knots, k spread nicely in the range of the covariate,, and obtain the cubic spline basis based on the k. i.e. the basis that arises by minimiing, e.g. {yk f ( k )}2 + λ f () 2 d w.r.t. f. k Choosing the knot locations for any penalied spline type smoother is rather arbitrary. It can be avoided by taking a reduced rank eigen approimation to a full spline, (actually get an optimal low rank basis this way).

f Rank 15 Eigen Appro to 2D TPS

Other basis penalty smoothers Many other basis-penalty smoothers are possible. Tensor product smooths are basis-penalty smooths of several covariates, constructed (automatically) from smooths of single covariates. Tensor product smooths are immune to covariate rescaling, provided that they are multiply penalied. Finite area smoothing is also possible (look out for Soap film smoothing).

Estimation Whatever the basis, the GAM becomes g{e(y i )} = X i β, a richly parameteried GLM. To avoid overfit, estimate β to minimie D(β) + j λ j β T S j β the penalied deviance. λ j control fit-smoothness (variance-bias) tradeoff. Can get this objective directly or by putting a prior on function wiggliness ep( λ j β T S j β/2). So GAM is also a GLMM and λ j are variance parameters. Given λ j actual β fitting is by a Penalied version of IRLS (Fisher scoring or full Newton), or by MCMC.

Smoothness selection Various criteria can be minimied for λ selection/estimation Cross validation leads to a GCV criterion V g = D( ˆβ)/ {n tr(a)} 2 AIC or Mallows Cp leads to V a = D( ˆβ) + 2tr(A)φ. Taking the Bayesian/mied model approach seriously, a REML based criteria is V r = D( ˆβ)/φ + ˆβ T Sβ/φ + log X T WX + S log S + 2l s... W is the diagonal matri of IRLS weights, S = j λ js j, A = X(X T WX + S) 1 X T W, the trace of which is the model EDF and l s is the saturated log likelihood.

Numerical Methods for optimiing λ All criteria can reliably be optimied by Newton s method (outer to PIRLS β estimation). Need derivatives of V wrt log(λ j )for this... 1. Get derivatives of ˆβ w.r.t. log(λ j ) by differentiating PIRLS or by Implicit Function Theorem approach. 2. Given these we can get the derivatives of W and hence tr(a) w.r.t. the log(λ j ) as well as the derivatives of D. Derivatives of GCV, AIC and REML have very similar ingredients. Some care is needed to ensure maimum efficiency and stability. MCMC and boosting offer alternatives for estimating λ, β.

Why worry about stability? A simple eample The,y, data on the left were modelled using the cubic spline on the right (full TPS basis, λ chosen by GCV). y 2 0 2 4 6 8 0.0 0.2 0.4 0.6 0.8 1.0 s(,15.07) 2 0 2 4 6 8 0.0 0.2 0.4 0.6 0.8 1.0 The net slide compares GCV calculation based on the naïve normal equations calculation ˆβ = (X T X + λs) 1 X T y with a stable QR based alternative...

Stability matters for λ selection! log(gcv) 0 1 2 3 4 5 6 7 Normal equations Choleski Normal equations SVD Stable QR method 15 10 5 0 5 10 15 log 10 (λ)... automatic minimiation of the red or green versions of GCV is not a good idea.

GAM inference The best calibrated inference, in a frequentist sense, seems to arise by taking a Bayesian approach. Recall the prior on function wiggliness ( ep 1 ) λj β T S j β 2 an improper Gaussian on β. Bayes rule and some asymptotics then β y N( ˆβ, (X T WX + λ j S j ) 1 φ) Posterior e.g. CIs for f j, but can also simulate from posterior very cheaply, to make inferences about anything the GAM predicts.

GAM inference II The Bayesian CIs have good across the function frequentist coverage probabilities, provided the smoothing bias is somewhat less than the variance. Neglect of smoothing parameter uncertainty is not very important here. An etension of Nychka (1988; JASA) is the key to understanding these results. P-values for testing model components for equality to ero are also possible, by inverting Bayesian CI for the component. P-value properties are less good than CIs.

Eample: retinopathy data 0 10 20 30 40 50 10 15 20 20 30 40 50 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.8 ret 0.0 0.4 0.8 20 30 40 50 bmi 10 15 20 gly 0 20 40 dur 0 10 20 30 40 50

Retinopathy models? Question: How is development of retinopathy in diabetics related to duration of disease at baseline, body mass inde (bmi) and percentage glycosylated haemoglobin? A possible model is logit{e(ret)} = f 1 (dur) + f 2 (bmi) + f 3 (gly) + f 1 (dur, bmi) + f 2 (dur, gly) + f 3 (gly, bmi) where ret Bernoulli. In R, model is fit with something like gam(ret te(dur)+te(gly)+te(bmi)+ te(dur,gly)+te(dur,bmi)+te(gly,bmi), family=binomial)

Retinopathy Estimated effects s(dur,3.26) 4 2 0 2 4 6 0 10 20 30 40 50 dur s(gly,1) 4 2 0 2 4 6 10 15 20 gly s(bmi,2.67) 4 2 0 2 4 6 20 30 40 50 bmi te(gly,bmi,2.5) te(dur,gly,0) te(dur,bmi,0) gly bmi bmi dur dur gly

Retinopathy GLY-BMI interaction linear predictor linear predictor gly gly 10 15 20 linear predictor gly bmi bmi 15 20 25 30 35 40 45 50 bmi red/green are +/ TRUE s.e.

GAM etensions To obtain a satisfactory framework for generalied additive modelling has required solving a rather more general estimation problem... GAM framework can cope with any quadratically penalied GLM where smoothing parameters enter the objective linearly. Consequently the following eamples etensions can all be used without new theory... Varying coefficient models, where a coefficient in a GLM is allowed to vary smoothly with another covariate. Model terms involving any linear functional of a smooth function, for eample functional GLMs. Simple random effects, since a random effect can be treated as a smooth. Adaptive smooths can be constructed by using multiple penalties for a smooth.

Eample: functional covariates Consider data on 150 functions, i (t), (each observed at t T = (t 1,..., t 200 )), with corresponding noisy univariate response, y i. First 9 ( i (t), y i ) pairs are... 0 4 8 12 208.8 47 121.8 0 4 8 12 125.4 43.4 15.3 0 4 8 12 123 7.7 109.4 0.0 0.2 0.4 0.6 0.8 1.0 t 0.0 0.2 0.4 0.6 0.8 1.0 t 0.0 0.2 0.4 0.6 0.8 1.0 t

F-GLM An appropriate model might be the functional GLM g{e(y i )} = f (t) i (t)dt where predictor i is a known function and f (t) is an unknown smooth regression coefficient function. Typically f and i are discretied so that g{e(y i )} = f T i where f T = [f (t 1 ), f (t 2 )...] and T i = [ i (t 1 ), i (t 2 )...]. Generically this is an eample of dependence on a linear functional of a smooth. R package mgcv has a simple summation convention mechanism to handle such terms...

FGLM fitting Want to estimate smooth, f, in model y i = f (t) i (t)dt + ɛ i. gam(y s(t,by=x)) will do this, if T and X are matrices. i th row of X is the observed (discretied) function i (t). Each row of T is a replicate of the observation time vector t. s(t,6.54):x 0.4 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 1.0 t

Adaptive smoothing Perhaps I don t like this P-spline smooth of the motorcycle crash data... gam(accel~s(times,bs="ps",k=40)) s(times,10.29) 100 50 0 50 100 10 20 30 40 50 times Should I really use adaptive smoothing?

Adaptive smoothing 2 P-splines and the preceding GAM framework make it very easy to do adaptive smoothing. Use a B-spline basis f () = β j b j (), with an adaptive penalty P = K 1 k=2 c k(β k 1 2β k + β k+1 ) 2, where c k varies smoothly with k and hence. Defining d k = β k 1 2β k + β k+1, and D to be the matri such that d = Dβ, we have P = β T D T diag(c)dβ. Now use a (B-spline) basis epansion for c so that c = Cλ. Then P = j λ jβ T D T diag(c j )Dβ. i.e. the adaptive P-spline is just a P-spline with multiple penalties.

Adaptive smoothing 3 R package mgcv has an adaptive P-spline class. Using it does give some improvement... gam(accel~s(times,bs="ad",k=40)) s(times,8.56) 100 50 0 50 100 10 20 30 40 50 times

Conclusions Penalied regression splines are the starting point for a fairly complete framework for Generalied Additive Modelling. The numerical methods and theory developed for this framework are applicable to any quadratically penalied GLM, so many etensions of standard GAMs are possible. The R package mgcv tries to eploit the generality of the framework, so that almost any quadratically penalied GLM can readily be used.

References Hastie and Tibshirani (1986) invented GAMs. The work of Wahba (e.g. 1990) and Gu (e.g. 2002) heavily influenced the work presented here. Duchon (1977) invented thin plate splines. The Retinopathy data are from Gu. Penalied regression splines go back to Wahba (1980), but were given real impetus by Eilers and Mar (1996) and in a GAM contet by Mar and Eilers (1998). See Wood (2006) Generalied Additive Models: An Introduction with R, CRC for more information. Wood (2008; JRSSB) is more up to date on numerical methods. The mgcv package in R implements everything covered here.