An Introduction to GAMs based on penalized regression splines. Simon Wood Mathematical Sciences, University of Bath, U.K.

Similar documents
Modelling with smooth functions. Simon Wood University of Bath, EPSRC funded

Model checking overview. Checking & Selecting GAMs. Residual checking. Distribution checking

Basis Penalty Smoothers. Simon Wood Mathematical Sciences, University of Bath, U.K.

Smoothness Selection. Simon Wood Mathematical Sciences, University of Bath, U.K.

Checking, Selecting & Predicting with GAMs. Simon Wood Mathematical Sciences, University of Bath, U.K.

mgcv: GAMs in R Simon Wood Mathematical Sciences, University of Bath, U.K.

Spatial smoothing using Gaussian processes

arxiv: v1 [stat.me] 25 Sep 2007

Modelling geoadditive survival data

Odds ratio estimation in Bernoulli smoothing spline analysis-ofvariance

Recap. HW due Thursday by 5 pm Next HW coming on Thursday Logistic regression: Pr(G = k X) linear on the logit scale Linear discriminant analysis:

Inversion Base Height. Daggot Pressure Gradient Visibility (miles)

Integrated Likelihood Estimation in Semiparametric Regression Models. Thomas A. Severini Department of Statistics Northwestern University

CMSC858P Supervised Learning Methods

Spatial Process Estimates as Smoothers: A Review

Simultaneous Confidence Bands for the Coefficient Function in Functional Regression

Spatially Adaptive Smoothing Splines

Lecture 6: Methods for high-dimensional problems

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Analysis Methods for Supersaturated Design: Some Comparisons

Variable Selection and Model Choice in Survival Models with Time-Varying Effects

Regression, Ridge Regression, Lasso

Generalized Additive Models

A Modern Look at Classical Multivariate Techniques

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Smooth additive models for large datasets

Where now? Machine Learning and Bayesian Inference

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Kneib, Fahrmeir: Supplement to "Structured additive regression for categorical space-time data: A mixed model approach"

E(x i ) = µ i. 2 d. + sin 1 d θ 2. for d < θ 2 0 for d θ 2

1. SS-ANOVA Spaces on General Domains. 2. Averaging Operators and ANOVA Decompositions. 3. Reproducing Kernel Spaces for ANOVA Decompositions

Linear Methods for Prediction

Introduction to Statistical modeling: handout for Math 489/583

Regression: Lecture 2

2 Statistical Estimation: Basic Concepts

Interaction effects for continuous predictors in regression modeling

Generalized Additive Models (GAMs)

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Variable Selection for Generalized Additive Mixed Models by Likelihood-based Boosting

Biostatistics Advanced Methods in Biostatistics IV

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Generalized Linear Models

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

Generalized Linear Models. Kurt Hornik

Outline of GLMs. Definitions

Model Selection. Frank Wood. December 10, 2009

Mixed models in R using the lme4 package Part 7: Generalized linear mixed models

Measurement Error in Spatial Modeling of Environmental Exposures

Regularization: Ridge Regression and the LASSO

Machine Learning for OR & FE

Effective Computation for Odds Ratio Estimation in Nonparametric Logistic Regression

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

ISyE 691 Data mining and analytics

Linear Regression Models P8111

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models

Nonstationary spatial process modeling Part II Paul D. Sampson --- Catherine Calder Univ of Washington --- Ohio State University

Outline. Mixed models in R using the lme4 package Part 5: Generalized linear mixed models. Parts of LMMs carried over to GLMMs

Transformations The bias-variance tradeoff Model selection criteria Remarks. Model selection I. Patrick Breheny. February 17

Computational methods for mixed models

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models

Model Choice Lecture 2

Local regression I. Patrick Breheny. November 1. Kernel weighted averages Local linear regression

MS&E 226: Small Data

Gaussian processes for spatial modelling in environmental health: parameterizing for flexibility vs. computational efficiency

Penalized Splines, Mixed Models, and Recent Large-Sample Results

Regression I: Mean Squared Error and Measuring Quality of Fit

Flexible Modeling with Generalized Additive Models and Generalized Linear Mixed Models: Comprehensive Simulation and Case Studies.

Linear Models 1. Isfahan University of Technology Fall Semester, 2014

for function values or parameters, which are neighbours in the domain of a metrical covariate or a time scale, or in space. These concepts have been u

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

Modeling the Mean: Response Profiles v. Parametric Curves

Linear Methods for Prediction

FREQUENTIST BEHAVIOR OF FORMAL BAYESIAN INFERENCE

Proteomics and Variable Selection

Space-time modelling of air pollution with array methods

Stat 5101 Lecture Notes

Estimators as Random Variables

Or How to select variables Using Bayesian LASSO

Stat 587: Key points and formulae Week 15

Linear Mixed Models. One-way layout REML. Likelihood. Another perspective. Relationship to classical ideas. Drawbacks.

Use of Bayesian multivariate prediction models to optimize chromatographic methods

Model comparison and selection

Lecture 3: Introduction to Complexity Regularization

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

P -spline ANOVA-type interaction models for spatio-temporal smoothing

Multidimensional Density Smoothing with P-splines

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Classification. Chapter Introduction. 6.2 The Bayes classifier

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

Selection of Variables and Functional Forms in Multivariable Analysis: Current Issues and Future Directions

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Lecture : Probabilistic Machine Learning

Estimating prediction error in mixed models

Multivariate spatial models and the multikrig class

A Variety of Regularization Problems

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model

1 Data Arrays and Decompositions

Sparse Linear Models (10/7/13)

Transcription:

An Introduction to GAMs based on penalied regression splines Simon Wood Mathematical Sciences, University of Bath, U.K.

Generalied Additive Models (GAM) A GAM has a form something like: g{e(y i )} = η i = X i β + f 1 ( 1i ) + f 2 ( 2i, 3i ) + f 3 ( 4i ) + g is a known link function. y i independent with some eponential family distribution. Crucially this var(y i ) = V (E(y i ))φ, where V is a distribution dependent known function. f j are smooth unknown functions (subject to centering conditions). X β is parametric bit. i.e. a GAM is a GLM where the linear predictor depends on smooth functions of covariates.

GAM Representation and estimation Originally GAMs were estimated by backfitting, with any scatterplot smoother used to estimate the f j.... but it was difficult to estimate the degree of smoothness. Now the tendency is to represent the f j using basis epansions of moderate sie, and to apply tuneable quadratic penalties to the model likelihood, to avoid overfit.... this makes it easier to estimate degree of smoothness, by estimating the tuning parameters/ smoothing parameters.

Eample basis-penalty: P-splines Eilers and Mar have popularied the use of B-spline bases with discrete penalties. If b k () is a B-spline and β k an unknown coefficient, then K f () = β k b k (). Wiggliness can be penalied by e.g. k K 1 P = (β j 1 2β j + β j+1 ) 2 = β T Sβ. k=2 b k () 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.8 1.0 f() 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 1.0

Other reduced rank splines Reduced rank versions of splines with derivative based penalties often have slightly better MSE performance. e.g. Choose a set of knots, k spread nicely in the range of the covariate,, and obtain the cubic spline basis based on the k. i.e. the basis that arises by minimiing, e.g. {yk f ( k )}2 + λ f () 2 d w.r.t. f. k Choosing the knot locations for any penalied spline type smoother is rather arbitrary. It can be avoided by taking a reduced rank eigen approimation to a full spline, (actually get an optimal low rank basis this way).

f Rank 15 Eigen Appro to 2D TPS

Other basis penalty smoothers Many other basis-penalty smoothers are possible. Tensor product smooths are basis-penalty smooths of several covariates, constructed (automatically) from smooths of single covariates. Tensor product smooths are immune to covariate rescaling, provided that they are multiply penalied. Finite area smoothing is also possible (look out for Soap film smoothing).

Estimation Whatever the basis, the GAM becomes g{e(y i )} = X i β, a richly parameteried GLM. To avoid overfit, estimate β to minimie D(β) + j λ j β T S j β the penalied deviance. λ j control fit-smoothness (variance-bias) tradeoff. Can get this objective directly or by putting a prior on function wiggliness ep( λ j β T S j β/2). So GAM is also a GLMM and λ j are variance parameters. Given λ j actual β fitting is by a Penalied version of IRLS (Fisher scoring or full Newton), or by MCMC.

Smoothness selection Various criteria can be minimied for λ selection/estimation Cross validation leads to a GCV criterion V g = D( ˆβ)/ {n tr(a)} 2 AIC or Mallows Cp leads to V a = D( ˆβ) + 2tr(A)φ. Taking the Bayesian/mied model approach seriously, a REML based criteria is V r = D( ˆβ)/φ + ˆβ T Sβ/φ + log X T WX + S log S + 2l s... W is the diagonal matri of IRLS weights, S = j λ js j, A = X(X T WX + S) 1 X T W, the trace of which is the model EDF and l s is the saturated log likelihood.

Numerical Methods for optimiing λ All criteria can reliably be optimied by Newton s method (outer to PIRLS β estimation). Need derivatives of V wrt log(λ j )for this... 1. Get derivatives of ˆβ w.r.t. log(λ j ) by differentiating PIRLS or by Implicit Function Theorem approach. 2. Given these we can get the derivatives of W and hence tr(a) w.r.t. the log(λ j ) as well as the derivatives of D. Derivatives of GCV, AIC and REML have very similar ingredients. Some care is needed to ensure maimum efficiency and stability. MCMC and boosting offer alternatives for estimating λ, β.

Why worry about stability? A simple eample The,y, data on the left were modelled using the cubic spline on the right (full TPS basis, λ chosen by GCV). y 2 0 2 4 6 8 0.0 0.2 0.4 0.6 0.8 1.0 s(,15.07) 2 0 2 4 6 8 0.0 0.2 0.4 0.6 0.8 1.0 The net slide compares GCV calculation based on the naïve normal equations calculation ˆβ = (X T X + λs) 1 X T y with a stable QR based alternative...

Stability matters for λ selection! log(gcv) 0 1 2 3 4 5 6 7 Normal equations Choleski Normal equations SVD Stable QR method 15 10 5 0 5 10 15 log 10 (λ)... automatic minimiation of the red or green versions of GCV is not a good idea.

GAM inference The best calibrated inference, in a frequentist sense, seems to arise by taking a Bayesian approach. Recall the prior on function wiggliness ( ep 1 ) λj β T S j β 2 an improper Gaussian on β. Bayes rule and some asymptotics then β y N( ˆβ, (X T WX + λ j S j ) 1 φ) Posterior e.g. CIs for f j, but can also simulate from posterior very cheaply, to make inferences about anything the GAM predicts.

GAM inference II The Bayesian CIs have good across the function frequentist coverage probabilities, provided the smoothing bias is somewhat less than the variance. Neglect of smoothing parameter uncertainty is not very important here. An etension of Nychka (1988; JASA) is the key to understanding these results. P-values for testing model components for equality to ero are also possible, by inverting Bayesian CI for the component. P-value properties are less good than CIs.

Eample: retinopathy data 0 10 20 30 40 50 10 15 20 20 30 40 50 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.8 ret 0.0 0.4 0.8 20 30 40 50 bmi 10 15 20 gly 0 20 40 dur 0 10 20 30 40 50

Retinopathy models? Question: How is development of retinopathy in diabetics related to duration of disease at baseline, body mass inde (bmi) and percentage glycosylated haemoglobin? A possible model is logit{e(ret)} = f 1 (dur) + f 2 (bmi) + f 3 (gly) + f 1 (dur, bmi) + f 2 (dur, gly) + f 3 (gly, bmi) where ret Bernoulli. In R, model is fit with something like gam(ret te(dur)+te(gly)+te(bmi)+ te(dur,gly)+te(dur,bmi)+te(gly,bmi), family=binomial)

Retinopathy Estimated effects s(dur,3.26) 4 2 0 2 4 6 0 10 20 30 40 50 dur s(gly,1) 4 2 0 2 4 6 10 15 20 gly s(bmi,2.67) 4 2 0 2 4 6 20 30 40 50 bmi te(gly,bmi,2.5) te(dur,gly,0) te(dur,bmi,0) gly bmi bmi dur dur gly

Retinopathy GLY-BMI interaction linear predictor linear predictor gly gly 10 15 20 linear predictor gly bmi bmi 15 20 25 30 35 40 45 50 bmi red/green are +/ TRUE s.e.

GAM etensions To obtain a satisfactory framework for generalied additive modelling has required solving a rather more general estimation problem... GAM framework can cope with any quadratically penalied GLM where smoothing parameters enter the objective linearly. Consequently the following eamples etensions can all be used without new theory... Varying coefficient models, where a coefficient in a GLM is allowed to vary smoothly with another covariate. Model terms involving any linear functional of a smooth function, for eample functional GLMs. Simple random effects, since a random effect can be treated as a smooth. Adaptive smooths can be constructed by using multiple penalties for a smooth.

Eample: functional covariates Consider data on 150 functions, i (t), (each observed at t T = (t 1,..., t 200 )), with corresponding noisy univariate response, y i. First 9 ( i (t), y i ) pairs are... 0 4 8 12 208.8 47 121.8 0 4 8 12 125.4 43.4 15.3 0 4 8 12 123 7.7 109.4 0.0 0.2 0.4 0.6 0.8 1.0 t 0.0 0.2 0.4 0.6 0.8 1.0 t 0.0 0.2 0.4 0.6 0.8 1.0 t

F-GLM An appropriate model might be the functional GLM g{e(y i )} = f (t) i (t)dt where predictor i is a known function and f (t) is an unknown smooth regression coefficient function. Typically f and i are discretied so that g{e(y i )} = f T i where f T = [f (t 1 ), f (t 2 )...] and T i = [ i (t 1 ), i (t 2 )...]. Generically this is an eample of dependence on a linear functional of a smooth. R package mgcv has a simple summation convention mechanism to handle such terms...

FGLM fitting Want to estimate smooth, f, in model y i = f (t) i (t)dt + ɛ i. gam(y s(t,by=x)) will do this, if T and X are matrices. i th row of X is the observed (discretied) function i (t). Each row of T is a replicate of the observation time vector t. s(t,6.54):x 0.4 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 1.0 t

Adaptive smoothing Perhaps I don t like this P-spline smooth of the motorcycle crash data... gam(accel~s(times,bs="ps",k=40)) s(times,10.29) 100 50 0 50 100 10 20 30 40 50 times Should I really use adaptive smoothing?

Adaptive smoothing 2 P-splines and the preceding GAM framework make it very easy to do adaptive smoothing. Use a B-spline basis f () = β j b j (), with an adaptive penalty P = K 1 k=2 c k(β k 1 2β k + β k+1 ) 2, where c k varies smoothly with k and hence. Defining d k = β k 1 2β k + β k+1, and D to be the matri such that d = Dβ, we have P = β T D T diag(c)dβ. Now use a (B-spline) basis epansion for c so that c = Cλ. Then P = j λ jβ T D T diag(c j )Dβ. i.e. the adaptive P-spline is just a P-spline with multiple penalties.

Adaptive smoothing 3 R package mgcv has an adaptive P-spline class. Using it does give some improvement... gam(accel~s(times,bs="ad",k=40)) s(times,8.56) 100 50 0 50 100 10 20 30 40 50 times

Conclusions Penalied regression splines are the starting point for a fairly complete framework for Generalied Additive Modelling. The numerical methods and theory developed for this framework are applicable to any quadratically penalied GLM, so many etensions of standard GAMs are possible. The R package mgcv tries to eploit the generality of the framework, so that almost any quadratically penalied GLM can readily be used.

References Hastie and Tibshirani (1986) invented GAMs. The work of Wahba (e.g. 1990) and Gu (e.g. 2002) heavily influenced the work presented here. Duchon (1977) invented thin plate splines. The Retinopathy data are from Gu. Penalied regression splines go back to Wahba (1980), but were given real impetus by Eilers and Mar (1996) and in a GAM contet by Mar and Eilers (1998). See Wood (2006) Generalied Additive Models: An Introduction with R, CRC for more information. Wood (2008; JRSSB) is more up to date on numerical methods. The mgcv package in R implements everything covered here.