Modelling with smooth functions. Simon Wood University of Bath, EPSRC funded

Size: px

Start display at page:

Download "Modelling with smooth functions. Simon Wood University of Bath, EPSRC funded"

Marybeth Hoover
5 years ago
Views:

1 Modelling with smooth functions Simon Wood University of Bath, EPSRC funded

2 Some data... Daily respiratory deaths, temperature and ozone in Chicago (NMMAPS) deaths temperature ozone Apparently daily respiratory deaths are significantly negatively correlated with ozone and temperature. time But obviously there is confounding here - we need a model.

3 A model... death i Poi(µ i ) log(µ i ) = α + f 1 (t i ) + f 2 (o3 i, tmp i ) + f 3 (pm10 i ) The f j are smooth functions to estimate. 1. α + f 1 is the (log) background respiratory mortality rate. 2. f 2 is modification by ozone and temperature (interacting). 3. f 3 is the modification from particulates. Actually the predictors are aggregated over the 3 days preceding death. Fit in R (mgcv package) gam(death s(time,k=200)+te(o3,tmp)+s(pm10), family=poisson)

4 Air pollution model estimates... s(time,99.75) time te(o3,tmp,31.6) tmp s(pm10,3.09) o3 pm10 High ozone and temp associated with increased risk.

5 Some more data... log(1/r) octane Can we predict octane rating of fuel from near infra-red spectrum? 60 octane/spectrum pairs available (from pls package) wavelength (nm) Try a signal regression model. k i is i th spectrum, f is a smooth function... octane i = f (ν)k i (ν)dν + ɛ i

6 Signal regression model fit... If each row of matri NIR contains a spectral intensity, and each row of nm the corresponding wavelengths, then gam(octane s(nm,by=nir)) estimates the model. Plots of estimated f (ν) and fitted vs. data: s(nm,9.16):nir octane nm predicted octane

7 And one more dataset ret bmi dur gly How does probability of diabetic retinopathy relate to previous disease duration, body mass inde and % glycosylated hemoglobin? Wisconsin study (see gss package). Model: ret i Bernouilli, logit{e(ret)} = f 1 (dur) + f 2 (bmi) + f 3 (gly) + f 4 (dur, bmi) + f 5 (dur, gly) + f 6 (gly, bmi)

8 Retinopathy model estimates gam(ret s(dur) + s(gly) + s(bmi) + ti(dur,gly) + ti(dur,bmi) + ti(gly,bmi), family=binomial) s(dur,3.26) dur s(gly,1) gly s(bmi,2.67) bmi te(gly,bmi,2.5) te(dur,gly,0) te(dur,bmi,0) gly bmi bmi dur dur gly

9 Additive smooth models: structured fleibility The preceding are eamples of generalized additive models (GAM). A GAM is a GLM, in which the linear predictor depends linearly on unknown smooth functions of covariates... y i EF (µ i, φ), g(µ i ) = f 1 ( 1i ) + f 2 ( 2i ) + Hastie and Tibshirani (1990) and Wahba (1990) laid the foundations for these models as discussed here. If we can work with GAMs at all we can easily generalize: g(µ i ) = A i θ + j L ij f j ( j ) + Z i b L ij are linear functionals; A & Z are model matrices; θ and b are parameters and Gaussian random effects.

10 Additive smooth models & practical computation Making GAMs work in practice requires at least 3 things A way of representing the smooth functions f j. 2. A way of estimating the f j. 3. Some means of deciding how smooth the f j should be. 3 is the awkward part of the enterprise, and strongly influences 1 and 2. As well as point estimates we also need further inferential tools, such as interval estimates, model selection methods, AIC, p-values etc. We d also like to go beyond univariate eponential family. This talk will cover these things in order.

11 Representing smooth functions: splines To motivate how to represent several smooth terms in a model, first consider a simpler smoothing problem. Consider estimating the smooth function f in the model y i = f ( i ) + ɛ i from i, y i data using smoothing splines... wear size The red curve is the function minimizing (y i f ( i )) 2 + λ f () 2 d. i

12 Splines and the smoothing parameter Smoothing parameter λ controls the stiffness of the spline. wear wear size size wear wear size size But the spline can be written ˆf () = i β ib i (), where the basis functions b i () do not depend on λ.

13 General spline theory background Consider a Hilbert space of real valued functions, f, on some domain τ (e.g. [0, 1]). It is a reproducing kernel Hilbert space, H, if evaluation is bounded. i.e. M s.t. f (t) M f. Then the Riesz representation thm says that there is a function R t H s.t. f (t) = R t, f. Now consider R t (u) as a function of t: R(t, u) R t, R s = R(t, s) so R(t, s) is known as reproducing kernel of H. Actually, to every positive definite function R(t, s) corresponds a unique r.k.h.s.

14 Spline smoothing problem RKHS are quite useful for constructing smooth models, to see why consider finding ˆf to minimize {y i f (t i )} 2 + λ f (t) 2 dt. i Let H have f, g = g (t)f (t)dt. Let H 0 denote the RKHS of functions for which f (t) 2 dt = 0, with basis φ 1 (t) = 1, φ 2 (t) = t, say. Spline problem seeks ˆf H 0 H to minimize {y i f (t i )} 2 + λ Pf 2. i

15 Spline smoothing solution ˆf (t) = n i=1 c ir ti (t) + M i=1 d iφ i (t) is the basis representation of ˆf. Why? Suppose minimizer were f = ˆf + η where η H and η ˆf : 1. η(t i ) = R ti, η = P f 2 = Pˆf 2 + η 2 which is minimized when η = obviously this argument is rather general. So if E ij = R ti, R tj and T ij = φ j (t i ) then we seek ĉ and ˆd to minimize y Td Ec λct Ec.... straightforward to compute (but at O(n 3 ) cost).

16 Reduced rank smoothers Can obtain efficient reduced rank basis (and penalty) by 1. using spline basis for a representative subset of data, or 2. using Lanczos methods to find a low order eigenbasis. In either case we end up representing the smoother as f () = j β j b j () basis functions, b j () known; coefficients β not. Corresponding smoothing penalty is β T Sβ. S is known. So y i = f ( i ) + ɛ i becomes y = Xβ + ɛ where X ij = b j ( i ) and ˆβ = argmin β y Xβ 2 + λβ T Sβ. Eamples follow asymptotic justification of rank reduction.

17 What rank? Consider, f, a rank k cubic regression spline (i.e. λ = 0), parameterized in terms of function values at evenly spaced knots. From basic regression theory, average sampling standard error of ˆf is O( k/n). From approimation theory for cubic splines, asymptotic bias of ˆf is bounded by O(k 4 ). So if we let k = O(n 1/9 ) we minimize the asymptotic MSE at O(n 8/9 ). Actually in practice we would want to penalize and then k = O(n 1/5 ) is appropriate. Point is that asymptotically k n is appropriate.

18 P-splines: B-spline basis & appro penalty y basis functions y (β i 1 2β i + β i i+1) 2 = y (β i 1 2β i + β i i+1) 2 = 16.4 y (β i 1 2β i + β i i+1) 2 =

19 Eample: Thin plate splines One way of generalizing splines from 1D to several D is to turn the fleible strip into a fleible sheet ˆf minimizing e.g. {y i f ( i, z i )} 2 + λ f 2 + 2fz 2 + fzzddz 2 i This results in a thin plate spline. It is an isotropic smooth. Isotropy may be appropriate when different covariates are naturally on the same scale linear predictor linear predictor linear predictor z z z

20 Cheaper thin plate splines RKHS or similar theory gives eplicit basis (for any dimension), with coefficients (c and d) minimizing y Td Ec 2 + λc T Ec s.t. T T c = 0. Can reduce O(n 3 ) cost to O(k 3 ) by replacing E by its rank k truncated eigen-decomposition (computed by Lanczos methods). Cheap and somewhat optimal truth t.p.r.s. 16 d.f knot basis

21 Non-isotropic tensor product smooths A different construction is needed if covariates have different scales. Start with marginal spline f z and let its coefficients vary smoothly with f(z) f(,z) z z

22 The complete tensor product smooth Let each coefficient of f z be a spline of. Construct is symmetric (see right). f(,z) f(,z) z z

23 Tensor product penalties - one per dimension -wiggliness: sum marginal penalties over red curves. z-wiggliness: sum marginal z penalties over green curves. f(,z) f(,z) z z

24 Tensor product epressions Suppose the basis epansions for smoothing w.r.t. and z marginally are j B jb j (z) and i α ia i ().... and the marginal penalties are B T S z B and α T S α. The tensor product basis construction gives: f (, z) = β ij b j (z)a i () With two penalties (requiring 2 smoothing parameters) J z (f ) = β T I I S z β and J (f ) = β T S I J β The construction generalizes to any number of marginals and multi-dimensional marginals.

25 z Isotropic vs. tensor product comparison Isotropic Thin Plate Spline Tensor Product Spline z 0.2 z z z z z z... each figure smooths the same data. The only modification is that has been divided by 5 in the bottom row.

26 Smooth model components A rich range of basis-penalty smoothers is possible... s(times,10.9) X a b f(,z) z c times Y s(region,63.04) d lat When combining several smooths in one model, we often need an identifiability constraint i f ( i) = 0. Re-write as 1 T Xβ = 0 and absorb by reparameterization. e 3 2 lon f

27 Representing a GAM Consider the GAM y i EF(µ i, φ), g(µ i ) = β 0 + j f j( ji ). Each f j () has a basis - penalty representation, say X j β j, β jt S j β j (with constraints absorbed). So the GAM becomes g(µ) = Xβ, where X = [1 : X 1 : X 2 :...] and β T = [β 0, β 1T, β 2T,...]. Penalty is now λj β T S j β, where S j is such that β T S j β β jt S j β j. Can easily include parametric components and linear functionals of smooths.

28 ˆβ given λ Let l be the model log-likelihood and use penalized MLE L(β) = l(β) 1 λ j β T S j β, 2 j ˆβ = argma β L(β) Optimize by Newton s method (penalized IRLS for EF). Bayes motivation. Prior: π(β) = N(0, ( λ j S j ) ); likelihood, π(y β): as is. Then MAP estimate is ˆβ. Further, in large sample limit, if I is information matri at ˆβ, β y N( ˆβ, (I + λ j S j ) 1 ) which can be used to construct well calibrated CIs. Note link to Gaussian random effects!

29 AIC, effective degrees of freedom, p-values Following through the derivation of AIC in the penalized case, yields AIC = 2l( ˆβ) + 2tr(F) where F = (I + λ j S j ) 1 I. Better still F = V β I, where V β is an approimate posterior covariance matri for β, corrected for λ uncertainty (see Greven and Kneib, 2010, Biometrika for why). tr(f) is the effective degrees of freedom of the model. p-values for testing smooth terms or variance components for equality to zero can also be obtained (see Wood 2013a,b Biometrika). But we still need smoothing parameter (λ) estimates.

30 Smoothing parameter estimates Let ρ = log λ and π(ρ) = constant, then the marginal likelihood is π(ρ y) = π(y β)π(β ρ)dβ Empirical Bayes: ˆρ = argma λ π(ρ y).... same as REML in Gaussian random effects contet. In practice use a Laplace approimation for the integral. log π(ρ y) L( ˆβ) 1 2 log λ j S j log H + c 2 H is Hessian of L. Log require numerical care.

31 How marginal likelihood works y y y λ too low, prior variance too high λ and prior variance about right λ too high, prior variance too low Draw β from prior implied by λ. Find average value of likelihood for these draws. Choose λ to maimize this average likelihood. i.e. formally, maimize π(y β)π(β λ)dβ w.r.t. λ. Cross validation is an alternative.

32 Numerical fitting strategy Optimize REML w.r.t. ρ = log(λ) by Newton s method. For each trial ρ Re-parameterize β so that log S + computation is stable. 2. Find ˆβ by an inner Newton optimization of L(β). 3. Use implicit differentiation to find first two derivatives of ˆβ w.r.t. ρ. 4. Compute log determinant terms and their derivatives. 5. Hence compute the REML score and its first two derivatives, as required for the net Newton step. For large datasets an alternative is often possible: Find ˆβ by iterative fitting of working linear models, and estimate ρ for the working model at each iteration step.

33 Where s the eponential family assumption?... the single parameter independent EF assumption of GAMs is barely used. It just simplifies some of the numerical computations (and adds to numerical robustness). Actually we can use most of the apparatus just described with almost any regular likelihood, provided its practical to compute with, and the first three or four derivatives w.r.t. to the model coefficients can also be computed...

34 Eample: predicting prostate status Healthy Enlarged Cancer Intensity Intensity Intensity Daltons Daltons Daltons Model: ordered category (benign/enlarged/cancer) predicted by logistic latent random variable with mean µ i = f (D)ν i (D)dD, ν i (D) is i th spectrum. gam(ds s(d,by=k),family=ocat(r=3)) a b c f(d) deviance residuals Daltons theoretical quantiles

35 Fuel efficiency of cars city.mpg hw.mpg width height weight hp

36 Multivariate additive model Correlated bivariate normal response (hw.mpg, city.mpg). Component means given by smooth additive predictors. Best model very simple (and somewhat unepected) a b c s(weight,6.71) s(hp,4.1) effects weight hp Gaussian quantiles d e f s.1(weight,5.71) s.1(hp,5.69) effects 6e 05 2e 05 2e weight hp Gaussian quantiles

37 Scale location models Can additively model mean and variance (and skew and... ) Simple eample: y i N(µ i, σ i ) µ i = j f j ( ji ), log σ i = j g j (z ji ). Here is a simple 1-D smoothing eample of this... accel σ times times

38 R packages There are many alternative R packages available: 1. gam for original backfitting approach vgam for vector GAMs and more gamlss GAMs for location scale and shape mboost GAMs via boosting. 5. gss Smoothing Spline ANOVA. 6. scam Shaped constrained additive models. 7. gamm4 GAMMs using lme4. 8. bayes MCMC and likelihood based GAMs Methods discussed here are in R recommended package mgcv... 1 No smoothing parameter selection 2 Limited smoothing parameter selection 3 see also mgcv::jagam

39 mgcv package in R

An Introduction to GAMs based on penalized regression splines. Simon Wood Mathematical Sciences, University of Bath, U.K.

An Introduction to GAMs based on penalied regression splines Simon Wood Mathematical Sciences, University of Bath, U.K. Generalied Additive Models (GAM) A GAM has a form something like: g{e(y i )} = η