Smoothing Age-Period-Cohort models with P -splines: a mixed model approach

Similar documents
Smoothing Age-Period-Cohort models with P -splines: a mixed model approach

Currie, Iain Heriot-Watt University, Department of Actuarial Mathematics & Statistics Edinburgh EH14 4AS, UK

Using P-splines to smooth two-dimensional Poisson data

GLAM An Introduction to Array Methods in Statistics

Multidimensional Density Smoothing with P-splines

A Hierarchical Perspective on Lee-Carter Models

P -spline ANOVA-type interaction models for spatio-temporal smoothing

Flexible Spatio-temporal smoothing with array methods

Identification of the age-period-cohort model and the extended chain ladder model

Variable Selection for Generalized Additive Mixed Models by Likelihood-based Boosting

1 Mixed effect models and longitudinal data analysis

Space-time modelling of air pollution with array methods

Likelihood Ratio Tests. that Certain Variance Components Are Zero. Ciprian M. Crainiceanu. Department of Statistical Science

Generalized Linear Models

Analysis Methods for Supersaturated Design: Some Comparisons

A Modern Look at Classical Multivariate Techniques

Illustration of the Varying Coefficient Model for Analyses the Tree Growth from the Age and Space Perspectives

Linear Regression Models P8111

Theorems. Least squares regression

Simultaneous Confidence Bands for the Coefficient Function in Functional Regression

Bivariate Weibull-power series class of distributions

Spatial Process Estimates as Smoothers: A Review

Estimation of cumulative distribution function with spline functions

Model checking overview. Checking & Selecting GAMs. Residual checking. Distribution checking

Recovering Indirect Information in Demographic Applications

Variable Selection and Model Choice in Survival Models with Time-Varying Effects

Machine Learning for OR & FE

ISyE 691 Data mining and analytics

Generalized Additive Models

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

Improving the Precision of Estimation by fitting a Generalized Linear Model, and Quasi-likelihood.

Model selection and comparison

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Geographically Weighted Regression as a Statistical Model

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

Chapter 7: Model Assessment and Selection

Linear Methods for Prediction

Basis Penalty Smoothers. Simon Wood Mathematical Sciences, University of Bath, U.K.

Forecasting with the age-period-cohort model and the extended chain-ladder model

Some properties of Likelihood Ratio Tests in Linear Mixed Models

Regularization in Cox Frailty Models

Generalized Linear. Mixed Models. Methods and Applications. Modern Concepts, Walter W. Stroup. Texts in Statistical Science.

Biostatistics Advanced Methods in Biostatistics IV

Exact Likelihood Ratio Tests for Penalized Splines

Estimation of spatiotemporal effects by the fused lasso for densely sampled spatial data using body condition data set from common minke whales

Regularization: Ridge Regression and the LASSO

Lecture 6: Methods for high-dimensional problems

Recap. HW due Thursday by 5 pm Next HW coming on Thursday Logistic regression: Pr(G = k X) linear on the logit scale Linear discriminant analysis:

Now consider the case where E(Y) = µ = Xβ and V (Y) = σ 2 G, where G is diagonal, but unknown.

Lecture 8. Poisson models for counts

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

Linear Models 1. Isfahan University of Technology Fall Semester, 2014

Analysis of the AIC Statistic for Optimal Detection of Small Changes in Dynamic Systems

PENALIZING YOUR MODELS

Repeated ordinal measurements: a generalised estimating equation approach

mgcv: GAMs in R Simon Wood Mathematical Sciences, University of Bath, U.K.

Regression, Ridge Regression, Lasso

Machine Learning for OR & FE

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

Introduction to General and Generalized Linear Models

Vector Autoregressive Model. Vector Autoregressions II. Estimation of Vector Autoregressions II. Estimation of Vector Autoregressions I.

Nonparametric Small Area Estimation Using Penalized Spline Regression

Survival Analysis I (CHL5209H)

Inversion Base Height. Daggot Pressure Gradient Visibility (miles)

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Modelling Survival Data using Generalized Additive Models with Flexible Link

Model comparison and selection

PENALIZED LIKELIHOOD PARAMETER ESTIMATION FOR ADDITIVE HAZARD MODELS WITH INTERVAL CENSORED DATA

Outline. Mixed models in R using the lme4 package Part 3: Longitudinal data. Sleep deprivation data. Simple longitudinal data

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty

CHOOSING AMONG GENERALIZED LINEAR MODELS APPLIED TO MEDICAL DATA

Outlier detection and variable selection via difference based regression model and penalized regression

Sparse Linear Models (10/7/13)

Math 423/533: The Main Theoretical Topics

High-dimensional regression

A Significance Test for the Lasso

Checking, Selecting & Predicting with GAMs. Simon Wood Mathematical Sciences, University of Bath, U.K.

Local Likelihood Bayesian Cluster Modeling for small area health data. Andrew Lawson Arnold School of Public Health University of South Carolina

Proteomics and Variable Selection

Linear Model Selection and Regularization

Regularization Methods for Additive Models

Modelling geoadditive survival data

The lasso, persistence, and cross-validation

Restricted Likelihood Ratio Tests in Nonparametric Longitudinal Models

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models

Log-linear multidimensional Rasch model for capture-recapture

Data Mining Stat 588

Outline of GLMs. Definitions

CMSC858P Supervised Learning Methods

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77

LOGISTIC REGRESSION Joseph M. Hilbe

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

Nonstationary spatial process modeling Part II Paul D. Sampson --- Catherine Calder Univ of Washington --- Ohio State University

Alternative implementations of Monte Carlo EM algorithms for likelihood inferences

Model Estimation Example

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

Array methods in statistics with applications to the modelling and forecasting of mortality. James Gavin Kirkby

High-dimensional Ordinary Least-squares Projection for Screening Variables

Transcription:

Smoothing Age-Period-Cohort models with P -splines: a mixed model approach Running headline: Smooth Age-Period-Cohort models I D Currie, Department of Actuarial Mathematics and Statistics, and the Maxwell Institute for Mathematical Sciences, Heriot-Watt University, Edinburgh, EH14 4AS, UK. J G Kirkby, Department of Actuarial Mathematics and Statistics,, and the Maxwell Institute for Mathematical Sciences, Heriot-Watt University, Edinburgh, EH14 4AS, UK. M Durban, Departamento de Estadistica y Econometria, Universidad Carlos III de Madrid, Edificio Torres Quevedo, 28911 Leganes, Madrid, Spain. P H C Eilers, Department of Medical Statistics, Leiden University Medical Center, 2300 RC Leiden, The Netherlands. Responsible author: I D Currie, Email: I.D.Currie@hw.ac.uk Abstract: We use smoothing with B-splines and penalties, the P -spline method of Eilers and Marx (1996), to smooth the Age-Period-Cohort (APC) model of mortality. We describe how smoothing with penalties in one dimension allows a mixed model approach to be used. We apply this method to the APC model and show that penalization gives a way of dealing with identifiability problems in the discrete APC model and leads to a mixed model representation of the model. We show that individual random effects can be used to model overdispersion and that this can also be achieved within the mixed model framework. We illustrate our methods with some mortality data provided by the UK insurance industry. Keywords: Age-period-cohort; identifiability; mixed models; mortality; overdispersion; P - splines; smoothing. File name: /talks/london.2006/apc.paper/paper.tex: 16 April 2007 1

Sect:Intro 1 Introduction We suppose that we have mortality data arranged in a two-way table classified by age at death and year of death. Age-Period-Cohort models (APC) are an important class of models in the study of mortality in such tables and, more generally, of disease incidence. A difficulty with the APC model is the choice of parameterization since the model is in general not identifiable. Clayton and Schifflers (1987) give a careful discussion of parameterization in APC models and sound warnings about the dangers of over-interpreting the fitted parameters; they are equally sceptical about the wisdom of forecasting by extrapolating parameter values. Holford (1983) and Carstensen and Keiding (2005) also discuss the APC model with particular reference to the problems caused by non-identifiability. The discussion in these papers revolves round the properties of different parameterizations. Our approach is different: we force smoothness on the fitted model by penalizing differences between adjacent coefficients. Penalization does two things: first, it replaces the usual identifiability constraints, and second, it allows the APC model to be cast in a mixed model framework. The purpose of this paper is to explore and illustrate these ideas. Smoothing with P -splines was introduced by Eilers and Marx (1996) and here we apply the method to smooth the APC model. Smooth versions of the APC model have already appeared in the literature. Heuer (1997) used restricted cubic splines or natural splines to give a smooth version of the APC model; he also included interactions in his model by using the Kronecker product of the spline functions. An important difference between Heuer s approach and ours is that in Heuer s paper smoothness at the edges is produced by modifying the B-spline basis (natural splines) whereas in our case smoothness is produced by the use of penalties. Ogata et al. (2000) used splines in a Bayesian framework and also produced a smooth version of the APC model. Carstensen and Keiding (2005) give R code (R Development Core Team, 2004) to smooth with natural splines. The plan of the paper is as follows. In section 2 we explain the P -spline approach in one dimension and describe a transformation of the B-spline basis which allows the model to be expressed as a mixed model. In section 3 we apply our transformation to the APC model and show that the transformation deals with the identifiability problems that arise in the APC model. The mixed model representation has an interpretation as an additive model where the fixed component is a plane and there are three random components which correspond to the age, period and cohort effects. There is evidence of overdispersion and in section 4 we 2

follow Perperoglou and Eilers (??) and use individual random effects to model overdispersion. We use some mortality data provided by the UK insurance industry to illustrate our methods throughout. The paper ends with a short discussion. ect:pspline ect:bspline 2 Smoothing in one dimension with P -splines Smoothing with P -splines, introduced by Eilers and Marx (1996), is based on two ideas: (a) use B-splines as the basis for a regression, and (b) use a difference penalty on adjacent regression coefficients to ensure smoothness. Estimation is by penalized likelihood. The method has two attractive features which follow from these ideas: (a) the regression nature of P -splines means that it is straightforward to introduce smooth terms into a larger regression model, and (b) the difference penalty means that the familiar least squares (LS) solution in a normal model and, more generally, the iterative weighted least squares (IWLS) algorithm in a generalized linear model (GLM) apply in the P -spline setting. A number of papers contain descriptions of the method (Eilers and Marx, 1996, Marx and Eilers, 1998, Currie et al., 2004). We present a short introduction. In section 2.1 we describe P -spline smoothing with a B-spline basis and then in section 2.2 we describe a transformation of the B-spline basis which gives an alternative way of fitting the P -spline model. This new basis has two advantages: first, it allows us to use (generalized) linear mixed model methods and second, it enables us to deal with identifiability problems in more complex models. We describe the mixed model approach in section 2.3. This approach allows simple fitting with standard software and has the further advantage that it enables us to deal easily with overdispersion. 2.1 P -splines with a B-spline basis Our introduction is set in the context of Poisson errors, appropriate for modelling mortality data. We suppose we have data (d i, e i, x i ), i = 1,..., n, on a set of lives all aged sixty-five, say, where d i is the number of deaths in year x i and e i is the exposed to risk. Let d = (d 1,..., d n ), e = (e 1,..., e n ), etc. We suppose that the number of deaths d i is a realization of a Poisson distribution with mean µ i = e i θ i where θ i is the force of mortality or hazard function in year x i. We seek a smooth estimate of θ = (θ i ). A natural approach is to fit a GLM with Poisson errors, i.e., log µ = log e + log θ = log e + Xa where X = X(x), the regression matrix, is a function of year x and log e is an offset in the linear predictor. It seems unlikely that a polynomial basis will be suitable for modelling the variability in θ and a more flexible basis is provided by a set of B-splines {B 1 (x),..., B c (x)}; such a basis is shown in the upper left panel of Fig. 1 for c = 7. 3

We are still in the framework of classical regression with regression matrix, B = B(x), say, where the rows of B are the values of the B-splines in the basis evaluated at each year in turn. We use some mortality data provided by the Continuous Mortality Investigation (CMI) on claim incidence in the UK insurance business to illustrate our methods. The data run from ages 20 to 90 and years 1947 to 2002; for more information on these data see Currie et al. (2004). The middle left panel of Fig. 1 shows a plot of the logarithm of the raw forces of mortality, ˆθ i = d i /e i, for the sixty-five year old policy-holders in the data set. We fit an unpenalized GLM with c = 23 cubic B-splines in the basis and the resulting fit is also shown. It seems that the data have been undersmoothed; conversely, if there are too few B-splines in the basis, then the data will be oversmoothed. Thus, one approach is to optimise the number, and possibly position, of splines in the basis (Friedman and Silverman, 1989). An alternative is to consider the behaviour, not of the fitted curve, but of the fitted coefficients, â k. The c = 23 values of â k are plotted at their corresponding knot positions, and we see that the erratic nature of the fitted curve is a consequence of similar behaviour of the â k. The P -spline solution (Eilers and Marx, 1996) to this problem is to use a rich basis of B-splines and then ensure smoothness of the fitted curve by penalizing the resulting roughness in the â k with a difference penalty. For example, the second order penalty (which we will use throughout this paper) is given by (a 1 2a 2 + a 3 ) 2 +... + (a c 2 2a c 1 + a c ) 2 = a D Da (2.1) eq:penalty where D is a difference matrix of order 2; first and third order penalties can also be used. The log-likelihood function is modified by the penalty function and a is estimated by maximising the penalized log-likelihood l p = l(a; y) 1 2 a P a (2.2) eq:pl where l(a; y) is the usual log-likelihood for a GLM, P = P B = λd D is the penalty matrix and λ is the smoothing parameter. The suffix B, as in P B, indicates the associated basis and emphasizes that the penalty depends on the choice of basis. Other basis and penalties will be introduced below but when the context allows we will suppress the suffix and write the penalty simply as P. Maximizing (2.2) gives the penalized likelihood equations B (y µ) = P a (2.3) eq:ple which, conditional on the value of the smoothing parameter λ, can be solved with (B W B + P )â = B W z, (2.4) eq:scoring 4

the penalized version of the scoring algorithm; here B is the regression matrix, P is the penalty matrix, the tilde as in ã denotes a current estimate, and similarly for µ, z = Bã + W 1 (y µ), the working variable, and W = diag( µ), the diagonal matrix of weights, while â denotes the updated estimate of a. The hat-matrix is H = B(B Ŵ B + P ) 1 B Ŵ (2.5) eq:hat and the trace of the hat-matrix, tr(h), a measure of the effective dimension, ED, or degrees of freedom of the model (Hastie and Tibshirani, 1990, p52), is Standard errors for â can be computed from [ ] ED = Tr(H) = Tr (B Ŵ B + P ) 1 B Ŵ B. (2.6) eq:tr Var(â) (B W B + P ) 1. (2.7) eq:sterr Wahba (1983) and Silverman (1985) used a Bayesian argument to derive (2.7); see also Wood (2006, section 4.8) for a good discussion. We also note that in the extreme cases, λ = 0 and λ =, (2.7) reduces to familiar results: we get the usual asymptotic variance in an unpenalized GLM when λ = 0; when λ the limiting fit is a straight line (on the log scale) and the variance in (2.7) reduces to the variance when the linear predictor is linear in age. (Here we assume that a second order penalty and B-splines of degree at least two are used; see Eilers and Marx (1996)). There remains the choice of the smoothing parameter. In this paper we will use mixed model methods to select the smoothing parameter but there are a number of other possibilities: the Akaike Information Criterion (AIC) (Akaike, 1973), the Bayesian Information Criterion (BIC) (Schwarz, 1978) or Generalised Cross Classification (GCV) (Craven and Wahba, 1979), for example. The right middle panel of Fig. 1 shows the result of using BIC to select the smoothing parameter with BIC is defined as BIC = Dev + log n Tr (2.8) eq:bic where Dev is the deviance in a GLM and n is the number of observations. We still have c = 23 B-splines in the basis but with a second order penalty and λ chosen by BIC the degrees of freedom are reduced from 23 to about 6.5. The fitted coefficients also exhibit smoothness and demonstrate an important difference between smoothing with natural splines and P -splines: in the former, smoothness at the edges is ensured by the use of splines linear in the tails while in the latter, B-splines are used throughout the basis and smoothness is ensured by the penalty. 5

t:transform 2.2 P -splines with a transformed basis The fitted coefficients with a B-spline basis have an attractive property: each coefficient is associated with a B-spline and the estimated value of the coefficient is approximately a weighted average of the observations in the vicinity of this B-spline, as the middle panels of Fig. 1 demonstrate. The second order penalty penalizes departures from linearity. An alternative strategy is to extract a linear component from the fitted trend and fit the remaining variation by fitting a smooth curve with a penalty that penalizes departures from zero. This approach has echoes of a mixed model where trend is split into a fixed part (the linear part) and the random part (the curved part); Green (1985) is an early reference to this idea which is also discussed by Verbyla et al. (1999). A transformation that achieves this decomposition with a B-spline basis and second order penalty penalty is described in Currie et al. (2006). This transformation not only gives access to (generalized) linear mixed model methods but also allows us to deal with problems of identifiability, as we will see in section 3. Welham et al. (2007) give a comprehensive review of mixed model representations of spline models. Let B = B(x), n c, be the regression matrix of B-splines and D D, c c, define the penalty matrix. Let UΦU be the singular value decomposition of D D where Φ is the diagonal matrix consisting of the eigenvalues φ 1, φ 2,..., φ c of D D in ascending order. We assume that a second order penalty is used so φ 1 = φ 2 = 0. Now a linear function is in the null space of D D so we can take U n = [1 : x ], c 2, as an orthogonal basis for the null space where 1 is (1,..., 1)/ c and x is (1, 2,..., c) centred and scaled to have unit length. Let U s, c (c 2), be the submatrix of U corresponding to the c 2 non-zero eigenvalues. We take U = [U n : U s ] and transform Bθ = Xβ + Zα, say, where X = BU n and Z = BU s (Φ + ) 0.5, and β = U nθ and α = (Φ + ) 0.5 U sθ ; here Φ + is the diagonal matrix consisting of the c 2 positive eigenvalues in Φ. With this transformation, the penalty θ D Dθ = α α. Furthermore, since β is unpenalized we may replace X = BU n = B[1 : x ] by [1, x] where 1 is a vector of 1 s of length n and x is the vector of year values. With these definitions in place the following estimation procedures in (2.4) are equivalent: Regression matrix: B = B(x) [X : Z], X = [1, x], Z = BU s (Φ + ) 0.5 (2.9) eq:transform Penalty matrix: P B = λd D P F = λ blockdiag[o 2, I c 2 ] (2.10) eq:pstar where O 2 is a 2 2 matrix of zeros and I c 2 is the identity matrix of size c 2. The linear part, Xβ, is unpenalized, while the non-linear part, Zα, is penalized or shrunk towards zero. We interpret this representation as a mixed model with Xβ as the fixed part and Zα as the 6

random part in section 2.3; see also Currie et al. (2006). Figure 1 explains how the new basis works. With c = 7 B-splines there are five basis functions in Z, as shown in the upper right panel. These new basis functions are very different from the original B-splines; first, they are no longer local functions and second, the high frequency functions have low amplitude (a consequence of the scaling by (Φ + ) 0.5 shown in the lower right panel). The lower left panel shows the values of ˆα, c = 23, from the unpenalized fit,, and the penalized fit, ; the shrinkage of the penalized estimates towards zero is evident. It is important to realise that although the amplitudes of the basis functions differ greatly their coefficients are equally penalized. Indeed, it would be possible to remove the high frequency/low amplitude basis functions from Z with little effect on the resulting fit, an idea exploited by Wood (2003). bsect:mixed 2.3 A mixed model representation Equations (2.9) and (2.10) say that fitting the penalized GLM with regression matrix B and penalty matrix P B = λd D is equivalent to fitting the penalized GLM with regression matrix [X : Z] and penalty matrix P F = λ blockdiag[o 2, I c 2 ]. With this second representation the scoring algorithm (2.4) becomes [ X W X X W Z Z W X Z W Z + λ I ] [ ˆβ ˆα ] = [ X W Z W ] z, (2.11) eq:mixed.equs where I = I c 2, z = X β +Z α+ W 1 (y µ) and W = diag( µ) with µ = e exp(x β +Z α) in the Poisson case. We recognise (2.11) as the mixed model equations for the linear mixed model z = Xβ + Zα + ɛ, α N (0, λ 1 I), ɛ N (0, W 1 ); (2.12) eq:mixed see Searle et al. (1992), p276. Smoothing parameters may be selected selected by maximizing the residual log-likelihood l(λ) = 1 2 log V 1 2 log X V 1 X 1 2 z (V 1 V 1 X(X V 1 X) 1 X V 1 ) z (2.13) eq:reml where V = W 1 + ZGZ (2.14) eq:v and G = λ 1 I is the variance of the random effects. We now iterate between (2.11) and (2.13). We will return to this method in section 4 but in the present case we use the proposal of Schall (1991) for the estimation of β, α and the smoothing parameter λ in a generalized linear mixed model. With the same notation as Schall we let [ C X W X X W Z C = Z W X Z W Z + λ I 7 ] (2.15) eq:cprimec

and define T to be the lower (c 2) (c 2) block of the inverse of C C (corresponding to α). Schall s algorithm is 1. for given β = β, α = α and λ = λ estimate β and α from (2.11), and 2. for given β = β, α = α and λ = λ estimate λ from ˆλ 1 = α α c 2 v (2.16) eq:schall where v = λ Tr(T ). This fixed point iteration scheme yields approximate residual maximum likelihood (REML) estimates. We have found Schall s algorithm to provide an efficient solution. Approximate maximum likelihood estimates can be obtained by defining T to be the inverse of the lower (c 2) (c 2) block of C C. It follows from (2.6) and the definition of T that the effective dimension of the fitted model can be written ED = p + (c p v) (2.17) eq:ed where c is the total number of parameters, p is the dimension of β and v = λtr(t ). decomposition extends to additive models like the smooth APC model presented in the next section (see (3.7). In the example in section 2.1 we have p = 2, c p v = 5.14 and ED = 7.14, a slightly more flexible fit than that obtained with λ chosen by BIC. Overdispersion is a common problem with Poisson models. If Var(y) = σ 2 µ then Schall s estimate of σ 2 reduces to ˆσ 2 = (y ˆµ) Ŵ 1 (y ˆµ) n c + v which in our present example gives ˆσ 2 = 1.20; there is little evidence of serious overdispersion. In general, it is preferable to incorporate overdispersion directly into the estimation process and the mixed model approach enables this to happen in a natural way. The mixed model (2.12) becomes This (2.18) eq:sigma2 z = Xβ + Zα + ɛ, α N (0, λ 1 I), ɛ N (0, σ 2 W 1 ) (2.19) eq:mixed2 and Schall s algorithm is modified as follows: in step 1, replace W by σ 2 W and add step 3, estimate σ 2 from (2.18). In this example there is little change: we find a fitted model with a slightly lower effective dimension of 7.01 and estimated overdispersion of ˆσ 2 = 1.19. The model may also be fitted with standard software. We use the glmmpql( ) function of R in the MASS library of Venables and Ripley (2002). The fitted model has effective dimension 8

of 6.64 (computed from (2.6) (with W replaced by σ 2 W )) and estimated overdispersion of ˆσ 2 = 1.21. We give some skeleton R code in Appendix B. Sect:APC 3 Smooth Age-Period-Cohort models We suppose that we have data matrices Y and E, both n a n y, of deaths and exposures respectively. The rows of Y and E are indexed by age at death x a and the columns by year of death x y. The classical approach to the APC model is the factor model in which the variation in the force of mortality, θ ijk at age i in year j for cohort k, is decomposed into three components: log θ ijk = α i + β j + γ k, i = 1,..., n a, j = 1,... n y, k = 1,..., n a + n y 1 (3.1) eq:factor where α i, β j and γ k are the age, period (year) and cohort effects respectively. With Poisson errors, this is a GLM so is easily fitted with standard software such as R (R Development Core Team, 2004). However, there is a difficulty with the interpretation of the fitted parameters since of the 2n a + 2n y 1 parameters in (3.1) only 2n a + 2n y 4 are identifiable; see Clayton and Schifflers (1987) for a careful discussion of the dangers of over-interpretation of the fitted parameters. Instead of trying to interpret the fitted parameters we consider the fitted log(mortality) surface which is unique. The upper left panel in Fig. 2 shows the mean fitted log(mortality) by age with the linear effect of age removed for the data,, for the factor model,, and for the smooth model described below, ; the corresponding plots for year and cohort are also given. These plots suggest that a smooth model in age, year and cohort is a natural alternative to the discrete factor model. The smoothness assumption is deceptive since we will see that this alone is sufficient to deal with the identifiability constraints. A smooth model may also deal with another problem with the APC model: the cohort parameters which correspond to the oldest and youngest cohorts tend to be poorly estimated, a consequence of the small numbers of cells which contribute to estimates of the corner cohort parameters; the parameter estimates corresponding to the youngest cohorts in the CMI dataset are particularly unstable. A smooth model should help to deal with this instability. We assume that the parameters α, β and γ in (3.1) are smooth and define a smooth APC model as follows. Let M a, M y and M c be the n a n y matrices with entries age at death, year of death and year of birth (cohort) and let x a = vec(m a ), x y = vec(m y ) and x c = vec(m c ). Let B a = B(x a ) be the regression matrix of B-splines based on x a with similar definitions for B y and B c. We define a smooth APC regression matrix by B = [B a : B y : B c ] (3.2) eq:smooth 9

with corresponding coefficients a = (a a, a y, a c). We impose smoothness on the coefficients a a, a y and a c by the block diagonal penalty matrix P = blockdiag[λ a D ad a, λ y D yd y, λ c D cd c ] (3.3) eq:p where D a, D y and D c are second order difference matrices and λ a, λ y and λ c are the smoothing parameters for the age, year and cohort parameters respectively. The model defined by (3.2) and (3.3) is a generalized additive model (GAM) (Hastie and Tibshirani, 1990) but instead of using back-fitting we fit directly with (2.4). However, the regression matrix in (3.2) is not of full rank so some care is required. There are a number of possibilities: we could use a small ridge penalty on the system of equations, as in Marx and Eilers (1998), or we could use a generalized inverse. A third possibility is to transform B to a non-singular basis; the transformation developed in section 2.2 enables us to extract the linear components of age, year and cohort, i.e., a plane in the age-year space. An important point is that all three methods give exactly the same fitted values. Let X a = [1, x a ], X y = [1, x y ] and X c = [1, x c ] be the n a n y 2 matrices corresponding to (2.9). Then removing the linear dependencies among the columns of X a, X y and X c we obtain the X matrix in the transformed model as X = [1 : x a : x y ]. (3.4) eq:bigx Note that although X is not unique the space spanned by X is, since this space equals the null space of P in (3.3). The Z matrix is given by Z = [Z a : Z y : Z c ] (3.5) eq:bigz where, for example, Z a = B a U a:s (Φ + a ) 0.5, and U a:s and Φ + a are obtained from the singular value decomposition of D ad a as in section 2.2. Lastly, with the new regression matrix defined as [X : Z], the penalty transforms into P = blockdiag[o 3, λ a I ca 2, λ y I cy 2, λ c I cc 2] (3.6) eq:block where O 3 is the 3 3 matrix of 0 s and c a 2 is the column dimension of Z a, etc. The model may now be fitted as in section 2.3 with fixed regression matrix given by (3.4), random regression matrix by (3.5) and penalty matrix by (3.6). We fit the smooth APC model with B a, n 10, B y, n 13 and B c, n 28 where n = 3976, i.e., c a = 10, c y = 13 and c y = 28. Schall s algorithm, (2.15) and (2.16), extends as follows: let 10

T a be the (c a 2) (c a 2) block of the inverse of C C which corresponds to the Z a coefficients; we take similar definitions for T y and T c for Z y and Z c. Fitting the Poisson model without overdispersion we find with REML that the effective dimension of the fitted model is reduced from 249 for the factor model to about 35.8. The estimate of overdispersion using (2.18) is ˆσ 2 = 2.00, evidence of some overdispersion. We refit the model with σ 2 included as part of the estimation process. With overdispersion included in the estimation process we would expect heavier smoothing since the smoothed surface will be less inclined to follow the local behaviour of the observed mortality surface. The effective dimension is further reduced to about 33.8 where (2.17) becomes ED = p + (c p v a v y v c ) (3.7) eq:partition where there are p = 3 fixed effects, c p = c a + c y + c c 6 = 45 random effects, and v a = λ a Tr(T a ) = 0.49, v y = λ y Tr(T y ) = 3.49 and v c = λ c Tr(T c ) = 10.25. The resulting detrended mean log(mortality) curves have been added to Fig. 2; the fitted log(mortality) is also shown for age 65. Skeleton R code is provided in Appendix B. In the previous paragraph we described overdispersion as a variance effect. However, with mortality data this approach ignores effects such as cold winters which can inflate death rates. In the next section we use the approach of Perperoglou and Eilers (??) where overdispersion is viewed not as a variance problem but as a problem with the linear structure of the model. They suggest the addition of individual random effects to the linear predictor as a way of dealing with the lack of fit that is otherwise modelled with overdispersion. Sect:Over 4 Overdispersion as individual random effects In the previous section we showed that the linear predictor for the APC model has a mixed model representation Xβ + Zα where X and Z are defined in (3.4) and (3.5) respectively. Perperoglou and Eilers (??) modified the linear predictor by the addition of individual effects to give Xβ + Zα + γ (4.1) eq:over where the length of γ is the same as the number of observations, n, say. Thus, the model has more parameters than observations but a ridge penalty on γ maintains identifiability and shrinks γ towards zero; the penalty (3.6) becomes P = blockdiag[o 3, λ a I ca 2, λ y I cy 2, λ c I cc 2, κi n ] = blockdiag[o 3, P, κi n ], (4.2) eq:blockover 11

say. We have a mixed model where the variance of the random effects α and γ is given by G = blockdiag[λ 1 a I ca 2, λ 1 y I cy 2, λ 1 c I cc 2, κ 1 I n ] = blockdiag[p 1, κ 1 I n ]. (4.3) eq:varg The mixed model equations (2.11) become X W X X W Z X W Z W X Z W Z + P Z W W X W Z W + κin ˆβ ˆα ˆγ = X W Z W W z. (4.4) eq:mixed.equs2 This is a very large system of equations but Perperoglou and Eilers (??) provide a device which facilitates its solution. We define a modified weight matrix W = κ( W + κi n ) 1 W (4.5) eq:wstar and solve (4.4) for ˆγ to get κˆγ = W ( z X ˆβ Z ˆα) (4.6) eq:kgamma from which it follows that (4.4) reduces to [ X W X X W Z Z W X Z W Z + P ] [ ˆβ ˆα ] = [ X W Z W ] z. (4.7) eq:mixed.equs3 This is the same system as obtained for the original smooth APC model but with the weight matrix W replaced by W. For given κ we optimize over the remaining parameters by using Schall s algorithm; κ is estimated by maximizing the profile residual log-likelihood l(ˆλ a, ˆλ y, ˆλ c, κ) from (2.13). It is essential to avoid the inversion of large matrices such as the left hand side of (4.4) and some matrix identities to this end are provided in Appendix A. Figure 3 shows the results of fitting the model. Figure 3 also shows the profile log-likelihood l(ˆλ a, ˆλ y, ˆλ c, κ) plotted against log κ; evidently, the smoothing parameter which shrinks the individual random effects towards zero is sharply estimated. Values of the observed and smoothed log(mortality) together with the estimated individual effects are also shown for ages forty and sixty. There is a noticeable difference in the individual random effects at these ages. An explanation can be found in the lower right panel of Fig. 3 which gives the numbers of deaths at ages sixty and forty. Since Var(log(d/e)) 1/d the values of log(d/e) are a good estimate of the true underlying smooth log(mortality) at age sixty, but a poor estimate at age forty. It follows that the residuals log(d/e) X ˆβ are almost entirely explained by the individual random effects at age sixty, while the stochastic element of the residual is substantial at age forty. Thus the model has the ability to distinguish between lack of fit (age sixty) and overdispersion (age forty). SOME OTHER COMMENTS ON THE FITTED MODEL???? 12

Sect:Disc 5 Discussion The modified weight matrix W deserves some comment. We note that W is a diagonal matrix with entries w i = κ w i /( w i + κ) with w i = µ in the Poisson case considered here. Thurston et al. (2000) used the same weight matrix in their algorithm to fit the negative binomial distribution, a distribution often used to model overdispersion. For further comment on this point see Perperoglou and Eilers (??). My feeling is that the paper is strong technically (I hope!) but is rather weak in having useful things to say about the data so some positive comments on (a) what has been done (b) other applications for the method (c) further work - could mention period shocks as an idea maybe. FURTHER SUGGESTIONS FOR THE CLOSING DISCUSSION NEEDED 13

References Akaike H (1973) Maximum likelihood identification of Gaussian autoregressive moving average models. Biometrika, 60, 255-65. Carstensen B and Keiding N (2005) Age-period-Cohort models: Statistical Inference in the Lexis diagram. Unpublished manuscript available at www.biostat.ku.dk/ bxc Clayton D and Schifflers E (1987) Models for temporal variation in cancer rates. II: Age-periodcohort models. Statistics in Medicine, 6, 469-81. Craven P and Wahba G (1979) Smoothing noisy data with spline functions. Numerische Mathematik, 31, 377-403. Currie ID, Durban M, Eilers PHC (2004) Smoothing and forecasting mortality rates. Statistical Modelling 4, 279-98. Currie ID, Durban M and Eilers PHC (2006) Generalized linear array models with applications to multidimensional smoothing. Journal of the Royal Statistical Society: Series B 68, 259-80. Eilers PHC and Marx BD (1996) Flexible smoothing with B-splines and penalties. Statistical Science 11, 89-121. Friedman JH and Silverman BW (1989) Flexible parsimonious smoothing and additive modeling. Technometrics 31, 3-21. Green PJ (1985) Linear models for field trials, smoothing and cross-validation. Biometrika, 72, 527-37. Hastie TJ and Tibshirani RJ (1990) Generalized additive models. London: Chapman and Hall. Holford TR (1983) The estimation of age, period and cohort effects for vital rates. Biometrics, 39, 311-24. Heuer C (1997) Modeling of time trends and interactions in vital rates using restricted regression splines. Biometrics, 53, 161-77. Marx BD and Eilers PHC (1998) Direct generalized additive modeling with penalized likelihood. Computational Statistics and Data Analysis, 28, 193-209. Ogata Y, Katsura K, Keiding N, Holst C and Green A (2000) Empirical Bayes Age-Period- Cohort analysis of retrospective incidence data. Scandinavian Journal of Statistics, 27, 415-32. Perperoglou A and Eilers PHC Overdispersion modelling with individual random effects. R Development Core Team (2004). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL 14

http://www.r-project.org Schall R (1991) Estimation in generalized linear models with random effects. Biometrika, 78, 719-27. Schwarz G (1978) Estimating the dimension of a model. Annals of Statistics, 6, 461-64. Searle SR, Casella G and McCulloch CE (1992) Variance components. New York: John Wiley & Sons. Silverman BW (1985) Some aspects of the spline smoothing approach to nonparametric regression curve fitting (with Discussion). Journal of the Royal Statistical Society: Series B, 47, 1-52. Thurston SW, Wand MP and Wiencke JK (2000) Negative binomial additive models. Biometrics, 56, 139-44. Venables WN and Ripley BD (2002) Modern Applied Statistics with S. New York: Springer- Verlag. Verbyla AP, Cullis BR, Kenward MG and Welham SJ (1999) The analysis of designed experiments and longitudinal data by using smoothing splines (with discussion). Applied Statistics, 48, 269-311. Wahba G (1983) Bayesian confidence intervals for the cross-validated smoothing spline. Journal of the Royal Statistical Society: Series B, 45, 133-50. Welham SJ, Cullis BR, Kenward MG and Thompson R (2007) A comparison of mixed model splines for curve fitting. Australian and New Zealand Journal of Statistics 49, 1-23. Wood SN (2003) Thin plate regression splines. Journal of the Royal Statistical Society: Series B 65, 95-114. Wood SN (2006) Generalized additive models: an introduction with R. London: Chapman and Hall. 15

Appendix A We provide some matrix identities which allow estimation in the smooth APC model with individual random effects, model (4.1) and (4.2). In (4.1) let C C = X W X X W Z X W Z W X Z W Z + P Z W W X W Z W + κin. (5.1) eq:mixed.app1 This matrix is (c a + c y + c c 3 + n) (c a + c y + c c 3 + n). For given κ, Schall s algorithm requires the leading (c a + c y + c c 3) (c a + c y + c c 3) block of (C C) 1. It follows from results on the inverse of partition matrices and the definition of W that this matrix is given by [ X W X X W Z ] 1, (5.2) eq:mixed.app2 Z W X Z W Z + P the inverse of the matrix on the left hand side of (4.7). The Schall estimation scheme, as in section 3, is now used (conditional on κ) to estimate the remaining parameters. To estimate κ we compute the profile residual log-likelihood l(ˆλ a, ˆλ y, ˆλ c, κ) from 1 2 log V 1 2 log X V 1 X 1 2 z (V 1 V 1 X(X V 1 X) 1 X V 1 ) z. (5.3) eq:app3 Now, with the variance of the random effects given by (4.3), we find [ V = W 1 P 1 O + [Z : I n ] O κ 1 I n ] [Z : I n ] (5.4) eq:app4 = W 1 + ZP 1 Z (5.5) eq:app5 where P is defined in (4.2). It follows that V 1 and V are V 1 = W W Z(P + Z W Z) 1 Z W (5.6) eq:app6 and V = (λ ca 2 a λy cy 2 λ cc 2 c ) 1 W 1 P + Z W Z. (5.7) eq:app7 16

Appendix B Skeleton code to fit the mixed model (2.12) is given below. It is assumed that deaths and exposures are stored in vectors Dth and Exp, and the fixed and random effects regression matrices are X and Z respectively. The function myglmmpql is a copy of the R-function glmmpql in which the line mcall$method <- "ML" is replaced by mcall$method <- "REML". library(nlme) library(mass) Id <- factor(rep(1,length(dth))) data.fr <- groupeddata(dth ~ X[,-1] rep(1,length = length(dth)), data = data.frame(dth, X, Z, Exp)) fit <- myglmmpql(dth ~ X[,-1] + offset(log(exp)), data = data.fr, random = list(id = pdident(~z-1)), family= poisson) Skeleton code to fit the penalized APC model (4.1) and (4.2) is given below. The fixed and random effects regression matrices are X, Z.a, Z.y and Z.c respectively. Id <- factor(rep(1,length(dth))) Z.block <- list(list(id=pdident(~z.a-1)), list(id=pdident(~z.y-1)), list(id=pdident(~z.c-1))) Z.block <- unlist(z.block, recursive=false) data.fr <- groupeddata(dth ~ X[,-1] rep(1,length=length(dth)), data = data.frame(dth, X, Z.a, Z.y, Z.c, Exp)) fit <- myglmmpql(dth~x[,-1]+offset(log(exp)), data=data.fr, random=z.block, family=poisson) 17

B spline 0.0 0.2 0.4 0.6 Transformed basis 0.8 0.4 0.0 0.2 1 2 3 4 5 1950 1960 1970 1980 1990 2000 1950 1960 1970 1980 1990 2000 Year Year log(mortality) 4.8 4.4 4.0 3.6 Age = 65 Npar = 23 DF = 23 log(mortality) 4.8 4.4 4.0 3.6 Age = 65 Npar = 23 DF = 6.5 1950 1960 1970 1980 1990 2000 1950 1960 1970 1980 1990 2000 Year Year Coefficient 0.5 0.0 0.5 1.0 Scaling of basis functions 0 5 10 15 20 5 10 15 20 5 10 15 20 Index of basis functions Index of basis functions Fig.Bases Figure 1: (a) B-spline basis (b) transformed basis (c) unpenalized regression: coefficient,, data (d) penalized regression: coefficient,, data (e) unpenalized,, and penalized,, coefficients in transformed regression (f) scaling of basis functions, φ 0.5 i, i = 3,..., c. 18

log(mortality) 2.2 2.0 1.8 1.6 1.4 1.2 log(mortality) 0.05 0.00 0.05 Factor Smooth 20 30 40 50 60 70 80 90 1950 1960 1970 1980 1990 2000 Age Year log(mortality) 0.5 0.0 0.5 1.0 1.5 log(mortality) 4.8 4.6 4.4 4.2 4.0 3.8 3.6 Age 65 1880 1900 1920 1940 1960 1980 1950 1960 1970 1980 1990 2000 Year of Birth Year Fig.Detrend Figure 2: Age-Period-Cohort model: detrended plots of mean log(mortality) by (a) age (b) year (c) cohort; (d) observed and fitted log(mortality) at age 65 19

profile residual logl 6853.25 6853.27 6853.29 6853.31 log(mortality) 7.4 7.2 7.0 6.8 6.6 6.4 6.2 Age = 40 5.740 5.750 5.760 1950 1960 1970 1980 1990 2000 log(kappa) Year log(mortality) 5.2 5.0 4.8 4.6 4.4 4.2 Age = 60 Number of deaths 0 200 400 600 800 1000 1200 Age = 60 Age = 40 1950 1960 1970 1980 1990 2000 1950 1960 1970 1980 1990 2000 Year Year Fig.APC.Ind Figure 3: Age-Period-Cohort model with individual random effects: (a) profile residual loglikelihood l(ˆλ a, ˆλ y, ˆλ c, κ) against log κ; (b) and (c) observed,, and fitted log(mortality), X ˆβ + Z ˆα, and individual random effects, ˆγ; (d) numbers of deaths. 20