Smoothing Age-Period-Cohort models with P -splines: a mixed model approach

Size: px

Start display at page:

Download "Smoothing Age-Period-Cohort models with P -splines: a mixed model approach"

Gavin Randall
5 years ago
Views:

1 Smoothing Age-Period-Cohort models with P -splines: a mixed model approach Running headline: Smooth Age-Period-Cohort models I D Currie, Department of Actuarial Mathematics and Statistics, and the Maxwell Institute for Mathematical Sciences, Heriot-Watt University, Edinburgh, EH14 4AS, UK. J G Kirkby, Department of Actuarial Mathematics and Statistics,, and the Maxwell Institute for Mathematical Sciences, Heriot-Watt University, Edinburgh, EH14 4AS, UK. M Durban, Departamento de Estadistica y Econometria, Universidad Carlos III de Madrid, Edificio Torres Quevedo, Leganes, Madrid, Spain. P H C Eilers, Department of Methodology and Statistics, Utrecht University, 3508 TC Utrecht, The Netherlands. Responsible author: I D Currie, I.D.Currie@hw.ac.uk Abstract: We use smoothing with B-splines and penalties, the P -spline method of Eilers and Marx (1996), to smooth the Age-Period-Cohort (APC) model of mortality. We describe how smoothing with penalties in one dimension allows a mixed model approach to be used. We apply this method to the APC model and show that penalization gives a way of dealing with identifiability problems in the discrete APC model and leads to a mixed model representation of the model. We show that individual random effects can be used to model overdispersion and that this can also be achieved within the mixed model framework. We illustrate our methods with some mortality data provided by the UK insurance industry. Keywords: Age-period-cohort; identifiability; mixed models; mortality; overdispersion; P - splines; Schall s algorithm; smoothing. File name: /talks/london.2006/apc.paper/paper.tex: 20 June

2 Sect:Intro 1 Introduction We suppose that we have mortality data arranged in a two-way table classified by age at death and year of death. Age-Period-Cohort models (APC) are an important class of models in the study of mortality in such tables and, more generally, of disease incidence. A difficulty with the APC model is the choice of parameterization since the model is in general not identifiable. Clayton and Schifflers (1987) give a careful discussion of parameterization in APC models and sound warnings about the dangers of over-interpreting the fitted parameters; they are equally sceptical about the wisdom of forecasting by extrapolating parameter values. Holford (1983) and Carstensen (2007) also discuss the APC model with particular reference to the problems caused by non-identifiability. The discussion in these papers revolves round the properties of different parameterizations. Our approach is different: we force smoothness on the fitted model by penalizing differences between adjacent coefficients. Penalization does two things: first, it replaces the usual identifiability constraints, and second, it allows the APC model to be cast in a mixed model framework. The purpose of this paper is to explore and illustrate these ideas. Smoothing with P -splines was introduced by Eilers and Marx (1996) and here we apply the method to smooth the APC model. Smooth versions of the APC model have already appeared in the literature. Heuer (1997) used restricted cubic splines or natural splines to give a smooth version of the APC model; he also included interactions in his model by using the Kronecker product of the spline functions. An important difference between Heuer s approach and ours is that in Heuer s paper smoothness at the edges is produced by modifying the B-spline basis (natural splines) whereas in our case smoothness is produced by the use of penalties. Ogata et al. (2000) used splines in a Bayesian framework and also produced a smooth version of the APC model. The plan of the paper is as follows. In section 2 we explain the P -spline approach in one dimension and describe a transformation of the B-spline basis which allows the model to be expressed as a mixed model; Schall s algorithm (Schall, 1991) for fitting a generalized linear mixed model is described. In section 3 we apply our transformation to the APC model and show that the transformation deals with the identifiability problems that arise in the APC model. The mixed model representation has an interpretation as an additive model where the fixed component is a plane and there are three random components which correspond to the age, period and cohort effects. There is evidence of overdispersion and in section 4 we follow Perperoglou and Eilers (unpublished) and use individual random effects to model overdispersion; the mixed 2

3 model is a natural setting for this model and Schall s algorithm copes well with the computational challenge of model fitting. We use some mortality data provided by the UK insurance industry to illustrate our methods throughout. The paper ends with a short discussion. ect:pspline ect:bspline 2 Smoothing in one dimension with P -splines Smoothing with P -splines, introduced by Eilers and Marx (1996), is based on two ideas: (a) use B-splines as the basis for a regression, and (b) use a difference penalty on adjacent regression coefficients to ensure smoothness. Estimation is by penalized likelihood. The method has two attractive features which follow from these ideas: (a) the regression nature of P -splines means that it is straightforward to introduce smooth terms into a larger regression model, and (b) the difference penalty means that the familiar least squares (LS) solution in a normal model and, more generally, the iterative weighted least squares (IWLS) algorithm in a generalized linear model (GLM) apply in the P -spline setting. A number of papers contain descriptions of the method (Eilers and Marx, 1996, Marx and Eilers, 1998, Currie et al., 2004). We present a short introduction. In section 2.1 we describe P -spline smoothing with a B-spline basis and then in section 2.2 we describe a transformation of the B-spline basis which gives an alternative way of fitting the P -spline model. This new basis has two advantages: first, it allows us to use (generalized) linear mixed model methods and second, it enables us to deal with identifiability problems in more complex models. We describe the mixed model approach in section 2.3. This approach allows simple fitting with standard software and has the further advantage that it enables us to deal easily with overdispersion. 2.1 P -splines with a B-spline basis Our introduction is set in the context of Poisson errors, appropriate for modelling mortality data. We suppose we have data (d i, e i, x i ), i = 1,..., n, on a set of lives all aged sixty-five, say, where d i is the number of deaths in year x i and e i is the exposed to risk. Let d = (d 1,..., d n ), e = (e 1,..., e n ), etc. We suppose that the number of deaths d i is a realization of a Poisson distribution with mean µ i = e i θ i where θ i is the force of mortality or hazard function in year x i. We seek a smooth estimate of θ = (θ i ). A natural approach is to fit a GLM with Poisson errors, i.e., log µ = log e + log θ = log e + Xa where X = X(x), the regression matrix, is a function of year x and log e is an offset in the linear predictor. It seems unlikely that a polynomial basis will be suitable for modelling the variability in θ and a more flexible basis is provided by a set of B-splines {B 1 (x),..., B c (x)}; such a basis is shown in the upper left panel of Fig. 1 for c = 7. 3

4 We are still in the framework of classical regression with regression matrix, B = B(x), say, where the rows of B are the values of the B-splines in the basis evaluated at each year in turn. We use some mortality data provided by the Continuous Mortality Investigation (CMI) on claim incidence in the UK insurance business to illustrate our methods. The data run from ages 20 to 90 and years 1947 to 2002; for more information on these data see Currie et al. (2004). The middle left panel of Fig. 1 shows a plot of the logarithm of the raw forces of mortality, ˆθ i = d i /e i, for the sixty-five year old policy-holders in the data set. We fit an unpenalized GLM with c = 23 cubic B-splines in the basis and the resulting fit is also shown. It seems that the data have been undersmoothed; conversely, if there are too few B-splines in the basis, then the data will be oversmoothed. Thus, one approach is to optimise the number, and possibly position, of splines in the basis (Friedman and Silverman, 1989). An alternative is to consider the behaviour, not of the fitted curve, but of the fitted coefficients, â k. The c = 23 values of â k are plotted at their corresponding knot positions, and we see that the erratic nature of the fitted curve is a consequence of similar behaviour of the â k. The P -spline solution (Eilers and Marx, 1996) to this problem is to use a rich basis of B-splines and then ensure smoothness of the fitted curve by penalizing the resulting roughness in the â k with a difference penalty. For example, the second order penalty (which we will use throughout this paper) is given by (a 1 2a 2 + a 3 ) (a c 2 2a c 1 + a c ) 2 = a D Da (2.1) eq:penalty where D is a difference matrix of order 2; first and third order penalties can also be used. The log-likelihood function is modified by the penalty function and a is estimated by maximising the penalized log-likelihood l p = l(a; y) 1 2 a P a (2.2) eq:pl where l(a; y) is the usual log-likelihood for a GLM, P = P B = λd D is the penalty matrix and λ is the smoothing parameter. The suffix B, as in P B, indicates the associated basis and emphasizes that the penalty depends on the choice of basis. Other basis and penalties will be introduced below but when the context allows we will suppress the suffix and write the penalty simply as P. Maximizing (2.2) gives the penalized likelihood equations B (y µ) = P a (2.3) eq:ple which, conditional on the value of the smoothing parameter λ, can be solved with (B W B + P )â = B W z, (2.4) eq:scoring 4

5 the penalized version of the scoring algorithm; here B is the regression matrix, P is the penalty matrix, the tilde as in ã denotes a current estimate, and similarly for µ, z = Bã + W 1 (y µ), the working variable, and W = diag( µ), the diagonal matrix of weights, while â denotes the updated estimate of a. The hat-matrix is H = B(B Ŵ B + P ) 1 B Ŵ (2.5) eq:hat and the trace of the hat-matrix, tr(h), a measure of the effective dimension, ED, or effective degrees of freedom of the model (Hastie and Tibshirani, 1990, p52), is [ ] ED = Tr(H) = Tr (B Ŵ B + P ) 1 B Ŵ B ; (2.6) eq:tr a convenient alternative to (2.6) is [ ] ED = Tr(H) = c Tr (B Ŵ B + P ) 1 P (2.7) eq:tr2 where c is the number of columns in B. Standard errors for â can be computed from Var(â) (B W B + P ) 1. (2.8) eq:sterr Wahba (1983) and Silverman (1985) used a Bayesian argument to derive (2.8); see also Wood (2006, section 4.8) for a good discussion. We also note that in the extreme cases, λ = 0 and λ =, (2.8) reduces to familiar results: we get the usual asymptotic variance in an unpenalized GLM when λ = 0; when λ the limiting fit is a straight line (on the log scale) and the variance in (2.8) reduces to the variance when the linear predictor is linear in age. (Here we assume that a second order penalty and B-splines of degree at least two are used; see Eilers and Marx (1996)). There remains the choice of the smoothing parameter. In this paper we will use mixed model methods to select the smoothing parameter but there are other possibilities: the Akaike Information Criterion (AIC) (Akaike, 1973), the Bayesian Information Criterion (BIC) (Schwarz, 1978) or Generalised Cross Classification (GCV) (Craven and Wahba, 1979), for example. The right middle panel of Fig. 1 shows the result of using BIC to select the smoothing parameter with BIC is defined as BIC = Dev + log n Tr (2.9) eq:bic where Dev is the deviance in a GLM and n is the number of observations. We still have c = 23 B-splines in the basis but with a second order penalty and λ chosen by BIC the degrees of freedom are reduced from 23 to about 6.5. The fitted coefficients also exhibit smoothness and demonstrate an important difference between smoothing with natural splines and P -splines: in 5

6 the former, smoothness at the edges is ensured by the use of splines linear in the tails while in the latter, B-splines are used throughout the basis and smoothness is ensured by the penalty. t:transform 2.2 P -splines with a transformed basis The fitted coefficients with a B-spline basis have an attractive property: each coefficient is associated with a B-spline and the estimated value of the coefficient is approximately a weighted average of the observations in the vicinity of this B-spline, as the middle panels of Fig. 1 demonstrate. The second order penalty penalizes departures from linearity. An alternative strategy is to extract a linear component from the fitted trend and fit the remaining variation by fitting a smooth curve with a penalty that penalizes departures from zero. This approach has echoes of a mixed model where trend is split into a fixed part (the linear part) and the random part (the curved part); Green (1985) is an early reference to this idea which is also discussed by Verbyla et al. (1999). A transformation that achieves this decomposition with a B-spline basis and second order penalty penalty is given by Eilers (1999) in the discussion of the Verbyla et al. paper; see also Currie et al. (2006). Such transformations not only give access to (generalized) linear mixed model methods but also allow us to deal with problems of identifiability, as we will see in section 3. Welham et al. (2007) give a comprehensive review of mixed model representations of spline models. Let B = B(x), n c, be the regression matrix of B-splines and D D, c c, define the penalty matrix. Let UΦU be the singular value decomposition of D D where Φ is the diagonal matrix consisting of the eigenvalues φ 1, φ 2,..., φ c of D D in ascending order. We assume that a second order penalty is used so φ 1 = φ 2 = 0. Now a linear function is in the null space of D D so we can take U n = [1 : x ], c 2, as an orthogonal basis for the null space where 1 is (1,..., 1)/ c and x is (1, 2,..., c) centred and scaled to have unit length. Let U s, c (c 2), be the submatrix of U corresponding to the c 2 non-zero eigenvalues. We take U = [U n : U s ] and transform Bθ = Xβ + Zα, say, where X = BU n and Z = BU s (Φ + ) 0.5, and β = U nθ and α = (Φ + ) 0.5 U sθ ; here Φ + is the diagonal matrix consisting of the c 2 positive eigenvalues in Φ. With this transformation, the penalty θ D Dθ = α α. Furthermore, since β is unpenalized we may replace X = BU n = B[1 : x ] by [1, x] where 1 is a vector of 1 s of length n and x is the vector of year values. With these definitions in place the following estimation procedures 6

7 in (2.4) are equivalent: Regression matrix: B = B(x) [X : Z], X = [1, x], Z = BU s (Φ + ) 0.5 (2.10) eq:transform Penalty matrix: P B = λd D P F = λ blockdiag[o 2, I c 2 ] (2.11) eq:pstar where O 2 is a 2 2 matrix of zeros and I c 2 is the identity matrix of size c 2. The linear part, Xβ, is unpenalized, while the non-linear part, Zα, is penalized or shrunk towards zero. We interpret this representation as a mixed model with Xβ as the fixed part and Zα as the random part in section 2.3; see also Currie et al. (2006). Figure 1 explains how the new basis works. With c = 7 B-splines there are five basis functions in Z, as shown in the upper right panel. These new basis functions are very different from the original B-splines; first, they are no longer local functions and second, the high frequency functions have low amplitude (a consequence of the scaling by (Φ + ) 0.5 shown in the lower right panel). The lower left panel shows the values of ˆα, c = 23, from the unpenalized fit,, and the penalized fit, ; the shrinkage of the penalized estimates towards zero is evident. It is important to realise that although the amplitudes of the basis functions differ greatly their coefficients are equally penalized. Indeed, it would be possible to remove the high frequency/low amplitude basis functions from Z with little effect on the resulting fit, an idea exploited by Wood (2003). bsect:mixed 2.3 A mixed model representation Equations (2.10) and (2.11) say that fitting the penalized GLM with regression matrix B and penalty matrix P B = λd D is equivalent to fitting the penalized GLM with regression matrix [X : Z] and penalty matrix P F = λ blockdiag[o 2, I c 2 ]. With this second representation the scoring algorithm (2.4) becomes [ X W X X W Z Z W X Z W Z + λ I ] [ ˆβ ˆα ] = [ X W Z W ] z, (2.12) eq:mixed.equs where I = I c 2, z = X β +Z α+ W 1 (y µ) and W = diag( µ) with µ = e exp(x β +Z α) in the Poisson case. We recognise (2.12) as the mixed model equations for the linear mixed model z = Xβ + Zα + ɛ, α N (0, λ 1 I), ɛ N (0, W 1 ); (2.13) eq:mixed see Searle et al. (1992), p276. Smoothing parameters may be selected selected by maximizing the residual log-likelihood l(λ) = 1 2 log V 1 2 log X V 1 X 1 2 z (V 1 V 1 X(X V 1 X) 1 X V 1 ) z (2.14) eq:reml 7

8 where V = W 1 + ZGZ (2.15) eq:v and G = λ 1 I is the variance of the random effects. We now iterate between (2.12) and (2.14). We will return to this method in section 4 but in the present case we use the proposal of Schall (1991) for the estimation of β, α and the smoothing parameter λ in a generalized linear mixed model. With the same notation as Schall we let [ C X W X X W Z C = Z W X Z W Z + λ I ] (2.16) eq:cprimec and define T to be the lower (c 2) (c 2) block of the inverse of C C (corresponding to α). Schall s algorithm is 1. for given β = β, α = α and λ = λ estimate β and α from (2.12), and 2. for given β = β, α = α and λ = λ estimate λ from ˆλ 1 = α α c 2 v (2.17) eq:schall where v = λ Tr(T ). This fixed point iteration scheme yields approximate residual maximum likelihood (REML) estimates. We have found Schall s algorithm to provide an efficient solution. Approximate maximum likelihood estimates can be obtained by defining T to be the inverse of the lower (c 2) (c 2) block of C C. It follows from (2.7), the form of P F in (2.11) and the definition of T that the effective dimension of the fitted model can be written ED = 2 + (c 2 v) (2.18) eq:ed where 2 is the number of fixed effects, c 2 is the number of random effects and v = λtr(t ). We can interpret c 2 v as the effective degrees of freedom of the non-linear component of the effect of year. In the example in section 2.1 we have c p v = 5.1 with total ED = 7.1, slightly less smoothing than obtained with λ chosen by BIC when ED = 6.5. The decomposition (2.18) extends to the smooth APC model presented in the next section (see equation (3.7)). Overdispersion is a common problem with Poisson models. If Var(y) = σ 2 µ then Schall s estimate of σ 2 reduces to ˆσ 2 = (y ˆµ) Ŵ 1 (y ˆµ) n c + v 8 (2.19) eq:sigma2

9 which in our present example gives ˆσ 2 = 1.20; there is little evidence of serious overdispersion. In general, it is preferable to incorporate overdispersion directly into the estimation process and the mixed model approach enables this to happen in a natural way. The mixed model (2.13) becomes z = Xβ + Zα + ɛ, α N (0, λ 1 I), ɛ N (0, σ 2 1 W ) (2.20) eq:mixed2 and Schall s algorithm is modified as follows: in step 1, replace W by σ 2 W and add step 3, estimate σ 2 from (2.19). In this example there is little change: we find a fitted model with a slightly lower effective dimension of 7.01 and estimated overdispersion of ˆσ 2 = The model may also be fitted with standard software. We use the glmmpql( ) function of R (R Development Core Team, 2004) in the MASS library of Venables and Ripley (2002). The fitted model has effective dimension of 6.64 (computed from (2.6) (with W replaced by σ 2 W )) and estimated overdispersion of ˆσ 2 = We give some skeleton R code in Appendix B. Sect:APC 3 Smooth Age-Period-Cohort models We suppose that we have data matrices Y and E, both n a n y, of deaths and exposures respectively. The rows of Y and E are indexed by age at death x a and the columns by year of death x y. The classical approach to the APC model is the factor model in which the variation in the force of mortality, θ ijk at age i in year j for cohort k, is decomposed into three components: log θ ijk = α i + β j + γ k, i = 1,..., n a, j = 1,... n y, k = 1,..., n a + n y 1 (3.1) eq:factor where α i, β j and γ k are the age, period (year) and cohort effects respectively. With Poisson errors, this is a GLM so is easily fitted with standard software such as R. However, there is a difficulty with the interpretation of the fitted parameters since of the 2n a +2n y 1 parameters in (3.1) only 2n a +2n y 4 are identifiable; see Clayton and Schifflers (1987) for a careful discussion of the dangers of over-interpretation of the fitted parameters. Instead of trying to interpret the fitted parameters we consider the fitted log(mortality) surface which is unique. The upper left panel in Fig. 2 shows the mean fitted log(mortality) by age with the linear effect of age removed for the data,, for the factor model,, and for the smooth model described below, ; the corresponding plots for year and cohort are also given. These plots suggest that a smooth model in age, year and cohort is a natural alternative to the discrete factor model. The smoothness assumption is deceptive since we will see that this alone is sufficient to deal with the identifiability constraints. A smooth model may also deal with another problem with the APC 9

10 model: the cohort parameters which correspond to the oldest and youngest cohorts tend to be poorly estimated, a consequence of the small numbers of cells which contribute to estimates of the corner cohort parameters; the parameter estimates corresponding to the youngest cohorts in the CMI dataset are particularly unstable. A smooth model should help to deal with this instability. We assume that the parameters α, β and γ in (3.1) are smooth and define a smooth APC model as follows. Let M a, M y and M c be the n a n y matrices with entries age at death, year of death and year of birth (cohort) and let x a = vec(m a ), x y = vec(m y ) and x c = vec(m c ). Let B a = B(x a ) be the regression matrix of B-splines based on x a with similar definitions for B y and B c. We define a smooth APC regression matrix by B = [B a : B y : B c ] (3.2) eq:smooth with corresponding coefficients a = (a a, a y, a c). We impose smoothness on the coefficients a a, a y and a c by the block diagonal penalty matrix P = blockdiag[λ a D ad a, λ y D yd y, λ c D cd c ] (3.3) eq:p where D a, D y and D c are second order difference matrices and λ a, λ y and λ c are the smoothing parameters for the age, year and cohort parameters respectively. The model defined by (3.2) and (3.3) is a generalized additive model (GAM) (Hastie and Tibshirani, 1990) but instead of using back-fitting we fit directly with (2.4). However, the regression matrix in (3.2) is not of full rank so some care is required. There are a number of possibilities: we could use a small ridge penalty on the system of equations, as in Marx and Eilers (1998), or we could use a generalized inverse. A third possibility is to transform B to a non-singular basis; the transformation developed in section 2.2 enables us to extract the linear components of age, year and cohort, i.e., a plane in the age-year space. An important point is that all three methods give exactly the same fitted values. Let X a = [1, x a ], X y = [1, x y ] and X c = [1, x c ] be the n a n y 2 matrices corresponding to (2.10). Then removing the linear dependencies among the columns of X a, X y and X c we obtain the X matrix in the transformed model as X = [1 : x a : x y ]. (3.4) eq:bigx Note that although X is not unique the space spanned by X is, since this space equals the null space of P in (3.3). The Z matrix is given by Z = [Z a : Z y : Z c ] (3.5) eq:bigz 10

11 where, for example, Z a = B a U a:s (Φ + a ) 0.5, and U a:s and Φ + a are obtained from the singular value decomposition of D ad a as in section 2.2. Lastly, with the new regression matrix defined as [X : Z], the penalty transforms into P = blockdiag[o 3, λ a I ca 2, λ y I cy 2, λ c I cc 2] (3.6) eq:block where O 3 is the 3 3 matrix of 0 s and c a 2 is the column dimension of Z a, etc. The model may now be fitted as in section 2.3 with fixed regression matrix given by (3.4), random regression matrix by (3.5) and penalty matrix by (3.6). We fit the smooth APC model with B a, n 10, B y, n 13 and B c, n 28 where n = 3976, i.e., c a = 10, c y = 13 and c y = 28. Schall s algorithm, (2.16) and (2.17), extends as follows: let T a be the (c a 2) (c a 2) block of the inverse of C C which corresponds to the Z a coefficients; we take similar definitions for T y and T c for Z y and Z c. Fitting the Poisson model without overdispersion we find with REML that the dimension of the fitted model is reduced from 249 for the factor model to an effective dimension of about Generalizing (2.18) we write ED = 3 + (c a 2 v a ) + (c y 2 v y ) + (c c 2 v c ) (3.7) eq:partition where there are three fixed effects, c a 2 is the column dimension of Z a and v a = λ a Tr(T a ), etc. The non-linear components of the effects of age, year and cohort are c a 2 v a = 7.7, c y 2 v y = 8.1 and c c 2 v c = 17.0 respectively in the present example. The estimate of overdispersion using (2.19) is ˆσ 2 = 2.00, evidence of some overdispersion. We refit the model with σ 2 included as part of the estimation process. With overdispersion included in the estimation process we would expect heavier smoothing since the smoothed surface will be less inclined to follow the local behaviour of the observed mortality surface. The effective dimension is further reduced to about 33.8 the estimated value of σ 2 is 2.00, as before. The resulting detrended mean log(mortality) curves have been added to Fig. 2; the fitted log(mortality) is also shown for age 65. Skeleton R code is provided in Appendix B. In the previous paragraph we described overdispersion as a variance effect. However, with mortality data this approach ignores effects such as cold winters which can inflate death rates. In the next section we use the approach of Perperoglou and Eilers (unpublished) where overdispersion is viewed not as a variance problem but as a problem with the linear structure of the model. They suggest the addition of individual random effects to the linear predictor as a way of dealing with the lack of fit that is otherwise modelled with overdispersion. 11

12 Sect:Over 4 Overdispersion as individual random effects In the previous section we showed that the linear predictor for the APC model has a mixed model representation Xβ + Zα where X and Z are defined in (3.4) and (3.5) respectively. Perperoglou and Eilers (unpublished) modified the linear predictor by the addition of individual random effects to give Xβ + Zα + γ (4.1) eq:over where the length of γ is the same as the number of observations, n, say. Thus, the model has more parameters than observations but a ridge penalty on γ maintains identifiability and shrinks γ towards zero; the penalty (3.6) becomes P = blockdiag[o 3, λ a I ca 2, λ y I cy 2, λ c I cc 2, κi n ] = blockdiag[o 3, P, κi n ], (4.2) eq:blockover say. We have a mixed model where the variance of the random effects α and γ is given by G = blockdiag[λ 1 a I ca 2, λ 1 y I cy 2, λ 1 c I cc 2, κ 1 I n ] = blockdiag[p 1, κ 1 I n ]. (4.3) eq:varg The mixed model equations (2.12) become X W X X W Z X W Z W X Z W Z + P Z W W X W Z W + κin ˆβ ˆα ˆγ = X W Z W W z. (4.4) eq:mixed.equs2 This is a very large system of equations but Perperoglou and Eilers (unpublished) provide a device which facilitates its solution. We define a modified weight matrix W = κ( W + κi n ) 1 W (4.5) eq:wstar and solve (4.4) for ˆγ to get κˆγ = W ( z X ˆβ Z ˆα) (4.6) eq:kgamma from which it follows that (4.4) reduces to [ X W X X W Z Z W X Z W Z + P ] [ ˆβ ˆα ] = [ X W Z W ] z. (4.7) eq:mixed.equs3 This is the same system as obtained for the original smooth APC model but with the weight matrix W replaced by W. For given κ we optimize over the remaining parameters by using Schall s algorithm; κ is estimated by maximizing the profile residual log-likelihood l(ˆλ a, ˆλ y, ˆλ c, κ) from (2.14). It is essential to avoid the inversion of large matrices such as the left hand side of (4.4) and some matrix identities to this end are provided in Appendix A. 12

13 Figure 3 shows the results of fitting the model. Figure 3 also shows the profile log-likelihood l(ˆλ a, ˆλ y, ˆλ c, κ) plotted against log κ; evidently, the smoothing parameter which shrinks the individual random effects towards zero is sharply estimated. Values of the observed and smoothed log(mortality) together with the estimated individual effects are also shown for ages forty and sixty. There is a noticeable difference in the individual random effects at these ages. An explanation can be found in the lower right panel of Fig. 3 which gives the numbers of deaths at ages sixty and forty. Since Var(log(d/e)) 1/d the values of log(d/e) are a good estimate of the true underlying smooth log(mortality) at age sixty, but a poor estimate at age forty. It follows that the residuals log(d/e) X ˆβ are almost entirely explained by the individual random effects at age sixty, while the stochastic element of the residual is substantial at age forty. Furthermore the individual random effects show systematic deviations between the data and the model at age forty, evidence of lack of fit at this age. The modified weight matrix W in (4.5) deserves some comment. We note that W is a diagonal matrix with entries w i = κ w i /( w i + κ) with w i = µ in the Poisson case considered here. Thurston et al. (2000) used a similar weight matrix in their algorithm to fit the negative binomial distribution, a distribution often used to model overdispersion; in their paper the weight w i did not include the estimated random effect γ i. For further comment on this point see Perperoglou and Eilers (unpublished). Sect:Disc 5 Discussion The model (4.1) with overdispersion involves choosing four smoothing parameters in the framework of a GLM with over four thousand linear parameters; Schall s algorithm combined with the modified weight method in (4.5) and (4.7) gives a low-footprint, efficient method of model fitting with simple direct coding. Our conclusion is that Schall s algorithm (1991) is a simple and effective method of fitting in the mixed model setting. In this paper we have considered random effects acting at the individual age and year level. One other possibility arises as a result of such things as outbreaks of influenza or cold winters. Such effects can be modelled as smooth random effects which act on the mortality of a whole year. The individual random effects γ with length n = n a n y in (4.1) are replaced by annual random effects (I ny B a )γ where γ has length n y c s ; here c s is the column dimension of the B-spline basis B a and denotes the Kronecker product. Some initial results from this approach are reported in Kirkby et al. (2007). 13

14 We have used B-splines and penalties to smooth the APC model. Transformation of the B- spline basis enables the model to be expressed as a mixed model which allows the modelling of overdispersion as individual random effects. The problem of identifiability is addressed with the same transformation. In conclusion we offer a unified approach for smoothing the APC model in a mixed model framework, dealing with non-identifiability in the APC model and modelling overdispersed counts. 14

15 References Akaike H (1973) Maximum likelihood identification of Gaussian autoregressive moving average models. Biometrika, 60, Carstensen B (2007) Age-period-cohort models for the Lexis diagram. Statistics in Medicine, 26, Clayton D and Schifflers E (1987) Models for temporal variation in cancer rates. II: Age-periodcohort models. Statistics in Medicine, 6, Craven P and Wahba G (1979) Smoothing noisy data with spline functions. Numerische Mathematik, 31, Currie ID, Durban M, Eilers PHC (2004) Smoothing and forecasting mortality rates. Statistical Modelling 4, Currie ID, Durban M and Eilers PHC (2006) Generalized linear array models with applications to multidimensional smoothing. Journal of the Royal Statistical Society: Series B 68, Eilers PHC (1999) Discussion of The analysis of designed experiments and longitudinal data by using smoothing splines (with discussion) (by AP Verbyla, BR Cullis, MG Kenward and SJ Welham) Applied Statistics, 48, Eilers PHC and Marx BD (1996) Flexible smoothing with B-splines and penalties. Statistical Science 11, Friedman JH and Silverman BW (1989) Flexible parsimonious smoothing and additive modeling. Technometrics 31, Green PJ (1985) Linear models for field trials, smoothing and cross-validation. Biometrika, 72, Hastie TJ and Tibshirani RJ (1990) Generalized additive models. London: Chapman and Hall. Heuer C (1997) Modeling of time trends and interactions in vital rates using restricted regression splines. Biometrics, 53, Holford TR (1983) The estimation of age, period and cohort effects for vital rates. Biometrics, 39, Kirkby JG and Currie ID (2007) Smooth models of mortality with period shocks. Proceedings of 22nd International Workshop on Statistical Modelling, Barcelona, to appear. Marx BD and Eilers PHC (1998) Direct generalized additive modeling with penalized likelihood. Computational Statistics and Data Analysis, 28, Ogata Y, Katsura K, Keiding N, Holst C and Green A (2000) Empirical Bayes Age-Period- 15

16 Cohort analysis of retrospective incidence data. Scandinavian Journal of Statistics, 27, Perperoglou A and Eilers PHC Overdispersion modelling with individual random effects. Unpublished manuscript. R Development Core Team (2004). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN , URL Schall R (1991) Estimation in generalized linear models with random effects. Biometrika, 78, Schwarz G (1978) Estimating the dimension of a model. Annals of Statistics, 6, Searle SR, Casella G and McCulloch CE (1992) Variance components. New York: John Wiley & Sons. Silverman BW (1985) Some aspects of the spline smoothing approach to nonparametric regression curve fitting (with Discussion). Journal of the Royal Statistical Society: Series B, 47, Thurston SW, Wand MP and Wiencke JK (2000) Negative binomial additive models. Biometrics, 56, Venables WN and Ripley BD (2002) Modern Applied Statistics with S. New York: Springer- Verlag. Verbyla AP, Cullis BR, Kenward MG and Welham SJ (1999) The analysis of designed experiments and longitudinal data by using smoothing splines (with discussion). Applied Statistics, 48, Wahba G (1983) Bayesian confidence intervals for the cross-validated smoothing spline. Journal of the Royal Statistical Society: Series B, 45, Welham SJ, Cullis BR, Kenward MG and Thompson R (2007) A comparison of mixed model splines for curve fitting. Australian and New Zealand Journal of Statistics 49, Wood SN (2003) Thin plate regression splines. Journal of the Royal Statistical Society: Series B 65, Wood SN (2006) Generalized additive models: an introduction with R. London: Chapman and Hall. 16

17 Appendix A We provide some matrix identities which allow estimation in the smooth APC model with individual random effects, model (4.1) and (4.2). In (4.1) let C C = X W X X W Z X W Z W X Z W Z + P Z W W X W Z W + κin. (5.1) eq:mixed.app1 This matrix is (c a + c y + c c 3 + n) (c a + c y + c c 3 + n). For given κ, Schall s algorithm requires the leading (c a + c y + c c 3) (c a + c y + c c 3) block of (C C) 1. It follows from results on the inverse of partition matrices and the definition of W in (4.5) that this matrix is given by [ X W X X W Z ] 1, (5.2) eq:mixed.app2 Z W X Z W Z + P the inverse of the matrix on the left hand side of (4.7). The Schall estimation scheme, as in section 3, is now used (conditional on κ) to estimate the remaining parameters. To estimate κ we compute the profile residual log-likelihood l(ˆλ a, ˆλ y, ˆλ c, κ) from 1 2 log V 1 2 log X V 1 X 1 2 z (V 1 V 1 X(X V 1 X) 1 X V 1 ) z. (5.3) eq:app3 Now, with the variance of the random effects given by (4.3), we find [ V = W 1 P 1 O + [Z : I n ] O κ 1 I n ] [Z : I n ] (5.4) eq:app4 = W 1 + ZP 1 Z (5.5) eq:app5 where P is defined in (4.2). It follows that V 1 and V are V 1 = W W Z(P + Z W Z) 1 Z W (5.6) eq:app6 and V = (λ ca 2 a λy cy 2 λ cc 2 c ) 1 W 1 P + Z W Z. (5.7) eq:app7 17

18 Appendix B Skeleton code to fit the mixed model (2.13) is given below. It is assumed that deaths and exposures are stored in vectors Dth and Exp, and the fixed and random effects regression matrices are X and Z respectively. The function myglmmpql is a copy of the R-function glmmpql in which the line mcall$method <- "ML" is replaced by mcall$method <- "REML". library(nlme) library(mass) Id <- factor(rep(1,length(dth))) data.fr <- groupeddata(dth ~ X[,-1] rep(1,length = length(dth)), data = data.frame(dth, X, Z, Exp)) fit <- myglmmpql(dth ~ X[,-1] + offset(log(exp)), data = data.fr, random = list(id = pdident(~z-1)), family = poisson) Skeleton code to fit the penalized APC model in section 3 is given below. The fixed and random effects regression matrices are X, and Z.a, Z.y and Z.c respectively. Id <- factor(rep(1, length(dth))) Z.block <- list(list(id = pdident(~z.a-1)), list(id = pdident(~z.y-1)), list(id = pdident(~z.c-1))) Z.block <- unlist(z.block, recursive = FALSE) data.fr <- groupeddata(dth ~ X[,-1] rep(1,length = length(dth)), data = data.frame(dth, X, Z.a, Z.y, Z.c, Exp)) fit <- myglmmpql(dth ~ X[,-1] + offset(log(exp)), data = data.fr, random = Z.block, family=poisson) 18

19 B spline Transformed basis Year Year log(mortality) Age = 65 Npar = 23 DF = 23 log(mortality) Age = 65 Npar = 23 DF = Year Year Coefficient Scaling of basis functions Index of basis functions Index of basis functions Fig.Bases Figure 1: (a) B-spline basis (b) transformed basis (c) unpenalized regression: coefficient,, data (d) penalized regression: coefficient,, data (e) unpenalized,, and penalized,, coefficients in transformed regression (f) scaling of basis functions, φ 0.5 i, i = 3,..., c. 19

20 log(mortality) log(mortality) Factor Smooth Age Year log(mortality) log(mortality) Age Year of Birth Year Fig.Detrend Figure 2: Age-Period-Cohort model: detrended plots of mean log(mortality) by (a) age (b) year (c) cohort; (d) observed and fitted log(mortality) at age 65 20

21 profile residual logl log(mortality) Age = log(kappa) Year log(mortality) Age = 60 Number of deaths Age = 60 Age = Year Year Fig.APC.Ind Figure 3: Age-Period-Cohort model with individual random effects: (a) profile residual loglikelihood l(ˆλ a, ˆλ y, ˆλ c, κ) against log κ; (b) and (c) observed,, and fitted log(mortality), X ˆβ + Z ˆα, and individual random effects, ˆγ; (d) numbers of deaths. 21

Smoothing Age-Period-Cohort models with P -splines: a mixed model approach

Smoothing Age-Period-Cohort models with P -splines: a mixed model approach Running headline: Smooth Age-Period-Cohort models I D Currie, Department of Actuarial Mathematics and Statistics, and the Maxwell