Recovering Indirect Information in Demographic Applications

Similar documents
Currie, Iain Heriot-Watt University, Department of Actuarial Mathematics & Statistics Edinburgh EH14 4AS, UK

GLAM An Introduction to Array Methods in Statistics

Modelling general patterns of digit preference

A Hierarchical Perspective on Lee-Carter Models

Multidimensional Density Smoothing with P-splines

Using P-splines to smooth two-dimensional Poisson data

Efficient Estimation of Smooth Distributions From Coarsely Grouped Data

P -spline ANOVA-type interaction models for spatio-temporal smoothing

Modelling trends in digit preference patterns

Nonparametric inference in hidden Markov and related models

Variable Selection and Model Choice in Survival Models with Time-Varying Effects

Generalized Additive Models

Flexible Spatio-temporal smoothing with array methods

Models for Count and Binary Data. Poisson and Logistic GWR Models. 24/07/2008 GWR Workshop 1

Analysing geoadditive regression data: a mixed model approach

Statistics: Learning models from data

Generalized linear mixed models (GLMMs) for dependent compound risk models

Space-time modelling of air pollution with array methods

On the Importance of Dispersion Modeling for Claims Reserving: Application of the Double GLM Theory

mboost - Componentwise Boosting for Generalised Regression Models

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

Linear Regression With Special Variables

Estimating prediction error in mixed models

Modelling Survival Data using Generalized Additive Models with Flexible Link

Model comparison and selection

Bayesian Nonparametric Regression for Diabetes Deaths

Chapter 3: Regression Methods for Trends

A Reliable Constrained Method for Identity Link Poisson Regression

Beyond Mean Regression

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20

Generalized Linear Models

Model Selection. Frank Wood. December 10, 2009

Bayesian density estimation from grouped continuous data

Multivariate Calibration with Robust Signal Regression

Generalized linear mixed models (GLMMs) for dependent compound risk models

Two Applications of Nonparametric Regression in Survey Estimation

Generalized linear mixed models for dependent compound risk models

Estimation of cumulative distribution function with spline functions

Array methods in statistics with applications to the modelling and forecasting of mortality. James Gavin Kirkby

Some general observations.

Statistical Model Of Road Traffic Crashes Data In Anambra State, Nigeria: A Poisson Regression Approach

Nonparametric Small Area Estimation Using Penalized Spline Regression

Linear regression methods

Statistics 360/601 Modern Bayesian Theory

PENALIZED LIKELIHOOD PARAMETER ESTIMATION FOR ADDITIVE HAZARD MODELS WITH INTERVAL CENSORED DATA

Lecture 8. Poisson models for counts

Survival Analysis I (CHL5209H)

Citation for published version (APA): Canudas Romo, V. (2003). Decomposition Methods in Demography Groningen: s.n.

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

STA 216: GENERALIZED LINEAR MODELS. Lecture 1. Review and Introduction. Much of statistics is based on the assumption that random

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014

Linear model selection and regularization

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

Modelling geoadditive survival data

gamboostlss: boosting generalized additive models for location, scale and shape

Inverse problems in statistics

R programs for splitting abridged fertility data into a fine grid of ages using the neural network method

Mixture models for heterogeneity in ranked data

Variable Selection for Generalized Additive Mixed Models by Likelihood-based Boosting

Content Preview. Multivariate Methods. There are several theoretical stumbling blocks to overcome to develop rating relativities

Nonparametric Small Area Estimation via M-quantile Regression using Penalized Splines

1 Computing with constraints

Model Selection in GLMs. (should be able to implement frequentist GLM analyses!) Today: standard frequentist methods for model selection

Adaptive Piecewise Polynomial Estimation via Trend Filtering

Mixed models in R using the lme4 package Part 7: Generalized linear mixed models

PL-2 The Matrix Inverted: A Primer in GLM Theory

Nonparametric time series forecasting with dynamic updating

Linear Regression Models P8111

Outline of GLMs. Definitions

Varieties of Count Data

Chapter 7: Model Assessment and Selection

Lecture 2: Linear and Mixed Models

PASS Sample Size Software. Poisson Regression

Computational methods for mixed models

Outline. Mixed models in R using the lme4 package Part 5: Generalized linear mixed models. Parts of LMMs carried over to GLMMs

Ph.D. course: Regression models

Biostatistics Workshop Longitudinal Data Analysis. Session 4 GARRETT FITZMAURICE

Resampling techniques for statistical modeling

Mixture modelling of recurrent event times with long-term survivors: Analysis of Hutterite birth intervals. John W. Mac McDonald & Alessandro Rosina

FUNCTIONAL DATA ANALYSIS. Contribution to the. International Handbook (Encyclopedia) of Statistical Sciences. July 28, Hans-Georg Müller 1

ftsa: An R Package for Analyzing Functional Time Series by Han Lin Shang

Federated analyses. technical, statistical and human challenges

On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models

7.1 The Hazard and Survival Functions

Generalized Additive Models (GAMs)

Local Likelihood Bayesian Cluster Modeling for small area health data. Andrew Lawson Arnold School of Public Health University of South Carolina

UNIVERSITETET I OSLO

Statistical Methods as Optimization Problems

Smooth semi- and nonparametric Bayesian estimation of bivariate densities from bivariate histogram data

Approaches to Modeling Menstrual Cycle Function

Lab 3: Two levels Poisson models (taken from Multilevel and Longitudinal Modeling Using Stata, p )

On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models

Option 1: Landmark Registration We can try to align specific points. The Registration Problem. Landmark registration. Time-Warping Functions

High-dimensional regression

New Statistical Methods That Improve on MLE and GLM Including for Reserve Modeling GARY G VENTER

Transcription:

Recovering Indirect Information in Demographic Applications Jutta Gampe Abstract In many demographic applications the information of interest can only be estimated indirectly. Modelling events and rates is typical for demographic analyses so that statistical models based on counts are a natural starting point. We will demonstrate that the Penalized Composite Link model is a versatile and valuable tool to solve such indirect estimation problems in demography. Key words: Demography, Indirect observations, Penalized Composite Link Model 1 Introduction Demography deals with the analysis of populations and population dynamics, that is, the change of populations over time with respect to their size and composition. More specifically the discipline analyzes the processes that drive population change: Mortality, fertility and migration, as well as related processes such as marriage or divorce. The models used to describe and to analyze these processes are mostly based on age-specific rates (of fertility, mortality,... ) and the underlying data are the number of events (births, deaths,... ) and the number of individuals at-risk of the event. With its emphasis on rates and events, many methods in demography are based on distributional models for counts (and corresponding exposures), so that extensions of Poisson models are a natural starting point. In many demographic applications the information of interest cannot be estimated directly though, but indirect estimation procedures have to be employed. For many applications the indirect estimation problem can be phrased as a Penalized Composite Link Model (PCLM). This model combines a remarkable versatility of Jutta Gampe Max Planck Institute of Demographic Research, 18057 Rostock, Germany, e-mail: gampe@demogr.mpg.de 1

2 J. Gampe model formulation with modest additional assumptions, such as smoothness of rates across age or across time. The resulting procedures are computationally efficient which allows to handle large data sets that are common in demography. In the next section we give a short summary of the PCLM, followed by several demographic examples in Section 3. 2 The Penalized Composite Link Model Based on the Composite Link Model suggested in [1] the model was extended by an additional smoothness penalty in [2]. The model considers a series of Poisson counts y 1,...,y n with E(y i ) = µ i. In contrast to a simple Poisson regression the means µ i are supposed to be linearly combined from a series of γ 1,...,γ m so that µ i = m j=1 c i j γ j. (1) For the γ j the common specification of generalized linear models (GLM) is retained: γ j = e η j and η j = p k=1 x jkβ k is the linear predictor. In matrix notation we can write µ = Cγ, γ = e η, η = X β (2) The matrix C determines how the expected values of the observable y i are composed from the elements of γ. To estimate the parameters in the CLM, it can be shown that the standard iteratively weighted least-squares algorithm (IWLS) can be modified in the following way: Define Γ = diag(γ), M = diag(µ) and U = M 1 CΓ X. Then the next IWLS iteration solves Ũ MŨ β = Ũ (y µ) +Ũ MŨ β, (3) where the tilde indicates current values in the iteration. If the number of observations n is smaller than the number of parameters p, then the problem is ill-conditioned but an additional penalty can help. This penalty will constrain the elements of the parameter vector β and thereby will allow the estimation of the model. A typical assumption is that the elements of γ are a smooth series. In the simplest case of X = I and hence γ = e β a difference penalty on the elements of β will do the job. Alternatively we can express lnγ as a linear combination of B-splines so that X holds the spline basis, see [3]. Again a difference penalty on the elements of β will ensure the required smoothness of γ. If D is a matrix that builds differences of order d (typically d = 1 or d = 2) and λ is the smoothing parameter that balances the effect of the penalty relative to the model deviance, then maximizing the penalized log-likelihood leads to the following modified system of the IWLS for this penalized composite link model (PCLM): (Ũ MŨ + λd D)β = Ũ (y µ) +Ũ MŨ β. (4)

Recovering Indirect Information in Demographic Applications 3 For detailed derivations see [2]. The choice of the optimal smoothing parameter λ is usually done by minimizing an information criterion, AIC or BIC, over a grid of λ-values. 3 Examples 3.1 Ungrouping coarse histograms Many official data (life tables etc.) are reported in grouped form. Age-groups of width 5 or 10 years are common and often the information on the elderly is summarized in a rather wide open age-group such as 80+ or 85+. In aging societies more detailed information on the higher ages is needed. Even if nowadays, in developed countries, the last age-group has been changed into 100+ or 110+ in official statistics, for studying time trends it is required that the coarsely grouped data can be analyzed on a more detailed scale. In [4] it is demonstrated how the PCLM can be employed for this purpose. Let a 1,...,a J be the sequence of ages we are interested in (typically a 1 = 0,a 2 = 1,..., a J = 110 or a J = 120) and γ j, j = 1,...,J are the corresponding (but unobserved, because of the grouping) expected counts for these ages. What is observed are the counts y 1,...,y n in the n age-groups, which are realizations of Poisson variates with means E(y i ) = µ i. The µ i are the sum of the γ j that lie in the age-group represented by the y i, so that the composition matrix C here is rather simple: 1... 1 0............ 0 0... 0 1... 1 0... 0 C =... 0.. (5). 0 0.. 0... 0 0... 0 1... 1 The number of rows n is the number of observed age-groups, the number of columns J is the length of the finer age-sequence. Obviously here n < J. Assuming that the ungrouped distribution γ 1,...,γ J is smooth is not restrictive but adding a smoothness penalty allows to estimate the vector γ. As demonstrated in [4], this approach even works for wide open age-intervals and can also be used to disaggregate exposure numbers if rates are to be estimated on a finer age-grid. 3.2 Additive decomposition of death rates The trajectory of human mortality over age is rather complex, see Figure 1, and decomposing death rates into additive components has a long tradition. Commonly three components are used, referring to child mortality, a stretch of elevated mor-

4 J. Gampe tality starting in late puberty the so called accident-hump and finally senescent mortality which basically increases exponentially. Particularly the dynamics of the accident hump has recently attracted attention, but this part is rather complex and parametric models rarely fit. Death rates (log scale) 0 2 4 6 8 Death rates 0.005 0.004 0.003 0.002 0.001 0.000 0 10 20 30 40 50 60 70 80 90 100 110 0 10 20 30 40 50 Age Age Fig. 1 Age-specific death rates for males in Switzerland in 1980. Data taken from the Human Mortality Database (www.mortality.org). In [5] it was demonstrated that a (partially) nonparametric additive decomposition of death rates can be achieved by a so called sum-of-smooth exponentials (SSE) model. This model is particularly suited to model trajectories with sharp changes. The observed data here are the death counts y i and the corresponding exposures e i across the ages i = 1,...,n (where n = 100 or even 110). The expected values of the y i are µ i = K k=1 e i γ ik, where (γ 1k,...,γ nk ) are the death rates for the K components (here K = 3). Again the number n is smaller than the number of γ ik. The different components γ k = (γ 1k,...,γ nk ) here are modeled as γ k = exp{b k β k }, where the matrix B k holds either a B-spline basis for the smooth components or expresses a parametric specification, e.g. for the exponentially increasing senescent component. The structure of the µ i again suggests a PCLM approach. In this problem a smoothness penalty is not enough to make the decomposition identifiable, however, adding penalties for shape constraints, such as monotonicity or the accident-hump component being log-concave, solves the problem. Details can be found in [5]. Figure 2 shows the resulting estimates from a three-component SSE model for the Swiss mortality data presented in Figure 2. Here the child and senescent mortality were modeled by parametric functions and the accident-hump component was specified to be smooth and being concave on the log-scale. The model has also be extended to two dimensions so that the change of the different components over time can be studied.

Recovering Indirect Information in Demographic Applications 5 0 0.005 log mortality 2 4 6 8 10 ln(y/e) ln(µ/e) ln(γ 1 ) ln(γ 2 ) ln(γ 3 ) Death rates 0.004 0.003 0.002 0.001 12 0.000 0 10 20 30 40 50 60 70 80 90 100 110 0 10 20 30 40 50 Fig. 2 Age-specific death rates for males in Switzerland in 1980. Estimates of the three-component SSE model. 3.3 Age-at-death distributions in paleodemography While in modern populations ages-at-death are available from death certificates or, in earlier centuries, from church registers, we do not have such information if we go further back in time. In this case age-estimates from excavated skeletons in historical cemeteries, which can be done by experienced anthropologists, is the only option. Age-estimates are not necessarily unbiased, but they can be calibrated by using so called known-age reference collections. The overall procedure runs in two steps. If we denote the true age-at-death by a and the estimated skeletal age by s, we can use m skeletons from a reference collection, for whom we have (a i,s i ),i = 1,...,m, to estimate the conditional distribution f (s a). This is usually done by nonparametric regression to allow for nonlinear and heteroscedastic relations between a and s. In the so called target population (the excavated burial site) we only observe the sekeletal age on n buried individuals. The age-at-death distribution f (a), which is to be estimated, is related to the distribution g(s) of skeletal ages by g(s) = f (s a) f (a)da. (6) Information on f (s a) is borrowed from the reference collection. If we model both age-scales on a fine grid, i.e., s = (s 1,...,s L ) and a = (a 1,...,a K ), then the observations are the counts of skeletons in each of the L skeletal-age classes: y 1,...,y L, l y l = n. These are realizations of Poisson variates with means µ l, where µ l = n K k=1 c lk γ k. (7)

6 J. Gampe The c lk = P(s l a k ) stem from the reference collection and γ k = P(a k ). Again this deconvolution problem has the structure of a PCLM and the unknown age-at-death distribution γ = (γ 1,...,γ K ) can be estimated by using a simple smoothness penalty. References 1. Thompson R., Baker R.J.: Composite Link Functions in Generalized Linear Models. Applied Statistics 30 (2), 125 131 (1981) 2. Eilers, P.H.C: Ill-posed problems with counts, the composite link model and penalized likelihood. Statistical Modelling 7 (3), 239 254 (2007) 3. Eilers, P.H.C, Marx, B.: Flexible Smoothing with B-splines and Penalties. Statistical Science 11 (2), 89 121 (1996) 4. Rizzi, S., Gampe, J., Eilers, P.H.C: Efficient Estimation of Smooth Distributions From Coarsely Grouped Data. American Journal of Epidemiology 182 (2), 138147 (2015) 5. Camarda, G.C., Eilers, P.H.C., Gampe, J.: Sums of smooth exponentials to decompose complex series of counts. Statistical Modelling 16 (4), 279 296 (2016)