Modelling Survival Data using Generalized Additive Models with Flexible Link

Similar documents
Generalized Additive Models

Single-level Models for Binary Responses

A NOTE ON ROBUST ESTIMATION IN LOGISTIC REGRESSION MODEL

Interaction effects for continuous predictors in regression modeling

Generalized Additive Models

Power and Sample Size Calculations with the Additive Hazards Model

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto.

Regularization in Cox Frailty Models

Graphical Presentation of a Nonparametric Regression with Bootstrapped Confidence Intervals

* * * * * * * * * * * * * * * ** * **

Classification. Chapter Introduction. 6.2 The Bayes classifier

Generalized Linear Models (GLZ)

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Treatment Variables INTUB duration of endotracheal intubation (hrs) VENTL duration of assisted ventilation (hrs) LOWO2 hours of exposure to 22 49% lev

Reduced-rank hazard regression

Inversion Base Height. Daggot Pressure Gradient Visibility (miles)

PREWHITENING-BASED ESTIMATION IN PARTIAL LINEAR REGRESSION MODELS: A COMPARATIVE STUDY

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Semiparametric Generalized Linear Models

7 Semiparametric Estimation of Additive Models

Bayesian Estimation and Inference for the Generalized Partial Linear Model

Chapter 4. Parametric Approach. 4.1 Introduction

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Linear Regression Models P8111

PENALIZED LIKELIHOOD PARAMETER ESTIMATION FOR ADDITIVE HAZARD MODELS WITH INTERVAL CENSORED DATA

Kernel Logistic Regression and the Import Vector Machine

ON CONCURVITY IN NONLINEAR AND NONPARAMETRIC REGRESSION MODELS

Consider Table 1 (Note connection to start-stop process).

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

Regression so far... Lecture 21 - Logistic Regression. Odds. Recap of what you should know how to do... At this point we have covered: Sta102 / BME102

Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model

Supporting Information for Estimating restricted mean. treatment effects with stacked survival models

Quantile regression and heteroskedasticity

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20

Introduction to General and Generalized Linear Models

BMI 541/699 Lecture 22

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 216: GENERALIZED LINEAR MODELS. Lecture 1. Review and Introduction. Much of statistics is based on the assumption that random

Flexible Estimation of Treatment Effect Parameters

Introducing Generalized Linear Models: Logistic Regression

A class of latent marginal models for capture-recapture data with continuous covariates

REVISED PAGE PROOFS. Logistic Regression. Basic Ideas. Fundamental Data Analysis. bsa350

Logistic regression model for survival time analysis using time-varying coefficients

Measurement Error in Spatial Modeling of Environmental Exposures

Survival Analysis Math 434 Fall 2011

Variable Selection for Generalized Additive Mixed Models by Likelihood-based Boosting

UNIVERSITY OF CALIFORNIA, SAN DIEGO

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011)

Analysis of Categorical Data. Nick Jackson University of Southern California Department of Psychology 10/11/2013

Model Selection, Estimation, and Bootstrap Smoothing. Bradley Efron Stanford University

Fahrmeir: Discrete failure time models

STA102 Class Notes Chapter Logistic Regression

Distribution-free ROC Analysis Using Binary Regression Techniques

Lecture 1. Introduction Statistics Statistical Methods II. Presented January 8, 2018

Gradient types. Gradient Analysis. Gradient Gradient. Community Community. Gradients and landscape. Species responses

Analysis of Time-to-Event Data: Chapter 4 - Parametric regression models

Truck prices - linear model? Truck prices - log transform of the response variable. Interpreting models with log transformation

Harvard University. Harvard University Biostatistics Working Paper Series. A New Class of Rank Tests for Interval-censored Data

Simultaneous Confidence Bands for the Coefficient Function in Functional Regression

Stat 642, Lecture notes for 04/12/05 96

Generalized logit models for nominal multinomial responses. Local odds ratios

UNIVERSITÄT POTSDAM Institut für Mathematik

Logistic Regression: Regression with a Binary Dependent Variable

Lecture 2: Poisson and logistic regression

Local Likelihood Bayesian Cluster Modeling for small area health data. Andrew Lawson Arnold School of Public Health University of South Carolina

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

Professors Lin and Ying are to be congratulated for an interesting paper on a challenging topic and for introducing survival analysis techniques to th

Logistic Regression - problem 6.14

Generalized Additive Models (GAMs)

Semi-parametric estimation of non-stationary Pickands functions

On Fitting Generalized Linear Mixed Effects Models for Longitudinal Binary Data Using Different Correlation

Odds ratio estimation in Bernoulli smoothing spline analysis-ofvariance

A review of some semiparametric regression models with application to scoring

Introduction to mtm: An R Package for Marginalized Transition Models

Bayesian Nonparametric Regression for Diabetes Deaths

mboost - Componentwise Boosting for Generalised Regression Models

Tests of independence for censored bivariate failure time data

Statistics in medicine

Today. HW 1: due February 4, pm. Aspects of Design CD Chapter 2. Continue with Chapter 2 of ELM. In the News:

Lecture 5: Poisson and logistic regression

Proteomics and Variable Selection

Linear Regression With Special Variables

A multi-state model for the prognosis of non-mild acute pancreatitis

Statistical Inference

Model Selection in GLMs. (should be able to implement frequentist GLM analyses!) Today: standard frequentist methods for model selection

Generalized Linear. Mixed Models. Methods and Applications. Modern Concepts, Walter W. Stroup. Texts in Statistical Science.

A Handbook of Statistical Analyses Using R 2nd Edition. Brian S. Everitt and Torsten Hothorn

Biost 518 Applied Biostatistics II. Purpose of Statistics. First Stage of Scientific Investigation. Further Stages of Scientific Investigation

Data Mining Stat 588

Integrated Likelihood Estimation in Semiparametric Regression Models. Thomas A. Severini Department of Statistics Northwestern University

Censored partial regression

How to Present Results of Regression Models to Clinicians

Duration of Unemployment - Analysis of Deviance Table for Nested Models

Logistisk regression T.K.

Partial Generalized Additive Models

The In-and-Out-of-Sample (IOS) Likelihood Ratio Test for Model Misspecification p.1/27

STAT5044: Regression and Anova

What s New in Econometrics? Lecture 14 Quantile Methods

Binary Logistic Regression

Transcription:

Modelling Survival Data using Generalized Additive Models with Flexible Link Ana L. Papoila 1 and Cristina S. Rocha 2 1 Faculdade de Ciências Médicas, Dep. de Bioestatística e Informática, Universidade Nova de Lisboa, Campo Mártires da Pátria 130, 1169-056 Lisboa, Portugal, CEAUL (apapoila@hotmail.com) 2 Faculdade de Ciências, Universidade de Lisboa, Campo Grande, Edifício C6, Piso 4, 1749-016 Lisboa, Portugal, CEAUL (cmrocha@fc.ul.pt) Abstract: When using Generalized Linear Models (GLMs), misspecification of the link is very likely to occur due to the fact that the information, necessary to correctly choose this distribution function, is usually unavailable. To overcome this problem, new developments emerged which, simultaneously, gave rise to more flexible models. As a result, survival analysis also derived benefit from this new line of research. In fact, the gamma-logit model may be viewed as a GLM with binary response and unknown link function belonging to the one-parameter family of transformations, introduced by Aranda-Ordaz(1981). We suggest the use of flexible parametric link families in Generalized Additive Models (GAMs) with binary response and propose a generalization of the gamma-logit model, which we will denote by additive gamma-logit model. Based on the local scoring algorithm, the estimation procedure minimizes the deviance through the use of a deviance profile plot. A simulation study was carried out and the proposed methodology was applied to a real current status data set. Keywords: Generalized additive model; unknown link function; survival analysis; gamma-logit model; current status data. 1 Introduction With the evolution of Statistics, there has been an emphasis on the development of models with greater flexibility. That is what happened with the GLMs, in particular with the logistic model. In fact, several generalizations of this model were developed to ensure a minimization of the errors resulting from a bad choice of the link. Power transformation families were used to control symmetric and asymmetric departures from the logistic model and many parametric link classes were proposed (e.g. Pregibon (1980) and Aranda-Ordaz (1981)). As a consequence, survival analysis also benefited from these developments, due to the correspondence that can be established between models in binary regression analysis and in survival analysis (Doksum and Gasko, 1990). For instance, we may refer the gamma-logit model that, from the inferential point of view, is equivalent to a binary response

2 The additive gamma-logit model GLM, with unknown link function belonging to the Aranda-Ordaz (1981) transformations family. However, considering a GAM instead of a GLM is the natural extension of the gamma-logit model, in the sense that smooth functions may be used to establish the relation between the covariates and the response variable, often in a more realistic way. Some work has already been done to extend GAMs to a broader class of models with unknown nonparametric link function (Hastie and Tibshirani (1984) and Roca-Pardiña et al. (2004)). In this paper we propose the introduction of parametric link families in GAMs and, although the developed procedures may be applied to any model with a response variable whose distribution belongs to the exponential family, our paper will obviously focus the binary response case. Our proposal lies somewhere between an additive model with a fixed link and an additive model with a fully non-parametric link. When using GAMs with flexible link, it is necessary to calculate an odds ratio curve because, unlike the GLMs, the effect of a continuous covariate on the response depends not only on the shape of the partial function but also on the functional form of the link. In our case, we have derived an estimator of the odds ratio curve and also constructed pointwise confidence intervals for the odds ratios, following Figueiras and Cadarso-Suárez (2001) and Cadarso-Suárez et al. (2005). A simulation study was conducted and the new methodology was applied to a real current status data set. 2 GLMs with flexible parametric link and the gamma-logit model The idea of using GLMs with flexible parametric link emerged as a natural consequence of the development of goodness of link tests for GLMs. In this context, Pregibon (1980) suggested a procedure to examine the adequacy of a particular hypothesized link function of a GLM, by embedding this function and the correct, but unknown, link in a family of link functions. Let Y be a response variable with a distribution belonging to the exponential family and (X 1,..., X p ) a vector of p covariates. A GLM with flexible parametric link is defined by E(Y X 1,..., X p ) = h(β 0 + p j=1 β jx j, ψ), where h, known as the link function, is a strictly monotone differentiable function that belongs to the family H = {h(., ψ) : ψ Ψ}, ψ represents the link parameter vector and β 0, β 1,..., β p are the regression coefficients, that must be estimated from the available data. This defines a broad class of models but, at the present, we will only focus the particular case of a model with binary response and parametric link belonging to the family proposed by Aranda-Ordaz (1981), in order to obtain the existing gammalogit model. In a survival analysis context, this family is defined by { { } log (1 u) γ 1 γ-logit(u) = γ if γ > 0 (1) log[ log(1 u)] if γ = 0.

and h is the inverse of the function defined in (1). A.L.Papoila and C.S.Rocha 3 3 GAMs with flexible parametric link and the additive gamma-logit model In this paper, we propose the introduction of GAMs with a flexible parametric link, in order to obtain an extension of the gamma-logit model which we will denote by additive gamma-logit model. Let Y be a response variable with a distribution belonging to the exponential family and (X 1,..., X p ) a vector of p covariates. A GAM with flexible parametric link is defined by µ = E(Y X 1,..., X p ) = h(β 0 + p j=1 f j(x j ), ψ), where h, the link function, is a strictly monotone differentiable function that belongs to the family H = {h(., ψ) : ψ Ψ}, where ψ represents the link parameter vector. The partial functions f j (X j ), j = 1,..., p, are arbitrary univariate functions that must be estimated from the data and represent the effect of the covariates on the response. As previously referred, we will only focus the particular case of a model with a binary response and parametric link belonging to the family proposed by Aranda-Ordaz (1981),{ in order to obtain the additive gamma-logit model defined by F (t x) = h γ-logit [F 0 (t)] + } p j=1 f j(x j ), where F 0 (t) represents the baseline distribution function. In what concerns estimation, we added, to the Fortran program developed by Hastie and Tibshirani (1990), new routines that allowed the estimation of β 0 and of the partial functions f 1,..., f p through the use of the iterative modified backfitting (Buja et al., 1989) and local scoring algorithms (Hastie and Tibshirani, 1990). Cubic smoothing splines were used to model individual predictors. The amount of smoothing was defined, before fitting the model, by the specification of the degrees of freedom. In order to estimate the parameter vector ψ, we used a deviance profile plot. To estimate the odds ratio curve we followed Cadarso-Suárez et al. (2005), that proposed a generalization of the odds ratio curve suggested by Figueiras and Cadarso-Suárez (2001) for the logistic GAMs. In fact, Cadarso-Suárez et al. (2005) defined the generalized odds ratio curve for a continuous covariate X j at point x, and taking x 0 as the reference value, by OR x 0 j (x) = E (X 1,...,X p ) [ ] p(x1,..., x,..., X p )/(1 p(x 1,..., x,..., X p )), p(x 1,..., x 0,..., X p )/(1 p(x 1,..., x 0,..., X p )) where p(x 1,..., X p ) = P (Y = 1 X 1,..., X p ) and E (X1,...,X p ) represents the mean operator over the covariates {X k } k j. Thus, if we consider a GAM with a link belonging to the Aranda-Ordaz (1981) transformations family,

4 The additive gamma-logit model we obtain the following estimator of the odds ratio ÔR x 0 j (x) = 1 n n i=1 (1 + ˆψ e ˆβ 0+ ˆf 1(X i1)+...+ ˆf j(x)+...+ ˆf p(x ip) ) 1/ ˆψ 1 (1 + ˆψ e ˆβ 0 + ˆf 1 (X i1 )+...+ ˆf j (x 0 )+...+ ˆf p (X ip ) ) 1/ ˆψ 1, where ˆψ, ˆβ 0 and ˆf j are estimates obtained from fitting our GAM. In what concerns the construction of pointwise confidence intervals for the odds ratio curve, we used bootstrap techniques (Cadarso-Suárez et al., 2005). A simulation study was carried out, not only to evaluate the quality of the link parameter estimates, but also to compare the performance of the proposed GAM with that of the GLM with the same parametric link. We concluded that the estimation process was satisfactory and that a substantial gain, in what concerns the deviance, may be achieved with our model. 4 A real case study To apply the proposed methodology, we have studied the elapsed time from first injecting drug use until HIV infection, using a data set of 361 drug users who started using intravenous drugs between 1974 and 1997 and were admitted to the detoxification unit of the Hospital Universitari Germans Trias i Pujol in Badalona, Spain, between 1987 and 2000. For these individuals the moment of HIV infection is unknown. In fact, for 15% of the cases, the only available information about this instant is limited to the interval [instant of last negative HIV test, instant of first positive HIV test]. For the rest of the individuals, we only know their status (infected or not infected) at the date of the last HIV test (monitoring instant). This means that the data is mainly case I interval censored and so we decided to treat all the observations as current status data. From the available data we used the variables age of first intravenous drug use, gender, the elapsed time (T ), in months, from the instant of first intravenous drug use until the date of the last HIV test (monitoring time) and the indicator variable Y that gives us information about the result of the last HIV test (0 if the individual is seronegative or 1 if the individual is infected). We considered the model µ = h{[β 0 + f(t )] + f 1 (age) + β 1 gender}. The estimate of the link parameter was obtained through the minimization of the deviance, calculated for a grid of values of ψ and we considered that ˆψ = 5 was the best estimate, for a deviance of approximately 377.2. The resulting fitted model is given by ˆµ = ( 1/5. 1 1 + 5 e [5.18+ ˆf(T )]+ ˆf 1 (age)+2.61 gender) For the variable age, we refer to Figure 1 for a graphical representation of the odds ratio curve, estimated for both genders and considering the mean age (19 years) as the reference value. As we can see from these two figures, the graphics are

A.L.Papoila and C.S.Rocha 5 3 2.6 2.2 1.8 1.4 1 0.6 0.2 0 5 10 15 20 25 30 35 3 2.6 2.2 1.8 1.4 1 0.6 0.2 0 5 10 15 20 25 30 35 FIGURE 1. OR (age) estimates and corresponding 95% confidence intervals, female and male genders. very similar. It seems to exist a lower risk of infection for the individuals who initiated their injecting drug addiction with an age of approximately 26 years old. Survival curves for both female and male were obtained and FIGURE 2. Estimates of the survival functions of time until HIV infection for females and males who initiated their drug addiction with a mean age of 19 years. from Figure 2 we can see that time until HIV infection is longer for men. It also seems that the curves level off and the resulting plateau may indicate the existence of immune individuals in the population. In fact, it is admissible that some of the injecting drug users take the adequate precautions and consequently an HIV infection is unlikely to occur. Finally, to evaluate the goodness-of-fit of the proposed model, the deviance residuals were examined and no serious trends, characteristic of a bad fit,

6 The additive gamma-logit model were detected. To overcome the lack of global goodness-of-fit tests for these kind of models, we used bootstrap techniques and concluded that the model was reasonably adequate. The 95% bootstrap confidence interval for the deviance (352.86, 425.69) was obtained. However, we are aware of the existence of unobserved heterogeneity among the individuals. So, we believe that the introduction of a frailty term would certainly improve the fit of the model. Acknowledgements: This research was supported by FCT/POCI 2010. The authors would like to thank Drs. Klaus Langohr, Guadalupe Gómez and Robert Muga for making the data available. References Aranda-Ordaz, F.J. (1981). On two families of transformations to additivity for binary regression data. Biometrika 68, 357-363. Buja, A., Hastie, T.J. and Tibshirani, R.J. (1989). Linear smoothers and additive models (with discussion). Annals of Statistics 17, 453-555. Cadarso-Suárez, C., Roca-Pardiñas, J.R., Figueiras, A. and Manteiga, W. (2005). Non-parametric estimation of the odds ratios for continuous exposures using generalized additive models with an unknown link function. Statistics in Medicine 24, 1169-1184. Doksum, K.A. and Gasko, M. (1990). On a correspondence between models in binary regression and in survival analysis. International Statistical Review 58, 243-252. Figueiras, A. and Cadarso-Suárez, C. (2001). Application of nonparametric models for calculating odds ratios and their confidence intervals for continuous exposures. American Journal of Epidemiology 154, 3, 264-275. Hastie, T. and Tibshirani, R. (1984). Generalized additive models. Tech. Rep. 98, Dept. of Statistics, Stanford University. Hastie, T. and Tibshirani, R. (1990). Generalized Additive Models. Chapman & Hall, New York. Pregibon, D. (1980). Goodness of link tests for generalized linear models. Journal of the Royal Statistical Society, series C 29, 15-24. Roca-Pardiñas, J., Manteiga, W., Bande, M., Sánchez, J., Cadarso-Suárez, C. (2004). Predicting binary time series of SO 2 using generalized additive models with unknown link function. Environmetrics 15, 1-14.