Modelling Survival Data using Generalized Additive Models with Flexible Link

Modelling Survival Data using Generalized Additive Models with Flexible Link Ana L. Papoila 1 and Cristina S. Rocha 2 1 Faculdade de Ciências Médicas, Dep. de Bioestatística e Informática, Universidade Nova de Lisboa, Campo Mártires da Pátria 130, 1169-056 Lisboa, Portugal, CEAUL (apapoila@hotmail.com) 2 Faculdade de Ciências, Universidade de Lisboa, Campo Grande, Edifício C6, Piso 4, 1749-016 Lisboa, Portugal, CEAUL (cmrocha@fc.ul.pt) Abstract: When using Generalized Linear Models (GLMs), misspecification of the link is very likely to occur due to the fact that the information, necessary to correctly choose this distribution function, is usually unavailable. To overcome this problem, new developments emerged which, simultaneously, gave rise to more flexible models. As a result, survival analysis also derived benefit from this new line of research. In fact, the gamma-logit model may be viewed as a GLM with binary response and unknown link function belonging to the one-parameter family of transformations, introduced by Aranda-Ordaz(1981). We suggest the use of flexible parametric link families in Generalized Additive Models (GAMs) with binary response and propose a generalization of the gamma-logit model, which we will denote by additive gamma-logit model. Based on the local scoring algorithm, the estimation procedure minimizes the deviance through the use of a deviance profile plot. A simulation study was carried out and the proposed methodology was applied to a real current status data set. Keywords: Generalized additive model; unknown link function; survival analysis; gamma-logit model; current status data. 1 Introduction With the evolution of Statistics, there has been an emphasis on the development of models with greater flexibility. That is what happened with the GLMs, in particular with the logistic model. In fact, several generalizations of this model were developed to ensure a minimization of the errors resulting from a bad choice of the link. Power transformation families were used to control symmetric and asymmetric departures from the logistic model and many parametric link classes were proposed (e.g. Pregibon (1980) and Aranda-Ordaz (1981)). As a consequence, survival analysis also benefited from these developments, due to the correspondence that can be established between models in binary regression analysis and in survival analysis (Doksum and Gasko, 1990). For instance, we may refer the gamma-logit model that, from the inferential point of view, is equivalent to a binary response

2 The additive gamma-logit model GLM, with unknown link function belonging to the Aranda-Ordaz (1981) transformations family. However, considering a GAM instead of a GLM is the natural extension of the gamma-logit model, in the sense that smooth functions may be used to establish the relation between the covariates and the response variable, often in a more realistic way. Some work has already been done to extend GAMs to a broader class of models with unknown nonparametric link function (Hastie and Tibshirani (1984) and Roca-Pardiña et al. (2004)). In this paper we propose the introduction of parametric link families in GAMs and, although the developed procedures may be applied to any model with a response variable whose distribution belongs to the exponential family, our paper will obviously focus the binary response case. Our proposal lies somewhere between an additive model with a fixed link and an additive model with a fully non-parametric link. When using GAMs with flexible link, it is necessary to calculate an odds ratio curve because, unlike the GLMs, the effect of a continuous covariate on the response depends not only on the shape of the partial function but also on the functional form of the link. In our case, we have derived an estimator of the odds ratio curve and also constructed pointwise confidence intervals for the odds ratios, following Figueiras and Cadarso-Suárez (2001) and Cadarso-Suárez et al. (2005). A simulation study was conducted and the new methodology was applied to a real current status data set. 2 GLMs with flexible parametric link and the gamma-logit model The idea of using GLMs with flexible parametric link emerged as a natural consequence of the development of goodness of link tests for GLMs. In this context, Pregibon (1980) suggested a procedure to examine the adequacy of a particular hypothesized link function of a GLM, by embedding this function and the correct, but unknown, link in a family of link functions. Let Y be a response variable with a distribution belonging to the exponential family and (X 1,..., X p ) a vector of p covariates. A GLM with flexible parametric link is defined by E(Y X 1,..., X p ) = h(β 0 + p j=1 β jx j, ψ), where h, known as the link function, is a strictly monotone differentiable function that belongs to the family H = {h(., ψ) : ψ Ψ}, ψ represents the link parameter vector and β 0, β 1,..., β p are the regression coefficients, that must be estimated from the available data. This defines a broad class of models but, at the present, we will only focus the particular case of a model with binary response and parametric link belonging to the family proposed by Aranda-Ordaz (1981), in order to obtain the existing gammalogit model. In a survival analysis context, this family is defined by { { } log (1 u) γ 1 γ-logit(u) = γ if γ > 0 (1) log[ log(1 u)] if γ = 0.

and h is the inverse of the function defined in (1). A.L.Papoila and C.S.Rocha 3 3 GAMs with flexible parametric link and the additive gamma-logit model In this paper, we propose the introduction of GAMs with a flexible parametric link, in order to obtain an extension of the gamma-logit model which we will denote by additive gamma-logit model. Let Y be a response variable with a distribution belonging to the exponential family and (X 1,..., X p ) a vector of p covariates. A GAM with flexible parametric link is defined by µ = E(Y X 1,..., X p ) = h(β 0 + p j=1 f j(x j ), ψ), where h, the link function, is a strictly monotone differentiable function that belongs to the family H = {h(., ψ) : ψ Ψ}, where ψ represents the link parameter vector. The partial functions f j (X j ), j = 1,..., p, are arbitrary univariate functions that must be estimated from the data and represent the effect of the covariates on the response. As previously referred, we will only focus the particular case of a model with a binary response and parametric link belonging to the family proposed by Aranda-Ordaz (1981),{ in order to obtain the additive gamma-logit model defined by F (t x) = h γ-logit [F 0 (t)] + } p j=1 f j(x j ), where F 0 (t) represents the baseline distribution function. In what concerns estimation, we added, to the Fortran program developed by Hastie and Tibshirani (1990), new routines that allowed the estimation of β 0 and of the partial functions f 1,..., f p through the use of the iterative modified backfitting (Buja et al., 1989) and local scoring algorithms (Hastie and Tibshirani, 1990). Cubic smoothing splines were used to model individual predictors. The amount of smoothing was defined, before fitting the model, by the specification of the degrees of freedom. In order to estimate the parameter vector ψ, we used a deviance profile plot. To estimate the odds ratio curve we followed Cadarso-Suárez et al. (2005), that proposed a generalization of the odds ratio curve suggested by Figueiras and Cadarso-Suárez (2001) for the logistic GAMs. In fact, Cadarso-Suárez et al. (2005) defined the generalized odds ratio curve for a continuous covariate X j at point x, and taking x 0 as the reference value, by OR x 0 j (x) = E (X 1,...,X p ) [ ] p(x1,..., x,..., X p )/(1 p(x 1,..., x,..., X p )), p(x 1,..., x 0,..., X p )/(1 p(x 1,..., x 0,..., X p )) where p(x 1,..., X p ) = P (Y = 1 X 1,..., X p ) and E (X1,...,X p ) represents the mean operator over the covariates {X k } k j. Thus, if we consider a GAM with a link belonging to the Aranda-Ordaz (1981) transformations family,

4 The additive gamma-logit model we obtain the following estimator of the odds ratio ÔR x 0 j (x) = 1 n n i=1 (1 + ˆψ e ˆβ 0+ ˆf 1(X i1)+...+ ˆf j(x)+...+ ˆf p(x ip) ) 1/ ˆψ 1 (1 + ˆψ e ˆβ 0 + ˆf 1 (X i1 )+...+ ˆf j (x 0 )+...+ ˆf p (X ip ) ) 1/ ˆψ 1, where ˆψ, ˆβ 0 and ˆf j are estimates obtained from fitting our GAM. In what concerns the construction of pointwise confidence intervals for the odds ratio curve, we used bootstrap techniques (Cadarso-Suárez et al., 2005). A simulation study was carried out, not only to evaluate the quality of the link parameter estimates, but also to compare the performance of the proposed GAM with that of the GLM with the same parametric link. We concluded that the estimation process was satisfactory and that a substantial gain, in what concerns the deviance, may be achieved with our model. 4 A real case study To apply the proposed methodology, we have studied the elapsed time from first injecting drug use until HIV infection, using a data set of 361 drug users who started using intravenous drugs between 1974 and 1997 and were admitted to the detoxification unit of the Hospital Universitari Germans Trias i Pujol in Badalona, Spain, between 1987 and 2000. For these individuals the moment of HIV infection is unknown. In fact, for 15% of the cases, the only available information about this instant is limited to the interval [instant of last negative HIV test, instant of first positive HIV test]. For the rest of the individuals, we only know their status (infected or not infected) at the date of the last HIV test (monitoring instant). This means that the data is mainly case I interval censored and so we decided to treat all the observations as current status data. From the available data we used the variables age of first intravenous drug use, gender, the elapsed time (T ), in months, from the instant of first intravenous drug use until the date of the last HIV test (monitoring time) and the indicator variable Y that gives us information about the result of the last HIV test (0 if the individual is seronegative or 1 if the individual is infected). We considered the model µ = h{[β 0 + f(t )] + f 1 (age) + β 1 gender}. The estimate of the link parameter was obtained through the minimization of the deviance, calculated for a grid of values of ψ and we considered that ˆψ = 5 was the best estimate, for a deviance of approximately 377.2. The resulting fitted model is given by ˆµ = ( 1/5. 1 1 + 5 e [5.18+ ˆf(T )]+ ˆf 1 (age)+2.61 gender) For the variable age, we refer to Figure 1 for a graphical representation of the odds ratio curve, estimated for both genders and considering the mean age (19 years) as the reference value. As we can see from these two figures, the graphics are

A.L.Papoila and C.S.Rocha 5 3 2.6 2.2 1.8 1.4 1 0.6 0.2 0 5 10 15 20 25 30 35 3 2.6 2.2 1.8 1.4 1 0.6 0.2 0 5 10 15 20 25 30 35 FIGURE 1. OR (age) estimates and corresponding 95% confidence intervals, female and male genders. very similar. It seems to exist a lower risk of infection for the individuals who initiated their injecting drug addiction with an age of approximately 26 years old. Survival curves for both female and male were obtained and FIGURE 2. Estimates of the survival functions of time until HIV infection for females and males who initiated their drug addiction with a mean age of 19 years. from Figure 2 we can see that time until HIV infection is longer for men. It also seems that the curves level off and the resulting plateau may indicate the existence of immune individuals in the population. In fact, it is admissible that some of the injecting drug users take the adequate precautions and consequently an HIV infection is unlikely to occur. Finally, to evaluate the goodness-of-fit of the proposed model, the deviance residuals were examined and no serious trends, characteristic of a bad fit,

6 The additive gamma-logit model were detected. To overcome the lack of global goodness-of-fit tests for these kind of models, we used bootstrap techniques and concluded that the model was reasonably adequate. The 95% bootstrap confidence interval for the deviance (352.86, 425.69) was obtained. However, we are aware of the existence of unobserved heterogeneity among the individuals. So, we believe that the introduction of a frailty term would certainly improve the fit of the model. Acknowledgements: This research was supported by FCT/POCI 2010. The authors would like to thank Drs. Klaus Langohr, Guadalupe Gómez and Robert Muga for making the data available. References Aranda-Ordaz, F.J. (1981). On two families of transformations to additivity for binary regression data. Biometrika 68, 357-363. Buja, A., Hastie, T.J. and Tibshirani, R.J. (1989). Linear smoothers and additive models (with discussion). Annals of Statistics 17, 453-555. Cadarso-Suárez, C., Roca-Pardiñas, J.R., Figueiras, A. and Manteiga, W. (2005). Non-parametric estimation of the odds ratios for continuous exposures using generalized additive models with an unknown link function. Statistics in Medicine 24, 1169-1184. Doksum, K.A. and Gasko, M. (1990). On a correspondence between models in binary regression and in survival analysis. International Statistical Review 58, 243-252. Figueiras, A. and Cadarso-Suárez, C. (2001). Application of nonparametric models for calculating odds ratios and their confidence intervals for continuous exposures. American Journal of Epidemiology 154, 3, 264-275. Hastie, T. and Tibshirani, R. (1984). Generalized additive models. Tech. Rep. 98, Dept. of Statistics, Stanford University. Hastie, T. and Tibshirani, R. (1990). Generalized Additive Models. Chapman & Hall, New York. Pregibon, D. (1980). Goodness of link tests for generalized linear models. Journal of the Royal Statistical Society, series C 29, 15-24. Roca-Pardiñas, J., Manteiga, W., Bande, M., Sánchez, J., Cadarso-Suárez, C. (2004). Predicting binary time series of SO 2 using generalized additive models with unknown link function. Environmetrics 15, 1-14.