Appendix A The coxvc_1-1-1 package A.1 Introduction The coxvc_1-1-1 package is a set of functions for survival analysis that run under R2.1.1 [81]. This package contains a set of routines to fit Cox models [24] with time varying effects of the covariates and reduced-rank models [77]. What makes those two modelling approaches so special is that an expanded data set has to be created before fitting, making the task computationally demanding, since even small data sets explode when stacking together all the possible risk sets. Using coxvc the models can be fitted on the original data, in a very fast and efficient algorithm, as described in [76]. The set of routines included in the package also contains some small useful functions that the authors often use when fitting survival models. The coxvc 1-1-1 requires packages MASS, splines and survival [64], which are automatically loaded when you use the package. Please refer to the manual of those packages for more information. The MASS [102] package is loaded for using the command ginverse which is essential when estimating the generalized inverse matrix of the information matrix from a reduced-rank model. Splines are loaded in order to transform some of the covariates when running the models. Note that this package is not essential (although the build in examples of the coxvc package use splines) but it is definitely useful in many applications. Last, the survival package is the base core of the package, since it is needed for creating the survival objects used in our examples. A.2 Statistical background The Cox proportional hazards models is the most common method to analyze survival data. However, the main assumption of proportionality - the hazard ratio of two different cases remain constant regardless of time- is often violated, especially in studies with long follow up. The most straightforward way to extent the model is via the inclusion of interactions of the covariates with time 123
The coxvc_1-1-1 package functions. A non-proportional Cox model may be written as: h(t X) = h 0 (t) exp(xθf ) (A.1) where h 0 (t) is the unspecified baseline hazard, X is an 1 p matrix of p covariates, F is a n q matrix of q time functions, and Θ is a p q matrix of estimable coefficients. Perperoglou, le Cessie and van Houwelingen [77] introduced the idea of reduced-rank regression to survival analysis with time varying coefficients. A reduced-rank model requires the matrix of regression coefficients Θ to be written as a product of two submatrices, B of size p r and Γ of size q r, thus resulting in Θ = BΓ, a matrix of reduced-rank r, smaller than the number of covariates p or the number of time functions q. For fitting the full model, r has to be chosen to be equal to the minimum (p, q), in which case the structure matrix Θ is of full rank. This package was created to fulfil the demand of fitting reduced- rank hazards models in a fast and efficient way. For motivation of the package use refer to [76]. The new version of the package contains an additional set of small functions that were found useful to the author in several cases when analyzing survival data. A.3 Examples First load the coxvc library: > library(coxvc) The sample data within this library come from a study of ovarian cancer patients [104]. There are in total 358 cases of patients with information of the following variables: 124 time The number of days from enrollment until death or censoring. death An indicator of death (1) or censoring (0). karn The karnofsky index measuring the ability of the patients to perform several tasks. diam The diameter of the residual tumor. figo The Figo index, denoting the site of the metastasis. x Patient id
A.3. Examples Table A.1: Definitions of variables and patients frequencies X k 0 1 2 3 4 Karnofsky 100 90 80 70 < 70 n 137 108 47 46 20 X f 0 1 Figo III IV n 262 96 X d 0 1 2 3 4 Diameter Micro < 1 1-2 2-5 > 5 n 29 67 49 68 145 For more information refer to table A.1. First attach the data: > data(ova) > attach(ova) A short summary of the data follows: > summary(ova) time death karn figo Min. : 7.0 Min. :0.000 Min. :0.000 Min. :0.0000 1st Qu.: 369.3 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.0000 Median : 734.5 Median :1.000 Median :1.000 Median :0.0000 Mean : 980.8 Mean :0.743 Mean :1.173 Mean :0.2682 3rd Qu.:1619.3 3rd Qu.:1.000 3rd Qu.:2.000 3rd Qu.:1.0000 Max. :2729.0 Max. :1.000 Max. :4.000 Max. :1.0000 diam x Min. :0.000 Min. : 1.00 1st Qu.:1.000 1st Qu.: 90.25 Median :3.000 Median :179.50 Mean :2.651 Mean :179.50 3rd Qu.:4.000 3rd Qu.:268.75 Max. :4.000 Max. :358.00 A simple Cox proportional hazards model can be fitted in the usual way using the coxph command from survival library: 125
The coxvc_1-1-1 package > fit.ph <- coxph(surv(time, death) ~ karn + diam + figo) > fit.ph Call: coxph(formula = Surv(time, death) ~ karn + diam + figo) coef exp(coef) se(coef) z p karn 0.172 1.19 0.0532 3.23 1.2e-03 diam 0.200 1.22 0.0495 4.05 5.2e-05 figo 0.542 1.72 0.1353 4.00 6.2e-05 Likelihood ratio test=64.1 on 3 df, p=7.68e-14 n= 358 A test of proportionality based on Schoenfeld residuals [92] reveals that in fact there are deviations from proportional hazards in the data. > cox.zph(fit.ph) rho chisq p karn -0.15773 7.23734 0.00714 diam -0.03464 0.31140 0.57682 figo -0.00405 0.00432 0.94760 GLOBAL NA 9.64451 0.02184 as it is indicated by the small global p-value given above. A graphical inspection given by: > par(mfrow = c(3, 1)) > plot(cox.zph(fit.ph)) The results are shown in figure A.1 and suggest that there may be an interaction of time with the covariates. A first approach will be to fit a full rank model, which includes the full Θ matrix. We choose to transform time using B-splines, thus create the F matrix to contain F 1 (t) = 1 a constant and cubic B-spline functions on 3 degrees of freedom: > Ft <- cbind(rep(1, nrow(ova)), bs(time, df = 3)) Then the full rank model is given by: > fit.r3 <- coxvc(surv(time, death) ~ karn + diam + figo, Ft, rank = 3, + data = ova) > fit.r3 126
A.3. Examples Beta(t) for karn 1 0 1 2 82 200 350 470 630 800 1100 1900 Time Beta(t) for diam 2.0 0.5 1.0 82 200 350 470 630 800 1100 1900 Time Beta(t) for figo 2 0 2 4 82 200 350 470 630 800 1100 1900 Time Figure A.1: Test of proportionality based on scaled Schoenfeld residuals along with a spline smooth with 90% confidence intervals. call: coxvc(formula = Surv(time, death) ~ karn + diam + figo, Ft = Ft, rank = 3, data = ova) coef exp(coef) se(coef) z p karn 0.556 1.744 0.152 3.6579 0.00025 diam 0.199 1.220 0.168 1.1845 0.24000 figo 0.501 1.650 0.392 1.2781 0.20000 karn:f1(t) -1.370 0.254 0.730-1.8767 0.06100 diam:f1(t) 0.043 1.044 0.696 0.0618 0.95000 figo:f1(t) 0.543 1.721 1.797 0.3022 0.76000 karn:f2(t) 0.497 1.644 1.396 0.3560 0.72000 diam:f2(t) 0.465 1.592 1.072 0.4338 0.66000 figo:f2(t) -2.280 0.102 3.201-0.7123 0.48000 127
The coxvc_1-1-1 package karn:f3(t) -1.676 0.187 2.150-0.7795 0.44000 diam:f3(t) -1.550 0.212 1.546-1.0026 0.32000 figo:f3(t) 3.481 32.492 4.358 0.7988 0.42000 log-likelihood= -1373.840 algorithm converged in 5 iterations The class of object fit.r3 is coxvc. The generic function printcoxvc is included in the package for printing results from the full model. The model has 21 parameters, and in practice the results are identical with fitting a coxph model on the expanded data set. However, the fit here was done in 5 iterations, on the original data set, which makes the routine much faster and more efficient. There are in total 266 events present in the ovarian data set. The object fit.r3 also contains the baseline hazard evaluated at this event time points. The function expand.haz can be used for expanding either the baseline or the cumulative baseline hazard. > haz <- fit.r3$hazard > length(haz) [1] 266 > haz.exp <- expand.haz(haz, death, fun = "baseline") > length(haz.exp) [1] 358 When expanding the baseline hazard, the function assigns a zero value in the time points of censoring, while when expanding a cumulative baseline hazard, the function assigns the value of the cumulative baseline at the time where the previous event took place whenever there is a censored case. > cum.haz <- cumsum(haz) > cum.haz.exp <- expand.haz(cum.haz, death, fun = "cumulative") The function plotcoxvc is included in the package to draw figures of the time varying behavior of the covariates: > plotcoxvc(fit.r3, fun = "effects", xlab = "time in days") The same function can be also used for plotting the survival function. Since the object fit.r3 is a coxvc using plot(survfit(...)) will not give the survival plot. Instead, the function plotcoxvc can be used: 128
A.3. Examples 1.0 0.5 0.0 0.5 1.0 1.5 2.0 karn diam figo 0 500 1000 1500 2000 time in days Figure A.2: Estimated effects of the covariates over time, for the full rank model. > plotcoxvc(fit.r3, fun = "survival", xlab = "time in days") In figure A.2 we have seen that the time varying behavior of the covariates is too flexible, especially in the last days of the follow up. We fitted a rank=2 model at the data, to see whether the fit improves: > fit.r2 <- coxvc(surv(time, death) ~ karn + diam + figo, Ft, rank = 2, + data = ova) > fit.r2 call: coxvc(formula = Surv(time, death) ~ karn + diam + figo, Ft = Ft, rank = 2, data = ova) coef exp(coef) se(coef) karn 0.545 1.7246084 0.151 129
The coxvc_1-1-1 package 0.0 0.2 0.4 0.6 0.8 1.0 0 500 1000 1500 2000 time in days Figure A.3: Survival function for the full rank model. diam 0.176 1.1924381 0.163 figo 0.616 1.8515072 0.358 karn:f1(t) -1.277 0.2788727 0.711 diam:f1(t) 0.209 1.2324450 0.646 figo:f1(t) -0.415 0.6603403 1.259 karn:f2(t) 0.239 1.2699785 1.322 diam:f2(t) 0.094 1.0985597 0.942 figo:f2(t) 0.298 1.3471618 1.455 karn:f3(t) -1.146 0.3179059 1.964 diam:f3(t) -0.830 0.4360493 1.352 figo:f3(t) -2.054 0.1282210 2.202 log-likelihood= -1375.299, Rank= 2 algorithm converged in 12 iterations 130
A.3. Examples Beta : Gamma: [,1] [,2] [,1] [,2] [1,] 5.4171-72.6259 [1,] 1.0610 0.0716 [2,] 0.3858-3.2510 [2,] 1.8601 0.1563 [3,] 3.8667-48.6655 [3,] 0.5818 0.0401 [4,] -5.4324-0.3894 > summary(fit.r2) call: coxvc(formula = Surv(time, death) ~ karn + diam + figo, Ft = Ft, rank = 2, data = ova) Beta : Gamma: [,1] [,2] [,1] [,2] [1,] 5.4171-72.6259 [1,] 1.0610 0.0716 [2,] 0.3858-3.2510 [2,] 1.8601 0.1563 [3,] 3.8667-48.6655 [3,] 0.5818 0.0401 [4,] -5.4324-0.3894 The class of fit.r2 is coxrr. For reduced-rank models the generic function print.coxrr will print the estimated coefficients of the model along with their standard errors and so forth, as well as the factors of the Θ matrix, B and Γ. Moreover, the function summary.coxrr will provide also summary of the B and Γ matrices. We see that the rank=2 model, with 16 parameters in total, has a more reasonable fitting of the covariate effects > plotcoxvc(fit.r2, fun = "effects", xlab = "time in days") while the rank=1 model with 9 free parameters, is more much more rigid: > fit.r1 <- coxvc(surv(time, death) ~ karn + diam + figo, Ft, rank = 1, + data = ova) > fit.r1 call: coxvc(formula = Surv(time, death) ~ karn + diam + figo, Ft = Ft, rank = 1, data = ova) 131
The coxvc_1-1-1 package 1.0 0.5 0.0 0.5 1.0 karn diam figo 0 500 1000 1500 2000 time in days Figure A.4: Estimated effects of the covariates over time, for the rank=2 model. coef exp(coef) se(coef) karn 0.334 1.39654314 0.101 diam 0.381 1.46374761 0.122 figo 0.779 2.17929188 0.258 karn:f1(t) -0.478 0.62002220 0.334 diam:f1(t) -0.545 0.57984178 0.396 figo:f1(t) -1.115 0.32791528 0.796 karn:f2(t) 0.298 1.34716179 0.562 diam:f2(t) 0.340 1.40494759 0.638 figo:f2(t) 0.695 2.00370907 1.314 karn:f3(t) -1.142 0.31918002 0.873 diam:f3(t) -1.303 0.27171542 0.981 figo:f3(t) -2.664 0.06966899 2.067 132
A.3. Examples log-likelihood= -1378.049, Rank= 1 algorithm converged in 5 iterations Beta : Gamma: [,1] [,1] [1,] 0.2146 [1,] 1.5553 [2,] 0.2450 [2,] -2.2257 [3,] 0.5008 [3,] 1.3876 [4,] -5.3198 > plotcoxvc(fit.r1, fun = "effects", xlab = " time in days") 1.0 0.5 0.0 0.5 1.0 karn diam figo 0 500 1000 1500 2000 time in days Figure A.5: Estimated effects of the covariates over time, for the rank=1 model. The package also contains a small function calc.h0 to compute the baseline hazard from a Cox model, evaluated for a case with all covariate values equal 133
The coxvc_1-1-1 package to zero. For example consider the simple proportional hazards model fit.ph. To get an estimate of the baseline hazard the function coxph.details can be used: > haz.ph <- coxph.detail(fit.ph)$haz > haz.ph0 <- calc.h0(fit.ph) The object haz.ph is the baseline hazard evaluated at the mean value of the covariates, while the object haz.ph0 is the baseline hazard evaluated for all covariate values equal to zero. This can be seen in graph A.6: > plot(time[death == 1], exp(-cumsum(haz.ph)), ylim = c(0, 1), + ylab = "", "l") > lines(time[death == 1], exp(-cumsum(haz.ph0)), col = 2) 0.0 0.2 0.4 0.6 0.8 1.0 0 500 1000 1500 2000 time[death == 1] Figure A.6: Figure of survival for an average person (black line) and a person with covariates X = 0 134