The practical utility of incorporating model selection uncertainty into prognostic models for survival data

Size: px

Start display at page:

Download "The practical utility of incorporating model selection uncertainty into prognostic models for survival data"

Edwina Caldwell
5 years ago
Views:

1 The practical utility of incorporating model selection uncertainty into prognostic models for survival data Short title: Incorporating model uncertainty into prognostic models Nicole Augustin Department of Statistics, University of Glasgow, Glasgow, G12 8QQ, UK Willi Sauerbrei Institut für Medizinische Biometrie und Medizinische Informatik, Universitätsklinikum Freiburg, Stefan-Meier-Str. 26, D Freiburg, Germany Martin Schumacher Institut für Medizinische Biometrie und Medizinische Informatik, Universitätsklinikum Freiburg, Freiburg, Germany April 8, 2004

2 Abstract Predictions of disease outcome in prognostic factor models are usually based on one selected model. However, often several models fit the data equally well, but these models might differ substantially in terms of included explanatory variables and might lead to different predictions for individual patients. For survival data we discuss two approaches for accounting for model selection uncertainty in two data examples with the main emphasis on variable selection in a proportional hazard Cox model. The main aim of our investigation is to establish in which ways either of the two approaches are useful in such prognostic models. The first approach is Bayesian model averaging (BMA) adapted for the proportional hazard model (Volinsky et al., 1997). As a new approach we propose a method which averages over a set of possible models using weights estimated from bootstrap resampling as proposed by Buckland et al. (1997), but in addition we perform an initial screening of variables based on the inclusion frequency of each variable to reduce the set of variables and corresponding models. The main objective of prognostic models is prediction, but the interpretation of single effects is also important and models should be general enough to ensure transportability to other clinical centres. In the data examples we compare predictions of the two approaches with conventional predictions from one selected model and with predictions from the full model. Confidence intervals are compared in one example. Comparisons are based on the partial predictive score and the Brier score. We conclude that the two model averaging methods yield similar results and are especially useful when there is a high number of potential prognostic factors, most likely some of them without influence in a multivariable context. The new method based on bootstrap resampling has the additional positive effects of achieving model parsimony by reducing the number of explanatory variables and dealing with correlated variables in an automatic fashion. keywords model selection uncertainty; survival analysis; model averaging; bootstrap; prognostic factor models. 1

3 1 Introduction Prognostic studies in medicine are about identifying factors influencing disease outcome. Identifying and ranking these factors involves statistical modelling, usually of a regression model of some form, e.g. in medicine this is often the logistic regression model or the Cox proportional hazards model (Cox, 1972). The resulting models are the basis for predictions of disease outcome, which are of potential value for therapeutic and other clinical decisions. For many diseases studied there is a vast number of potential factors and there is often a lot of uncertainty about the best model for prediction. For example in breast cancer the effects of more than 200 prognostic factors are currently discussed. Even if only 10 variables are investigated in one study, this can lead to a large set of possible models. Not surprisingly it is often the case that several models provide similar fits to the data, but these models differ in terms of included factors and predicted prognostic indices for individual patients. The differences are often substantial if these predictions are made at extreme values of the range of the data. Also, finding the variables with the highest prognostic value is a difficult task, mainly because selected models are often unstable (Steyerberg et al., 2001; Sauerbrei, 1999; Breiman, 1996b; Sauerbrei and Schumacher, 1992; Altman and Andersen, 1989; Chen and George, 1985). Standard statistical methodology involved in making the predictions and variance estimation is based on the assumption that the model used was pre-specified without any data-dependent modelling. However, every model presented is based on many implicit assumptions made before the actual formal model selection process starts. These assumptions include data interpretation, e.g. variable definitions, coding and dealing with missing values. Also a model class is chosen, in survival data whether a Cox model or a fully parametric model is fitted to survival data. Then a decision on the appropriate criterion for defining the best model for model selection is needed. These include combinations of measures on the discrepancy of fit (the deviance function), the prediction error, model stability or criteria like model complexity. Obviously, many other assumptions are required before starting a model selection process and decisions on these aspects can have a strong influence on the final model. See Blettner and Sauerbrei (1993) for some background on these pre-model selection decision processes for a specific example 2

4 and Chatfield (2002) for the discussion of further aspects. In this paper we will not tackle all the modelling uncertainties mentioned above. We consider two studies with survival data analysed with a Cox regression model. We restrict ourselves to only fitting linear terms in our model and assume that further assumptions of the Cox model are acceptable. We will concentrate on the issue of whether a variable is in- or excluded in a prognostic model. Despite these assumptions and pre-modelling decisions we still have thousands of possible models, e.g. 15 variables in one example results in 2 15 = candidate models, from which to select a prognostic model. Ignoring the uncertainty about the variable selection process may lead to unjustifiable confidence in parameter estimates and classification schemes derived from the selected model (Chatfield, 1995; Draper, 1995; Hoeting et al., 1999; Holländer et al., 2004). For example, confidence intervals may not yield their nominal coverage. One possible way to incorporate model selection uncertainty is to average over the set of possible models. In cases where the aim is to find the most likely model, it is useful to state the uncertainty attached to the selected model along with the results of other good approximating models. In this paper we describe some of the currently existing methods and propose a modification of one approach for handling model selection uncertainty. These are mainly Bayesian approaches (Chatfield, 1995; Draper, 1995; Kass and Raftery, 1995; Raftery et al., 1997; Hoeting et al., 1999) and methods based on bootstrap resampling (Buckland et al., 1997; Burnham and Anderson, 1998). In recent years these methods have been applied in many fields, such as mark-recapture (Stanley and Burnham, 1998; Brooks et al., 2000), ecology (MacKenzie and Kendall, 2002), econometry (Bunnin et al., 2002) and medicine (Viallefont et al., 2001). Here we are specifically interested in applications of the above approaches to survival data. The Bayesian model averaging (BMA) method was adapted to survival data by Volinsky et al. (1997); see also Hoeting et al. (1999) for an application. Buckland et al. (1997) proposed an approach using bootstrap resampling and use it in a simple example with a model space consisting of 4 models only. For more complex situations with a larger model space we propose a modification using a two step approach. First a subset of variables with no or little influence is eliminated and then model averaging is applied to models selected from the remaining variables using weights derived from subsequent bootstrap resampling investigation. We investigate whether these two methods can improve predictions for 3

5 prognostic models of cancer (glioma) and a liver disease (primary biliary cirrhosis) relative to three alternatives: (i) a single model selected using backward elimination (BE) with a predefined low nominal selection level, (ii) the full model and (iii) predictions using the Kaplan-Meier estimate, where no covariates are used. 2 Data Example 1: Glioma In a randomised clinical trial two chemotherapies were compared for patients with malignant glioma (Ulm et al., 1989; Sauerbrei and Schumacher, 1992; Sauerbrei, 1999). Besides therapy, 15 variables are considered to be potential prognostic factors (see (Table 1)). Here we use the data from 411 patients with complete information for the variables of interest. After a median follow-up time of 712 days, 274 deaths have occurred. In Sauerbrei (1999) the full model was considered and three other possible models were selected using BE with different significance levels (α =0.157, 0.05 and 0.01). Resulting models had nine, five and four variables respectively (Table 1). All models had the variables mali1, age, Karn1 and surg1 in common. Furthermore the linear predictors from the three models after selction are highly correlated with the predictor from the full model, with the highest pairwise Pearson s correlation coefficient of 0.99 with the BE(0.157) model having 9 variables and the lowest 0.94 (BE(0.01)) with 4 variables. Each of the three variables originally measured with three ordered categories were coded with two dummy variables. Eliminating one of the dummies corresponds to collapsing two categories which is sensible if two adjacent categories have a similar influence on outcome. Judged from the log-likelihood the full model has the largest value. However, the leaveone-out cross-validated log-likelihood, a measure of the predictive value of a model (Verweij and Houwelingen, 1993), shows that even the model based on BE(0.157) may include variables which do not increase the predictive value of the model. Cross-validated loglikelihood values of the models with 4 or 5 variables are very similar, the corresponding value of the full model is worse. 4

6 Table 1: Brain tumour (Glioma) example: Results and selected explanatory variables of the full model and different selected models using backward elimination (BE) with different significance levels are shown. l (1) is the cross-validated log-likelihood (Van Houwelingen and LeCessie, 1990). full model BE(0.157) BE(0.05) BE(0.01) log-likelihood l (1) 1) sex x anam: anamneses x mali1: grade of malignancy 2-3 x x x x mali2: grade of malignancy 3 x x age: age x x x x Karn1: Karnofsky index >= 70 x x x x Karn2: Karnofsky index > 80 x surg1: surgical resection (0 = biopsy, 1 = partial or total) x x x x surg2: surgical resection (total) x x conv: convulsiva x cort: cortisone x x epil: epilepsy x x x amne: amnesia x psyc: organic psychosyndrome x x apha: aphasia x 5

7 Example 2: Primary Biliary Cirrhosis Our second example are survival times of 312 patients with the fatal chronic liver disease primary biliary cirrhosis (PBC). The data are from a double-blind randomised clinical trial to compare the drug DPCA with a placebo. The clinical, biochemical, serological and histological parameters listed in Table 5 are possible prognostic factors for the disease. Fleming and Harrington (1991) showed that the drug DPCA did only have a negligible effect on prognosis, hence we do not include it. Following Hoeting et al. (1999) we exclude 8 observations due to missing values in the covariates. This leaves 304 patients with 123 events for this analysis. In addition we exclude the binary variable ascites (swelling in the abdominal cavity) due to the fact that less than 8% of cases had ascites present and this caused convergence problems in some of the bootstrap samples. In the full model the effect of ascites is negligible. Like Hoeting et al. (1999) we used the log of bilirubin, albumin, SGOT, alkaline phosphates. In contrast to Hoeting et al. (1999) we do not use the log of prothrombin time. From previous analysis (Fleming and Harrington, 1991) it is not clear whether transforming this variable improves the model fit. Also, using the log leads to a very high expected odds ratio in the full model and there are convergence problems in the bootstrap samples. Fleming and Harrington (1991) used a Cox regression model for the data and selected five of the variables (age, l.albu, l.bili, edema and l.prothr). Hoeting et al. (1999) and Raftery et al. (1996) analysed these data using BMA. 3 Survival models We will use the Cox model where the hazard λ at time t of the event is modelled as a function of the explanatory variables X = (x 1,...,x p ). The hazard function is λ i (t X) = λ 0 (t) exp(x i β), (1) where β is the vector of coefficients of the explanatory variables. x i is the covariate vector of patient i and λ 0 (t) is the baseline hazard. For prognosis the survivor function will be estimated by: Ŝ i (t) = [Ŝ0(t)] exp(x i where S 0 (t) is the baseline survivor function. 6 β) ˆ,

8 As apparent from the two data examples there is typically a high number of possible prognostic factors: often above 10. Even without fitting interactions and considering only the in- and exclusion of variables this results in 2 10 = 1024 models. In the glioma example we have 2 15 = and in the PBC example there are 2 13 = 8192 possible models. If prior information does not lead to a smaller set of candidate variables with a potential effect, the model space needs to be reduced. The two approaches we describe below achieve this (a) by directly choosing a subset of models using Occams window or (b) indirectly by choosing a smaller subset of explanatory variables after using a screening step. As variable selection method we will use backward elimination (BE) with the BIC criterion (BE BIC ). BE may be used as a substitute for all-subset procedure if the corresponding predefined selection level is chosen (Sauerbrei 1999). Because of the comparison to Bayesian model averaging we used the BIC criterion, however others with a higher selection level as AIC or a selection level of 0.05 could have been choosen. 4 Quantifying model selection uncertainty 4.1 Bayes approaches Bayesian model averaging (BMA) is one approach to incorporating model uncertainty into inference. We give a short summary of the method; more details are given in Chatfield (1995), Draper (1995), Kass and Raftery (1995) and Hoeting et al. (1999). Consider the possible set of models M 1,..., M j,..., M J with associated prior probabilities of M j being the true model. The set M 1,..., M J is often called the model space. Let θ be the parameter or quantity of interest, which may depend on the covariates, e.g. the linear predictor (prognostic index). The most general definition of θ is the whole predictive distribution. Given the data D the posterior distribution pr(θ M j, D) of the estimated quantity θ under a specific model j is averaged over the different models considered, weighted by the respective pr(m j D): pr(θ D) = J pr(θ M j, D)pr(M j D) (2) j 7

9 Both terms in equation 2 require solving integrals, for the general linear model these are available in closed form; this is not the case for the Cox model. The posterior summary statistics are evaluated by incorporating the model posterior probabilities (Hoeting et al., 1999; Draper, 1995), e.g. the posterior mean is E(θ D) = L E(θ M j, D)pr(M j D). (3) j The specification of prior model probabilities pr(m j ) assumes that one of the models considered is true, since all other models which do not belong to the set M 1,..., M J have a prior probability of zero. Defining suitable priors pr(m j ) for the different models appears to be difficult with this approach, especially with a large set of possible models. The simplest suggested solution is to give each model equal weight when no prior information is available. This choice of prior can be problematic if two explanatory variables are correlated leading to two models M 1 and M 2 containing one of the two correlated variables respectively, then the two models will be almost equivalent. Giving the two models equal weight a priori amounts to giving the single model of which M 1 and M 2 are slightly different versions twice as much actual prior weight. Another problem of the BMA approach is that large summations and integrals are necessary especially if the model space is large, and these are often hard to compute. One approach to handle the large model space is to restrict it to models with a posterior probability above a certain threshold, e.g. by using Occam s window (Madigan and Raftery, 1994). Occam s window includes all models with a posterior probability greater than a certain proportion of the model with the highest posterior. 4.2 Bayesian model averaging for survival data (approx. BMA) For carrying out BMA within the Cox model, the integrals of both components in equation 2 are not available in closed form. Volinsky et al. (1997) suggest several approximations (see also Hoeting et al. (1999)). Instead of integrating the posterior distribution of the quantity of interest (Volinsky et al., 1997) use the partial maximum likelihood approximation. If the quantity of interest is the prognostic index, this so called predictive distribution is the partial likelihood with the maximum likelihood estimator ˆθ j inserted. For deriving pr(m j D), the posterior model probabilities in equation 2, the log of the inte- 8

10 grated likelihood is approximated by the Bayesian information criterion (BIC) (Schwarz, 1978). Before averaging, models not supported by the data are excluded using Occam s window. These are all models where the posterior probability is smaller than a specified constant, e.g. 1/C of the model with the highest posterior probability. Instead of calculating all posterior model probabilities an altered version of the leaps and bounds algorithm, also used for all subset selection (Furnival and Wilson, 1974; Lawless and Singhal, 1978; Kuk, 1984), is used to narrow down the choice of models. Volinsky et al. (1997) claim that by choosing a large enough set of best models with the leaps and bounds algorithm, all models within Occam s window are found. Posterior mean estimates and variance of θ are calculated using weighted averages from the separate models with normalised weights proportional to pr(m j D), with summation over models within Occam s window. Volinsky et al. (1997) apply this approach to Cox models for assessing the risk of stroke from a cardiovascular health study. As a measure for the importance of the different variables, posterior effect probabilities pr(β j 0 D) for the coefficients associated with each variable, are given. This is the sum over all pr(m j D) of all models containing variable x i. In the following we will call their approach approx. BMA. To our knowledge it is currently not implemented in any of the mainstream statistics packages or Bayesian specialist software such as BUGS, but the authors have made some of their code publicly available under Non-Bayesian approaches Buckland et al. (1997) have suggested several more or less ad hoc approaches to accounting for model uncertainty. In their approach an information criterion (I), e.g. the Akaike information criterion, is used to estimate model weights for parameter averaging and variance estimation. These weights have the following form w j = exp( I j /2) J i=1 exp( I, j = 1,...J, i/2) where I j is the information criterion for model j. The expected value of the averaged parameter is then ˆθ = J j 9 ˆθ j w j, (4)

11 where ˆθ j is the estimated parameter or quantity from the observed data. The variance of ˆθ j is estimated as follows (Buckland et al., 1997): { J 2 var(ˆθ) = w j var(ˆθ j γ j ) + γj} 2 (5) j where the model misspecification bias γ j is estimated by ˆγ j = ˆθ j ˆθ. ˆθj and ˆvar(ˆθ j γ j ) are the conditional estimates under the selected model j and ˆθ is found using equation 4. These estimates are plugged into the above equation. Setting I equal to the Bayes information criterion, the weights in equation 4 are almost equal to the model posterior pr(m j D) as it is approximated by Volinsky et al. (1997) when non-informative priors are used (see above). The main difference is that in pr(m j D) the normalising constant is not the sum over all possible models, but a sum over a smaller subset of best models as identified by the leaps and bounds algorithm and Occam s window. As an alternative way of obtaining model weights w j Buckland et al. (1997) suggest to apply the bootstrap to model selection and give an illustration for an example with 4 models under consideration. Then the weight of model j is estimated by the proportion of bootstrap samples in which model j was selected. Then bootstrap percentile confidence intervals are given by ordering the bootstrap estimates from smallest to largest and selecting the appropriate percentiles from the list. The idea of bootstrapping the modelling process has been suggested in the past mainly for investigating model selection stability issues e.g. Efron and Gong (1983), Chen and George (1985), Dijkstra and Veldkamp (1994), Altman and Andersen (1989), Sauerbrei and Schumacher (1992) and Hjorth (1994). Not many investigations exist incorporating a term to account for model uncertainty in the estimate of the variance. In the special case of two orthogonal covariates Candolo et al. (2003) show analytically that in the linear regression model the above approach reduces the mean squared error if effects of covariates are relatively weak. They support their findings by results from a simulation study considering 8 models. Especially when working with survival data resampling patients is the preferred bootstrap approach. A method that uses model averaging is bootstrap aggregating or bagging (Breiman, 1996a). In bagging the bootstrap is applied to model selection and the parameter of 10

12 interest, e.g. the prognostic index, is an average of the B bootstrap estimates: ˆθ = 1 B B j ˆθ b j, where ˆθ b j is the j th bootstrap estimate with j = 1,..., B. 4.4 Screening of variables and bootstrap model averaging (bootstrap MA) We use the bootstrap approach of Buckland et al. (1997) but eliminate in a first step single variables thought to have a weak influence at most before we apply model averaging. This screening step helps to reduce the number of models substantially, as we have thousands of models in our examples, in contrast to only 4 models in Buckland et al. (1997). It helps to get more reliable weights for the models and it is most helpful for the use of the predictor in future applications as it does not require to measure all variables, even if they only have a small influence on the predictor. Our approach is also in contrast to approx. BMA approach where the set of models is reduced by Occam s window, however each variable may still be required because it is included in at least one of the models passing Occam s window criterion. For the first step we use parts of a strategy suggested by Sauerbrei and Schumacher (1992) based on bootstrap resampling. Model selection is carried out in the bootstrap samples. In order to make our selection procedure comparable to approx. BMA we use BE with a selection level corresponding to the BIC criterion where the penalty for each covariate is log(n), n is here the number of events in the survival data. We denote this selection algorithm by BE BIC. Then the number of times each covariate is included in the model, i.e. the inclusion frequency h(x i ), is recorded. Instead of the BIC criterion another criterion, e.g. the AIC or a nominal selection level, say 0.05, can be used. Now a subset of variables is selected only containing variables with an inclusion frequency above a minimum value. Variables with no or hardly any prognostic influence will have very low inclusion frequencies and will therefore be eliminated in this first step (Sauerbrei and Schumacher (1992)). Obviously, the selection criteria influences the interpretation of the inclusion frequencies; see Sauerbrei and Schumacher (1992) where this is discussed for the special situation of one variable. In a second step we carry out another bootstrap resampling procedure and apply BE BIC to the subset of remaining variables from the 11

13 first step. The set of possible models is substantially reduced after the first step, e.g. from 2 15 to 2 7 in the glioma example (Table 2). Now we record the number of times each model is selected and use these inclusion frequencies h(m j ) as weights in equation 4 for model averaging. The screening step also influences estimation of the variance of the predictor, as it uses the bootstrap samples of the second step. Furthermore, the second step in our procedure makes model selection in each bootstrap sample much faster and gives larger weights to models supported by the data because it considers only models in which every variable has probably some influence. This allows us to work with a relatively small number of bootstrap samples (several hundreds to about a thousand). The precision of relative model inclusion frequencies h(m j ) can be increased by a larger number of bootstrap replicates. 5 Assessing the predictive performance 5.1 Criteria We compare the predictive performance of the different approaches using the partial predictive score (PPS) and the Brier score (Brier, 1950; Graf et al., 1999). Partial predictive log-score (PPS): Volinsky et al. (1997) suggested using the partial predictive log-score (PPS). This is a possible criteria if the quantity of interest is the prognostic index. Let D build be the data used for the model building process and D test the data used for validation. The PPS is based on the logarithmic scoring rule, the summed logarithm of the observed ordinate of the predictive density distribution for the observations in D test : log pr(d M j, D build ) d D test and for BMA the PPS is then { J } pr(d M j, D build ) pr(m j D build ). d D test log j=1 Since for the Cox model the predictive density pr(d M j, D build ) is not available. Volinsky et al. (1997) suggest to use the partial likelihood (PL) instead, with the maximum 12

14 Table 2: Glioma example: Results of the bootstrap MA approach. The inclusion frequencies h(x i ) are based on the models selected in 500 bootstrap samples with BE BIC. The 27 models are the result of step two based on 7 explanatory variables with h(x i ) > 30% selected in step one. The corresponding h(m j ) resulting from different selection levels in step one are also given. Variables with h(x i ) > 30% and part of the subset of explanatory variables after step one. explan. step one: step two: 10 models variables h(x i )% with the highest h(m j )% out of 27 sex anam mali1 81 x x x x x x - x - - mali x x - x x age 100 x x x x x x x x x x Karn1 64 x x x - - x x - x x Karn surg1 99 x x x x x x x x x x surg conv cort x x - x epil 37 - x - - x - x amne psyc apha h(m j )%, h(x i ) > h(m j )%, h(x i ) > h(m j )%, h(x i ) > h(m j )%, h(x i ) > h(m j )%, h(x i ) >

15 likelihood estimator ˆβ j from D build inserted. PPS = { J } PL(ˆβ build )pr(m j D build ), d D test log j=1 then the PPS is the sum of the partial log-likelihood evaluated for each observation in D test. That is (assuming no ties): PL(ˆβ build ) = i D test exp(x iˆβ build ) l R(t i ) exp(x lˆβ build ) where x i = (x i1,..., x ip ), t i is the observed time for the i th patient. The indicator δ i is zero if t i is right-censored, unity otherwise. R(t i ) is the risk set at time t i. Since the partial likelihood is only defined for a whole data set, the derivation is not straightforward for a subset. The cross-validated leave-one-out partial log-likelihood as defined by Verweij and Houwelingen (1993) takes account of this problem. The above definition of the partial likelihood is only an approximation and is only valid for a fairly high number of observations in the test data. We use k-fold crossvalidation where the sample size of the test data is between 16 and 20 in the two examples presented. For evaluating the PPS under bootstrap MA the weight pr(m j D build ) is replaced by bootstrap inclusion frequency h(m j ) for model j. The higher the PPS the better the predictive performance. If observations in D test yielding high ordinates in the predictive density distribution as derived from the model selected and estimated from D build, the model is also plausible for D test. Brier score: If the quantity of interest is the predicted survivor function at a given time t, which is a function of the linear predictor and the baseline hazard function, a criterion evaluating the predicted survivor function directly is possibly most appropriate. The Brier score (Graf et al., 1999) is such a criterion. The prediction of the survivor function involves estimation of the prognostic index and of the integrated baseline hazard t 0 λ 0(s)ds in equation 1. The Brier score is the expected prediction error at a given time point t : BS(t ) = E[(I(T > t ) Ŝ(t X)) 2 ], where I(.) is an indicator function. It is estimated as follows BS(t ) = 1 n v i (I(t i > t ) n Ŝ(t x i )) 2, i=1 14 δ i

16 where the weights v i are equal to one if no censoring is present at all. With censoring, weights for observations with an event are equal to the inverse of the Kaplan-Meier estimate of the censoring distribution at the time of the event. Individuals without an event up to t are weighted according to the Kaplan-Meier estimate of the censoring distribution at time t. Observations censored before time t do not contribute to the Brier score, i.e. v i = 0. The lower the Brier score the better the predictive performance. With model averaging the survival probability S(t X) is a weighted average of the estimated survivor functions under the different models. In the situation where one data set D build is used for the model building process and another data set D test is used for validation, the parameters β and the baseline hazard λ 0 are estimated from D build. The Brier score from the pooled Kaplan-Meier estimate of the survivor function is a benchmark value for a prediction without taking explanatory information into account. For measuring the gain from explanatory information in the model we use the percentage change in Brier score compared to a prediction from the Kaplan-Meier estimate (Graf et al., 1999): R2 = 1 BS(t ) model / BS(t ) K M. R 2 is thus analogous to the explained variation commonly quoted for general linear models. A naive prediction with constant survival probability of 0.5 yields a BS of 0.25: useful as a worst case benchmark. 5.2 Cross-validation We use K-fold cross-validation (Davison and Hinkley, 1997) to estimate the predictive criteria. The data are split into K disjoint sets of similar size. A rule of thumb is to use either K = n or K = 10 in the case of large data sets, where choosing K = n becomes very computer intensive (Davison and Hinkley, 1997). These K sets define the test set D test for assessing the predictive performance and the build set D build used to fit the model. In turn, one of the K sets is the test set and the remaining data is the build set. This procedure ensures one prediction per observation. For survival data it is sensible to stratify the observations in the K sets by event and censored cases. Then the following cross-validation procedure is carried out: 1. Perform the model building procedure (Kaplan-Meier, full model, backwards elimination (BE BIC ), bootstrap MA or approx. BMA) on the build data to obtain: (a) under the full model and BE BIC the estimates ˆβ build and 15 ˆλ build 0. ˆβ build is a vector

17 of estimated regression coefficients for variables selected, coefficients of variables which are not selected are set to zero and ˆλ build 0 are the baseline hazard estimates. (b) under bootstrap MA the estimates referring to the 1 to J selected models ˆβ build build 1,..., ˆβ J and build build ˆλ 01,..., ˆλ 0J referring to the 1 to L selected models are found. Under approx. BMA the estimates build build build build ˆβ 1,..., ˆβ L and ˆλ 01,..., ˆλ 0L. 2. Using the parameters estimated in the build data we predict the quantity of interest for the test data. (a) full model and BE BIC : Ŝ(t X test ) or the linear predictor ˆθ = X test β build ˆ for the test data, with covariate matrix X test. (b) bootstrap MA: Ŝ1(t X test )...ŜJ(t X test ) or the linear predictor ˆθ 1...ˆθ J for the test data, with covariate matrix X test are averaged using equation 4. approx. BMA: Ŝ1(t X test )...ŜL(t X test ) or ˆθ 1...ˆθ L are averaged using equation Calculate the assessment criteria, e.g. the Brier score or the PPS. 4. repeat steps 1-3 for each of the K data splits and sum the criteria obtained from the 1 to K data splits. When D test is too small, say below 10 observations, the censoring distribution for the weights in the Brier score is calculated using D build. However in our examples, the sample sizes of D test are around 20 and hence this is not a problem. 6 Examples 6.1 Glioma We applied our bootstrap MA approach to the glioma data as described above using 500 bootstrap resamples in all steps involved. In the first step of bootstrap MA the inclusion frequencies h(x i ) range between 4% and 100% (Table 2 and Figure 1). The variables mali1, age and surg1 have the highest inclusion frequencies, all are above 75%. The variable Karn2 has the lowest inclusion frequency of 4%. In the second step of the bootstrap MA procedure only variables with h(x i ) > 30% are included and the bootstrap procedure is applied again to the remaining subset with 16

18 Glioma example bootstrap inclusion frequencies (%) age surg1 mali1 Karn1 ** * * * * * * * * p-value (log) from full model Figure 1: The bootstrap inclusion frequencies versus p-values from full model for the Glioma data. mali1, mali2, age, Karn1, surg1, epil and cort. In total there were 27 different models selected out of 2 7 possible models; the top ten are shown with model inclusion frequencies h(m j ) in Table 2. The model including variables mali1, age, Karn1 and surg1 has the highest inclusion frequency (17%). Repeating the bootstrap MA procedure with different selection levels h(x i ) > 40%, 20%, 10% and 0% in the first screening step always yields the same model with the highest h(m j ), but, of course, the distributions of the respective h(m j ) differ. The order of the next important models differ also (Table 2). For the relatively high selection level of 40% in the screening step only 7 of the 16 (2 4 ) possible models were selected and in 99% of the replicates one of the 4 top models was selected. For selection levels 20%, 10% and 0% the resulting model sets are much larger. There where 107, 184 and 226 models in the respective sets. This reduced the h(m j ) of the strongest models. When applying approx. BMA to the glioma data, using model priors pr(m j ) which give each model equal weight, we obtain fairly similar answers. As in the example of Volinsky et al. (1997) all models with a posterior probability greater 1/C with C = 20, were included. In total 41 models were selected into the set. Table 3 shows the first 10 models with the highest posterior model probabilities. The model representing the highest part 17

19 Glioma example bootstrap inclusion frequencies (%) amne anam apha Karn2 cort mali2 psyc surg2 sex epil Karn1 age surg1 mali posterior prob (%) Figure 2: The bootstrap inclusion frequencies versus the posterior effect probabilities pr(β l 0 D) for the Glioma data marked with *. Posterior effect probabilities pr(β l 0 D) derived without Occam s window model space reduction are marked with. (16%) of the total posterior model probability is the same as was selected by BE BIC and under bootstrap MA. The variables mali1, age and surg1 have the highest posterior effect probabilities pr(β j 0 D) of 100%. The variables age and surg1 also have a 100% inclusion frequency under bootstrap MA but mali1 has only an inclusion frequency of 81 %. Figure 2 shows the scatter plot of bootstrap inclusion probabilities from the bootstrap MA approach versus the posterior effect probabilities pr(β j 0 D). With BMA variables with strong effects appear more important than with bootstrap MA, i.e. obtain a higher posterior effect probability compared to the inclusion frequency. Variables with weak effects appear more important with bootstrap MA compared to BMA. Some of these differences are due to the fact that the pr(β 0 D) are derived after the model space has been reduced by Occam s window, whereas the h(x i ) are estimated before the model space reduction. The pr(β 0 D) resulting from the BMA procedure without reducing the model space yield a closer agreement with the h(m j ) for most variables (also shown in Figure 2). We compare the predictive performance of the model averaging methods with the performance of predictions using the Kaplan-Meier estimate without covariates, the full Cox 18

20 Table 3: Glioma example: Results of approx BMA. The second column shows the posterior effect probabilities pr(β l 0 D). In the right half models with the highest pr(m j D) are shown. explan. Pr(β l 0 D) 10 models with the variables (%) highest pr(m j D) out of 41 sex x - x anam mali1 100 x x x x x x x x x x mali x age 100 x x x x x x x x x x Karn1 82 x x x x - x - x x x Karn surg1 100 x x x x x x x x x x surg x conv cort x epil 32 - x x - - x amne psyc x - apha pr(m j D)(%)

21 Table 4: Glioma example 20-fold cross-validation results: Partial predictive log-score (PPS) and Brier score estimated using 20-fold cross-validation. methods used for prediction Kaplan- bootstrap approx. full criterion Meier MA BMA BE BIC model PPS n.a BS(t = 310) R 2 (%) BS(t = 730) R 2 (%) model and the single model as chosen by BE BIC. We predict the probability of survival at a relatively early time point after diagnosis, at 310 days, and after two years at 730 days. These time-points where chosen by looking at the overall Kaplan-Meier curve, at day 310 the estimated survival probability is about 0.5. Day 730 is after the median follow-up time of 712 days and the estimated survival probability is about 0.2. We estimate the Brier score using 20-fold cross-validation. The results shown in Table 4 do not indicate any major differences. There is some benefit in introducing explanatory information: the percentage reduction of the Brier score (R 2 ) is around 15% for the four model based predictions at day 310 and at day 730 it is around 28%. The full model yields the highest R 2 at day 310 and at day 730 the lowest. Differences in predictions of the prognostic indices with the four model-based approaches are also small, the estimated partial predictive scores (PPS) from the 20-fold cross-validation are all similar (Table 4), with bootstrap MA yielding the lowest, i.e. worst, and the full model the highest, i.e. best, PPS. 20

22 6.2 Primary Biliary Cirrhosis In the PBC example our pre-model selection decisions are as follows. We apply BE BIC to bootstrap samples in the bootstrap MA approach. For the 123 events this is in correspondence to the BIC with a penalty of log(123). The results given in Table 5 show that the model selection uncertainty in this example is much higher than in the glioma example. 65 different models were selected in the second step of bootstrap MA, of which inclusion frequencies are all below 12%. Table 6 shows the corresponding figures for approx. BMA, where only 32 different models were selected. The model with the highest posterior probability (18%) is the same as for the bootstrap MA method. In Figure 3 we see that the relation between the bootstrap inclusion frequencies and posterior effect probabilities is similar as in the glioma example. The variables l.albu, age, prothr, l.urcop and edema with high posterior probabilities appear to be more important under approx. BMA than under two step bootstrap MA. Variables with low posterior probabilities appear more important under two step bootstrap MA. We compare predictive performance of our averaging approaches with the predictions using the Kaplan-Meier estimate, the full Cox model, and the selected model from BE BIC (Table 7). The Brier score is estimated, at given time points t =1000 and 3000 days, approximately 3 and 8 years; with an estimated survival probability of about 0.8 and 0.6 respectively using the Kaplan-Meier method. At t =1000 the percentage reduction of the Brier score (R 2 ) is around 35% for the four model-based predictions; the full model yields the highest R 2 = 36% and there is no difference between bootstrap MA and approx. BMA. At t = 3000 the bootstrap MA procedure leads the highest R 2 = 31%, BE BIC gives only 28%. Figure 4 shows the distribution of the proportion in bootstrap percentile confidence interval width of the full model for bootstrap MA and BE. The differences between models are larger for the confidence intervals than for the Brier Score or R 2 criterion. On average, the full model yields the widest confidence interval, BE BIC the narrowest and bootstrap MA is in between. The intervals from BE BIC are shorter than intervals of the full model for all observations and the median width is 0.61 of the width of intervals from the full model. The intervals from bootstrap MA have a median width which is 0.88 of the of the width of intervals from the full model. In contrast to the full model, bootstrap MA and 21

23 pbc example bootstrap inclusion frequencies (%) platel sex spiders l.alkal hepato l.sgot histo prothr l.urcop l.sbili edema age l.albu posterior prob (%) Figure 3: The bootstrap inclusion frequencies versus the posterior effect probabilities pr(β l 0 D) for the PBC data. proportion of confidence interval width compared to full model bootma BE Figure 4: PBC example. Distribution of the proportion of the width of confidence intervals of linear predictors for all observations. Proportion of the full model are shown for bootstrap MA and BE BIC interval width. The dotted line at a proportion of one shows where the confidence intervals of the compared methods have the same width. 22

24 Table 5: PBC example: Results of the bootstrap MA approach. The inclusion frequencies h(x i ) are based on the models selected in 500 bootstrap samples with BE BIC. The 65 models are the result of step two based on 7 explanatory variables with h(x i ) > 30% selected in step one, of which 10 are shown. For these models the corresponding h(m j ) resulting from different selection levels in step one are also given. Model selected by Fleming and Harrington (1991). Model selected using backward elimination with the BIC criterion (BE BIC ). step one: step two: 10 models explan. h(x i ) with the highest h(m j )% out of 65 variables % age 86 x x x x x x x x x x sex hepato: hepatomegaly spiders edema 62 x - - x x x x - x x l.sbili : log(bilirubin) 100 x x x x x x x x x x l.albu : log(albumin) 85 x x x x x x x x x x l.urcop :log(urine copper) 60 x x x x x x l.alkal: log(alkaline phosphates) l.sgot: log(sgot) x x platel: platelets prothr : prothrombin time 66 x x x - x x - x - x histo: histologic stage x - - x x - - h(m j ) %

25 Table 6: PBC example: Results of approx BA applied. The second column shows the posterior effect probabilities pr(β l 0 D). 10 of 33 models with the highest posterior probability are shown. explan. Pr(β l 0 D) 10 models with the variables % highest pr(m j D) out of 32 age 100 x x x x x x x x x x sex hepato spiders edema 87 x x x x - x x x x x l.sbili 100 x x x x x x x x x x l.albu 100 x x x x x x x x x x l.urcop 73 x x x x x - x - x - l.alkal l.sgot x x x x platel prothr 75 x x - - x x x x x x histo 40 - x x x x - pr(m j D)(%)

26 Table 7: PBC example: Partial predictive log-score (PPS) and Brier score at day 1000 and 3000 estimated using 20-fold cross-validation. methods used for prediction Kaplan- bootstrap approx. full criterion Meier MA BMA BE BIC model PPS n.a BS(t = 1000) R 2 (%) BS(t = 3000) R 2 (%)

27 BE BIC do not require information from all variables. 7 Discussion In statistical analysis it has been common practice to base inference on a conditional model, a model given a priori. This ignores uncertainty of model predictions, estimates of effects and variance caused by model selection. Breiman (1992) calls this a quiet scandal. A possible solution to the problem may be to use a weighted average of the models when making predictions and similarly to base variance estimates on several models, rather than on one conditional model. Most of this methodology has been developed in a Bayesian framework and was put into practice in the last decade (Chatfield, 1995; Draper, 1995; Hoeting et al., 1999). In this paper we use Bayesian model averaging (approx. BMA) and suggest a bootstrap approach (bootstrap MA) for model averaging which has its roots in the papers by Sauerbrei and Schumacher (1992) and Buckland et al. (1997). We present survival analyses from two studies where we concentrate on variable selection in the Cox model. We assume that the effects are linear, constant over time, interactions between covariates do not exist and that all assumptions implied by the Cox model hold. Obviously, this is still a severe restriction of the parameter space considered. However, we see no chance to escape from this problem of many simplifying assumptions. In a prototype linear regression analysis tool more model selection aspects have been incorporated to consider the cost of data analysis Faraway (1992), but we are not aware of any substantial improvement from this approach. In our bootstrap MA procedure we start with a screening step aiming to eliminate variables which have no or at most a weak effect. The cost may be to loose a variable which may improve a predictor slightly, however this step reduces the computational burden substantially, more importantly it gives a clearer impression concerning the models most supported by the data. This helps with the interpretation and may thereby increase the practical usefulness of the results, e.g. by making the transportation of the model to other medical centres easier. The screening step of our approach can reduce the number of variables used in the predictor substantially, e.g. in the Glioma example the predictor is based on only 7 out of 15 variables. For a more parsimonious model even less variables 26

28 could be retained by increasing the cut-off value for the variable selection frequency h(x i ) above which variables are included in the subset after the first screening step. Based on the cross-validated log-likelihood we may conclude in the Glioma example that the full model and the model with 9 variables selected by BE (0.157) may be overparameterised. Hence they are less useful for describing data of a similar study collected under slightly different conditions, e.g. at a different time point, or data from different clinics but under the same protocol. In addition, it is questionable whether all factors from the full model will ever be collected in a similar study. Therefore, a model requiring all variables lacks practical usefulness. The Occam s window used in Bayes model averaging excludes models not supported by the data, but the remaining models may still require to measure all the available variables, although some of them may have a close to zero weight. Hence if model parsimony is important in aiding interpretability, reproducibility and transportability of a model (Sauerbrei, 1999; Altman and Royston, 2000; Burnham and Anderson, 2001) the bootstrap MA approach has an advantage over the BMA method. In both model averaging approaches, more or less arbitrary decisions on settings for the model space reduction are made. For bootstrap MA, this is the cut-off value of h(x i ) above which variables are included in the subset after the first screening step. In approx. BMA it is the setting of C in the Occam s window. Only models with a posterior probability greater than 1/C are included. In the glioma example we have shown how the number of selected models changes depending on the cut-off value of bootstrap MA. In bootstrap MA, results may also be sensitive to the choice of the selection level applied in the backward elimination procedure. We have used the BIC criterion to make our procedure comparable to the approx. BMA procedure. To what extent results are sensitive to these settings is currently under investigation in a large simulation study. In both of our examples there is a potential gain from using the proposed techniques. For the Glioma example predictions from both model averaging methods were very similar and do not differ substantially from predictions of conventional methods (BE BIC, full model). In studies with a larger number of potential predictors it is more likely that many of them have no influence in a multivariable context and differences between the approaches are more likely. In the Glioma example the similarity of results is not surprising as the actual reduction of the Brier score of all model-based predictions is relatively small with a maximum of R 2 of 17% at day 310 and 30% at day 730. The sample size in relation 27

Variable Selection in Cox s Proportional Hazards Model Using a Parallel Genetic Algorithm

Variable Selection in Cox s Proportional Hazards Model Using a Parallel Genetic Algorithm Mu Zhu and Guangzhe Fan Department of Statistics and Actuarial Science University of Waterloo Waterloo, Ontario