Power analysis for multivariable Cox regression models

Size: px

Start display at page:

Download "Power analysis for multivariable Cox regression models"

Arline Fields
5 years ago
Views:

Received: 21 August 2017 Revised: 31 July 2018 Accepted: 21 August 2018 DOI: 10.1002/sim.

1 Received: 21 August 2017 Revised: 31 July 2018 Accepted: 21 August 2018 DOI: /sim.7964 RESEARCH ARTICLE Power analysis for multivariable Cox regression models Emil Scosyrev 1 Ekkehard Glimm 2 1 Quantitative Safety and Epidemiology, Novartis Pharmaceuticals Corporation, East Hanover, New Jersey Novartis, Clinical Information Sciences, Basel, Switzerland Correspondence Emil Scosyrev, Quantitative Safety and Epidemiology, Novartis Pharmaceuticals Corporation, One Health Plaza, East Hanover, NJ emil.scosyrev@novartis.com In power analysis for multivariable Cox regression models, variance of the estimated log-hazard ratio for the treatment effect is usually approximated by inverting the expected null information matrix. Because, in many typical power analysis settings, assumed true values of the hazard ratios are not necessarily close to unity, the accuracy of this approximation is not theoretically guaranteed. To address this problem, the null variance expression in power calculations can be replaced with one of the alternative expressions derived under the assumed true value of the hazard ratio for the treatment effect. This approach is explored analytically and by simulations in the present paper. We consider several alternative variance expressions and compare their performance to that of the traditional null variance expression. Theoretical analysis and simulations demonstrate that, whereas the null variance expression performs well in many nonnull settings, it can also be very inaccurate, substantially underestimating, or overestimating the true variance in a wide range of realistic scenarios, particularly those where the numbers of treated and control subjects are very different and the true hazard ratio is not close to one. The alternative variance expressions have much better theoretical properties, confirmed in simulations. The most accurate of these expressions has a relatively simple form. It is the sum of inverse expected event counts under treatment and under control scaled up by a variance inflation factor. KEYWORDS Cox regression, power analysis, sample size 1 INTRODUCTION Inference about treatment effects in observational (nonrandomized) studies in medicine is often based on Cox regression analysis with adjustment for multiple continuous and categorical covariates. Power analysis at the design stage of these studies must account for the effect of covariate adjustment on the variance of estimated log-hazard ratios. Therefore, power calculations based on simple unadjusted tests, such as the log-rank test, can be inaccurate in these circumstances. Hsieh and Lavori proposed to approximate the variance of estimated log-hazard ratio for the treatment effect in a multivariable Cox model by first approximating the variance of estimated log-hazard ratio in the crude (treatment-only) model and then multiplying the crude variance by 1/(1- R 2 ), where R 2 is the multiple coefficient of determination from a linear least squares regression of the treatment indicator variable on other covariates. 1 The value of R 2 represents the anticipated strength of the association of the covariate vector with treatment. 1 Once the variance of the estimated covariate-adjusted log-hazard ratio is specified, it can be used for calculation of power for superiority and noninferiority testing. This method John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/sim Statistics in Medicine. 2019;38:88 99.

2 SCOSYREV AND GLIMM 89 has been implemented in many commonly used software packages, including SAS (proc power), Stata (ststpowercox), R (powersurvepi), and PASS (Cox Regression module). Despite its popularity, this approach has a potential limitation arising from the fact that the variance expression for the estimated crude log-hazard ratio (ie, the quantity scaled up based on the value of R 2 ) was originally derived under the null hypothesis, that is, assuming that treatment does not affect the study event rate. 1-6 While this assumption may hold in certain noninferiority analysis settings, in general, power is defined under the assumed true value of the hazard ratio for the treatment effect, which need not necessarily be close to unity. The potential limitation of using the null variance for power analysis has been previously recognized. 3-6 In particular, it was shown by an analytical argument and simulations that the null variance expression can be very inaccurate when unequal numbers of subjects are allocated to the comparison groups (treatment versus control), despite the fact that this expression contains the treatment allocation ratio. 4,5 To address this problem, the null variance expression in power calculations can be replaced with one of the alternative variance expressions derived under the assumed true value of the hazard ratio for the treatment effect. 3,4 This approach is explored analytically and by simulations in the present paper for a general case of Cox regression modeling with a binary treatment variable and (possibly) other covariates, which may be associated with the treatment variable (as in observational studies) or be independent from it (as in randomized trials). We consider several different expressions approximating the true variance of the estimated log-hazard ratio for the treatment effect and compare their performance to that of the traditional null variance expression. The rest of this paper is organized as follows. Mathematical theory is reviewed in Section 2, simulation studies are reported in Section 3, and a real data example is considered in Section 4. Section 5 contains the discussion of the main points of the paper. 2 MATHEMATICAL THEORY Let G i be an indicator variable, taking the value of 1 if the ith subject (i = 1,, n) experiences the event of interest, and the value of 0 otherwise. Let T i denote the subject's at risk survival time, that is, time until the subject experiences the study event or a competing event. Define overall survival function as S i (t) = Pr (T i > t). The cause-specific cumulative incidence function and the cause-specific hazard of the study event are given, respectively, by F s i (t) =Pr (T i t, G i = 1), h s i (t) =lim Δt 0 Pr (t T i < t +Δt, G i = 1 T i t) Δt. 7-9 Let x i1 be a binary treatment indicator, taking the value of 1 if the ith subject is treated and the value of 0 otherwise, and let {x i2,, x ip } be a set of baseline covariates for the ith subject. In the Cox model, the cause-specific hazard function for the event of interest is related to treatment and other covariates as follows 8,10 : h s i (t x i1,, x ip ) = h s 0 (t) exp (β 1x i1 + +β P x ip ). (1) Here, h s 0 (t) is some unspecified baseline hazard function for the study event of interest. We will assume that adjustment for {x i2,, x ip } is sufficient for control of confounding so that the hazard ratio representing true treatment effect is exp(β 1 ). Regression coefficients in (1) are estimated by maximizing the log-partial likelihood function. 10 Throughout this paper, our interest is in the effect of treatment on cause-specific hazards so that competing events can be represented with a censoring indicator (ie, subjects experiencing the competing event are censored at the time of event occurrence). 8,9,11 A theoretical justification for this commonly used approach was first provided by Prentice et al, based on a likelihood factorization argument. 8 An alternative strategy is to model the effect of covariates on the subdistribution hazard, as defined by Fine and Gray. 12 However, the Fine-Gray model is beyond the scope of this paper. Let 1 denote the partial likelihood estimator of β 1, let V ( ) 1 denote its asymptotic variance, and let V ( ) 1 denote a( consistent variance estimator. An asymptotic Wald-type (1 α)100% confidence interval for β 1 is constructed as LCL = 1 z 1 α 2 V ( ) 1 2,UCL= z 1 α 2 V ( ) ),wherelcl and UCL are the lower and upper confidence limits, respectively, whereas confidence limits for the hazard ratio are given by {exp (LCL), exp (UCL)}. The Wald-type test statistic is ( ) 1 β 1 V ( ) 1 2,whereβ 1 1 is the log-hazard ratio under some null hypothesis of interest (eg, the traditional null of no treatment effect is β 1 = 0). Once the values of β 1 and V ( ) 1 are specified, the power for any superiority or noninferiority Wald-type hypothesis test based on the multivariable Cox model can be easily approximated using any program outputting percentiles of the

3 90 SCOSYREV AND GLIMM normal distribution. 4 For a two-sided test of the traditional null (β 1 = 0), { Power Φ β 1 V ( ) 1 2 } 1 z1 α 2. (2) Before we consider power calculations for multivariable Cox models, we examine the case of Cox regression with a binary treatment indicator as the only covariate. 2.1 Cox regression with a binary treatment indicator only Let n T and n C denote the number of treated and control subjects, respectively, with n = n T + n C, let p T = n T /n denote the treatment fraction, let p C = 1 p T, and define the treatment allocation ratio r = n T /n C.Letd T and d C denote the observed event counts for the study event in the treatment and in the control groups, respectively; let d = d T + d C ; and define the expected study event counts as D T = E(d T ), D C = E(d C ), and D = E(d). Let r j denote the ratio of treated to control subjects at risk at the jth event time. According to partial-likelihood estimation theory, the variance of the estimated treatment effect can be obtained by taking the inverse of the expected information. 2,3,10 In the Cox model with treatment as the only covariate, the observed information function can be written as 3,13,14 I (β) = d j=1 { rj exp (β) } { r j exp (β) + 1 } 2. (3) Here, β is the crude log-hazard ratio, which we (for now) assume to be constant over time. Since treatment is the only factor, we have dropped the index on β 1. Because each r j in (3) and their total number (d) are random variables, expectation of (3) does not reduce to any simple closed-form expression. One approximate solution to this problem is to evaluate the expectation of (3) under the general null hypothesis, that is, assuming that treatment does not affect the study event or any of the competing events. 2,3 Under the general null, ie, β = 0, each r j is approximately equal to the original treatment allocation ratio r. In this case, we can view d as the only random variable in (3). Hence, the expected information is E{I(β)} = E [ d { r (r + 1) 2}] = D { r (r + 1) 2}. (4) Taking the inverse of (4), we obtain the null variance expression 2,3,6 V 0 ( β ) = (r + 1) 2 (Dr) = 1 (Dp T p C ). (5) Substituting (5) into (2) and solving for D, we obtain the following well-known equation 1-6,14-20 : D = ( z 1 α 2 + z power ) 2 ( β 2 p T p C ). (6) This formula was first proposed by George and Desu 20 for exponential survival model under equal allocation ( p T = p C ) and was later shown by Schoenfeld 2 to extend to any correctly specified proportional hazards (PH) model. Note that the assumed true value of the hazard ratio (β) entered this equation as a probability limit of β, whereas asymptotic variance of was defined by (5), with β set to zero. 2,3,6 To examine the effect of setting β = 0 when its true value in (3) is not close to zero, suppose that treatment does not affect the rate of censoring or any common competing events, and the study event is rare, so that each r j is indeed approximately equal to the original treatment allocation ratio r. In this case, when evaluating the expectation of (3), we may substitute r for r j as before, but now, we also set β equal to its assumed true value instead of setting it to zero. The resulting variance expression is 3 V 1 ( β ) = {r exp (β) + 1} 2 {Dr exp (β)}. (7) This is (approximately) the true variance when the study event is rare and treatment does not affect the rate of any common competing events. The ratio of the null variance in (5) to the true rare event variance in (7) is {(r +1) 2 exp (β)} / {r exp (β) +1} 2. From this ratio, we can show that the null variance expression (5) can substantially underestimate the true variance ( when the group with greater hazard receives more subjects. 4 For example, when exp(β) = 2andr = 5, we have V 0 β ) ( V 1 β ) = 0.6. When the group with greater hazard receives fewer subjects, the

4 SCOSYREV AND GLIMM 91 null ( variance expression (5) can substantially overestimate the true variance. 4 For example, when exp(β) = 2andr = 0.2, V 0 β ) ( V 1 β ) = ( Unfortunately, V 1 β ) is guaranteed to be close to the true variance only when the study event is rare and treatment does not affect the rates of any common competing events because (7) assumes a risk set ratio that stays close to r. However, when the study event is common and/or treatment affects the rate of a common competing event, r approximates only the first risk set ratio, whereas, for subsequent events, r j tends to deviate further from r with increasing event count. To address this problem, Scosyrev and Bancken 4 proposed to approximate r j by the midpoint of the approximated first and last risk set ratios. If treatment does not affect the censoring rate, for the first event r j r and for the last event r j r = (n T D T U T ) / (n C D C U C ), where U T and U C are the expected counts of competing events in the treatment and the control groups, respectively. Taking the midpoint of r and r, we can approximate r j by r = ( r + r ) 2. Substituting r for r j in (3), evaluating the expectation, and taking its inverse, we obtain the following variance expression 4 : V 2 ( β ) = { r exp (β) + 1} 2 {D r exp (β)}. (8) Equation (8) is expected to have the same performance as (7) with rare events ( r r) but superior performance with common events, where the average distance from the r j values to their midrange ( r) is generally smaller than that to their lower or upper range (r). We now consider another (much simplified) variance expression that has been used in power calculations for the Cox model or the log-rank test under general PH alternatives 19,21 V 3 ( β ) = (1 D T + 1 D C ). (9) This expression coincides with asymptotic variance of the estimated log-hazard ratio obtained from the maximum-likelihood estimation theory under the constant hazard assumption. 14 While inference for the Cox model is not based on full likelihood, the model is highly efficient relative to parametric exponential model and is, in fact, asymptotically fully efficient under very general conditions. 22 Thus, Equation (9) is expected to produce reasonably accurate results when the constant hazard assumption is not severely violated, whether the study event is common or rare. We also show in Appendix 1 that, with rare events, Equation (9) can be derived from the information matrix of the Cox model under arbitrary shape of baseline hazard. Expected event counts for Equations (5), (7) to (9) can be estimated using various methods described in the literature. 2,4,16,19,23 For example, under constant hazards, the expected study event count for a given intervention group is given by 4,16,19,23,24 D g = n gh s g [1 exp ( ) { ( )}] h g b g exp hg ag + b g. (10) h g h g a g Here, n g is the sample size in group g, a g > 0 is the length of accrual, b g is the length of post-accrual follow-up, h s g is the study event rate, and h g = h s g + hc g is the total event rate, where hc g includes the nonadministrative drop-out rate and the competing event rate. 2.2 Cox regression with a binary treatment indicator and additional covariates Asymptotic variance of the estimated treatment effect in a multivariable Cox model can be obtained by inverting the expected information matrix, as explained in Appendix 1. It is seen from the expression for observed information matrix (Equation A1.6) that the variance of the estimated treatment effect in a Cox model with multiple covariates depends on the magnitude of the treatment effect as well as the magnitudes of regression coefficients for other covariates included in the model, the joint distribution of all covariates (including the treatment indicator), and other parameters. Because Equation A1.6 is too complex to be used for power analysis directly, some simplifications were proposed in the literature. For a case of a binary treatment indicator with one other covariate, Schmoor et al 6 proposed to divide the crude null variance in (5) by one minus the squared Pearson's correlation coefficient of the treatment indicator with the covariate. More generally, in the presence of an arbitrary number of covariates, Hsieh and Lavori 1 proposed to calculate power based on the null variance (5) inflated by the variance inflation factor (VIF) given by VIF 0 = 1/(1 - R 2 ), where R 2 is the multiple coefficient of determination from the linear least-squares regression of the binary treatment indicator on other covariates. We show in Appendix 1 that VIF 0 is the true VIF in the Cox model under the global null hypothesis for

5 92 SCOSYREV AND GLIMM the study event (β 1 = =β P = 0) and assuming that covariates do not affect the rates of any common competing events or censoring. Theoretically, VIF 0 may work reasonably well when the covariate effects are not too large. To accommodate arbitrarily large covariate effects, we derived a more general VIF expression in Appendix 1. Let w i = h s i (t) hs 0 (t) = exp (β 1 x i1 + +β P x ip ) and let R 2 denote the multiple coefficient of determination from weighted linear least-squares W regression of the binary treatment indicator on other covariates, with weight w i. We show in Appendix 1 that, when the study event is not too common and treatment does not have a strong effect on the rate of censoring or any common competing events, the true VIF in the Cox model is approximately VIF 1 = 1 ( 1 RW) 2. This VIF is a generalization of VIF0, reducing to VIF 0 under the global null. In general, the crude and the adjusted log-hazard ratios may have very different magnitudes due to confounding. In addition, if the PH assumption holds exactly given some set of covariates with nonzero regression coefficients, it cannot hold exactly in the crude model, although it can hold approximately in both models under various realistic conditions, including settings with rare events. 25 In our current analysis, we will assume that the hazard ratio for the treatment effect is constant over time, given the set of covariates {x i2,, x ip }, but that the marginal (crude) hazard ratio may or may not be approximately constant over time. Let P T and P C denote the expected total at-risk person time under treatment and under control, respectively, and let θ = ln {(D T /P T )/(D C /P C )} denote the crude log-rate ratio, which is a well-defined quantity whether or not the PH assumption holds in the unadjusted Cox model. For power analysis in the multivariable Cox model, we consider variance expressions obtained by scaling up the crude variance in (5), (7) to (9) by VIF 0 or VIF 1 V 0 ) = {1 (DpT p C )} VIF 0 (11) V 0 )w = {1 (Dp Tp C )} VIF 1 (12) V 1 ) = [ {r exp (θ) + 1} 2 {Dr exp (θ)} ] VIF 0 (13) V 1 )w = [ {r exp (θ) + 1} 2 {Dr exp (θ)} ] VIF 1 (14) V 2 ) = [ { r exp (θ) + 1} 2 {D r exp (θ)} ] VIF 0 (15) V 2 )w = [ { r exp (θ) + 1} 2 {D r exp (θ)} ] VIF 1 (16) V 3 ) = (1 DT + 1 D C ) VIF 0 (17) V 3 )w = (1 D T + 1 D C ) VIF 1. (18) Equation (11) corresponds to the approach proposed by Hsieh and Lavori. 1 The performance of variance expressions (11) to (18) for power analysis is examined in simulation studies, as described in Section 3. A practical approach to approximation of the true VIF based on commonly available prior data is illustrated using a real data example in Section 4. 3 SIMULATION STUDIES 3.1 Simulation settings Three simulation studies were performed to check the main theoretical properties of the variance expressions (11) to (18). In each simulation study, we examined a cohort of 1000 subjects accrued uniformly over a period of two years and followed for additional three years after completion of accrual, which implies uniform distribution of administrative censoring times with a range of three to five years. Other parameters varied between simulation settings, as described in the following Simulation study 1 In this basic scenario, there was a binary treatment variable and no other covariates. Study event times were generated from an exponential distribution with a constant baseline (control) hazard of year 1,ahazardratioforthetreatment effect of 1.4, 1.5, or 1.6, a treatment allocation ratio of 3, 1, or 1/3, and no competing events. There were nine simulation settings in this study (Table 1).

6 SCOSYREV AND GLIMM 93 TABLE 1 Simulation set 1: relative bias of standard errors and power in the Cox model with a binary treatment indicator only r D T /D C HR Relative ( bias of standard errors Predicted power ( 100%) Simulated Power V ) ( 0 β V ) ( 1 β V ) ( 2 β V ) ( 3 β V ) ( 0 β V ) ( 1 β V ) ( 2 β V ) 3 β Estimate (95% CI) / (37-41) 142/ (53-57) 150/ (68-72) / (53-58) 95/ (69-73) 100/ (83-86) / (45-50) 47/ (60-65) 50/ (74-78) r = treatment allocation ratio; {D T,D C } = mean study event counts under treatment and under control; HR = hazard ratio; total cohort size was 1000 subjects in all simulation settings; relative bias of standard errors was defined as the ratio of predicted standard error (computed as square root of the respective variance expression) to the empirical mean of the consistent model-based standard error estimator in 2000 realizations of the data-generating process per setting (a value of 1 corresponds to no bias); power was defined as probability of the lower 95% confidence limit for HR falling above 1. Predicted power computed based on respective variance expressions was compared with simulated power in 2000 realizations of the data-generating process per setting (simulated power is the proportion of the 2000 realizations where the null hypothesis was rejected) Simulation study 2 In addition to the binary treatment variable x 1, there was a binary baseline covariate x 2. Within each combination of x 1 and x 2, the study event hazard was constant (ie, the event times had an exponential distribution). In subjects with x 2 = 1, the hazard was 0.12 year 1 under control and year 1 under treatment. In subjects with x 2 = 0, the hazard was either 0.06 year 1 under control and year 1 under treatment (Scenario 1) or 0.03 year 1 under control and year 1 under treatment (Scenario 2). Hence, the adjusted hazard ratio for the treatment effect was 1.5 in both scenarios, whereas the adjusted hazard ratio for the covariate effect was 2.0 in Scenario 1 or 4.0 in Scenario 2. These two scenarios were examined to check the theoretical prediction of deteriorating performance of VIF 0 with increasing magnitude of covariate effects. The association of x 2 with treatment varied between simulation settings, as shown in Table 2, where p x2 and px2 T C denote the expected proportions of subjects with x 2 = 1 under treatment and under control, respectively. In this simulation, the competing event times were generated from an exponential distribution with a constant hazard of 0.05 year 1 independently from the study event time. The competing hazard did not depend on x 1 or x 2. There were 18 simulation settings in this study (Table 2) Simulation study 3 In addition to the binary treatment variable x 1, each subject had two baseline covariates, ie, a binary covariate x 2 and a continuous normal covariate x 3. Treatment probability was a logistic function of x 2 and x 3, whereas the cause-specific hazard of the study event was a function of x 1, x 2,andx 3 with a Weibull baseline hazard. A total of 30 simulation settings were examined. Details of this simulation study are provided in Appendix Measures of performance of variance expressions (11) to (18) Performance of variance expressions (11) to (18) was examined based on 2000 realizations of the data-generating process per simulation setting, with a total of 57 settings in the three simulation studies. Let a denote the predicted standard error in a given simulation setting, computed as square-root of the respective variance expression (11) to (18), and let m and s denote mean and standard deviation, respectively, of the consistent model-based standard error estimator across the 2000 realizations of the data-generating process. The point estimate of the relative bias of standard errors ( based on (11) to (18) was defined as B = a/m, whereas the lower and upper 95% confidence limits were constructed as a m ± 1.96s ) These confidence limits measure the magnitude of simulation error (uncertainty in the relative bias due to finite number of replications per setting). In principle, in the definition of B, m could be replaced with empirical standard deviation of 1 across the 2000 replications, ie, se ( ) 1. However, when the standard modeling assumptions for Cox regression are satisfied, the partial likelihood estimation theory guaranties that m se ( ) 1 converge in probability to 1, so the two definitions of B are asymptotically equivalent, but B a/m allows for a more straightforward quantification of simulation error because m is a sample mean, whereas se ( ) 1 is a sample standard deviation (in a sample of size 2000). The

7 94 SCOSYREV AND GLIMM TABLE 2 Simulation set 2: relative bias of standard errors derived from variance Equations (11) to (18) in a Cox model with a binary treatment x 1 and a binary covariate x 2 e β 2 r p x2 T px2 C R 2 R 2 w D T /D C RR Relative bias of standard errors based on Equations (11) to (18) (11) (12) (13) (14) (15) (16) (17) (18) / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / e β 2 = adjusted hazard ratio for the covariate effect, r = treatment allocation ratio; {pt x2,p C x2 } = expected proportions of subjects with x 2 = 1inthe treated and the control groups, respectively; R 2 = coefficient of determination for the treatment-covariate association; R 2 w is the weighted version of R 2 ;{D T,D C } = mean study event counts under treatment and under control; RR = crude rate ratio for the study event; adjusted HR for the treatment effect on the study event was 1.5 in all settings in this table; total cohort size was 1000 subjects in all simulation settings; relative bias of standard errors was defined as the ratio of predicted standard error computed as square root of respective variance expression (11) to (18) to the empirical mean of the consistent model-based standard error estimator in 2000 realizations of the data-generating process per setting (a value of 1 corresponds to no bias). ratio m se ) was examined for each simulation setting, along with empirical coverage of the 95% CIs for the treatment effect exp(β 1 ). Superiority power was defined in each simulation setting as Pr (LCL HR > 1) if true HR > 1, and Pr (UCL HR < 1) if true HR < 1, where LCL HR and UCL HR are the lower and upper 95% confidence limits for the true HR, respectively. In simulation study 3, noninferiority power was also examined, defined as Pr {UCL HR < exp(1)} if true HR > 1, and Pr {LCL HR > exp( 1)} if true HR < 1. These noninferiority margins corresponded to the range of possible values of the hazard ratio in this simulation study. Predicted power values were computed based on variance expressions (11) to (18) by plugging these expressions into general superiority and noninferiority power equations, 4 of which Equation (2) is a special case for traditional superiority testing. These predicted power values were tabulated along with simulated power, defined as the proportion of rejected nulls in 2000 realizations of the data-generating process per setting. For each simulated power value W, a 95% confidence interval (representing simulation error) was constructed as W ± 1.96{W(1 W)/2000} 1/2. Simulation was executed in SAS 9.3. The SAS code is available as Supporting Information. 3.2 Simulation results Simulation results are presented in Tables 1 to 3 (Simulation studies 1 and 2) and in Supplementary Tables 1 through 4 of Appendix 2 (Simulation study 3). In all settings, ( simulation error in the analysis of the relative bias of standard errors was negligible, with 95% confidence limits a m ± 1.96s ) 2000 not noticeably different from point estimates when rounded to two decimal places (and hence not presented separately in the tables). On the other hand, 95% CIs for simulated power are shown explicitly in the respective tables, along with predicted power based on (11) to (18). In all simulation settings, empirical coverage of 95% CIs for exp(β 1 ) was close to the nominal level (median 0.95, range ), and the standard error ratio m se ( ) 1 was close to unity (median 1.00, range ). The performance of variance expressions (11) to (18) is reviewed in the following.

8 SCOSYREV AND GLIMM 95 TABLE 3 Simulation set 2: superiority power predicted based on variance Equations (11) to (18) in a Cox model with a binary treatment x 1 and a binary covariate x 2 e β 2 r p x2 T px2 C R 2 R 2 w Power ( 100%) computed based on Equations (11) to (18) Simulated Power (11) (12) (13) (14) (15) (16) (17) (18) Estimate (95% CI) / / (81-84) 0.6/ / (81-84) 0.9/ / (67-71) / / (91-93) 0.6/ / (91-93) 0.9/ / (83-86) / / (82-86) 0.6/ / (84-87) 0.9/ / (81-84) / / (67-71) 0.6/ / (68-72) 0.9/ / (60-64) / / (80-84) 0.6/ / (84-87) 0.9/ / (79-83) / / (71-74) 0.6/ / (77-80) 0.9/ / (78-81) e β 2 = adjusted hazard ratio for the covariate effect, r = treatment allocation ratio; {pt x2,p C x2 } = expected proportions of subjects with x 2 = 1 in the treated and the control groups, respectively; R 2 = multiple coefficient of determination for the treatment-covariate association; R 2 w is the weighted version of R 2 ; adjusted HR for the treatment effect on the study event was 1.5 in all settings in this table; total cohort size was 1000 subjects in all simulation settings; predicted power was computed by plugging variance values obtained from Equations (11) to (18) into the superiority power Equation (2); simulated power is the proportion of the 2000 realizations where the null hypothesis was rejected Simulation study 1 With treatment as the only covariate in the model, variance expressions (11) to (18) reduce to their univariable counterparts (5), (7) to (9). As predicted by theory, the null variance expression (5) performed well under equal allocation, but had a noticeable downward bias (overestimating power) when the group with greater hazard received more subjects (r = 3), and had an upward bias (underestimating power) when the group with greater hazard received fewer subjects (r = 1/3) (Table 1, Supplementary Figure 1). For example, with r = 3andHR= 1.6, the null variance expression predicted a power of close to 80%, whereas the simulated power was only 70%. The alternative variance expressions had much better (and almost identical) performance Simulation study 2 Results of this simulation study with two covariates are presented in Tables 2 and 3. As predicted by theory, under unequal treatment allocation, the null variance expressions (11) to (12) noticeably underestimated or overestimated the true variance, depending on whether the group with greater hazard received more or fewer subjects (Table 2). Power values predicted by the null variance expressions (11) to (12) differed from the simulated power by 10 percentage points in some settings (Table 3). The alternative variance expressions had much better performance, although as predicted by theory, with a very strong covariate effect (HR = 4) expressions incorporating VIF 1 had better performance than those based on VIF Simulation study 3 Results of this simulation study are summarized in Appendix 2. As predicted by theory, the null variance expressions (11)-(12) performed relatively well when the treatment allocation ratio (r) and/or the crude rate ratio (RR) were close to unity. However, when r and RR both deviated far from unity in the same direction, the null variance expressions (11)-(12) substantially underestimated the true variance, while in those settings where r and RR both deviated far from unity in the opposite directions, the null variance expressions substantially overestimated the true variance. The resulting relative bias of standard errors in some cases exceeded 20% and power was under- or overestimated by 10 percentage points in

9 96 SCOSYREV AND GLIMM many settings. Compared with (11)-(12), all alternative variance expressions had better performance, although (17)-(18) performed better than the other expressions (see Appendix 2 for details). 4 SETTING INITIAL CONDITIONS FOR POWER ANALYSIS BASED ON PRIOR DATA, ILLUSTRATED WITH A REAL DATA EXAMPLE Theory and simulations presented in the previous sections indicate that power for Cox regression analysis can be accurately approximated by plugging the variance approximation from Equation (9) scaled up by an appropriate VIF into general power equations, 4 of which Equation (2) is a special case for traditional superiority testing 18 (the null hypothesis of no treatment effect) { (D 1 Power Φ β 1 T } + ) D 1 C VIF z1 α 2. (19) If variance expression (9) in (19) is replaced with the null variance (5), we obtain a power equation corresponding to the popular approach of Hsieh and Lavori, 1 which has been implemented in various software packages { } [(DT ] 1VIF Power Φ β 1 + D C ) p T p C z1 α 2. (20) Both (19) and (20) scale up a crude variance expression by a VIF, but (20) uses the crude null variance (5), whereas (19) uses the crude alternative variance (9). To calculate power for a given α usingeither Equation (19) or (20), we need to specify the same four parameters, ie, β 1, D T, D C, and VIF, but these parameters are arranged in a slightly different way in (19) compared with (20). In the next paragraphs, we use a real data example to explain how the values of parameters β 1, D T, D C, and VIF can be specified based on commonly available prior data. In a retrospective cohort study aiming to examine the effect of statin treatment (versus no use of statins) on recurrence-free survival after radiation therapy for prostate cancer, we expect a sample size of approximately 5000 eligible patients with approximately five years of follow-up data per patient. The final sample size will be determined after the formal application of eligibility criteria to the existing dataset (if the study moves forward), but it is unlikely to be very different from the current estimate of n = Due to the absence of prospective accrual in this analysis, the sample size is not controlled by investigators. Hence, the purpose of power analysis in this case is not to set an accrual goal (target sample size) but to determine what effect sizes can be detected with reasonable power given the available sample size, duration of follow-up, and other initial conditions. Based on a previous study with a different dataset, we expect a treatment fraction of p T = 0.2 (approximately n T = 1000 patients treated with statins and n C = 4000 controls not using statins), and a five-year recurrence-free survival of S C = 0.83 (83%) in the control group. 26 In the prior study, the crude HR for the end point of recurrence-free survival in statin users versus nonusers was HR C = 0.60 (95% CI: 0.43, 0.83), whereas the HR adjusted for baseline demographics, tumor characteristics, and other covariates was HR A = 0.69 (95% CI: 0.50, 0.97). 26 We note that the ratio of the crude to the adjusted HRs is 0.60/0.69 = We will refer to this quantity as the confounding factor (CF), that is, CF = HR C /HR A. The effect size that we wish to detect in this calculation is a relative risk reduction of 25%, so we set the adjusted HR to HR A = 0.75, and β 1 = ln (0.75). Given the prior estimate of the background (control) five-year recurrence-free survival of 83%, the expected event count under control for our study can be approximated as D C = n C (1 - S C ) = 4000 (1-0.83) = 680. The effect size of HR A = 0.75 together with the CF estimate of 0.87 implies a crude hazard ratio of HR C = HR A CF = = 0.65, which we can use for the calculation of D T from the baseline (control) event rate as follows. If the PH assumption holds at least approximately in the crude (treatment-only) model, the HR C is approximately equal to the ratio of log-survival probabilities HR C = (ln S T )/(lns C ). 27 Hence, we can approximate the five-year recurrence-free survival under treatment as S T = exp (HR C ln S C ) = exp(0.65 ln 0.83) = 0.89, whereas the expected event count under treatment is D T = n T (1 - S T ) = 1000 (1-0.89) = 110. Equations (19), (20) also require a VIF, which can be estimated from previously reported CIs for the crude and the adjusted HR, using the fact that estimated variance of estimated log-hazard ratio is related to the lower and upper 95% confidence limits of the HR (LCL HR, UCL HR ) as follows 27 : ( ) V ln HR = {ln (UCL HR ) ln (LCL HR )} 2 (2 1.96) 2. (21)

10 SCOSYREV AND GLIMM 97 Applying (21) to previously reported 95% CIs for the crude HR (LCL HRC,UCL HRC ) and the adjusted HR (LCL HRA,UCL HRA ), we can approximate the VIF by VIF { ln ( UCL HRA ) ln ( LCLHRA )} 2 { ln ( UCLHRC ) ln ( LCLHRC )} 2. (22) In our real data example, VIF {ln(0.97) - ln(0.50)} 2 / {ln(0.83) - ln(0.43)} 2 = To be conservative, we round this up to 1.10 to allow for possibility of a larger than expected variance inflation effect due to some differences in the definition of covariates. Now that we specified the values of β 1, D T, D C, and VIF, we can plug these values in (19) or (20) and calculate power in Excel respectively as follows. For a two-sided test at 5% alpha Power = NORM.DIST (ABS (LN(0.75))/SQRT((1/ /680)*1.1)-1.96,0,1,TRUE) = Power = NORM.DIST (ABS (LN(0.75))/SQRT(1/(( )*0.2*0.8)*1.1)-1.96,0,1,TRUE) = Thus, the predicted power is 76.1% according to (19) and 86.9% according to (20). In a simulation study approximating these power analysis settings in terms of group sample sizes, expected event counts, and VIF, the simulated power was 78.1% (95% CI: 77.3%, 78.9%) (Appendix 3). As predicted by theory, when the group with greater hazard receives more subjects, the null variance expression tends to produce overly optimistic power estimates. In contrast, power estimates derived from the alternative variance expression remain reasonably accurate under equal or unequal treatment allocation. From the power calculation based on (19), we note that, with effect size HR A = 0.75, the power is slightly below the conventional threshold of 80%. While we cannot increase the sample size or the length of follow-up due to retrospective nature of the study, we can determine what magnitudes of effect can be detected with power of at least 80% with available sample size. For this purpose, we may repeat the power calculation using (19), expressing power as a function of effect size. For HR A = 0.75, 0.74, 0.73, 0.72, 0.71, and 0.70, Equation (19) gives power of approximately 0.76, 0.80, 0.83, 0.86, 0.89, and 0.91, respectively. Thus, given the available sample size, in order to test the null hypothesis of no treatment effect with a power of at least 80%, the effect size HR A must be In those studies where sample size is controlled by investigators (eg, cohort studies with prospective accrual), we may wish to determine the sample size that would result in the desired power, given the specified effect size and other initial conditions. Although we do not control the sample size in our real data example considered here, we will perform the sample size calculation for illustration, as if this was a prospective cohort study. Suppose that the target effect size is HR A = 0.75 and the sample size should be sufficient to test the null hypothesis of no treatment effect with a power of at least 80%. For this purpose, we repeat our power calculation by increasing D T and D C values in some small increments, such as 5%, which corresponds to the 5% incremental increase in the sample size. In the original calculation with (n T = 1000, n C = 4000, D T = 110, and D C = 680), Equation (19) predicted a power of With a 5% increase in group sample sizes and expected event counts (n T = 1050, n C = 4200, D T = 116, and D C = 714), Equation (19) yields a power of 0.78, whereas, with a 10% increase (n T = 1100, n C = 4400, D T = 121, and D C = 748), it yields a power of Hence, in order to satisfy the effect size and power requirements stated earlier, we would set the accrual goal to n T = 1100 treated patients and n C = 4400 control patients, assuming the 1:4 treatment-to-control allocation ratio. 5 DISCUSSION In power analysis for Cox regression models, variance of the estimated log-hazard ratio is usually approximated by inverting the expected null information matrix. 1,2,14,17,23,24 Although this approximation works reasonably well in many nonnull settings (Tables 1 to 3, Supplementary Tables 1 through 4 in Appendix 2), it is important to recognize conditions resulting in unsatisfactory performance of this method. As was shown analytically and by simulations, the null variance expression can substantially overestimate the true variance, thus underestimating precision and power when the group with greater hazard receives fewer subjects. Even more importantly, the null variance expression can substantially underestimate the true variance, thus overestimating precision and power when the group with greater hazard receives more subjects. These results derived for the Cox model hold whether or not covariates are present and whether or not they are associated with the treatment indicator. Hence, these findings are relevant to the Cox model power calculations for randomized trials as well as observational studies. Because in the case of univariable Cox model, hypothesis tests are asymptotically equivalent to the log-rank test, power calculations based on (6) (which is derived from the null variance) are also inaccurate for the log-rank test when the true hazard ratio and the treatment-to-control allocation ratio are both far from 1. This was indeed previously shown by other authors. 5

11 98 SCOSYREV AND GLIMM In any study (observational or randomized) where comparison groups have very unequal sizes, inaccuracy of the null variance expression (5) and the corresponding power equations such as (6) or (20) becomes relevant. In the multivariable Cox model, the magnitude of this error can be large even with correctly specified VIF, or when covariates are unassociated with the treatment indicator (VIF = 1). Among the alternative variance expressions, Equation (9) scaled up by appropriate VIF seems most appealing due to its accuracy and relative simplicity. Power analysis based on this approach corresponding to Equation (19) may be carried out by specifying parameters β 1, D T, D C, and VIF in steps outlined in the real data example described in Section 4. These steps are fairly straightforward, although details may depend on the type of prior data available and the context of the study. Appendix 4 contains additional recommendations for each step under various scenarios typically encountered in practice. There is one important scenario where variance expressions (5) and (9) are equal. In noninferiority studies, power can be defined as probability that the upper (1 α)100% confidence limit for the HR falls below some noninferiority margin M > 1, given that the true hazard ratio (HR A ) is equal to some value less than M,suchasHR A = 1. 4,5 In the latter case, if planned follow-up is the same for treated and control subjects, D T = Dp T, D C = Dp C, and (9) becomes equal to (5). Power, in this case, can be obtained from (19) or (20), which become equal to each other, but β 1 in these equations must now be replaced with ln(m), which is thenoninferioritymargin forthelog-hazardratio { } [(DT ] 1VIF Power Φ ln (M) + D C ) p T p C z1 α 2. (23) Equation (23) can be obtained by plugging the null variance V 0 ) = [(DT + D C )p T p C ] 1 VIF into a general noninferiority power equation, such as for example Equation (4) in the work of Scosyrev and Bancken. 4 We note that the values of D T and D C for (23) must be computed under the assumption of no treatment effect (HR A = 1). For a one-sided noninferiority test, α/2 in (23) can be replaced with α. In a randomized trial, VIF in (23) would be set to 1. DATA ACCESSIBILITY All data associated with this paper, including simulation code will be available for archiving at the journal's repository. CONFLICTS OF INTEREST Authors are employed by Novartis Pharma. ORCID Emil Scosyrev REFERENCES 1. Hsieh FY, Lavori PW. Sample-size calculations for the Cox proportional hazards regression model with nonbinary covariates. Control Clin Trials. 2000;21(6): Schoenfeld DA. Sample-size formula for the proportional-hazards regression model. Biometrics. 1983;39(2): Luo J, Su Z. A note on variance estimation in the Cox proportional hazards model. J Appl Stat. 2013;40(5): Scosyrev E, Bancken F. Expected precision of estimation and probability of ruling out a hypothesis based on a confidence interval. Int Stat Rev. 2017;85(3): Jung S-H, Jang SJ, McCall LM, Blumenstein B. Sample size computation for two-sample noninferiority log-rank test. J Biopharm Stat. 2005;15(6): Schmoor C, Sauerbrei W, Schumacher M. Sample size considerations for the evaluation of prognostic factors in survival analysis. Statist Med. 2000;19(4): Latouche A, Porcher R. Sample size calculations in the presence of competing risks. Statist Med. 2007;26(30): Prentice RL, Kalbfleisch JD, Peterson AV Jr, Flournoy N, Farewell VT, Breslow NE. The analysis of failure times in the presence of competing risks. Biometrics. 1978;34(4): Lau B, Cole SR, Gange SJ. Competing risk regression models for epidemiologic data. Am J Epidemiol. 2009;170(2): Cox DR. Regression models and life-tables. J R Stat Soc Ser B Methodol. 1972;34(2): Beyersmann J, Scheike TH. Classical regression models for competing risks. In: Klein JP, van Houwelingen HC, Ibrahim JG, Scheike TH, eds. Handbook of Survival Analysis. Boca Raton, FL: CRC Press; Fine JP, Gray RJ. A proportional hazards model for the subdistribution of a competing risk. J Am Stat Assoc. 1999;94(446):

12 SCOSYREV AND GLIMM Klein JP, Moeschberger ML. Survival Analysis: Techniques for Censored and Truncated Data. New York, NY: Springer; Collett D. Modelling Survival Data in Medical Research. 2nd ed. Boca Raton, FL: Chapman & Hall/CRC; Væth M, Skovlund E. A simple approach to power and sample size calculations in logistic regression and Cox regression models. Statist Med. 2004;23(11): Pintilie M. Dealing with competing risks: testing covariates and calculating sample size. Statist Med. 2002;21(22): Kocak M, Onar-Thomas A. A simulation-based evaluation of the asymptotic power formulas for Cox models in small sample cases. Am Stat. 2012;66(3): Wang S, Zhang J, Lu W. Sample size calculation for the proportional hazards model with a time-dependent covariate. Comput Stat Data Anal. 2014;74(1): Crisp A, Curtis P. Sample size estimation for non-inferiority trials of time-to-event data. Pharm Stat. 2008;7(4): George SL, Desu MM. Planning the size and duration of a clinical trial studying the time to some critical event. J Chronic Dis. 1974;27(1): Korn EL, Freidlin B. Conditional power calculations for clinical trials with historical controls. Statist Med. 2006;25(17): Efron B. The efficiency of Cox's likelihood function for censored data. J Am Stat Assoc. 1977;72(359): Lachin JM. Sample size and power for a logrank test and Cox proportional hazards model with multiple groups and strata, or a quantitative covariate with multiple strata. Statist Med. 2013;32(25): Lachin JM. Biostatistical Methods: The Assessment of Relative Risks. Hoboken, NJ: John Wiley & Sons; Lin NX, Logan S, Henley WE. Bias and sensitivity analysis when estimating treatment effects from the Cox model with omitted covariates. Biometrics. 2013;69(4): Kollmeier MA, Katz MS, Mak K, et al. Improved biochemical outcomes with statin use in patients with high-risk localized prostate cancer treated with radiotherapy. Int J Radiat Oncol Biol Phys. 2011;79(3): Parmar MK, Torri V, Stewart L. Extracting summary statistics to perform meta-analyses of the published literature for survival endpoints. Statist Med. 1998;17(24): SUPPORTING INFORMATION Additional supporting information may be found online in the Supporting Information section at the end of the article. How to cite this article: Scosyrev E, Glimm E. Power analysis for multivariable Cox regression models. Statistics in Medicine. 2019;38:

Analysis of competing risks data and simulation of data following predened subdistribution hazards

Analysis of competing risks data and simulation of data following predened subdistribution hazards Bernhard Haller Institut für Medizinische Statistik und Epidemiologie Technische Universität München 27.05.2013