e author and the promoter give permission to consult this master dissertation and to copy it or parts of it for personal use. Each other use falls

Size: px

Start display at page:

Download "e author and the promoter give permission to consult this master dissertation and to copy it or parts of it for personal use. Each other use falls"

Francis Davidson
5 years ago
Views:

3 e author and the promoter give permission to consult this master dissertation and to copy it or parts of it for personal use. Each other use falls under the restrictions of the copyright, in particular concerning the obligation to mention explicitly the source when using results of this master dissertation.

5 I want to thank my tutor, Dr. Karel Vermeulen, for his thorough explanations and his answers to my many questions. I would like to express my gratitude towards Professor Vansteelandt for the opportunity of working on this thesis under his supervision and for his guidance and insights. Special acknowledgements go out to the VSC (Flemish Supercomputer Center), for making available their computational resources and services for the more computationally intensive methods used in this work. Finally, I would like to give thanks to my parents for their love and support. Without them, combining work and study would not have been possible for me. Roel Vandersmissen August 2015

7 iii Contents v vii List of Tables ix List of Figures xi Abstract xiii 1. Introduction Randomized controlled trials Baseline covariate adjustment Concerns regarding covariate selection Solutions Outline Methods for covariate adjustment Overview and notation Semiparametric locally efficient adjustment G-computation Targeted Maximum Likelihood Estimation Inverse Probability of Treatment Weighting Doubly robust estimation Firth bias reduction Data generation and analysis Simulation studies General Data generation mechanism Model fitting Performance analysis Application to blood pressure trial Soware

8 4. Results of simulation studies Semiparametric locally efficient estimation G-computation Targeted maximum likelihood estimation Inverse probability of treatment weighting Doubly robust estimation Comparison between methods Blood pressure trial results Discussion Appendix A. Overview of warnings Appendix B. Overview of N/A values Appendix C. Alternative DR approach References

9 3.1 SLE approaches G-computation approaches TMLE approaches DR approaches SLE results: Estimation SLE results: Type I errors SLE results: Power G-computation results: Estimation G-computation results: Type I errors G-computation results: Power TMLE results: Estimation TMLE results: Type I errors TMLE results: Power IPTW results: Estimation IPTW results: Type I errors IPTW results: Power DR results: Estimation DR results: Type I errors DR results: Power Comparison: Estimation Comparison: Type I errors Comparison: Power Blood pressure trial covariate summary Blood pressure trial analysis A.1 SLE estimation/power: Warnings A.2 SLE type I errors: Warnings A.3 G-computation estimation/power: Bootstrapping warnings A.4 G-computation type I errors: Bootstrapping warnings A.5 G-computation type I errors: Non-bootstrapping warnings A.6 TMLE type I errors: Warnings A.7 DR estimation/power: Bootstrapping warnings A.8 DR estimation/power: Non-bootstrapping warnings

10 A.9 DR type I errors: Bootstrapping warnings A.10 DR type I errors: Non-bootstrapping warnings B.1 DR estimation/power: N/A values B.2 DR type I errors: N/A values C.1 DR ALT results: Estimation C.2 DR ALT results: Type I errors C.3 DR ALT results: Power C.4 DR ALT estimation/power: Bootstrapping warnings C.5 DR ALT type I errors: Bootstrapping warnings C.6 DR ALT type I errors: Non-bootstrapping warnings

11 5.1 Blood pressure trial: Confidence intervals

13 e use of baseline covariate-adjusted analyses to compare treatments in randomized clinical trials increases the power of the corresponding statistical tests. However, due to the possibility of selecting the combination of covariates that gives the most favorable result, baseline covariate adjustment is not routinely used in the analysis of randomized clinical trials. In this thesis, we compared the performance of several different estimators in simulation studies. Our focus lied on estimators that make an objective inference on the treatment effect possible. is is made possible by separating the baseline covariate adjustment from the treatment effect estimation. e latter can be achieved for the Semiparametric Locally Efficient (SLE) estimators, the Inverse Probability of Treatment Weighting (IPTW) estimators and the Doubly Robust (DR) estimators. Additionally, we included estimators using Targeted Maximum Likelihood Estimation (TMLE) and G-computation estimators in our comparison. e simulation studies generated data for a binary outcome and a binary treatment variable, with eight baseline covariates. ese were conducted for sample sizes of = 100, = 150 and = 200, with 5000 repetitions for each sample size. For all estimators, multiple approaches were utilized to allow for the evaluation of the effects of model selection, model misspecification and the use of Firth bias reduction. ese results were then applied to the analysis of a hypertension trial. A subset of 105 subjects of this trial was used to analyze a binary outcome using our proposed methods.

15 A randomized controlled trial (RCT) is a type of experiment in which the subjects are randomly assigned to one of two or more treatment groups. One of these treatment groups is used as the basis for comparison with the other(s) and is called the control group [1]. e randomized treatment allocation results in the production of comparable groups, which allows causal inferences to be made [2]. e RCT is considered to be the in clinical trial design [1]. In an RCT chance imbalances in baseline characteristics may occur despite randomization. is is more likely for trials with small sample sizes. If these imbalances occur for prognostic variables (i.e. variables that are associated with the outcome of interest), the power of a statistical test for the treatment effect is lowered. Adjustment of the statistical analysis for possible imbalances can be done by including prognostic variables [3]. In the case of continuous outcomes (using linear regression for the outcome of interest) the unadjusted estimates are unbiased, but the power is lower because of a lower precision compared with adjusting for these variables in the analysis. is result can't be directly extended to the non-linear case. For logistic regression, for example, covariate adjustment does not lead to increased precision (it tends to decrease the precision instead) [4]. However, due to the non-collapsibility of odds ratios, the adjusted and unadjusted analyses estimate different treatment effects. ey estimate conditional and marginal treatment effects, respectively [5]. In the case of logistic regression, the adjusted estimates are further away from the null compared with the unadjusted estimates [4]. In the case of the estimation of a log odds ratio treatment effect, for example, this corresponds to the adjusted estimates being further away from 0 than the unadjusted estimates. e overall effect of adjusting for baseline covariates in logistic regression is still an increase in power [6]. is is because the effect of the decrease in precision on the power is offset by the greater distance from the null, leading to a net increase in power for the adjusted analyses in logistic regression. As a large amount of baseline covariates are usually collected in an RCT, a choice must be made regarding the variables that will be used for adjustment. is opens up the possibility for investigators to select the combination of covariates that results in the most favorable treatment effect estimate and/or the lowest p-value (a ). is is the case for both linear and non-linear models. Pre-specification of the variables to be adjusted for is a possible solution for this, but is not always feasible as all prognostic

16 variables may not be known in advance. is has led to a focus on the unadjusted analysis, in spite of the adjusted analysis resulting in increased power [4]. e non-collapsibility of odds ratios presents an additional complication in adjusting for prognostic variables [5]. is causes the choice of variables to influence what treatment effect is being estimated. e value of the treatment effect thus differs for different choices of variables used for adjustment. is potentially increases the effects of a fishing expedition in the case of logistic regression. Specifying all baseline covariates a priori is a solution to eliminate the possibility of selecting covariates to give the most favorable result. An automated model selection procedure can be applied as a solution as well, as long as the details for using this model and all the candidate covariates are specified in advance [4]. Both these methods require complete pre-specification of covariates and, as already mentioned, this is not always feasible because it is possible that all prognostic variables are not known in advance [4]. A solution to the fishing expedition, other than specifying all covariates in advance, was developed by Tsiatis et al. [7] and Zhang et al. [8]. By using semiparametric theory the separation of the estimation of the treatment effect from the baseline covariate adjustment is made possible. is is the case for Inverse probability of treatment weighting (IPTW) as well [9]. e doubly robust (DR) estimator as presented by Loux et al. also enables the separation of treatment effect estimation from baseline covariate adjustment [10]. All these methods thus lessen concerns regarding the possibilities to engage in a search for the most favorable result. All approaches specified above are explained in more detail in section 2. Simulation studies were performed comparing several methods for baseline covariate adjustment in RCTs. ese methods were then applied to analyze a blood pressure trial. e scope of this thesis was limited to RCTs with binary treatment, binary outcome and small sample sizes. Small sample sizes were selected because chance imbalances between treatment groups are more likely to occur with fewer subjects. In section 2, the methods used are described in detail. Section 3 then describes how the simulation studies and the analysis of the blood pressure trial dataset were conducted. is is followed by section 4 presenting the results of the simulation studies and section 5 containing the blood pressure trial analysis results. ese results are discussed in the final section (section 6).

17 e methods that will be used for baseline covariate adjustment in this work are explained in this section. ese are: semiparametric locally efficient estimation, G-computation estimation, targeted maximum likelihood estimation, inverse probability of treatment weighting estimation and doubly robust estimation. Additionally, Firth bias reduction is discussed. is is not a method for baseline covariate adjustment in itself, but can be used in combination with the proposed methods as a bias-reduced alternative for standard maximum likelihood estimation. All proposed methods result in marginal treatment effect estimators. is gives treatment effect estimates that are comparable with the treatment effects in most other studies [4], that are possible to be used in meta-analyses [5] and, as they provide a measure of an overall effect, are the main focus of policy makers (e.g. regulatory authorities) [7]. Furthermore, the true value of the marginal treatment effect does not depend on the covariates included in the model and estimating a marginal treatment effect thus lowers the influence one potentially has on the size of the treatment effect estimate when choosing the combination of covariates that results in the most favorable estimate. For all the proposed methods, a binary outcome, a binary treatment variable = 0, 1 and a vector of baseline covariates = ( 1,..., ) are assumed. e probabilities of treatment assignment are ( = 1) = 1 and ( = 0) = 0 = 1 1 and are the same for all subjects. In [8], Zhang et al. discuss a semiparametric method for adjustment for baseline covariates which makes it possible to circumvent the problem of the fishing expedition. Consider the following model for a binary outcome in a two-arm RCT, with binary treatment : with () = ()/(1 ). { ( ) } = { ( = 1 ) } = + ( = 1), (2.1) e maximum likelihood (ML) estimator for the parameters = (, ) is found by solving the following estimating equation: (, ; ) = 0. (2.2) =1 In (2.2), the estimating function (, ; ) is given by: [ (, ; ) = [1, ( = 1)] { + ( = 1) }], (2.3) with () = ()/[1 + ()]. is estimating function is unbiased. Under regularity conditions, this leads to a consistent and asymptotically normal estimator [12].

18 In order to obtain an estimator that is adjusted for baseline covariates = ( 1,..., ), Zhang et al. [8] assume a semiparametric model (model ) where all joint densities for (,, ) can be written as:,, (,, ;,,, ) =, (, ;,, ) (; ), with (; ) completely specified as = ( 0, 1 ) is known. Randomization ensures that and are independent. Model is additionally assumed to satisfy the following constraints:, (, ;,, ) = ( ;, ), (2.4), (, ;,, ) = (), (2.5) with (2.5) following from the fact that and are independent due to randomization. is an additional nuisance parameter that is possibly infinite-dimensional, which is the reason why this is a semiparametric model. Zhang et al. go on to show that, given an unbiased estimating function (, ; ), the class of all unbiased estimating functions for under the semiparametric model can be written as follows: (,, ; ) = (, ; ) 1 [ ] ( = ) (). (2.6) In the case that (), an arbitrary function, is different from zero, this differs from the unadjusted analysis and uses information regarding the baseline covariates to augment the estimator by exploiting correlations between and parts of. is adjusted estimator is shown to be relatively more efficient than the unadjusted estimator. =0 Zhang et al. further derive the optimal estimating function within the class of all unbiased estimating functions for by finding the estimating function with minimal variance. e estimator ˆ is an M- estimator, as described in Stefanski and Boos [11], since it satisfies: (, ˆ) = 0. =1 is can be seen in (2.2) and (2.3). Stefanski and Boos derive that an M-estimator is asymptotically multivariate normal with mean 0 and variance 1, where: = ( 0 ) 1 ( 0 ){( 0 ) 1 }. Here, 0 is the true parameter value, and ( 0 ) and ( 0 ) are given by: ( 0 ) = [ (, 0 )/ 0 ], ( 0 ) = [ (, 0 ){(, 0 )} ]. Zhang et al. use these results to obtain the variance of [8]. ey show that an estimator of solving (2.6) is consistent and asymptotically normal with asymptotic covariance matrix: 1 ( 1 ), (2.7) with: = [ { (, ; 0 ) 1 =0 } ] [ ] 2 ( = ) (), (2.8)

19 0 the true value of, 2 = and = { (, ; )} =0. e variance of is given by the value of row 2, column 2 of the variance covariance matrix. e optimal estimating function is then found by minimizing and thus identifying the function () for = 0, 1 leading to minimal asymptotic variance of an estimator of. For a given (, ; ), this is shown to occur when () = [ (, ;, = ) ] for = 0, 1. is choice extracts the greatest amount of information from the data, with regard to the target parameter. It is thus concluded that the optimal estimator for in class (2.6) is found by solving the following estimating equation: (,, ; ) = =1 ( (, ; ) =1 =0 1 [ ] [ ] ) ( = ) (, ; ) =, = 0. (2.9) e values for [ (, ; ), = ] are unknown and need to be estimated by postulating working models. It should be noted that for an estimating function (2.3) the expected value can be written as: [ (, ; ) =, ] [ = [1, ( = 1)] ( =, ) { + ( = 1) }]. (2.10) Obtaining the expected value for the estimating function thus comes down to obtaining ( =, ). It can be seen that it is possible to obtain the estimates for [ (, ; ) =, ] separately from the treatment effect estimates. at is, it is possible to have the analyses of the estimates for [ (, ; ) =, ] to be performed by someone independent from the person responsible for the treatment effect analysis. is invalidates any concerns regarding fishing expeditions. G-computation is a method that can be used to derive a marginal treatment effect estimator as a log odds ratio for an RCT. First, we will summarize how this method is used by Snowden et al. to acquire a marginal treatment effect estimator (as a risk difference) for observational studies [13]. en, we will describe how this can be applied to an RCT in order to obtain a log odds ratio marginal treatment effect estimator. Snowden et al. start by fitting a regression model for outcome on the treatment and baseline covariates, i.e. fitting (, ). is model is then used to predict the counterfactual outcomes for both treatment regimens = 0, 1 for each observation (all subjects). Snowden et al. then proceed to use this full dataset to fit a marginal structural model (MSM) to obtain a marginal treatment effect estimate as a risk difference. G-computation can be applied to randomized studies with the log odds ratio as marginal treatment effect. is amounts to fitting the following MSM on the counterfactual outcomes: { ( ) } = +. (2.11) Obtaining the maximum likelihood estimator ˆ from the MSM (2.11) is equivalent to determining the treatment effect estimate as: ( / ) ˆ1, (1 ˆ1, ) ˆ = /, (2.12) ˆ 0, (1 ˆ0, )

20 with ˆ 0, = 1 =1 0 and ˆ 1, = 1 =1 1 the marginal means of the control and treatment group, respectively. As noted by Snowden et al. [13], the variances and standard errors reported by soware for fitting the MSM (2.11) for G-computation are not correct, as the counterfactuals are perfectly balanced for baseline covariates between the treatment groups. is is due to the creation of two counterfactuals (one in each treatment group) with identical baseline covariate values from each original observation. e variances and standard errors will instead be determined by implementing a resampling-based method, i.e. bootstrapping. Moore and van der Laan describe the use of Targeted Maximum Likelihood Estimation (TMLE) for the estimation of the treatment effect in a randomized trial with binary outcome [14]. Suppose independent and identically distributed observations of the random vector are observed, with = (,, ) 0. Here, 0 is the probability density of. Consider a statistical model, which contains all possible probability distributions of the data with respect to an appropriate dominating measure: thus is nonparametric up to possible smoothness conditions. 0 is the true data distribution. In a first step, an initial density estimator ˆ 0 of the density 0 of is found. is initial density es- timator corresponds to an initial fit 0 (, ), which is an estimator of (, ). is initial fit can be obtained by fitting a logistic regression model using ML estimation, with treatment as fixed covariate and baseline variables as candidate covariates in the model. In this case, the initial fit is given by: 0 (, ) = [ ˆ 0 (, ) ], (2.13) for the function ˆ 0, obtained through fitting an initial logistic regression model for { (, ) }. As is binary, the density estimator thus corresponds to: [ ] ˆ 0 [ ] (, ) = 0 1 (, ) 1 0 (, ). (2.14) e TMLE algorithm identifies a of ˆ 0 that results in a maximal change in the parameter of interest. is is achieved by creating a parametric fluctuation denoted by ˆ 0 () through ˆ 0, with a free fluctuation parameter: [ ] ˆ 0 [ ] ()(, ) = 0 1 ()(, ) 1 0 ()(, ). (2.15) In the case of a logistic regression model for the initial fit, 0 ()(, ) is given by: { 0 ()(, ) = [ ˆ 0 (, ) + (, ) ]}, with (, ), the clever covariate, determining the direction of the fluctuation. By requiring that the score of the density at = 0 is equal to the efficient influence curve ( 0 ), (, ) is found. e influence curve is the curve that describes the changes in the estimator under slight perturbations of the empirical distribution. e efficient influence curve is the influence curve with minimal variance. In [15] it is shown that the efficient influence curve can be decomposed as: ( 0 ) =( 0 ) [ ( 0 ), ] + [ ( 0 ), ] [ ( 0 ) ] + [ ( 0 ) ] [ ( 0 ) ].

21 Moore and van der Laan use this result to conclude that the efficient influence curve can be written as: with: 1 ( 0 ) = ( 0 ) [ ( 0 ), ], 2 ( 0 ) = [ ( 0 ), ] [ ( 0 ) ], 3 ( 0 ) = [ ( 0 ) ] [ ( 0 ) ]. ( 0 ) = 1 ( 0 ) + 2 ( 0 ) + 3 ( 0 ), en, 1 ( 0 ) is a score for the probability density of, (, ). 2 ( 0 ) is a score for the treatment assignment mechanism 0 ( ) = ( = 1 ). 3 ( 0 ) is a score for the marginal probability distribution () of. At = 0, the score for 0, the true treatment assignment mechanism, is zero because randomization ensures that 0 (, ) = 0 (). Also, the empirical distribution of is a nonparametric ML estimator. From the latter two statements, it follows that (, ) only needs to be chosen so that the score of ˆ 0 (, ) at = 0 includes the efficient influence curve component for (, ) (i.e. 1 ( 0 )). e fluctuation function (2.15) stretches the initial density estimator in a direction that targets the parameter of interest. By maximizing the likelihood of the data over, the optimal amount of stretch is found. is corresponds to an updated density estimator, ˆ 1, which is the first step TMLE density estimator. is process is repeated until convergence is reached (i.e. until the stretch is essentially zero). is corresponds to the TMLE estimator, ˆ. is estimator achieves a bias reduction for the parameter of interest together with a small increase in the likelihood. e TMLE estimator is thus a familiar type of likelihood-based estimator. As the TMLE estimator solves the efficient influence curve estimation equation, it inherits its properties of asymptotic linearity and local efficiency. Hence, TMLE estimation has properties of both likelihood-based and estimating function-based methodologies. Gruber and van der Laan [16] note that the TMLE estimators for the parameters ( 1 ) and ( 0 ) can be targeted simultaneously as follows, in the case of a logistic regression initial model, with = ( 0, 1 ): Here, 0 (, ) and 1 (, ) are given by: [ ] ()(, ) = ˆ 0 (, ) (, ) (, ). (2.16) 0 (, ) = 1 (, ) = ( = 0) ˆ(0 ), ( = 1) ˆ(1 ). (2.17) Gruber and van der Laan indicate that this guarantees one-step convergence, as 0 (, ) and 1 (, ) do not involve. e final targeted estimate (, ) = (, ) is thus equal to 1 (, ) and is given by: [ ] 1 (, ) = ˆ 0 (, ) + ˆ ˆ 1 1. (2.18) Van der Laan and Rose note that simultaneously targeting these two parameters results in obtaining a valid TMLE estimate for parameters that are functions of these counterfactuals [17]. is allows us to derive a

22 log odds ratio estimator. Gruber and van der Laan [16] describe how G-computation can first be utilized to determine the estimators ( 1 ) = 1 and ( 0 ) = 0 as: ˆ 1 = 1 ˆ 0 = 1 [ ] (1, ), =1 [ ] (0, ). =1 (2.19) We can then use ˆ 1 and ˆ 0 from (2.19) to obtain a log odds ratio estimator as: ( / ) ˆ1, (1 ˆ1, ) ˆ = /. (2.20) ˆ 0, (1 ˆ0, ) Gruber and van der Laan note that the influence curve for subject is given by [16]: ( ) 1 ( ) ( ) = (, ) + (1, ) ˆ 1 (1 ˆ 1 ) ˆ(1 ) ( ) 1 1 ( (, )) + (0, ). ˆ 0 (1 ˆ 0 ) 1 ˆ(1 ) Moore and van der Laan explain that this result can then be used to obtain an estimator of the asymptotic variance of ( ˆ 0 ) as [14]: ˆ 2 = 1 where () = ( ), with the TMLE of 0. 2 ( ), (2.21) =1 Gruber and van der laan [16] note that, because the TMLE estimators solve the efficient influence curves and the efficient influence curves satisfy a double robustness property, the TMLE estimator is doubly robust. is means that the TMLE estimators are consistent if the estimators for either (, ) or the treatment mechanism () are consistent. Moore and van der Laan [14] additionally note that, in the case of a randomized trial, the treatment mechanism () is known, so the TMLE estimator is always con- sistent, even if 0(, ) is misspecified. If 0(, ) is consistently estimated for a randomized trial, the TMLE estimator is not only consistent, but efficient as well. However, if for a randomized trial 0(, ) is misspecified, the efficiency can be increased by estimating the treatment mechanism as well. e TMLE estimator then uses the information from the treatment mechanism estimate to obtain greater precision for the treatment effect estimate. To obtain the initial fit of the TMLE algorithm, van der Laan and Rose propose the use of Super Learner [17]. Super Learner builds a data-adaptive weighted combination of a library of candidate algorithms (estimators) using cross-validation. e optimal weights are chosen based on minimizing a loss function. In the case of a binary outcome, the parameter of interest for Super Learner is ( = 1, ). e negative log likelihood loss function that is minimized in order to find the optimal combination of estimators is: [ (, ) = (, ) { 1 (, ) } ] 1. (2.22) e procedure starts by dividing the dataset into groups of equal size. Next, the data in groups 2 to (the ) is used to obtain a regression fit. Group 1 is used as the, i.e. the obtained

23 fit from the training set is used to predict 0(, ) for group 1. is is done for all candidate algorithms separately. is process is repeated until all groups have been used as validation set (each time using all other groups as training set). e predictions from the validation sets can be used to estimate the cross-validation risk for each algorithm, i.e. calculate the value of the loss function (2.22). en, a family of weighted combinations of all algorithms, indexed by a weight vector, is proposed. For all validation sets, the predicted values for ( = 1, ) for each algorithm are weighted according to to predict the outcome. e cross-validated risk for the weighted combination of all algorithms is then calculated using equation (2.22). e weight vector is chosen so that the cross-validated risk is minimized. Finally, each of the algorithms are fitted on the complete dataset. ese fits are then combined using weight vector to generate the Super Learner prediction function. is can be used as the initial fit 0 (, ) in the first step of the TMLE algorithm. e use of propensity scores (PS) to obtain a covariate-adjusted marginal treatment effect for RCTs is detailed by Williamson et al. [18]. e propensity score for a subject is the estimated probability that this subject receives the treatment. Consider an RCT with binary outcome, with binary treatment and with = ( 0, 1,..., ) the vector of 0 and the baseline covariates that we wish to adjust for. Here, 0 is set to 1 for all subjects in order to include an intercept term in vector directly. e propensity scores, () = ( = 1 ), are then estimated by obtaining a ML estimate for the following logistic regression model: ( ( ) ) 1 ( = ( ). ) Next, the ML estimate ˆ is used to find the estimated propensity score for subject as follows: ˆ = ˆ ( ) = (ˆ( )) 1 + (ˆ( )). e marginal mean for the treatment group is obtained by averaging the outcomes in this group, inversely weighted by their probability of receiving the treatment (i.e. the estimated propensity scores ˆ ). e marginal mean for the control group is obtained similarly, except that the outcomes are inversely weighted by their probability of being assigned to the control group (i.e. 1 ˆ ). e estimates for the marginal means for the outcome in the treatment and control group are thus given by: ( )( ) 1 ˆ 1, =, ˆ ˆ ˆ 0, = =1 ( =1 (1 ) (1 ˆ ) =1 )( =1 ) 1 (1 ). (1 ˆ ) (2.23) e IPTW estimates for the marginal means can then be used to calculate an estimate for the log odds ratio as: ( / ) ˆ1, (1 ˆ1, ) ˆ = /. (2.24) ˆ 0, (1 ˆ0, )

24 e log odds ratio estimated by (2.24) is thus a marginal treatment effect estimate that is adjusted for the baseline covariates used in estimating the propensity scores. is method of adjustment is called Inverse Probability of Treatment Weighting (IPTW). In observational studies, IPTW removes confounding by baseline covariates in order to make causal inference possible. ere is no confounding in randomized studies, however, here IPTW takes into account possible chance imbalances between the treatment groups that may occur despite randomization. ese chance imbalances are taken into account by the creation of a pseudo-population with balanced covariates between the treatment groups as a result of the weighting [19]. As there are no confounders in RCTs due to randomization (and thus no unmeasured confounders), the marginal means in the pseudopopulation are the same as in the original population and, consequently, the log odds ratio is the same as well. e IPTW treatment effect estimates in RCTs are therefore always consistent. e estimate for the log odds ratio in (2.24) can also be obtained by fitting a weighted logistic regression model for the outcome on the treatment, with weights = 1/ˆ if = 1 and = 1/(1 ˆ ) if = 0 [18]. Williamson et al. further derive the large-sample variance for an IPTW marginal log odds ratio. ey begin by noting that using estimated propensity scores to solve for the marginal means (equation (2.23)) is equivalent to simultaneously solving the estimating equations: (;,, ) = 0 for the parameter = ( 1, 0, ), with: ( (;,, 1 ) 1 ) = ( 0 )(1 )(1 ) 1. ( ) =1 e first two estimating equations are scalars and the third component is a 1 vector. Solving these estimating equations results in an estimator ˆ that is asymptotically normally distributed. As this is an [ M-estimator, the large-sample variance of ˆ is equal to 1 1. Here, = / ] and = ] [, with the derivative evaluated at the true value of the parameter and the expectations taken over the true distribution of the data. Using standard matrix multiplication we can thus calculate (ˆ 1 ), (ˆ 0 ) and (ˆ 1, ˆ 0 ). As the log odds ratio is a function of ˆ 1 and ˆ 0, the delta method can be used to obtain the large-sample variance of the log odds ratio. is result, however, requires the estimation of both counterfactuals 1 and 0 for each subject. Replacing expectations by sample averages and leaving expressions in terms of observable quantities (i.e. and, not 1 and 0 ) in the large-sample variance gives us a sample estimate for the large-sample variance. is sample estimate of the large-sample variance does not require estimation of 1 and 0 : where, letting 1 =1 ˆ ( ˆ ) = = 1, 1 =1 1 1 ˆ ˆ ( 2 ) 1 2 ˆ, (2.25) = 0, (ˆ 1 (1 ˆ 1 )) 1 = 1 and (ˆ 0 (1 ˆ 0 )) 1 = 0, we

25 have: = =1 ( ˆ 1 ) 2 ˆ =1 ( ˆ 0 ) 2 (1 ) (1 ˆ ) 2, ˆ = 1 = 1 1 ( 1 2 = 1 1 =1 ( ˆ 1 ) (1 ˆ ) ˆ ) 1 ( ) ˆ (1 ˆ ), =1 ( 1 + ) ( ) ( ˆ ) 1. = =1 ( ˆ 0 )(1 )ˆ 1 ˆ, (2.26) Williamson et al. show that, if the estimated propensity scores are treated as known quantities, the sample estimate of the variance is equal to ( ) =, which is a sample estimate of the unadjusted estimator. Combined with the second term of equation (2.25) always being positive, this implies that taking into account the fact that the propensity scores were estimated increases the precision. is increase in precision with the estimation of the propensity scores is due to the balancing of baseline covariates, which is achieved by weighting the analysis. IPTW makes it possible to completely separate the propensity score and the outcome analyses. is excludes the possibility of selecting those baseline covariates that give the most favorable result and this, together with the fact that IPTW always results in unbiased treatment effect estimates in RCTs, results in objective inferences for RCTs [9]. e IPTW estimator can be expanded upon by estimating initial outcome models in addition to the propensity score model. Loux et al. explain that this can be accomplished by fitting initial outcome models (, ) in each subgroup = with = 0, 1 and the covariate vector estimated by ˆ using ML [10]. ese outcome models include only a combination of baseline covariates as explanatory variables. ey note that a doubly robust estimator of the marginal means can then be found using: ( ˆ 1, = 1 ( + 1 ) ) 1 (, ˆ 1 ), ˆ ˆ ˆ 0, = 1 =1 ( =1 ) (1 ) ( (1 ˆ ) ) 0 (, ˆ 0 ). 1 ˆ (2.27) e estimators ˆ 1, and ˆ 0, have been shown by Robins et al. to be consistent estimators for ( 1 ) and ( 0 ), respectively [20]. Furthermore, they show that these are consistent estimators as long as either the propensity score model or the outcome model are correctly specified, i.e. these are doubly robust estimators. Additionally, they prove that the DR estimators (2.27) are asymptotically efficient in the class of semiparametric estimators.

26 en, we can use the DR estimators (2.27) to determine a marginal log odds ratio as: ( / ) ˆ1, (1 ˆ1, ) ˆ = /. (2.28) ˆ 0, (1 ˆ0, ) We will apply these estimators to randomized studies in order to take into account chance imbalances between the treatment groups that may occur despite randomization, which is the same reasoning for using the IPTW estimators discussed above (2.5). As for the IPTW estimators, the possibility to separate the different analyses contributes to an objective inference in randomized studies: it is possible to completely separate the propensity score, initial outcome and marginal mean analyses. e standard error for the marginal DR estimator will be determined by bootstrapping. Loux et al. note that the DR marginal means estimators (2.27) are not restricted to be between 0 and 1, which could give a negative OR when ˆ or 1 ˆ become very small [10]. However, as the DR estimator is applied to randomized studies here, values for ˆ close to 0 or 1 are of no concern. Firth bias reduction (BR) is not a method for baseline covariate adjustment, but is included in this section as it can be used in conjunction with any of the above methods where maximum likelihood (ML) estimation is used. Firth bias reduction can be used as an alternative for standard ML estimation that has reduced bias in the estimation of the parameter. Firth explains how the first-order term of the asymptotic bias of ML estimates can be removed [21]. He notes that the asymptotic bias of the ML estimator ˆ of a -dimensional parameter = ( 1, 2,..., ) can be written as: () = 1() + 2() , with the number of subjects, in the case of an RCT. Removing the first-order term of this bias is accomplished by modifying the score function and is thus a preventive approach. As such, it does not depend on the existence of the standard ML estimate for a specific sample, as is the case for a corrective approach for bias reduction. Standard ML estimation solves the following score equation: () = () = 0, where () = { () } is the likelihood function and () is the score function. Firth defines the following modification to the -th component of the score vector for : () = () + (), in which is allowed to depend on the data and is (1) as. In the above modification to the score function, the letter indexes the parameter : {1, 2,..., }. Define ˆ as the solution of () = 0. ˆ is thus the standard ML estimator. Define as the solution of () = 0, this is the bias-reduced ML estimator. Define ˆ = 1 2 ( ). en, the bias of is shown to be: ( [ 1 2 ˆ ) = 1,,(,, + ) ], + + ( 3 2 ), 2

27 where,, and indexes the parameter. e joint null cumulants are, = 1 ( ),,, = 1 ( ) and, = 1 ( ). Here, = / and = 2 / are the log likelihood derivatives., denotes the inverse of, = 1 (), with () the expected Fisher information matrix. e expected value of is denoted by. e first-order bias of ˆ is given by the term: 1,,(,, + ), = 1 2 1(). us, removes the first-order term if it satisfies:, = 1 + ( 1 2 ), the solution to which is: =, 1 + ( 1 2 ). In matrix notation the following should hold for the vector, in order to remove the first-order term of the bias of the standard ML estimate: () = () 1() + ( 1 2 ). We can thus choose () = 1 () 1 () or () = 1 () 1 () as bias-reducing vector, using the expected and observed Fisher information matrix respectively. Either of these choices removes the first-order bias term. Additionally, in the case of an exponential family in canonical parameterization, () and () coincide, as the observed information does not involve the data. For a logistic regression model, the ( 1 ) bias vector is given by: 1 = ( ) 1, with = { (1 ); = 1,..., }, the binomial index for the -th count and the element of equal to ( 1 2 ). Here, refers to the diagonal element of the hat matrix given by: = ( ) 1. us, as is the Fisher information matrix, we can write: =. e modified score function can then be written as =, with the component for the parameter: = =1 [ ( + ) ] ( + ), (2.29) 2 for = 1,...,. Kosmidis [22] notes that by treating the as if they were known constants, then biasreduced estimation using the modified scores approach as in (2.29) is formally equivalent to ML estimation with the following adjustments to the response frequencies and totals: Pseudo-successes: = + 1 2, Pseudo-totals: = +, (2.30)

28 for = 1,...,. e pseudo-data representation (2.30) has the disadvantage of incorrectly inflating the binomial totals and could thus lead to an underestimation of standard errors. is can be solved by readjusting each working weight by dividing it by +, with evaluated at the estimates, and multiplying it by. is results in correct estimated standard errors. is results in the following modified iteratively reweighted least squares (IRLS) procedure: 1. Obtain the diagonal components of the hat matrix 2. Determine the pseudo-data representation at the current value of the parameters according to (2.30) 3. Calculate the updated parameter estimate as: (2.31) +1 = ( ) 1 4. Re-adjust the working weight : divide it by +, with evaluated at the estimates, then multiply it by With referring to the estimated parameter of iteration. Here, the -vector of the modified working variates with: ( ) = + ( / ) 1 (1 ) + [ ] (1/2) (1 ) =, for = 1,..., and = ( (1/2))/ { (1 ) }. e variables are the working variates for ML IRLS, which are modified by subtracting the variables. For the initial estimate 1 the ML estimates of aer adding 1 2 to the initial successes and the initial totals are used. is eliminates the possibility of inifinite ML estimates. In addition to bias reduction, another advantage of using bias-reduced regression is the production of finite estimates even when the ML estimates are infinite (e.g. in the case of complete separation). Firth notes that the first order asymptotic covariance matrix of is the same as that of ˆ, so the standard errors for the bias-reduced and ML estimates are asymptotically the same [21].

29 Simulation studies were carried out in order to compare the performance of several different approaches to obtain a marginal treatment effect estimate in the case of a binomial outcome. is marginal treatment effect is the log odds ratio for the treatment group versus the control group. e data were simulated for two treatment groups, details of the data generating mechanism can be found in section below. is data generating mechanism was used in the simulation of randomized studies of sample sizes 100, 150 and 200. For each of the sample sizes 5000 Monte Carlo data sets were generated. e data generating mechanism described in this section was adapted from one of the simulation studies in [8]. ere was a 50% chance of being assigned to either the treatment or the control group, i.e. treatment assignment was generated as Bernoulli with ( = 1) = ( = 0) = 0.5. = 1 corresponds to the treatment group, = 0 corresponds to the control group. e baseline covariates = ( 1,..., 8 ) were generated as follows: 1, 3 and 8 N(0,1) 4 Bernoulli(0.3) 6 Bernoulli(0.5) 2 = with 1 N(0,1) 5 = with 2 N(0,1) 7 = with 3 N(0,1) was then generated as Bernoulli with { ( = 1 =, ) } = + + ; = 0.9; = 1, 2; = 1.3 and = (0.5, 1.3, 0.5, 1.5, 0, 0, 0, 0). e conditional treatment effect,, was chosen to be 1.3 as this was found to be the smallest treatment effect (in absolute value) to give a power of at least 80% for Fisher s exact test in a simulation study using 5000 repetitions of = 200 subjects. It is clear from that the last four baseline covariates ( 5 through 8 ) are not predictive for the outcome and are thus not important when adjusting our treatment effect estimate for the baseline covariates. e marginal treatment effect from this data generating mechanism, expressed as the log odds ratio comparing the treatment group with the controlled group, can be found by considering the following

30 model: { (( ) } = +. e marginal treatment effect,, differs from the conditional treatment effect due to the noncollapsibility of odds ratios. In order to find the marginal treatment effect, 100 repetitions of simulations with one million subjects were carried out. e marginal treatment effect was then calculated as the mean of the treatment effects (as log odds ratios comparing the treatment group with the control group) using unadjusted logistic regression. is gave us the true marginal treatment effect = (with an empirical standard deviation of 0.004). A description of the approaches used for the analysis of the simulation studies is given in this section. We obtained an unadjusted estimate by fitting a logistic regression model containing only the treatment as explanatory variable (ULR). is model was fitted using the IRLS algorithm. Model fitting using IRLS will be referred to as standard ML estimation hereaer. A Firth bias-reduced unadjusted estimate was found using a logistic regression model fitted with Firth bias reduction containing only the treatment as explanatory variable. Fitting a model with Firth bias reduction was performed by implementing the modified IRLS algorithm (2.31). We started the semiparametric locally efficient (SLE) approaches by obtaining the unadjusted estimate ˆ = (ˆ, ˆ), with ˆ the unadjusted estimate of the marginal treatment effect as a log odds ratio. is unadjusted estimate was found by solving the estimating equation (2.2) with estimating function (2.3). Next, separate models were fitted in both treatment groups to predict the endpoint. e model fitted in the treatment group was used to predict the endpoint,, for all subjects, in case all subjects were treated: (, = 1). Likewise, the model in the control group was used to predict the endpoint for all subjects, setting the treatment variable to 0: (, = 0). ese predicted outcomes were then used to obtain expected values for the estimating function as (2.10) for = 0, 1. ese were then plugged into the estimating equation (2.9). Solving this estimating equation gave the estimate = (, ), with the adjusted estimate of the marginal treatment effect as a log odds ratio and () = [ (, ; =, ) ] for = 0, 1. An estimate of the variance covariance matrix for was found using (2.7), with obtained by (2.8) for = 0, 1. An estimate for the variance of was found as the value of row 2, column 2 of this matrix. is was the general approach that was used for all SLE estimators, the differences between the SLE estimators concerned the determination of the predicted outcomes. SLE1 used model selection in order to fit the separate models in both treatment groups to predict the endpoint. In both treatment groups separately, all possible subsets of the full model containing all eight baseline covariates as linear terms and no interaction terms were considered. ese models all included an intercept and excluded the treatment variable. e full model was fitted using logistic regression in each of the two treatment groups separately. e weights obtained in the last iteration of the IRLS algorithm were then used to perform a weighted linear regression for all possible subsets of the full model (separately for both treatment groups). Subsequently, an optimal model was selected from these weighted linear models in each of the

31 two treatment groups by using Mallow's criterion: = 8 + 2, with the sample size, the number of parameters in the model, = =1 ( 1) 2 the error sum of squares of the model with parameters and 1 the predicted outcomes of the model with parameters. 8 = 1 9 =1 ( 2) 2 is the mean squared error of the model containing all eight baseline covariates and 2 the predicted outcomes of the model containing all baseline covariates. In each treatment group, the model with the smallest was selected as the optimal model. For each treatment group separately, the optimal model was then refitted using logistic regression and this fit was then used to predict the outcomes. All models for SLE1 were fitted using standard ML estimation. For SLE2 we fitted a logistic regression model containing an intercept and all baseline covariates in both treatment groups separately: { ( =, ) } = 0,, + 1,, 1 + 2,, 2 + 3,, 3 + 4,, 4 + 5,, 5 + 6,, 6 + 7,, 7 + 8,, 8, (3.1) with = 0 for the control group and = 1 for the treatment group. In this regression, the outcomes for the control group ( = 0) were inversely weighted by 1 ˆ, with ˆ the estimated propensity score for subject. Likewise, the outcomes for the treatment group ( = 1) are inversely weighted by the estimated propensity score ˆ for each subject. is way of inversely weighting a regression model will be called hereaer. e estimated propensity scores are obtained by fitting a logistic regression model on the treatment variable,. e explanatory variables in the propensity score model are all eight baseline covariates. { ( ) } = 0, + 1, 1 + 2, 2 + 3, 3 + 4, 4 + 5, 5 + 6, 6 + 7, 7 + 8, 8. (3.2) e propensity scores used were the predicted values for the treatment variable, ( ), obtained using model (3.2). In this text, inverse weighting using the propensity scores always used model (3.2) for the propensity scores. When inverse weighting using the propensity scores was done for a Firth bias-reduced regression model, the propensity scores were obtained by fitting model (3.2) using Firth bias-reduction as well. Otherwise, standard ML estimation was used. e remainder of the SLE approaches all employed inverse weighting using propensity scores for the initial outcome models. SLE3 also fitted the full model (3.1), but was different from SLE2 in the use of Firth-bias reduction for fitting these models. e fourth SLE approach (SLE4) was an efficient approach in that it used the correct model to predict the outcomes. In both treatment groups separately a logistic regression model, including the first four baseline covariates and an intercept, was fitted for the outcome: { ( =, ) } = 0,, + 1,, 1 + 2,, 2 + 3,, 3 + 4,, 4, (3.3) with = 0 for the control group and = 1 for the treatment group. e fih SLE method (SLE5) differed from SLE4 in the model fitting algorithm: Firth bias-reduced regression was used for fitting model (3.3). TS6 used the following model selection procedure for the initial outcome models: the selection algorithm

32 started from the full model (3.1) in both treatment groups. At each iteration the baseline covariate with the highest p-value was removed from the model, as long as this p-value was greater than 0.2. e final models in the two treatment groups were then used to predict the outcomes for both treatment groups separately. is model selection procedure is referred to as backward elimination hereaer. Backward elimination for TS2 kept the intercept in the model at all times. All models for this model selection procedure were fitted using standard ML estimation and inversely weighted using propensity scores. SLE7 used backward elimination to obtain the predicted outcomes as well, differing from TS6 in the use of Firth bias reduction in fitting the weighted outcome models as well as the propensity score model. e intercept was always kept in the model, as was the case for TS6 as well. SLE8 used a misspecified logistic regression model, in both treatment groups separately, to predict the outcomes. ese models included the first baseline covariate, the second baseline covariate squared, the sixth baseline covariate and an intercept: { ( =, ) } = 0,, + 1,, 1 + 2,, ,, 6, (3.4) with = 0 for the control group and = 1 for the treatment group. All SLE8 models were fitted using standard ML estimation. e ninth and final SLE approach (SLE9) differed from SLE8 in the use of Firth bias reduction for the weighted outcome models as well as the propensity score (PS) model. A summary of the different semiparametric locally efficient estimators can be found in table 3.1 below. Method Initial outcome model: { ( =, ) }, for = 0, 1 SLE1 All possible subsets selection No SLE2 All baseline covariates PS Inverse weighting of initial outcome model SLE3 All baseline covariates (Firth BR) PS (Firth BR) SLE4 Correct model PS SLE5 Correct model (Firth BR) PS (Firth BR) SLE6 Backward elimination PS SLE7 Backward elimination (Firth BR) PS (Firth BR) SLE8 Model misspecification PS SLE9 Model misspecification (Firth BR) PS (Firth BR) For all the G-computation approaches, an initial outcome model { (, ) } was fitted first. e counterfactuals for both treatment regimens = 0, 1 were then predicted for all observations using this fitted model. e full dataset of counterfactuals was then used to determine an estimate for the marginal treatment effect as a log odds ratio by (2.12) in case the initial outcome model was fitted using standard ML estimation. When the initial outcome model was fitted using Firth bias reduction, the MSM (2.11) was fitted on the full dataset of counterfactuals using Firth bias reduction. e standard error was found by bootstrapping the treatment effect and determining the standard deviation of these resampled treatment effects. Nonparametric bootstrapping (i.e. the ordinary bootstrap) was used for 500 replicates. e first G-computation model (GC1) started by fitting a full model for the initial outcome model, which included

Targeted Maximum Likelihood Estimation in Safety Analysis

Targeted Maximum Likelihood Estimation in Safety Analysis Sam Lendle 1 Bruce Fireman 2 Mark van der Laan 1 1 UC Berkeley 2 Kaiser Permanente ISPE Advanced Topics Session, Barcelona, August 2012 1 / 35