Adaptive designs beyond p-value combination methods Ekkehard Glimm, Novartis Pharma EAST user group meeting Basel, 31 May 2013
Outline Introduction Combination-p-value method and conditional error function approach "Unconventional" combination methods: Example 1: Treatment arm selection Example 2: Sample size re-estimation 2 adaptive designs Ekkehard Glimm 31 May 2013 EAST User Workshop
Introduction Adaptive design: A clinical trial split into several stages. An interim analysis after every stage. Conduct of subsequent trial stages depends on the interim analysis results. For example, a Novartis trial in COPD (Barnes et al., 2010) started with 4 different dosing regimens of indacaterol. After the interim analysis, 2 of these were dropped. If the only option is to stop the trial (for efficacy or futility) group-sequential trial 3 adaptive designs Ekkehard Glimm 31 May 2013 EAST User Workshop
Introduction Possible adaptations : Sample size Sample size allocation to treatments Treatment arms (delete, add, change) Population (e.g., inclusion/exclusion criteria, subgroups) Statistical test strategy (e.g. primary endpoint, switch hypotheses) Possible benefits: Correct wrong assumptions Select most promising option early Make use of emerging information external to the trial React earlier to "surprises" (+ and -) Overall likely to lead to more reliable information earlier. 4 adaptive designs Ekkehard Glimm 31 May 2013 EAST User Workshop
Introduction If not properly accounted for, adaptations can introduce selection bias, type I error inflation etc. For control of type I error rate α, it is generally necessary that Rules for early rejection of hypotheses are stated in the study protocol Rules for combining evidence across stages and associated multiplicity adjustments are defined up front (in the protocol) For regulatory purposes, the class of envisaged decisions after stage 1 is stated (in the protocol) Dependent on analysis method used, specific rules for adaptation - do not need to be pre-specified - can make use of Bayesian principles integrating all information available, also external to the study - should be evaluated (e.g. via simulations) and preferred version recommended e.g. in DMC charter 5 adaptive designs Ekkehard Glimm 31 May 2013 EAST User Workshop
Introduction Source of potential α inflation Early rejection of null hypotheses at interim analysis Adaptation of design features and combination of information across trial stages (e.g. sample size reassessment based on treatment effect, next interim analysis, testing strategy.) Multiple hypothesis testing (e.g. with adaptive selection of hypotheses at interim analysis) Adjustment method Classical group sequential plans, e.g. α- spending approach Combination of p-values, e.g. inverse normal method, Fisher s combination test Conditional error function Multiple testing methodology, e.g. closed test procedures All three approaches can be combined. 6 adaptive designs Ekkehard Glimm 31 May 2013 EAST User Workshop
Introduction For simplicity, assume a two-stage trial of a single treatment versus a control. For each stage and test, there is a test statistic maybe a "conventional test statistic", e.g. t j from a t-test of the treatment after stage j (1=interim, 2=final) or the corresponding p-value p j (e.g. p j =1 probt( t j )) H is a hypothesis about the treatment (e.g. "treatment no better than control"). 7 adaptive designs Ekkehard Glimm 31 May 2013 EAST User Workshop
Combination Tests Inverse Normal Combination Function Final decision: Reject H (e.g. an intersection null hypothesis), if and only if 1 1 1 C( p1, p2) = w1 Φ (1 p1) + 1 w1 Φ (1 p2) > Φ (1 α) where the weight w 1, 0<w 1 <1, has to be fixed a priori Fisher s product combination test (special case of inverse chi- square combination) Final decision: Reject H (e.g. an intersection hypothesis), if and only if C ( ) ( 2 p, p = p p c = exp χ 2) 1 2 1 2 α 4, α 8 adaptive designs Ekkehard Glimm 31 May 2013 EAST User Workshop
Combination Tests with repeated hypothesis testing Inverse Normal Combination Function At interim, stop the study if p 1 α 1 (reject H) or p 1 α 0 (retain H), if α 1 < p 1 < α 0, continue the study, resulting in p 2 Final decision: Reject H, if and only if 1 1 1 C( p1, p2) = w1 Φ (1 p1) + 1 w1 Φ (1 p2) > Φ (1 α 2) Given α, α 0 and α 1, α 2 is determined as for standard group sequential trials Fisher s product combination test (special case of inverse chi- square combination) Final decision: Reject H, if and only if C ( ) ( 2 p, p = p p c = exp χ 2) 1 2 1 2 α2 4, α2 Given α,α 0 and α 1, α 2 is determined by α 1 + c α2 (ln α 0 - ln α 1 ) = α 9 adaptive designs Ekkehard Glimm 31 May 2013 EAST User Workshop
Conditional Error Function (CEF) approach The conditional rejection error after n 1 observations is the conditional probability (under H) to reject H with a test φ on n = n 1 +n 2 observations, given the first n 1 observations: A (x 1,..., x n1 ) := P H [ φ rejects H (x 1,..., x n1 )] where (x 1,..., x n1 ) are the first n 1 observations. The conditional rejection error is just a special case of the conditional power; whenever one is able to compute it one can compute the CEF. Final Test: Reject H after n 2 additional observations with another valid test φ' if it is significant at level A (x 1,..., x n1 ). n and φ need to be pre-defined (but not φ and n 2 ) Principle can be applied to group sequential testing procedures and closed testing procedures. 10 adaptive designs Ekkehard Glimm 31 May 2013 EAST User Workshop
Conditional Error Function (CEF) approach For the simple case of a single hypothesis H tested with test φ, the combination-p-value approach is the same as the CEF approach using w 1 = w, w 2 = 1 w, c= Φ 1 (1 α 2 ) 11 adaptive designs Ekkehard Glimm 31 May 2013 EAST User Workshop
How these methods are usually applied The approaches are usually presented in very general terms. Then most texts quickly move to combinations of stochastically independent p-values as the quintessential example. However, there is nothing in the general CEF approach that would prevent us from using test statistics other than p-values combining non-independent test statistics from the two stages (if we can work out the null distribution of the combination) 12 adaptive designs Ekkehard Glimm 31 May 2013 EAST User Workshop
Motivation for looking beyond standard applications If an explicit selection rule is given, it is often possible to work out the precise distribution of the final test statistic. Examples: group sequential trials, sample size re-estimation with stage-2-sample size given as an explicit function of the stage-1- data. Combination-p-value methods and CEF approach allow full flexibility. There is a "middle ground": Some restrictions on design modification, but no fixed rules. F Remainder of the talk focusses on this. 13 adaptive designs Ekkehard Glimm 31 May 2013 EAST User Workshop
Example 1:Treatment arm selection 14 adaptive designs Ekkehard Glimm 31 May 2013 EAST User Workshop
Setup Stage 1: k regimens of an experimental treatment. Interim analysis: q out of these k treatments selected (no fixed rule) Stage 2: test for superiority against a control group. Restriction on decision: Control group recruitment follows a fixed design irrespective of stages. No control group data is used in the interim decisions. 15 adaptive designs Ekkehard Glimm 31 May 2013 EAST User Workshop
Motivation A trial with a well-established drug in a new indication. Very small population of severely impaired pediatric patients. Pretty certain that drug is efficacious, but no hope for superiority over control if we would do a small phase II trial with several doses. No comparator in stage1 ( "Phase II component"), comparator arm only in stage 2 ( "Phase III component"). F Control data cannot influence interim decision making. 16 adaptive designs Ekkehard Glimm 31 May 2013 EAST User Workshop
Notation (illustrated by example for simplicity) average response to treatment i, stage 1 average response to control, all stages Treatment S is selected, with stage-2-sample size n S2 average response to treatment S, stage 2 17 adaptive designs Ekkehard Glimm 31 May 2013 EAST User Workshop
Test Final decision is based on the test ϕ: Reject if for all I subset of {1,,k} containing S. q(n) is just a normalizing constant. c I is determined in such a way that the FWER is kept. 18 adaptive designs Ekkehard Glimm 31 May 2013 EAST User Workshop
Test: Derivation c I can be obtained by numerical integration or a simulation: Under HI: μ i = μ 0 for all i I, the quantities that make up t I : are all stochastically independent. the distribution of t I does not depend on μ 0. 19 adaptive designs Ekkehard Glimm 31 May 2013 EAST User Workshop
Test: Remarks Hence, the null distribution of t I is readily simulated from these components. In this way, a "Dunnett-test-like" table of critical values could be compiled. Note that this is a conditional test, the stoch independence is between unconditional and conditional. Keeps the FWER (conditionally and unconditionally). 20 adaptive designs Ekkehard Glimm 31 May 2013 EAST User Workshop
Test: Weights Assume equal sample sizes for all arms within stage i.e. n 11 for all stage-1 treatment arms n S2 for all stage-2 treatment arms n 0 = n 11 + n S2 in the control group Suggestion: Use w 1 = n 11 / n 11 + n S2 and w 2 = n S2 / n 11 + n S2 Justification: This is the optimal weight for H S if (i.e. for the test of the selected treatment only, if the "originally planned" sample size and the actually used sample size are the same.) This is also the ordinary t-test for H S n = n * S 2 S 2 21 adaptive designs Ekkehard Glimm 31 May 2013 EAST User Workshop
Assessment: Comparison with combi-p-value Combination-p-value approach (CPV): H I is tested by C( p I 1 1 1, p2) = w1 Φ (1 pi ) + w2 Φ (1 p2) > Φ (1 α) C( p 1, p2) > Φ (1 α) H S is rejected if holds for all I containing S I p I is the p-value of the stage-1-data Dunnett test. As this method is based on separate tests in stage 1 and stage 2, we must split the placebo group into two stages as well. 22 adaptive designs Ekkehard Glimm 31 May 2013 EAST User Workshop
Assessment: Comparison with combi-p-value If σ is assumed known, no conceptual difficulty with this: For H S, test ϕ and CPV test are identical (if same weights are used). For other intersection hypotheses H I, they are not, but very similar in terms of power. The weights w 1 and w 2 are not optimal for H I in either approach. If σ is assumed unknown, less clear how to do the p-value combination: Just ignore the fact σ is estimated? Estimate σ only at the end of the trial? Then what's the distribution of C( p I, p2)? Not intuitively appealing to convert a t-test statistic into a p-value and then to "convert back" with a Normal score. 23 adaptive designs Ekkehard Glimm 31 May 2013 EAST User Workshop
Assessment: Comparison with combi-p-value Advantages of the suggested test ϕ No need to split the placebo group into a "stage 1" and a "stage 2". Avoids "back-and-forth" transformation of original test statistics. Summary of simulation results: Test ϕ keeps the FWER, CPV does not: - stage 1: 12 patients per arm, 3 treatment arms, - stage 2: 30 patients in selected arm, - placebo: 42 patients, 100'000 simulations, error level 5%: observed error between 4.9% and 5.1% for ϕ, between 5.4% and 5.6% for CPV (various scenarios). Methods are very similar in terms of power (only done for σ assumed known). Clear power gain over a "stage-2-only" test (approx. 90% / 83%). 24 adaptive designs Ekkehard Glimm 31 May 2013 EAST User Workshop
Remarks: Some open issues Both ϕ and CPV represent (slightly different) CEF approaches. CPV could be modified to keep the FWER as well. ϕ has no "intrinsic" power advantage over CPV. Should we use this method in conventional adaptive designs (i.e. control in stage 1 and 2) as well? Is it acceptable to bring in the control arm in stage 2 only? 25 adaptive designs Ekkehard Glimm 31 May 2013 EAST User Workshop
Example 2: Blinded sample size review 26 adaptive designs Ekkehard Glimm 31 May 2013 EAST User Workshop
Background Sample Size Review (SSR): A re-evaluation of the sample size at an interim analysis. Unblinded SSR: Patient-treatment allocation is disclosed Blinded SSR: Patient-treatment arm allocation is not disclosed Intuitively clear that unblinded SSR can introduce a bias (e.g. over-estimation of treatment effect, type I error inflation when testing) What about blinded SSR? 27 adaptive designs Ekkehard Glimm 31 May 2013 EAST User Workshop
Blinded SSR for superiority testing One-sample t-test: observations x i from N(µ,σ 2 ) Want to test H 0 : µ = 0 with a t-test. n 1 observations at interim add n 2 * for a total of n=n 1 +n 2 * Notes: n 2 * is a random variable. Conceptually, one- and two-sample case are the same. 28 adaptive designs Ekkehard Glimm 31 May 2013 EAST User Workshop
Blinded SSR for superiority testing (Kieser and Friede, 2006) (re-)evaluate n 2 * based on the total variance n 2 * is a function of this (and thus of the stage-1-data), but depends on stage 1 only via this quantity. At the final: unmodified (one-sample-)t-test using n=n 1 +n 2 * as if fixed. Total variance in the two-sample case is (variance estimate ignoring group) 29 adaptive designs Ekkehard Glimm 31 May 2013 EAST User Workshop
Blinded SSR: Investigation of α level This method is generally accepted ("revisions based on blinded interim evaluations of data (...) do not introduce statistical bias ", FDA (2010) guideline), but It can be shown not to keep the type I error (to be fair: In extreme cases, and even then type I error inflation is minute). Of course, the p-value-combination method keeps the type I error here. But there are more powerful alternatives. 30 adaptive designs Ekkehard Glimm 31 May 2013 EAST User Workshop
Blinded SSR: alternatives P-value combination and CEF: Weights must be fixed before trial, not depend on what s been observed (in CEF, this is implicit). Consequence: There is an originally intended stage-2 sample size n 2. Power is best if n 2 =n 2 *. With blinded SSR, these methods suffer from some power loss. Examples (Fisher s p-value combination): - on average 3-4% less power than unmodified one-sided two-sample t-test if latter has around 80-90% power (simulation of several scenarios) - Maximum power loss of 7% (76.4% vs 69.6% when n 1 =30, true treatment effect 0.1 and treatment effect assumed for sample size estimation 0.15) Power loss results from great flexibility (possible interim modifications not restricted to SSR) Small power loss also from transformation of a t-test p-value with the normal cdf Φ 1 (1 p). 31 adaptive designs Ekkehard Glimm 31 May 2013 EAST User Workshop
Blinded SSR: alternatives New suggestion: Direct combination of test statistics Let T 1 ~ t(n 1-1) and T 2 ~ t(n 2 * - 1) (under H 0 ) be the t-test statistics for stage 1 and stage 2, respectively. The distribution of T 2 is conditional on It can be shown (using Fang and Zhang, 1990, theorem 2.5.8.) that given and under H 0, - T 1 ~ t(n 1-1) and T 2 ~ t(n 2 * - 1) still holds, - T 1 and T 2 are stochastically independent. This allows to use data-dependent weights when combining stage 1 and stage 2. 32 adaptive designs Ekkehard Glimm 31 May 2013 EAST User Workshop
Blinded SSR: new suggestion Suggestion: use Motivation: These are optimal weights for the corresponding z-test. Not covered by the conventional suggestions: Second weight depends on n 2 * and thus on observed total variance in stage 1. Dependence on stage 1 must be exclusively via the total variance. Weighted sum of two independent t-random variates: not a known distribution, but quantiles / p-values are easy to obtain (via simulation or numerical integration) The unmodified t-test statistic is not a special case: It cannot be written as a weighted sum of the stage-1- and stage-2- statistic. 33 adaptive designs Ekkehard Glimm 31 May 2013 EAST User Workshop
Blinded SSR: power of new suggestion There still is a power loss relative to the unmodified t-test, but it is minute. Power simulations (σ=1, α=0.05, 1 000 000 runs per scenario) : Sample sizes 5 to 50 in steps of 5 Many combinations of true treatment effect µ and assumed treatment effect δ in the range 0 to 1.5. After stage 1, sample size is recalculated so as to retain 80% power with assumed treatment effect and the estimated σ. Except for extremely small sample sizes, power was virtually identical (always a tiny bit lower for the new suggestion, difference below 1% for all simulated scenarios with n 1 30) 34 adaptive designs Ekkehard Glimm 31 May 2013 EAST User Workshop
Graph: Power vs assumed treatment effect, n 1 =5 35 adaptive designs Ekkehard Glimm 31 May 2013 EAST User Workshop
Graph: Power vs assumed treatment effect, n 1 =30 36 adaptive designs Ekkehard Glimm 31 May 2013 EAST User Workshop
Summary Discussion of two cases where CEF approach is used in an uncoventional way. In both cases, interim decisions are restricted, but not completely prespecified either. Mathematical technicalities (e.g. null distributions of test statistics) are no fundamental hurdle anymore: can always find desired quantities (e.g. critical values), as long as a "standard representation" with a few known random variables can be derived. Some open questions regarding operating characteristics of adaptive designs remain (e.g. optimal weights). 37 adaptive designs Ekkehard Glimm 31 May 2013 EAST User Workshop
References Barnes, P.J., Pocock, S.J., Magnussen, H., Iqbal, A., Kramer, B., Higgins, M., Lawrence, D. (2010). Integrating indacaterol dose selection in a clinical study in COPD using an adaptive seamless design Pulmonary Pharmacology & Therapeutics 23, 165 171. Fang. K.-T. and Zhang, Y.-T. (1990): Generalized Multivariate Analysis. Springer, New York. FDA (2010): Guidance for Industry: Adaptive Design Clinical Trials for Drugs and Biologics. Draft Guidance. Friede, T., Kieser, M. (2006): Sample size recalculation in internal pilot study designs: a review. Biometrical Journal 48, 537-555. 38 adaptive designs Ekkehard Glimm 31 May 2013 EAST User Workshop