Bios 6649: Clinical Trials - Statistical Design and Monitoring

Bios 6649: Clinical Trials - Statistical Design and Monitoring Spring Semester 2015 John M. Kittelson Department of Biostatistics & Informatics Colorado School of Public Health University of Colorado Denver c 2015 John M. Kittelson, PhD Bios 6649- pg 1

6. Other topics Other topics in design of biomedical studies than 2 treatment groups (a) > 2 unrelated treatment (one-way ANOVA) designs (d) Designs for regression endpoints endpoints 6.2 Design of non-inferiority trials 6.3 Prevention of bias (randomization and blinding) 6.4 Prevention and treatment of missing data Bios 6649- pg 2

6.1(a) Trials with more than two unrelated Evaluating more than two unrelated (1-way ANOVA) Recall inferential objective with 2-group RCT: Parameterization of treatment effect θ (true mean treatment effect) Magnitude of ˆθ used to decide whether new treatment is suited for use in practice Parameter space divided into regions of clinically important effects (in both directions) What is the question with more than 2 treatment groups? Which treatment (of many) is best? Is there a dose-response relationship? Is this several studies with a shared control group? Is a combination treatment better than each of its components? Bios 6649- pg 3

6.1(a) Trials with more than two unrelated Evaluating more than two unrelated (1-way ANOVA) Inferential objective Is there a difference between K treatment groups or are they all the same? (My observation: investigators are more often interested in identifying the particular that differ from the rest, so the ANOVA F-test does not usually satisfy the scientific/clinical objectives.) Statistical setting Hypotheses: Null hypothesis: the mean outcome is the same in all groups (θ k = θ for k = 1,..., K ) Alternative hypothesis: At least one mean differs from the others. Bios 6649- pg 4

6.1(a) Trials with more than two unrelated Evaluating more than two unrelated (1-way ANOVA) Statistical inference focuses on the F-test in a 1-way ANOVA: F = SS b (K 1) SS w K (N 1) where N is the number of subjects in treatment group, and SS b and SS w denote between-group and within-group sums of squares: SS b = SS w = N i=1 N i=1 K (Y k Y ) 2 k=1 K (Y ik Y k ) 2 k=1 The statistic follows an F-distribution with numerator and denominator degrees of freedom of K 1 and K (N 1), respectively. (Recall that the F -distribution comes from the ratio of two chi-square random variables.) Bios 6649- pg 5

Evaluating more than two unrelated Sample size evaluation (one-way ANOVA) The ANOVA setting does not fit into the usual sample size formulation (The F-statistic is not asymptotically normal) For the purposes of sample size evaluation: Assume SSw is fixed and known, and let = K (N 1) σ2 w. This is analogous to assuming σ 2 in the usual sample size evaluation. SS w With known σ 2 w known: F = SS b (K 1) σ 2 w, and F χ 2 K 1 Bios 6649- pg 6

Evaluating more than two unrelated Sample size evaluation (one-way ANOVA) Under the null hypothesis, the critical value is cv = qchisq(0.95,k-1) Finding power = Pr(F > cv): F follows a non-central chi-square distribution. Non-centrality parameter is KNδ 2 where: δ 2 = σ2 b σ 2 w. Bios 6649- pg 7

Evaluating more than two unrelated Sample size evaluation (one-way ANOVA) F χ 2 K 1, so power is given by: 1 Pr(F f ) = 1 pchisq(f,k-1,ncp = KNδ 2 ) To find the sample size that gives power β for σ 2 b/σ 2 w, we find N such that 1 Pr(F f ) = β. For sample size evaluation, we specify the alternative hypothesis in terms of σ 2 b. Suppose we want adequate power for a particular vector of θ k s: (θ 1,..., θ K ). This alternative determines σ 2 b as follows: σ 2 b = K (θ k θ) 2 k=1. K * NOTICE: That this is not the standard variance estimator. The standard estimator would have K 1 in the denominator. Bios 6649- pg 8

Evaluating more than two unrelated Sample size evaluation (one-way ANOVA) Example: Suppose that a trial is planned with 3 treatment groups. Suppose that the primary outcome is LDL-C reduction (measured as percent). Assume σw = 10%. Want power to detect a difference between treatment groups of 8%. Finding σb 2: (µ 1, µ 2, µ 3 ) = (0, 0, 8) σb 2 = 14.22 (µ 1, µ 2, µ 3 ) = (0, 2, 8) σb 2 = 11.56 (µ 1, µ 2, µ 3 ) = (0, 4, 8) σb 2 = 10.67 (µ 1, µ 2, µ 3 ) = (0, 6, 8) σb 2 = 11.56 (µ 1, µ 2, µ 3 ) = (0, 8, 8) σb 2 = 14.22 The critical value is: cv <- qchisq(0.95,2) = 5.991. Set δ 2 = σ 2 b/σ 2 w = 10.67/100 = 0.1067. For a particular choice of N, the power is given by: 1 - pchisq(cv,2,3*n*0.1067). There is just over 90% power when N = 40. Bios 6649- pg 9

Evaluating more than two unrelated Sample size evaluation (one-way ANOVA) Example in seqdesign. > dsgn <- seqdesign(prob.model="mean",arms=3,variance=sgma2w, + alt.hypo=c(0,4,8), alpha=0.05,power = 0.90335) > dsgn Call: seqdesign(prob.model = "mean", arms = 3, alt.hypothesis = c(0, 4, 8), variance = sgma2w, power = 0.90335, alpha = 0.05) PROBABILITY MODEL and HYPOTHESES: Theta is between group variance of means One-sided hypothesis test of a greater alternative: Null hypothesis : Theta <= 0.00 (size = 0.0500) Alternative hypothesis : Theta >= 10.67 (power = 0.9033) (Fixed sample test) STOPPING BOUNDARIES: Sample Mean scale Futility Efficacy Time 1 (N= 119.96) 4.9946 4.9946 > dsgn$std.bounds STOPPING BOUNDARIES: Standardized Cumulative Sum scale Futility Efficacy Time 1 (N= 1) 5.9915 5.9915 Bios 6649- pg 10

Evaluating more than two unrelated Sample size evaluation (one-way ANOVA) Comments on ANOVA designs seqdesign: There are some quirks: * Not possible to specify sample size and then calculate power * By default power = 1- alpha which can be confusing ANOVA designs are uncommon in RCTs. Interim decision-making is not implemented in RCTdesign * It is difficult to specify reasons for early termination with more. * See above comments regarding inferential objective. Bios 6649- pg 11

Evaluating more than two unrelated Definition: The term factorial design" refers to an experiment that evaluates several factors simultaneously. The most common application in the health sciences setting is a 2 2 factorial design of two different (A vs B) and 4 treatment groups: * no A; no B * no A; B * A; no B * A; B I have also seen 2-factor designs with more than 2 levels (e.g., a 3 4 factorial design) or with more than 2 (e.g., a 2 2 2 factorial design). Inferential objective: There are 3 common reasons for using a factorial design (Classical) An efficient way to study 2 separately. Study the interaction between two drugs. Selecting the best treatment from 4 different options. Bios 6649- pg 12

Evaluating more than two unrelated (statistical setting in a 2 2 factorial design) Data: Y ijk = outcome for: ith subject (i = 1,..., N) jth level of treatment A (j = 0, 1) kthe level of treatment B (k = 0, 1) Distribution: Y jk N (θ jk, σ 2 /N). Tests: Potential tests include (see next page): Effect of treatment A (marginal) Effect of treatment B (marginal) Interaction between treatment A and B. Bios 6649- pg 13

Evaluating more than two unrelated (statistical setting in a 2 2 factorial design) Effect of treatment A: Y 10 + Y 11 2 Y 00 + Y 01 2 N (θ 1 θ 0, σ2 N ) where θ j = (θ j0 + θ j1 )/2. Effect of treatment B: Y 01 + Y 11 2 Y 00 + Y 10 2 N (θ 1 θ 0, σ2 N ) where θ k = (θ 0k + θ 1k )/2. Test interaction: Is the effect of A if you receive B the same as its effect if you do not receive B? (Y 11 Y 01 ) (Y 10 Y 00 ) N ((θ 11 θ 01 ) (θ 10 θ 00 ), 4σ2 N Selecting best of the 4 : test for any evidence of group differences in a 1-way ANOVA. ) Bios 6649- pg 14

Evaluating more than two unrelated (statistical setting in a 2 2 factorial design) Sample size evaluation: Let θ + denote the smallest important difference. Test effects of treatment A and (separately) the effects of treatment B: ( ) 2 zα + z β N main = σ 2 Test for interaction: θ + N Xact = ( zα + z β θ + ) 2 4σ 2 Notes on designing for interaction: * N Xact = 4N main * Interactions are usually smaller than main effects. When designing to test interactions design alternative (µ + ) could be smaller in which case N Xact > 4N main. Bios 6649- pg 15

Evaluating more than two unrelated (statistical setting in a 2 2 factorial design) One-way ANOVA approach: δ = σ2 b σ 2 where σ 2 b is calculated from θ jk. If you want to select the sample size to detect when at least one treatment has effect θ +, then: * choosing (µ 00, µ 01, µ 10, µ 11 ) = (0, 0, µ +, µ + ) gives smallest samples size. * choosing (µ 00, µ 01, µ 10, µ 11 ) = (0, 1 3 µ +, 2 3 µ +, µ + ) gives largest samples size. Bios 6649- pg 16

Evaluating more than two unrelated (inferential objective) Inferential objective: Treatment groups represent different doses of a drug. The FDA often requires evidence of a dose-response relationship. Some approaches to inference: Identify two dose levels as being of primary interest. Ask if any group differs from the others (ANOVA design). Compare all dose levels to 0-dose (placebo) to determine the minimally effective dose. Ask if there is evidence that response increases with dose. I will discuss the last setting. In my experience this is a common FDA motivation for dose-response studies. Bios 6649- pg 17

Evaluating more than two unrelated (Statistical setting) Treatment groups are ordered by dose. Let µ d, d = 1,..., D denote the mean outcome in the ordered dose groups. The dose-response hypothesis is that response increases with dose. Usually monotonicity is also hypothesized; that is, µ 0 < µ D with µ 0 µ 1... µ D 1 µ D Notice that this definition divides the parameter space into null and alternative regions: Null region: any non-monotonic or decreasing response. Alternative region: monotonic increasing response. As with ANOVA, these regions are multi-dimensional, and so it is not possible to express the null and alternative hypotheses by one-dimensional inequalities. In contrast to ANOVA, the null region is bigger than µ 0 = µ 1 =,..., = µ D. Bios 6649- pg 18

Evaluating more than two unrelated (Statistical setting) The multi-dimensional regions can be reduced to a single parameter using an appropriate contrast, θ = d w dµ d. A linear-trend contrast is interpreted as the first-order approximation to the (potentially non-linear) dose-response function. If the true dose-response function is non-linear, then a linear-trend contrast will still generalize to populations treated with the same dose levels. Linear-trend contrast (equally-spaced dose-levels): w d = d d. The contrast θ = d w dµ d is interpretable as the change in response for a 1-level change in dose. Bios 6649- pg 19

Evaluating more than two unrelated (Statistical setting) Notes: There are other linear" contrasts (e.g., orthogonal polynomials), but the above contrast is the linear contrast for a linear trend. The contrast can be rescaled to change the interpretation (e.g., change units). Unless you know that the dose-response relationship is exactly linear, then it is better to fit a linear contrast across dose levels than to treat dose as continuous and estimate the slope parameter in a regression model. If dose is treated as continuous, then systematic departures from linearity are counted as residual variation thereby decreasing power. Bios 6649- pg 20

Evaluating more than two unrelated (sample size evaluation) Suppose that you have D + 1 dose groups (d = 0,..., D), and that you want adequate power for dose-response functions with µ D µ 0 µ +. As above, consider a linear-trend contrast: With θ = d w d µ d where w d = (d d) Estimate θ by (ˆθ) = d w d Y d. If the variance is constant across dose groups, then with N patients per group, ˆθ N (θ, d w d 2 σ 2 N ). Assume µ0 = 0. If µ d is in fact linear (i.e., µ d = d D µ D), then the sample size (in each dose group) require for power β when µ D = µ + is given by: N = z α + z β D d w d µ D + d=0 2 D wd 2 σ 2 d=0 You can also evaluate power under non-linear dose-response relationships; for example if µ d = µ D (d/d) P for several choices of P 1. Bios 6649- pg 21

Evaluating more than two unrelated (example) Example: Consider a trial in which children are randomized to different doses of inhaled steroid for asthma control (Busse WW (1999) J. Allergy Clin Immunol; 1215-22). Suppose that doses will be 100µg/d, 400µg/d, and 800µg/d. Response will be measured by FEV(%predicted) at week 6 and standard deviation is about 10(%predicted). Suppose that you want to detect a linear trend in dose-response of 1(% predicted) for every 100µg/d increase in steroid dose. This is similar to the magnitude of the dose-response relationship for FEV 1 in the paper. Bios 6649- pg 22

Evaluating more than two unrelated (example) Approach Linear-trend contrast: With 3 equally-spaced dose-groups the contrast is w d = ( 1, 0, 1). This study does not have equally-spaced doses, so we use wd = ( 1, 0.1, 1.1). To get the correct interpretation we use w d = wd /7.4 If response increases by 1(%pred) for every 100µg/d increase in steroid then if µ 100 = 20, µ 400 = 23, and µ 800 = 27 so that w d µ d = 1 Sample size for linear trend (90% power): * variance = d w 2 d σ2 = 0.04054 10 2 * Design alternative = θ + = 1: d ( ) zα + z 2 β N = 4.054 1 ( ) 1.96 + 1.28 2 = 4.054 1 = 42.56 Bios 6649- pg 23

Evaluating more than two unrelated (example) Evaluate sensitivity of power to non-linear dose-response relationships: Suppose µ d = 20 + 27(d/D) P. A study with 43 subjects per group has the following power for non-linear relationships: P θ + Power 0.20 0.961 0.879 0.50 0.979 0.890 0.80 0.993 0.898 1.00 1.000 0.903 1.25 1.008 0.907 2.00 1.023 0.915 5.00 1.039 0.923 Bios 6649- pg 24

Evaluating more than two unrelated (example): Implementation in RCTdesign R code for above calculations: > # Linear contrast: > d <- c(100,400,800)/100 > wd <- d - mean(d) > wd <- wd/sum(wd^2) > # With this contrast thetaa is 1.0: > thetaa <- sum(wd*c(20,23,27)) > thetaa [1] 1 Equivalent seqdesign command: > fxd <- seqdesign(arms=0,variance=100,ratio=c(1,4,8), + alt.hypo=1,power="calculate",sample.size=42.56*3) PROBABILITY MODEL and HYPOTHESES: Theta is difference in means per unit difference in treatment level One-sided hypothesis test of a greater alternative: Null hypothesis : Theta <= 0 (size = 0.0250) Alternative hypothesis : Theta >= 1 (power = 0.8997) (Fixed sample test) STOPPING BOUNDARIES: Sample Mean scale Futility Efficacy Time 1 (N= 127.68) 0.6049 0.6049 Bios 6649- pg 25

Evaluating more than two unrelated Notes on normed contrasts: A normed contrast has length = 1; i.e., divide by d w d 2 A normed contrast may not preserve the units of interest (i.e., θ A may not have desired interpretation) Dose response example with normed contrast: > d <- c(100,400,800)/100 > wd <- d - mean(d) > wdn <- wd/sqrt(sum(wd^2)) > sum(wdn^2) [1] 1 > > thetaa <- sum(wdn*c(20,23,27)) > thetaa [1] 4.9666 Bios 6649- pg 26

Consider a study for estimating the linear trend (θ 1 ) in outcome Y with a continuous explanatory (X) variable: Least squares estimator: ˆθ 1 = E(Y i ) = θ 0 + θ 1 X i N i (X i X)(Y i Y ) i (X i X) 2 ( θ 1, V ) N where: V N = σ 2 Y X i (X i X) 2 σ2 Y X Nσ 2 X Bios 6649- pg 27

Sample size evaluation for regression designs Usual sample size formula can be used with the variance of ˆθ 1 /4: N = ( zα + z β θ 1+ ) 2 V = ( zα + z β θ 1+ ) 2 σ 2 Y X σ 2 X For example we require about 100 patients if we want 97.5% power for θ + = 0.25 when σ 2 Y X = 2 and σ 2 X = 5: N = ( ) 2 3.92 0.4 0.25 = 98.5 The corresponding seqdesign command, assuming X i N ((0,ratio) is: dsgn <- seqdesign(arms=0,variance=2,ratio=5,alt.hypo=0.25) Logistic regression designs can be specified using the odds probability model (see users guide). Bios 6649- pg 28

endpoints (e) Designs using change from baseline endpoints In many trial the primary endpoint is the change from baseline in a continuous outcome measure Examples: * Change in FEV1 * Change in 6-minute walk distance * Change in weight (or BMI) Motivation for analyzing change: * Looking at change within a subject removes between-subject variability * May reduce variation (increase power) Which is the most precise estimate of treatment effect? * T-test of difference in mean at follow-up? * T-test of difference in change from baseline? * Regression (ANCOVA) of difference in followup given baseline? Bios 6649- pg 29

endpoints (e) Designs using change from baseline endpoints Setting: Let Yi0k and Y i1k denote a measurement on the ith subject in treatment group k (k = 0, 1) at baseline and follow-up. Let Dik = Y i1k Y i0k Consider the following statistical models for testing treatment effect (Tx = 1 [k=1] ) (a) Test difference at follow-up time: E(Y i1k ) = α 0 + α 1 Tx (b) Test change from baseline: E(D ik ) = β 0 + β 1 Tx (c) Test follow-up meausure given baseline measure (ANCOVA): E(Y i1k ) = γ 0 + γ 1 Tx + γ 2 E(Y i0k ) Bios 6649- pg 30

endpoints Designs using change from baseline endpoints Probability model Suppose that ( ) [( ) ] Y 0ik µ 00 N, Σ Y 1ik µ 1k where Σ = [ σ 2 ] ρσ 2 ρσ 2 σ 2 notice that µ 0k = µ 01 = µ 00 by randomization Bios 6649- pg 31

endpoints Designs using change from baseline endpoints Distribution of estimated treatment effects from the 3 statistical models with i = 1,...N subjects per group: (a) E(Y i1k ) = α 0 + α 1 Tx (b) E(D ik ) = β 0 + β 1 Tx ˆα 1 N ˆβ 1 N (µ 1 µ 0, 2σ2 N ) ( ) µ 1 µ 0, 4σ2 (1 ρ) N (c) E(Y i1k ) = γ 0 + γ 1 Tx + γ 2 E(Y i0k ) ( ) ˆγ 1 N µ 1 µ 0, 2σ2 (1 ρ 2 ) N where µ 1 = µ 11 µ 10 ; also notice: α 1 = µ 1 β 1 = µ 1 = (µ 11 µ 00 ) (µ 10 µ 00 ) γ 1 = µ 1 Bios 6649- pg 32

endpoints Designs using change from baseline endpoints Relative efficiency of the designs using the 3 statistical models: (a) vs (b) When is var( ˆβ 1 ) < var(ˆα 1 )? var( ˆβ 1 ) < var(ˆα 1 ) 4σ 2 (1 ρ) < 2σ 2 0.5 < ρ < 1.0 (a) vs (c) When is var(ˆγ 1 ) < var(ˆα 1 )? var(ˆγ 1 ) < var(ˆα 1 ) 2σ 2 (1 ρ 2 ) < 2σ 2 0.0 < ρ < 1.0 (b) vs (c) When is var(ˆγ 1 ) < var( ˆβ 1 )? var(ˆγ 1 ) < var( ˆβ 1 ) 2σ 2 (1 ρ 2 ) < 4σ 2 (1 ρ) 0.0 < ρ < 1.0 So, conditioning on baseline (ANCOVA) is more efficient (see also examples/proof in handout) Note: ˆγ1 is the same regardless of whether Y i1k or D ik is used as the response variable (try it). Bios 6649- pg 33