Technical Manual. 1 Introduction. 1.1 Version. 1.2 Developer

Technical Manual 1 Introduction 1 2 TraditionalSampleSize module: Analytical calculations in fixed-sample trials 3 3 TraditionalSimulations module: Simulation-based calculations in fixed-sample trials 15 4 GroupSequential module: Analytical and simulation-based calculations in group-sequential trials 20 5 Appendix 27 1 Introduction Mediana Designer is a free Windows-based software tool that supports traditional and simulation-based power or sample size calculations in fixed-sample and groupsequential trials. The current version of Mediana Designer comes with the following three modules: TraditionalSampleSize module (Analytical calculations in fixed-sample trials): This module supports analytical evaluation of operating characteristics in clinical trials with a traditional (fixed-sample) design. TraditionalSimulations module (Simulation-based calculations in fixed-sample trials): This module implements the simulation-based Clinical Scenario Evaluation approach in clinical trials with a traditional (fixed-sample) design. GroupSequential module (Analytical and simulation-based calculations in groupsequential trials): This module implements analytical and simulation-based evaluation of operating characteristics in clinical trials that employ group-sequential designs with several decision points. This manual provides a detailed summary of statistical methods implemented in each module. 1.1 Version This manual describes the features available in Mediana Designer Version 0.9 and was released on April 12, 2019. 1.2 Developer Mediana Designer was developed by and is maintained by Mediana Inc. For more information on Mediana Designer, please visit the Biopharmaceutical Network site at

2 Mediana Designer http://biopharmnet.com/mediana-designer/ The latest version of this manual can be downloaded from this web site. 1.3 Reviewers Beta versions of Mediana Designer have been reviewed by the following biopharmaceutical statisticians (in alphabetical order): Thomas Brechenmacher (IQVIA), Jian Chen (Tesaro), Gerald Crans (Gilead), Qiqi Deng (Boehringer Ingelheim), Miguel Garcia (Boehringer Ingelheim), Qi Gong (Gilead), Wei Guo (Tesaro), Julie Huang (Gilead), Rochelle Huang (Gilead), Jim Love (Boehringer Ingelheim), Lanjia Lin (Gilead), Jie Liu (Gilead), Yi Liu (Boehringer Ingelheim), Kaushik Patra (Alexion), Gautier Paux (Sanofi), Dooti Roy (Boehringer Ingelheim), Shuo Wang (Gilead), Kyle Wathen (Johnson & Johnson), Ilker Yalcin (Tesaro), Wei Ye (Gilead), Ron Yu (Gilead). The development team would like to thank the reviewers for the valuable feedback. 1.4 Validation Multiple commercially available and open-source software tools have been used to test the implementation of statistical methods presented in this manual. This includes: EAST software (Cytel, 2016). SAS software, e.g., POWER procedure (SAS, 2017) and SEQDESIGN procedure (SAS, 2017). R packages, e.g., TrialSize (Zhang et al., 2013), gsdesign (Anderson, 2016) and Mediana (Paux and Dmitrienko, 2019). For a detailed summary of procedures that were carried out to test Mediana Designer, please visit http://biopharmnet.com/mediana-designer-validation/ 1.5 Project files Dozens of Mediana Designer project files/case studies have been created to illustrate the functionality supported by the individual modules. To download these project files, please visit http://biopharmnet.com/mediana-designer-project-files/ Notation and conventions The following notation will be used throughout this manual: Φ(x) is the cumulative distribution function of the standard normal distribution.

Technical Manual 3 z 1 x is the upper 100xth percentile of the standard normal distribution, i.e., z 1 x = Φ 1 (1 x). Clinical trials will be designed to ensure power of 1 β (e.g., 90% power with β = 0.1) with a one-sided Type I error rate set to α (e.g., α = 0.025). 2 TraditionalSampleSize module: Analytical calculations in fixed-sample trials 2.1 Introduction This section focuses on sample size and power calculations in clinical trials that utilize a traditional design with a fixed number of patients (if the primary endpoint is continuous or binary) or a fixed number of events (if the primary endpoint is a time-to-event endpoint). A two-arm clinical trial with a parallel design will be assumed. Let n 1 and n 2 denote the number of patients enrolled into the control arm and the treatment arm, respectively. The total number of enrolled patients is denoted by n = n 1 +n 2. The randomization ratio (r) is defined as the ratio of the number of patients in the treatment arm to that in the control arm. For example, with r = 2, twice as many patients are assigned to the treatment arm compared to the control arm. In other words, n 2 = rn 1 and thus n 1 = n 1+r and n 2 = rn 1+r. In addition, d will denote the target number of events in trials with time-to-event endpoints. Analytical frequentist approaches to sample size and power calculations are discussed in Sections 2.2 through 2.4(these sections describe trials with continuous, binary and time-to-event endpoints). Within the frequentist framework, Mediana Designer supports the following analytical calculations: Clinical trials with continuous or binary endpoints: Calculate power for a given sample size or calculate the required sample size for a given power level. Clinical trials with time-to-event endpoints: Calculate power for a given number of events or calculate the target number of events for a given power level. If the patient accrual and dropout parameters are specified, calculate power for a given sample size or calculate the required sample size for a given power level A simulation-based Bayesian approach to evaluating assurance (probability of success) in clinical trials is presented in Sections 2.5. Mediana Designer supports the following assurance calculations: Clinical trials with continuous or binary endpoints: Calculate assurance for a given sample size. Clinical trials with time-to-event endpoints: Calculate assurance for a given number of events.

4 Mediana Designer 2.2 Frequentist calculations in trials with continuous endpoints It is assumed that the continuous primary endpoint is normally distributed and the treatment effect is evaluated using the standard Z-test. Power and sample calculations based on this test are supported by several R packages (TrialSize and gsdesign), POWER procedure in SAS and EAST. The true values of the mean effects in the control arm and treatment arm are denoted by µ 1 and µ 2, respectively, and, similarly, the standard deviations in the control arm and treatment arm are denoted by σ 1 and σ 2. The mean treatment difference is given by δ = µ 2 µ 1. The following two scenarios that correspond to two different alternative hypotheses of a beneficial effect will be considered in this section as well as other sections: Upper one-sided alternative: A positive value of the mean treatment difference corresponds to a beneficial treatment effect. Lower one-sided alternative: A negative value of the mean treatment difference indicates treatment benefit. Superiority assessment Consider first the superiority setting where the goal is to demonstrate that the treatment provides a statistically significant improvement over the control. The hypothesis testing problem with an upper one-sided alternative is defined as and, with a lower one-sided alternative, H 0 : δ = 0 versus H 1 : δ > 0 H 0 : δ = 0 versus H 1 : δ < 0. The test statistic for evaluating the strength of evidence in favor of either alternative hypothesis is given by Z = µ 2 µ 1 σ 2 1 /n 1 + σ 2 2/n. 2 When an upper one-sided alternative is assumed, a larger value of the test statistic leads to a decision to reject the null hypothesis of no effect, i.e., H 0 is rejected if Z z 1 α. With a lower one-sided alternative, the null hypothesis is rejected if Z z 1 α. If the average standard deviation is defined as σ = σ 2 1 + σ2 2 r, power is easily computed as a function of δ and σ, i.e., ( n ψ(δ,σ) = Φ ( ψ(δ,σ) = Φ 1+r n 1+r ) δ σ z 1 α δ σ z 1 α ) (upper one-sided alternative), (lower one-sided alternative). If power is set to 1 β, the required total number of patients is equal to n = (1+r) (z 1 α +z 1 β ) 2 σ 2 δ 2.

Non-inferiority assessment Technical Manual 5 In a non-inferiority setting, the trial s objective is to demonstrate that the treatment is not substantially worse than, i.e., not inferior to, the control. The degree of non-inferiority is determined using a pre-set constant, known as the non-inferiority margin. The margin is denoted by γ. The null hypothesis of inferiority and alternative hypothesis of non-inferiority are defined as H 0 : δ = γ versus H 1 : δ = 0, where γ is negative under an upper one-sided alternative and positive under a lower one-sided alternative. The corresponding non-inferiority test statistic is given by Z = µ 2 µ 1 γ σ1 /n 1 + σ 2 /n 2. It is easy to verify that, under the null hypothesis of inferiority, the test statistic follows the standard normal distribution. As above, the null hypothesis is rejected if Z z 1 α with an upper one-sided alternative and if Z z 1 α with a lower one-sided alternative. Power as a function of the total sample size n is given by ( n ψ(δ,σ γ) = Φ 1+r ( n ψ(δ,σ γ) = Φ 1+r δ γ σ δ γ σ z 1 α ) z 1 α ) (upper one-sided alternative), (lower one-sided alternative). The total sample size in the trial as a function of β is given by Example n = (1+r) (z 1 α +z 1 β ) 2 σ 2 (δ γ) 2. The following numeric example illustrates the process of computing the sample size in a clinical trial with a normally distributed endpoint. Consider an antihypertension Phase III trial with two arms (experimental treatment versus active control). The primary analysis in the trial is based on the change in systolic blood pressure (measured in mmhg) and aims to demonstrate that the treatment is non-inferior to the active control. A larger reduction in the mean systolic blood pressure is desirable and thus a lower one-sided alternative will be considered in the hypothesis testing problem. Under the alternative hypothesis, µ 1 = µ 2 = 9 (i.e., δ = 0) and, under the null hypothesis of inferiority, µ 1 = 9 and µ 2 = 6 (i.e., δ = 3). The trial s design is balanced (r = 1) and the common standard deviation is σ 1 = σ 2 = 10, therefore σ 2 = σ 2 1 + σ2 2 r = 200. The non-inferiority margin is set to γ = 3 (note that the margin is positive since a lower one-sided alternative is considered in this problem). Using a one-sided α = 0.025 and 90% power (β = 0.1), the total sample size in the trial is n = (1+r) (z 1 α +z 1 β ) 2 σ 2 (δ γ) 2 = 467. The resulting sample size matches the example presented in the EAST user manual (Cytel, 2016, Chapter 12).

6 Mediana Designer 2.3 Frequentist calculations in trials with binary endpoints Two-arm trials with binary primary endpoints are considered in this section. The Z- test for proportions with an unpooled variance estimate is supported in this setting. Power and sample calculations for this test are presented in multiple papers and books, including Chow, Shao and Wang (2008) and Julious(2010). This approach is implemented in SAS, EAST and multiple R packages (TrialSize and gsdesign). For example, this test is implemented in the TwoSampleProportion.NIS function of the TrialSize package. Hypothesis testing problems Let π 1 and π 2 denote the true values of the proportions of interest, e.g., response rates, in the control arm and treatment arm, respectively. The true treatment difference is equal to δ = π 2 π 1. As in Section 2.2, the following two scenarios will be considered: Upper one-sided alternative: A positive value of the treatment difference corresponds to a beneficial treatment effect. Lower one-sided alternative: A negative value of the treatment difference corresponds to treatment benefit. In a superiority setting, the corresponding null hypotheses of no effect and alternative hypotheses of a beneficial effect are defined as in Section 2.2, i.e., H 0 : δ = 0 versus H 1 : δ > 0 (upper one-sided alternative), H 0 : δ = 0 versus H 1 : δ < 0 (lower one-sided alternative). If the trial is conducted to pursue a non-inferiority objective with a pre-specified non-inferiority margin γ, the null hypothesis of inferiority and alternative hypothesis of non-inferiority are again set up as in Section 2.2, i.e., H 0 : δ = γ versus H 1 : δ = 0. The margin is negative if an upper one-sided alternative is considered and is positive otherwise. Superiority and non-inferiority assessments The test statistic for evaluating the significance of the treatment effect in a superiority setting is given by Z = π 2 π 1 π1 (1 π 1 )/n 1 + π 2 (1 π 2 )/n 2. The null hypothesis of no effect is rejected if Z z 1 α provided a higher value of the proportion indicates a beneficial treatment effect (upper one-sided alternative) and if Z z 1 α otherwise (lower one-sided alternative). The standard deviation corresponding to this test is naturally defined as follows σ = π 1 (1 π 1 )+ π 2(1 π 2 ). r Using this definition of σ, the power function of the test is given by ( ) n δ ψ(π 1,π 2 ) = Φ 1+r σ z 1 α (upper one-sided alternative), ( ) n δ ψ(π 1,π 2 ) = Φ 1+r σ z 1 α (lower one-sided alternative).

The total sample size of the trial with a superiority objective is n = (1+r) (z 1 α +z 1 β ) 2 σ 2 δ 2. Technical Manual 7 If the trial s goal is to show that the treatment is non-inferior to the control, the test statistic is easily modified as follows Z = π 2 π 1 γ π1 (1 π 1 )/n 1 + π 2 (1 π 2 )/n 2. The power function is defined as follows ( ) n δ γ ψ(π 1,π 2 γ) = Φ z 1 α (upper one-sided alternative), 1+r σ ( ) n δ γ ψ(π 1,π 2 γ) = Φ z 1 α (lower one-sided alternative). 1+r σ The total sample size in the trial is given by Example n = (1+r) (z 1 α +z 1 β ) 2 σ 2 (δ γ) 2. Consider a two-arm Phase III trial in patients with HIV. The primary endpoint in the trial is binary (24-week disease-free rate) and a higher rate corresponds to a beneficial effect. Suppose that the trial is designed to perform a non-inferiority assessment for a novel treatment compared to an active control. The disease-free rate in the control arm is assumed to be 80% (π 1 = 0.8). Under the null hypothesis of inferiority, the disease-free rate in the treatment arm is 75%, i.e., π 2 = 0.75 or δ = 0.05. An upper one-sided alternative is considered in the trial and states that the disease-free rate equals 80% in both trial arms (π 2 = 0.8 or δ = 0). The non-inferiority margin is set to γ = 0.05. Assuming a balanced design (r = 1) with a one-sided α = 0.025 and 90% power (β = 0.1), the required total number of patients in the trial is n = (1+r) (z 1 α +z 1 β ) 2 σ 2 (δ γ) 2 = 2690, where σ = 0.5657. This sample size calculation matches that presented in the EAST user manual (Cytel, 2016, Chapter 24). 2.4 Frequentist calculations in trials with time-to-event endpoints A clinical trial with an event-driven design will be considered in this section and it will be assumed that the time to the primary event follows an exponential distribution. Let λ 1 and λ 2 denote the hazard rates in the control and treatment arms, respectively. The hazard ratio is denoted by δ, i.e., δ = λ 2 /λ 1. As in Sections 2.2 and 2.3, two scenarios based on lower and upper one-sided alternatives are considered to define the hypothesis testing problem. Under an upper one-sided alternative, treatment benefit is associated with a lower hazard ratio or, equivalently, with a longer time to the primary event in the treatment arm compared to the control arm. With a lower one-sided alternative, a beneficial effect is associated with a higher hazard ratio. This immediately implies that, in a trial

8 Mediana Designer pursuing a superiority objective, the null and alternative hypotheses are defined as follows: H 0 : δ = 1 versus H 1 : δ < 1 (upper one-sided alternative), H 0 : δ = 1 versus H 1 : δ > 1 (lower one-sided alternative). Within a non-inferiority framework, let γ denote the prospectively defined noninferiority margin on the hazard ratio scale. Using this margin, the null and alternative hypotheses are set up as follows: H 0 : δ = γ versus H 1 : δ = 1. Here γ is greater than 1 under an upper one-sided alternative and is less than 1 otherwise. An important feature of clinical trials with time-to-event endpoints is that power calculations can be carried out to pursue two different goals: Compute the target number of events in the trial. This calculation does not take into account patient accrual or patient dropout (in a sense, every patient is followed up until this patient experiences the event of interest). Compute the number of enrolled patients in the trial. To perform this calculation, assumptions about the patient accrual and dropout processes need be made. These two goals will be discussed below. The calculation of the number of events is based on the Schoenfeld formula (Schoenfeld, 1981) and, to address the second goal, the approach developed in Lachin and Foulkes (1986) is applied. The same methodology is utilized in EAST, SAS as well as R packages (TrialSize and gsdesign). Calculation of the number of events Beginning with the problem of computing the target number of events in a trial where the primary endpoint is a time-to-event endpoint, consider first a superiority setting. The treatment effect will be evaluated using the standard log-rank test. Let d denote the total number of events in the two trial arms. Assuming that there are no ties, let n 1k be the number of patients in the control arm who are at risk just before the kth event and, similarly, let n 2k be the number of patients in the treatment arm who are at risk just before the kth event, k = 1,...,d. Finally, I k = 1 if the kth event occurs in the control arm and 0 otherwise. The test statistic is given by Z = d k=1 ( I k ) n 1k / d n 1k +n 2k k=1 n 1k n 2k (n 1k +n 2k ) 2. Assuming an upper one-sided alternative, a large value of this test statistic is inconsistent with the null hypothesis of no treatment effect and therefore the null hypothesis is rejected if Z z 1 α. If a lower one-sided alternative is considered, the null hypothesis of no effect is rejected if Z z 1 α. If there is no censoring, which means that every patient ultimately experiences the event of interest, and the true hazard ratio δ is close to 1, the power function of the log-rank test can be approximated as follows: ( ) dr ψ(δ) = Φ 1+r logδ z 1 α (upper one-sided alternative), ( ) dr ψ(δ) = Φ 1+r logδ z 1 α (lower one-sided alternative),

where logδ is the natural logarithm of the hazard ratio. As a result, the total number of events in the trial is equal to d = (1+r)2 r (z 1 α +z 1 β ) 2 (logδ) 2. Technical Manual 9 Switching to a clinical trial designed to demonstrate that the treatment is noninferior to the control, the log-rank test needs to be modified to support a noninferiority assessment as follows: Z = d k=1 ( I k ) n 1k / d n 1k +γn 2k k=1 γn 1k n 2k (n 1k +γn 2k ) 2. This implies that the power function of the non-inferiority test is given by ( ) dr ψ(δ γ) = Φ 1+r log δ γ z 1 α (upper one-sided alternative), ψ(δ γ) = Φ ( ) dr 1+r log δ γ z 1 α and thus the total number of events needs to be set to d = (1+r)2 r Calculation of the number of patients (lower one-sided alternative), (z 1 α +z 1 β ) 2 (log(δ/γ)) 2. The approach presented above focuses on finding the target number of primary events and the required number of patients is not explicitly defined. To compute the number of patients to be enrolled into the trial, assumptions on the patient accrual and patient dropout processes need to be made. Suppose that the length of the accrual period is T R and the total duration of the trial, i.e., the length of time from the enrollment of the first patient to the discontinuation of the last patient, is T S. Patients can be enrolled into the trial in a uniform fashion or, alternatively, a more general distribution can be introduced to describe the patient accrual. It is common to assume that the accrual is governed by a truncated exponential distribution with the following cumulative distribution function: F(x τ) = 1 exp( τx) 1 exp( τt R ), 0 x T R. Here τ is the parameter that defines the distribution s shape. With a positive value of τ, patients are initially enrolled at a high rate but the patient accrual slows down towards the end of the trial. On the other hand, if τ < 0, the accrual rate increases over time. Lastly, if τ = 0, this distribution simplifies to a uniform distribution, i.e., F(x) = x/t R. Secondly, there are two sources of censoring in the trial: Administrative censoring, i.e., a patient reaches the end of the trial without experiencing the primary event. Censoring due to dropout, i.e., a patient is lost to follow up before experiencing the primary event. Assuming an exponential dropout, let η denote the common hazard rate of the dropout distribution in the control and treatment arms.

10 Mediana Designer To define the formula for computing the total number of enrolled patients, let where σ 0 = 1+r rϕ(λ), 1+r σ 1 = ϕ(λ 1 ) + 1+r rϕ(λ 2 ), ϕ(x) = x x+η + xτ exp( (x+η)t S)(1 exp( (x+η τ)t R )) (x+η)(x+η τ)(1 exp( τt R )) and λ is the average hazard rate, i.e., λ = λ 1 +rλ 2. 1+r Now, assuming a superiority setting, the power function is given by ( ) n σ 0 ψ(δ) = Φ logδ z 1 α (upper one-sided alternative), σ 1 σ 1 ( ) n σ 0 ψ(δ) = Φ logδ z 1 α (lower one-sided alternative) σ 1 σ 1 and therefore the total number of enrolled patients is equal to n = (z 1 ασ 0 +z 1 β σ 1 ) 2 (logδ) 2. With a non-inferiority setting, the power function is defined as follows ( n ψ(δ γ) = Φ log δ ) σ 1 γ z σ 0 1 α (upper one-sided alternative), σ 1 ( n ψ(δ γ) = Φ log δ ) σ 1 γ z σ 0 1 α (lower one-sided alternative) and the total number of enrolled patients is equal to Example σ 1 n = (z 1 ασ 0 +z 1 β σ 1 ) 2 (log(δ/γ)) 2. To illustrate the methods for computing the target number of events and sample size in trials with time-to-event endpoints, consider a Phase III trial in patients with metastatic colorectal cancer. A two-arm design (experimental treatment plus best supportive care versus best supportive care) is employed in the trial and patients will be randomized in a 2:1 ratio to the treatment or control (r = 2). The primary objective of this trial is to demonstrate that the experimental treatment is superior to the control in terms of overall survival. It is assumed that median survival times in the control and treatment arms are 6 and 9 months, respectively. The hazard rates corresponding to these median survival times are λ 1 = log2 6 = 0.116, λ 2 = log2 9 = 0.077, and the hazard ratio is δ = λ 2 /λ 1 = 0.667. Assuming 90% power and a one-sided α = 0.025, the target number of events in the trial is d = (1+r)2 r (z 1 α +z 1 β ) 2 (logδ) 2 = 288.

Technical Manual 11 Furthermore, the following assumptions will be made to find the required number of patients in the trial: The length of the patient accrual period is T R = 12 months and the total length of the trial is T S = 24 months. The patient accrual is governed by a truncated exponential distribution with the median accrual time of 9 months, which means that 50% of the patients are expected to be enrolled by the 9-month milestone. The corresponding parameter of the truncated exponential distribution is τ = 0.203. The annual dropout rate is 5%, which means that the hazard rate of the exponential dropout distribution is η = log0.95/12= 0.0043. The resulting number of enrolled patients is n = (z 1 ασ 0 +z 1 β σ 1 ) 2 (logδ) 2 = 388. 2.5 Bayesian calculations This section provides a short summary of simulation-based Bayesian calculations aimed at evaluating the probability of success, also known as assurance, in two-arm clinical trials with continuous, binary and time-to-event endpoints. In general, assurance calculations rely on averaging frequentist characteristics such as power with respect to prior distributions of the endpoint parameters, e.g., prior distributions of the true response rates in the control and treatment arms in a trial with a binary endpoint. The prior distributions are derived from historical data and a simple approach to deriving posterior distributions and carrying out assurance calculations in trials with continuous, binary and time-to-event endpoints is presented below. For more information on the use of assurance in clinical trials and calculation of posterior distributions, see O Hagan, Stevens and Campbell (2005), Wang (2015) and Gelman et al. (2013). It will be assumed throughout this section that posterior distributions of interest are found using information from a two-arm historical trial with the same experimental treatment and control as in the current trial. The indices corresponding to the control and treatment arms are i = 1 and i = 2, e.g., the number of patients in the control and treatment arms are denoted by k 1 and k 2, respectively. The historical trial may be either hypothetical, in which case this trial serves purely as a device for assessing the robustness of power calculations, or based on a real clinical trial. In the latter case it is important to remember that the assumed primary endpoint parameters, e.g., mean and standard deviation, do not need to be equal to the actual parameters observed in the real trial. The actual parameter values may be replaced by assumed values that will be utilized for power calculation in the current trial. To compute assurance, let θ be the vector of endpoint parameters in the control and treatment arms, e.g., θ = (π 1,π 2 ) in a clinical trial with a binary primary endpoint, where π 1 and π 2 are the true values of the proportions in the control and treatment arms, respectively. If the probability of a statistically significant treatment effect in the current trial is denoted by P(θ), assurance is defined as P(θ)f(θ)dθ, where f(θ) is the probability density function of the prior distribution of θ. This prior distribution is equal to the posterior distribution of θ given the data from the

12 Mediana Designer historical trial. This posterior distribution is often derived from a non-informative prior for θ. Within a simulation-based framework, the integral is approximated by 1 s s P(θ i ), i=1 where θ 1,...,θ s are sampled from the prior distribution of θ and s is the number of simulation runs. Bayesian calculations in trials with continuous endpoints Suppose that assurance calculations will be run a clinical trial with a continuous primary endpoint in addition to traditional frequentist calculations. The endpoint follows a normal distribution with the parameters (µ 1,σ1 2 ) in the control arm and (µ 2,σ2) 2 in the treatment arm. The joint distribution of the endpoint parameters in each trial arm is defined using the following two-step algorithm: The variance σi 2 follows a scaled inverse chi-square distribution with the degrees of freedom ν i and scale parameter τi 2, i = 1,2. The mean µ i, conditional on the variance σi 2, follows a normal distribution with the mean µ i and variance σi 2/κ i, where κ i is a pre-set scaling parameter, i = 1,2. A non-informative prior distribution can be defined by setting κ i to 0, ν i to 1 and τ 2 i to 0, i = 1,2. Consider the ith trial arm, i = 1,2, and let m i and s i denote the assumed mean and standard deviation that will be utilized in the power calculation in the current trial. These endpoint parameters are treated as if they were observed in the historical trial, i.e., these values are used to find the posterior distribution of the true means and standard deviations (as explained above, the assumed values may not be equal to the actual means and standard deviations in the historical trial). The posterior distribution of the endpoint parameters in the ith trial arm of the historical trial is then derived using the two-step approach described above. In particular, it can be shown that the posterior variance σ 2 i follows a scaled inverse chi-squared distribution with the degrees of freedom parameter given by k i 1 and scaleparameters 2 i.furthermore,theposteriormeanµ i,conditionalontheposterior variance σ 2 i, is normally distributed with the mean m i and variance σ 2 i /k i. The posterior distributions of the endpoint parameters derived from the historical trial will be used as the prior distributions when evaluating assurance in the current trial. An important feature of this approach to performing Bayesian calculations is that it depends only on the number of patients in each arm of the historical trial. If the sample size in the historical trial is large, the marginal posterior distributions will be tightly clustered around the assumed values of the endpoint parameters, i.e., around m 1 and s 1 in the control arm and around m 2 and s 2 in the treatment arm, and assurance will be reasonably close to frequentist power. Bayesian calculations in trials with binary endpoints Consider a two-arm historical trial with a binary endpoint and let π 1 and π 2 denote the true values of the proportions in the control and treatment arms, respectively. Conjugate distributions will be assumed for these proportions, i.e., π i will be assumed to follow a beta distribution with the shape parameters α i and β i, i = 1,2.

Technical Manual 13 The assumed values of the proportions to be used in the power calculations in the current trial are denoted by p 1 and p 2. Treating these values as if they were observed in the historical trial and assuming non-informative priors for the true proportions, i.e., α i = 1, β i = 1, i = 1,2, it is easy to derive the parameters of the posterior distributions of π 1 and π 2 in the historical trial. The posterior distribution of the true proportion π i is also a beta distribution with the shape parameters α i = 1+p i k i, β i = 1+(1 p i )k i, i = 1,2. The resulting posterior distributions of the true proportions will be used as the prior distributions for computing assurance in the current trial. Bayesian calculations in trials with time-to-event endpoints Considering a two-arm historical trial with the same time-to-event endpoint as in the current trial, assume that the time to the event of interest is exponentially distributed and let λ 1 and λ 2 denote the true hazard rates in the control and treatment arms, respectively. Using a conjugate distribution approach, it will be assumed that λ i follows a gamma distribution with the shape parameter α i and rate parameter β i, i = 1,2. An improper non-informative prior with α i = 0 and β i = 0 will be assumed for the true hazard rate λ i, i = 1,2, in the historical trial. Let l 1 and l 2 denote the assumed hazard rates in the control arm and treatment arm of the current trial, respectively. If the observed hazard rates in the historical trial are set to these assumed values, the posterior distribution of λ i in the historical trial is a gamma distribution with the following shape and rate parameters: α i = k i, β i = k i /l i, i = 1,2. As above, the gamma distributions with these parameters will be used as the prior distributions for the hazard rates when performing assurance calculations in the current trial. Example Consider a development program for the treatment of rheumatoid arthritis. The primary endpoint in the Phase II and III trials included in this program is binary (ACR20 definition of improvement) and a higher response rate indicates a beneficial treatment effect. Suppose that the sample size in a Phase III trial will be computed assuming that the control response rate is π 1 = 0.3 and the treatment response rate is π 2 = 0.5. These response rates are based on the results observed in a recently conducted Phase II trial, which will serve as the historical trial. Using a one-sided α = 0.025 and the Z-test for proportions, it is easy to check that the total number of patients needs to be set to 242 to ensure 90% power. To support assurance calculations in the Phase III trial, it will be assumed that the true response rates in the two trial arms follow beta distributions. Noninformative priors will be considered in the historical trial to compute the posterior distributions for the response rates that will be ultimately utilized in the assurance calculation. The non-informative beta priors are defined using the following set of

14 Mediana Designer shape parameters: α 1 = 1, β 1 = 1 (control arm), α 2 = 1, β 2 = 1 (treatment arm). Assuming a balanced design in the Phase II trial, suppose that the sample size per arm is k 1 = k 2 = 50 patients. To compute the posterior distributions of the true response rates, the observed response rates in the Phase II trial are assumed to be equal to p 1 = 0.3 and p 2 = 0.5. The posterior distributions are beta distributions with the following parameters: α 1 = 16, β 1 = 36 (control arm), α 2 = 26, β 2 = 26 (treatment arm). These posterior distributions are plotted in Figure 1. It can be seen from this figure that the posterior distributions of the true response rates are centered around the assumed values (p 1 = 0.3 and p 2 = 0.5) and demonstrate a fairly high amount of variability due to the fact that the estimated response rates are obtained from a relatively small historical trial. The posterior distributions now serve as the prior distributions for π 1 and π 2 in the Phase III trial. A simulation-based algorithm can now be applied to generate samples from the prior distributions, compute response rates in the control and treatment arms and ultimately evaluate the significance of the treatment effect in each simulation run using the Z-test for proportions. By averaging over 10,000 runs, the assurance is estimated to be 73.7%. This value is lower than 90%, which is the target for frequentist power, since the assurance calculation accounts for the uncertainty around the assumed response rates, i.e., p 1 = 0.3 and p 2 = 0.5. If the response rates came from a larger Phase II trial, assurance would be closer to frequentist power. For example, if the sample size per arm in the Phase II trial is k 1 = k 2 = 100 patients, assurance increases to 79.2%. Figure 1 Posterior distributions of the true response rates in the Phase II trial (solid curve, control arm; dashed curve, treatment arm). 0.0 0.2 0.4 0.6 0.8 1.0 Response rate

Technical Manual 15 3 TraditionalSimulations module: Simulation-based calculations in fixed-sample trials 3.1 Introduction This section introduces the simulation-based Clinical Scenario Evaluation approach in clinical trials with a traditional (fixed-sample) design. The TraditionalSimulations module relies on this approach to support efficient clinical trial simulations for trials with continuous, binary or time-to-event endpoints. Simulations are run to compute power for a given sample size or a given number of events. In addition, Bayesian calculations can be performed to evaluate assurance (probability of success) for a given sample size or a given number of events. Simulation-based calculations in clinical trials with a fixed-sample design are supported in the Mediana package (Paux and Dmitrienko, 2019). 3.2 Clinical Scenario Evaluation approach The Clinical Scenario Evaluation (CSE) approach implemented in the Traditional- Simulations module supports clinical trial simulations in large number of settings, including clinical trials with multiple arms and multiple patient populations. The CSE framework was introduced to facilitate the process of evaluating the operating characteristics of multiple candidate analysis methods under several sets of treatment effect assumptions. For more information on the Clinical Scenario Evaluation framework, see Benda et al. (2010), Friede et al. (2010) and Dmitrienko and Pulkstenis (2017). The CSE approach decomposes a complex problem of examining a large number of options by identifying the main components of the evaluation process. These components are termed models, i.e., data models (also known as assumptions), analysis models (also known as options) and evaluation models (also known as metrics). The data, analysis and evaluation models are defined as follows: Data models define the process of generating trial data (e.g., sample sizes, outcome distributions). Analysis models define the statistical methods applied to the trial data (e.g., statistical tests, multiplicity adjustments). Evaluation models specify the measures for evaluating the performance of the chosen statistical methods (success criteria such as disjunctive or weighted power). Each combination of the data and analysis models is referred to as a clinical scenario. An important feature of the general CSE approach is that it enables clinical trial sponsors to transition from basic sample size calculations to clinical trial optimization, i.e., to a comprehensive evaluation of applicable trial designs and analysis strategies based on the selected evaluation criteria and identifying the parameter configurations that lead to optimal performance. A detailed review of CSE-based clinical trial optimization approaches is provided in Dmitrienko and Pulkstenis (2017).

16 Mediana Designer 3.3 Data, analysis and evaluation models This section defines the data, analysis and evaluation models supported by the TraditionalSimulations module. The current version of the TraditionalSimulations module allows the user to specify a single set of treatment effect assumptions, i.e., a single data model, and a single analysis model. The key concepts introduced in this section will be illustrated in Section 3.4. Data model A data model is defined as a collection of samples and samples are defined as mutually exclusive groups of patients, for example, treatment arms in a trial or subsets of the trial s population. The user needs to specify the parameters of the primary endpoint s distribution in each sample and these parameters are used when generating patient data within each sample. The data can also be generated using the approach defined in Section 2.5, i.e., a prior distribution can be assigned to each parameter of the endpoint s distribution. This prior distribution is defined as the posterior distribution derived from a historical trial with the same experimental treatment or control and a given number of patients or events. By assigning prior distributions to the endpoint parameters, the user can perform sensitivity assessments to determine the impact of potential deviations from the original treatment effect assumptions on the trial s operating characteristics. In addition, the user can specify the patient accrual and patient dropout processes that apply to all samples defined in the data model. If the patient accrual or dropout parameters are not specified, the patient accrual and dropout processes will not be modeled, e.g., with time-to-event endpoints, no censoring will be applied and all patients will reach the endpoint of interest. The following parameters need to be specified to define the patient accrual process: Length of the accrual period. Length of the follow-up period if the primary endpoint is continuous or binary (if the primary endpoint is a time-to-event endpoint, the length of the patient follow-up is determined by the target number of events). Patient accrual distribution. The patient accrual can follow a uniform distribution or truncated exponential distribution. The truncated exponential distribution is defined in Section 2.2. The shape of this distribution is determined by the τ parameter and, for convenience, the user can specify the median accrual time and the corresponding value of τ is found by Mediana Designer. The median accrual time is defined as the time point by which 50% of the patients are enrolled into the trial. For example, if the median accrual time is 6 months, 50% of the trial s total sample size will be enrolled into the trial by the 6-month time point. If the patient dropout process is modeled, the dropout distribution is assumed to be uniform for continuous or binary endpoints and exponential for time-to-event endpoints. With continuous and binary endpoints, patients who drop out of the trial before completing the treatment period are not included in the power or assurance calculations and, with time-to-event endpoints, the corresponding time to the event ofinterestiscensoredifapatientdropsoutofthetrial.thealgorithmforgenerating patient data when the patient accrual and dropout processes are specified is defined in the Appendix (see Section 5.3).

Technical Manual 17 Analysis model An analysis model is a collection of tests. Tests are defined as statistical methods applied to the samples included in the data model and help evaluate the treatment effects based on the patient data generated from the data model. The following tests are supported by the TraditionalSimulations module: Continuous endpoints: Superiority or non-inferiority Z-tests. Binary endpoints: Superiority or non-inferiority Z-tests for proportions. Time-to-event endpoints: Superiority or non-inferiority log-rank tests. If a non-inferiority test is requested, the user needs to specify the non-inferiority margin. The three tests are defined in Sections 2.2, 2.3 and 2.4, respectively. To assess the significance of treatment effects, the samples that define the control and treatment arms are specified for each test and each test generates a one-sided p-value. An important feature of Mediana Designer is that samples can be merged to perform advanced comparisons of the control and treatment arms in the analysis model. This feature is often used in clinical trials with several patient populations, e.g., the overall patient population and one or more subsets of the overall population. An example is provided in Section 3.4. Multiplicity adjustments If more than two tests are defined in the analysis model, the user can specify any number of multiplicity adjustments to control the overall Type I error rate. The multiplicity adjustments based on commonly used multiple testing procedures listed below. The multiplicity adjustments do not have to apply to all tests defined in the analysis model, in fact, the user can choose the group of tests to which each individual multiplicity adjustment will be applied. Each multiple testing procedure generates a set of one-sided multiplicity-adjusted p-values. In addition, the unadjusted inferences based on the original one-sided p-values produced by the selected tests are always performed. They serve as a reference point for the multiplicityadjusted inferences. The following multiple testing procedures are supported by the TraditionalSimulations module: Bonferroni procedure. Holm procedure. Hochberg procedure. Hommel procedure. Fixed-sequence procedure. Family of chain procedures. The testing algorithms used in these multiplicity adjustments are defined in the Appendix (see Section 5.1). Evaluation model An evaluation model is a collection of evaluation/success criteria that are applied to the tests defined in the analysis model. These criteria are applied to assess the performance of a single test or a group of tests. The one-sided significance level needs to be selected for each criterion and, as with multiplicity adjustments, the user can select the group of tests to which each individual criterion should be applied.

18 Mediana Designer The user can specify any number of success criteria. The available criteria include marginal power (probability that a single test is significant) as well as the following composite criteria for a group of tests: Disjunctive power (probability that at least one test in the selected group of tests is significant). Conjunctive power(probability that all tests in the selected group are significant). Weighted power (weighted sum of marginal power values for the selected group of tests). These success criteria are defined in the Appendix (see Section 5.2). 3.4 Example The following example will be used to illustrate the CSE-based approach to evaluating operating characteristics of clinical trials implemented in the Traditional- Simulations module. This example deals with a Phase III clinical trial in patients with mild or moderate asthma (it is based on a clinical trial example from Millen, Dmitrienko and Song, 2014). The trial is intended to support a tailoring strategy. In particular, the treatment effect of a single dose of a new treatment will be compared to that of placebo in the overall population of patients as well as a pre-specified subpopulation of patients with a marker-positive status at baseline. Marker-positive patients are more likely to receive benefit from the experimental treatment. The overall objective of the clinical trial accounts for the fact that the treatment s effect may, in fact, be limited to the marker-positive subpopulation. The trial will be declared successful if the treatment s beneficial effect is established in the overall population of patients or, alternatively, the effect is established only in the subpopulation. Finally, the primary endpoint in the clinical trial is defined as an increase from baseline in the forced expiratory volume in one second (FEV1). This endpoint is normally distributed and improvement is associated with a larger change in FEV1. Data model To set up a data model for this clinical trial, four mutually exclusive groups of patients are defined as follows: Sample 1: Marker-negative patients in the placebo arm. Sample 2: Marker-positive patients in the placebo arm. Sample 3: Marker-negative patients in the treatment arm. Sample 4: Marker-positive patients in the treatment arm. Using this definition of samples, the user can model the fact that the treatment s effect is most pronounced in patients with a marker-positive status. The treatment effect assumptions in the four samples are summarized in Table 1 (FEV1 is measured in liters). As shown in the table, the mean change in FEV1 is constant across the marker-negative and marker-positive subpopulations in the placebo arm (Samples 1 and 2). A positive treatment effect is expected within both subpopulations in the treatment arm but marker-positive patients (Sample 4) will experience a greater beneficial effect compared to marker-negative patients (Sample 3). The total sample size in the trial is set to 330 patients. The sizes of the individual samples are computed based on historic information, namely, based on the assumption that 40% of patients in the trial will have a marker-positive status.

Technical Manual 19 TABLE 1 Sample sizes and treatment effect assumptions in the data model Sample Number of Mean Standard patients deviation Sample 1 99 0.12 0.45 Sample 2 66 0.12 0.45 Sample 3 99 0.24 0.45 Sample 4 66 0.30 0.45 Analysis model The analysis model defines the treatment effect tests in the overall population as well as the subpopulation of marker-positive patients. The primary endpoint follows a normal distribution and the treatment effect will be assessed using the two-sample Z-test. A key feature of the analysis strategy in this case study is that the samples defined in the data model are different from the samples used in the analysis of the primary endpoint. As shown in Table 1, four samples are included in the data model. However, from the analysis perspective, the user is interested in examining the treatment effect in two populations, namely, in the overall population and in the marker-positive subpopulation. This means that some of the data samples need to be merged to define the following two tests in the analysis model: Test 1: Treatment effect test in the overall population. The superiority Z-test is applied to the following two analysis samples: Placebo arm: Samples 1 and 2 are merged. Treatment arm: Samples 3 and 4 are merged. Test 2: Treatment effect test in the marker-positive subpopulation. The superiority Z-test is applied to the following two analysis samples: Placebo arm: Sample 2. Treatment arm: Sample 4. In addition, since two null hypotheses are tested in this trial, i.e., the null hypothesis of no effect in the overall population of patients and the null hypothesis of no effect in the subpopulation, a multiplicity adjustment needs to be applied. The Hochberg procedure with equally weighted null hypotheses will be included in the analysis model to perform a multiplicity correction. Evaluation model The following success criteria are defined in the evaluation model: Marginal power: Probability of a significant outcome in each patient population (i.e., the probability that each test defined in the analysis model is significant). Disjunctive power: Probability of a significant treatment effect in the overall population or marker-positive subpopulation (i.e., the probability that either Test 1 or Test 2 is significant). This criterion defines the overall probability of success in this clinical trial. Conjunctive power: Probability of simultaneously achieving significance in the overall population and marker-positive subpopulation (i.e., the probability that both tests are significant). Simulation results A summary of the operating characteristics of the selected trial is provided in Table 2. The success criteria defined in the evaluation model were evaluated using

20 Mediana Designer 100,000 simulation runs at a one-sided α = 0.025. TABLE 2 Summary of the operating characteristics Success criterion Applied to Multiplicity Value test(s) adjustment Marginal power Test 1 No adjustment 0.827 Marginal power Test 2 No adjustment 0.634 Disjunctive power Test 1 and Test 2 No adjustment 0.869 Conjunctive power Test 1 and Test 2 No adjustment 0.593 Marginal power Test 1 Hochberg procedure 0.781 Marginal power Test 2 Hochberg procedure 0.618 Disjunctive power Test 1 and Test 2 Hochberg procedure 0.806 Conjunctive power Test 1 and Test 2 Hochberg procedure 0.593 Table 2 displays the values of the selected success criteria based on the unadjusted and Hochberg-adjusted p-values produced by Tests 1 and 2. The unadjusted power values serve as a benchmark to assess the magnitude of the multiplicity correction based on the Hochberg procedure. It follows from the table that the multiplicityadjusted overall probability of success in the trial (disjunctive power) is 80.6%. The adjusted marginal probability of success(marginal power) is fairly high in the overall population (78.1%) and is lower in the marker-positive subpopulation (61.8%), which is mainly due to the fact only 40% of the patients are marker-positive. In general, it is difficult to establish a significant treatment effect simultaneously in the two populations but, in this particular case, conjunctive power is reasonably high (59.3%). The simulation results shown in Table 2 perfectly match those generated using the Mediana package. 4 GroupSequential module: Analytical and simulation-based calculations in group-sequential trials 4.1 Introduction This section discusses analytical and simulation-based evaluation of operating characteristics in two-arm group-sequential trials with continuous, binary or time-toevent primary endpoints. An analytical approach is employed to compute standard operating characteristics in a group-sequential trial without patient accrual or dropout modeling. The analytical evaluation of the trial s operating characteristics is complemented by a simulation-based approach that enables the user to model the patient accrual or dropout processes and perform sensitivity assessments. Analytical and simulation-based calculations in group-sequential trials are supported by the gsdesign package, SEQDESIGN procedure in SAS and EAST. 4.2 Group-sequential trials Consider a parallel-group clinical trial which is conducted to evaluate the efficacy and safety of an experimental treatment compared to placebo. It is assumed that there are m decision points in the trial. The first m 1 decision points represent interim looks at which the trial may be stopped due to superior efficacy or futility based on the available data and the last decision point is the final analysis. The current version of the GroupSequential module supports group-sequential trials