A Simple, Graphical Procedure for Comparing Multiple Treatment Effects

A Simple, Graphical Procedure for Comparing Multiple Treatment Effects Brennan S. Thompson and Matthew D. Webb May 15, 2015 <<< PRELIMINARY AND INCOMPLETE >>> Abstract In this paper, we utilize a new graphical procedure to show how multiple treatment effects can be compared while controlling the familywise error rate (the probability of finding one or more spurious differences between the parameters of interest). Monte Carlo simulations suggest that this procedure adequately controls the familywise error rate in finite samples, and has average power nearly identical to a simple max-t procedure. We illustrate our proposed approach using data from a field experiment on different types of performance pay for teachers. Keywords: multiple comparisons; familywise error rate; treatment effects; bootstrap Department of Economics, Ryerson University. E-mail: brennan@ryerson.ca Department of Economics, University of Calgary. E-mail: matthewdwebb@gmail.com 1

1 Introduction In the case of comparing multiple treatments, the problem at hand is two-fold: (A) we want to know whether or not the effect of each treatment is different from zero, and (B) we want to know whether or not the effect of each treatment is different from that of any of the other treatments. In order to make the discussion of our problem more concrete, consider the following regression model: Y t = β 0 C t + k β i T i,t + Z tδ + U t, t = 1,..., n, (1.1) i=1 where C t equals one if individual t belongs to the control group and zero otherwise, T i,t equals one if individual t belongs to treatment group i {1,..., k} and zero otherwise, Z t is a vector of other characteristics for individual t, and U t is an idiosyncratic error term for individual t. The (average marginal) treatment effect of the ith treatment is defined as α i β i β 0, for i {1,..., k}. Thus, the first part of our problem involves testing α i = 0, for each i {1,..., k}, (1.2) while the second part of our problem involves testing α i = α j, for each unique (i, j) {1,..., k} {1,..., k}. (1.3) Note that, since α i = 0 is equivalent to β i = β 0, and α i = α j is equivalent to 2

β i = β j, our problem boils down to testing β i = β j, for each unique (i, j) K K, (1.4) where K = {0,..., k}. Thus, our problem can be seen to involve making a total of ( ) ( k+1 2 comparisons: the k comparisons implicit in (1.2), plus the k ) 2 comparisons implicit in (1.3). For example, with k = 2 treatments, we must make 3 comparisons: the comparison of the 2 treatment effects to zero, and the comparison between the 2 treatment effects. With k = 3 treatments, we must make 6 comparisons, and so on. It is well-known that, when conducting more than one hypothesis test at a given nominal level simultaneously, the probability of rejecting at least one true hypothesis (i.e., the familywise error rate) is often well in excess of that given nominal level. To illustrate the severity of this issue, we generate, for k {2,..., 5}, one million samples of size n = 100(k + 1) from the model in (1.1) as follows. We set β 0 = = β k = 0, and assign 100 observations to the control group (i.e., n t=1 C t = 100) and 100 observations to each of the k treatment groups (i.e., n t=1 T i,t = 100 for each i {1,..., k}). For each t, Z t = 0 and U t is an independent standard normal draw. Within each sample, we independently test (A) each of the k restrictions in (1.2), and (B) each of the ( k 2) restrictions in (1.3), using conventional t-tests at the 5% nominal level. The rejection frequencies for these tests are shown in Figure 1. Specifically, the dash-dotted line shows the frequency of rejecting at least one of the k restrictions in (1.2), while the dashed line shows the frequency of rejecting at least one of the ( ) k 2 restrictions in (1.3). The solid line shows the empirical familywise error rate (i.e., the frequency of rejecting at least one the ( ) k+1 2 restrictions in (1.4)). Note that, even with k = 2, the empirical familywise error rate is approximately 0.122; with k = 5, the empirical familywise error rate is approximately 0.364. 3

Figure 1: Empirical Familywise Error Rates for Independent t-tests In recognition of this issue, a wide variety of multiple testing procedures, such as max-t procedures (see, e.g., Romano and Wolf, 2005, and Section 3 below) have been developed to control the familywise error rate (or other generalized error rates, such as the false discovery rate; see Benjamini and Hochberg, 1995). For the specific problem of making multiple pairwise comparisons that we are interested in here, Bennett and Thompson (2015), hereafter BT, have recently proposed a graphical procedure that identifies differences between a pair of parameters through the non-overlap of the so-called uncertainty intervals (see Section 2) for those parameters. 1 This graphical method, which can be seen as a resampling-based generalization of Tukey s (1953) method, is appealing because it offers users more than a 0-1 ( Yes- No ) decision regarding differences between parameters. Indeed, this method allows users to determine both statistical and practical significance of pairwise differences, while also providing them with a measure of uncertainty concerning the locations of 1 Note that BT denote the total number of parameters by k, while the total number of parameters we consider is k + 1. 4

the individual parameters. The remainder of this paper is organized as follows. In Section 2, we provide a brief overview of the procedure of BT. Section 3 describes the results of a set of Monte Carlo simulations designed to examine the finite-sample performance of this procedure in the current context. In Section 4, we present the results of an empirical example in which the different treatments are different types of performance pay for teachers using data from a field experiment in India. 2 The Overlap Procedure In this section, we briefly summarize how the graphical procedure of BT can be applied to the problem at hand. For complete details, including proofs of the main results, see BT. The parameters of interest, β i = β i (P ), i K, are unknown but are presumed to be consistently estimable from observable data generated from some (unknown) probability mechanism P. That is, for each i K, we have available to us a n- consistent estimator ˆβ n,i of β i. The overlap procedure presents each parameter estimate ˆβ n,i together with its corresponding uncertainty interval, C n,i (γ) = [ ˆβn,i ± γ se ( ˆβn,i )], (2.1) ( ) whose length is determined by the parameter γ (discussed below) and se ˆβn,i, the standard error of ˆβ n,i. In what follows, we denote the lower and upper endpoints of the interval C n,i by L n,i and U n,i, respectively. The uncertainty intervals can then be used to make inferences about the ordering 5

of the parameters of interest as follows: We infer that β i < β j if ˆβ n,i < ˆβ n,j and the uncertainty intervals for β i and β j are non-overlapping. Note that, as a function of γ, the probability of declaring at least one significant difference between parameters, when all k parameters are equal, is as follows: [ ] n Q n (γ; P ) = Prob P max {L n,i(γ) U n,j (γ)} > 0. (2.2) i,j K Thus, in order to control the familywise error rate at nominal level α, the ideal choice of γ when all k parameters are equal is γ n (α) = inf {γ : Q n (γ; P ) α}. (2.3) Unfortunately, because P is unknown, we cannot compute Q n (γ; P ), and γ n (α) is thus infeasible. We therefore turn our attention to estimating γ n (α) so as to achieve at least asymptotic control of the familywise error rate. Constructing a feasible counterpart to γ n (α) of course requires that we first formulate an empirical estimator of Q n (γ; P ). Towards this end, let ˆP n (typically the empirical distribution) be an estimate of P. The idea is to estimate Q n (γ; P ) using its bootstrap analogue [ n Q n (γ; ˆP { n ) = Prob ˆPn max L n,i (γ) Un,j(γ) } ] > 0, (2.4) i,j K where L n,i(γ) and U n,i(γ) are, respectively, the lower and upper endpoints of C n,i(γ) = [ ( )] β n,i ± γ se β n,i, (2.5) with β n,i being the estimate of β i obtained by resampling from (1.1) with the restric- 6

tion imposed that β 0 = = β k. This, in turn, leads naturally to an estimator of γ as γn(α) = inf {γ : Q n (γ; ˆP ) n } α. (2.6) Plugging γn(α) into C n,i (γ) gives rise to a simple graphical device for visualizing statistically and practically significant differences. BT show that, under quite general conditions, this method (i) controls the familywise error rate asymptotically, and (ii) is consistent, in the sense that any (true) differences between parameters are inferred with probability one asymptotically. 3 Simulation Evidence We now examine the finite-sample performance of the overlap procedure described above by way of several Monte Carlo experiments. As in BT, we consider a max-t procedure designed to control the familywise error rate at the nominal level α as a benchmark. Specifically, this procedure rejects the restriction that β i = β j whenever T n,(i,j) = ˆβ n,i ˆβ n,j [ ( )] 2 [ ( )]. 2 se ˆβn,i + se ˆβn,j is greater than 1 α quantile of max i j T n,(i,j), where T n,(i,j) = β n,i β n,j [ ( )] 2 [ ( )], 2 se βn,i + se βn,j 7

n Homoskedasticity Heteroskedasticity Overlap max-t Overlap max-t 300 0.0532 0.0571 0.0555 0.0558 600 0.0503 0.0550 0.0492 0.0529 1200 0.0524 0.0573 0.0546 0.0574 Table 1: Empirical familywise error rates for overlap and max-t procedures and, as above, β n,i is the estimate of β i obtained by resampling from (1.1) with the restriction imposed that β 0 = = β k. In what follows, we use 199 i.i.d. bootstrap replications for both the max-t procedure and the overlap procedure. The design of our simulations is the same as the one described in Section 1, but with several variations. First, we fix k = 5 and generate 10,000 samples of size n, with n {300, 600, 1200} (i.e., when n = 300, there are 50 observations in the control group and 50 observations in each of the 5 treatment groups, and so on). Second, we set β i = θ(i + 1), with θ {0, 0.01, 0.02,..., 1}, which allows us to examine both control of the familywise error rate (when θ = 0) and power (when θ > 0). Finally, we consider two different specifications for the error term distribution: A homoskedastic case in which all of the errors are drawn from the standard normal distribution, and a heteroskedastic case in which the errors for observations assigned to the control group are standard normal, while the observations assigned to treatment group i {1,..., k} are normal with mean zero and variance i + 1. 2 Table 1 shows that control of the familywise error rate for both the overlap procedure and the max-t procedure is quite close to the nominal level α = 0.05 at all of the sample sizes considered in both the homoskedastic and heteroskedastic cases. Interestingly, although the differences are quite small, the rejection rates for the overlap procedure are uniformly smaller than the rejection rates for the max-t procedure. 2 In both cases, we estimate the parameters using ordinary least squares and obtain heteroskedasticity-consistent standard errors using the method of White (1980). 8

(a) Homoskedastic case (b) Heteroskedastic case Figure 2: Average power for overlap and max-t procedures In order to compare the power of the two procedures, we follow BT in examining average power, which is the proportion of false restrictions that are rejected. Figures 2a and 2b display average power as a function of θ in the homoskedastic and heteroskedastic cases, respectively. Within these figures, black lines correspond to the overlap procedure and red lines correspond to the max-t procedure, while lines that are solid, dashed, and dotted correspond to n = 300, n = 600, and n = 1200, respectively. Evidently, both procedures have nearly identical average power over θ and at all of the sample sizes considered in both the homoskedastic and heteroskedastic cases. 4 Empirical Example As an illustration of the procedure discussed above, we re-visit a field experiment on teacher performance pay in India conducted by Muralidharan and Sundararaman 9

(2011), hereafter MS. Specifically, MS run two sets of experiments where they offer incentive pay to teachers which they condition on student s scores. In the analysis, they have outcomes from three separate groups of schools: a control group (Control), a group in which teachers were paid based on the scores of their own students (Individual Incentive), and a group in which teachers were paid based on the performance of all students at their school (Group Incentive). The experiment is quite well designed and readers can find additional details in MS. Among other things, MS test the impact of the two interventions on combined math and language (Telegu) scores. The experiment ran for two years. In the majority of the analysis, year two scores are analyzed often controlling for the year zero, or pretreatment, scores. While most of the paper is about the average impact of the interventions, in Table 8 the authors compare the impact of the group incentive to the impact of the individual incentive. To make these comparisons they first estimate the following model: 49 Score2 t = α 0 + α 1 Individual t + α 2 Group t + δ j County j,t + δ 50 Score0 t + U t, (4.1) where Score2 and Score0 are, respectively, the combined math and language score in year 2 and year 0, Group is an indicator variable for being in the group incentive, Individual is an indicator for being in the individual incentive, and {County j } 49 j=1 is a set of indicator variables for all but one of the 50 counties (Mandals) in which the experiment was conducted. The standard errors are clustered by school. Next, MS independently test the following three restrictions: j=1 MS1: α 1 = 0 MS2: α 2 = 0 10

MS3: α 1 = α 2 The T -statistics obtained for the tests of these three restrictions are 4.84, 2.71, and -1.91, respectively. Thus, MS conclude that both incentives have significantly positive treatment effects, and that the two treatment effects are statistically different from one another. We now consider an application of the overlap procedure discussed above. To do so, we first need to modify the estimating equation so that the mean of the control group is estimated as a parameter, rather than being absorbed into the intercept. Specifically, we estimate the following equation: 49 Score2 t = β 0 Control t + β 1 Individual t + β 2 Group t + δ j County j,t + δ 50 Score0 t + U t, j=1 (4.2) where Control is an indicator variable for membership in the control group. Notice that α 0 = β 0 and and α i = β i β 0, for i {1, 2}. Using 199 i.i.d. bootstrap replications, we obtain γn(0.05) = 0.505, and produce Figure 3, which can be interpreted as follows: Since C n,0 and C n,1 do not overlap, we can reject β 0 = β 1, or equivalently, α 1 = 0 (MS1). Since C n,0 and C n,2 overlap, we can not reject β 0 = β 2, or equivalently, α 2 = 0 (MS2). Since C n,1 and C n,2 overlap, we can not reject β 1 = β 2, or equivalently, α 1 = α 2 (MS3). Thus, while our conclusions on the first restriction (MS1) is the same as MS, our conclusion on the second and third restrictions (MS2 and MS3) are different. 11

Figure 3: Uncertainty Intervals for Empirical Example We also consider an alternative method of visualizing these results in Figure 4, where all figures are expressed in marginal terms. That is, we subtract ˆβ n,0 = 0.1319 from the point estimates and the endpoints of the uncertainty intervals. In addition, we no longer display a point estimate or uncertainty interval for β 0, but instead include a dotted horizontal line at the the value of U n,0 ˆβ n,0, the upper endpoint of the re-centered version of C n,0. 12

Figure 4: Re-Centered Uncertainty Intervals for Empirical Example 13

References Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), 57(1):pp. 289 300. Bennett, C. J. and Thompson, B. S. (2015). Graphical procedures for multiple comparisons under general dependence. Unpublished manuscript. Muralidharan, K. and Sundararaman, V. (2011). Teacher Performance Pay: Experimental Evidence from India. Journal of Political Economy, 119(1):39 77. Romano, J. P. and Wolf, M. (2005). Exact and approximate stepdown methods for multiple hypothesis testing. Journal of the American Statistical Association, 100(469):94 108. Tukey, J. W. (1953). The problem of multiple comparisons. In The Collected Works of John W. Tukey VIII. Multiple Comparisons: 1948-1983, pages 1 300. Chapman and Hall, New York. White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica, 48(4):817 838. 14