A Simple, Graphical Procedure for Comparing. Multiple Treatment Effects

Similar documents
A Simple, Graphical Procedure for Comparing Multiple Treatment Effects

A Simple, Graphical Procedure for Comparing. Multiple Treatment Effects

Graphical Procedures for Multiple Comparisons Under General Dependence

Wild Bootstrap Inference for Wildly Dierent Cluster Sizes

Christopher J. Bennett

Casuality and Programme Evaluation

Resampling-Based Control of the FDR

QED. Queen s Economics Department Working Paper No Hypothesis Testing for Arbitrary Bounds. Jeffrey Penney Queen s University

One-Way ANOVA. Some examples of when ANOVA would be appropriate include:

Bootstrap Tests: How Many Bootstraps?

Bootstrapping Heteroskedasticity Consistent Covariance Matrix Estimator

Multiple Testing of One-Sided Hypotheses: Combining Bonferroni and the Bootstrap

The Exact Distribution of the t-ratio with Robust and Clustered Standard Errors

A Practitioner s Guide to Cluster-Robust Inference

Bootstrap Testing in Econometrics

Answer Key: Problem Set 6

11. Bootstrap Methods

WILD CLUSTER BOOTSTRAP CONFIDENCE INTERVALS*

Do not copy, post, or distribute

The Exact Distribution of the t-ratio with Robust and Clustered Standard Errors

Inference on Optimal Treatment Assignments

Bootstrap-Based Improvements for Inference with Clustered Errors

Control of Generalized Error Rates in Multiple Testing

Small-sample cluster-robust variance estimators for two-stage least squares models

4. Nonlinear regression functions

large number of i.i.d. observations from P. For concreteness, suppose

Math 1040 Final Exam Form A Introduction to Statistics Fall Semester 2010

Inference via Kernel Smoothing of Bootstrap P Values

IV Quantile Regression for Group-level Treatments, with an Application to the Distributional Effects of Trade

A better way to bootstrap pairs

Economics Division University of Southampton Southampton SO17 1BJ, UK. Title Overlapping Sub-sampling and invariance to initial conditions

ECON3150/4150 Spring 2015

Economics 471: Econometrics Department of Economics, Finance and Legal Studies University of Alabama

Discussion of Bootstrap prediction intervals for linear, nonlinear, and nonparametric autoregressions, by Li Pan and Dimitris Politis

Wild Cluster Bootstrap Confidence Intervals

Path Analysis. PRE 906: Structural Equation Modeling Lecture #5 February 18, PRE 906, SEM: Lecture 5 - Path Analysis

The Bootstrap: Theory and Applications. Biing-Shen Kuo National Chengchi University

Small-Sample Methods for Cluster-Robust Inference in School-Based Experiments

Monte Carlo Studies. The response in a Monte Carlo study is a random variable.

Statistics Primer. ORC Staff: Jayme Palka Peter Boedeker Marcus Fagan Trey Dejong

Supplement to Quantile-Based Nonparametric Inference for First-Price Auctions

A Course in Applied Econometrics Lecture 7: Cluster Sampling. Jeff Wooldridge IRP Lectures, UW Madison, August 2008

Tobit and Interval Censored Regression Model

Confidence Intervals in Ridge Regression using Jackknife and Bootstrap Methods

Supplement to Randomization Tests under an Approximate Symmetry Assumption

Exploratory quantile regression with many covariates: An application to adverse birth outcomes

Obtaining Critical Values for Test of Markov Regime Switching

Heteroskedasticity-Robust Inference in Finite Samples

On block bootstrapping areal data Introduction

A Course on Advanced Econometrics

Flexible Estimation of Treatment Effect Parameters

WISE MA/PhD Programs Econometrics Instructor: Brett Graham Spring Semester, Academic Year Exam Version: A

Inference Based on the Wild Bootstrap

LECTURE 5. Introduction to Econometrics. Hypothesis testing

Multiple comparisons of slopes of regression lines. Jolanta Wojnar, Wojciech Zieliński

Finite Population Correction Methods

NBER WORKING PAPER SERIES ROBUST STANDARD ERRORS IN SMALL SAMPLES: SOME PRACTICAL ADVICE. Guido W. Imbens Michal Kolesar

Sociology 593 Exam 2 Answer Key March 28, 2002

Statistical Applications in Genetics and Molecular Biology

BOOTSTRAPPING DIFFERENCES-IN-DIFFERENCES ESTIMATES

6 Single Sample Methods for a Location Parameter

Econometrics II. Nonstandard Standard Error Issues: A Guide for the. Practitioner

Post-Selection Inference

Robust Standard Errors in Small Samples: Some Practical Advice

EC4051 Project and Introductory Econometrics

An Analysis of College Algebra Exam Scores December 14, James D Jones Math Section 01

Performance Evaluation and Comparison

INFERENCE APPROACHES FOR INSTRUMENTAL VARIABLE QUANTILE REGRESSION. 1. Introduction

Selective Inference for Effect Modification

Does k-th Moment Exist?

Applying the Benjamini Hochberg procedure to a set of generalized p-values

The Number of Bootstrap Replicates in Bootstrap Dickey-Fuller Unit Root Tests

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2

Chapter 1 Statistical Inference

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

WISE International Masters


Salt Lake Community College MATH 1040 Final Exam Fall Semester 2011 Form E

Investigating Models with Two or Three Categories

Robust Performance Hypothesis Testing with the Variance. Institute for Empirical Research in Economics University of Zurich

Applied Statistics and Econometrics

A semi-bayesian study of Duncan's Bayesian multiple

A Bootstrap Test for Causality with Endogenous Lag Length Choice. - theory and application in finance

Robust Backtesting Tests for Value-at-Risk Models

2. Linear regression with multiple regressors

WISE MA/PhD Programs Econometrics Instructor: Brett Graham Spring Semester, Academic Year Exam Version: A

Modified Variance Ratio Test for Autocorrelation in the Presence of Heteroskedasticity

Controlling Bayes Directional False Discovery Rate in Random Effects Model 1

Section 2: Estimation, Confidence Intervals and Testing Hypothesis

Inference in difference-in-differences approaches

Lecture 3: Multiple Regression. Prof. Sharyn O Halloran Sustainable Development U9611 Econometrics II

A multiple testing procedure for input variable selection in neural networks

Physics 2019 v1.2. IA1 sample assessment instrument. Data test (10%) August Assessment objectives

University of California San Diego and Stanford University and

Improving linear quantile regression for


The International Journal of Biostatistics

QED. Queen s Economics Department Working Paper No. 1355

Quantitative Empirical Methods Exam

Econometrics. 4) Statistical inference

Transcription:

A Simple, Graphical Procedure for Comparing Multiple Treatment Effects Brennan S. Thompson and Matthew D. Webb October 26, 2015 <<< PRELIMINARY AND INCOMPLETE >>> Abstract In this paper, we utilize a new graphical procedure to show how multiple treatment effects can be compared while controlling the familywise error rate (the probability of finding one or more spurious differences between the parameters of interest). Monte Carlo simulations suggest that this procedure adequately controls the familywise error rate in finite samples, and has average power nearly identical to a simple max-t procedure. We illustrate our proposed approach using data from two field experiments, one comparing different types of performance pay for teachers in India, and another comparing different student achievement programs in Canada. Keywords: multiple comparisons; familywise error rate; average treatment effect; bootstrap Department of Economics, Ryerson University. E-mail: brennan@ryerson.ca Department of Economics, Carleton University. E-mail: matt.webb@carleton.ca 1

1 Introduction In the case of comparing multiple treatments, the problem at hand is two-fold: (A) we want to know whether or not the effect of each treatment is different from zero, and (B) we want to know whether or not the effect of each treatment is different from that of any of the other treatments. 1 In order to make the discussion of our problem more concrete, consider the following regression model: Y t = β 0 C t + k β i T i,t + Z tη + U t, t = 1,..., n, (1.1) i=1 where C t equals one if individual t belongs to the control group and zero otherwise, T i,t equals one if individual t receives treatment i {1,..., k} and zero otherwise, Z t is a vector of other characteristics for individual t, and U t is an idiosyncratic error term for individual t. The (average) treatment effect of the ith treatment is defined as δ i β i β 0, for i {1,..., k}. Thus, the first part of our problem involves testing δ i = 0, for each i {1,..., k}, (1.2) while the second part of our problem involves testing δ i = δ j, for each unique (i, j) {1,..., k} {1,..., k}. (1.3) Note that, since δ i = 0 is equivalent to β i = β 0, and δ i = δ j is equivalent to 1 Note that this problem is quite distinct from the problem of comparing heterogenous treatment effects, where different types of individuals may respond differently to the same treatment. Lee and Shaikh (2014) and Gu and Shen (2015) consider this problem using a multiple comparisons approach. 2

β i = β j, our problem boils down to testing β i = β j, for each unique (i, j) K K, (1.4) where K = {0,..., k}. Thus, our problem can be seen to involve making a total of ( ) ( k+1 2 comparisons: the k comparisons implicit in (1.2), plus the k ) 2 comparisons implicit in (1.3). For example, with k = 2 treatments, we must make 3 comparisons: the comparison of the 2 treatment effects to zero, and the comparison between the 2 treatment effects. With k = 3 treatments, we must make 6 comparisons, and so on. It is well-known that, when conducting more than one hypothesis test at a given nominal level simultaneously, the probability of rejecting at least one true hypothesis (i.e., the familywise error rate) is often well in excess of that given nominal level. To illustrate the severity of this issue, we generate, for k {2,..., 5}, one million samples from the model in (1.1) as follows. We set β 0 = = β k = 0, and assign 100 observations to the control group (i.e., n t=1 C t = 100) and 100 observations to each of the k treatment groups (i.e., n t=1 T i,t = 100 for each i {1,..., k}), so that n = 100(k + 1). For each t, Z t = 0 and U t is an independent standard normal draw. Within each sample, we independently test (A) each of the k restrictions in (1.2), and (B) each of the ( k 2) restrictions in (1.3), using conventional t-tests at the 5% nominal level. The rejection frequencies for these tests are shown in Figure 1. Specifically, the dash-dotted line shows the frequency of rejecting at least one of the k restrictions in (1.2), while the dashed line shows the frequency of rejecting at least one of the ( ) k 2 restrictions in (1.3). The solid line shows the empirical familywise error rate (i.e., the frequency of rejecting at least one the ( ) k+1 2 restrictions in (1.4)). Note that, even with k = 2, the empirical familywise error rate is approximately 0.122; with k = 5, the empirical familywise error rate is approximately 0.364. 3

Figure 1: Empirical Familywise Error Rates for Independent t-tests In recognition of such problems, a wide variety of multiple testing procedures, such as max-t procedures (see, e.g., Romano and Wolf, 2005a, and Section 4 below) have been developed to control the familywise error rate (or other generalized error rates, such as the false discovery rate; see Benjamini and Hochberg, 1995). For the specific problem of making multiple pairwise comparisons that we are interested in here, Bennett and Thompson (forthcoming), hereafter BT, have recently proposed a graphical procedure that identifies differences between a pair of parameters through the non-overlap of the so-called uncertainty intervals (see Section 2) for those parameters. 2 This graphical procedure, which can be seen as a resampling-based generalization of Tukey s (1953) procedure, is appealing because it offers users more than a 0-1 ( Yes-No ) decision regarding differences between parameters. Indeed, this procedure allows users to determine both statistical and practical significance of pairwise differences, while also providing them with a measure of uncertainty concerning 2 Note that BT denote the total number of parameters by k, while the total number of parameters we consider is k + 1. 4

the locations of the individual parameters. The remainder of this paper is organized as follows. In Section 2, we describe how the procedure of BT can be applied in the current context. In Section 3, we consider an extension of this procedure designed to identify the best treatment under consideration. Section 4 describes the results of a set of Monte Carlo simulations designed to examine the finite-sample performance of these procedures. In Section 5, we present the results of two empirical examples, one in which the different treatments are k = 2 types of performance pay for teachers used in a field experiment in India, and another in which the different treatments are k = 3 student achievement programs used in a field experiment in Canada. 2 The Overlap Procedure The basic approach of this procedure is to present each of the parameter estimates ˆβ n,i, i K, together with a corresponding uncertainty interval, [ ˆβn,i ± γ se ( ˆβn,i )], ( ) whose length is determined by the parameter γ (discussed below) and se ˆβn,i, the standard error of ˆβ n,i (high-level assumptions on the large-sample behaviour of these estimators are given at the end of this section). We denote the lower and upper endpoints of the uncertainty interval for β i by L n,i and U n,i, respectively. These uncertainty intervals are used to make inferences about the ordering of the parameters of interest as follows: We infer that β i > β j if L n,i > U n,j. If the uncertainty intervals for β i and β j overlap one another, we can make no such inference. For this reason, we refer to this procedure as the overlap procedure. 5

Note that, if all k parameters are equal, the ideal choice of γ at the nominal familywise error rate α is the smallest value satisfying } Prob P {max {L n,i(γ)} > min {U n,i(γ)} α, (2.1) i K i K where P denotes the unknown probability mechanism from which our data is generated. Of course, because P is unknown, this choice is infeasible. Instead, we choose γ to satisfy the bootstrap analogue of (2.1): Prob ˆPn {max i K { L n,i (γ) } { > min U n,i (γ) }} α, i K where ˆP n is an estimate of P, and L n,i and U n,i are, respectively, the lower and upper endpoints of [ ( ˆβ n,i ˆβ ( )] n,i ) ± γ se β n,i, with ˆβ ( ) n,i and se β n,i are obtained under sampling from ˆP n. We assume only that n( ˆβn β) and n( ˆβ n,i ˆβ n,i ) both have the same (continuous and strictly increasing) limiting distribution, and that ( ) n se ˆβn,i probability to the same (positive) constant. and ( ) n se β n,i both converge in Given these high-level assumptions, BT show that the overlap procedure described above (1) controls the familywise error rate asymptotically, and (2) is consistent, in the sense that any differences between parameter pairs are inferred (in the correct direction) with probability one asymptotically. Simulation evidence presented in BT and in Section 4 below suggest that the overlap procedure provides satisfactory control of the familywise error rate and has good (average) power properties in finite samples. 6

3 A Modified Overlap Procedure for Multiple Comparisons with the Best The procedure described in the previous section is designed to control the familywise error rate across all pairwise comparisons. This approach allows for a (potentially complete) ranking of all the treatments under consideration. For example, assuming (without loss of generality) that a larger value of the outcome variable is better, one could infer that treatment i is the best (i.e., β i > β j for all j K \ {i}) if L n,i > U n,j for all j K \ {i}. Similarly, one may be able to identify a second best treatment, a third best treatment, and so on. While such a complete ranking may occasionally be of value, interest often centers on identifying only the (first) best treatment. That is, we may only want to know whether or not the treatment effect which is estimated to be the largest is actually statistically distinguishable from the other treatments effects (and zero). Such a problem is the focus of so-called multiple comparisons with the best procedures (see, e.g., Hsu, 1981, 1984 or Horrace and Schmidt 2000). Below, we follow BT in developing a modification of the graphical procedure described in the previous section to focus on this problem. 3 The basic idea behind this modification is that, by eliminating irrelevant pairwise comparisons (i.e., those in which neither of the parameters is estimated to be largest), it may be possible to increase the power of the procedure. We begin by introducing some further notation. Let [1], [2],..., [k + 1], be the 3 In fact, BT introduce a generalization of the multiple comparisons with the best approach which allows for comparisons within the r best (r being some integer smaller than the total number of parameters under consideration). Such an approach may be of use when the number of parameters under consideration is very large (perhaps in the hundreds or thousands), and one is willing to abandon pursuit of a complete ranking in return for the ability to resolve more comparisons within the top r. 7

random indices such that ˆβ n,[1] > ˆβ n,[2] > > ˆβ n,[k+1]. This means that β [1] is the true value of the parameter which is estimated to be largest, and not necessarily the largest parameter value. Similarly, L n,[1] (γ) = ˆβ ( ) n,[1] γ se ˆβn,[1] is the lower endpoint of the uncertainty interval associated with the largest point estimate, which is not necessarily the largest lower endpoint. Indeed, if the standard error of ˆβ n,[1] is relatively large, it could be the case that the lower endpoint associated with this point estimate extends below the lower endpoint associated with a smaller point estimate. Similar to what is done in the basic overlap procedure, we infer that β [1] is the largest parameter value in the collection if L n,[1] > U n,[j] for all j > 1. Thus, since our ideal choice of γ is the smallest value satisfying 4 { Prob P {L n,[1] (γ) > max Un,[j] (γ) }} α (3.1) j {2,...,k+1} when all k parameters are equal, a feasible choice of γ is the smallest value satisfying { Prob ˆPn {L n,[1](γ) > max U n,[j] (γ) }} α. (3.2) j {2,...,k+1} Simulation evidence presented in the next section suggests that the choice of γ resulting from this modification may be substantially smaller than the choice resulting from 4 If a smaller value of the outcome variable is better, { this procedure would be modified { as follows. The probability in (3.1) would be replaced by Prob P Un,[k+1] (γ) < max j {1,...,k} Ln,[j] (γ) }} }}, and the probability in (3.2) would be replaced by Prob ˆPn {Un,[k+1] (γ) < max j {1,...,k} {U n,[j] (γ). The resulting uncertainty intervals could then be used to infer that β [k+1] is the smallest parameter value in the collection if U n,[k+1] < U n,[j] for all j < (k + 1). 8

the basic overlap procedure, thereby greatly increasing the power of the procedure. 4 Simulation Evidence We now examine the finite-sample performance of the overlap procedures described above by way of several Monte Carlo experiments. As in BT, we consider a max-t procedure as a benchmark for the basic overlap procedure. Specifically, this procedure rejects the restriction that β i = β j whenever T n,(i,j) = ˆβ n,i ˆβ n,j [ ( )] 2 [ ( )]. 2 se ˆβn,i + se ˆβn,j is greater than 1 α quantile of max i j T n,(i,j), where T n,(i,j) = ( ˆβ n,i ˆβ ) ( n,i ˆβ n,j ˆβ ) n,j [ ( )] 2 [ ( )]. 2 se ˆβ n,i + se ˆβ n,j It is important to note that this test statistic (and its bootstrap counterpart) is not a function of the estimated covariance between ˆβ n,i and ˆβ n,j, since this estimate will always be zero in the designs we consider. This is due to the fact that we do not include any covariates in our designs (i.e., Z t = 0 for each t). 5 In what follows, we estimate β i, i K, using ordinary least squares and obtain heteroskedasticity-consistent standard errors using the method of White (1980). The bootstrap counterparts of these objects are obtained using the wild bootstrap (see, 5 If covariates are included (as in the empirical examples considered in the next section), the estimated covariances between the parameters of interest may be non-zero. In any case, although the overlap procedure does not make explicit use of the estimated covariances, it is valid under general forms of dependence between the estimated parameters; see BT. 9

Homoskedasticity Heteroskedasticity n max-t Overlap Mod. Overlap max-t Overlap Mod. Overlap 300 0.0627 0.0572 0.0513 0.0647 0.0613 0.0547 600 0.0600 0.0557 0.0495 0.0605 0.0582 0.0539 1200 0.0526 0.0484 0.0509 0.0551 0.0486 0.0515 Table 1: Empirical familywise error rates for max-t and basic and modified overlap procedures e.g., Mammen, 1993) with 199 replications. We set the nominal familywise error rate α equal to 0.05. The design of our simulations is the same as the one described in Section 1, but with several variations. First, we fix k = 5 and generate 10,000 samples of size n, with n {300, 600, 1200} (i.e., when n = 300, there are 50 observations in the control group and 50 observations in each of the 5 treatment groups, and so on). Second, we set β i = θ(i + 1), with θ {0, 0.01, 0.02,..., 1}, which allows us to examine both control of the familywise error rate (when θ = 0) and power (when θ > 0). Finally, we consider two different specifications for the error term distribution: A homoskedastic case in which all of the errors are drawn from the standard normal distribution, and a heteroskedastic case in which the errors for observations assigned to the control group are standard normal, while the observations assigned to treatment group i {1,..., k} are normal with mean zero and variance i + 1. Table 1 shows that control of the familywise error rate for the max-t procedure and both the basic and modified overlap procedures is adequate at all of the sample sizes considered in both the homoskedastic and heteroskedastic cases. Interestingly, although the differences are quite small, the empirical familywise error rates for the overlap procedures are uniformly smaller than those for the max-t procedure. In order to compare the power of the max-t procedure and the basic overlap procedure, we follow BT and Romano and Wolf (2005b) in examining average power, 10

(a) Homoskedastic errors (b) Heteroskedastic errors Figure 2: Empirical average power for overlap and max-t procedures which is the proportion of false restrictions (of the form β i = β j here) that are rejected. Figures 2a and 2b display the empirical average power for these two procedures as a function of θ in the homoskedastic and heteroskedastic cases, respectively. Within these figures, black lines correspond to the basic overlap procedure and red lines correspond to the max-t procedure, while lines that are solid, dashed, and dotted correspond to n = 300, n = 600, and n = 1200, respectively. Evidently, both procedures have nearly identical average power over θ and at all of the sample sizes considered in both the homoskedastic and heteroskedastic cases. Finally, we turn to the gain in power that results from using the modified overlap procedure. Specifically, we examine the probability that the largest (i.e., the best ) parameter is correctly identified. 6 Figures 3a and 3b display the empirical probability that the best is identified by the two overlap procedures as a function of θ in the homoskedastic and heteroskedastic cases, respectively. Within these figures, the black 6 Note that, here, we always have θ > 0, so the largest parameter is β k+1. Thus, the largest parameter is correctly identified whenever L n,k+1 > U n,j for all j (k + 1). 11

(a) Homoskedastic errors (b) Heteroskedastic errors Figure 3: Empirical probability of identifying the best for the basic and modified overlap procedures line corresponds to the basic overlap procedure while the blue line corresponds to the modified overlap procedure. To reduce clutter, we only plot the results for n = 600, but the results for the other sample sizes are qualitatively similar. Namely, the modified overlap procedure does a much better job in identifying the best, particularly in the heteroskedastic case. As noted in the previous section, this is due to the fact that γ is chosen to be much smaller with the modified overlap procedure, resulting in narrower uncertainty intervals. In all cases considered here, we find that the value of γ that is chosen by the modified overlap procedure is slightly less than half as large (on average, over the 10,000 samples) as the value of γ that is chosen by the basic overlap procedure. More generally, this ratio will decrease as k is increased. 12

5 Empirical Examples We now present the results of two empirical examples, both from the education literature. In Section 5.1, we re-visit a field experiment examining different types of performance pay for teachers. In Section 5.2, we re-visit a field experiment examining different student achievement programs. 5.1 Performance Pay for Teachers Muralidharan and Sundararaman (2011), hereafter MS, conducted a set of experiments in India in which they offered incentive pay to teachers conditioned on student s scores. In the analysis, they have outcomes from three separate groups of schools: a control group, a group in which teachers were paid based on the scores of their own students, and a group in which teachers were paid based on the performance of all students at their school. That is, they consider k = 2 distinct treatments. Although k is quite small here, this example provides a very easy-to-follow illustration of the overlap procedures described above. Moreover, as the dataset used in this example is publicly-available, readers can easily replicate the results presented below. 7 Among other things, MS test the impact of the two interventions on combined math and language (Telegu) scores. The experiment ran for two years. In the majority of the analysis, year two scores are analyzed controlling for the year zero (i.e., pre-treatment) scores. While most of the paper is about the average impact of the interventions, in their Table 8, the authors compare the impact of the group incentive to the impact of the individual incentive. To make these comparisons, they first 7 The dataset is available at: http://www.jstor.org/stable/suppl/10.1086/659655/suppl file/2009434data.zip. 13

estimate the following model: 49 Score2 t = δ 0 + δ 1 Group t + δ 2 Individual t + η l County l,t + η 50 Score0 t + errors, where Score2 and Score0 are, respectively, the combined math and language score in year 2 and year 0, Group is an indicator variable for being in the group incentive, Individual is an indicator for being in the individual incentive, and {County l } 49 l=1 is a set of indicator variables for all but one of the 50 counties (Mandals) in which the experiment was conducted. There are n = 29760 observations, and the model is estimated using ordinary least squares. Heteroskedasticity-consistent standard errors are obtained using clustering (see, e.g., Cameron et al., 2008); this clustering is done by school. Next, MS independently test (i.e., without controlling the familywise error rate) the following three restrictions (absolute values of the T -statistics obtained for the l=1 tests of these three restrictions are given in brackets): 8 MS1: δ 1 = 0 ( T n,ms1 = 2.71) MS2: δ 2 = 0 ( T n,ms2 = 4.84) MS3: δ 1 = δ 2 ( T n,ms3 = 1.91) Thus, MS conclude that both treatment effects are statistically different from zero, and from one another (although this latter conclusion can only be made at a nominal level of slightly less than 6%). We now consider an application of the overlap procedures discussed above. To do so, we first need to re-write the model above so that the mean of the control group is 8 Note that these T -statistics do incorporate the estimated covariances between the parameters, which are not equal to zero because of the inclusion of the covariates. 14

estimated as a parameter, rather than being absorbed into the intercept. Specifically, we estimate the following model: 49 Score2 t = β 0 Control t +β 1 Group t +β 2 Individual t + η l County l,t +η 50 Score0 t +errors, where Control is an indicator variable for membership in the control group. Notice that δ 0 = β 0 and and δ i = β i β 0, for i {1, 2}. For a nominal familywise error rate of α = 0.05, we obtain a value of 0.50 for γ with the basic overlap procedure. This choice is based on 999 wild cluster bootstrap (see, e.g., Cameron et al., 2008) replications. The resulting uncertainty intervals are shown in Figure 4a, which can be interpreted as follows: Since the uncertainty intervals for β 1 and β 0 overlap, we can not infer that β 1 > β 0 (or, equivalently, we can not infer that δ 1 is positive). l=1 Since L n,2 > U n,0, we infer that β 2 > β 0 (or, equivalently, that δ 2 > 0). Since the uncertainty intervals for β 2 and β 1 overlap, we can not infer that β 2 > β 1 (or, equivalently, we can not infer that δ 2 δ 1 is positive). Thus, while our conclusion on the second restriction (MS2) is consistent with MS, our conclusions on the first and third restrictions (MS1 and MS3) are not. 9 We also consider an alternative and perhaps more useful method of visualizing these results in Figure 4b. Here, the quantities are re-centered around the treatment effects, δ 1 β 1 β 0 and δ 2 β 2 β 0. That is, we subtract ˆβ n,0 = 0.13 from ˆβ n,1 and ˆβ n,2, and from the endpoints of the uncertainty intervals for β 1 and β 2. Moreover, we include a dotted horizontal line at the the value of U n,0 ˆβ n,0 = 0.08, the upper 9 We also applied the basic overlap procedure at a nominal familywise error rate of α = 0.06 (obtaining a value of 0.47 for γ), and found that the uncertainty intervals for β 2 and β 1 were still overlapping (recall that the absolute value of the T -statistic for the test of MS3 was 1.91). 15

(a) Uncertainty Intervals for β i, i K (b) Uncertainty Intervals for δ i, i {1,..., k} Figure 4: Performance Pay Example: Basic Overlap Procedure endpoint of the re-centered version of the uncertainty interval for β 0. 10 Given that the uncertainty interval for δ 2 (the treatment effect of for the individual incentive) lies entirely above this dotted horizontal line, for example, one can quickly infer that δ 2 > 0 (or, equivalently, β 2 > β 0, as was seen from Figure 4a). Finally, we apply the modified overlap procedure in an attempt to identify the best treatment. That is, we seek to determine whether or not the treatment effect for the individual incentive (the treatment effect which is estimated to be the largest) is statistically distinguishable from the other treatments effects (and zero). Using the modified overlap procedure, we obtain a value of 0.34 for γ, leading to substantially narrower uncertainty intervals; see Figure 5. Note that we only plot the lower endpoint of the uncertainty interval for β [1] = β 2 and the upper endpoints of the uncertainty intervals for β [2] = β 1 and β [3] = β 0 in Figure 5a (and similarly in Figure 5b). This is 10 One could similarly include a dotted horizontal line at the the value of L n,0 ˆβ n,0 if the vertical axis were to extend below zero; Figure 6b provides an example of this. 16

(a) Uncertainty Intervals for β 0, β 1, and β 2 (b) Uncertainty Intervals for δ 1 and δ 2 Figure 5: Performance Pay Example: Modified Overlap Procedure done in order to emphasize the fact that the modified overlap procedure can not be used to make inferences about the ordering of any parameter pair (β [i], β [j] ) whenever min{i, j} > 1 (i.e., whenever neither of the parameters is estimated to be largest). Here, our conclusions are more in line with MS. Specifically, we infer that β 2 > β 0 (or, equivalently, δ 2 > 0) and that β 2 > β 1 (or, equivalently, δ 2 > δ 1 ). That is, our conclusions on the second and third restrictions (MS2 and MS3) are consistent with MS. However, since the modified overlap procedure is not designed to allow us to infer anything about the ordering of β 1 and β 0 (or, equivalently, anything about the sign of δ 2 ), we can not make any conclusions about MS1. The increased power of the modified overlap procedure comes entirely at the cost of remaining silent on such comparisons. 5.2 Student Achievement Programs Angrist et al. (2009), hereafter ALO, conducted a field experiment at a large, public 17

university in Canada in order to examine programs aimed at improving students first year academic performance and second year re-enrollment. 11 The experiment involved sorting students into a control group and three treatment groups (i.e., k = 3). Students in the first treatment group were offered support services (supplemental instruction and peer advising), while students in the second treatment group were offered financial incentives (cash awards depending on their performance). Students in the third treatment group were offered both support services and financial incentives. Although ALO present results for several different outcome variables, we focus here on just first semester grade point averages for simplicity. Moreover, ALO present results for males and females (both separately and together), while we only consider females. We do so because ALO do not find any statistically significant treatment effects for males or for males and females combined (see the first two columns of their Table 5). In order to examine the effects of the different treatments in this case, ALO estimate the following model: Grades t = δ 0 + δ 1 Support t + δ 2 Financial t + δ 3 Combined t + Z tη + errors, where Grades is average grade at the end of first semester, Support, Financial, Combined are indicator variables indicating membership in treatment group 1 (support services only), group 2 (financial incentives only), and group 3 (both support services and financial incentives), respectively, and Z contains sets of dummy variables for mother tongue, high school group, number of courses in fall term, self reports on how often the student procrastinates, mother s education, and father s education. There are n = 729 observations, and the model is estimated using ordinary least squares. Heteroskedasticity-consistent standard errors are obtained using the method 11 This dataset is publicly-available from: https://www.aeaweb.org/aej-applied/data/2007-0062 data.zip 18

of White (1980). While ALO do not explicitly test for differences between the different treatment effects, they do directly compare the parameter estimates in the text. For example, on p. 14 the authors state that the estimates for women suggest the combination of services and fellowships... had a larger impact than fellowships alone. If they had formally tested for the differences between the coefficients, the ( ) 3+1 2 = 6 restrictions would be (absolute values of the T -statistics obtained for the tests of these restrictions are given in brackets): 12 ALO1: δ 1 = 0 ( T n,alo1 = 0.58) ALO2: δ 2 = 0 ( T n,alo2 = 2.21) ALO3: δ 3 = 0 ( T n,alo3 = 3.17) ALO4: δ 1 = δ 2 ( T n,alo4 = 1.21) ALO5: δ 1 = δ 3 ( T n,alo5 = 2.10) ALO6: δ 2 = δ 3 ( T n,alo6 = 1.01) Thus, at the 5% nominal level, one would only reject the restrictions ALO2, ALO3, and ALO5 if the tests were conducted independently (i.e., without controlling the familywise error rate). In other words, one could conclude that only the treatment effects for the financial group and the combined group are statistically different from zero (ALO do test the first three restrictions and come to the same conclusions about them), and that the treatment effect for the combined group is statistically different from the treatment effect for the support group. Again, while ALO do not explicitly test ALO4 ALO6, their statement quoted above (about the treatment effect for the 12 As in the previous example, these T -statistics incorporate the estimated covariances between the parameters, which are not equal to zero because of the inclusion of the covariates. 19

combined group being larger than the treatment effect for the financial group) is not statistically supported, even if one ignores the multiple comparisons problem. We now turn our attention to the overlap procedures. As in the previous example, we first need to re-write the model above so that the mean of the control group is estimated as a parameter, rather than being absorbed into the intercept. That is, we estimate the following model: GPA t = β 0 Control t + β 1 Support t + β 2 Financial t + β 3 Combined t + Z tη + errors, where Control is an indicator variable for membership in the control group. Given a nominal familywise error rate of α = 0.05, we obtain a value of 0.64 for γ with the basic overlap procedure. This value is obtained using 999 wild bootstrap replications (the standard errors obtained by ALO are not clustered). Our resulting uncertainty intervals are shown in Figure 6, which can be interpreted as follows: Since the uncertainty intervals for β 1 and β 0 overlap, we can not infer that β 1 > β 0 (or, equivalently, we can not infer that δ 1 is positive). Similarly, since the uncertainty intervals for β 2 and β 0 overlap, we can not infer that β 2 > β 1 (or, equivalently, we can not infer that δ 2 is positive). Since L n,3 > U n,0, we infer that β 3 > β 0 (or, equivalently, that δ 3 > 0). Since the uncertainty intervals for β 2 and β 1 overlap, we can not infer that β 2 > β 1 (or, equivalently, we can not infer that δ 2 δ 1 is positive). Similarly, since the uncertainty intervals for β 3 and β 1 overlap, we can not infer that β 3 > β 1 (or, equivalently, we can not infer that δ 3 δ 1 is positive). Finally, since the uncertainty intervals for β 3 and β 2 overlap, we can not infer that β 3 > β 2 (or, equivalently, we can not infer that δ 3 δ 2 is positive). 20

(a) Uncertainty Intervals for β i, i K (b) Uncertainty Intervals for δ i, i {1,..., k} Figure 6: Student Achievement Example: Basic Overlap Procedure In summary, our conclusions on four of the restrictions (ALO1, AL03, AL04, and ALO6) are consistent with the results of independent tests conducted at the 5% nominal level. However, our conclusions on the other two restrictions (ALO2 and ALO5) are not. We next apply the modified overlap procedure. Here, we obtain a value of 0.35 for γ, leading to much narrower uncertainty intervals. From Figure 7, we infer that β 3 > β 0 (or, equivalently, δ 3 > 0), and that β 3 > β 1 (or, equivalently, δ 3 > δ 1 ). However, we can not that β 3 > β 2 (or, equivalently, we can not infer that δ 3 δ 2 is positive). That is, we can conclude that the treatment effect for the combined group (the largest estimate) is larger than the treatment effect for the control group, but not that it is larger than the treatment effect for the financial group. This reinforces our finding above that the claim made by ALO that the treatment effect for the combined group is larger than the treatment effect for the financial group is not statistically supported. 21

(a) Uncertainty Intervals for β i, i K (b) Uncertainty Intervals for δ i, i {1,..., k} Figure 7: Student Achievement Example: Modified Overlap Procedure Overall, the conclusions from Figure 7 differ from those based on Figure 6 only in that we can conclude that the treatment effect for the combined group is larger than the treatment effect for the support group. Similar to what was noted in the previous example, the increased power of the modified overlap procedure comes at the cost of not being able to make comparisons between the treatment effects for the financial group and the support group, or between either of these two groups and zero. That is, we move from being able to make 6 pairwise comparisons with the basic overlap procedure, to only being able to make 3 pairwise comparisons with the modified overlap procedure. References Angrist, J., Lang, D., and Oreopoulos, P. (2009). Incentives and services for college achievement: Evidence from a randomized trial. American Economic Journal: 22

Applied Economics, 1(1):136 63. Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B, 57(1):289 300. Bennett, C. J. and Thompson, B. S. (forthcoming). Graphical procedures for multiple comparisons under general dependence. Journal of the American Statistical Association. Cameron, A. C., Gelbach, J. B., and Miller, D. L. (2008). Bootstrap-based improvements for inference with clustered errors. The Review of Economics and Statistics, 90(3):414 427. Gu, J. and Shen, S. (2015). Multiple testing for positive treatment effects. Unpublished manuscript. Horrace, W. C. and Schmidt, P. (2000). Multiple comparisons with the best, with economic applications. Journal of Applied Econometrics, 15(1):1 26. Hsu, J. C. (1981). Simultaneous confidence intervals for all distances from the best. The Annals of Statistics, 9(5):1026 1034. Hsu, J. C. (1984). Constrained simultaneous confidence intervals for multiple comparisons with the best. The Annals of Statistics, 12(3):793 1150. Lee, S. and Shaikh, A. M. (2014). Multiple testing and heterogenous treatment effects: Re-evaluating the effect of Progresa on school enrollment. Journal of Applied Econometrics, 29(4):612 626. Mammen, E. (1993). Bootstrap and wild bootstrap for high dimensional linear models. The Annals of Statistics, 21(1):255 285. 23

Muralidharan, K. and Sundararaman, V. (2011). Teacher performance pay: Experimental evidence from India. Journal of Political Economy, 119(1):39 77. Romano, J. P. and Wolf, M. (2005a). Exact and approximate stepdown methods for multiple hypothesis testing. Journal of the American Statistical Association, 100(469):94 108. Romano, J. P. and Wolf, M. (2005b). Stepwise multiple testing as formalized data snooping. Econometrica, 73(4):1237 1282. Tukey, J. W. (1953). The problem of multiple comparisons. In The Collected Works of John W. Tukey VIII. Multiple Comparisons: 1948-1983, pages 1 300. Chapman and Hall, New York. White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica, 48(4):817 838. 24