A Simple, Graphical Procedure for Comparing Multiple Treatment Effects

Similar documents
A Simple, Graphical Procedure for Comparing. Multiple Treatment Effects

A Simple, Graphical Procedure for Comparing. Multiple Treatment Effects

Graphical Procedures for Multiple Comparisons Under General Dependence

Resampling-Based Control of the FDR

Christopher J. Bennett

11. Bootstrap Methods

Control of Generalized Error Rates in Multiple Testing

Bootstrapping Heteroskedasticity Consistent Covariance Matrix Estimator

Multiple Testing of One-Sided Hypotheses: Combining Bonferroni and the Bootstrap

Heteroskedasticity-Robust Inference in Finite Samples

large number of i.i.d. observations from P. For concreteness, suppose

Performance Evaluation and Comparison

A Course on Advanced Econometrics

The Number of Bootstrap Replicates in Bootstrap Dickey-Fuller Unit Root Tests

A better way to bootstrap pairs

Obtaining Critical Values for Test of Markov Regime Switching

Robust Performance Hypothesis Testing with the Variance. Institute for Empirical Research in Economics University of Zurich

A multiple testing procedure for input variable selection in neural networks

Casuality and Programme Evaluation

Multiple comparisons of slopes of regression lines. Jolanta Wojnar, Wojciech Zieliński

Monte Carlo Studies. The response in a Monte Carlo study is a random variable.

Improving linear quantile regression for

Bootstrap Tests: How Many Bootstraps?

Applying the Benjamini Hochberg procedure to a set of generalized p-values

A Conditional-Heteroskedasticity-Robust Con dence Interval for the Autoregressive Parameter

Exact and Approximate Stepdown Methods For Multiple Hypothesis Testing

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Rejoinder on: Control of the false discovery rate under dependence using the bootstrap and subsampling

Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing

Lecture 30. DATA 8 Summer Regression Inference

Does k-th Moment Exist?

ECON3150/4150 Spring 2015

On block bootstrapping areal data Introduction

A CONDITIONAL-HETEROSKEDASTICITY-ROBUST CONFIDENCE INTERVAL FOR THE AUTOREGRESSIVE PARAMETER. Donald Andrews and Patrik Guggenberger

Lecture 28. Ingo Ruczinski. December 3, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University

PROCEDURES CONTROLLING THE k-fdr USING. BIVARIATE DISTRIBUTIONS OF THE NULL p-values. Sanat K. Sarkar and Wenge Guo

Hypothesis Testing Based on the Maximum of Two Statistics from Weighted and Unweighted Estimating Equations

The Jackknife-Like Method for Assessing Uncertainty of Point Estimates for Bayesian Estimation in a Finite Gaussian Mixture Model

Economics Division University of Southampton Southampton SO17 1BJ, UK. Title Overlapping Sub-sampling and invariance to initial conditions

Confidence intervals for kernel density estimation

Inference for Identifiable Parameters in Partially Identified Econometric Models

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2

Bootstrapping the Grainger Causality Test With Integrated Data

Non-Parametric Dependent Data Bootstrap for Conditional Moment Models

Impact of serial correlation structures on random effect misspecification with the linear mixed model.

Robust Backtesting Tests for Value-at-Risk Models

Control of the False Discovery Rate under Dependence using the Bootstrap and Subsampling

Research Article A Nonparametric Two-Sample Wald Test of Equality of Variances

Discussion of Bootstrap prediction intervals for linear, nonlinear, and nonparametric autoregressions, by Li Pan and Dimitris Politis

Controlling Bayes Directional False Discovery Rate in Random Effects Model 1

Supplement to Quantile-Based Nonparametric Inference for First-Price Auctions

A semi-bayesian study of Duncan's Bayesian multiple

Statistical Applications in Genetics and Molecular Biology

Preliminaries The bootstrap Bias reduction Hypothesis tests Regression Confidence intervals Time series Final remark. Bootstrap inference

Bootstrapping, Randomization, 2B-PLS

Master s Written Examination - Solution

Quantitative Empirical Methods Exam

The Nonparametric Bootstrap

Confidence Intervals in Ridge Regression using Jackknife and Bootstrap Methods

A Practitioner s Guide to Cluster-Robust Inference

STAT 461/561- Assignments, Year 2015

Estimation of the Conditional Variance in Paired Experiments

Multiple Testing. Hoang Tran. Department of Statistics, Florida State University

Final Exam November 24, Problem-1: Consider random walk with drift plus a linear time trend: ( t

Comprehensive Examination Quantitative Methods Spring, 2018

FDR-CONTROLLING STEPWISE PROCEDURES AND THEIR FALSE NEGATIVES RATES

Inference via Kernel Smoothing of Bootstrap P Values

One-Way ANOVA. Some examples of when ANOVA would be appropriate include:

Political Science 236 Hypothesis Testing: Review and Bootstrapping

The Simple Linear Regression Model

A Bootstrap Test for Causality with Endogenous Lag Length Choice. - theory and application in finance

Business Analytics and Data Mining Modeling Using R Prof. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee

Multiple Comparison Methods for Means

Lecture 13: Subsampling vs Bootstrap. Dimitris N. Politis, Joseph P. Romano, Michael Wolf

Lecture 27. December 13, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University.

Testing for a break in persistence under long-range dependencies and mean shifts

where x i and u i are iid N (0; 1) random variates and are mutually independent, ff =0; and fi =1. ff(x i )=fl0 + fl1x i with fl0 =1. We examine the e

Bootstrapping heteroskedastic regression models: wild bootstrap vs. pairs bootstrap

University of California San Diego and Stanford University and

review session gov 2000 gov 2000 () review session 1 / 38

INFERENCE APPROACHES FOR INSTRUMENTAL VARIABLE QUANTILE REGRESSION. 1. Introduction

Avoiding data snooping in multilevel and mixed effects models

Bootstrap Testing in Econometrics

Overview. Overview. Overview. Specific Examples. General Examples. Bivariate Regression & Correlation

Improving Weighted Least Squares Inference

A Course in Applied Econometrics Lecture 7: Cluster Sampling. Jeff Wooldridge IRP Lectures, UW Madison, August 2008

Models, Testing, and Correction of Heteroskedasticity. James L. Powell Department of Economics University of California, Berkeley

Two examples of the use of fuzzy set theory in statistics. Glen Meeden University of Minnesota.

Analysis of Fast Input Selection: Application in Time Series Prediction

STAT 263/363: Experimental Design Winter 2016/17. Lecture 1 January 9. Why perform Design of Experiments (DOE)? There are at least two reasons:

Econometrics Summary Algebraic and Statistical Preliminaries

Contest Quiz 3. Question Sheet. In this quiz we will review concepts of linear regression covered in lecture 2.

Gravity Models, PPML Estimation and the Bias of the Robust Standard Errors

A CONDITIONAL-HETEROSKEDASTICITY-ROBUST CONFIDENCE INTERVAL FOR THE AUTOREGRESSIVE PARAMETER. Donald W.K. Andrews and Patrik Guggenberger

A nonparametric two-sample wald test of equality of variances

Tobit and Interval Censored Regression Model

Inference on Optimal Treatment Assignments

Choice of Spectral Density Estimator in Ng-Perron Test: Comparative Analysis

Case Study in the Use of Bayesian Hierarchical Modeling and Simulation for Design and Analysis of a Clinical Trial

BANA 7046 Data Mining I Lecture 2. Linear Regression, Model Assessment, and Cross-validation 1

Transcription:

A Simple, Graphical Procedure for Comparing Multiple Treatment Effects Brennan S. Thompson and Matthew D. Webb May 15, 2015 <<< PRELIMINARY AND INCOMPLETE >>> Abstract In this paper, we utilize a new graphical procedure to show how multiple treatment effects can be compared while controlling the familywise error rate (the probability of finding one or more spurious differences between the parameters of interest). Monte Carlo simulations suggest that this procedure adequately controls the familywise error rate in finite samples, and has average power nearly identical to a simple max-t procedure. We illustrate our proposed approach using data from a field experiment on different types of performance pay for teachers. Keywords: multiple comparisons; familywise error rate; treatment effects; bootstrap Department of Economics, Ryerson University. E-mail: brennan@ryerson.ca Department of Economics, University of Calgary. E-mail: matthewdwebb@gmail.com 1

1 Introduction In the case of comparing multiple treatments, the problem at hand is two-fold: (A) we want to know whether or not the effect of each treatment is different from zero, and (B) we want to know whether or not the effect of each treatment is different from that of any of the other treatments. In order to make the discussion of our problem more concrete, consider the following regression model: Y t = β 0 C t + k β i T i,t + Z tδ + U t, t = 1,..., n, (1.1) i=1 where C t equals one if individual t belongs to the control group and zero otherwise, T i,t equals one if individual t belongs to treatment group i {1,..., k} and zero otherwise, Z t is a vector of other characteristics for individual t, and U t is an idiosyncratic error term for individual t. The (average marginal) treatment effect of the ith treatment is defined as α i β i β 0, for i {1,..., k}. Thus, the first part of our problem involves testing α i = 0, for each i {1,..., k}, (1.2) while the second part of our problem involves testing α i = α j, for each unique (i, j) {1,..., k} {1,..., k}. (1.3) Note that, since α i = 0 is equivalent to β i = β 0, and α i = α j is equivalent to 2

β i = β j, our problem boils down to testing β i = β j, for each unique (i, j) K K, (1.4) where K = {0,..., k}. Thus, our problem can be seen to involve making a total of ( ) ( k+1 2 comparisons: the k comparisons implicit in (1.2), plus the k ) 2 comparisons implicit in (1.3). For example, with k = 2 treatments, we must make 3 comparisons: the comparison of the 2 treatment effects to zero, and the comparison between the 2 treatment effects. With k = 3 treatments, we must make 6 comparisons, and so on. It is well-known that, when conducting more than one hypothesis test at a given nominal level simultaneously, the probability of rejecting at least one true hypothesis (i.e., the familywise error rate) is often well in excess of that given nominal level. To illustrate the severity of this issue, we generate, for k {2,..., 5}, one million samples of size n = 100(k + 1) from the model in (1.1) as follows. We set β 0 = = β k = 0, and assign 100 observations to the control group (i.e., n t=1 C t = 100) and 100 observations to each of the k treatment groups (i.e., n t=1 T i,t = 100 for each i {1,..., k}). For each t, Z t = 0 and U t is an independent standard normal draw. Within each sample, we independently test (A) each of the k restrictions in (1.2), and (B) each of the ( k 2) restrictions in (1.3), using conventional t-tests at the 5% nominal level. The rejection frequencies for these tests are shown in Figure 1. Specifically, the dash-dotted line shows the frequency of rejecting at least one of the k restrictions in (1.2), while the dashed line shows the frequency of rejecting at least one of the ( ) k 2 restrictions in (1.3). The solid line shows the empirical familywise error rate (i.e., the frequency of rejecting at least one the ( ) k+1 2 restrictions in (1.4)). Note that, even with k = 2, the empirical familywise error rate is approximately 0.122; with k = 5, the empirical familywise error rate is approximately 0.364. 3

Figure 1: Empirical Familywise Error Rates for Independent t-tests In recognition of this issue, a wide variety of multiple testing procedures, such as max-t procedures (see, e.g., Romano and Wolf, 2005, and Section 3 below) have been developed to control the familywise error rate (or other generalized error rates, such as the false discovery rate; see Benjamini and Hochberg, 1995). For the specific problem of making multiple pairwise comparisons that we are interested in here, Bennett and Thompson (2015), hereafter BT, have recently proposed a graphical procedure that identifies differences between a pair of parameters through the non-overlap of the so-called uncertainty intervals (see Section 2) for those parameters. 1 This graphical method, which can be seen as a resampling-based generalization of Tukey s (1953) method, is appealing because it offers users more than a 0-1 ( Yes- No ) decision regarding differences between parameters. Indeed, this method allows users to determine both statistical and practical significance of pairwise differences, while also providing them with a measure of uncertainty concerning the locations of 1 Note that BT denote the total number of parameters by k, while the total number of parameters we consider is k + 1. 4

the individual parameters. The remainder of this paper is organized as follows. In Section 2, we provide a brief overview of the procedure of BT. Section 3 describes the results of a set of Monte Carlo simulations designed to examine the finite-sample performance of this procedure in the current context. In Section 4, we present the results of an empirical example in which the different treatments are different types of performance pay for teachers using data from a field experiment in India. 2 The Overlap Procedure In this section, we briefly summarize how the graphical procedure of BT can be applied to the problem at hand. For complete details, including proofs of the main results, see BT. The parameters of interest, β i = β i (P ), i K, are unknown but are presumed to be consistently estimable from observable data generated from some (unknown) probability mechanism P. That is, for each i K, we have available to us a n- consistent estimator ˆβ n,i of β i. The overlap procedure presents each parameter estimate ˆβ n,i together with its corresponding uncertainty interval, C n,i (γ) = [ ˆβn,i ± γ se ( ˆβn,i )], (2.1) ( ) whose length is determined by the parameter γ (discussed below) and se ˆβn,i, the standard error of ˆβ n,i. In what follows, we denote the lower and upper endpoints of the interval C n,i by L n,i and U n,i, respectively. The uncertainty intervals can then be used to make inferences about the ordering 5

of the parameters of interest as follows: We infer that β i < β j if ˆβ n,i < ˆβ n,j and the uncertainty intervals for β i and β j are non-overlapping. Note that, as a function of γ, the probability of declaring at least one significant difference between parameters, when all k parameters are equal, is as follows: [ ] n Q n (γ; P ) = Prob P max {L n,i(γ) U n,j (γ)} > 0. (2.2) i,j K Thus, in order to control the familywise error rate at nominal level α, the ideal choice of γ when all k parameters are equal is γ n (α) = inf {γ : Q n (γ; P ) α}. (2.3) Unfortunately, because P is unknown, we cannot compute Q n (γ; P ), and γ n (α) is thus infeasible. We therefore turn our attention to estimating γ n (α) so as to achieve at least asymptotic control of the familywise error rate. Constructing a feasible counterpart to γ n (α) of course requires that we first formulate an empirical estimator of Q n (γ; P ). Towards this end, let ˆP n (typically the empirical distribution) be an estimate of P. The idea is to estimate Q n (γ; P ) using its bootstrap analogue [ n Q n (γ; ˆP { n ) = Prob ˆPn max L n,i (γ) Un,j(γ) } ] > 0, (2.4) i,j K where L n,i(γ) and U n,i(γ) are, respectively, the lower and upper endpoints of C n,i(γ) = [ ( )] β n,i ± γ se β n,i, (2.5) with β n,i being the estimate of β i obtained by resampling from (1.1) with the restric- 6

tion imposed that β 0 = = β k. This, in turn, leads naturally to an estimator of γ as γn(α) = inf {γ : Q n (γ; ˆP ) n } α. (2.6) Plugging γn(α) into C n,i (γ) gives rise to a simple graphical device for visualizing statistically and practically significant differences. BT show that, under quite general conditions, this method (i) controls the familywise error rate asymptotically, and (ii) is consistent, in the sense that any (true) differences between parameters are inferred with probability one asymptotically. 3 Simulation Evidence We now examine the finite-sample performance of the overlap procedure described above by way of several Monte Carlo experiments. As in BT, we consider a max-t procedure designed to control the familywise error rate at the nominal level α as a benchmark. Specifically, this procedure rejects the restriction that β i = β j whenever T n,(i,j) = ˆβ n,i ˆβ n,j [ ( )] 2 [ ( )]. 2 se ˆβn,i + se ˆβn,j is greater than 1 α quantile of max i j T n,(i,j), where T n,(i,j) = β n,i β n,j [ ( )] 2 [ ( )], 2 se βn,i + se βn,j 7

n Homoskedasticity Heteroskedasticity Overlap max-t Overlap max-t 300 0.0532 0.0571 0.0555 0.0558 600 0.0503 0.0550 0.0492 0.0529 1200 0.0524 0.0573 0.0546 0.0574 Table 1: Empirical familywise error rates for overlap and max-t procedures and, as above, β n,i is the estimate of β i obtained by resampling from (1.1) with the restriction imposed that β 0 = = β k. In what follows, we use 199 i.i.d. bootstrap replications for both the max-t procedure and the overlap procedure. The design of our simulations is the same as the one described in Section 1, but with several variations. First, we fix k = 5 and generate 10,000 samples of size n, with n {300, 600, 1200} (i.e., when n = 300, there are 50 observations in the control group and 50 observations in each of the 5 treatment groups, and so on). Second, we set β i = θ(i + 1), with θ {0, 0.01, 0.02,..., 1}, which allows us to examine both control of the familywise error rate (when θ = 0) and power (when θ > 0). Finally, we consider two different specifications for the error term distribution: A homoskedastic case in which all of the errors are drawn from the standard normal distribution, and a heteroskedastic case in which the errors for observations assigned to the control group are standard normal, while the observations assigned to treatment group i {1,..., k} are normal with mean zero and variance i + 1. 2 Table 1 shows that control of the familywise error rate for both the overlap procedure and the max-t procedure is quite close to the nominal level α = 0.05 at all of the sample sizes considered in both the homoskedastic and heteroskedastic cases. Interestingly, although the differences are quite small, the rejection rates for the overlap procedure are uniformly smaller than the rejection rates for the max-t procedure. 2 In both cases, we estimate the parameters using ordinary least squares and obtain heteroskedasticity-consistent standard errors using the method of White (1980). 8

(a) Homoskedastic case (b) Heteroskedastic case Figure 2: Average power for overlap and max-t procedures In order to compare the power of the two procedures, we follow BT in examining average power, which is the proportion of false restrictions that are rejected. Figures 2a and 2b display average power as a function of θ in the homoskedastic and heteroskedastic cases, respectively. Within these figures, black lines correspond to the overlap procedure and red lines correspond to the max-t procedure, while lines that are solid, dashed, and dotted correspond to n = 300, n = 600, and n = 1200, respectively. Evidently, both procedures have nearly identical average power over θ and at all of the sample sizes considered in both the homoskedastic and heteroskedastic cases. 4 Empirical Example As an illustration of the procedure discussed above, we re-visit a field experiment on teacher performance pay in India conducted by Muralidharan and Sundararaman 9

(2011), hereafter MS. Specifically, MS run two sets of experiments where they offer incentive pay to teachers which they condition on student s scores. In the analysis, they have outcomes from three separate groups of schools: a control group (Control), a group in which teachers were paid based on the scores of their own students (Individual Incentive), and a group in which teachers were paid based on the performance of all students at their school (Group Incentive). The experiment is quite well designed and readers can find additional details in MS. Among other things, MS test the impact of the two interventions on combined math and language (Telegu) scores. The experiment ran for two years. In the majority of the analysis, year two scores are analyzed often controlling for the year zero, or pretreatment, scores. While most of the paper is about the average impact of the interventions, in Table 8 the authors compare the impact of the group incentive to the impact of the individual incentive. To make these comparisons they first estimate the following model: 49 Score2 t = α 0 + α 1 Individual t + α 2 Group t + δ j County j,t + δ 50 Score0 t + U t, (4.1) where Score2 and Score0 are, respectively, the combined math and language score in year 2 and year 0, Group is an indicator variable for being in the group incentive, Individual is an indicator for being in the individual incentive, and {County j } 49 j=1 is a set of indicator variables for all but one of the 50 counties (Mandals) in which the experiment was conducted. The standard errors are clustered by school. Next, MS independently test the following three restrictions: j=1 MS1: α 1 = 0 MS2: α 2 = 0 10

MS3: α 1 = α 2 The T -statistics obtained for the tests of these three restrictions are 4.84, 2.71, and -1.91, respectively. Thus, MS conclude that both incentives have significantly positive treatment effects, and that the two treatment effects are statistically different from one another. We now consider an application of the overlap procedure discussed above. To do so, we first need to modify the estimating equation so that the mean of the control group is estimated as a parameter, rather than being absorbed into the intercept. Specifically, we estimate the following equation: 49 Score2 t = β 0 Control t + β 1 Individual t + β 2 Group t + δ j County j,t + δ 50 Score0 t + U t, j=1 (4.2) where Control is an indicator variable for membership in the control group. Notice that α 0 = β 0 and and α i = β i β 0, for i {1, 2}. Using 199 i.i.d. bootstrap replications, we obtain γn(0.05) = 0.505, and produce Figure 3, which can be interpreted as follows: Since C n,0 and C n,1 do not overlap, we can reject β 0 = β 1, or equivalently, α 1 = 0 (MS1). Since C n,0 and C n,2 overlap, we can not reject β 0 = β 2, or equivalently, α 2 = 0 (MS2). Since C n,1 and C n,2 overlap, we can not reject β 1 = β 2, or equivalently, α 1 = α 2 (MS3). Thus, while our conclusions on the first restriction (MS1) is the same as MS, our conclusion on the second and third restrictions (MS2 and MS3) are different. 11

Figure 3: Uncertainty Intervals for Empirical Example We also consider an alternative method of visualizing these results in Figure 4, where all figures are expressed in marginal terms. That is, we subtract ˆβ n,0 = 0.1319 from the point estimates and the endpoints of the uncertainty intervals. In addition, we no longer display a point estimate or uncertainty interval for β 0, but instead include a dotted horizontal line at the the value of U n,0 ˆβ n,0, the upper endpoint of the re-centered version of C n,0. 12

Figure 4: Re-Centered Uncertainty Intervals for Empirical Example 13

References Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), 57(1):pp. 289 300. Bennett, C. J. and Thompson, B. S. (2015). Graphical procedures for multiple comparisons under general dependence. Unpublished manuscript. Muralidharan, K. and Sundararaman, V. (2011). Teacher Performance Pay: Experimental Evidence from India. Journal of Political Economy, 119(1):39 77. Romano, J. P. and Wolf, M. (2005). Exact and approximate stepdown methods for multiple hypothesis testing. Journal of the American Statistical Association, 100(469):94 108. Tukey, J. W. (1953). The problem of multiple comparisons. In The Collected Works of John W. Tukey VIII. Multiple Comparisons: 1948-1983, pages 1 300. Chapman and Hall, New York. White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica, 48(4):817 838. 14