Understanding Ding s Apparent Paradox

Similar documents
arxiv: v1 [math.st] 28 Feb 2017

Research Note: A more powerful test statistic for reasoning about interference between units

Introduction Large Sample Testing Composite Hypotheses. Hypothesis Testing. Daniel Schmierer Econ 312. March 30, 2007

Ch. 5 Hypothesis Testing

Application of Parametric Homogeneity of Variances Tests under Violation of Classical Assumption

Quantitative Empirical Methods Exam

Maximum Likelihood (ML) Estimation

A Course in Applied Econometrics Lecture 7: Cluster Sampling. Jeff Wooldridge IRP Lectures, UW Madison, August 2008

Large Sample Properties of Estimators in the Classical Linear Regression Model

Causal Inference Lecture Notes: Causal Inference with Repeated Measures in Observational Studies

Econometric Analysis of Cross Section and Panel Data

arxiv: v4 [math.st] 23 Jun 2016

A Course in Applied Econometrics Lecture 14: Control Functions and Related Methods. Jeff Wooldridge IRP Lectures, UW Madison, August 2008

EXAMINATION: QUANTITATIVE EMPIRICAL METHODS. Yale University. Department of Political Science

arxiv: v1 [stat.me] 3 Feb 2017

A REVERSE TO THE JEFFREYS LINDLEY PARADOX

Econometrics. Week 4. Fall Institute of Economic Studies Faculty of Social Sciences Charles University in Prague

Increasing the Power of Specification Tests. November 18, 2018

INFERENCE APPROACHES FOR INSTRUMENTAL VARIABLE QUANTILE REGRESSION. 1. Introduction

Inference for Average Treatment Effects

Goodness-of-Fit Tests for Time Series Models: A Score-Marked Empirical Process Approach

Heteroskedasticity ECONOMETRICS (ECON 360) BEN VAN KAMMEN, PHD

University of California, Berkeley

Lawrence D. Brown* and Daniel McCarthy*

Wooldridge, Introductory Econometrics, 4th ed. Chapter 15: Instrumental variables and two stage least squares

Some History of Optimality

A randomization-based perspective of analysis of variance: a test statistic robust to treatment effect heterogeneity

Review of Econometrics

Shu Yang and Jae Kwang Kim. Harvard University and Iowa State University

EC821: Time Series Econometrics, Spring 2003 Notes Section 9 Panel Unit Root Tests Avariety of procedures for the analysis of unit roots in a panel

1/24/2008. Review of Statistical Inference. C.1 A Sample of Data. C.2 An Econometric Model. C.4 Estimating the Population Variance and Other Moments

Econometrics of Panel Data

Final Exam. Economics 835: Econometrics. Fall 2010

Greene, Econometric Analysis (6th ed, 2008)

LECTURE 10: NEYMAN-PEARSON LEMMA AND ASYMPTOTIC TESTING. The last equality is provided so this can look like a more familiar parametric test.

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006

Conservative variance estimation for sampling designs with zero pairwise inclusion probabilities

1 Motivation for Instrumental Variable (IV) Regression

A Practitioner s Guide to Cluster-Robust Inference

Recall that a measure of fit is the sum of squared residuals: where. The F-test statistic may be written as:

Comprehensive Examination Quantitative Methods Spring, 2018

Research Article A Nonparametric Two-Sample Wald Test of Equality of Variances

Models for Heterogeneous Choices

Reliability of inference (1 of 2 lectures)

The Exact Distribution of the t-ratio with Robust and Clustered Standard Errors

Econometrics. 4) Statistical inference

Problem Set #6: OLS. Economics 835: Econometrics. Fall 2012

4.8 Instrumental Variables

Introductory Econometrics

Parameter Estimation, Sampling Distributions & Hypothesis Testing

6. Assessing studies based on multiple regression

Statistics and econometrics

Testing Linear Restrictions: cont.

Testing Statistical Hypotheses

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Testing for Regime Switching in Singaporean Business Cycles

Lectures 5 & 6: Hypothesis Testing

EMERGING MARKETS - Lecture 2: Methodology refresher

Bootstrapping Heteroskedasticity Consistent Covariance Matrix Estimator

A better way to bootstrap pairs

Applied Statistics Lecture Notes

Previous lecture. Single variant association. Use genome-wide SNPs to account for confounding (population substructure)

CORRELATION, ASSOCIATION, CAUSATION, AND GRANGER CAUSATION IN ACCOUNTING RESEARCH

Beyond the Sharp Null: Permutation Tests Actually Test Heterogeneous Effects

mrw.dat is used in Section 14.2 to illustrate heteroskedasticity-robust tests of linear restrictions.

Finite-sample quantiles of the Jarque-Bera test

Lecture 2: Basic Concepts of Statistical Decision Theory

Spatial Regression. 9. Specification Tests (1) Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved

Econometrics II. Nonstandard Standard Error Issues: A Guide for the. Practitioner

A New Technique to Estimate Treatment Effects in Repeated Public Goods Experiments

Econometrics. Week 8. Fall Institute of Economic Studies Faculty of Social Sciences Charles University in Prague

Statistical Inference with Regression Analysis

Regression with a Single Regressor: Hypothesis Tests and Confidence Intervals

HANDBOOK OF APPLICABLE MATHEMATICS

What s New in Econometrics? Lecture 14 Quantile Methods

HYPOTHESIS TESTING. Hypothesis Testing

Introduction to Econometrics

A Simple, Graphical Procedure for Comparing Multiple Treatment Effects

A nonparametric two-sample wald test of equality of variances

1 Procedures robust to weak instruments

LECTURE 10. Introduction to Econometrics. Multicollinearity & Heteroskedasticity

Gov 2002: 5. Matching

Bios 6649: Clinical Trials - Statistical Design and Monitoring

Testing an Autoregressive Structure in Binary Time Series Models

The University of Hong Kong Department of Statistics and Actuarial Science STAT2802 Statistical Models Tutorial Solutions Solutions to Problems 71-80

Sequential Monitoring of Clinical Trials Session 4 - Bayesian Evaluation of Group Sequential Designs

Warwick Economics Summer School Topics in Microeconometrics Instrumental Variables Estimation

Correlation and Regression

Testing Statistical Hypotheses

Chapter 4. Theory of Tests. 4.1 Introduction

Testing Simple Hypotheses R.L. Wolpert Institute of Statistics and Decision Sciences Duke University, Box Durham, NC 27708, USA

Derivation of Monotone Likelihood Ratio Using Two Sided Uniformly Normal Distribution Techniques

Thomas J. Fisher. Research Statement. Preliminary Results

Lecture 7 Introduction to Statistical Decision Theory

Introduction to Econometrics. Assessing Studies Based on Multiple Regression

Heteroskedasticity. We now consider the implications of relaxing the assumption that the conditional

Instrumental Variables

Panel Data. March 2, () Applied Economoetrics: Topic 6 March 2, / 43

Non-Stationary Time Series, Cointegration, and Spurious Regression

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

Transcription:

Submitted to Statistical Science Understanding Ding s Apparent Paradox Peter M. Aronow and Molly R. Offer-Westort Yale University 1. INTRODUCTION We are grateful for the opportunity to comment on A Paradox From Randomization-Based Causal Inference (Ding, 2016), an interesting discussion of the properties of Fisher s randomization test (FRT) with a comparison to a Waldtype test based on a variance estimator originally proposed by Neyman (1923 [1990]). The article illustrates that the use of a Wald-type test using Neyman s variance estimator and the FRT (with the difference in means as a test statistic) can lead to situations where the null of zero average causal effect (Neyman s null) is rejected, while the sharp null of zero individual causal effects (Fisher s null) is not. The article in large part attributes this apparent paradox, which persists asymptotically, to a difference in the implied variances of the reference distributions used to construct tests. In this comment, we seek to situate Ding (2016) s findings in established statistical results, and to explain how the apparent paradox is a direct consequence of two well-known results. We summarize our contribution as follows: (i) Rao-type tests may be suboptimal under non-local alternatives (Engle, 1984), and as Ding (2016) shows, the FRT is asymptotically equivalent to a Rao-type test assuming constant effects. Rather than the FRT, a Wald-type analogue to the FRT using the difference in means exists (considered by Freedman, 2008; Samii and Aronow, 2012; Gerber and Green, 2012; and Lin, 2013) is a valid test of Fisher s null, and does not suffer from the potential pathology of Rao-type tests. Furthermore, this Wald-type test equivalent to a pooled-variance two-sample z-test is asymptotically equivalent to the FRT under local alternatives. Reichardt and Gollob (1999) discuss this point, with reference to Mosteller and Rourke (1973). (ii) As highlighted by Pratt (1964), Romano (1990), and Freedman (2008), the behavior of tests under incorrect working assumptions may depend on the joint distribution of the data. Analogous to the Behrens-Fisher problem, a test of the null of no average effect that assumes there is no effect on the variance may be more or less powerful than a test of no average effect that makes no such additional assumption. We illustrate this by comparing the Wald-type analogue to the FRT to the standard Wald-type test of Neyman s null. Combining (i) and (ii), we see that Ding (2016) s apparent paradox follows from the use of a suboptimal test under non-local alternatives and the well-known behavior of tests of joint hypotheses when one of the con- Peter M. Aronow, Departments of Political Science and Biostatistics, Yale University, 77 Prospect Street, New Haven, CT 06520 United States (e-mail: peter.aronow@yale.edu). Molly Offer-Westort, Departments of Political Science and Statistics, Yale University, 89 Trumbull Street, New Haven, CT 06520 United States (e-mail: molly.offer-westort@yale.edu). 1

2 ARONOW AND OFFER-WESTORT stituent hypotheses is false. (iii) We also note an error in Ding (2016) s Theorem 7 and suggest a refinement that is correct at full generality. N.b., we consider only asymptotic results, and do not discuss the finite N differences between exact tests and tests that are only known to be asymptotically valid. Finite N differences may account in part for Ding (2016) s simulation results; while perhaps of practical importance, such differences are in our view theoretically uninteresting as the source of a potential paradox. 2. A WALD-TYPE ANALOGUE TO THE FISHER RANDOMIZATION TEST Ding (2016) demonstrates the asymptotic equivalence of the FRT and Rao s score test as applied to a linear model assuming homoskedasticity. This result builds on prior findings from Romano (1990), Freedman (2008), and Samii and Aronow (2012). Freedman (2008) considers the operating characteristics of a Wald test assuming a homoskedastic linear model; Samii and Aronow (2012) give the implied variance of this test a randomization basis, by establishing that the implied variance is equivalent to the randomization distribution of the differencein-means estimator if the treatment effect is assumed to be constant and equal to the observed estimate. Gerber and Green (2012) also advocate for this Wald-type approach in practice. Ding (2016) further reproduces Lin (2013) and Samii and Aronow (2012) s result, showing that a Wald test from a linear model allowing for heteroskedasticity is equivalent to a Wald-type test using one of Neyman s variance estimators, V (Neyman) = S 2 1/N 1 + S 2 0/N 0 + o p (N 1 ). (We follow Ding (2016) s notation throughout. We will also assume complete random assignment with the asymptotic scaling and regularity conditions of Freedman (2008), primarily bounded fourth (cross) moments on the limit distribution of potential outcomes with lim N N 1 /N = p, where 0 < p < 1. These regularity conditions appear to be acknowledged in the final paragraph of Ding (2016).) To summarize, Ding (2016) demonstrates that the FRT is asymptotically equivalent to a Rao score test under homoskedasticity, and reiterates that the Neyman test is equivalent to using a Wald test under heteroskedasticity. The asymptotic equivalence of the FRT and the Rao test illustrates why the FRT may have poor power relative to Wald-type analogues. Rao tests are known to have the potential to be sub-optimal for non-local alternatives (Engle, 1984). In this setting, the reference distribution for the Rao test is constructed assuming the null hypothesis, τ = Y 1 Y 0 = 0. The Wald test statistic, however, is constructed in part from the empirical distribution, using as a working assumption τ = τ in order to construct a reference distribution local to the observed data. Comparing the two tests based on their asymptotic approximations, the Wald test performs better than the Rao score test under settings where the true value of τ is not local to 0. We investigate this point below. As implied by Ding (2016, Equation 7), we have that the variance of the FRT reference distribution, V (Fisher) = S 2 1/N 0 + S 2 0/N 1 + τ 2 /N + o p (N 1 ).

UNDERSTANDING DING S APPARENT PARADOX 3 Suppose now that instead of using FRT, we were to use the Wald-type analogue considered by Freedman (2008); Gerber and Green (2012); and Samii and Aronow (2012). Ding (2016) notes the variance estimator associated with the Wald-type analogue in Section A.3.1, and refers to it as V (OLS), where V (OLS) = S 2 1/N 0 + S 2 0/N 1 + o p (N 1 ). Freedman (2008) s results straightforwardly imply that a Wald-type test using V (OLS) is an asymptotically valid test of Fisher s null, but that it may not be a valid test of the null that τ = 0 when treatment effects are heterogeneous (as is possible under Neyman s null). It trivially follows that V (Fisher) V (OLS) = τ 2 /N + o p (N 1 ), and the Wald-type analogue to FRT is asymptotically uniformly more powerful than the FRT. This is not a question of Fisher s null compared to Neyman s null, but rather simply a result of the fact that the FRT is provably suboptimal given non-local alternatives. The reason for this is intuitive: in constructing the implied variance estimate, V (Fisher) fails to account for any location shift across potential outcomes (akin to the combined variance), whereas V (OLS) does (akin to the pooled variance) (Reichardt and Gollob, 1999, p. 124 125). Via standard linearization arguments, V (OLS) will provide a better approximation to V ( τ) in large samples when effects are indeed non-local and constant. (Aronow, 2013 establishes the consistency of analogous linearized variance estimators for a broad class of non-standard experimental designs under constant effects.) Under Pitman-type local alternatives of the form τ N = o(1), we have V (Fisher) V (OLS) = o p (N 1 ). The only asymptotic distinction between V (Fisher) and V (OLS) emerges with non-local alternatives. But as Engle (1984, p. 786) notes, If the alternative is not close to the null, then presumably both tests would reject with very high probability for large samples; the asymptotic behavior of tests for non-local alternatives is usually not of particular interest. Thus, as we proceed, we focus on the Wald-type test using V (OLS), which will allow us to highlight the remaining source of Ding (2016) s apparent paradox. 3. COMPARING WALD TO WALD, OR THE RETURN OF THE BEHRENS-FISHER PROBLEM Ding (2016, Equation 7) considers V (Fisher) V (Neyman); let us consider an analogue, replacing V (Fisher) with its Wald-type analogue V (OLS): V (OLS) V (Neyman) = ( S 2 1/N 0 + S 2 0/N 1 ) ( S 2 1 /N 1 + S 2 0/N 0 ) + op (N 1 ). A necessary condition for plim N N V (OLS) plim N N V (Neyman) is S 1 S 0. (As noted above, this result also holds for V (Fisher) V (Neyman) under any Pitman-type alternative such that τ N = o(1).) We unpack this result. Applying arguments from Neyman (1923 [1990]), it is well known that lim N NV ( τ) S 2 1 /p + S2 0 /(1 p), where p = lim N N 1 /N. Ergo, given the finite population central limit theorem, Wald-type tests constructed using V (Neyman) will tend to be conservative. Under constant effects, Neyman (1923 [1990]) further shows that this holds with equality: lim N NV ( τ) =

4 ARONOW AND OFFER-WESTORT S1 2/p + S2 0 /(1 p). Furthermore, if the maintained hypothesis of constant effects is true, then it must be the case that S 1 = S 0, and therefore it would also be the case that lim N NV ( τ) = S0 2/p+S2 1 /(1 p); in such a case, V (Neyman) and V (OLS) therefore asymptotically agree. If we are not in the world of constant effects, it is possible that S 0 S 1, and there exist limit distributions such that lim N NV ( τ) > S0 2/p + S2 1 /(1 p), implying that using V (OLS) may yield anticonservative inferences and invalid tests of Neyman s null. Fisher s null can be viewed as a joint hypothesis that implies two features of the underlying distribution of potential outcomes: τ = 0 and S 1 = S 0. Neyman s null implies no such assumption about S 1 = S 0. This setting may seem familiar, as it is analogous to the Behrens-Fisher problem. If all we are interested in is testing Neyman s null: τ = 0 (c.f., no location shift), an additional working assumption that S 0 = S 1 (c.f., no scale shift) may lead to unusual behavior if indeed S 0 S 1. If both τ 0 and S 0 S 1 then Wald-type tests using V (OLS) may be more or less powerful than tests using V (Neyman). Or, as Lin (2013) illustrates, two wrongs can sometimes make a right, and tests of Fisher s null may be more powerful in testing the null that τ = 0. But two wrongs can also sometimes make a wrong, as if S1 2/p + S2 0 /(1 p) < S2 0 /p + S2 1 /(1 p), then the converse will hold, and Ding (2016) s apparent paradox emerges. [Note that this condition is equivalent to (p > 1/2 and S 1 > S 0 ) or (p < 1/2 and S 1 < S 0 ).] Just as with the Behrens-Fisher problem, where an incorrect working assumption about the scale shift may lead to lower power in testing the null hypothesis of no location shift, there is no generic reason why, when S 1 S 0, a test that assumes S 1 = S 0 would be more powerful in testing the hypothesis that τ = 0. 4. A NOTE ON THEOREM 7 Ding (2016, Theorem 7) states, For completely randomized experiments, matched-pair experiments, and 2 K factorial experiments, if the outcomes are binary, then all test statistics are equivalent to the difference-in-means statistic. From this, Ding (2016) concludes, Therefore, for binary data, the choice of test statistic is not a problem. While perhaps practically true for many test statistics, this claim is not correct when taken at full generality. For example, consider a test statistic that is not strictly increasing in the difference in means (e.g., τ ). For such test statistics, the resulting inferences may substantively differ. Here we state a weaker claim verifiable at full generality, which has been known since Copas (1973, Equation 9): for completely randomized experiments, if outcomes are binary, then under the sharp null hypothesis of no treatment effect, all test statistics may be written as a function of the difference-in-means statistic. Stronger statements are likely available. 5. CONCLUSION Our comment does not subtract from the paper as a useful exposition of tests commonly used to test the Fisher and Neyman nulls of no effect in randomized experiments. Ding (2016) has drawn attention to an important topic in the domain of randomization-based causal inference. We hope that by placing Ding (2016) s contribution in the context of a broader statistical literature, we have shed light on the reasons for this apparent paradox.

UNDERSTANDING DING S APPARENT PARADOX 5 REFERENCES Aronow, P. M. (2013). Model assisted causal inference. PhD thesis, Department of Political Science, Yale University, New Haven, CT. Copas, J. (1973). Randomization models for the matched and unmatched 2 2 tables. Biometrika 60 467 476. Ding, P. (2016). A paradox from randomization-based causal inference. Statist. Sci. Engle, R. F. (1984). Wald, likelihood ratio, and Lagrange multiplier tests in econometrics. Handbook of econometrics 2 775 826. Freedman, D. A. (2008). On regression adjustments to experimental data. Adv. Appl. Math. 40 180 193. Gerber, A. S. and Green, D. P. (2012). Field experiments: Design, analysis, and interpretation. WW Norton. Lin, W. (2013). Agnostic notes on regression adjustments to experimental data: Reexamining Freedman s critique. Ann. Appl. Stat. 7 295 318. Mosteller, F. and Rourke, R. E. (1973). Sturdy statistics: Nonparametrics and order statistics. Addison-Wesley, Reading, MA. Neyman, J. S. (1923 [1990]). On the application of probability theory to agricultural experiments. Essay on principles. Section 9. Reprinted in Statist. Sci. 5 465 472. Pratt, J. W. (1964). Robustness of Some Procedures for the Two-Sample Location Problem. J. Amer. Statist. Assoc. 59 665 680. Reichardt, C. S. and Gollob, H. F. (1999). Justifying the use and increasing the power of a t test for a randomized experiment with a convenience sample. Psychol. Meth. 4 117. Romano, J. P. (1990). On the behavior of randomization tests without a group invariance assumption. J. Amer. Statist. Assoc. 85 686 692. Samii, C. and Aronow, P. M. (2012). On equivalencies between design-based and regressionbased variance estimators for randomized experiments. Statist. Probab. Lett. 82 365 370.