QED. Queen s Economics Department Working Paper No. 903

Similar documents
Graphical Methods for Investigating. the Size and Power of Hypothesis Tests

Graphical Methods for Investigating. the Size and Power of Hypothesis Tests

Bootstrap Tests: How Many Bootstraps?

Bootstrap Testing in Econometrics

A better way to bootstrap pairs

The Power of Bootstrap and Asymptotic Tests

Bootstrapping Heteroskedasticity Consistent Covariance Matrix Estimator

QED. Queen s Economics Department Working Paper No Russell Davidson Queen s University. James G. MacKinnon Queen s University

QED. Queen s Economics Department Working Paper No. 781

Bootstrap Testing in Nonlinear Models

QED. Queen s Economics Department Working Paper No The Size Distortion of Bootstrap Tests

Inference Based on the Wild Bootstrap

11. Bootstrap Methods

QED. Queen s Economics Department Working Paper No Heteroskedasticity-Robust Tests for Structural Change. James G. MacKinnon Queen s University

The Wild Bootstrap, Tamed at Last. Russell Davidson. Emmanuel Flachaire

Inference via Kernel Smoothing of Bootstrap P Values

Heteroskedasticity-Robust Inference in Finite Samples

Bootstrapping heteroskedastic regression models: wild bootstrap vs. pairs bootstrap

Chapter 1 Statistical Inference

Bootstrap Methods in Econometrics

Binary choice 3.3 Maximum likelihood estimation

1 Motivation for Instrumental Variable (IV) Regression

Testing an Autoregressive Structure in Binary Time Series Models

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

Least Absolute Value vs. Least Squares Estimation and Inference Procedures in Regression Models with Asymmetric Error Distributions

QED. Queen s Economics Department Working Paper No Moments of IV and JIVE Estimators

Lectures 5 & 6: Hypothesis Testing

Finite-sample quantiles of the Jarque-Bera test

Robustness and Distribution Assumptions

An Equivalency Test for Model Fit. Craig S. Wells. University of Massachusetts Amherst. James. A. Wollack. Ronald C. Serlin

G. S. Maddala Kajal Lahiri. WILEY A John Wiley and Sons, Ltd., Publication

Economics 536 Lecture 7. Introduction to Specification Testing in Dynamic Econometric Models

The Bootstrap: Theory and Applications. Biing-Shen Kuo National Chengchi University

More efficient tests robust to heteroskedasticity of unknown form

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Testing for Regime Switching in Singaporean Business Cycles

1 Procedures robust to weak instruments

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

Heteroskedasticity. Part VII. Heteroskedasticity

Intermediate Econometrics

Numerical Distribution Functions of. Likelihood Ratio Tests for Cointegration. James G. MacKinnon. Alfred A. Haug. Leo Michelis

Review of Classical Least Squares. James L. Powell Department of Economics University of California, Berkeley

Diagnostics for the Bootstrap and Fast Double Bootstrap

ECNS 561 Multiple Regression Analysis

Bootstrap Inference in Econometrics

Discrete Dependent Variable Models

ACE 564 Spring Lecture 8. Violations of Basic Assumptions I: Multicollinearity and Non-Sample Information. by Professor Scott H.

Economic modelling and forecasting

ADJUSTED POWER ESTIMATES IN. Ji Zhang. Biostatistics and Research Data Systems. Merck Research Laboratories. Rahway, NJ

A TIME SERIES PARADOX: UNIT ROOT TESTS PERFORM POORLY WHEN DATA ARE COINTEGRATED

Please bring the task to your first physics lesson and hand it to the teacher.

Analysis of the AIC Statistic for Optimal Detection of Small Changes in Dynamic Systems

Discussion of Tests of Equal Predictive Ability with Real-Time Data by T. E. Clark and M.W. McCracken

The Bootstrap in Econometrics

Size and Power of the RESET Test as Applied to Systems of Equations: A Bootstrap Approach

The Number of Bootstrap Replicates in Bootstrap Dickey-Fuller Unit Root Tests

Wild Bootstrap Tests for IV Regression

Tests of the Present-Value Model of the Current Account: A Note

Wooldridge, Introductory Econometrics, 4th ed. Appendix C: Fundamentals of mathematical statistics

A NEW PROCEDURE FOR MULTIPLE TESTING OF ECONOMETRIC MODELS

Bootstrap Tests for Overidentification in Linear Regression Models

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

QED. Queen s Economics Department Working Paper No. 616

Wooldridge, Introductory Econometrics, 3d ed. Chapter 9: More on specification and data problems

Steven Cook University of Wales Swansea. Abstract

Introductory Econometrics. Review of statistics (Part II: Inference)

Groupe de Recherche en Économie et Développement International. Cahier de recherche / Working Paper 08-17

On the Triangle Test with Replications

Models for Heterogeneous Choices

Inference about Clustering and Parametric. Assumptions in Covariance Matrix Estimation

A Practitioner s Guide to Cluster-Robust Inference

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2

Wooldridge, Introductory Econometrics, 4th ed. Chapter 15: Instrumental variables and two stage least squares

LECTURE 13: TIME SERIES I

STAT 536: Genetic Statistics

1 Correlation and Inference from Regression

Multiple Regression Theory 2006 Samuel L. Baker

TESTING FOR NORMALITY IN THE LINEAR REGRESSION MODEL: AN EMPIRICAL LIKELIHOOD RATIO TEST

1/34 3/ Omission of a relevant variable(s) Y i = α 1 + α 2 X 1i + α 3 X 2i + u 2i

An overview of applied econometrics

A Course in Applied Econometrics Lecture 7: Cluster Sampling. Jeff Wooldridge IRP Lectures, UW Madison, August 2008

WILD CLUSTER BOOTSTRAP CONFIDENCE INTERVALS*

Central Bank of Chile October 29-31, 2013 Bruce Hansen (University of Wisconsin) Structural Breaks October 29-31, / 91. Bruce E.

A New Procedure for Multiple Testing of Econometric Models

Model Estimation Example

Darmstadt Discussion Papers in Economics

The Bootstrap. Russell Davidson. AMSE and GREQAM Centre de la Vieille Charité 2 rue de la Charité Marseille cedex 02, France

Multiple Regression Analysis

PBAF 528 Week 8. B. Regression Residuals These properties have implications for the residuals of the regression.

Appendix A: The time series behavior of employment growth

Two-by-two ANOVA: Global and Graphical Comparisons Based on an Extension of the Shift Function

Function Approximation

Estimation and Hypothesis Testing in LAV Regression with Autocorrelated Errors: Is Correction for Autocorrelation Helpful?

Averaging Estimators for Regressions with a Possible Structural Break

Non-linear panel data modeling

Tests of the Co-integration Rank in VAR Models in the Presence of a Possible Break in Trend at an Unknown Point

Ma 3/103: Lecture 24 Linear Regression I: Estimation

Testing and Model Selection

EMPIRICALLY RELEVANT CRITICAL VALUES FOR HYPOTHESIS TESTS: A BOOTSTRAP APPROACH

A Course in Applied Econometrics Lecture 14: Control Functions and Related Methods. Jeff Wooldridge IRP Lectures, UW Madison, August 2008

Transcription:

QED Queen s Economics Department Working Paper No 903 Graphical Methods for Investigating the Size and Power of Hypothesis Tests Russell Davidson Queen s University GREQAM James G MacKinnon Queen s University Department of Economics Queen s University 94 University Avenue Kingston, Ontario, Canada K7L 3N6 6-1994

Graphical Methods for Investigating the Size and Power of Hypothesis Tests by Russell Davidson and James G MacKinnon Department of Economics Queen s University Kingston, Ontario, Canada K7L 3N6 GREQAM Centre de la Vieille Charité 2 rue de la Charité 13002 Marseille, France Abstract Simple techniques for the graphical display of simulation evidence concerning the size and power of hypothesis tests are developed and illustrated Three types of figures called P value plots, P value discrepancy plots, and size-power curves are discussed Some Monte Carlo experiments on the properties of alternative forms of the information matrix test for linear regression models and probit models are used to illustrate these figures Tests based on the OPG regression are found to perform worse in terms of both size and power than efficient score tests This research was supported, in part, by the Social Sciences and Humanities Research Council of Canada We are grateful to two referees and to participants in workshops at Cornell University, York University, the State University of New York at Albany, GRE- QAM, the Statistical Society of Canada, and the Canadian Econometric Study Group for helpful comments First Version: June, 1994 Revised: June, 1995 and March, 1996

1 Introduction To obtain evidence on the finite-sample properties of hypothesis testing procedures, econometricians generally resort to simulation methods As a result, many, if not most, of the papers that deal with specification testing and other forms of hypothesis tests include some Monte Carlo results This paper discusses some simple graphical methods that appear to be very useful for characterizing both the size and the power of test statistics The graphs convey much more information, in a more easily assimilated form, than tables can do Consider a Monte Carlo experiment in which N realizations of some test statistic τ are generated using a data generating process, or DGP, that is a special case of the null hypothesis We may denote these simulated values by τ j, j = 1,, N Unless τ is extraordinarily expensive to compute (as it may be if bootstrapping is involved; see Section 2), N will generally be a large number, probably 5000 or more In practice, of course, several different test statistics may be generated on each replication The conventional way to report the results of such an experiment is to tabulate the proportion of the time that τ j exceeds one or more critical values, such as the 1%, 5%, and 10% values for the asymptotic distribution of τ This approach has at least two serious disadvantages First of all, the tables provide information about only a few points on the finite-sample distribution of τ; as we shall see in Section 3, this can be an important limitation Secondly, the tables require some effort to interpret, and they generally do not make it easy to see how changes in the sample size, the number of degrees of freedom, and other factors affect test size There are many ways to present graphically the sort of information that is usually presented in tabular form The ones we advocate here are easy to implement and yield graphs that are easy to interpret All of the graphs we discuss are based on the empirical distribution function, or EDF, of the P values of the τ j s The P value of τ j is the probability of observing a value of τ as extreme as or more extreme than τ j, according to some distribution F (τ) This distribution could be the asymptotic distribution of τ, or it could be a distribution derived by bootstrapping, or it could be an approximation to the (generally unknown) finite-sample distribution of τ For notational simplicity, we shall assume that there is only one P value associated with τ j, namely, p j p(τ j ) Precisely how p j is defined will vary If τ is asymptotically distributed as χ 2 (r) and F χ 2(x, r) denotes the cdf of the χ 2 (r) distribution evaluated at x, then p j = 1 F χ 2(τ j, r) If τ is asymptotically distributed as N(0, 1) and Φ(x) denotes the standard normal cdf, then, for a two-tail test, p j = 2 2Φ( τ ) The EDF of the p j s is simply an estimate of the cdf of p(τ) At any point x i in the (0, 1) interval, it is defined by ˆF (x i ) 1 N N I(p j x i ), (1) j=1 1

where I(p j x i ) is an indicator function that takes the value 1 if its argument is true and 0 otherwise An EDF is conventionally evaluated at every data point In order to conserve storage space, since N will often be very large, we suggest that, instead, (1) be evaluated only at m points x i, i = 1,, m The x i s must be chosen, in advance, so as to provide a reasonable snapshot of the (0, 1) interval, or of that part of it which is of interest It is difficult to state categorically how large m should be and how the x i s should be chosen A quite parsimonious set of x i s is x i = 002, 004,, 01, 02,, 99, 992,, 998 (m = 107) (2) Another choice which should give slightly better results is x i = 001, 002,, 010, 015,, 990, 991,, 999 (m = 215) (3) For both (2) and (3), there are extra points near 0 and 1 in order to ensure that we do not miss any unusual behavior in the tails As we shall see in Section 5, it may be necessary to add additional points in certain cases Note that, because we only evaluate the EDF at a relatively small number of points, it is straightforward to employ variance reduction techniques If control or antithetic variates are available, the techniques proposed by Davidson and MacKinnon (1992b) to estimate tail areas can simply be applied to each point on the EDF The simplest graph that we will discuss is a plot of ˆF (x i ) against x i We shall refer to such plots as P value plots If the distribution of τ used to compute the p j s is correct, each of the p j s should be distributed as uniform (0, 1) Therefore, when ˆF (x i ) is plotted against x i, the resulting graph should be close to the 45 line As we shall see in Section 3, P value plots allow us to distinguish at a glance among test statistics that systematically over-reject, test statistics that systematically under-reject, and test statistics that reject about the right proportion of the time However, because all test statistics that behave approximately the way they should will look roughly like 45 lines, P value plots are not very useful for distinguishing among such test statistics For dealing with test statistics that are well-behaved, it is much more revealing to graph ˆF (x i ) x i against x i We shall refer to these graphs as P value discrepancy plots These plots have advantages and disadvantages They convey a lot more information than P value plots for test statistics that are well behaved However, some of this information is spurious, simply reflecting experimental randomness In Section 4, we therefore discuss semi-parametric methods for smoothing them Moreover, because there is no natural scale for the vertical axis, P value discrepancy plots can be harder to interpret than P value plots P value plots and P value discrepancy plots are very useful for dealing with test size, but not very useful for dealing with test power In Section 5, we will discuss graphical methods for comparing the power of competing tests using size-power curves These curves can be constructed using two EDFs, one for an experiment in which the null hypothesis is true, and one for an experiment in which it is false 2

It would be more conventional to graph the EDF of the τ j s instead of the EDF of their P values If the former were graphed against a hypothesized distribution, the result would be what is called a P-P plot; see Wilk and Gnanadesikan (1968) Plotting P values makes it much easier to interpret the plots, since what they should look like will not depend on the null distribution of the test statistic This fact also makes it easy to compare test statistics which have different distributions under the null, and to compare different procedures for making inferences from the same test statistics Note that P value plots are really just P-P plots of P values, and, contrary to what we believed when we initially wrote this paper, we are not the first to employ them See, for example, Fisher, Mammen, and Marron (1994) and West and Wilcox (1995) Another common type of plot is a Q-Q plot, in which the empirical quantiles of the τ j s are plotted against the actual quantiles of their hypothesized distribution If the empirical distribution is close to the hypothesized one, the plot will be close to the 45 line The problem with this approach, which was used by Chesher and Spady (1991), is that a Q-Q plot has no natural scale for the axes: If the hypothesized distribution changes, so will that scale This makes it impossible to plot on the same axes test statistics which have different distributions under the null One way to get around this see, for example, Mammen (1992) is to employ a Q-Q plot of P values This will convey exactly the same information as a P value plot, except that the plot will be reflected about the 45 line In our view, however, Q-Q plots of P values are harder to read than P value plots, because it is more natural to have nominal test size on the horizontal axis Of course, graphical methods by themselves are not always enough When test performance depends on a number of factors, the two-dimensional nature of most graphs and all tables can be limiting In such cases, it may be desirable to supplement the graphs with estimated response surfaces which relate size or power to sample size, parameter values, and so on; see, for example, Hendry (1984) In order to illustrate and motivate these procedures, we use them to present the results of a study of the properties of alternative forms of the information matrix (IM) test proposed by White (1982) We compare tests based on the outer-productof-the-gradient (OPG) regression, which was proposed as a way to compute IM tests by Chesher (1983) and Lancaster (1984), with two other forms of the IM test We find that the OPG variant performs relatively poorly in terms of both size and power, although its size can be greatly improved by bootstrapping The result that the OPG form of the IM test frequently tends to over-reject in finite samples is not new: See, among others, Taylor (1987), Chesher and Spady (1991), Davidson and MacKinnon (1992a), and Horowitz (1994) However, the result that, for a given size, OPG IM tests have lower power than some other forms of the IM test does not appear to be well-known The plan of the paper is as follows In the next section, we briefly discuss several forms of the IM test statistic In Section 3, we present a number of Monte Carlo results to illustrate the use of P value plots and P value discrepancy plots 3

In Section 4, we discuss and illustrate methods for smoothing P value discrepancy plots Finally, in Section 5, we discuss and illustrate the use of size-power curves 2 Alternative Forms of the Information Matrix Test To illustrate the utility of the graphical methods we discuss in this paper, we present results for several variants of the IM test, some of which were studied in Davidson and MacKinnon (1992a) Table 1 of that paper, which occupies a page and half, is a particularly striking example of the disadvantages of presenting simulation results for test statistics in a non-graphical way We shall deal with two models, the univariate linear regression model and the probit model In the former case, IM tests are pivotal, which means that their distributions do not depend on any nuisance parameters, but in the latter case they are not Three of the variants of the IM test that we shall deal with pertain to the linear regression model y t = δ 1 + k δ i X ti + u t, u t NID(0, σ 2 ), t = 1,, n (4) i=2 The regressors X ti are normal random variables, independent across observations, and equicorrelated with correlation coefficient one-half Because all versions of the IM test for this model are pivotal, the values of the δ i s and σ 2 are chosen arbitrarily The OPG variant of the IM test statistic is obtained by regressing an n vector of 1s on ũ t X ti, for i = 1,, k, and on 1 2 (k2 + 3k) test regressors These test regressors are functions of ẽ t ũ t / σ and the X ti s Here ũ t is the t th OLS residual and σ 2 = (1/n) n t=1 ũ2 t There are k(k + 1)/2 1 test regressors of the form (ẽ 2 t 1)X ti X tj, which test for heteroskedasticity, k regressors of the form (ẽ 3 t 3ẽ t )X ti, which test for skewness interacting with the regressors, and one regressor of the form ẽ 4 t 5ẽ 2 t + 2, which tests for kurtosis The test statistic is n minus the sum of squared residuals from the regression, and it is asymptotically distributed as χ 2( 1 2 (k2 + 3k) ) The DLR form of the IM test is a bit more complicated It involves a doublelength artificial regression with 2n observations, and the test statistic is 2n minus the sum of squared residuals from this regression The number of regressors is the same as for the OPG test, and so is the asymptotic distribution of the test statistic See Davidson and MacKinnon (1984, 1992a) A third form of the IM test, which is less widely available than the other two, is the efficient score, or ES, form The ES form of the Lagrange Multiplier test is often considered to have optimal or nearly optimal properties, because the only random quantities in the estimate of the information matrix are the restricted maximum 4

likelihood parameter estimates In this case, the ES form of the IM test is actually the sum of three test statistics: 1 h 2 2 Z ( Z Z ) 1 Z h 2 + 1 h 6 3 X ( X X ) 1 X h 3 + 1 24n n (ẽ4 t 3 ) 2, (5) where Z is a matrix with typical element X ti X tj, h 2 has typical element ẽ 2 t 1, and h 3 has typical element ẽ 3 t These three statistics test for heteroskedasticity, skewness, and kurtosis, respectively; see Hall (1987) The other two variants of the IM test that we shall deal with pertain to the probit model Consider any binary response model of the form t=1 E(y t Ω t ) = Q(X t β), t = 1,, n, (6) where y t is a 0-1 dependent variable, Ω t denotes an information set, Q t ( ) is a twice continuously differentiable, increasing function that maps from the real line to the 0-1 interval, X t is a k 1 row vector of observations on variables that belong to Ω t, and β is a k-vector of unknown parameters The contribution to the loglikelihood by observation t is l t = y t log Q(X t β) + (1 y t ) log ( 1 Q(X t β) ), (7) and the derivative with respect to β i is l t = X tiq t (y t Q t ), (8) β i Q t (1 Q t ) where q t q(x t β) is the derivative of Q t Q(X t β) derivative of (8) with respect to β j can be seen to be After some algebra, the X ti X tj Q 2 t (1 Q t ) 2 ( (yt Q t )q tq t (1 Q t ) q 2 t (y t Q t ) 2), (9) where q t q (X t β) is the derivative of q t Thus the (i, j) direction of the IM test is given by 2 l t + l t l t = X tix tj q t q t (y t Q t ) (10) β i β j β i β j Q t (1 Q t ) q t The OPG regression simply involves regressing a vector of 1s on k regressors with typical element (8) and on k(k + 1)/2 1 test regressors with typical element (10), where all regressors are evaluated at the ML estimates β The test statistic is n minus the sum of squares residuals from this regression The number of test regressors is normally one less than k(k + 1)/2 because, if there is a constant term, the test regressor for which i and j both index the constant will be perfectly collinear with the regressor for the constant from (8) 5

In our experiments, we deal with a probit model that has between 2 and 5 regressors, one of them a constant term and the rest independent N(0, 1) random variates For this model, the function Q(X t β) is the cumulative standard normal distribution function, Φ(X t β), its derivative q(x t β) is the standard normal density, φ(x t β), and q t/q t = X t β It is quite easy to derive an efficient score variant of the LM test for the probit model; see Orme (1988) Consider the artificial regression w t (y t Q t ) = w t q t X t b + w t q t ( q t/ q t )W t c + residual, (11) where Q t Q t (X t β), qt q t (X t β), q t q (X t β), wt ( Qt (1 Q t ) ) 1/2, and Wt is a vector with typical element X ti X tj By standard results for artificial regressions see Davidson and MacKinnon (1993, Chapter 15) the explained sum of squares from regression (11) is an LM statistic for c = 0, and from (10) it is clear that this LM statistic is testing in the directions of the IM test That this form of the LM statistic is the efficient score variant is evident from the fact that none of the regressors in (11) depends directly on y t ; their only dependence on the data is through the ML estimates β In the case of the probit model, (11) simplifies slightly to w t (y t Φ t ) = w t φt X t b + w t φt ( X t β)wt c + residual (12) This artificial regression is precisely the one for testing the hypothesis that γ = 0 in the model ( ) Xt β E(y t Ω t ) = Φ (13) exp(w t γ) This model involves a form of heteroskedasticity, and it reduces to the ordinary probit model when γ = 0 Thus we see that, for the probit model, the IM test is equivalent to a test for a certain type of heteroskedasticity It is well-known that many forms of the IM statistic, most notoriously the OPG form, have very poor finite-sample properties under the null One way to improve these is to obtain P values by using the parametric bootstrap This idea was investigated by Horowitz (1994), who found that it worked very well for IM tests in probit and tobit models The methodology we use, which differs somewhat from that of Horowitz, is as follows After a test statistic, say ˆτ, has been obtained, B sets of simulated data, or bootstrap samples, are generated from a DGP that satisfies the null hypothesis, using as parameters the parameter estimates from the actual sample For each of the bootstrap samples, a bootstrap test statistic is then computed Suppose that B of these test statistics are greater than ˆτ Then the estimated bootstrap P value for ˆτ is B /B In the case of the linear regression model (4), all forms of the IM test statistic are pivotal This implies that, as B, the EDF of the bootstrap P values must tend to the 45 line, and that the size-power curve will be unaffected by bootstrapping Therefore, in this case, we do not need to do a Monte Carlo experiment to see 6

how the bootstrap works; except for experimental error, it will work perfectly Note that the error terms for the bootstrap samples must be obtained from a normal distribution rather than by resampling from the residuals, since the residuals will not be normally distributed Thus the result that the bootstrap will work perfectly for IM tests in linear regression models is only for the parametric bootstrap Since the IM tests for the probit model are not pivotal, it is of interest to see how bootstrapping affects their size and power The results, which are presented in Sections 3 and 5, turn out to be remarkably interesting 3 P Value Plots and P Value Discrepancy Plots Figure 1 shows P value plots for the OPG, DLR, and ES variants of the IM test for the regression case with n = 100 and k = 2 These are based on an experiment with 5000 replications The x i s were chosen as in (3), so that m = 215 The figure makes it dramatically clear that the OPG form of the IM test works very badly under the null In this case, it rejects just under half the time at the nominal 5% level In contrast, the DLR form seems to work quite well, and the ES form over-rejects in the left tail and under-rejects elsewhere Figure 1 illustrates some of the advantages and disadvantages of P value plots On the one hand, these plots make it very easy to distinguish tests that work well, such as DLR, from tests that work badly, such as OPG On the other hand, they do not make it easy to see patterns in the behavior of tests that work well For example, one has to look quite closely at the figure to see that DLR systematically under-rejects for small test sizes and over-rejects for larger test sizes A potential disadvantage of P value plots is that they can take up a lot of space Since, in most cases, we are primarily interested in reasonably small test sizes, it generally makes sense to truncate the plot at some value of x less than unity Figure 2 shows two sets of P value plots, both truncated at x = 025 These plots provide a great deal of information about how the sample size n and the number of regressors k affect the performance of the OPG form of the IM test From Figure 2a, we see that size improves as n increases but is still very unsatisfactory for n = 1000 From Figure 2b, we see that the performance of these tests deteriorates dramatically as k (and hence the number of degrees of freedom) increases These figures tell us all we really need to know about the performance of the OPG IM test under the null An alternative approach, which was used by West and Wilcox (1995), is to show only the lower left-hand part of the plot, up to say 025 on each axis The potential problem with this approach is evident from Figure 2b It cannot be used with tests that over-reject as severely as the OPG IM test does in this figure For the DLR variant of the IM test, P value plots are not very informative because the tests perform so well Figure 3 therefore provides P value discrepancy plots for the same cases as Figure 2, but truncated at x = 040 Figure 3a makes it clear that the tendency for DLR systematically to under-reject for small test 7

sizes when k = 2 is not an artifact of experimental error However, as we see from Figure 3b, things change as k increases For n = 200, DLR over-rejects for all test sizes whenever k 5 For large values of k and small values of n, the DLR test is sufficiently badly behaved that a P value plot might be more appropriate than a P value discrepancy plot The performance of the ES variant will be discussed in the next section It is natural to ask whether P value discrepancies, such as those in Figure 3, can be explained by experimental randomness The Kolmogorov-Smirnov (KS) test is often used to test whether an EDF is compatible with some specified distribution function For EDFs of P values, the KS test statistic is simply max ˆF (pj ) p j (14) j=1,,n This is almost equal to the largest absolute value of the P value discrepancy plot, max i=1,,m ˆF (xi ) x i (15) Note, however, that expression (14) will almost always be slightly bigger than expression (15) because the maximum is being taken over a larger number of points A simple, graphical method to see whether P value discrepancies can be explained by experimental randomness is to add the KS critical values to the discrepancy plot, as is done in Figure 3 The 05 critical values are the dashed lines labelled KS 05 In this case, we see that most of the discrepancies are not significant at the 05 level Note that, although it is done in Figure 3 to save space, we really should not truncate the plot if we wish to base a KS test on it, since the test statistic is the maximum over all the differences in (14) We now turn to the results for the probit model Figure 4 shows that the OPG variant of the IM test rejects far too often under the null This over-rejection becomes worse as the number of regressors k increases and as the sample size n declines It also depends on the parameters of the DGP The figure reports results from two sets of experiments, one with β 0 = 0 and the other β i s equal to 1, and the second with all the β i s equal to 1 The performance of the OPG test is always somewhat worse in the latter case From the figure, we see that the OPG test should never be used without bootstrapping and that, because the test is definitely nonpivotal, bootstrapping cannot be expected to work perfectly Figure 5 shows P value discrepancy plots for the ES form of the IM test for probit models To keep the figure readable, only results for the β 0 = 0 case are shown The comparable figure for β 0 = 1, which is not shown, looks similar but not identical to this one Figure 5 is much less jagged than Figure 3 because, like Figure 4, it is based on 100,000 replications The striking feature of this figure is that the ES test rejects too often for small test sizes and not nearly often enough for larger sizes When k = 3, tests at the 05 level perform quite well, especially for n = 50 and n = 100 But the test over-rejects quite severely at the 01 and 8

001 levels, and its performance at the 05 level actually seems to deteriorate as n increases By this point, it should be clear why P value plots and P value discrepancy plots are often much more informative than tables of rejection frequencies at conventional significance levels like 05 First of all, there is nothing special about the 05 level Some investigators may prefer to use the 01 level or even the 001 level, while pretests are often performed at levels of 25 or even higher Moreover, if investigators are simply looking at P values rather than testing at some specified level, they will generally place much more weight on a P value of 001 than on a P value of 049 This can be a dangerous thing to do A table of rejection frequencies at the 05 level would suggest that the ES form of the IM test for probit models is quite reliable, and yet very small P values occur much more often than they should do Another reason for being concerned with the entire distribution of a test statistic, or at least the left-hand part of it, is that, in our experience, a test statistic that is seriously unreliable at any level for some combination of parameter values and sample size, is likely to be seriously unreliable at the 05 level for some other combination of parameter values and sample size Consider Figure 5 Thus serious P value discrepancies at any level should be a cause for concern Of course, because they use one dimension for nominal test size, P value plots and P value discrepancy plots cannot use that dimension to represent something else, such as the value of some parameter In consequence, there are bound to be cases in which other types of plots are equally or more informative There will certainly be many occasions when, like residual plots for regressions, P value plots provide valuable information to investigators but are not worth including in a published paper where space is tight Figure 3 illustrates one serious problem with P value discrepancy plots: Unless the sample size is very large, they tend to be quite jagged, reflecting experimental error Therefore, it is natural to think about how to obtain smoother plots that may be easier to interpret This is the topic of the next section 4 Smoothing P Value Discrepancy Plots One natural way to smooth a P value discrepancy plot is to regress the discrepancies on smooth functions of x i, such as polynomials or trigonometric functions If we let v i denote ˆF (x i ) x i and f l (x i ) denote the l th function of x i, the first of which may be a constant term, such a regression can be written as v i = L γ l f l (x i ) + u i, i = 1,, m (16) l=1 There are two difficulties with this approach One is that the u i s are neither homoskedastic nor serially uncorrelated However, it turns out to be relatively easy to derive a feasible GLS procedure Another problem is how to choose L and the 9

functions f l (x i ) There are several ways to do this, and our experience suggests that there is no single best way If the regression function in (16) were chosen correctly, the error term u i would be equal to ˆF (x i ) F (x i ) It can be shown that, for any two points x and x in the (0, 1) interval, Var( ˆF (x)) = N 1 F (1 F ), and Cov( ˆF (x), ˆF (x )) = N 1( min(f, F ) F F ) (17) Here F F (x) and F F (x ) Notice that the first line of (17) is just a restatement of the well-known result about the variance of the mean of N Bernoulli trials Expression (17) makes it clear that the m m covariance matrix of the u i s in (16), which we shall call Ω, exhibits a moderate amount of heteroskedasticity and a great deal of serial correlation Both the standard deviation of u i and the correlation between u i and u i 1 are greatest when F i = 05 and decline as F i approaches 0 or 1 Equation (16) can be rewritten using matrix notation as v = Zγ + u, E(uu ) = Ω, (18) where Z is an n L matrix, the columns of which are the regressors in (16) The GLS estimator of γ is γ = ( Z Ω 1 Z ) 1 Z Ω 1 v (19) The smoothed discrepancies are the fitted values Z γ from (18), and the covariance matrix of these fitted values is Z ( Z Ω 1 Z ) 1 Z (20) Confidence bands can be constructed by using the square roots of the diagonal elements of (20) It is easily verified that the inverse of Ω has non-zero entries only on the principal diagonal and the two adjacent diagonals Specifically, Ω 1 ii = N(F i+1 F i ) 1 + N(F i F i 1 ) 1, Ω 1 i,i+1 = N(F i+1 F i ) 1, Ω 1 i,i 1 = N(F i F i 1 ) 1, (21) for i = 1,, m, and Ω 1 i,j = 0 for i j > 1 In these formulae, F 0 = 0 and F m+1 = 1 In contrast to many familiar examples of GLS estimation, it is neither easy nor necessary to triangularize Ω 1 in this case Because Ω 1 has non-zero elements only along three diagonals, it is not difficult to compute Z Ω 1 Z and Z Ω 1 v directly For example, the lj th element of Z Ω 1 Z is m i=1 Z il Z ij Ω 1 ii + m Z i 1,l Z ij Ω 1 i=2 10 i,i 1 + m 1 i=1 Z i+1,l Z ij Ω 1 i,i+1, (22)

where the needed elements of Ω 1 were defined in (21) Thus, by using (22), it is straightforward to compute the GLS estimates (19) As is generally the case, true GLS estimation is not feasible here If the test statistic being studied is well-behaved, however, F (x i ) will be close to x i for all x i, and it will be reasonable to use x i instead of the unknown F i in (21) This will yield approximate GLS estimates If the test statistic is not so well-behaved, it is natural to use a two-stage procedure In the first stage, the approximate GLS estimates are obtained In the second stage, the unknown F i s in (21) are replaced by F i x i +ź i, where ź i denotes the fitted values from the approximate GLS procedure Note that whatever values are used to estimate the F i s must be positive, as must be the estimates of F i F i 1 Since these conditions may not always be satisfied by the F i s, it may be necessary to modify them slightly before computing the feasible GLS estimates and the final estimates F i The loglikelihood function associated with GLS estimation of (18) is m 2 log(2π) + 1 2 Ω 1 1 2 (v Zγ) Ω 1 (v Zγ) (23) The determinant of Ω 1 which appears in (23) is N m ( m i=0 a i m j=0 1/a j ), (24) where a i (F i+1 F i ) 1, for i = 0,, m, and, as before, F 0 = 0 and F m+1 = 1 We have not yet said anything about how to specify the regression function in (16), that is, the matrix Z One obvious approach is to use powers of x i as regressors Another is to use the functions sin(lπx i ) for l = 1, 2, 3, and no constant term The advantage of the latter approach is that sin(0) = sin(lπ) = 0, so that the approximation, like z i itself, is constrained to equal zero at x i = 0 and x i = 1 However, this may not always be an advantage If a test over-rejects severely, F i x i may be large even for x i near zero, and it may be hard for a function that equals zero at x i = 0 to fit well with a reasonable number of terms For a given set of regressors, the choice of L can be made in various ways We have chosen it to maximize the Akaike Information Criterion (ie, the value of the loglikelihood function (24) minus L) It is important to make sure that (16) fits satisfactorily, as it may not if Z has been chosen poorly One simple approach is to calculate the GLS equivalent of the regression standard error: s ( ) 1/2 1 n L ũ Ω 1 ũ, where ũ is the vector of (feasible) GLS residuals from (18) If Z has been specified correctly, s should be approximately equal to unity 11

In the previous section, we briefly discussed Kolmogorov-Smirnov tests A test that should usually be much more powerful than a KS test is a likelihood ratio test for γ = 0 Such a test may easily be computed from (24) As an example, for the DLR test for n = 500 and k = 4, the correctly computed KS test, based on (14), is 00185, and the approximate one, based on (15), is 00184 These tests are not significant (P = 065 for the correct one) In contrast, the LR test for the smoothing regression with trigonometric regressors is 2124 with 5 degrees of freedom, which is highly significant (P = 0007) Of course, this P value is really too low, since the data were used to select both the form of the regression and the number of regressors Nevertheless, since the LR test for the smoothing regression with polynomial regressors is 1814, also with 5 degrees of freedom (P = 0028), there does seem to be fairly strong evidence against the hypothesis that the P value discrepancies can be explained by experimental randomness in this case Figure 6 illustrates the smoothing procedure we have just described for the case of the DLR test with n = 200 and k = 5 In this case, only four regressors sin(lπx i ) for l = 1,, 4 were needed to fit as well as we would expect (s = 098) A polynomial approximation with two more regressors worked equally well The confidence bands were obtained by adding to and subtracting from the F i s twice the square roots of the diagonal elements of (20) Because of the scale of the vertical axis, these bands appear to be quite wide What is more interesting is that they are distinctly wider in the far left-hand part of the figure than in the far right-hand part This is because the DLR test is tending to over-reject everywhere For example, F (05) = 00797 and F (95) = 09622 This means that F i (1 F i ), to which the variance of ˆF i is assumed to be proportional, is equal to 00733 in the former case and 00364 in the latter The fitted values F i tend to be more precisely estimated where the variance of ˆF i is smaller Figure 7 shows two sets of truncated, smoothed P value discrepancy plots for the ES form of the IM test in the regression model Because ˆF (001) was always substantially greater than zero, the trigonometric approximations did not work at all well, and the smoothing was done using polynomial approximations It is clear that the ES form over-rejects for tests at small nominal sizes but under-rejects when the nominal size is large enough, just as it does for probit models These tendencies become more pronounced as n falls and k rises, although the effects of reducing n and increasing k are by no means the same In order to keep the figure readable, confidence bands are not shown Since the standard error of F (05) was between 0032 and 0039, the basic shape of these smoothed curves is certainly reliable, but not every wiggle should be taken seriously The main advantage of smoothing here is that it makes the figures much easier to read We now turn our attention to the bootstrapped IM tests for the probit model These results are based on 10,000 replications with 500 bootstrap samples per replication Figure 8 shows smoothed P value discrepancy plots for both forms of the IM test, for k = 3, several sample sizes, and the two sets of parameter values used in Figure 4 Several things are evident from this figure First of all, bootstrapping works remarkably well, but it does not work perfectly This is exactly what 12

is predicted by the theoretical results of Davidson and MacKinnon (1996) The discrepancies between nominal and actual size are reasonably small, vastly smaller, especially for the OPG variant, than the discrepancies associated with using asymptotic P values However, these discrepancies are definitely not zero Even for the ES variant when n = 200, the hypothesis that the discrepancies are zero can be convincingly rejected Secondly, the bootstrapped P values for the ES variant tend to be a bit too small, while, with one notable exception, those for the OPG variant tend to be somewhat too big Finally, parameter values clearly matter, which again confirms the results of Davidson and MacKinnon (1996) Because of this, there is no guarantee that bootstrapping will always work as well in practice as it seems to in this example 5 Size-Power Curves It is often desirable to compare the power of alternative test statistics, but this can be difficult to do if all the tests do not have the correct size Suppose we perform a Monte Carlo experiment in which the data are generated by a DGP belonging to the alternative hypothesis The test statistics of interest are calculated for each replication, and corresponding P values are obtained If the EDFs of these P values are plotted, the result will not be very useful, since we will be plotting power against nominal test size Unfortunately, this is what is often implicitly done when test power is reported in a table In order to plot power against true size, we need to perform two experiments, preferably using the same sequence of random numbers In the first experiment, the null hypothesis holds, and in the second it does not Let the points on the two approximate EDFs be denoted ˆF (x) and ˆF (x), respectively As before, these are to be evaluated at a prechosen set of points x i, i = 1,, m As we have seen, F (x) is the probability of getting a nominal P value less than x under the null Similarly, F (x) is the probability of getting a nominal P value less than x under the alternative Tracing the locus of points ( F (x), F (x) ) inside the unit square as x varies from 0 to 1 thus generates a size-power curve on a correct size-adjusted basis Plotting the points ( ˆF (xi ), ˆF (x i ) ), including the points (0, 0) and (1, 1), does exactly the same thing, except of course for experimental error By using the same set of random numbers in both experiments, we can reduce experimental error, since the correlation between ˆF (x i ) F (x i ) and ˆF (x i ) F (x i ) will normally be quite high Since drawing a size-power curve requires results from two matched experiments, we need to use two DGPs that are related to each other Just how they are related will generally affect the results The exception is for test statistics that are pivotal, since their distribution will be the same for any set of parameter values that satisfies the null hypothesis Except in this special case, it is important to choose the parameters of the null DGP so that they are as similar as possible to those of the non-null DGP Otherwise, it does not make much sense to use ˆF (x) for the size adjustment of ˆF (x) 13

It is not entirely clear how to pick a null DGP that corresponds to any given non-null DGP The best way to do so is probably to use the null DGP characterized by the pseudo-true values; see Horowitz (1994) and Davidson and MacKinnon (1996) For maximum likelihood estimation, the pseudo-true values are the parameter values associated with a null DGP which minimize the Kullback-Leibler Information Criterion, or KLIC, between the given non-null DGP and the set of null DGPs It can be obtained by maximizing the expectation of the loglikelihood function under the null for data generated under the non-null DGP In the case of the binary response model (6), the pseudo-true null may be obtained by maximizing n ( p t log Q(X t β) + (1 p t ) log ( 1 Q(X t β) )), (25) t=1 where p t is the probability that y t = 1 according to the non-null DGP It is easy to maximize (25) using software similar to that used for estimating binary response models However, because the maximum will depend on the regressors, we are constrained to use fixed regressors in the experiments The idea of plotting power against true size to obtain a size-power curve is not at all new When we wrote the first version of this paper, we believed that the very simple method of doing so which we have advocated here was new, but in fact it is not; see Wilk and Gnanadesikan (1968) Thus the principal contribution of this section is to illustrate the use of size-power curves by presenting some results on the power of alternative forms of the IM test We believe that these results are of some interest Figure 9 shows size-power curves for the OPG, DLR, and ES forms of the IM test with k = 2 for n = 100 and n = 200 The error terms in the non-null data generating process were generated as a mixture of normals with different variances, and they therefore displayed kurtosis Since IM tests for linear regression models are pivotal, we did not have to use a pseudo-true null Several results are immediate from this figure The ES form has the greatest power for a given size of test, followed by the DLR form The OPG form has far less power than the other two, and for n = 100 it actually has power less than its size for true sizes that are small enough to be interesting Numerous other experiments, for which results are not shown, produced the same ordering of the three tests The inferiority of the OPG form was always very marked These experimental results make it clear that, if one is going to use an IM test for a linear regression model, the best one to use is the bootstrapped ES form It must have essentially the correct size (because the test is pivotal), and it will have better power than any of the other tests This conclusion contradicts that of Davidson and MacKinnon (1992a), who failed to allow for the possibility of bootstrapping and who ignored the issue of power Of course, it might well be better yet to test for heteroskedasticity, skewness, and kurtosis separately The component pieces of the ES form (5), with P values determined by bootstrapping, could be used for this purpose 14

There is one practical problem with drawing size-power curves by plotting ˆF (x i ) against ˆF (x i ) For tests that under- or over-reject severely under the null, there may be a region of the size-power curve that is left out by a choice of values of x i such as those given in (2) or (3) For instance, suppose that a test over-rejects severely for small sizes, as the OPG IM test does Then, even if x i is very small, there may be many replications under the null for which the realized P value is still smaller As an example, for the OPG test in the regression model with n = 100 and k = 5, ˆF (001) = 576 and ˆF (001) = 405 Therefore, if the size-power curve were plotted with x 1 = 001, there would be a long straight segment extending from (0, 0) to (576, 405) Such a straight segment would bear clear witness to the gross over-rejection of which the OPG IM test is guilty, but it would also bear witness to a severe lack of detail in the depiction of how the test behaves It could well be argued that tests which behave very badly under the null are not of much interest, so that this is not a serious problem In any case, the problem is not difficult to solve We simply have to make sure that the x i s include enough very small numbers Experience suggests that adding the following 18 points to (3) will produce reasonably good results even in extreme cases: x i = 1 10 8, 2 10 8, 5 10 8,, 1 10 3, 2 10 3, 5 10 3 In the remainder of this section, we discuss size-power curves for IM tests in probit models These turn out to be very interesting indeed The data are generated by the model ( ) β0 + X t1 + X t2 E(y t Ω t ) = Φ, (26) exp(5x t1 + 5X t2 ) where β 0 = 0 or β 0 = 1, and we used fixed regressors This model is a special case of (13), and it is similar to one of the alternatives used by Horowitz (1994) The pseudo-true null was obtained by maximizing (25), with Φ( ) replacing Q( ), when the probabilities were generated by (26) We performed eight experiments, all with k = 3, for n = 50, n = 100, n = 200, and n = 400 for each of β 0 = 0 and β 0 = 1 We also ran a few additional experiments with k = 2 to verify that the number of regressors did not affect the principal conclusions It is not possible to graph the results for all eight experiments in one figure, but Figure 10 does show all the interesting results One striking result is that the relative performance of the ES and OPG tests, based on asymptotic P values, depends greatly on β 0 ES performs very much better than OPG when β 0 = 1, but when β 0 = 0 it is impossible to rank the two tests Another striking result is that bootstrapping can improve or worsen the performance of both tests It is clear from Figure 10 that whether performance improves or worsens depends, in no obvious manner unfortunately, on sample size, the value of β 0, and test size In general, the power of the ES test is little affected by bootstrapping, but that of the OPG test is, often quite significantly In particular, for n = 100 and β 0 = 0, the bootstrap OPG test is very much better than the asymptotic form of the test, especially for small test size In this case, the bootstrap OPG 15

test is in fact markedly more powerful than either form of the ES test Horowitz (1994) also found that bootstrapping can improve the performance of the OPG test; he did not consider the ES test From these results and those discussed earlier, we conclude that the ES form of the IM test with bootstrap P values is the procedure of choice for probit models, just as it is for regression models Although it occasionally may have slightly less power for a given true size than the ES form using asymptotic P values, or than the OPG form using bootstrap P values, it may equally well have more power, perhaps a lot more in the latter case Any differences in power are outweighed by the fact that the bootstrap P values for the ES form are more accurate than the bootstrap P values for the OPG form, and substantially more accurate than the asymptotic ones for either form, especially when they are very small; compare Figures 4, 5, and 8 However, practitioners should be aware that bootstrapping will not produce entirely accurate test sizes for small samples; see Figure 8 6 Conclusion Monte Carlo experiments are a valuable tool for obtaining information about the properties of specification testing procedures in finite samples However, the rich detail in the results they provide can be difficult to apprehend if presented in the usual tabular form In this paper, we have discussed several graphical techniques that can make the principal results of an experiment immediately obvious All of these techniques rely on the construction of an estimated cdf (EDF) of the nominal P values associated with some test statistic From these, we can easily obtain a variety of diagrams, namely, P value plots, P value discrepancy plots (which may optionally be smoothed), and size-power curves It is worth noting that these techniques are of use not only for reporting experimental results in published work, but also as a research tool The investigation of different testing procedures is greatly facilitated if one can immediately put the results of a Monte Carlo experiment on one s computer screen in easily understood form We have illustrated the techniques we advocate by presenting the results of a number of experiments concerning alternative forms of the information matrix test We believe that these results, which are entirely presented in graphical form, are of considerable interest and provide more information in a more easily assimilated fashion than the tabular results which are typically presented 16

References Chesher, A (1983) The information matrix test: simplified calculation via a score test interpretation, Economics Letters, 13, 45 48 Chesher, A, and R Spady (1991) Asymptotic expansions of the information matrix test statistic, Econometrica, 59, 787 815 Davidson, R, and J G MacKinnon (1984) Model specification tests based on artificial linear regressions, International Economic Review, 25, 485 502 Davidson, R, and J G MacKinnon (1992a) A new form of the information matrix test, Econometrica, 60, 145 157 Davidson, R, and J G MacKinnon (1992b) Regression-based methods for using control variates in Monte Carlo experiments, Journal of Econometrics, 54, 203 222 Davidson, R and J G MacKinnon (1993) Estimation and Inference in Econometrics, New York, Oxford University Press Davidson, R and J G MacKinnon (1996) The size and power of bootstrap tests, Queen s University IER Discussion Paper No 932 Fisher, N J, E Mammen, and J S Marron (1994) Testing for multimodality, Computational Statistics and Data Analysis, 18, 499 512 Hall, A (1987) The information matrix test for the linear model, Review of Economic Studies, 54, 257 263 Hendry, D F (1984) Monte Carlo experimentation in econometrics, Ch 16 in Handbook of Econometrics, Vol II, eds Z Griliches and M D Intriligator, Amsterdam, North-Holland Horowitz, J L (1994) Bootstrap-based critical values for the information matrix test, Journal of Econometrics, 61, 395 411 Lancaster, T (1984) The covariance matrix of the information matrix test, Econometrica, 52, 1051 1053 Mammen, E (1992) When Does Bootstrap Work?, New York, Springer-Verlag Orme, C (1988) The calculation of the information matrix test for binary data models, The Manchester School, 56, 370 376 Taylor, L W (1987) The size bias of White s information matrix test, Economics Letters, 24, 63 67 West, K D, and D W Wilcox (1995) A comparison of alternative instrumental variables estimators of a dynamic linear regression model, Journal of Business and Economic Statistics, forthcoming White, H (1982) Maximum likelihood estimation of misspecified models, Econometrica, 50, 1 26 17

Wilk, M B, and R Gnanadesikan (1968) Probability plotting methods for the analysis of data, Biometrika, 33, 1 17 18

Actual Size 10 09 08 07 06 05 04 03 02 (005, 0497) OPG DLR ES 45 line 01 00 00 01 02 03 04 05 06 07 08 09 10 Nominal Size Figure 1 P value plots for IM tests, regression model, n = 100, k = 2 19

10 09 08 07 06 05 04 03 02 01 00 n = 100 n = 200 n = 500 n = 1000 45 line 0 05 10 15 20 25 30 a k = 2 k = 6 k = 5 k = 4 k = 3 k = 2 45 line 0 05 10 15 20 25 30 b n = 200 Figure 2 P value plots for OPG IM tests, regression model 20