THE BEHAVIOUR OF LINEAR MODEL SELECTION TESTS UNDER GLOBALLY NON-NESTED HYPOTHESES

Sankhyā : The Indian Journal of Statistics 2002, Volume 64, Series A, Pt.1, pp 109-18 THE BEHAVIOUR OF LINEAR MODEL SELECTION TESTS UNDER GLOBALLY NON-NESTED HYPOTHESES By MITCHELL R. WATNIK and WESLEY O. JOHNSON University of California, Davis SUMMARY. We consider the asymptotic behaviour of tests for non-nested linear model selection under the alternative hypothesis. We consider three testing procedures: the J-test (cf. Davidson and MacKinnon, 1981; the JA-test (cf. Fisher and McAleer, 1981; and the modified Cox test (cf. Godfrey and Pesaran, 198. All are shown to have positive means under the true alternative. Thus, for discriminating between models, they should be treated as one-sided tests. We derive asymptotic distributions for slightly modified versions of these statistics under the alternative in order to obtain relative efficiencies. The modified versions are selected since they have greater power than the original test statistics. Both algebraic and theoretical relationships among the tests are indicated. 1. Introduction Choosing between non-nested linear regression models is an important practical problem in statistics. For example, suppose there are two schools of thought regarding a certain biological, economic or sociological phenomenon. Each school proceeds to analyze particular data accordingly, resulting in models that overlap to a degree but are still separate e.g. non-nested. Alternatively, the models may have resulted from different model selection strategies e.g. C p (cf. Mallows, 197 or the corrected Akaike Information Criteria (cf. Hurvich and Tsai, 1989. There is a clear interest in choosing between the two models. The problem of choosing between non-nested models traces back to Cox (1961 who formulated a testing procedure based on a likelihood ratio approach. Under appropriate conditions, differing variants of his procedure for non-nested linear models have well-behaved asymptotic properties under the null and local alternative hypotheses (cf. Dastoor and McAleer, 1989; Pesaran, 1982, 1987; Watnik, Johnson and Bedrick, 2001. Paper received October 2000; revised July 2001. AMS (1991 subject classifications. Primary 62J05, 62F05; secondary 62J15, 62F0. Key words and phrases. Asymptotic relative efficiency; j-test; JA-test; Multiple correlation; Power; Regression; Separate hypotheses.

110 mitchell r. watnik and wesley o. johnson In the significance testing arena (cf. Cox, 1962, one does not assume that the two models make up the entire space of possibilities; the alternative is simply left unspecified. Here, unlike Dastoor and McAleer (1989, for example, we are interested in the discrimination problem (Cox, 1962; i.e., determining which of the two hypotheses is better (Efron, 1984. McAleer (1995 notes that, although this is not the way non-nested tests were derived, this is typically the way they are employed. Unlike the usual nested situation, it is possible to reject both hypotheses when they are non-nested. To be specific, if the two hypotheses are referred to as H X and H Z, one must calculate the test statistic when H X is the null and H Z is the alternative as well as when H Z is the null and H X is the alternative. With each calculation, the practitioner must consider whether or not to reject the current null hypothesis. In the non-nested situation, then, four conclusions are possible (Green, 1971: 1 reject both hypotheses; 2 reject H X but not H Z ; reject H Z but not H X ; and 4 do not reject either hypothesis. Pesaran (1974 defines size and power for non-nested tests. Size, α, is the probability, calculated under H X, of rejecting the null hypothesis when the H X is the null. Power, π n, is defined to be the probability of correctly identifying the true model; namely the probability under H X of simultaneously i not rejecting H X and ii rejecting H Z. By definition, then, power in the non-nested situation is less than the power for a comparable test of nested hypotheses, πn, which in our case would simply be the probability under H X of rejecting the null hypothesis H Z. The last point is important, because standard definitions of efficiency, such as the Bahadur efficiency (Bahadur, 1960 and the Hájek and Šidák (1967, p. 267 relative efficiency, depend on the definition of power in the nested situation. We note, however, that π n is approximately πn α provided the chance of rejecting the false null, conditional on having rejected the true null, is nearly one. Thus, with nested power calculations, we need only make a simple adjustment to get the non-nested power under this very plausible assumption, which is supported by empirical work. Cox (1961, 1962 considers the significance testing setting and uses it to motivate the idea of artificial nesting, wherein one creates an artificial supermodel and tests hypotheses about subsets of it. This is the tactic employed in both the Davidson and MacKinnon (1981 development of the popular J-test, and the Fisher and McAleer (1981 JA-test. Our results (cf. section indicate that, in the discrimination setting, the above procedures should result in one-sided tests. Non-nested tests have often been considered as significance tests and thus treated as two-sided (cf. Godfrey, 1998.

linear model selection tests 111 The primary purpose of this paper is to make asymptotic power comparisons for the modified Cox test of Godfrey and Pesaran (198, the J-test, the JA-test, and some variations of these three procedures. Towards that goal, we also give a way to estimate the power of non-nested model selection tests when the true parameter vector and variance can be consistently estimated. In addition, we give a method for theoretically judging which procedure is the most powerful by using measures of asymptotic relative efficiency. Although the J-test usually stands out as preferable in large samples, it has been established that this test may not hold size in moderate samples, especially when the dimensions of the two models are different (Godfrey, 1998; McAleer, 1995; Watnik, Johnson and Bedrick, 2001. Pesaran (1984 compares the Bahadur (1960 asymptotic relative efficiencies of some Cox-type testing procedures against each other and against the J-test using the two-sided versions; i.e., under the significance testing setting. Pesaran (1982 considers the power of the J-test against a Cox-type test for locally non-nested hypotheses, both using asymptotic and Monte Carlo methods. Watnik, et al. (2001 give Monte Carlo results for the three aforementioned test procedures. Godfrey (1998 suggests a promising bootstrap-j-test to correct for the J-test size problem, but asymptotic theory for this procedure is beyond the scope of this article. Other approaches to the non-nested testing problem in various settings have been discussed by Graybill and Iyer (1994, pp. 09-15, Efron (1984, Efron and Tibshirani (199, pp. 190-198 and Victoria- Feser (1997. In section 2, we develop notation, introduce the models, and give preliminary results. Test statistics are introduced and asymptotic properties given in section with technical details given in the appendix. Section 4 discusses the asymptotic relative efficiency comparisons for the testing procedures. Concluding remarks are given in section 5. 2. Background Throughout this paper, the model generating the n 1 vector Y will be assumed to be one of two candidate models. First consider Y = V δ + Xβ + ɛ x, (2.1 where V is an n o matrix; X is n p; and ɛ x1,..., ɛ xn are independently and identically distributed with mean 0 and finite variance, σ 2 x > 0. The rank of X is defined to be p throughout. If model (2.1 generates Y, then we assume that δ 1,..., δ o, β 1,..., β p are all non-zero. As a competitor, the model

112 mitchell r. watnik and wesley o. johnson Y = V δ + Zγ + ɛ z (2.2 is suggested; Z is an n x q matrix; V is common to both models; and ɛ zi are independently and identically distributed with mean 0 and variance σ 2 z for i = 1,..., n. The rank of Z is q. If model (2.2 generates Y, then we assume that δ 1,..., δ o, γ 1,..., γ q are all non-zero. We refer to matrices V, X, Z etc. and their respective column spaces (cf. Christensen, 1987, p. 25 interchangeably in order to ease the notational burden. We assume that the n 1 vector of ones, e, is in V. The column vectors in X are linearly independent of those in Z and vice versa. Furthermore, neither X nor Z may be the null set, as this would result in nested models. For ease of tractability, and without loss of generality, we assume V T X = 0, V T Z = 0. (2. Thus the vector δ is the same in models (2.1 and (2.2. Otherwise, one can reparameterize by projecting both X and Z onto V, the orthogonal complement of V, in order to satisfy (2. without altering the models. Equations (2. allow us to consider the contributions due to X and Z individually; i.e., outside of whatever V already adds to the model. We develop asymptotic properties for two statistics that are key to our later results. Our inferences are conditional upon X and Z and we require C XX lim XT X n, C ZZ lim ZT Z n, C XZ = C T ZX lim XT Z n, (2.4 where C XX and C ZZ are finite, positive definite, symmetric matrices. In fact, we assume that X T X/n = C XX, etc. for large enough n. We also require the following standard assumptions (cf. McAleer, 1995 to hold for large n: Assumptions 2.1. a. The design matrices X and Z are completely separate (cf. Cox, 1961; i.e., no linear combination of columns in X lies in the column space of Z and vice-versa, or X Z =. b. Xβ does not lie in Z. c. X and Z are non-singular. Kent (1986 discusses the implications of violating these assumptions. We address the consequences of violating Assumption 2.1b in section..

linear model selection tests 11 Pesaran (1987 defines local alternatives to mean that (2.1 and (2.2 can be brought arbitrarily close to one another in our setting. His definition of global alternatives is that the two models cannot be brought arbitrarily close; i.e., they are distinct. With the parameterization that Pesaran (1987 uses, he asserts that linear models cannot be globally non-nested. However, in our parameterization and because of the assumptions that β i (i = 1,..., p and γ j (j = 1,..., q are all non-zero, we ensure that the models are globally non-nested and are not pair-wise local alternatives. Ours separates the common part of the models (V, and the distinct parts (X and Z, while his does not. We believe that the global setting is a more natural way to consider linear alternatives. Now, define the set of linear statistics W X n 1/2 (X T X 1/2 X T Y, W Z n 1/2 (Z T Z 1/2 Z T Y. Under (2.1, and with large n, E(W X = C 1/2 XX β µ x, E(W Z = C 1/2 ZZ C ZXβ µ z. (2.5 Further, define A (Z T Z 1/2 Z T X(X T X 1/2, R C 1/2 ZZ C ZXC 1/2 XX, M RT R. (2.6 Note that A = corr(n 1/2 W Z, n 1/2 W X and that X and Z are centered since e V and both are projected onto V, due to (2.. We use A to denote the computational formula for the correlation matrix while R represents the limiting correlation in the theoretical results. Notably, A T A is also the matrix from which the sample canonical correlation coefficients are obtained for the two samples of column variables associated with X and Z, while M is the population counterpart. An important property of M is that its characteristic roots are between 0 and 1, inclusive, and hence is nonnegative definite. We note in the Appendix to Section 2 that W X and W Z are asymptotically jointly normal under mild conditions. Recalling the definitions of M from (2.6 and µ x from (2.5, we define ξ i µ T x M i µ x, i = 0, 1,..., 5; λ i ξ i1 ξ i = µ T x M i1 (I Mµ x, i = 1,..., 5; (2.7 where M 0 I by convention. We note that λ i 0 and that λ i λ i+1 = µ T x (IMM i1 (IMµ x 0, where the inequality holds because M = R T R is symmetric with roots between 0 and 1. It follows that ξ 0 ξ 1 ξ 5 0, and λ 2 λ λ 4 λ 5 0. (2.8

114 mitchell r. watnik and wesley o. johnson We note that, under Assumptions 2.1, statement (2.8 becomes > λ 2 > λ > λ 4 > λ 5 > 0. The following proposition is useful for establishing later results. Proposition 2.1. (a λ 2 2 λ ; (b λ 2 λ 2 λ 4 ; (c λ 2 λ 5 ; (d λ 2 λ λ 4. (2.9 We prove claims (a and (d in the Appendix to Section 2 and the others follow similarly. Finally, we define a consistent estimator of the variance. Following Pesaran (1974, under (2.1, s 2 x P (V,X Y 2 n p o σx, 2 s 2 z P (V,Z Y 2 n q o σ 2 x +, in probability as n, where P X I P X, P is the projection matrix,. denotes the Euclidean norm of a vector. Similarly, s 2 z σz 2 and s 2 x σz 2 + C 1/2 ZZ γ 2 C 1/2 XX C XZ γ 2, in probability under (2.2. Since the true model is unknown à priori, we use Royston and Thompson s (1995 estimate of variance: ˆσ 2 min(s 2 x, s 2 z. Then, ˆσ 2 consistently estimates σ 2 x under (2.1 and σ 2 z under (2.2. Equivalently, ˆσ 2 consistently estimates σ 2, the variance of the true model, regardless of whether σ 2 is σ 2 x or σ 2 z. We use ˆσ instead of either s x or s z to estimate σ, because it increases the power of non-nested tests while not changing the asymptotic distributions under the null.. The Test Statistics We seek to evaluate the performance of various non-nested testing procedures. We first present the test statistics associated with the J-test (Davidson and MacKinnon, 1981, the JA-test (Fisher and McAleer, 1981, and the modified Cox test (Godfrey and Pesaran, 198. All of these test statistics considered herein have asymptotic standard normal distributions under the null hypothesis. We will derive the asymptotic distributions under the alternative..1. The null hypothesis (2.2. The version of the test statistic for the J-test developed by Royston and Thompson (1995, equation 4.5, which they

linear model selection tests 115 argue has closer to nominal size under (2.2 than does the original Davidson and MacKinnon (1981 proposal, is J Z P X Y 2 Y T P X P Z Y n 1/2ˆσ. (.1 ( P X Y 2 P Z P X Y 2 1/2 In the Appendix to Section, we present this and the other statistics in this section in a form which facilitates theoretical arguments. We consider the form of (.1 to be more practical for calculations. The JA-test has test statistic (cf. Table 1 of Fisher, 198 JA Z Y T P X P Z Y P X P Z Y 2 n 1/2ˆσ. (.2 ( P X P Z Y 2 P Z P X P Z Y 2 1/2 Fisher (198 shows that, when the appropriate term is substituted for ˆσ above, JA Z has an exact t distribution under (2.2 when the errors are normally distributed. Our version of the modified Cox test statistic of Godfrey and Pesaran (198 is the negative of the original where G Z P X Y 2 P X P Z Y 2 s 2 z {p tr(p Z P X } n 1/2 {4s 2 z ( P X P Z Y 2 P Z P X P Z Y 2 + 2s 4 ztr(b 2 } 1/2, tr(b 2 p tr(p X P Z 2 {p tr(p XP Z } 2. n q The numerator of this statistic is identical to expression (89 of Cox (1961 and it has expectation zero under H Z. We consider the simpler test statistic G Z P X Y 2 P X P Z Y 2. (. {4nˆσ 2 ( P X P Z Y 2 P Z P X P Z Y 2 1/2 } We note that G Z G Z = o p (1 under (2.2 (cf. Godfrey and Pesaran, 198, p. 18. Remarks. 1. The statistics J X, JA X, and G X are defined analogously; namely, by exchanging Z for X and vice-versa in equations (.1, (.2, and (..

116 mitchell r. watnik and wesley o. johnson 2. The statistics J Z, JA Z, and G Z are, with the exception of using ˆσ in place of their respective estimates of the standard deviation, exactly the same as those given in Davidson and MacKinnon (1981 (cf. Table 1 of Fisher, 198, Fisher and McAleer (1981, and Godfrey and Pesaran (198, respectively. Our use of ˆσ to estimate the standard deviation was done solely to increase the power of all of the tests (cf. section 2.. Had we used s z in place of ˆσ in (., it would have been the case that G Z G Z = o p (1 under (2.1 (cf. Godfrey and Pesaran, 198 and that G Z would be equivalent to the test statistic given in equation (25 of Fisher and McAleer (1981 (cf. Table 1 of Fisher, 198. We define a notation for the denominators of the J and JA tests, respectively: ˆτ JZ {ˆσ 2 ( W X 2 AW X 2} 1/2 ; ˆτ JAZ {ˆσ ( 2 A T W Z 2 AA T W Z 2} 1/2 (.4 Note that the denominator of G Z is proportional to the denominator of JA Z. Using (2.7 and (A.1, we see that, under (2.1, ˆτ JX p σx /2 2, ˆτ JAX p σx /2 2, ˆτ JZ p σx /2 1, ˆτ JAZ p σx /2. (.5 This verifies the result of Fisher (198, who notes that the denominators of J X and JA X are asymptotically equivalent under H X. So, it may be interesting to investigate the properties of these statistics using the other s denominator, as Pesaran (1984, p. 251 notes, the efficiency of a Cox-type test depends on the choice of consistent estimate of the variance. We therefore define and F G Z P XY 2 P X P Z Y 2 2ˆτ JZ ; (.6 F J Z P XY 2 Y T P X P Z Y ˆτ JAZ ; (.7 F JA Z Y T P X P Z Y P X P Z Y 2 ˆτ JZ. (.8 We note the correspondence F G Z = J Z + F JA Z 2, G Z = F J Z + JA Z. (.9 2

linear model selection tests 117 As a consequence of this, we thus expect the performance of F G Z and G Z to strike a balance between that for the corresponding two tests. We finally note that the statistics F G X, F J X, and F JA X are defined by interchanging the roles of X and Z in equations (.6, (.7, and (.8..2 The alternative hypothesis: (2.1. We now derive the distributions of J Z, F J Z, JA Z, F JA Z, G Z, and F G Z under (2.1. As the subscript indicates, (2.2 is considered to be the null. Using (2.5 and (2.7 it follows that, in probability and under (2.1, J Z σ/2 1 F J Z σ/2, JA Z λ 2 σ/2, F JA Z λ 2 σ/2 1, G Z +λ 2, 2σ/2, F G Z +λ 2. 2σ/2 1 (.10 Under Assumptions 2.1, then, the asymptotic means of these test statistics are all strictly positive. Pesaran (1987 notes this about J, JA and G, but, since he considers local alternatives, he does not suggest using the statistics as one-sided tests. In the model discrimination setting, however, it follows directly that corresponding tests of hypotheses should be one-sided and the null hypothesis should only be rejected if the observed test statistic is too large. We obtain the result Proposition.2. Under (2.1, the assumptions for (A.1 and Assumptions 2.1, the following convergence results hold as n : ( n 1/2 J Z λ1/2 1 N (0, 1, σ ( n 1/2 JA Z λ [ 2 N 0, λ 2 n 1/2 ( σ/2 G Z + λ 2 2σ/2 N [ 0, λ { 1 + 2 ( λ1 + λ 2 λ 2 ( λ4 λ { + ( λ2 λ 2 }] (λ 4 λ 5, λ 4 + ( + λ 2 (λ 4 λ 5 4λ } ] 1,

118 mitchell r. watnik and wesley o. johnson ( n 1/2 F J Z N σ/2 [0, 1λ { ( n 1/2 F JA Z λ 2 N σ/2 1 [0, 1λ1 { ( n 1/2 F G Z λ 1 + λ 2 N 2σ/2 1 [ ( 1 0, 4 2 λ 2 + 2 λ 2 4 ( λ2 ( λ1 λ λ 4 + (λ 2 λ + ( 2 λ1 (λ 4 λ 5 }] λ ( 2 λ2 ( λ 2 }] { ( λ1 + λ 2 2 ( λ 2 + 4 ( }] λ2 λ.,, (.11 The proof of Proposition.2 is in the Appendix to Section. Note that the asymptotic variances do not depend on β, while the means do. Also note that the asymptotic distributions under the alternative only depend on the design matrices X and Z through the canonical correlation matrix M. We have the following corollary to Proposition.2, which becomes important in the next section. Corollary to Proposition.2: Under (2.1, the assumptions for (A.1 and Assumptions 2.1, the asymptotic variance of the JA test is greater than or equal to 1. The proof of this corollary is in the Appendix to Section... The case of X Z. Although the situation with X Z is excluded by Assumptions 2.1, it is a situation in which it should be easy to discriminate between the two models. So, when X is nearly perpendicular to Z, and the assumptions hold, one might expect high power from these test procedures. However, when X Z, the JA-test has numerator 0. The denominators of both the JA-test, and the modified Cox test, G Z and G X, are 0 in this situation as well (the statistics G Z and G X will have relatively small, but not 0, denominators. Fisher (198, p. 10 notes the failure of the JA-test in this situation. In fact, when X is perpendicular to Z, F G is the only one of these six statistics that does not necessarily have either the numerator or denominator tending to 0. To elaborate on this, we note that, when X Z, ξ i = 0 for i = 1, 2,..., 5. So, λ i = 0 for i = 2,..., 5. The proofs of asymptotic normality for JA Z and

linear model selection tests 119 G Z will fail in the case where X Z as noted in the proofs in the Appendix to Section. However, when X Z, the J-test also fails. This observation about the J-test is made in a footnote of Davidson and MacKinnon (1981. From equation (.1, we see that, in our way of calculating the J-test, it becomes P X Y /(n 1/2ˆσ when X Z. It is easy to see that this cannot be asymptotically normal, with mean 0, in this case because the statistic is necessarily non-negative. One obvious possibility is to instead consider the model E(Y = V δ + Xβ + Zγ and to use conventional F -tests for β = 0 and γ = 0. 4. Relative Efficiency We proceed to calculate the asymptotic relative efficiencies for J, F J, JA, F JA, G and F G. We define three measures of relative efficiency and then proceed to make comparisons. We conclude this section by giving an illustration of how to consistently estimate power in the general case. 4.1. Measures of Relative Efficiency. We first define the standard measure of relative efficiency for comparing one-sided tests. Suppose a statistic, say n 1/2 S, is asymptotically standard normal under the null, and n 1/2 (S µ a /σ a is asymptotically standard normal under the alternative. Then the nested asymptotic power of a size α test is 1 Φ{(z α n 1/2 µ a /σ a } where Φ(z α = 1 α. Thus power is large provided µ a is large and/or σ a is small. Solving for the sample size that achieves power πn, we obtain n = (z α σ a z π n 2 /µ 2 a, provided z α σ a z π n. Then define the asymptotic relative efficiency of, say, J to F G to be n j /n fg, the relative sample sizes for the two size α tests to both achieve power πn. Then e S (J, F G n {( ( } 2 j z1α σ j z π n z1α σ fg z π n = /, (4.1 n fg µ j µ fg where µ fg is the mean for F G under the alternative, etc. This measure of efficiency implies that J is superior to F G if e S (J, F G < 1. By (.11 and (4.1 we see that relative efficiencies for any pair of tests will not depend on the length of the vector β. We may thus assume without loss of generality that β = 1 when making these calculations. In theory, this measure can be somewhat complicated since it may depend on many inputs. Simplifications are obtained if we let α =.5 or if πn 1. In these instances ( 2 µfg σ j e S (J, F G e HS (J, F G =, (4.2 µ j σ fg

120 mitchell r. watnik and wesley o. johnson which is the Hájek and Šidák (1967, p. 267 measure of relative efficiency, which we will refer to as the HS criterion. On the other hand, if we let α 0, or πn =.5, we obtain e S (J, F G e B (J, F G = ( µfg µ j 2, (4. which is the Bahadur (1961 relative efficiency for comparing J to F G. 4.2 General conclusions about efficiencies. Using Proposition.2, we can order some of the statistics relative to these criteria. We begin with a comparison between JA and J under the condition π n 0.5. We note that e S (J, JA = { ( z1α z π n µja ( z1α σ ja z π n µa } 2. Now, under the condition π n 0.50, z π n 0, which implies e S (J, JA ( µja µ j 2 = e B (J, JA since, by Corollary to Proposition.2, σ ja 1. Therefore, e S (J, JA λ 2 2 /λ 1, by Proposition 2.1. Hence J is superior to JA under the general criterion, as well as both the HS and Bahadur criteria. In the Appendix to Section 4, we establish that, according to the Bahadur criterion, F J G J F G F JA. So, according to the Bahadur criterion, F J is the theoretically optimal procedure to use, while F JA is theoretically worst. Considering only the three test statistics from the literature, we have established that G is preferable to J, which in turn is preferable to JA, using the Bahadur criterion. We shall see later that JA neither dominates nor is dominated by F G in this context. Using HS efficiency, we have the following comparison between F G and J: e HS (F G, J = /2 1 1 4 { ( λ1 +λ 2 2 (λ1 λ 2 + 4 +λ 2 2/2 1 ( } 2 λ2 λ (4.4 In the Appendix to Section 4, we show that e HS (F G, J 1 and thus F G J, under HS. Interestingly, the reverse was true according to Bahadur.

linear model selection tests 121 If we add the following assumptions λ 2 = λ 2 λ 4 ; λ 2 4 = λ λ 5, (4.5 more conclusions under the HS criterion are possible. Equations (4.5 imply that λ 2 λ 5 = λ λ 4. Note that equations (4.5 hold if the roots of M are equal or if rank(m = 1, or, more generally, if the non-zero roots of M are equal. So, these assumptions hold in the situations when either p = 1 or q = 1, as we see in the next two subsections. In the Appendix to Section 4, we show that e HS (G, J 1 and that e HS (G, F J 1. So, under the HS criterion and assumptions (4.5, we have the ordering F G J G F J. We do not compare F JA here, because, as we see in the succeeding subsections, no generalizations for F JA are possible. We do not compare JA to the others under HS because it is dominated by J. 4. Relative efficiencies; p = 1 Case. With p = 1, results are particularly simple. Keep in mind that we are testing H Z versus H X where the rank of X is one and the rank of Z is arbitrary. We note that M is now the squared multiple correlation between X and Z. Then, from (2.7, we have λ i = C XX β 2 M i1 (1 M for i = 1,..., 5. It follows that assumptions (4.5 hold. Proposition.2 yields µ j = µ ja = C1/2 XXβ (1 M1/2, σ µ g = C1/2 XX β 2Mσ (1 + M(1 M1/2, µ fj = C1/2 XXβ (1 M1/2, Mσ µ fja = C1/2 XXβ M(1 M1/2, σ XX β µ fg = C1/2 2σ (1 + M(1 M 1/2, σ 2 j = σ 2 ja = 1, σ 2 g = 1 + M + M 2 M 4M, σ 2 fj = 1 + M M 2 M, σ 2 fja = M(1 M + M 2, σfg 2 = 1 + M M 2 + M. 4 (4.6 This implies that, when p = 1, J and JA will be asymptotically equivalent, as they are under local alternatives (McAleer, 1995. For the remainder of this subsection, then, we will only state results for J. We observe from (4.6 that, with small M, J and F G will be more powerful than F JA. In addition, (4.1 and (4.6 imply that changes in size will have the most dramatic effect on G and F J because of the presence of M in the denominators of their means. Note that the power of all

122 mitchell r. watnik and wesley o. johnson of the test statistics will be monotone and increasing in C 1/2 XXβ for fixed M; more variability in the X values gives greater power. Also note that all relative efficiencies will depend on the model only through the value of M. We summarize our conclusions for the situations where α = 0.05 in Table 1. It is worth remembering that, as size decreases, e S e B. In section 4.2, we saw that the Bahadur criterion had F J G J F G F JA. In this special case, then, the only new conclusion is that JA F G. Table 1. Orderings for tests with p = 1. M πn = 0.50 πn = 0.75 small ( 0.1 F J G J JA F G F JA J JA G F J F G F JA moderate ( 0.5 F J G J JA F G F JA F J G J JA F G F JA large ( 0.9 F J G J JA F G F JA F J G J JA F G F JA πn = 0.95 πn = 1 small J JA F G G F J F JA F G J JA G F J F JA moderate J JA G F J F G F JA F G J JA G F J large J JA G F J F G F JA F JA F G J JA G F J When πn = 1, we obtain the conclusions summarized in the lower righthand box of Table 1. In the previous subsection, we showed that F G J G F J according to this measure, i.e., the HS criterion. We now see that F JA is most efficient for large values of M, while it is least efficient for small values. This is why we were unable to make general conclusions about F JA for the HS criterion. Its place for moderate values of M depends on M. We view these results as theoretically interesting, but of limited value in practice since πn must be arbitrarily close to one for them to hold. In Table 1 we also summarize the ordering for the situations when πn = 0.75 and πn = 0.95, with α =.05. When seeking larger power, J tends to be best, while F J may be preferable when seeking moderate power. The orderings are almost identical to the Bahadur case with πn = 0.75. Conclusions are practically identical with α = 0.10. The ordering tends towards that for the HS criterion with πn = 0.95. The tests are practically equivalent with M > 0.6. The situation changes slightly with α = 0.10; F JA is technically most efficient though by a practically insignificant margin, especially for M > 0.85. For values of πn that are as large as 0.999, the orderings tend to be more like those for πn = 0.50, 0.75, or 0.95 than for πn = 1, so for realistic power and based on Table 1, it would appear that F J, G, J and JA are preferable to F G and F JA. The test F J is preferable for moderate power and J JA is preferable for higher power. The test G ranks either second or third.

linear model selection tests 12 We recall from (.9 that F G is the average of J and F JA and G is the average of F J and JA. It is thus not surprising that, when G is more efficient than F J, it tends to be less efficient than JA and vice-versa, and analogously for F G, J, and F JA, as can be seen by consideration of Table 1. 4.4 Relative efficiencies; q = 1 Case. Here, the rank of X is arbitrary and the rank of Z is 1. For this situation, we define r = RR T = C ZX CXX 1 C XZ/C ZZ, the squared multiple correlation between Z and X. Also, define c = β 2 and η = ρ T β/c 1/2, where ρ T C ZX C 1/2 XX /C1/2 ZZ (ρ 1,..., ρ p, the vector of correlations between Z and X. Note that r = p i=1 ρ2 i. As a consequence of these definitions and Assumptions 2.1, 0 < η 2 r < 1. Also note that, if p = 1 then η 2 = r. We use the assumptions C XX = I and C ZZ = 1, without loss of generality. For example, in setting up the models (2.1 and (2.2, we could simply have reparameterized X and Z appropriately utilizing Gram-Schmidt without altering the basic structure of the models. It then follows from (2.7 that = c(1 η 2, λ 2 = cη 2 (1 r, λ = cη 2 r(1 r, λ 4 = cη 2 r 2 (1 r, λ 5 = cη 2 r (1 r. It follows directly that assumptions (4.5 hold. Additionally, Proposition.2 implies { c(1 η 2 } 1/2 c 1/2 (1 η 2 µ j =, µ fj = σ σ {η 2 r(1 r} 1/2, { cη 2 (1 r } 1/2 µ ja = σr 1/2, µ fja = c1/2 η 2 (1 r σ {(1 η 2 } 1/2, µ g = c 1/2 (1 rη 2 2ση {r(1 r} 1/2, µ fg = c 1/2 (1 rη 2 2σ {(1 η 2 } 1/2, σj 2 = 1, σja 2 = 1, σg 2 = 1 { } 1 ( 4 rη 4 1 + 2rη 2 4rη 4 + r 2 η 4, (1 r { } σfj 2 1 = (2r η 2 + η 2 r 4η 2 + 1 r(1 r η 2, ( { η σfja 2 2 ( 2 (1 r 1 r 2 = 1 η 2 1 r 4η 1 η 2 + η2 1 η 2 (1 2η 2 + rη } 2, σ 2 fg = 1 4 ( 1 1 η 2 ( 1 rη 2 2 (1 2η 2 + rη 2 + 4rη 1 η 2 4 (1 r2 1 η 2. (4.7

124 mitchell r. watnik and wesley o. johnson Recall that the efficiencies of the tests will not depend on c, but we have included it for potential use in power calculations. We must keep in mind that β here is the reparameterized version. Note that when p = 1, these formulae reduce to those in (4.6, as they should. We have already established that J is asymptotically at least as efficient as JA according to the general criterion. When η 2 < r, J JA and they are equally efficient only where η 2 = r. This illustrates how restrictive is the assumption of local alternatives (cf. Pesaran, 1987 because, under local alternatives, J is asymptotically equivalent to JA (cf. McAleer, 1995. Clearly, if η 2. = 0, JA will have π n. = α. The test F JA will also suffer if η 2. = 0. We saw earlier that the Bahadur criterion s ordering of the procedures was F J G J F G F JA and J JA F JA (cf. section 4.2. Now, we can see that neither F G nor JA dominates the other according to the Bahadur criterion since, with r = 1, F G JA, while with η 2 = r, JA F G. As was the case when p = 1, the HS measure of relative efficiency does not imply that F JA is superior or inferior to either J or F G everywhere. So, either F G or F JA will be superior when the power is 1. We present a summary of the conclusions we draw for the situation when α = 0.05 and π n = 0.75 and 0.95 in Table 2. We note that the conclusions when α = 0.10 are virtually identical to those in Table 2. Table 2. Orderings for tests with q = 1. r η 2 πn = 0.75 πn = 0.95 small small J G F J F G F JA J F G G F J F JA moderate small J F J G F G F JA J F G F J G F JA moderate moderate F J G J F G F JA J G F J F G F JA large small J F J G F G F JA J F G F J G F JA large moderate F J G J F G F JA J F J G F G F JA large large F J G J F G F JA J F J G F G F JA The conclusions in Table 2 are similar to the analogous ones in Table 1 in that they imply that, in general, either F J or J is superior while F JA is inferior. 4.5. An example: flow rate. We illustrate how our formulae can be used to make theoretical power calculations. Consider modelling flow rate from watersheds following storm episodes as in Rawlings (1988, pp. 150-1. Though the sample size is 0, which is not particularly large, we use it nonetheless for illustrative purposes. The response variable is the log of the

linear model selection tests 125 peak flow rate, Y. Other variables are the average slope of the watershed in percent, s; longest stream flow in watershed in thousands of feet, f; infiltration rate of water into soil in inches per hour, i; watershed area in square miles, a; estimated soil storage capacity in inches of water, c; rainfall in inches, r; time period during which rainfall exceeded 0.25 inches per hour, t; and surface absorbancy index, where 0 represents complete absorbancy and 100 represents no absorbancy, b. The model obtained through a forward stepwise selection procedure was H X : Y = δ 0 + δ 1 s + β 1 f + β 2 i + ɛ X. Rawlings (1988, p. 519 mentions that Mallows C p selects the model H Z : Y = δ 0 + δ 1 s + γ 1 a + γ 2 c + γ r + γ 4 t + γ 5 b + ɛ Z. Comparing these models is of interest. Here, then, p = 2, q = 5 and o = 2. We projected the matrices X and Z into V and, assumed that C XX = X T X/n, etc. Since ˆβ and ˆγ are consistent estimates of β and γ under their respective hypotheses, and ˆσ = min(s x, s z is a consistent estimate of σ, we were then able to calculate the following statistics: ˆ = 0.144, ˆλ 2 = 0.0470, ˆλ = 0.058, ˆλ 4 = 0.07, ˆλ 5 = 0.027, ˆZ = 0.1408, ˆλ 2Z = 0.046, ˆλ Z = 0.04, ˆλ 4Z = 0.026, ˆλ 5Z = 0.018. Note that assumptions (4.5 do not hold. We have s X = 0.67 and s Z = 0.5487. So ˆσ = 0.5487. Using Proposition.2, we obtained the following approximations to power, say π n, for.05 level tests assuming model X is true, namely ˆπ n(j = 0.9781, ˆπ n(ja = 0.7767, ˆπ n(g = 0.941, ˆπ n(f J = 0.9496, ˆπ n(f JA = 0.2571, ˆπ n(f G = 0.9071. (4.8 Similarly, approximations to π n under model Z are: ˆπ n(j = 0.9822, ˆπ n(ja = 0.5949, ˆπ n(g = 0.956, ˆπ n(f J = 0.9470, ˆπ n(f JA = 0.0864, ˆπ n(f G = 0.8771. (4.9 Thus J is most powerful, while JA and F JA are evidently considerably less powerful. These power estimates are for the nested power and that the

126 mitchell r. watnik and wesley o. johnson actual power for this scenario will be about 5% less than the ˆπ ns in (4.8 and (4.9; cf. section 1. For these data, we obtained the following test statistics: J X = 2.91, J Z = 4.26, JA X = 1.01, JA Z = 1.68, G X = 0.20, G Z = 2.29, F J X = 5.04, F J Z = 8.40, F JA X = 0.59, F JA Z = 0.85, F G X = 1.16, F G Z = 2.56. The multiple correlation between the models X and Z was 0.98, RX 2 was approximately 86% and RZ 2 was approximately 8%. The tests JA, F G and G lead us to the conclusion that model X was preferable to model Z. The tests J and F J reject both models at the 1% level of significance, while F JA does not reject either model. These results are not surprising, as the power indicated that the tests G and F G would probably be decisive while F JA would tend to be indecisive. The J-test may reject too often in practice (cf. McAleer, 1995, and we conjecture that it is also the case that F J will have larger than nominal size. Therefore, as some preliminary work has been done to find finite-sample corrections for the J-test (cf. Royston and Thompson, 1995 and others have used the bootstrap as a means of correcting the size (Godfrey, 1998, we believe it would be worthwhile to thoroughly investigate such corrections for both J and F J. In the meantime, with relatively small sample sizes, practitioners may prefer to use the more robust method presented in Victoria-Feser (1997 or the less powerful alternatives G, F G, and JA. 5. Concluding Remarks We have considered non-nested hypothesis testing procedures in the linear regression setting. We demonstrated that the asymptotic means of the most commonly-used methods (cf. McAleer, 1995 are positive and thus, the procedures should be treated as one-sided tests when one is comparing two specified models. Furthermore, we presented a way to estimate the power of non-nested model selection tests when the true parameter vector and variance can be consistently estimated. In addition, we have given a method for theoretically judging which procedure is the most powerful by using measures of asymptotic relative efficiency. Interestingly, the asymptotic relative efficiencies did not depend upon the norm of the parameter vector. For the cases of p = 1 or q = 1, we saw that the efficiency depends only upon how close the design matrices are and how close the mean vector is to the alternative design s column space. This is consistent with the results obtained in the local alternatives case (Pesaran, 1987.

linear model selection tests 127 In the general case, the relative efficiencies will involve the canonical correlation matrix M and the vector β. We saw that J is theoretically preferable to JA in general. For the Bahadur (1961 measure of asymptotic relative efficiency (cf. (4., we were able to conclude that F J G J F G F JA. For the Hájek and Šidák (1967, p. 267 measure, we saw that F G J G F J, under assumptions (4.5. We remind the reader that these asymptotic results do not take into account the possibility of incorrect size. Therefore, if one does not have a large sample the most powerful test, as determined by use of this methodology, may not be the best test to use (McAleer, 1995. Acknowledgements. We are indebted to an anonymous referee for helpful suggestions. We also thank Ron Christensen for his numerous useful comments on earlier versions of this paper. A. Appendix to Section 2 Under the conditions necessary for the Lindeberg-Feller Central Limit Theorem (cf. Loève, 1977, pp. 292-294, we have the following distributional result {( n 1/2 WX W Z ( } µx µ z N { ( 0, σx 2 I R T }, (n, (A.1 R I under (2.1. Practically speaking, the conditions sufficient for (A.1 to hold are that the matrices C XX and C ZZ are invertible and finite, i.e., the matrices X T X and Z T Z are all of order n and of full rank, and the error term ɛ is either bounded or derives from a distribution with at least 2 moments and sufficiently small tails. To prove claim (a of Proposition 2.1, we recall that the characteristic roots of M, which we call m i for i = 1,..., p here, are between 0 and 1. Let M i be the eigenvectors of M and d i = M i T µ x. Then λ j = p i=1 mj1 i (1 m i d 2 i. Letting w i = (1m i d 2 i /, we have that the w i s define a probability measure. So λ 2 2 λ ( p 2 w i m i i=1 p w i m 2 i, i=1 which is obviously true. Claims (b and (c follow similarly. Claim (d follows from claims (a and (b and that λ i 0. B. Appendix to Section The calculational and theoretical forms for J Z (cf. (.1 are J Z P X Y 2 Y T P X P Z Y n 1/2ˆσ ( P X Y 2 P Z P X Y 2 1/2 = W X 2 W T X AT W Z ˆσ ( W X 2 AW X 2 1/2.

128 mitchell r. watnik and wesley o. johnson The theoretical form of JA Z (cf. (.2 is JA Z W T X AT W Z A T W Z 2 ˆσ ( A T W Z 2 AA T W Z 2 1/2. Similarly, the theoretical form of G Z (cf. (. is G Z W X 2 A T W Z 2 {4ˆσ 2 ( A T W Z 2 AA T W Z 2 } 1/2. For F G Z, F J Z, and F JA Z (cf. (.6, (.7, (.8, the theoretical forms are W X 2 A T W Z 2 F G Z 2ˆσ ( W X 2 AW X 2 1/2 ; F J Z W X 2 W T X AT W Z ˆσ ( A T W Z 2 AA T W Z 2 1/2 ; and F JA Z W T X AT W Z A T W Z 2 ˆσ ( W X 2 AW X 2 1/2. We now prove Proposition.2. First, note that the numerators for each of J Z, JA Z, G Z, F J Z, F JA Z, and F G Z may be written as a linear combination of the three statistics W X 2, W T Z AW X, and A T W Z 2. Let b 1, b 2, b represent the coefficients associated with the three statistics, respectively. That is, for J X, b 1 = 1, b 2 = 1, and b = 0. Let b 4 be an indicator variable for using ˆτ JAZ in the denominator of the test statistic. This leads us to the realization that, when calculated under H Z, the numerators of all of our test statistics may be written in the form u n b 1 W X 2 + b 2 W T Z AW X + b A T W Z 2, where b 1, b 2, b {1, 0, 1}. Under H X, u n all of our test statistics may be written as p b1 ξ 0 + b 2 ξ 1 + b ξ 2. Therefore, S n = u n c {(1 b 4 ˆτ JZ + b 4ˆτ JAZ }, where c = 2 if we are using G or F G and it is 1 otherwise and b 4 {0, 1}.

linear model selection tests 129 We can now obtain the partial derivatives of S n with respect to W X and W Z. S n ( u n / W X = W X c {(1 b 4 ˆτ JZ + b 4ˆτ JAZ } + u { 1 n ˆτ JZ ˆτ 1 } JAZ (1 b 4 + b 4 c W X W X S n ( u n / W Z = W Z c {(1 b 4 ˆτ JZ + b 4ˆτ JAZ } + u { 1 n ˆτ JZ ˆτ 1 } JAZ (1 b 4 + b 4 c W Z W Z (B.1 We note that, as alluded to in section., when X Z, these derivatives are infinite when b 4 = 1, because ˆτ JAZ = 0. So, these results do not hold for JA Z, G Z and F J Z, if X Z. Recalling the notational definitions in equations (2.4, (2.5, (2.6 and (2.7, we note that µ z = Rµ x. Using all of these, we obtain: u n W X = 2b 1 W X + b 2 A T W Z p 2b1 µ x + b 2 R T µ z = 2b 1 µ x + b 2 R T Rµ x ; u n W Z = b 2 AW X + 2b AA T W Z p b 2 µ z + 2b RR T µ z = b 2 Rµ x + 2b RR T Rµ x ; (B.2 where the convergences are under H X. We now obtain the asymptotic norms: u n 2 W 4b 2 1ξ 0 + b 2 2ξ 2 + 4b 1 b 2 ξ 1 UX. X u n 2 W b 2 2ξ 1 + 4b 2 ξ + 4b 2 b ξ 2 UZ. Z ( T ( un A T un 2b 1 b 2 ξ 1 + 4b 1 b ξ 2 + b 2 W X W 2ξ 2 + 2b 2 b ξ UXZ. Z (B. From (.4, we derive the derivatives of ˆτ JZ 1 1 and ˆτ JAZ : ˆτ (W JZ 1 = ˆσ X A T AW X W X ˆτ JZ ˆτ 1 JZ W Z = 0 ˆτ 1 JAZ W X = 0 ˆτ 1 JAZ W Z (AA = ˆσ T W Z AA T AA T W Z. ˆτ JAZ (B.4 Now, noting that b 4 is an indicator, we calculate the partial derivatives

10 mitchell r. watnik and wesley o. johnson of S n using (B.1-(B.4: S n W X = S n W Z = ( u n / W X c {(1 b 4 ˆτ JZ + b 4ˆτ JAZ } (1 b } 4u nˆσ {(I cˆτ JZ A T AW X ( u n / W Z c {(1 b 4 ˆτ JZ + b 4ˆτ JAZ } b 4u nˆσ (AA T AA T AA T W Z. cˆτ JAZ (B.5 From (2.7, (.5, and (B.1-(B.5, we see that S n 2 UX W X c 2 σx 2 {(1 b 4 + b 4 λ } { } { } (1 b4 (b 1 ξ 0 + b 2 ξ 1 + b ξ 2 (b1 ξ 0 + b 2 ξ 1 + b ξ 2 ( λ 2 + c 2 σxλ 2 2 1 { } ( 2(1 b4 (b 1 ξ 0 + b 2 ξ 1 + b ξ 2 2b1 + b 2 λ 2 c 2 σxλ 2. 1 S n 2 UZ W Z c 2 σx 2 {(1 b 4 + b 4 λ } { } [{ } ] b4 (b 1 ξ 0 + b 2 ξ 1 + b ξ 2 b1 ξ 0 + b 2 ξ 1 + b ξ 2 + c 2 σxλ 2 2 (λ 4 λ 5 λ { } ( 2b4 (b 1 ξ 0 + b 2 ξ 1 + b ξ 2 b2 λ + 2b λ 4 c 2 σxλ 2. λ ( T ( Sn A T Sn UXZ W X W Z c 2 σx 2 {(1 b 4 + b 4 λ } { (1 b4 (b 1 ξ 0 + b 2 ξ 1 + b ξ 2 c 2 σxλ 2 2 1 { b4 (b 1 ξ 0 + b 2 ξ 1 + b ξ 2 c 2 σxλ 2 2 } (b 2 λ 2 + 2b λ } (2b 1 λ + b 2 λ 4. Using the delta method, the asymptotic variance of n 1/2 S n (2.1 is ( [ S W X µ ] T, { = σx 2 W X [ ] T ( S W Z µ σx 2 I R T ( S W X µ R I S W Z µ S 2 + S 2 ( S T ( S W + 2 R T Z W X W Z } (B.6 under

linear model selection tests 11 = U X + U Z + 2U XZ c 2 {(1 b 4 + b 4 λ } { } { } (1 b4 (b 1 ξ 0 + b 2 ξ 1 + b ξ 2 (b1 ξ 0 + b 2 ξ 1 + b ξ 2 ( λ 2 + c 2 λ 2 1 { } ( (1 b4 (b 1 ξ 0 + b 2 ξ 1 + b ξ 2 4b1 + 4b 2 λ 2 + 4b λ c 2 { } [{ } ] b4 (b 1 ξ 0 + b 2 ξ 1 + b ξ 2 b1 ξ 0 + b 2 ξ 1 + b ξ 2 + c 2 λ 2 (λ 4 λ 5 λ { } { } b4 (b 1 ξ 0 + b 2 ξ 1 + b ξ 2 (4b1 + 2b 2 λ + (2b 2 + 4b λ 4 c 2. λ λ (B.7 For J Z, we have b 1 = 1, b 2 = 1, b = 0, b 4 = 0, c = 1. Therefore, according to (B.7, the asymptotic variance of J Z under H X is: { } (4ξ 0 + ξ 2 4ξ 1 + (ξ 1 + 2(2ξ 1 + ξ 2 (ξ0 ξ 1 ( λ 2 + (ξ 0 ξ 1 λ 1 { } ( (ξ0 ξ 1 4λ1 4λ 2 = 4 λ 2 + (λ ( 1 λ 2 4λ1 4λ 2 = 1. (B.8 For JA Z, we have b 1 = 0, b 2 = 1, b = 1, b 4 = 1, c = 1. Therefore, according to (B.7, the asymptotic variance of JA Z under H X is: ( ( (ξ 2 + (ξ 1 + 4ξ 4ξ 2 + 2(ξ 2 2ξ ξ1 ξ 2 ξ1 ξ 2 + λ λ 2 (λ 4 λ 5 λ ( ( ξ1 ξ 2 2λ 2λ 4 λ λ = ξ ( 1 ξ 2 λ 2 ( ( + 2 2λ2 λ λ 4 λ λ (λ 4 λ 5 λ λ = λ {( ( } 2 λ2 λ4 λ λ 2 (λ 4 λ 5 + 2 1. λ (B.9 For G Z, we have b 1 = 1, b 2 = 0, b = 1, b 4 = 1, c = 2. Therefore,

12 mitchell r. watnik and wesley o. johnson according to (B.7, the asymptotic variance of G Z under H X is: ( ( (4ξ 0 + (4ξ + 2(4ξ 2 ξ0 ξ 2 ξ0 ξ 2 + 4λ 4λ 2 λ ( ( ξ0 ξ 2 4λ 4λ 4 4λ λ = λ ( ( 1 + λ 2 λ λ1 + λ 2 λ1 + λ 2 + λ 4λ 2 λ ( ( λ1 + λ 2 4λ 4λ 4 4λ λ ( { ( λ1 + λ 2 λ1 + λ 2 = 1 + 1 + λ 4λ 2 (λ 4 λ 5 (λ 4 λ 5 (λ 4 λ 5 ( {( } λ1 + λ 2 λ1 + λ 2 = 1 + λ 2 (λ 4 λ 5 + λ 4. 4λ ( } λ λ 4 λ (B.10 For F J Z, we have b 1 = 1, b 2 = 1, b = 0, b 4 = 1, c = 1. Therefore, according to (B.7, the asymptotic variance of F J Z under H X is: ( ( (4ξ 0 + ξ 2 4ξ 1 + (ξ 1 + 2(2ξ 1 + ξ 2 ξ0 ξ 1 ξ0 ξ 1 + λ λ 2 (λ 4 λ 5 λ ( ( ξ0 ξ 1 2λ 2λ 4 λ λ = 4λ ( 1 λ 2 λ 2 ( ( + 1 2λ1 λ λ 4 λ λ (λ 4 λ 5 λ λ = 2λ ( 1 λ 2 λ 2 + 1 λ λ (λ 4 λ 5 + 2λ 4. λ (B.11 For F JA Z, we have b 1 = 0, b 2 = 1, b = 1, b 4 = 0, c = 1. Therefore, according to (B.7, the asymptotic variance of F JA Z under H X is: ( { } (ξ 2 + (ξ 1 + 4ξ 4ξ 2 + 2(ξ 2 2ξ ξ1 ξ 2 (ξ1 ξ 2 ( λ 2 + λ 2 1 ( ( ξ1 ξ 2 4λ2 4λ = λ ( { } ( ( 2 λ2 λ2 ( λ 2 λ2 4λ2 4λ + λ 2. 1 (B.12 We note that, when X Z, as is considered in section., this asymptotic variance is 0.

linear model selection tests 1 For F G Z, we have b 1 = 1, b 2 = 0, b = 1, b 4 = 0, c = 2. Therefore, according to (B.7, the asymptotic variance of F G Z under H X is: ( { } (4ξ 0 + (4ξ + 2(4ξ 2 ξ0 ξ 2 (ξ0 ξ 2 ( λ 2 + 4 4λ 2 1 ( ( ξ0 ξ 2 4λ1 4λ 4 = 4λ ( { } 1 + 4λ 2 4λ λ1 + λ 2 (λ1 + λ 2 ( λ 2 + 4 4λ 2 1 ( ( λ1 + λ 2 4λ1 4λ 4 ( { } ( λ1 + λ 2 (λ1 + λ 2 ( λ 2 λ2 λ = +. 4λ 2 1 The proof of the Corollary to Proposition.2 follows. V ar(ja Z = λ { 2 1 + 2λ 4 + λ } 2(λ 4 λ 5 λ λ λ 2. This is greater than or equal to 1 if and only if λ 2 1 1 + 2λ 4 + λ 2(λ 4 λ 5 λ λ 2 λ λ 2 1 + 2λ 4 λ λ λ 2 λ 2λ 4 λ 2 The last statement is true by Proposition 2.1. 1. C. Appendix to Section 4 To show the Bahadur ordering, we have +λ 2 2 ( 2σ/2 λ1 + λ 2 2 e B (F J, G = e B (J, F G = = < 1. 2λ σ/2 1 e B (F G, F JA = λ 2 σ/2 +λ 2 2σ/2 2 ( 2 2λ2 = < 1. + λ 2 (B.1

14 mitchell r. watnik and wesley o. johnson Now, we compare J and G: e B (F J, J = e B (G, F G = λ < 1. e B (G, J = 4 { } λ1 λ ( + λ 2 2. (C.1 { } To show that e B (G, J < 1, we note that 4 λ1 λ 2 { } ( +λ 2 2 4 λ1 λ ( +λ 2 < 1 because λ 2 < λ 2. But, { } λ1 λ 2 4 ( + λ 2 2 1 4( λ 2 ( + λ 2 2 1 implies 0 λ 2 1 + λ 2 2 2 λ 2 = ( λ 2 2, Now, we show e HS (J, F G 1. Beginning with (4.4, we have e HS (J, F G = /2 1 +λ 2 2/2 1 1 4 { ( λ1 +λ 2 2 (λ1 λ 2 + 4 = + λ 2 { } (λ1 +λ 2 2 ( λ 2 + 4 (λ 2 λ = = ( + λ 2 2 { } (λ1 +λ 2 2 ( λ 2 + 4 (λ 2 λ 1 { } 1 ( λ 2 + 4 λ2 λ ( +λ 2 2 This is greater than, or equal to, 1 if and only if 1 (λ { } 1 λ 2 λ2 λ + 4 ( + λ 2 2 λ { } 2 λ2 λ 4 ( + λ 2 2 1 4 { λ1 λ ( + λ 2 2 2 ( } λ2 λ But the right-hand side is equivalent to e B (G, J (see equation 4.A.1. So, this completes the proof. }. 2

linear model selection tests 15 We now show e HS (G, J 1. For HS efficiency, we consider the relative efficiencies between G and J, and between G and F J. e HS (G, J = [( { } ] λ λ1 +λ 2 1 λ λ 2 4 + (+λ 2 (λ 4 λ 5 4λ 1 ( + λ 2 2 /4λ = 4 [ ( + λ 2 This is greater than, or equal to, 1, if and only if { } ] λ 4 + (+λ 2 (λ 4 λ 5 4λ λ 2 ( + λ 2 2. λ 4 [( + λ 2 {4λ λ 4 + ( +λ 2 (λ 4 λ 5 } 4λ 4( +λ 2 2 λ 2 ] [( + λ 2 (4λ λ 4 + λ 4 + λ 2 λ 4 λ 5 λ 2 λ 5 4λ The left-hand side is equivalent to (λ 2 1 + 2 λ 2 + λ 2 2λ 2. (4 λ λ 4 +4λ 2 λ λ 4 +λ 2 1λ 4 +2 λ 2 λ 4 +λ 2 2λ 4 λ 2 1λ 5 2 λ 2 λ 5 λ 2 2λ 5 4λ ( = 2 λ λ 4 +λ 2 1λ 4 +2 λ 2 λ 4 +λ 2 2λ 4 λ 2 1λ 5 λ 2 2λ 5, ] (C.2 where the underscores indicate use of assumptions (4.5. We will use underscores in the following to indicate use of (2.9 and assumptions (4.5. Now, we continue with (C.2: 2λ 2 1λ λ 4 +λ 1λ 4 +2λ 2 1λ 2 λ 4 + λ 2 2λ 4 λ 1λ 5 λ 2 2λ 5 (λ 2 1+2 λ 2 + λ 2 2λ 2 2λ 2 1λ λ 4 + λ 1λ 4 + 2λ 2 1λ 2 λ 4 + λ 2 2λ 4 λ 1λ 5 + λ 2 2λ 5 + λ 2 1λ 2 + 2 λ 2 λ 2 + λ 2 2λ 2 2λ 2 1λ λ 4 +λ 1λ 4 +2λ 2 1λ 2 λ 4 λ 1λ 5 + λ 2 2λ 5 +λ 2 1λ 2 + λ 2 λ 2 +λ 2 2λ 2 2λ 2 1λ λ 4 +2λ 2 1λ 2 λ 4 λ 2 2λ 5 +λ 2 1λ 2 λ 4 + λ 2 λ 2 +λ 2 2λ 2 2λ 2 1λ λ 4 +λ 2 1λ 2 λ 4 λ 2 λ λ 4 + λ 2 λ 2 +λ 2 2λ 2 λ 2 1λ λ 4 +λ 2 1λ 2 λ 4 λ 2 2λ 4 +λ 2 2λ 2 λ 2 1λ λ 4 λ 2 λ λ 4, which is a true statement.

16 mitchell r. watnik and wesley o. johnson Next, we compare G to F J according to the Hájek and Šidák criterion: [( { } ] λ 2 1 λ1 +λ 2 λ λ λ e HS (G, F J = 2 4 + (+λ 2 (λ 4 λ 5 4λ 1 { (λ1 +λ 2 2 /4λ 2 } { ( ( } 2 2 λ 2 +2λ λ1 4 λ + λ1 λ (λ4 λ 5 λ 2 [ 1 (λ1 + λ 2 {4λ λ 4 + ( + λ 2 (λ 4 λ 5 } 4λ = ] ( ( 2 }. λ 2 ( + λ 2 {2λ 2 1 λ 2 + 2λ λ1 4 λ + λ1 λ (λ4 λ 5 This is less than, or equal to, 1, if and only if λ 2 1 [ ] ( + λ 2 {4λ λ 4 + ( + λ 2 (λ 4 λ 5 } 4λ { } ( + λ 2 2 2 λ 2 λ 2 λ 2 + 2 λ λ 4 + λ 2 1(λ 4 λ 5. (C. The left-hand side is equivalent to: λ 2 1 ( + λ 2 {4λ λ 4 + ( + λ 2 (λ 4 λ 5 } 4λ 2 1λ = 4λ 1λ λ 4 + 4λ 2 1λ 2 λ λ 4 + (λ 4 1λ 4 + 2λ 1λ 2 λ 4 + λ 2 1λ 2 2λ 4 (λ 4 1λ 5 + 2λ 1λ 2 λ 5 + λ 2 1λ 2 2λ 5 4λ 2 1λ = 4λ 1λ λ 4 + λ 4 1λ 4 + 2λ 1λ 2 λ 4 + λ 2 1λ 2 2λ 4 λ 4 1λ 5 2λ 1λ 2 λ 5 λ 2 1λ 2 2λ 5 = 2λ 1λ λ 4 + λ 4 1λ 4 + 2λ 1λ 2 λ 4 + λ 2 1λ 2 2λ 4 λ 4 1λ 5 λ 2 1λ 2 2λ 5, where the underscores above and below indicate the use of (4.5. The right-hand side of (C. is equivalent to: (λ 2 1 + 2 λ 2 + λ 2 2 (2 λ 2 λ 2 λ 2 + 2 λ λ 4 + λ 2 1λ 4 λ 2 1λ 5 = (λ 4 1λ 4 λ 4 1λ 5 + 2λ 1λ 2 + 2λ 1λ λ 4 + 2λ 1λ 2 λ 4 2λ 1λ 2 λ 5 + (λ 2 1λ 2 λ 2 + 4λ 2 1λ 2 λ λ 4 + λ 2 1λ 2 2λ 4 λ 2 1λ 2 2λ 5 4 λ 2 2λ 2 + 2 λ 2 2λ λ 4 λ 2λ 2 = (λ 4 1λ 4 λ 4 1λ 5 + 2λ 1λ 2 + 2λ 1λ 2 λ 4 + (λ 2 1λ 2 λ 2 + λ 2 1λ 2 λ λ 4 + λ 2 1λ 2 2λ 4 4 λ 2 2λ 2 + 2 λ 2 2λ λ 4 λ 2λ 2. So, e HS (G, F J 1 if and only if (transformations using (2.9 or (4.5 are indicated beforehand using underscores 2λ 1λ λ 4 + λ 4 1λ 4 + 2λ 1λ 2 λ 4 + λ 2 1λ 2 2λ 4 λ 4 1λ 5 λ 2 1λ 2 2λ 5

linear model selection tests 17 λ 4 1λ 4 λ 4 1λ 5 + 4λ 1λ 2 + 2λ 2 1λ 2 2λ 4 +λ 2 1λ 2 λ λ 4 4 λ 2 2λ 2 + 2 λ 2 2λ λ 4 λ 2λ 2 2λ 1λ λ 4 λ 2 1λ 2 2λ 5 + 2λ 1λ 2 + λ 2 1λ 2 2λ 4 +λ 2 1λ 2 λ λ 4 4 λ 2 2λ 2 + 2 λ 2 2λ λ 4 λ 2λ 2 4 λ 2 2λ 2 + λ 2λ 2 λ 2 1λ 2 2λ 5 + λ 2 1λ 2 2λ 4 + λ 2 1λ 2 λ λ 4 + 2 λ 2 2λ λ 4 λ 2 2λ 2 + λ 2λ 2 λ 2 1λ 2 2λ 5 + λ 2 1λ 2 2λ 4 + 2 λ 2 2λ λ 4 λ 2 2λ 2 + λ 2λ 2 λ 2 1λ 2 2λ 5 + λ 2 1λ 2 2λ 4 λ 2λ 2 λ 2 1λ 2 2λ 4 λ 4 2λ 4 λ 2 1λ 2 2λ 4, which is a true statement. References Bahadur, R.R. (1960. Stochastic Comparison of Tests. Ann. Math. Statist. 1, 276-295. Christensen, R. (1987. Plane Answers to Complex Questions: The Theory of Linear Models. New York: Springer-Verlag. Cox, D.R. (1961. Tests of separate families of hypotheses. Proceedings of the 4th Berkeley Symposium on Mathematics, Probability and Statistics, Vol. 1, 105-12. Berkeley: University of California Press. Cox, D.R. (1962. Further results on tests of separate families of hypotheses. J. Roy. Stat. Soc., Series B, 24, 406-424. Cox, D.R. and Hinkley, D.V. (1974. Theoretical Statistics. New York: Wiley. Dastoor, N.K. and McAleer, M. (1989. Some power comparisons of joint and paired tests for nonnested models under local hypotheses. Econometric Theory 5, 8-94. Davidson, R. and MacKinnon, J.G. (1981. Several tests for model specification in the presence of alternative hypotheses. Econometrica 49, 781-79. Efron, B. (1984. Comparing non-nested linear models. J. Amer. Statist. Assoc. 79, 791-80. Efron, B. and Tibshirani, R.J. (199. An Introduction to the Bootstrap, Pacific Grove, CA: Chapman and Hall. Fisher, G.R. and McAleer, M. (1981. Alternative procedures and associated tests of significance for non-nested hypotheses. J. Econometrics 16, 10-119. Fisher, G.R. (198. Tests for two separate regressions. J. Econometrics 21, 117-12. Godfrey, L.G. (1998. Tests of non-nested regression models: some results on small sample behaviour and the bootstrap. J. Econometrics 84, 59-74. Godfrey, L.G. and Pesaran, M.H. (198. Tests of non-nested regression models. J. Econometrics 21, 1-154. Graybill, F.A. and Iyer, H.K. (1994. Regression Analysis: Concepts and Applications. Belmont, CA: Duxbury. Green, J.R. (1971. Testing departure from a regression without using replication. Technometrics 1, 609-615.