WALD TESTS OF SINGULAR HYPOTHESES. By Mathias Drton and Han Xiao University of Washington and Rutgers University

Size: px

Start display at page:

Download "WALD TESTS OF SINGULAR HYPOTHESES. By Mathias Drton and Han Xiao University of Washington and Rutgers University"

Teresa Fitzgerald
5 years ago
Views:

1 WALD TESTS OF SINGULAR HYPOTHESES By Mathias Drton and Han Xiao University of Washington and Rutgers University Motivated by the problem of testing tetrad constraints in factor analysis, we study the large-sample distribution of Wald statistics at parameter points at which the gradient of the tested constraint vanishes. When based on an asymptotically normal estimator, the Wald statistic converges to a rational function of a normal random vector. The rational function is determined by a homogeneous polynomial and a covariance matrix. For quadratic forms and bivariate monomials of arbitrary degree, we show unexpected relationships to chi-square distributions that explain conservative behavior of certain Wald tests. For general monomials, we offer a conjecture according to which the reciprocal of a certain quadratic form in the reciprocals of dependent normal random variables is chi-square distributed.. Introduction. Let f R[x,..., x k ] be a homogeneous k-variate polynomial with gradient f, and let Σ be a k k positive semidefinite matrix with positive diagonal entries. In this paper, we study the distribution of the random variable (.) W f,σ = f(x) 2 ( f(x)) T Σ f(x), where X N k (0, Σ) is a normal random vector with zero mean and covariance matrix Σ. The random variable W f,σ arises in the description of the large-sample behavior of Wald tests with Σ being the asymptotic covariance matrix of an estimator and the polynomial f appearing in a Taylor approximation to the function that defines the constraint to be tested. In regular settings, the Wald statistic for a single constraint converges to χ 2, the chi-square distribution with one degree of freedom. This familiar fact is recovered when f(x) = a T x, a 0, is a linear form and ( a T ) 2 X (.2) W f,σ = a T Σa becomes the square of a standard normal random variable; the vector a corresponds to a nonzero gradient of the tested constraint. Our attention is AMS 2000 subject classifications: 62F05, 62E20 Keywords and phrases: Asymptotic distribution, factor analysis, large-sample theory, singular parameter point, tetrad, Wald statistic

2 2 M. DRTON AND H. XIAO devoted to cases in which f has degree two or larger. These singular cases occur when the gradient of the constraint is zero at the true parameter. For likelihood ratio tests, a large body of literature starting with Chernoff (954) describes large-sample behavior in irregular settings; examples of recent work are Azaïs, Gassiat and Mercadier (2006), Drton (2009), Kato and Kuriki (203) and Ritz and Skovgaard (2005). In contrast, much less work appears to exist for Wald tests. Three examples we are aware of are Glonek (993), Gaffke, Steyer and von Davier (999) and Gaffke, Heiligers and Offinger (2002) who treat singular hypotheses that correspond to collapsility of contingency tables and confounding in regression. Our own interest is motivated by the fact that graphical models with hidden variables are singular (Drton, Sturmfels and Sullivant, 2009, Chap. 4). In graphical modeling, or more specifically in factor analysis, the testing of so-called tetrad constraints is a problem of particular practical relevance (Bollen, Lennox and Dahly, 2009; Bollen and Ting, 2000; Hipp and Bollen, 2003; Silva et al., 2006; Spirtes, Glymour and Scheines, 2000). This problem goes back to Spearman (904); for some of the history see Harman (976). The desire to better understand the Wald statistic for a tetrad was the initial statistical motivation for this work. We solve the tetrad problem in Section 5; the relevant polynomial is quadratic, namely, f(x) = x x 2 x 3 x 4. However, many other hypotheses are of interest in graphical modeling and beyond (Drton, Sturmfels and Sullivant, 2007; Drton, Massam and Olkin, 2008; Sullivant, Talaska and Draisma, 200; Zwiernik and Smith, 202). In principle, any homogeneous polynomial f could arise in the description of a large-sample limit and, thus, general distribution theory for the random variable W f,σ from (.) would be desirable. At first sight, it may seem as if not much concrete can be said about W f,σ when f has degree two or larger. However, the distribution of W f,σ can in surprising ways be independent of the covariance matrix Σ even if degree(f) 2. Glonek (993) was the first to shows this in his study of the case f(x) = x x 2 that is relevant, in particular, for hypotheses that are the union of two sets. Moreover, the asymptotic distribution in this case is smaller than χ 2, making the Wald test maintain (at times quite conservatively) a desired asymptotic level across the entire null hypothesis. We will show that similar phenomena hold also in degree higher than two; see Section 3 that treats monomials f(x) = x α xα 2 2. For the tetrad, conservativeness has been remarked upon in work such as Johnson and Bodner (2007). According to our work in Section 5, this is due to the singular nature of the hypothesis rather than effects of too small a sample size. We remark that in singular settings standard n-out-of-n bootstrap tests may fail to achieve a

3 WALD TESTS OF SINGULAR HYPOTHESES 3 desired asymptotic size, requiring the need to consider m-out-of-n and subsampling procedures; compare the discussion and references in Drton and Williams (20). In the remainder of this paper we first clarify the connection between Wald tests and the random variables W f,σ from (.); see Section 2. Bivariate monomials f of arbitrary degree are the topic of Section 3. Quadratic forms f are treated in Section 4, which gives a full classification of the bivariate case. The tetrad is studied in Section 5. Our proofs make heavy use of the polar coordinate representation of a pair of independent standard normal random variables and, unfortunately, we have so far not been able to prove the following conjecture, which we discuss further in Section 6. Conjecture.. Let Σ be any positive semidefinite k k matrix with positive diagonal entries. If f(x,..., x k ) = x α xα 2 2 xα k k with nonnegative integer exponents α,..., α k that are not all zero, then W f,σ (α + + α k ) 2 χ2. It is not difficult to show that the conjecture holds when Σ is diagonal. Proof under independence. Let Z be a standard normal random variable, and α > 0. Then α 2 /Z 2 follows the one-sided stable distribution of index 2 with parameter α, which has the density (.3) p α (x) = α 2π x 3 e 2 α2 /x, x > 0. The law in (.3) is the distribution of the first passage time of a Brownian motion to the level α (Feller, 966). Hence, it has the convolution rule (.4) p α p β = p α+β α, β > 0. When f(x) = x α xα k k and Σ = (σ ij ) is diagonal with σ,..., σ kk > 0, then (.5) W f,σ = α2 σ X α2 k σ kk Xk 2. By (.4), the distribution of /W f,σ is (α + + α k ) 2 /Z 2. Therefore, (.6) W f,σ as claimed in Conjecture.. (α + + α k ) 2 χ2,

4 4 M. DRTON AND H. XIAO The preceding argument can be traced back to Shepp (964); see also Cohen (98); Reid (987); Quine (994); DasGupta and Shepp (2004). However, if X is a dependent random vector, the argument no longer applies. The case k = 2, α = α 2 = and Σ arbitrary was proved by Glonek (993); see Theorem 2.3 below. We prove the general statement for k = 2 as Theorem 3.. Remark.2. When simplified as in (.5), the random variable W f,σ is well-defined when allowing α,..., α k to take nonnegative real as opposed to nonnegative integer values, and the above proof under independence goes through in that case as well. Subsequently, we will thus consider the random variable W f,σ for a monomial f(x,..., x k ) = x α xα 2 2 xα k k with α,..., α k nonnegative real. To be precise we then refer to W f,σ rewritten as ( k ) k α i α j (.7) W f,σ = σ ij. X i X j i= j= With this convention, we believe Conjecture. to be true for α,..., α k nonnegative real. 2. Wald tests. To make the connection between Wald tests and the random variables W f,σ from (.) explicit, suppose that θ R k is a parameter of a statistical model and that, based on a sample of size n, we wish to test the hypothesis (2.) H 0 : γ(θ) = 0 versus H : γ(θ) 0 for a continuously differentiable function γ : R k R. Suppose further that there is a n-consistent estimator ˆθ of θ such that, as n, we have the convergence in distribution n(ˆθ θ) d N k (0, Σ(θ)), where the asymptotic covariance matrix Σ(θ) is a continuous function of the parameter. The Wald statistic for testing (2.) is the ratio (2.2) T γ = γ(ˆθ) 2 var[γ(ˆθ)] = nγ(ˆθ) 2 ( γ(ˆθ)) T Σ(ˆθ) γ(ˆθ), where the denominator of the right-most term estimates the asymptotic variance of γ(ˆθ), which by the delta method is given by ( γ(θ)) T Σ(θ) γ(θ).

5 WALD TESTS OF SINGULAR HYPOTHESES 5 Consider now a true distribution from H 0, that is, the true parameter satisfies γ(θ) = 0. Without loss of generality, we assume that θ = 0. If the gradient is nonzero at the true parameter, then the limiting distribution of T γ is the distribution of the random variable in (.2) with a = γ(0) 0 and Σ = Σ(0). Hence, the limit is χ 2. However, if γ(0) = 0 (i.e., the constraint γ is singular at the true parameter), then the asymptotic distribution of T γ is no longer χ 2 but rather given by (.) with the polynomial f having higher degree; the degree of f is determined by how many derivatives of γ vanish at the true parameter. Proposition 2.. Assume that γ(0) = 0 and that there is a homogeneous polynomial f of degree d 2 such that, as x 0, If nˆθ γ(x) = f(x) + o( x d/2 ), and γ(x) = f(x) + o( x (d )/2 ). d N (0, Σ), then T γ d W f,σ. Example 2.2. Glonek (993) studied testing collapsibility properties of contingency tables. Under an assumption of no three-way interaction, collapsibility with respect to a chosen margin amounts to the vanishing of at least one of two pairwise interactions, which we here simply denote by θ and θ 2. In the (θ, θ 2 )-plane, the hypothesis is the union of the two coordinate axes, which can be described as the solution set of γ(θ, θ 2 ) = θ θ 2 = 0 and tested using the Wald statistic T γ based on maximum likelihood estimates of θ and θ 2. The hypothesis is singular at the origin as reflected by the vanishing of γ when θ = θ 2 = 0. Away from the origin, T γ has the expected asymptotic χ 2 distribution. At the origin, by Proposition 2., T γ converges to W f,σ, where f(x) = x x 2 and Σ is the asymptotic covariance matrix of the two maximum likelihood estimates. The main result of Glonek (993), stated as a theorem below, gives the distribution of W f,σ in this case. Glonek s surprising result clarifies that the Wald test for this hypothesis is conservative at (and in finite samples near) the intersection of the two sets making up the null hypothesis. Theorem 2.3 (Glonek, 993). If f(x) = x x 2 and Σ is any positive semidefinite 2 2 matrix with positive diagonal entries, then W f,σ 4 χ2. Before turning to concrete problems, we make two simple observations that we will use to bring (f, Σ) in convenient form.

6 6 M. DRTON AND H. XIAO Lemma 2.4. Let f R[x,..., x k ] be a homogeneous polynomial, and let Σ be a positive semidefinite k k matrix with positive diagonal entries. (i) If c R \ {0} is a nonzero scalar, then W cf,σ = W f,σ. (ii) If B is an invertible k k matrix, then W f B,B ΣB T distribution as W f,σ. has the same Proof. (i) Obvious, since (cf) = c f. (ii) Let X N (0, Σ) and define Y = B X N (0, B ΣB T ). Then f(x) = (f B)(Y ) and (f B)(Y ) = B T f(x). Substituting into (.) gives W f,σ = (f B)(Y ) 2 ( (f B)(Y )) T B ΣB T (f B)(Y ) = W f B,B ΣB T. 3. Bivariate Monomials. In this section, we study the random variable W f,σ when f(x) = x α xα 2 2. If the exponents α, α 2 are positive integers, then f is a bivariate monomial. However, all our arguments go through for a slightly more general case in which α, α 2 are positive real numbers; recall Remark.2. Our main result is that the distribution of W f,σ does not depend on Σ. Theorem 3.. Let f(x) = x α xα 2 2 with α, α 2 > 0, and let Σ be any positive semidefinite 2 2 matrix with positive diagonal entries. Then W f,σ (α + α 2 ) 2 χ2. Proof. As shown in Section, the claim is true if Σ = (σ ij ) is diagonal. It thus suffices to show that W f,σ has the same distribution as W f := W f,i. By Lemma 2.4, we can assume without loss of generality that σ = σ 22 = and ρ := σ 2 > 0. Since W f,σ = α2 X 2 + 2ρα α 2 + α2 2 X X 2 X2 2, we can also assume α = for simplicity. With σ = /α 2, we have and need to show that = W f,σ X 2 W f,σ + 2ρ σx X 2 + σ 2 X 2 2 ( + σ ) 2 χ2.

7 WALD TESTS OF SINGULAR HYPOTHESES 7 If ρ =, then X and X 2 are almost surely equal and it is clear that W f,σ has the same distribution as W f. Hence, it remains to consider 0 < ρ <. Let Z and Z 2 be independent standard normal random variables. When expressing Z = R cos(ψ) and Z 2 = R sin(ψ) in polar coordinates, it holds that R and Ψ are independent, and Ψ is uniformly distributed over [0, 2π]. Let ρ = sin(φ) with 0 φ < π/2, then the joint distribution of X and X 2 can be represented as X = R cos(ψ φ/2), X 2 = R sin(ψ + φ/2), which leads to with = W f,σ R 2 T T = cos 2 (Ψ φ/2) + 2 sin(φ) σ cos(ψ φ/2) sin(ψ + φ/2) + σ 2 sin 2 (Ψ + φ/2). Routine trigonometric calculations show that T can be expressed as a function of the doubled angle 2Ψ. More precisely, where t(ψ, φ) = T = σ2 t(2ψ, φ), 4 2 cos(2φ) + 2 cos(ψ φ) cos(2ψ) 2 cos(ψ + φ) σ cos(2φ) + ( + σ)[σ + cos(ψ φ) σ cos(ψ + φ)]. Since 2Ψ is uniformly distributed on [0, 4π], the distribution of T is independent of φ if and only if the same is true for the distribution of T = t(ψ, φ). We proceed by calculating the moments of T and show that they are independent of φ. For each 0 φ < π/2, there exists a small interval L = [φ ɛ, φ + ɛ] such that when m, the function sup [t(ψ, φ)] m t(ψ, φ) φ L φ is integrable over 0 ψ < 2π. Therefore, we have (3.) 2π φ E(T m m ) = [t(ψ, φ)]m t(ψ, φ) dψ, 0 2π φ The expression of φt(ψ, φ) is long, so we omit it here.

8 8 M. DRTON AND H. XIAO We introduce the complex numbers z = e iψ and a = e iφ, and express the functions t(ψ, φ) and φt(ψ, φ) in terms of z and a: (a z) 2 ( + az) 2 t(ψ, φ) = u(z, a) = z(a + aσ + a 2 z σz)( + a 2 σ az aσz), φ t(ψ, φ) = v(z, a) =a(a z)( + az)( + a2 s + 2az 2aσz + a 2 z 2 + σz 2 ) iz(a + aσ + a 2 z σz) 2 ( + a 2 σ az aσz) 2 ( a aσ z a 2 z + σz + a 2 σz az 2 aσz 2 ). The integral in (3.) can be computed as a complex contour integral on the unit circle T = {z : z = } 2π [t(ψ, φ)] m t(ψ, φ) dψ = [u(z, a)] m v(z, a) φ iz dz. 0 Let q(z, a) = [u(z, a)] m v(z, a) iz. As a function of z, it has two poles within the unit disc. These two poles are at z 0 = 0 and z = (a 2 σ )/(a + aσ) and have the same order m +. By the Residue Theorem, we know (3.2) q(z, a) dz = Res(q; 0) + Res(q; z ), 2πi T where Res(q; 0) and Res(q; z ) are the residues at 0 and z respectively. Let ζ 0 = {ce iψ, 0 ψ 2π} be a small circle around 0 such that z is outside the circle. Let S be the Möbius transform T S(w) = z w z w. Then S is one-to-one from the unit disk onto itself and maps 0 to z, and ζ 0 to a closed curve ζ = {S(ce iψ ), 0 ψ 2π} around z with winding number one. It holds that Res(q; z ) = 2πi It also holds that ζ q(z, a) dz = 2πi q(s(w), a)s (w) = q(w, a), ζ 0 q(s(w), a)s (w) dw. from which we deduce q(s(w), a)s (w) dw = q(w, a) dw = Res(q; 0). 2πi ζ 0 2πi ζ 0

9 WALD TESTS OF SINGULAR HYPOTHESES 9 Hence, the integral in (3.2) is zero. We have shown that the integral in (3.) is zero for every m, which means that the moments of T do not depend on φ for 0 φ < π/2. When φ = 0, the random variable T is bounded, so its moments uniquely determine the distribution. Therefore, the distribution of T does not depend on φ, and the proof is complete. Remark 3.2. If α = α 2, then Theorem 3. reduces to Theorem 2.3. In this case, our proof above would only need to treat σ =. Glonek s proof of Theorem 2.3 finds the distribution function of a random variable related to our T. If σ =, this requires solving a quadratic equation. When σ, we were unable to extend this approach as a complicated quartic equation arises in the computation of the distribution function. We thus turned to the presented method of moments. Let X = (X, X 2 ) T and Y = (Y, Y 2 ) T be two independent N 2 (0, Σ) random vectors, where Σ has positive diagonal entries. Let p, p 2 be nonnegative numbers such that p + p 2 =. The random variable Q = p X 2 Y + p 2 X Y 2 (p X 2, p 2 X )Σ(p X 2, p 2 X ) T has the standard normal distribution, and is independent of X. For f(x) = x p xp 2 2, let f(x) V f,σ = ( f(x)) T Σ f(x) and W f,σ = V 2 f,σ. Then (3.3) p Y X + p 2 Y 2 X 2 = Q V f,σ. By taking the conditional expectation given V f,σ, the characteristic function of (3.3) is seen to be E [exp{itq/v f,σ }] = E [ exp{ 2 t2 /W f,σ } ]. The uniqueness of the moment generating function for positive random variables (Billingsley, 995, Thm. 22.2) yields that (3.3) has a standard Cauchy distribution (with characteristic function e t ) if and only if W f,σ χ 2. Therefore, we have the following equivalent version of Theorem 3..

10 0 M. DRTON AND H. XIAO Corollary 3.3. Let X = (X, X 2 ) T and Y = (Y, Y 2 ) T be independent N 2 (0, Σ) random vectors, where Σ has positive diagonal entries. If p, p 2 are nonnegative numbers such that p + p 2 =, then the random variable has the standard Cauchy distribution. p Y X + p 2 Y 2 X 2 4. Quadratic Forms. In this section, we consider the distribution of W f,σ when f is a quadratic form, that is, f(x, x 2,..., x k ) = for real coefficients a ij. Equivalently, i j k (4.) f(x, x 2,..., x k ) = x T Ax, a ijx i x j where A = (a ij ) is symmetric, with a ii = a ii and a ij = a ij /2 for i < j. 4.. Canonical form. Let I denote the k k identity matrix. We use the shorthand W f := W f,i when the covariance matrix Σ is the identity. Lemma 4.. If f R[x,..., x k ] is homogeneous of degree d and Σ is a positive semidefinite k k matrix, then W f,σ has the same distribution as W g where g is a homogeneous degree d polynomial in rank(σ) many variables. Proof. If Σ has full rank then Σ = BB T for an invertible matrix B. Use Lemma 2.4(ii) to transform W f,σ to W g where g = f B is homogeneous of degree d. If Σ has rank m < k then Σ = BE m B T, where B is invertible and E m is zero apart from the first m diagonal entries that are equal to one. Form g by substituting x m+ = = x k = 0 into f B. Further simplications are possible for a treatment of the random variables W f. In the case of quadratic forms, we may restrict attention to canonical forms f(x) = λ x λ kx 2 k, as shown in the next lemma.

11 WALD TESTS OF SINGULAR HYPOTHESES Lemma 4.2. Let f(x) = x T Ax be a quadratic form given by a symmetric k k matrix A 0. If Σ is a positive definite k k matrix and λ,..., λ k are the eigenvalues of AΣ, then W f,σ has the same distribution as (4.2) ( λ Z λ kzk) ( λ 2 Z2 + + ), λ2 k Z2 k where Z,..., Z k are independent standard normal random variables. Proof. Write Σ = BB T for an invertible matrix B. By Lemma 2.4(ii), W f,σ has the same distribution as W g with g(x) = x T (B T AB)x. Let Q T (B T AB)Q = diag(λ,..., λ k ) be the spectral decomposition of B T AB, with Q orthogonal. Then λ,..., λ k are also the eigenvalues of AΣ. Applying Lemma 2.4(ii) again, we find that W f,σ has the same distribution as W h with h(x) = x T (Q T B T ABQ)x = λ x λ k x 2 k. Since h(x) = 2(λ x,..., λ k x k ), the claim follows. In (4.2), the set of eigenvalues {λ i : i k} can be scaled to {cλ i : i k} for any c 0, without changing the distribution; recall also Lemma 2.4(i). For instance, we may scale one nonzero eigenvalue to become equal to one. When all (scaled) λ i are in {, }, the description of the distribution of W f,σ can be simplified. We write Beta(α, β) for the Beta distribution with parameters α, β > 0. Lemma 4.3. Let k and k 2 be two positive integers, and let k = k + k 2. If f(x,..., x k ) = x x2 k x 2 k + x2 k +k 2, then W f has the same distribution as 4 R2 (2B ) 2, where R 2 and B are independent, R 2 χ 2 k, and B Beta(k /2, k 2 /2). Proof. The distribution of W f is that of (Z Z2 k Zk 2 + Z2 k +k 2 ) 2 4 Z Z2 k with Z,..., Z k independent and standard normal. Let Y := Z Z 2 k χ 2 k, Y 2 := Z 2 k Z2 k χ2 k 2.

12 2 M. DRTON AND H. XIAO Then R 2 := Y + Y 2 χ 2 k. Representing Z,..., Z k in polar coordinates shows that R 2 and W f /R 2 are independent (Muirhead, 982, Thm..5.5). Since B = Y /(Y + Y 2 ) Beta(k /2, k 2 /2), and (Y Y 2 ) 2 /(Y + Y 2 ) 2 = (2B ) 2, we deduce that the two random variables W f /R 2 and 4 (2B )2 have the same distribution. We note that when k = 4 and k = k 2 = 2, Lemma 4.3 gives the equality of distributions (4.3) W f d = 4 R2 U 2. The equality holds because, in this special case, U = Y /(Y + Y 2 ) is uniformly distributed on [0, ], and (2U ) 2 has the same distribution as U 2. The distribution from (4.3) will appear in Section 5. For general eigenvalues λ i, it seems that the distribution from (4.2) cannot be described in as simple terms Classification of bivariate quadratic forms. We now turn to the bivariate case (k = 2), that is, we are considering a quadratic form in two variables, f(x, x 2 ) = ax 2 + 2bx x 2 + cx 2 2. In this case, we are able to give a full classification of the possible distributions of W f in terms of linear combinations of a pair of independent χ 2 random variables; see Johnson, Kotz and Balakrishnan (994, Sect. 8.8) for a discussion of such distributions. Our classification reveals that for k = 2 the distributions for quadratic forms are stochastically bounded below and above by χ 2 /4 and χ2 2 /4, respectively. Theorem 4.4. Let Σ be a positive definite matrix, and let f(x, x 2 ) = ax 2 + 2bx x 2 + cx 2 2 be a nonzero quadratic form with matrix ( ) a b A := 0. b c (a) If b 2 ac 0, then W f,σ χ 2 /4. (b) If b 2 ac < 0, then W f,σ d = 4 ( Z det(aσ) ) tr(aσ) 2 Z2 2, where Z and Z 2 are independent standard normal random variables.

13 WALD TESTS OF SINGULAR HYPOTHESES 3 Before giving a proof of the theorem, we would like to point out that the key insight, Lemma 4.5 below, can also be obtained from a theorem of Marianne Mora that is based on properties of the Cauchy distribution (Seshadri, 993, Theorem 2.3). Proof. (a) When the discriminant b 2 ac 0 then f factors into a product of two linear forms. The joint distribution of the two linear forms is bivariate normal. Write Σ for the covariance matrix of the linear forms then the distribution of W f is equal to the distribution of W g,σ with g(x, x 2 ) = x x 2. Hence, the distribution is χ 2 /4 by Theorem 2.3/Theorem 3.. (b) In this case, the discriminant is negative and f does not factor. By Lemma 4.2, we can assume Σ = I and consider the distribution of W f for f(x, x 2 ) = λ x 2 + λ 2x 2 2, where λ and λ 2 are the eigenvalues of AΣ. Since det(aσ) = λ λ 2 and tr(aσ) = λ + λ 2, to prove the claim, we must show that in this case (4.4) W f d = 4 ( Z 2 + 4λ ) λ 2 (λ + λ 2 ) 2 Z2 2 = ( Z c ) ( + c) 2 Z2 2, where c = λ 2 /λ > 0. To show (4.4) we use the polar coordinates again. So represent the two considered independent standard normal random variables as X = R cos(ψ) and X 2 = R sin(ψ), where R 2 χ 2 2 and Ψ Uniform[0, 2π] are independent. Then [ W f = R2 cos(ψ) c sin(ψ) 2] 2 cos(ψ) 2 + c 2 sin(ψ) 2 = R2 4 ( ( c)2 ( + c) 2 cos(ψ) 2 sin(ψ) 2 ( + c) 2 cos(ψ) 2 + c 2 sin(ψ) 2 Using Lemma 4.5, we have d R 2 ) ( c)2 W f = ( 4 ( + c) 2 cos(ψ)2 ( ) = R2 4c (4.5) 4 ( + c) 2 cos(ψ)2 + sin(ψ) 2. This is the claim from (4.4) because R cos(ψ) and R sin(ψ) are independent and standard normal. Lemma 4.5. ). If c 0 and Ψ has a uniform distribution over [0, 2π], then S c (Ψ) := ( + c)2 cos(ψ) 2 sin(ψ) 2 cos(ψ) 2 + c 2 sin(ψ) 2 d = cos(ψ) 2.

14 4 M. DRTON AND H. XIAO Proof. Let R 2 χ 2 2 be independent of Ψ. Then R sin(ψ) and R cos(ψ) are independent and standard normal. Therefore, R 2 S c (Ψ) = ( + c) 2 [R sin(ψ)] 2 + c 2 ( + c) 2 [R cos(ψ)] 2 is the sum of two independent random variables that follow the one-sided stable distribution of index 2. Since c > 0, the first summand has the stable distribution with parameter /( + c) and the second summand has parameter c/( + c). Hence, by (.4), their sum follows a stable law with parameter. Expressing this in terms of the reciprocals, R 2 S c (Ψ) d = R 2 cos(ψ) 2 χ 2. It follows that S c (Ψ) has the same distribution as cos(ψ) 2. For instance, we may argue that S c (Ψ) and cos(ψ) 2 have identical moments, which implies equality of the distributions as both are compactly supported. The claim of Lemma 4.5 is false for c < 0. Indeed, the distribution of S c (Ψ) varies with c when c < Stochastic bounds. To understand possible conservativeness of Wald tests it is interesting to look for stochastic bounds on W f,σ that hold for all f and Σ. We denote the stochastic ordering of two random variables as U st V when P (U > t) P (V > t) for all t R. Proposition 4.6. If f R[x,..., x k ] is a quadratic form and Σ any nonzero positive semidefinite k k matrix, then W f,σ st 4 χ2 k. Equality is achieved when f(x) = x x2 k and Σ is the identity matrix. Proof. The second claim is obvious. For the first claim, without loss of generality, we can restrict our attention to the distributions from (4.2). The Cauchy-Schwarz inequality gives ( λ Z λ kzk) 2 2 ( Z 2 ) + + Zk 2 ) ( λ 2 Z λ2 k ) k) Z2 4 ( λ 2 Z2 + + λ2 k Z2 k = 4 which is the desired chi-square bound. 4 ( λ 2 Z2 + + λ2 k Z2 k ( Z Zk 2 ), The considered Wald test rejects the hypothesis that γ(θ) = 0 when the statistic T γ from (2.2) exceeds c α, where c α is the ( α) quantile of the

15 WALD TESTS OF SINGULAR HYPOTHESES 5 χ 2 distribution. Let k α be the largest degrees of freedom k such that a 4 χ2 k random variable exceeds c α with probability at most α. According to Proposition 4.6, if the true parameter is a singularity at which γ can be approximated by a quadratic form in at most k α variables, then the Wald test is guaranteed to be asymptotically conservative. Some values are k 0.05 = 7, k =, k 0.0 = 6, k = 20, k 0.00 = 29. Turning to a lower bound, we can offer the following simple observation. Proposition 4.7. Suppose the quadratic form f is given by a symmetric k k matrix A 0, and suppose that Σ is a positive definite k k matrix such that all eigenvalues of AΣ are nonnegative. Then W f,σ st 4 χ2. Proof. Let λ,..., λ k 0 be the eigenvalues of AΣ. By scaling, we can assume without loss of generality that λ = and 0 λ i for 2 i k. Then ( λ Z λ kzk) 2 2 ( 4 ( λ 2 Z2 + + ) Z2 λ 2 Z λ2 k k) Z2 λ2 k Z2 k 4 ( λ 2 Z2 + + ) = λ2 k Z2 4 Z2, k and the claim follows from Lemma 4.2. Proposition 4.7, Theorem 4.4 and simulation experiments lead us to conjecture that 4 χ2 is still a stochastic lower bound when there are both positive and negative eigenvalues λ i. Conjecture 4.8. For any quadratic form f 0 and any positive semidefinite matrix Σ 0, the distribution of W f,σ stochastically dominates 4 χ2. While we do not know how to prove this conjecture in general, we are able to treat the special case where the eigenvalues λ i are either or -. Theorem 4.9. Let k, k 2 > 0, and k = k + k 2. If f(x,..., x k ) = x x2 k x 2 k + x2 k +k 2, then W f st 4 χ2. Proof. Without loss of generality we assume k k 2. If k = 0 or k = k 2 =, the claim follows Proposition 4.7 and Theorem 4.4, respectively. We now consider the case k and k 2 2. By Lemma 4.3, we know (4.6) W f d = 4 R2 (2B ) 2,

16 6 M. DRTON AND H. XIAO where R 2 and B are independent, R 2 χ 2 k, and B Beta(k /2, k 2 /2). On the other hand, if B Beta(/2, (k )/2) and is independent of R 2, then (4.7) R 2 B χ 2. Let g(x) and h(x) be the density functions of (2B ) 2 and B, respectively. The comparison of (4.6) and (4.7) shows that it suffices to prove that (2B ) 2 is stochastically larger than B. We will show a stronger result, namely, that the likelihood ratio g(x)/h(x) is an increasing function over [0, ]. To simplify the argument, we rescale the density functions to and g(x) x ( + x) k /2 ( x) k 2/2 h(x) x ( x) (k 3)/2 + ( x) k /2 ( + x) k 2/2 = ( x) (k +k 2 3)/2 ( + x) (k +k 2 3)/2. For our purpose, it is equivalent to show the monotonicity of g(x 2 )/h(x 2 ), which is proportional to l(x) := ( + x) ( k 2+)/2 ( x) ( k +)/2 + ( x) ( k 2+)/2 ( + x) ( k +)/2. When k =, the derivative of l(x) satisfies 2l (x) = (k 2 )( x) ( k 2 )/2 (k 2 )( + x) ( k 2 )/2 > 0 when 0 < x <, and thus the likelihood ratio is an increasing function. When k 2, we have 2l (x)( + x) (k2+)/2 ( x) (k 2+)/2 = ( + x) [(k ] 2 )( + x) (k 2 k )/2 + (k )( x) (k 2 k )/2 ( x) [(k ] )( + x) (k 2 k )/2 + (k 2 )( x) (k 2 k )/2 [ ] > (k 2 k ) ( + x) (k 2 k )/2 ( x) (k 2 k )/2 0 for all 0 < x <. Therefore, l(x) is an increasing function.

17 WALD TESTS OF SINGULAR HYPOTHESES 7 5. Tetrads. We now turn to the problem that sparked our interest in Wald tests of singular hypothesis, namely, the problem of testing tetrad constraints on the covariance matrix Θ = (θ ij ) of a random vector Y in R p with p 4. A tetrad is a 2 2 subdeterminant that only involves off-diagonal entries and, without loss of generality, we consider the tetrad ( ) θ3 θ (5.) γ(θ) = θ 3 θ 24 θ 4 θ 23 = det 4. θ 23 θ 24 Example 5.. Consider a factor analysis model in which the coordinates of Y are linear functions of a latent variable X and noise terms. More precisely, Y i = β 0i +β i X +ɛ i where X N (0, ) is independent of ɛ,..., ɛ p, which in turn are independent normal random variables. Then the covariance between Y i and Y j is θ ij = β i β j and the tetrad from (5.) vanishes. Suppose now that we observe a sample of independent and identically distributed random vectors Y (),..., Y (n) with covariance matrix Θ. Let Y n be the sample mean vector, and let ˆΘ = n n (Y (i) Y n )(Y (i) Y n ) T i= be the empirical covariance matrix. Assuming that the data-generating distribution has finite fourth moments, it holds that n( ˆΘ Θ) d N k (0, V (Θ)) with k = p 2. The rows and columns of the asymptotic covariance matrix V (Θ) are indexed by the pairs ij := (i, j), i, j p. Since the tetrad from (5.) only involves the covariances indexed by the pairs in C = {3, 4, 23, 24}, only the principal submatrix Σ(Θ) := V (Θ) C C is of relevance for the large-sample distribution of the sample tetrad γ( ˆΘ). The gradient of the tetrad is γ(θ) = (θ 24, θ 23, θ 4, θ 3 ). Hence, if at least one of the four covariances in the tetrad is nonzero the Wald statistic T γ converges to a χ 2 distribution. If, on the other hand,

18 8 M. DRTON AND H. XIAO θ 3 = θ 4 = θ 23 = θ 24 = 0, then the large-sample limit of T γ has the distribution of W f,σ(θ) where f(x) = x x 4 x 2 x 3 is a quadratic form in k = 4 variables; recall Proposition 2.. This form can be written as x T Ax with a matrix that is a Kronecker product, namely ( ) ( ) (5.2) A = = If Y is multivariate normal, then the asymptotic covariance matrix has the entries V (Θ) ij,kl = θ ik θ jl + θ il θ jk. In the singular case with θ 3 = θ 4 = θ 23 = θ 24 = 0, we have thus θ θ 33 θ θ 34 θ 2 θ 33 θ 2 θ 34 ( ) ( ) Σ(Θ) = θ θ 34 θ θ 44 θ 2 θ 34 θ 2 θ 44 θ 2 θ 33 θ 2 θ 34 θ 22 θ 33 θ 22 θ 34 = θ θ 2 θ33 θ 34, θ 2 θ 22 θ 34 θ 44 θ 2 θ 34 θ 2 θ 44 θ 22 θ 34 θ 22 θ 44 which again is a Kronecker product. We remark that Σ(Θ) would also be a Kronecker product if we had started with an elliptical distribution instead of the normal, compare Iwashita and Siotani (994, eqn. (2.)), or if (Y, Y 2 ) and (Y 3, Y 4 ) were independent in the data-generating distribution. As we show next, in the singular case, the Kronecker structure of the two matrices A and Σ(Θ) gives a limiting distribution of the Wald statistic for the tetrad that does not depend on the block-diagonal covariance matrix Θ. Theorem 5.2. Let Σ = Σ () Σ (2) be the Kronecker product of two positive definite 2 2 matrices Σ (), Σ (2). Let f(x) = x x 4 x 2 x 3. Then where R 2 χ 2 4 W f,σ d = 4 R2 U 2, and U Uniform[0, ] are independent. Proof. Since f is a quadratic form we may consider the canonical form from Lemma 4.2, which depends on the (real) eigenvalues of AΣ. The claim follows from Lemma 4.3 and the comments in the paragraph following its proof provided the four eigenvalues of AΣ all have the same absolute value, two of them are positive and two are negative.

19 Let Σ (i) = (σ (i) kl WALD TESTS OF SINGULAR HYPOTHESES 9 ). Then, by (5.2), ( σ () 2 σ () AΣ = σ () 22 σ () 2 ) ( ) σ (2) 2 σ (2) σ (2) 22 σ (2). 2 For i =, 2, since Σ (i) is positive definite, the matrix ( ) σ (i) 2 σ (i) σ (i) 22 σ (i) 2 has the imaginary eigenvalues ±λ (i) = ± (σ (i) 2 )2 σ (i) σ(i) 22. It follows that AΣ has the real eigenvalues λ () λ (2) and λ () λ (2), each with multiplicity two. Hence, Lemma 4.3 applies with k = k 2 = 2. The distribution function of 4 R2 U 2 is F sing (t) = e 2t + ( 2πt Φ ( 2 t )), t 0, where Φ(t) is the distribution function of N (0, ). The density f sing (t) of 4 R2 U 2 is strictly decreasing on (0, ) and f sing (t) as t 0. In light of Theorem 4.4, it is interesting to note that the distribution of 4 R2 U 2 is not the distribution of a linear combination of four independent χ 2 random variables, because the χ 2 d distribution has a finite density at zero when d 2. However, the distribution satisfies 4 χ2 st 4 R2 U 2 st 4 χ2 2. The first inequality holds according to Theorem 4.9. The second inequality holds because R 2 U χ 2 2. According to the next result, the distribution is also no larger than a χ 2 distribution, which means that the Wald test of a tetrad constraint is asymptotically conservative at the tetrad s singularities (which are given by block-diagonal covariance matrices). and U Uniform[0, ] are inde- Proposition 5.3. pendent. Then Suppose R 2 χ R2 U 2 st χ 2.

20 20 M. DRTON AND H. XIAO Proof. Let Z,..., Z 4 be independent standard normal random variables. Then the sum of squares Z 2 + Z Z Z 2 4 d = R 2 χ 2 4 and the ratio Z 2 Z 2 + Z2 2 + Z2 3 + Beta ( Z2 2, 3 ) 2 4 are independent. Hence, the claim holds if and only if 2 U st B, where U Uniform[0, ] and B Beta ( 2, 3 2). The distribution of U/2 is supported on the interval [0, /2] on which it has distribution function F U/2 (t) = 2t. For t (0, ), the distribution function of B has first and second derivative F B (t) = 4 t 2 π and F 4t B (t) = π t. 2 Hence, F B is strictly concave on (0, ) and has a tangent with slope 4/π < 2 at t = 0. Consequently, F U/2 (t) F B (t), t R, giving the claimed ordering of 4 R2 U 2 and the χ 2 distribution. 6. Conjectures. In Section 3, we mentioned that Theorem 3. and Corollary 3.3 are equivalent. Similarly, Conjecture. is equivalent to the following one. Conjecture 6.. Let X = (X, X 2,..., X k ) T and Y = (Y, Y 2,..., Y k ) T be independent and have the same distribution N k (0, Σ), where Σ has positive diagonal entries. If p, p 2,..., p k are nonnegative numbers such that p + p p k =, then p Y + p 2Y p ky k X X 2 X k has the standard Cauchy distribution.

21 WALD TESTS OF SINGULAR HYPOTHESES 2 For a proof of this conjecture it is natural to try an induction type argument, which might involve the ratio of normal random variables with nonzero means (Marsaglia, 965). However, we were unable to make this work. By taking the reciprocal of W f,σ, we can translate Conjecture. into another equivalent form. Conjecture 6.2. Let X = (X, X 2,..., X k ) T N k (0, Σ) be such that non of its entries is a point mass. If p, p 2,..., p n are nonnegative numbers such that p + p p n =, then (6.) ( p, p 2,..., p ) ( n p Σ, p 2,..., p ) T n X X 2 X n X X 2 X n χ 2. Simulation provides strong evidence for the validity of these conjectures. We have tried many randomly generated scenarios with 2 k 5, simulating large numbers of values for the rational functions in question. In all cases empirical distribution functions were indistinguishable from the conjectured χ 2 or Cauchy distribution functions. On the other hand, the positivity requirement for p, p 2,..., p k is crucial for the validity of the conjectures. For instance, let Q be the reciprocal of the quantity on the left hand side of (6.), and consider the special case where k = 2, var(x ) = var(x 2 ) =, cor(x, X 2 ) = ρ, and p = p 2 = /2. Assuming that ρ <, change coordinates to Z = (X + X 2 )/ 2( + ρ), Z 2 = (X X 2 )/ 2( ρ), and then to polar coordinates Z = R cos Ψ and Z 2 = R sin Ψ. We obtain that ( Q = 4 X 2 2ρ + ) 2 [ρ + cos(2ψ)]2 X X 2 X2 2 = R ρ 2. The distribution of Q now depends on ρ. For instance, E[Q] = + 2ρ2 ρ Conclusion. In regular settings, the Wald statistic for testing a constraint on the parameters of a statistical model converges to a χ 2 distribution as the sample size increases. When the true parameter is a singularity of the constraint, the limiting distribution is instead determined by a rational function of jointly normal random variables (recall Section 2). The distributions of these rational functions are in surprising ways related to chi-square distributions as we showed in our main results in Sections 3-5.

22 22 M. DRTON AND H. XIAO Our work led to several, in our opinion, intriguing conjectures about the limiting distributions of Wald statistics. Although the conjectures can be stated in elementary terms, we are not aware of any other work that suggests these properties for the multivariate normal distribution. For quadratic forms, the usual canonical form leads to a particular class of distributions parametrized by a collection of eigenvalues (recall Lemma 4.2). It would be interesting to study Schur convexity properties of this class of distributions, which would provide further insights into asymptotic conservativeness of Wald tests of singular hypotheses. Finally, this paper has focused on testing a single constraint. It would be interesting to develop a general theory for Wald tests of hypotheses that are defined in terms of several constraints. In this setting the choice of the constraints representing a null hypothesis will play an important role in the distribution theory, as exemplified by Gaffke, Steyer and von Davier (999) and Gaffke, Heiligers and Offinger (2002). Acknowledgments. We would like to thank Gérard Letac and Lek- Heng Lim for helpful comments on our conjectures. This work was supported by NSF under Grant No. DMS Mathias Drton was also supported by an Alfred P. Sloan Fellowship. REFERENCES Azaïs, J.-M., Gassiat, É. and Mercadier, C. (2006). Asymptotic distribution and local power of the log-likelihood ratio test for mixtures: bounded and unbounded cases. Bernoulli MR (2008c:6207) Billingsley, P. (995). Probability and measure, third ed. Wiley Series in Probability and Mathematical Statistics. John Wiley & Sons Inc., New York. A Wiley-Interscience Publication. MR (95k:6000) Bollen, K. A., Lennox, R. D. and Dahly, D. L. (2009). Practical application of the vanishing tetrad test for causal indicator measurement models: an example from healthrelated quality of life. Stat. Med MR Bollen, K. A. and Ting, K.-F. (2000). A tetrad test for causal indicators. Psychological Methods Chernoff, H. (954). On the distribution of the likelihood ratio. Ann. Math. Statistics MR (6,38k) Cohen, E. A. Jr. (98). A note on normal functions of normal random variables. Comput. Math. Appl MR62366 (82h:60029) DasGupta, A. and Shepp, L. (2004). Chebyshev polynomials and G-distributed functions of F -distributed variables. In A festschrift for Herman Rubin. IMS Lecture Notes Monogr. Ser Inst. Math. Statist., Beachwood, OH. MR (2006a:6007) Drton, M. (2009). Likelihood ratio tests and singularities. Ann. Statist MR (200d:6243)

23 WALD TESTS OF SINGULAR HYPOTHESES 23 Drton, M., Massam, H. and Olkin, I. (2008). Moments of minors of Wishart matrices. Ann. Statist MR (200a:60035) Drton, M., Sturmfels, B. and Sullivant, S. (2007). Algebraic factor analysis: tetrads, pentads and beyond. Probab. Theory Related Fields MR (2008f:62086) Drton, M., Sturmfels, B. and Sullivant, S. (2009). Lectures on algebraic statistics. Oberwolfach Seminars 39. Birkhäuser Verlag, Basel. MR (202d:62004) Drton, M. and Williams, B. (20). Quantifying the failure of bootstrap likelihood ratio tests. Biometrika MR (202k:62065) Feller, W. (966). An introduction to probability theory and its applications. Vol. II. John Wiley & Sons Inc., New York. MR02054 (35 ##048) Gaffke, N., Heiligers, B. and Offinger, R. (2002). On the asymptotic nulldistribution of the Wald statistic at singular parameter points. Statist. Decisions MR96070 (2004a:62055) Gaffke, N., Steyer, R. and von Davier, A. A. (999). On the asymptotic nulldistribution of the Wald statistic at singular parameter points. Statist. Decisions MR (2000j:62027) Glonek, G. F. V. (993). On the behaviour of Wald statistics for the disjunction of two regular hypotheses. J. Roy. Statist. Soc. Ser. B MR22394 (94g:6222) Harman, H. H. (976). Modern factor analysis, Third ed. University of Chicago Press, Chicago, Ill. MR (53 ##4377) Hipp, J. R. and Bollen, K. A. (2003). Model fit in structural equation models with censored, ordinal, and dichotomous variables: testing vanishing tetrads. Sociological Methodology Iwashita, T. and Siotani, M. (994). Asymptotic distributions of functions of a sample covariance matrix under the elliptical distribution. Canad. J. Statist MR (96b:62086) Johnson, T. R. and Bodner, T. E. (2007). A note on the use of bootstrap tetrad tests for covariance structures. Struct. Equ. Model MR Johnson, N. L., Kotz, S. and Balakrishnan, N. (994). Continuous univariate distributions. Vol., second ed. Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics. John Wiley & Sons Inc., New York. A Wiley-Interscience Publication. MR (96j:62028) Kato, N. and Kuriki, S. (203). Likelihood ratio tests for positivity in polynomial regressions. J. Multivariate Anal MR Marsaglia, G. (965). Ratios of normal variables and ratios of sums of uniform variables. J. Amer. Statist. Assoc MR (3 ##2747) Muirhead, R. J. (982). Aspects of multivariate statistical theory. John Wiley & Sons Inc., New York. Wiley Series in Probability and Mathematical Statistics. MR (84c:62073) Quine, M. P. (994). A result of Shepp. Appl. Math. Lett MR (96e:60026) Reid, J. G. (987). Normal functions of normal random variables. Comput. Math. Appl MR (88h:62023) Ritz, C. and Skovgaard, I. M. (2005). Likelihood ratio tests in curved exponential families with nuisance parameters present only under the alternative. Biometrika MR (2008f:62032) Seshadri, V. (993). The inverse Gaussian distribution. Oxford Science Publications. The Clarendon Press Oxford University Press, New York. A case study in exponential families. MR30628 (96e:6205)

24 24 M. DRTON AND H. XIAO Shepp, L. (964). Normal functions of normal random variables. SIAM Rev Silva, R., Scheines, R., Glymour, C. and Spirtes, P. (2006). Learning the structure of linear latent variable models. J. Mach. Learn. Res MR Spearman, C. (904). General intelligence, objectively determined and measured. The American Journal of Psychology Spirtes, P., Glymour, C. and Scheines, R. (2000). Causation, prediction, and search, second ed. Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA. With additional material by David Heckerman, Christopher Meek, Gregory F. Cooper and Thomas Richardson, A Bradford Book. MR85675 (200j:62009) Sullivant, S., Talaska, K. and Draisma, J. (200). Trek separation for Gaussian graphical models. Ann. Statist MR (20f:62076) Zwiernik, P. and Smith, J. Q. (202). Tree cumulants and the geometry of binary tree models. Bernoulli MR Department of Statistics University of Washington Seattle, WA, U.S.A. md5@uw.edu Department of Statistics & Biostatistics Rutgers University Piscataway, NJ, U.S.A. hxiao@stat.rutgers.edu

Testing Algebraic Hypotheses

Testing Algebraic Hypotheses Mathias Drton Department of Statistics University of Chicago 1 / 18 Example: Factor analysis Multivariate normal model based on conditional independence given hidden variable: