WALD CONSISTENCY AND THE METHOD OF SIEVES IN REML ESTIMATION. By Jiming Jiang Case Western Reserve University

Size: px

Start display at page:

Download "WALD CONSISTENCY AND THE METHOD OF SIEVES IN REML ESTIMATION. By Jiming Jiang Case Western Reserve University"

Shonda Norman
6 years ago
Views:

1 The Annals of Statistics 1997, Vol. 5, o. 4, WALD COSISTECY AD THE METHOD OF SIEVES I REML ESTIMATIO By Jiming Jiang Case Western Reserve University We prove that for all unconfounded balanced mixed models of the analysis of variance, estimates of variance components parameters that maximize the (restricted) Gaussian likelihood are consistent and asymptotically normal and this is true whether normality is assumed or not. For a general (nonnormal) mixed model, we show estimates of the variance components parameters that maximize the (restricted) Gaussian likelihood over a sequence of approximating parameter spaces (i.e., a sieve) constitute a consistent sequence of roots of the REML equations and the sequence is also asymptotically normal. The results do not require the rank p of the design matrix of fixed effects to be bounded. An example shows that, in some unbalanced cases, estimates that maximize the Gaussian likelihood over the full parameter space can be inconsistent, given the condition that ensures consistency of the sieve estimates. 1. Introduction. In many cases exploration of asymptotic properties of maximum likelihood (ML) estimates led to one of two types of consistency: Cramér (1946) or Wald (1949) types. The former establishes the consistency of some root of the likelihood equation(s) but usually gives no indication on how to identify such a root when the roots of the likelihood equation(s) are not unique. The latter, however, states that an estimate of the parameter vector that maximizes the likelihood function is consistent. Therefore it is not unusual that Wald consistency sometimes fails whereas the Cramér one holds [e.g., Le Cam (1979)]. Often, obstacles to Wald consistency arise on the boundary of the parameter space. For this reason, the maximization is sometimes carried out over a sequence of subspaces which may belong to the interior of and approaches as sample size increases. Methods of this type have been studied in the literature, among them those that Grenander (1981) called sieves. In this paper we consider asymptotic behaviors of REML restricted or residual maximum likelihood estimates in variance components estimation. The REML method was first proposed by W. A. Thompson (196). Several authors have given overviews on REML, which can be found in Harville (1977), Khuri and Sahai (1985), Robinson (1987) and Searle, Casella and McCulloch (199). A general mixed model can be written as (1) y = Xβ + Z 1 α Z s α s + ε Received February 1995; revised October AMS 1991 subject classification. 6F1. Key words and phrases. Mixed models, restricted maximum likelihood, Wald consistency, the method of sieves. 1781

2 178 J. JIAG where y is an 1 vector of observations, X is an p known matrix of full rank p, β is a p 1 vector of unknown constants (the fixed effects), Z i is an m i known matrix, α i is a m i 1 vector of i.i.d. random variables with mean 0 and variance σi (the random effects), i = 1 s; ε is an 1 vector of i.i.d. random variables with mean 0 and variance σ0 (the errors). Asymptotic results of maximum likelihood type estimates for model (1) are few in number, with or without normality assumptions. Assuming normality, and the model having a standard AOVA structure, Miller (1977) proved a result of Cramér consistency and asymptotic normality for the maximum likelihood estimates (MLE) of both the fixed effects and the variance components σ0 σ s. Under conditions slightly stronger than those of Miller, Das (1979) obtained a similar result for the REML estimates. Also under the normality assumption, Cressie and Lahiri (1993) gave conditions under which Cramér consistency and asymptotic normality of the REML estimates in a general mixed model would hold, using a result of Sweeting (1980). Without assuming normality, in which case the REML estimates are defined as solutions of the REML equations derived under normality, Richardson and Welsh (1994) proved Cramér consistency and asymptotic normality for hierarchical (nested) mixed models. A common feature of the above results is that the rank p of the design matrix X of the fixed effects was held fixed. A more important and interesting question related to the (possible) superiority of REML over straight ML in estimating the variance components is how the REML estimates behave asymptotically with p. In such situations a well-known example showing the inconsistency of the MLE is due to eyman and Scott (1948). The question was answered recently by Jiang (1996), in which Cramér consistency and asymptotic normality of the REML estimates were proved under the assumption that the model is asymptotically identifiable and infinitely informative. The results do not require boundedness of p, normality, or any structure (such as AOVA or nested design, etc.) for the model. As a consequence, Cramér consistency was shown to hold for all unconfounded balanced mixed models of the analysis of variance. In contrast to the preceding results, the only consistency result in variance components estimation that might be considered as Wald type was Hartley and Rao (1967) for MLE. As was noted by Rao (1977), a corrected version of Hartley Rao s proof would establish the Wald consistency of the MLE, under the assumption that the number of observations falling into any particular level of any random factor stayed below a universal constant. To our knowledge, there has not been discussion on the method of sieves in variance components estimation. From both theoretical and practical points of view, consistency of Wald type (nonsieve or sieve) is more gratifying than that of Cramér type, although the former often requires stronger assumptions. Since in nonnormal cases [e.g., Richardson and Welsh (1994), Jiang (1996)] the REML estimates are defined as a solution of the (normal) REML equations, a natural question regarding Wald type consistency is whether the solution that maximizes the (restricted) Gaussian likelihood is consistent. The aim of this paper is to establish the following.

3 REML ESTIMATIO Wald consistency holds for all unconfounded balanced mixed models of the analysis of variance. That is, estimates of the variance components parameters that maximize the restricted Gaussian likelihood are consistent.. The method of sieves works for all mixed models provided the models are asymptotically identifiable and infinitely informative globally. That is, there is a clear way of constructing the sieve which will produce consistent estimates. 3. In both (1) and () the estimates are also asymptotically normal. Again the results do not require normality or boundedness of p or structures for the model. The Wald consistency in the balanced case is also without the assumption that the true parameter vector is an interior point of the parameter space. We also give an example showing that in some unbalanced cases there is sieve Wald consistency but no Wald consistency. The restricted Gaussian likelihood is given in Section. Wald consistency in the balanced case is proved in Section 3. In Section 4 we discuss the method of sieves for the general mixed model (1). An example and a counterexample are given in Section 5. In Section 6 we make some remarks about the techniques we use and open problems.. Definitions, basic assumptions, and a simple lemma. There are two parametrizations of the variance components: θ 0 = λ = σ0 θ i = µ i = σi /σ 0 1 i s [Hartley and Rao (1967)] and φ i = σi 0 i s. There is a one-to-one correspondence between the two sets of parameters. In this paper, estimates of the two sets of parameters are equivalent in the sense that consistency for one set of parameters implies that for the other. The parameter spaces are = θ λ > 0 µ i 0 1 i s and = φ σ0 > 0 σ i 0 1 i s. The true parameter vectors will be denoted by θ 0 = θ 0i 0 i s = λ 0 µ 0 = λ 0 µ 0i 1 i s and φ 0 = φ 0i 0 i s. Given the data y, the restricted (or residual) Gaussian log-likelihood is the Gaussian log-likelihood based on z = A y, where A is a p matrix such that () ranka = p A X = 0 Thus the restricted Gaussian log-likelihoods for estimating the two sets of parameters θ i 0 i s and φ i 0 i s are given, respectively, by ( ) p L θ = c log λ 1 1 (3) log VA µ λ z VA µ 1 z and (4) L φ = c 1 log UA φ 1 z UA φ 1 z where c = p/ log π VA µ = G 0 + s i=1 µ i G i UA φ = s φ i G i = λva µ with G 0 = A A G i = A Z i Z ia 1 i s. The

4 1784 J. JIAG maximizers ˆθ = ˆθ i 0 i s and ˆφ = ˆφ i 0 i s of (3) and (4) do not depend on the choice of A so long as () is satisfied. Therefore we may assume, w.l.o.g., that (5) A A = I p ote that the relation between ˆθ i 0 i s and ˆφ i 0 i s is the same as that for the parameters. The dependence of VA µ and UA φ on A is not important in this paper; therefore we will abbreviate these by Vµ and Uφ, respectively. Also w.l.o.g., we can focus on a sequence of designs indexed by, since consistency and asymptotic normality hold iff they hold for any sequence with increasing strictly monotonically. Thus, for example, p and m i 1 i s can be considered as functions of, and y, X, Z i s and so on as depending on [e.g., Jiang (1996)]. Let m 0 =, α 0 = ε. The following assumptions A1 and A are made for model (1). A1. For each, α 0 α 1 α s are mutually independent. A. For 0 i s, the common distribution of α i1 α imi may depend on. However, it is required that (6) lim sup x max 0 i s Eα4 i1 1 α i1 >x = 0 The basic idea of proving Wald consistency (nonsieve or sieve) is simple. Consider, for example, the log-likelihood (3). The difference L θ L θ 0 can be decomposed as (7) L θ L θ 0 = e θ θ 0 + d θ θ 0 where (8) e θ θ 0 = E θ0 L θ L θ 0 = 1 { p log λ + log Vµ λ 0 Vµ 0 + λ 0 λ trvµ 1 Vµ 0 p } (9) d θ θ 0 = L θ L θ 0 E θ0 L θ L θ 0 = 1 [ ] 1λ {z Vµ 1 1λ0 Vµ 0 1 z [ 1 E θ0 z λ Vµ 1 1 } Vµ λ 0 ]z 1 0

5 REML ESTIMATIO 1785 Essentially, what we are going to show is that, with probability 1 and uniformly for θ outside a small neighborhood δ θ 0 = θ θ θ 0 < δ, the second term on the RHS of (7) is negligible compared with the first term, which is negative. Therefore the LHS of (7) will be less than 0 for θ / δ θ 0, and the rest of the argument is the same as Jiang (1996). Let A B A i 1 i n be matrices. Define A = the determinant of A, A = λ max A A 1/ (λ max denotes the largest eigenvalue), A R = tra A 1/, CorA 1 A n = cora i A j if A i 0 1 i n, and 0 otherwise, where cora B = tra B/A R B R if A B 0; diaga i to be the block-diagonal matrix with A i as its ith diagonal block. Define bµ = I µ1 Z 1 µs Z s Bµ = AVµ 1 A [note that Bµ indeed does not depend on A], B 0 µ = bµbµbµ B i µ = bµbµz i Z i Bµbµ i = 1 s. Let p i 0 i s, be sequences of positive numbers. Denote U i φ = Uφ 1/ G i Uφ 1/ 0 i s V 0 θ = 1/λI p V i θ = Vµ 1/ G i Vµ 1/ 1 i s I ij θ = trv iθv j θ/p i p j K +m ij θ = 1 EW 4 l p i p j 3B iµ ll B j µ ll /λ 1 +1 j=0 l=1 i j = 0 1 s where m = m m s, B ll denotes the lth diagonal element of B; W l = ε l / λ 0 1 l W l = α i l k<i m k / λ 0 µ 0i + m k + 1 l + m k 1 i s k<i k i or = W l 1 l +m = ε / λ 0 α 1 / λ 0 µ 01 α s / λ 0 µ 0s, where α i / λ 0 µ 0i is understood as 0 0 (α ij / λ 0 µ 0i 0 1 j m i ) if µ 0i = 0. Let I θ = I ij θ K θ = K ij θ J θ = I θ + K θ. We say model (1) has positive variance components if θ 0 is an interior point of and is nondegenerate if (10) inf min 0 i s varα i1 > 0 A sequence of estimates ˆθ 0 ˆθ s is called asymptotically normal if there are sequences of numbers p i 0 i s and matrices M θ 0 with lim supm 1 θ 0 M θ 0 < such that (11) M θ 0 p 0 ˆθ 0 θ 00 p s ˆθ s θ 0s 0 Is+1 Similarly we define the asymptotic normality of estimates for the φ s.

6 1786 J. JIAG The following lemma will be used in proofs of both Theorem 3.1 and Theorem 4.1 in the next two sections. Lemma.1. Let = X 1 X n be a vector of independent random variables such that EX i = 0 EX 4 i < 1 i n, and A = a ij be an n n symmetric matrix. Then n var A = a ii varx i + a ij varx i varx j i=1 i j { varx } max i n a 1 i n varx i ij varx i varx j ij=1 where varx i /varx i is understood as 0 if varx i = 0. (1) Proof. We have A E A = { ( ) n a ii X i EX i + a ij X j X i } j<i i=1 where j<i = 0 if i = 1. The summands in (1) form a sequence of martingale differences with respect to the σ-fields i = σx 1 X i 1 i n. The result then follows easily. 3. The balanced case: Wald consistency. A balanced r-factor mixed model of the analysis of variance can be expressed (after possible reparametrization) in the following way [e.g., Searle, Casella and McCulloch (199), Rao and Kleffe (1988)]: (13) y = Xβ + i S Z i α i + ε where X = r+1 q=1 1d q n q with d = d 1 d r+1 S r+1 = 0 1 r+1 Z i = r+1 q=1 1i q nq with i = i 1 i r+1 S S r n = I n 1 1 n = 1 n with I n and 1 n being the n-dimensional identity matrix and vector of unit elements, respectively. [For examples, see Jiang (1996).] We assume, as usual, that factor r + 1 corresponds to repetition within cells. As a consequence, in (13) one must have d r+1 = 1 and i r+1 = 1 i S. Also we can assume, w.l.o.g., that n q 1 q r (since if n q = 1, factor q is not really a factor and the model not really an r-factor one). The model is called unconfounded if (1) the fixed effects are not confounded with the random effects and errors, that is, rankx Z i > p i and X I ; () the random effects and errors are not confounded; that is, I Z i Z i i S are linearly independent [e.g., Miller (1977)]. ote that under the balanced model (13) we have the expressions m i = i q =0 n q i S. This allows us to extend the definition of m i to all i S r+1. In particular, p = m d = d q =0 n q, and = m 00 0 = r+1 q=1 n q. Let S = 00 0 S µ i = µ 0i = 1 if i = 00 0 S r+1.

7 REML ESTIMATIO 1787 In the balanced case we have the following nice representations for the e θ θ 0 and d θ θ 0 in (7). (14) (15) Lemma 3.1. In the balanced case, e θ θ 0 = 1 S θ d θ θ 0 = 1 r l θ 1ξ l l d where for u v S u v iff u q v q 1 q r + 1, and u v iff u is not v; r l θ = λ 0 C l µ 0 /λc l µ with C l µ = i S µ i /m i 1 i l S θ = l dr l θ 1 log r l θp l with p l = l q =0n q 1 1/ ; and ξ l l d are random variables whose definition will be given later on in the proof. Proof. By the proof of Lemma 7.3 in Jiang (1996), there is an orthogonal matrix T such that T Z i Z i T = diagλ ij T XX T = diagλ dj, where λ i1 λ i = r+1 q=1 λ iqk q 1 k q n q 1 q r + 1 λ d1 λ d = r+1 q=1 λ dqk q 1 k q n q 1 q r + 1 with λ iqw = 1 i q + n q i q δ w 1 i S d (δ x y = 1 if x = y and = 0 otherwise). Let A T = B = b 1 b, then since [by (5)] AA = P X = I XX X 1 X = I p/xx B B = diagγ j, with γ j = 1 p/λ dj 1 j It is easy to see that r+1 q=1 1 i q + n q i q l q = /m i 1 i l i S d l S r+1. Hence r+1 λ iqkq q=1 = m i 1 i δk 1 i S r+1 λ dqkq q=1 = p 1 d δ k 1 where δ k1 = δ k1 1 δ kr+1 1 for k = k 1 k r+1. Thus γ 1 γ = 1 δk 1 d 1 k q n q 1 q r + 1 ow G i = A Z i Z i A = B diagλ ijb = j=1 λ ij b j b j i S, thus Vµ = I p + µ t G t = BB + µ t G t t S t S (16) = ( j=1 1 + ) µ t λ tj b j b j = t S γ j 0 ( 1 + µ t λ tj )b j b j t S since γ j = 0 b j = 0. Hence Vµ 1 = γ j 01 + t S µ t λ tj 1 b j b j since γ j 0 γ j = 1. Let Ɣ + = j γ j 0. Then Ɣ + = p (hereafter E denotes the cardinality of a set E). It follows from (16) that the eigenvalues

8 1788 J. JIAG of Vµ are 1 + t S µ t λ tj j Ɣ +. Thus by (8), e θ θ 0 = 1 { λ0 1 + t S µ 0t λ tj λ1 + t S µ t λ tj 1 log λ } 01 + t S µ 0t λ tj λ1 + t S µ t λ tj = 1 = 1 γ j 0 n 1 k 1 =1 1 l 1 =0 n r+1 k r+1 =1 1 l r+1 =0 = 1 { λ0 C l µ 0 λc l µ l d 1 δk 1 d { λ0 1 + t S µ 0t /m t 1 t δk 1 λ1 + t S µ t /m t 1 t δk 1 } 1 log { λ0 1 + t S µ 0t /m t 1 t l 1 l d λ1 + t S µ t /m t 1 t l } 1 log n q 1 l q =0 1 log λ } 0C l µ 0 n λc l µ q 1 = 1 S θ l q =0 using a formula given at the end of the proof of Lemma 7.3 in Jiang (1996) for the third equation; and by (9), d θ θ 0 = 1 ( [λ ) 1 ( µ t λ tj λ ) 1 ] µ 0t λ tj γ j 0 t S t S = 1 = 1 n 1 k 1 =1 b j z E θ0 b j z n r+1 ( [λ 1 1 l 1 =0 k r+1 =1 1 1 δk 1 d 1 + µ t 1 m t δk 1 t S t ) 1 ( λ ) 1 ] µ 0t 1 m t δk 1 t S t b k z E θ0 b k z l r+1 =0 δ k 1 =l = 1 r l θ 1ξ l l d 1 δk 1 d b k z E θ0 b k z where ξ l = λ 0 C l µ 0 1 δ k 1 =lb k z E θ0 b k z l d. Let S 1 θ = l dr l θ 1 p l S 3θ = l d r l θ 1p l. Define S + = i S µ 0i > 0 S + = 00 0 S + b 0 = min i S + µ 0i 1 i S µ + 0i if S +, and b 0 = 1 if S + = ; ε 0 = λ 0 /3 i S µ 0i 1 ; b = 1+explog /r

9 REML ESTIMATIO (here x means the largest interger less than or equal to x); s = S; and T l = q 1 q r + 1 l q = 1. The following lemma plays a key role in the proof of Wald consistency in the balanced case. (17) (18) (19) Lemma 3.. Suppose model (13) is unconfounded. For any B > 0, { ( ) ( inf S 1 θ 1/ r+3 min m i θ / B i S { ( ) ( inf S θ 1/ r+5 b 1 0 min m i θ/ B i S ( S 3θ sup θ/ B S θ r+3 b 0 inf θ/ B S θ ) 1/ + r+1 ( b log ε 0 B 3ε 0 + sλ 0 ) } ε 0 B 3ε 0 + sλ 0 ) ( ) 1/ min m i i S ) } where B = θ λ λ 0 ε 0 B/ µ l µ 0l b T l B/ m l l S. Proof. Let i = 0 = 0 0 if n r+1 ; i = 0 01 if n r+1 = 1. It is easy to show that i d, i i i S, and p i 1 r+1 m i = 1 r+1. So if λ < /3λ 0 or λ > λ 0, then (0) S 1 θ r i θ 1 p i = λ 0 /λ 1 p i 1/ r+3 If /3λ 0 λ λ 0 and µ l > λ 0 /ε 0 for some l S, then r l θ = λ 0 D l µ 0 /λd l µ < 1/, where D l µ = i S µ i m l /m i 1 i l. So S 1 θ r l θ 1 p l 1/r+ m l. ow suppose /3λ 0 λ λ 0 0 µ l λ 0 /ε 0 l S, and θ / B. If λ λ 0 > ε 0 B/, then as in (0) we have S 1 θ 1/ r+3 ε 0 /λ 0 B. If λ λ 0 ε 0 B/ and µ l µ 0l > b Tl B/ m l for some l S, there is l S such that µ l µ 0l > b Tl B/ m l and µ i µ 0i b Ti B/ m i i S i l i l. Then D l µ 0 D l µ µ 0l µ l µ 0i µ i m l /m i 1 i l i l > B/ m l i S since µ 0i µ i m l /m i 1 i l i l B/ m l i S i l i l b T i = B/ m l b + 1 Tl b Tl B/ m l b Tl 1

10 1790 J. JIAG Therefore, λ 0 D l µ 0 λd l µ λd l µ 0 D l µ λ λ 0 D l µ 0 /3λ 0 B/ m l ε 0 B/ µ 0i i S λ 0 /3B/ m l Hence S 1 θ r l θ 1 p l 1/r+ ε 0 B/3ε 0 + sλ 0 To prove (18), let l S + l = i S + i l such that m l = min i S + m l i if S + l, and l = i if S + l =. Then it is easy to show that C l µ 0 /C l µ 0 b 0 and hence (1) r l θ r l θ = C lµ 0 C l µ C l µ 0 C l µ b 0 If r l θ b 0 for some l d, then r l θ by (1). Therefore, () S θ r l θ 1 log r l θp l ( ) 1 log r l θp l 1/ r+1 1 log min m i i S using that x 1 log x/x 1 log / x. If r l θ < b 0 for all l d, then S θ S 1 θ/4b 0 by x 1 log x x 1 /L 0 < x < L L > 1. To prove (19), let S = i S. Write (3) S 3 θ = For any ε > 0, l d r l θ<b 0 r l θ 1p l + l d r l θ b 0 r l θ 1p l = S 1 + S (4) ( ) 1/ ( S 1 1 r l θ 1 p l l d r l θ<b 0 l d r l θ<b 0 } {ε r ε r l θ 1 p l l d r l θ<b 0 { ε r 1 1 } S θ + 4b 0ε S θ ) 1/ by x ε 1 + εx / and that S θ 4b 0 1 l d r l θ<b 0 r l θ 1 p l. By (1), the facts that (according to the definition) l only takes values in

11 REML ESTIMATIO 1791 S = i S and l l (therefore p l p l ), and the fact that ( ) 1 log S θ r i θp i it follows that (5) i S r i θ S b 0 + 1/ r l θp l l d r l θ = b 0 + 1/ r+1 b 0 + 1/ i S r i θ r i θ i S r i θ p l l d l =i r i θp i r+1 b log min i S p i S θ Inequality (19) now follows by (3) (5), taking ε = 4b 0 inf θ/ B S θ 1/ in (4), and the fact that min p i r+1( ) 1/ min m i i S Let ˆθ = ˆλ ˆµ i i S be the maximizer of (3) (one may define ˆθ as any fixed point in when the maximum can not be reached). Theorem 3.1. Let the balanced mixed model (13) be unconfounded. As and m i i S, we have the following: (i) ˆθ is consistent and the sequence p ˆλ λ 0 m i ˆµ i µ 0i i S is bounded in probability. (ii) If, moreover, the model has positive variance components and is nondegenerate, then ˆθ is asymptotically normal with p 0 = p p i = mi i S and M θ 0 = J 1/ θ 0I θ 0. Remark. Part (i) of Theorem 3.1 does not require that the model have positive variance components. In particular, when such requirement does hold, ˆθ constitutes (with probability 1) a consistent sequence of roots of the REML equations [e.g., Jiang (1996)], which are identical to the AOVA estimates in the balanced case [e.g., Searle, Casella and McCulloch (199)]. On the other hand, it is necessary for part (ii) of Theorem 3.1 that the model have positive variance components since otherwise the estimates, which are nonnegative, cannot be asymptotically normal. Proof. According to the definition in Section, it is always true (no matter whether µ 0i = 0 or not) that Z i α i = λ 0 µ 0i Z i α i / λ 0 µ 0i 1 i s, and i S (6) z = A y = λ 0 A bµ 0

12 179 J. JIAG ote that is a vector of independent random variables satisfying EW l = 0 varw l 1 [varw l = 1 if the corresponding µ 0i 0], and EW 4 l <. By the definition of ξ l at the end of the proof of Lemma 3.1, Lemma.1 and assumption A in Section, we see that with B = bµ 0 A δ k 1 =l b k b k A bµ 0, ) E θ0 ξl = var θ 0 ( B C l µ 0 C C l µ 0 trb (7) (( ) C ) = C l µ 0 tr b k b k Vµ 0 = Cp l δ k 1 =l for some constant C. Thus, for any M > 0, (8) P θ0 ξ l > Mp l for some l d r+1 C M For any ε > 0, choose M > r+1 Cε 1 1/. Then choose B > 0 and K such that by Lemma 3., sup θ/ B S 3 θ/s θ < 1/M m i > K i S. It then follows by (7), (14), (15), (18) and (8) that P θ0 L θ < L θ 0 θ / B > 1 ε when m i > K i S. The conclusion of (i) then follows by the definition of B. The proof for (ii) is the same as that of Theorem 4.1 of Jiang (1996). 4. The general case: the method of sieves. For convenience we consider in this section the parameters φ i 0 i s. Let δ M be two sequences of positive numbers such that δ 0 M. Let ˆφ δ M = φ δ φ i M 0 i s be the maximizer of L φ of (4) over δ M. The above procedure is called the method of sieves [e.g., Grenander (1981)]. The following definitions are explained intuitively in Jiang (1996). Definition 4.1. Model (1) is called asymptotically identifiable under the invariant class (AI ) at φ θ if (9) lim inf λ mincorv 0 θ V s θ > 0 (see Section ; λ min means the smallest eigenvalue). Here the invariant class is the class of (location) invariant estimates (30) = estimates which are functions of A y with A satisfying () Definition 4.. Model (1) is called infinitely informative under the invariant class (I 3 ) at φ θ if (31) lim V iθ R = 0 i s

13 REML ESTIMATIO 1793 The following notation is needed for this section: λ θ = λ min CorV 0 θ V s θ ρ a b = inf θ a b λ θ τ c d = inf φ c d λ θ where a b = θ a θ i b 0 i s and so on. Lemma 4.1. For any x i 0 i s and φ, ( ){ } s x i U i φ λ θ s x 0 R 4s + 1 V 0θ R + λ x i V iθ R i=1 ( ) λ θ s x 4M i s + 1 V i1 R where M = max 0 i s φ i. Proof. By the relation (see Section ): U 0 φ = V 0 θ s i=1 λ 1 µ i V i θ U i φ = λ 1 V i θ 1 i s, we have (( s s ) ) s x i U i φ = tr y i V i θ λ θ y i V iθ R R where y 0 = x 0 y i = λ 1 x i µ i x 0 1 i s. Let B = 1 i s x i µ i x 0 1/x i, then s y i V iθ R 1 s + 1 x 0 V 0θ R + 1 s s + 1 { x 0 V 0θ R + λ x 0 V 0θ R + y i V iθ R i/ B i B } s x i V iθ R using the fact that V 0 θ R λ 1 µ i V i θ R 1 i s. The second inequality is obvious. Lemma 4.. Suppose model (1) is AI at all θ a b with a > 0, then lim inf ρ a b > 0. Proof. By direct computation it can be shown that (see Section ): corv θ i θ V j θ 4V kθ 4a 1 0 i j k s k The result thus follows by an argument of subsequences [e.g., Billingsley (1986), A10]. Denote, respectively, the interiors of and by and. Corollary 4.1. Suppose model (1) is AI at all θ, then lim inf τ c d > 0 for all 0 < c < d <. i=1

14 1794 J. JIAG Let model (1) be AI at all θ and I 3 at 1 = 1 1. Define sequences δ and M as follows. Let q = min 0 i s V i 1, a, b be positive numbers such that (3) a + b1 b 1 < 3s Pick f, g, h such that (33) 3a + b < f < s a s + 7b (34) (35) ( f 3a b 0 < g < min 1 a s + 7 b s + 1 ) f ( 0 < h < min f 3a b g 1 a s + 7 b s + 1 ) f g ow pick sequences c and d such that 0 c c 1 1 d 1 d and (36) lim inf q h τ c d > 0 (by Corollary 4.1, one certainly can do this). Finally, pick a baseline region δ B M B such that 0 < δ B < 1 < M B < and let δ = δ B c q a M = M B d q b ote. The selection of δ B and M B will not make a difference for the following asymptotic result. However, a reasonable choice of δ B and M B may be of practical convenience since, when is not sufficiently large, the region c q a d q b may appear to be too narrow. Theorem 4.1. components. Consider a general mixed model (1) having positive variance (i) If the model is asymptotically identifiable and infinitely informative (AI 4 ) at all θ (or φ o ), then ˆφ is consistent and the sequence V i 1 R ˆφ i φ 0i 0 i s is bounded in probability. (ii) If, moreover, the model is nondegenerate, then ˆφ is asymptotically normal with p 0 = p, p i being any sequence V i 1 R 1 i s and M θ 0 = J 1/ θ 0I θ 0. Remark 1. Thus, with probability 1, ˆφ constitutes a consistent sequence of roots of the REML equations. Remark. The assumption that model (1) is AI 4 at all θ is equivalent to saying that it is AI at all θ and I 3 at some θ and w.l.o.g., one can take θ = 1.

15 REML ESTIMATIO 1795 Proof. (i) Let = φ = φ i 0 i s φ i = δ + k i ε 0 i s for some integers 0 k i < K 0 i s, where K = q b+f ε = M δ /K. Let φ δ M, then by (4) we have a decomposition similar to (7): (37) L φ L φ 0 = ẽ φ φ 0 + d φ φ 0 By an expression similar to (8), we have ẽ φ φ 0 = E φ0 L φ L φ 0 = 1/ { log Uφ 0 1/ Uφ 1 Uφ 0 1/ + tri p Uφ 0 1/ Uφ 1 Uφ 0 1/ } p = 1/ log λ i + 1 λ i i=1 where λ 1 λ p are the eigenvalues of Bφ = Uφ 0 1/ Uφ 1 Uφ 0 1/. Since λ max Bφ 1 M 0 δ 1, where M 0 = max 0 i s φ 0i, we have by x 1 log x x 1 /L 0 < x L L 1 and Lemma 4.1 that (38) (39) δ ẽ φ φ 0 41 M 0 truφ 0 1/ Uφ 1 Uφ 0 1/ I p δ = s 41 M 0 φ i φ 0i U i φ δ τ c d 16s + 11 M 0 M R s V i 1 R φ i φ 0i It is easy to verify the following identity which holds for any φ : where Uφ 0 1 Uφ 1 = s φ i φ 0i Uφ 0 1 G i Uφ s j=0 s H = 1/ j=0 s φ i φ 0i φ j φ 0j Uφ 0 1 G i Uφ 1 G j Uφ 0 1 s φ i φ 0i φ j φ 0j Uφ 0 1 G i Uφ 0 1/ H Uφ 0 1/ G j Uφ 0 1 s φ k φ k Uφ 0 1/ k=0 Uφ 1 G k Uφ 1 + Uφ 1 G k Uφ 1 Uφ 0 1/

16 1796 J. JIAG In the following we pick φ in (39) such that φ i φ i < ε 0 i s. By an expression similar to (9) and by (39) we get (40) d φ φ 0 = L φ L φ 0 E φ0 L φ L φ 0 { s = 1/ φ i φ 0i z Uφ 0 1 G i Uφ 0 1 z E φ0 s s φ i φ 0i φ j φ 0j + j=0 s j=0 ( z Uφ 0 1 G i Uφ 1 G j Uφ 0 1 z E φ0 ) s φ i φ 0i φ j φ 0j z Uφ 0 1 G i Uφ 0 1/ = 1/I 1 I + I 3 E φ0 I 3 It follows from Lemma.1 and (6) that H Uφ 0 1/ G j Uφ 0 1 z E φ0 ( s var φ0 z Uφ 0 1 G i Uφ 0 1 z CU i φ 0 R Cδ 0 V i1 R s j=0 )} where C = varε1 /varε 1 max 1 i s varα i1 /varα i1 δ 0 = min 0 i s φ 0i. Thus with L 1 = 3s + 1C 1/ q g, we have on E 1 = { z Uφ 0 1 G i Uφ 0 1 z E φ0 L 1 δ 1 0 V i1 R 0 i s } that (41) and (4) Similarly, ( s ) 1/ I 1 L 1 s + 1 1/ δ0 1 V i 1 R φ i φ 0i P φ0 E c 1 s var φ0 z z L 1 δ 0 V i1 R 1/3q g var φ0 z Uφ 0 1 G i Uφ 1 G j Uφ 0 1 z CUφ 0 1/ G i Uφ 1 G j Uφ 0 1/ R CU i φ 0 U j φ 0 U i φ R U j φ R Cδ 0 δ V i 1 R V j 1 R (Here is a brief derivation of the second inequality in the above: Uφ 0 1/ G i Uφ 1 G j Uφ 0 1/ R = AB R A A R B B R

17 REML ESTIMATIO 1797 where A = Uφ 0 1/ G i Uφ 1/ B = Uφ 1/ G j Uφ 0 1/. ow A A λ max U i φ 0 U i φ. Using the fact that A B 0 A B tra trb [note that it is not true that A B 0 A B A B (e.g., Chan and Kwong (1985))], we get A A R U i φ 0 U i φ R, and so on.) Thus with L = s + 13K s+1 C1/ q g we have on that (43) and (44) E = { z Uφ 0 1 G i Uφ 1 G j Uφ 0 1 z E φ0 L δ 0 δ 1 V i 1 1/ R V j1 1/ 0 i j s φ } ( s ) I L δ 0 δ 1 V i 1 1/ R φ i φ 0i ( s ) 1/ L s + 1 3/ δ 0 δ 1 M 0 M V i 1 R φ i φ 0i P φ0 E c φ 0 i j s R var φ0 z z L δ 1/3q g 0δ V i 1 R V j 1 R Finally, I 3 = U H U, where U = s φ i φ 0i Uφ 0 1/ G i Uφ 0 1 z. We have s H 1/ φ k φ k Uφ 0 1/ Uφ 0 1/ s + 1M 0 δ ε and k=0 U s + 1 s φ i φ 0i Uφ 0 1/ G i Uφ 0 1 z E φ0 Uφ 0 1/ G i Uφ 0 1 z δ 0 V i1 R Thus with L 3 = 3s+1 1/ q g we have on E 3 = Uφ 0 1/ G i Uφ 0 1 z L 3 δ 1 0 V i1 R 0 i s that s (45) I 3 L 3 s + 1 δ0 M 0δ V i 1 R φ i φ 0i and (46) It also follows that (47) P φ0 E c 3 s ε E φ0 z L 3 δ 0 V i1 R E φ0 I 3 s + 1 δ 0 M 0δ ε 1/3q g s V i 1 R φ i φ 0i

18 1798 J. JIAG Combining (37), (38), (40), (41), (43), (45) and (47), we have on E = E 1 E E 3 that for φ δ M, L φ L φ 0 { Q δ τ c d (48) 16s + 11 M 0 M + 1/s + 1 δ 0 M 0L 3 + 1δ ε + 1/s + 1 1/ δ 1 0 L 1 + s + 1L δ 1 M 0 M Q 1 where Q = s V i 1 R φ i φ 0i 1/. And, combining (4), (44) and (46), we have } (49) P φ0 E c q g ote that by I 3 at 1 (= 1 1 ), q tends to infinity, and so the probability of E tends to 1. It remains to show that for any η > 0 there is 0 > 0 such that for 0 the in (48) is less than 0 for all φ δ M S η φ 0 c, where S η φ 0 = φ φ i φ 0i < η 0 i s. Since δ 0 M as, this implies that ˆφ is consistent. Since φ δ M S η φ 0 c implies Q ηq, in (48) qh τ c d 16s + 11 M 0 q a b h + C η 1 q a+s+3/b+s+1/f+g 1 + C 1 q a f+g where C 1 C are constants. Also lim inf q h τ c d > 0, and ( a b h > max a f + g a + s + 3 b + s + 1 ) f + g 1 by the way we picked a, b, f, g, h. Thus in (48) is less than 0 for large uniformly for all φ δ M S η φ 0 c To prove that V i 1 R ˆφ i φ 0i 0 i s is bounded in probability, replace δ M K ε and q g by δ = 1/δ 0, M = 3/M 0, K, ε = M δ/k and η 1, respectively. Then by the same argument we have with probability greater than or equal to 1 η that for all φ δ M, { L φ L φ 0 Q δτ δ M 16s + 1M 0 M + C 1η δ ε (50) } + C η 1 K s+1/ δ 1 MQ 1 The result follows from (50) and the consistency of ˆφ. The proof of (ii) is the same as that of Theorem 4.3 in Jiang (1996).

19 REML ESTIMATIO Examples. The first example is used to illustrate the AI 4 condition. (51) Example 5.1. Consider a two-way (unbalanced) nested model y ijk = β i + α ij + ε ijk where i = 1 p j = 1 k = 1 n i β α and ε correspond to the fixed, random effects and errors, respectively. Assume, w.l.o.g., that n i 1 1 i p. Let X = diag1 ni 1 Z = diag1 ni I, then the model can be written as (5) y = Xβ + Zα + ε So = p i=1 n i. Direct calculation shows H = Z AA Z = Z I XX X 1 X Z = diagn i I 1 J where J = 1 1, and Z AVµ 1 A Z = H Hµ 1 I p +H 1 H = diagn i 1+ µn i 1 I 1 J Thus p trv 1 θ = trz AVµ 1 A n Z = i 1 + µn i=1 i ( ) p trv 1 θ = trz AVµ 1 A Z n = i 1 + µn i Let q = n i >1n i 1 r = q/p, then it is easy to show that i=1 (53) lim inf λ min CorV 0 θ V 1 θ > 0 for all θ = θ = λ µ λ > 0 µ > 0 iff lim sup p/ < 1/ iff lim inf r > 0. Since q 1 i p n i > 1 lim inf r = 0 would mean the model is asymptotically confounded. ote s = 1 in this example, so (3) (35) become a + b1 b 1 < 0 3a + b < f < 1 a 4b 0 < g < 05f 15a b 1 a 4b f 0 < h < f 3a b g 1 a 4b f g. The next example is a counterexample showing that AI 4 is not enough for Wald consistency. (54) Example 5.. Let λ 1 = 0 λ λ be positive numbers such that i=1 λ lim sup i /1 + µλ i 1/ < 1 i=1 λ i /1 + µλ i 1/ and (55) lim ( i=1 λ i 1 + µλ i ) =

20 1800 J. JIAG for all µ 0 [e.g., λ = = λ / = a λ /+1 = = λ = b, where a b > 0 a b]. Consider (56) y i = λ i α i + ε i i = 1 where α i s are random effects, ε i s are errors satisfying Pε 1 = 0 > 0. Thus y = Zα + ε where Z = diag λ i. Equations (54) and (55) imply lim inf λ min CorV 0 θ V 1 θ > 0 and lim V i θ R = i = 0 1 for all θ = θ = λ µ λ > 0 µ 0, so the model is AI 4 not only at all θ but all θ. Assume φ 0 o. It is easy to derive (57) L φ L φ 0 = 1 { log i=1 + ( i=1 ( σ 00 + σ01 λ ) i σ 0 + σ 1 λ i 1 σ 00 + σ 01 λ i σ 0 + σ 1 λ i where w i = ε i + λ i α i / σ00 + σ 01 λ i i = 1. ) } w i Let B = φ = σ0 σ 1 σ i σ 0i / i = 0 1 φ = σ 0 σ 01 with 0 < σ 0 < σ 00 exp { 3δ 1 ( + σ 00 /σ 01 ( i= )) } λi 1 log where δ = P φ0 ε 1 = 0. Define ξ = 1/ i=1 w i η = i= λi 1 w i / i= λi 1 E = ε 1 = 0 ξ η 3δ 1. Then it follows from (57) that on E L φ L φ 0 + /log + ξ { ( < L φ 0 + 1/ log σ 00 σ )( 00 σ 0 σ01 < L φ φ B λ 1 i i= )η } Thus ) P φ0 (sup L φ < L φ P φ0 E δ/3 φ B Therefore with probability greater than some positive number the maximum point ˆφ will not fall into B no matter how large is. ote that in this example the REML estimates are the same as the MLE.

21 6. Concluding remarks. REML ESTIMATIO In virtually all problems regarding Wald consistency, there are two sources of difficulties. The first is the arbitrariness of the parameters (here, θ or φ). The second is the randomness of some quantities involved in the likelihood function (here, z = A y). Somehow, one does not want these two kinds of things to be tied up, and this turns out to be the technical origin of our proofs. As we mentioned in Section, the basic idea is to prove the difference, say L θ L θ 0, is negative for θ outside a small neighborhood of θ 0. Since the first term in the difference decomposition [e.g., (7)] is nonrandom, the focus is therefore on the second term (i.e., the d term). In the balanced case, we are able to further decompose d θ θ 0 as a sum such that each summand is a product of two factors the first depends only on θ and the second only on z, and the number of terms in the sum is bounded. This technique of separating the two kinds of difficulties mentioned above has been proved totally successful since it then becomes clear how to construct the small neighborhood (see the proof of Theorem 3.1). Unfortunately, this nice decomposition of the d term no longer exists in the unbalanced case (thanks to a counterexample we are able to construct which helps clarify that). In the unbalanced situation we go back to the old idea of Wald (1949), that is, dividing the parameter space by small regions and approximating the likelihood within each small region by its value at, say, the center of the region. ote that once φ is fixed at some point φ L φ = L φ becomes a function of z alone. So this technique often does the job of separating φ and z except for one thing: the number of such small regions has to be well under control, which is impossible, as we have found, without using the method of sieves. Since we are dealing with some nonstandard situations, namely, quadratic forms of non-gaussian random variables which can not be expressed as i.i.d. sums, special techniques are needed to evaluate carefully the orders of different quantities and hence eventually construct the sieve. These include an expansion of some matrix-valued function and an inequality for the variance of a quadratic form. Finally, the counterexample has made it clear that sieve-wald consistency is all one can expect in a general unbalanced non-gaussian situation. 6.. In Section 4 a sequence of approximating spaces (i.e., a sieve) is constructed, and the consistency of the resulting estimates can be ensured under AI 4. Although the AI 4 condition seems minimal for a theorem of this generality, the convergence rate of the sieve to the full parameter space could be slow. The rate can be much improved if, for example, the true parameter vector is known to belong to a compact subspace of the interior of the parameter space, say, φ 0 δ M, where 0 < δ < M < (e.g., δ = 10 6 M = 10 6 ). In such a case the actual parameter space is 0 = δ M, each approximating space can be taken as 0 (i.e., δ = δ B = δ M = M B = M) and the corresponding sequence ˆφ is consistent [see (50) in the proof of Theorem 4.1].

22 180 J. JIAG 6.3. However, the AI 4 condition is not sufficient for Wald consistency. Additional assumptions are needed, either on the structure of the model or on distributions of the random effects and errors. For example, it would be interesting to see whether Wald consistency holds when normality actually holds (in which case Gaussian likelihood is the true likelihood). ote that in Example 5. the distribution of the errors is not normal. Acknowledgments. The paper is based on part of the author s Ph.D. dissertation at the University of California, Berkeley. I sincerely thank my advisor, Professor P. J. Bickel, for his guidance, encouragement and many helpful discussions. I also thank Professor L. Le Cam for helpful suggestions. The author also thanks a referee and an Associate Editor for their excellent and detailed suggestions which led to substantial improvement of the presentation. REFERECES Billingsley, P. (1986). Probability and Measure. Wiley, ew York. Chan,.. and Kwong, M. K. (1985). Hermitian matrix inequalities and a conjecture. Amer. Math. Monthly Cramér, H. (1946). Mathematical Methods of Statistics. Princeton Univ. Press. Cressie,. and Lahiri, S.. (1993). The asymptotic distribution of REML estimators. J. Multivariate Anal Das, K. (1979). Asymptotic optimality of restricted maximum likelihood estimates for the mixed model. Calcutta Statist. Assoc. Bull Grenander, U. (1981). Abstract Inference. Wiley, ew York. Hartley, H. O. and Rao, J.. K. (1967). Maximum likelihood estimation for the mixed analysis of variance model. Biometrika Harville, D. A. (1977). Maximum likelihood approaches to variance component estimation and to related problems (with discussion). J. Amer. Statist. Assoc Jiang, J. (1996). REML estimation: asymptotic behavior and related topics. Ann. Statist Khuri, A. I. and Sahai, H. (1985). Variance components analysis: a selective literature survey. Internat. Statist. Rev Le Cam, L. (1979). Maximum likelihood: an introduction. Lecture otes in Statist. 18. Univ. Maryland, College Park. Miller, J. J. (1977). Asymptotic properties of maximum likelihood estimates in the mixed model of the analysis of variance. Ann. Statist eyman, J. and Scott, E. (1948). Consistent estimates based on partially consistent observations. Econometrika Rao, C. R. and Kleffe, J. (1988). Estimation of Variance Components and Applications. orth- Holland, Amsterdam. Rao, J.. K. (1977). Contribution to the discussion of Maximum likelihood approaches to variance component estimation and to related problems by D. A. Harville. J. Amer. Statist. Assoc Richardson, A. M. and Welsh, A. H. (1994). Asymptotic properties of restricted maximum likelihood (REML) estimates for hierarchical mixed linear models. Austral. J. Statist Robinson, D. L. (1987). Estimation and use of variance components. Statistician Searle, S. R., Casella, G. and McCulloch, C. E. (199). Variance Components. Wiley, ew York. Sweeting, T. J. (1980). Uniform asymptotic normality of the maximum likelihood estimator. Ann. Statist

23 REML ESTIMATIO 1803 Thompson, Jr., W. A. (196). The problem of negative estimates of variance components. Ann. Math. Statist Wald, A. (1949). ote on the consistency of the maximum likelihood estimate. Ann. Math. Statist Department of Statistics Case Western Reserve University Euclid Avenue Cleveland, Ohio jiang@reml.cwru.edu

A MODIFICATION OF THE HARTUNG KNAPP CONFIDENCE INTERVAL ON THE VARIANCE COMPONENT IN TWO VARIANCE COMPONENT MODELS

K Y B E R N E T I K A V O L U M E 4 3 ( 2 0 0 7, N U M B E R 4, P A G E S 4 7 1 4 8 0 A MODIFICATION OF THE HARTUNG KNAPP CONFIDENCE INTERVAL ON THE VARIANCE COMPONENT IN TWO VARIANCE COMPONENT MODELS