Some properties of Likelihood Ratio Tests in Linear Mixed Models

Some properties of Likelihood Ratio Tests in Linear Mixed Models Ciprian M. Crainiceanu David Ruppert Timothy J. Vogelsang September 19, 2003 Abstract We calculate the finite sample probability mass-at-zero and the probability of underestimating the true ratio between random effects variance and error variance in a LMM with one variance component. The calculations are expedited by simple matrix diagonalization techniques. One possible application is to compute the probability that the log of the likelihood ratio (LRT), or residual likelihood ratio (RLRT), is zero. The large sample chi-square mixture approximation to the distribution of the log-likelihood ratio, using the usual asymptotic theory for when a parameter is on the boundary, has been shown to be poor in simulations studies. A large part of the problem is that the finite-sample probability that the LRT or RLRT statistic is zero is larger than 0.5, its value under the chi-square mixture approximation. Our calculations explain these empirical results. Another application is to show why standard asymptotic results can fail even when the parameter under the null is in the interior of the parameter space. This paper focuses on LMMs with one variance component because we have developed a very rapid algorithm for simulating finite-sample distributions of the LRT and RLRT statistics for this case. This allows us to compare finite-sample distributions with asymptotic approximations. The main result is the asymptotic approximation are often poor, and this results suggests that asymptotics be used with caution, or avoided altogether, for any LMM regardless of whether it has one variance component or more. For computing the distribution of the test statistics we recommend our algorithm for the case of one variance component and the bootstrap in other cases. Short title: Properties of (R)LRT Keywords: Effects of dependence, Penalized splines, Testing polynomial regression. Department of Statistical Science, Cornell University, Malott Hall, NY 14853, USA. E-mail: cmc59@cornell.edu School of Operational Research and Industrial Engineering, Cornell University, Rhodes Hall, NY 14853, USA. E-mail: ruppert@orie.cornell.edu Departments of Economics and Statistics, Cornell University, Uris Hall, NY 14853-7601, USA. E- mail:tjv2@cornell.edu

1 INTRODUCTION This work was motivated by our research in testing parametric regression models versus nonparametric alternatives. It is becoming more widely appreciated that penalized splines and other penalized likelihood models can be viewed as LMMs and the fitted curves as BLUPs (e.g., Brumback, Ruppert, and Wand, 1999). In this framework the smoothing parameter is a ratio of variance components and can be estimated by ML or REML. REML is often called generalized maximum likelihood (GML) in the smoothing spline literature. Within the random effects framework, it is natural to consider likelihood ratio tests and residual likelihood ratio tests, (R)LRTs, about the smoothing parameter. In particular, testing whether the smoothing parameter is zero is equivalent to testing for polynomial regression versus a general alternative modeled by penalized splines. These null hypotheses are also equivalent to the hypothesis that a variance component is zero. LRTs for null variance components are non-standard for two reasons. First, the null value of the parameter is on the boundary of the parameter space. Second, the data are dependent, at least under the alternative hypothesis. The focus of our research is on the finite sample distributions of (R)LRT statistics, and the asymptotic distributions are derived as limits of the finite sample distributions in order to compare the accuracy of various types of asymptotics. For example, for balanced one-way ANOVA we compare asymptotics with a fixed number of samples and the number of observations per sample going to infinity with the opposite case of the number of samples tending to infinity with the number of observations per sample fixed. Our major results are a. The usual asymptotic theory for standard or boundary problems provides accurate approximations to finite-sample distributions only when the response vector can be partitioned into a large number of independent sub-vectors, for all values of the parameters. b. The asymptotic approximations can be very poor when the number of independent sub-vectors is small or moderate. c. Penalized spline models do not satisfy the condition described in point a. and standard asymptotic results fail rather dramatically. The usual asymptotics for testing that a single parameter is at the boundary of its range is a 50 : 50 mixture of point mass at zero (called χ 2 0 ) and a χ2 1 distribution. A major reason why 1

the asymptotics fail to produce accurate finite-sample approximations is that the finite-sample probability mass at χ 2 0 is substantially greater than 0.5, especially for the LRT but even for the RLRT. This paper studies the amount of probability mass at 0 for these test statistics. However, our methods are sufficiently powerful that the finite-sample distribution of the LRT and RLRT statistics, conditional on being zero, can also be derived. These distributions are studied in a later paper (Crainiceanu and Ruppert, 2003). Our work is applicable to most LMMs with a single random effects variance component, not only to penalized likelihood models. We study LMMs with only a single variance component for tractability. When there is only one variance component, the distributions of the LRT and RLRT statistics can be simplified in such a way that simulation of these distributions is extremely rapid. Although our results do not apply to LMMs with more than one variance component, they strongly suggest that asymptotic approximations be used with great caution for such models. Since asymptotic approximations are poor for LMMs with one variance, it is unlikely that they satisfactory in general when there is more than one variance components. Consider the following LMM Y = Xβ + Zb + ɛ, E [ b ɛ ] = [ 0K 0 n ] [ b, Cov ɛ ] [ σ 2 = b Σ 0 0 σɛ 2 I K ], (1) where Y is an n-dimensional response vector, 0 K is a K-dimensional column of zeros, Σ is a known K K-dimensional, β is a p-dimensional vector of parameters corresponding to fixed effects, b is a K-dimensional vector of exchangeable random effects, and (b, ɛ) is a normal distributed random vector. Under these conditions it follows that E(Y ) = Xβ and Cov(Y ) = σ 2 ɛ V λ, where λ = σ 2 b /σ2 ɛ is the ratio between the variance of random effects b and the variance of the error variables ɛ, V λ = I n + λzσz T, and n is the size of the vector Y of the response variable. Note that σ 2 b = 0 if and only if λ = 0 and the parameter space for λ is [0, ). The LMM described by equation (1) contains standard regression fixed effects Xβ specifying the conditional response mean and random effects Zb that account for correlation. We are interested in testing where λ 0 [0, ). Consider the case λ 0 = 0 (σ 2 b H 0 : λ = λ 0 vs. H A : λ [0, ) \ {λ 0 }, (2) = 0) when the parameter is on the boundary of the parameter space under the null. Using non-standard asymptotic theory developed by Self and Liang (1987) 2

for independent data, one may be tempted to conclude that the finite samples distribution of the (R)LRT could be approximated by a 0.5χ 2 0 + 0.5χ2 1 mixture. Here χ2 k is the chi-square distribution with k degrees of freedom and χ 2 0 means point probability mass at 0. However, results of Self and Liang (1987) require independence for all values of the parameter. Because the response variable Y in model (1) is not a vector of independent random variables, this theory does not apply. Stram and Lee (1994) showed that the Self and Liang result can still be applied to testing for the zero variance of random effects in LMMs in which the response variable Y can be partitioned into independent vectors and the number of independent subvectors tends to infinity. In a simulation study for a related model, Pinheiro and Bates (2000) found that a 0.5χ 2 0 + 0.5χ2 1 mixture distribution approximates well the finite sample distribution of RLRT, but that a 0.65χ 2 0 + 0.35χ 2 1 mixture approximates better the finite sample distribution of LRT. A case where it has been shown that the asymptotic mixture probabilities differ from 0.5χ 2 0 + 0.5χ2 1 is regression with a stochastic trend analyzed by Shephard and Harvey (1990) and Shephard (1993). They consider the particular case of model (1) where the random effects b are modeled as a random walk and show that the asymptotic mass at zero can be as large as 0.96 for LRT and 0.65 for RLRT. For the case when λ 0 > 0 we show that the distribution of the (RE)ML estimator of λ has mass at zero. Therefore, even when the parameter is in the interior of the parameter space under the null, the asymptotic distributions of the (R)LRT statistics are not χ 2 1. We also calculate the probability of underestimating the true parameter λ 0 and show that in the penalized spline models this probability is larger than 0.5, showing that (RE)ML criteria tend to oversmooth the data. This effect is more severe for ML than for REML. Section 6.2 of Khuri, Mathew, and Sinha (1998) studies mixed models with one variance component. (They consider the error variance as a variance component so they call this case two variance components. ) In their Theorem 6.2.2, they derive the LBI (locally best invariant) test of H 0 : λ = 0 which rejects for large values of F = et ZΣZ T e e T, (3) e where e = {I X(X T X) 1 X}Y is the residual vector when fitting the null model. A test is LBI if among all invariant tests it maximizes power in some neighborhood of the null hypothesis. Notice that the denominator of (3) is, except for a scale factor, the estimator of σ 2 under the null hypothesis and will be inflated by deviations from the null. This suggests that the test might have 3

low power at alternatives far from the null. Khuri (1994) studies the probability in a LMM that a linear combination of independent mean squares is negative. For certain balanced models, there is an estimator of σb 2 of this form such that this estimator is negative if and only if the (R)LRT statistic is zero; see, for example, Sections 3.7 and 3.8 of Searle, Casella, and McCulloch (1992). However, in general Khuri s results to do apply to our problem. 2 SPECTRAL DECOMPOSITION OF (R)LRT n Consider maximum likelihood estimation (MLE) for model (1). Twice the log-likelihood of Y given the parameters β, σ 2 ɛ, and λ is, up to a constant that does not depend on the parameters, L(β, σɛ 2, λ) = n log σɛ 2 log V λ (Y Xβ)T V 1 λ (Y Xβ) σ 2 ɛ. (4) Residual or restricted maximum likelihood (REML) was introduced by Patterson and Thompson (1971) to take into account the loss in degrees of freedom due to estimation of β parameters and thereby to obtain unbiased variance components estimators. REML consists of maximizing the likelihood function associated with n p linearly independent error contrasts. It makes no difference which n p contrasts are used because the likelihood function for any such set differs by no more than an additive constant (Harville, 1977). residual log-likelihood was derived by Harville (1974) and is For the LMM described in equation (1), twice the REL(σɛ 2, λ) = (n p) log σɛ 2 log V λ log(x T V λ X) (Y X β λ ) T V 1 λ (Y X β λ ) σɛ 2, (5) where β λ = (X T V 1 λ X) 1 (X T V 1 Y ) maximizes the likelihood as a function of β for a fixed λ value of λ. The (R)LRT statistics for testing hypotheses described in (2) are LRT n = sup L(β, σɛ 2, λ) sup L(β, σɛ 2, λ), RLRT n = sup REL(σɛ 2, λ) sup REL(σɛ 2, λ) (6) H A H 0 H 0 H A H 0 H 0 Denote by µ s,n and ξ s,n the K eigenvalues of the K K matrices Σ 1/2 Z T P 0 ZΣ 1/2 and Σ 1/2 Z T ZΣ 1/2 respectively, where P 0 = I n X(X T X) 1 X T. Crainiceanu and Ruppert (2003) showed that if λ 0 is the true value of the parameter then [ LRT n D = sup λ [0, ) n log { 1 + N } n(λ, λ 0 ) D n (λ, λ 0 ) K ( ) ] 1 + λξs,n log, (7) 1 + λ 0 ξ s,n 4

RLRT n D = sup λ [0, ) [ (n p) log where D = denotes equality in distribution, N n (λ, λ 0 ) = K { 1 + N } n(λ, λ 0 ) D n (λ, λ 0 ) (λ λ 0 )µ s,n 1 + λµ s,n w 2 s, D n (λ, λ 0 ) = K K ( ) ] 1 + λµs,n log, (8) 1 + λ 0 µ s,n 1 + λ 0 µ s,n 1 + λµ s,n w 2 s + n p s=k+1 and w s, for s = 1,..., n p, are independent N(0, 1). These null finite sample distributions are easy to simulate (Crainiceanu and Ruppert, 2003). 3 PROBABILITY MASS AT ZERO OF (R)LRT w 2 s, Denote by f( ) and g( ) the functions to be maximized in equations (7) and (8) respectively. Note that the probability mass at zero for LRT n or RLRT n equals the probability that the function f( ) or g( ) has a global maximum at λ = 0. For a given sample size we compute the exact probability of having a local maximum of the f( ) and g( ) at λ = 0. This probability is an upper bound for the probability of having a global maximum at zero but, as we will show using simulations, it provides an excellent approximation. The first order condition for having a local maximum of f( ) at λ = 0 is f (0) 0, where the derivative is taken from the right. The finite sample probability of a local maximum at λ = 0 for ML, when λ 0 is the true value of the parameter, is { K P (1 + λ 0µ s,n )µ s,n ws 2 K (1 + λ 0µ s,n )ws 2 + n p 1 s=k+1 w2 n s } K ξ s,n, (9) where µ s,n and ξ s,n are the eigenvalues of the K K matrices Σ 1/2 Z T P 0 ZΣ 1/2 and Σ 1/2 Z T ZΣ 1/2 respectively, and w s are i.i.d. N(0,1) random variables. If λ 0 = 0 then the probability of a local maximum at λ = 0 is P { K µ s,nw 2 s n p w2 s 1 n } K ξ s,n. (10) Using similar derivations for REML the probability mass at zero when λ = λ 0 is { } K P (1 + λ 0µ s,n )µ s,n ws 2 K (1 + λ 0µ s,n )ws 2 + n p 1 K µ s,n, (11) s=k+1 w2 n p s and, in the particular case when λ 0 = 0, the probability of a local maximum at λ = 0 is { K } P µ s,nws 2 1 K µ s,n. (12) n p n p w2 s 5

Once the eigenvalues µ s,n and ξ s,n are computed explicitly or numerically, these probabilities can be simulated. Algorithms for computation of the distribution of a linear combination of χ 2 1 random variables developed by Davies (1980) or Farebrother (1990) could also be used, but we used simulations because they are simple, accurate, and easier to program. For K = 20 we obtained 1 million simulations in 1 minute (2.66GHz, 1Mb RAM). Probabilities in equations (9) and (11) are the probabilities that λ = 0 is a local maximum and provide approximations of the probabilities that λ = 0 is a global maximum. The latter is equal to the finite sample probability mass at zero of the (R)LRT and of (RE)ML estimator of λ when the true value of the parameter is λ 0. For every value λ 0 we can compute the probability of a local maximum at λ = 0 for (RE)ML using the corresponding equation (9) or (11). However, there is no close form for the probability of a global maximum at λ = 0 and we use simulation of the finite sample distributions of (R)LRT n statistics described in equations (7) or (8). In sections 5 and 6 we show that that there is close agreement between the probability of a local and global maximum at λ = 0 for two examples: balanced one-way ANOVA and penalized spline models. 4 PROBABILITY OF UNDERESTIMATING THE SIGNAL-TO-NOISE PARAM- ETER Denote by λ ML and λ 1 ML the global and the first local maximum of f( ), respectively. Define λ REML and λ 1 REML similarly using g( ). When λ 0 is the true value of the signal-to-noise parameter, { ) P ( λ1 ML < λ 0 P } { λ f(λ) K < 0 = P c s,n (λ 0 )ws 2 < λ=λ0 } n p w2 s n, (13) where c s,n (λ 0 ) = µ s,n 1 + λ 0 µ s,n / K ξ s,n 1 + λ 0 ξ s,n. Similarly, for REML we obtain { ) } { P ( λ1 REML < λ 0 P λ g(λ) K < 0 = P d s,n (λ 0 )ws 2 < λ=λ0 } n p w2 s n p, (14) where d s,n (λ 0 ) = µ s,n 1 + λ 0 µ s,n / K µ s,n 1 + λ 0 µ s,n. 6

Denote by p ML (λ 0 ) and p REML (λ 0 ) the probabilities appearing in the right hand side of equations ) ) (13) and (14). Our hope is that P ( λ(re)ml < λ 0 is well approximated by P ( λ1 (RE)ML < λ 0 which, in turn, is well approximated by p (RE)ML (λ 0 ). While no general proof is available for these results, in sections 5.2 and 6.3 we show that these approximations are very good at least for balanced one-way ANOVA and penalized spline models. We now develop large-λ asymptotic approximations to p ML (λ 0 ) and p REML (λ 0 ) that will be used in Section 6.3. If µ s,n = 0 then c s,n = d s,n = 0 for all values of λ 0. If µ s,n > 0 then lim c s,n(λ 0 ) = 1/K ξ and lim d s,n(λ 0 ) = 1/K µ, λ 0 λ 0 where K ξ and K µ are the number of non-zero eigenvalues µ s,n and ξ s,n respectively. Therefore ( lim p ML(λ 0 ) = P F Kξ,n p K ξ < n p K ) ξ λ 0 n K ξ and lim p REML(λ 0 ) = P ( F Kµ,n p K µ < 1 ), λ 0 where F r,s denotes an F-distributed random variable with (r, s) degrees of freedom. 5 ONE-WAY ANOVA (15) Consider the balanced one-way ANOVA model with K levels and J observations per level Y ij = µ + b i + ɛ ij, i = 1,..., K and j = 1,..., J. (16) where ɛ ij are i.i.d. random variables N(0, σɛ 2 ), b i are i.i.d. random effects distributed N(0, σb 2) independent of ɛ ij, µ is a fixed unknown intercept, and as before define λ = σb 2/σ2 ɛ. The matrix X for fixed effects is simply a JK 1 column of ones and the matrix Z is a JK K matrix with every column containing only zeros with the exception of a J-dimensional vector of 1 s corresponding to the level parameter. For this model Σ = I K, p = 1 and n = JK is the total number of observations. An important characteristic of this model is that one can explicitly calculate the eigenvalues of the matrices Z T P 0 Z and Z T Z. Using direct calculations we obtain that one eigenvalue of Z T P 0 Z is equal to zero and the remaining K 1 eigenvalues are µ s,n = J. Also, all K eigenvalues of Z T Z are equal and ξ s,n = J. 7

5.1 PROBABILITY MASS AT ZERO OF LRT AND RLRT For the balanced one-way design, using equation (9) one obtains the probability of a local maximum at λ = 0 for the ML estimator of λ when λ 0 is the true value { P F K 1,n K K } 1. (17) K 1 1 + λ 0 J Similarly, for REML we obtain the probability of a local maximum at λ = 0 { } 1 P F K 1,n K. (18) 1 + λ 0 J These results is known. See equations (119) and (147) of Searle, Casella, and McCulloch (1992). Table 1 shows the finite sample probability of a global and local maximum at λ = 0 for ML and REML. The probability of a global maximum is reported within parentheses. It represents the frequency of estimating λ = 0 for different true values λ 0 in 1 million simulations of the distributions described in equations (7) or (8) for K = 5 levels and different number of observations J per level. The probability of a local maximum is calculated using equations (17) or (18). There is very close agreement between the probability of a global and local maximum at λ = 0 for both criteria and for all values of the true parameter considered. Suppose that we want to test for no-level effect, that is H 0 : λ = 0 vs. H A : λ > 0. The probability mass at zero of the (R)LRT n is equal to the probability of a global maximum at λ = 0. The probability mass at zero under the alternative (λ 0 > 0) is larger for LRT n than for RLRT n, thus suggesting that RLRT n may have better power properties than LRT n. We focus now on the properties of the null asymptotic distributions of (R)LRT n for testing the zero variance of random effects null hypothesis. Because the response variable Y can be partitioned into K J-dimensional i.i.d. sub-vectors corresponding to each level, when the number of levels K increases to infinity the asymptotic distribution is 0.5χ 2 0 + 0.5χ2 1. However, in applications both the number of levels K and the number of observations per level J are fixed. If K is small or moderate (< 100) then the 0.5 approximation of the probability mass at zero is far from the true value. To make the comparison simple we consider the case J in equation (19) and we obtain null asymptotic probability mass at zero ML : P ML (K) = P {X K 1 < K} and REML : P REML (K) = P (X K 1 < K 1), 8

Table 1: Probability of having a local (global) maximum at λ = 0 for ML and REML. The number of levels is K = 5. ML REML J\λ 0 0 0.01 0.1 1 0 0.01 0.1 1 5 10 20 40 0.678 (0.678) 0.696 (0.696) 0.705 (0.704) 0.709 (0.709) 0.655 (0.655) 0.648 (0.648) 0.610 (0.609) 0.531 (0.529) 0.480 (0.480) 0.353 (0.353) 0.204 (0.204) 0.091 (0.090) 0.069 (0.069) 0.023 (0.023) 0.007 (0.007) 0.002 (0.002) 0.569 (0.570) 0.583 (0.582) 0.588 (0.588) 0.591 (0.591) 0.545 (0.545) 0.533 (0.532) 0.493 (0.492) 0.417 (0.417) 0.377 (0.377) 0.264 (0.264) 0.145 (0.145) 0.062 (0.062) 0.047 (0.047) 0.015 (0.015) 0.004 (0.004) 0.001 (0.001) Notes: The finite sample probability of having a global maximum (probability mass at zero of LRT n and RLRT n respectively) is reported within parentheses. It represents the frequency of estimating λ = 0 for different true values λ 0 in 1 million simulations of the distributions described in equations (7) or (8) for K = 5 levels and different number of observations J per level. The standard deviation of each of these estimated probabilities is at most 0.0005. where X r denotes a random variable with a χ 2 distribution with r degrees of freedom. Figure 1 shows P ML (K) and P REML (K) versus K. By central limit theorem both P ML (K) and P REML (K) tend to 0.5, but for K < 100 these probabilities are much larger than 0.5. Indeed, P ML (5) = 0.713, P ML (10) = 0.650, P ML (20) = 0.605 and P ML (100) = 0.547. 5.2 PROBABILITY OF UNDERESTIMATING THE SMOOTHING PARAMETER We now investigate the probability of underestimating λ 0 using ML and REML when λ 0 is true and the design is balanced. It is easy to see that c s,n (λ 0 ) = 1/K in equation (13) and d s,n (λ 0 ) = 1/(K 1), for s = 1,..., K 1, and c K,n (λ 0 ) = d K,n (λ 0 ) = 0. Therefore p ML (λ 0 ) = P {F K 1,n K < K/(K 1)} and REML : p REML (λ 0 ) = P (F K 1,n K < 1), (19) which are the probabilities obtained by the first order conditions, and do not depend on λ 0. Table 2 displays these probabilities for K = 5 levels and several values of J and compares them with the exact probability of underestimating λ 0 calculated using 1 million simulations of the distributions described in equations (7) or (8). The latter is represented between parentheses. We used λ 0 = 1 but similar results were obtained for other values of λ 0. There is close agreement between these probabilities and ML underestimates λ 0 much more frequently than REML. 9

Calculations for the balanced one-way ANOVA model can be done analytically because the eigenvalues µ s,n and ξ s,n can be calculated explicitly and have a particularly simple form. Standard asymptotic theory for a parameter on the boundary holds when K and J are large but fails when K is moderate and J is large. Crainiceanu and Ruppert (2003) suggest using the finite sample distributions described in equations (7) and (8), which are very easy to simulate. Table 2: Probability of underestimating the true value of the signal-to-noise ratio parameter λ 0 for ML and REML. The number of levels is K = 5. J 5 10 20 40 ML REML 0.678 (0.677) 0.569 (0.568) 0.696 (0.695) 0.583 (0.581) 0.705 (0.704) 0.588 (0.587) 0.709 (0.708) 0.591 (0.590) Notes: The finite sample probability of underestimating λ 0 is reported within parentheses. It represents the frequency of estimating λ < λ 0 for different true values λ 0 in 1 million simulations of the distributions described in equations (7) or (8) for K = 5 levels and different number of observations J per level. λ 0 = 1 but other values give similar results. The standard deviation of each of these estimated probabilities is at most 0.0005. 5.3 THE UNBALANCED ONE-WAY DESIGN For unbalanced data, Searle, Casella, and McCulloch (1992, p. 88) state that the probability mass at zero cannot be easily specified. Apparently, this quantity cannot be expressed simply using the F-distribution. However, it is a simple case of (9). 5.4 OTHER TESTS Khuri, Mathew, and Sinha (1998) discuss several approaches to testing in LMMs besides (R)LRTs. Wald s variance component test is simply the F-test assuming that all parameters are fixed effects. Under the null hypothesis the test statistic has an exact F-distribution even for unbalanced data. For the balanced one-way design, Wald s test is UMPS (uniformly most powerful similar) and therefore UMPU (uniformly most powerful unbiased) and is also UMPI (uniformly most powerful invariant). In the case of the unbalanced one-way design, there are no UMPS, UMPU, or UMPI tests. However, there is a LBI (locally best invariant) test which was derived by Das and Sinha 10

(1987). The test statistic, which is given by equation (3) or (4.2.8) of Khuri, Mathew, and Sinha (1998), is a ratio of quadratic forms in Y so percentiles of its distribution can be found by Davies s (1980) algorithm. 6 TESTING POLYNOMIAL REGRESSION VERSUS A NONPARAMETRIC AL- TERNATIVE In this section we show that nonparametric regression using P-splines is equivalent to a particular LMM. In this context, the smoothing parameter is the ratio between the random effects and error variances and testing assumptions about the shape of the regression function is equivalent to testing hypotheses about the smoothing parameter. We first focus on testing for a polynomial regression versus a general alternative modeled by penalized splines, which is equivalent to testing for a zero smoothing parameter (or zero random effects variance). For this hypothesis, we study the probability mass at zero of the (R)LRT n statistics under the null and alternative hypotheses. In particular, we show that the null probability mass at zero is much larger than 0.5. Because in the penalized spline models the data vector cannot be partitioned into more than one i.i.d. sub-vectors, the Self and Liang assumptions do not hold, but it was an open question as to whether the results themselves held or not. Our results show that the 0.5 : 0.5 mixture of χ 2 distributions cannot be extended to approximate the finite sample distributions of (R)LRT n regardless of the number of knots used. We also investigate the probability of underestimating the true smoothing parameter, which in the context of penalized spline smoothing is the probability of oversmoothing. We show that first order results as described in Section 4 provide excellent approximations of the exact probability of oversmoothing and the probability of oversmoothing with (RE)ML is generally larger than 0.5. 6.1 P-SPLINES REGRESSION AND LINEAR MIXED MODELS Consider the following regression equation y i = m (x i ) + ɛ i, (20) where ɛ i are i.i.d. N ( 0, σ 2 ɛ ) and m( ) is the unknown mean function. Suppose that we are interested in testing if m( ) is a p-th degree polynomial: H 0 : m (x) = β 0 + β 1 x +... + β p x p. 11

To define an alternative that is flexible enough to describe a large class of functions, we consider the class of regression splines H A : m(x) = m (x, Θ) = β 0 + β 1 x +... + β p x p + K b k (x κ k ) p +, (21) where Θ = (β 0,..., β p, b 1,..., b K ) T is the vector of regression coefficients, β = (β 0,..., β p ) T is the vector of polynomial parameters, b = (b 1,..., b K ) T k=1 is the vector of spline coefficients, and κ 1 < κ 2 <... < κ K are fixed knots. Following Gray (1994) and Ruppert (2002), we consider a number of knots that is large enough (e.g. 20) to ensure the desired flexibility. The knots are taken to be sample quantiles of the x s such that κ k corresponds to probability k/(k + 1). To avoid overfitting, the criterion to be minimized is a penalized sum of squares n {y i m (x i ; Θ)} 2 + 1 λ ΘT W Θ, (22) i=1 where λ 0 is the smoothing parameter and W is a positive semi-definite matrix. Denote Y = (y 1, y 2,..., y n ) T, X the matrix having the i-th row X i = (1, x i,..., x p i ), Z the matrix having the i-th row Z i = { (x i κ 1 ) p +, (x i κ 2 ) p +,..., (x i κ K ) p +}, and X = [X Z]. In this paper we focus on matrices W of the form W = [ 0p+1 p+1 0 p+1 K 0 K p+1 Σ 1 where Σ is a positive definite matrix and 0 ml is an m l matrix of zeros. This type of matrix W penalizes the coefficients of the spline basis functions (x κ k ) p + ], only and will be used in the remainder of the paper. A standard choice is Σ = I K but other matrices can be used according to the specific application. If criterion (22) is divided by σ 2 ɛ one obtains 1 σ 2 ɛ Y Xβ Zb 2 + 1 λσɛ 2 b T Σ 1 b. (23) Define σb 2 = λσ2 ɛ, consider the vectors γ and β as unknown fixed parameters and the vector b as a set of random parameters with E(b) = 0 and cov(b) = σb 2Σ. If (bt, ɛ T ) T is a normal random vector and b and ɛ are independent then one obtains an equivalent model representation of the penalized spline in the form of a LMM (Brumback, Ruppert, and Wand 1999; Ruppert, Wand, and Carroll, 2003): ( b Y = Xβ + Zb + ɛ, cov ɛ ) [ σ 2 = b Σ 0 0 σɛ 2 I n ]. (24) 12

More specifically, the P-spline model is equivalent to the LMM in the following sense. Given a fixed value of λ = σ 2 b /σ2 ɛ, the P-spline is equal to the BLUP of the regression function in the LMM. The P-spline model and LMM may differ in how λ is estimated. In the LMM it would be estimated by ML or REML. In the P-spline model, λ could be determined by cross-validation, generalized cross-validation or some other method for selecting a smoothing parameter. However, using ML or REML to select the smoothing parameter is an effective method and we will use it. There is naturally some concern about modeling a regression function by assuming that (b T, ɛ T ) T is a normal random vector and cov(b) = σb 2 Σ. However, under the null hypotheses of interest, b = 0 so this assumption does hold with σ 2 b = 0. Therefore, this concern is not relevant to the problem of testing the null hypothesis of a parametric model. One can also view the LMM interpretation of a P-spline model as a hierarchical Bayesian model and the assumption about (b T, ɛ T ) T as part of the prior. This is analogous to the Bayesian interpretation of smoothing splines pioneered by Wahba (1978, 1990). In this context, testing for a polynomial fit against a general alternative described by a P-spline is equivalent to testing H 0 : λ = 0 ( σ 2 b = 0) vs. H A : λ > 0 ( σ 2 b > 0). Given the LMM representation of a P-spline model we can define LRT n and RLRT n for testing these hypotheses as described in Section 2. Because the b i s have mean zero, σb 2 = 0 in H 0 is equivalent to the condition that all coefficients b i of the truncated power functions are identically zero. These coefficients account for departures from a polynomial. For the P-spline model, Wald s variance component test mentioned in Section 5.4 would be the F-test for testing polynomial regression versus a regression spline viewed as a fixed effects model. The fit under the alternative would be ordinary least squares. Because there would be no smoothing, it seems unlikely that this test would be satisfactory and, perhaps for this reason, it has not been studied, at least as far as we are aware. 6.2 PROBABILITY MASS AT ZERO OF (R)LRT In this section we compute the probability that the (R)LRT n is 0 when testing for a polynomial regression versus a general alternative modeled by penalized splines. We consider testing for a constant mean, p = 0, versus the alternative of a piecewise constant spline and linear polynomial, 13

p = 1, versus the alternative of a linear spline. For illustration we analyze the case when the x s are equally spaced on [0, 1] and K = 20 knots are used, but the same procedure can be applied to the more general case. Once the eigenvalues µ s,n and ξ s,n of the matrices Z T P 0 Z and Z T Z are calculated numerically, the probability of having a local maximum at zero for (R)LRT n is computed using equations (9) or (11). Results are reported in Tables 3 and 4. We also report, between parentheses, the estimated probabilities of having a global maximum at zero. As for one-way ANOVA, we used 1 million simulations from the spectral form of distributions described in equations (7) or (8). For RLRT n there is close agreement between the probability of a local and global maximum at zero for every value of the parameter λ 0. For LRT n the two probabilities are very close when λ 0 = 0, but when λ 0 > 0 the probability of a local maximum at zero is much larger than the probability of a global maximum. This happens because the likelihood function can be decreasing in a neighborhood of zero but have a global maximum in the interior of the parameter space. The restricted likelihood function can exhibit the same behavior but does so less often. An important observation is that the LRT n has almost all its mass at zero, that is 0.92 for p = 0, and 0.99 for p = 1. This makes the construction of a LRT very difficult, if not impossible, especially when we test for the linearity against a general alternative. Estimating λ to be zero with high probability when the true value is λ 0 = 0 is a desirable property of the likelihood function. However, continuing to estimate λ to be zero with high probability when the true value is λ 0 > 0 (e.g. 0.78 for n = 100, λ 0 = 1 and p = 1) suggests that the power of the LRT n can be poor. The RLRT n has less mass at zero 0.65 for p = 0, and 0.67 for p = 1, thus allowing the construction of tests. Also, the probability of estimating zero smoothing parameter when the true parameter λ 0 > 0 is much smaller (note the different scales in Tables 3 and 4) indicating that the RLRT is probably more powerful than the LRT. In a simulation study Crainiceanu, Ruppert, Claeskens and Wand (2003) showed that this is indeed the case. Columns corresponding to λ 0 = 0 in Tables 3 and 4 show that the 0.5 approximation of the probability mass at zero for (R)LRT n is very poor for K = 20 knots, regardless of the number of observations. Using an analogy with the balanced one-way ANOVA case one may be tempted to believe that by increasing the number of knots K the 0.5 approximation will improve. To address this problem we calculate the asymptotic probability mass at zero when the number of observations n tends to infinity and the number of knots K is fixed. 14

Table 3: Probability of having a local and global minimum at λ = 0 for LRT n p = 0 p = 1 n λ 0 = 0 λ 0 = 0.1 λ 0 = 1 λ 0 = 0 λ 0 = 1 λ 0 = 10 50 100 200 400 0.953 (0.917) 0.954 (0.919) 0.954 (0.921) 0.954 (0.923) 0.430 (0.312) 0.271 (0.158) 0.146 (0.060) 0.064 (0.015) 0.114 (0.021) 0.0365 (0.002) 0.008 (0) 0.001 (0) > 0.999 (0.993) > 0.999 (0.994) > 0.999 (0.994) > 0.999 (0.995) > 0.999 (0.891) 0.999 (0.778) 0.972 (0.623) 0.874 (0.456) Notes: The standard deviation of each of these estimated probabilities is at most 0.0005. 0.9941 (0.422) 0.773 (0.262) 0.545 (0.143) 0.381 (0.067) Consider the example of testing for a constant mean versus a general alternative modeled by a piecewise constant spline with equally spaced observations and K knots. In Appendix A1 we show that µ s,n /n µ s and ξ s,n /n ξ s where µ s and ξ s are the eigenvalues of two K K matrices. Using these results in equations (10) and (12) it follows that the asymptotic probability mass at zero for LRT n and RLRT n are ( K ) K P µ s ws 2 ξ s and ( K ) K P µ s ws 2 µ s, (25) respectively, where w s are independent N(0, 1) random variables. Figure 2 shows these probabilities calculated using 1 million simulations in equation (25) for a number of knots 1 K 100. Over a wide range of number of knots the probabilities are practically constant, 0.95 for ML and 0.65 for REML. Approximating the null probability of estimating λ as 0 by 0.5 is very inaccurate, since 0.5 is not even correct asymptotically. 6.3 PROBABILITY OF UNDERESTIMATING THE SMOOTHING PARAMETER We now investigate the probability of underestimating the true value of the smoothing parameter λ 0 using ML or REML criteria. In the penalized spline context this probability is the probability of undersmoothing. The REML bias towards oversmoothing for the regression function estimates has been discussed before in the smoothing spline literature (e.g., Efron, 2001; and Kauerman 15

Table 4: Probability of having a local and global minimum at λ = 0 for RLRT n p = 0 p = 1 n λ 0 = 0 λ 0 = 0.01 λ 0 = 0.1 λ 0 = 0 λ 0 = 0.01 λ 0 = 0.1 50 100 200 400 0.653 (0.642) 0.654 (0.646) 0.656 (0.647) 0.656 (0.648) 0.474 (0.463) 0.376 (0.366) 0.266 (0.258) 0.166 (0.155) 0.151 (0.139) 0.074 (0.063) 0.028 (0.021) 0.008 (0.005) 0.671 (0.660) 0.674 (0.663) 0.675 (0.666) 0.676 (0.667) 0.667 (0.656) 0.664 (0.655) 0.655 (0.646) 0.638 (0.629) 0.625 (0.616) 0.589 (0.579) 0.529 (0.520) 0.447 (0.438) Notes: The finite sample probability of having a global maximum (probability mass at zero of log-lr and RLRT respectively) is reported within parentheses. It represents the frequency of estimating λ = 0 for different true values λ in 1 million simulations of the spectral decomposition of RLRT n in equation (8). The standard deviation of each of these estimated probabilities is at most 0.0005. 2003). Based on first order conditions we provide a simple and accurate approximation of the finite sample probability of undersmoothing for penalized splines. The exact finite sample probability of undersmoothing can be obtained exactly by recording the ARGMAX at each simulation step of the distributions described in equations (7) and (8). As illustration, we use the same examples as in Section 6.2: piecewise constant spline, p = 0, and linear spline, p = 1, with equally spaced observations and K = 20 knots, even though results hold more generally for unequally spaced observations and any number of knots. Table 5 shows the approximate values p ML (λ 0 ) and p REML (λ 0 ) of underestimating λ 0 obtained using first order conditions, as described in Section 4. These values are obtained using 1 million simulations from the expressions that appear in the right hand side of equations (13) and (14), where µ s,n and ξ s,n are the eigenvalues of the matrices Z T P 0 Z and Z T Z respectively. The exact probability of underestimating λ 0 is reported between parentheses and is obtained using 1 million simulations from the distribution described in (7) and (8) respectively. Results are reported for n = 100 and n = 400 observations. Table 5 shows a very close agreement between the approximate and exact probability of underestimating λ 0 for all values of λ 0 both for ML and REML criteria. For both criteria, probabilities decrease with the increase of the true value of the smoothing parameter but their values remain larger than 0.5 for all cases considered. Differences are more severe for small to moderate values 16

Table 5: Approximate and exact probability of underestimating the smoothing parameter λ 0 p = 0 p = 1 n(ml) 0.01 0.1 1 10 10 5 0.01 0.1 1 10 10 5 100 (1) 0.741 (0.754) 400 (1) 0.662 (0.669) 100 (0) 0.616 (0.617) 400 (0) 0.586 (0.588) 0.606 (0.609) 0.567 (0.567) 0.563 (0.563) 0.549 (0.548) 0.530 (0.530) 0.539 (0.539) 0.531 (0.530) 0.539 (0.539) 0.516 (0.515) 0.536 (0.536) 0.529 (0.528) 0.539 (0.539) 0.513 (0.513) 0.535 (0.536) 0.528 (0.529) 0.539 (0.539) >.999 (0.993) 0.992 (0.990) 0.674 (0.663) 0.673 (0.665) 0.955 (0.980) 0.878 (0.913) 0.669 (0.661) 0.656 (0.655) 0.829 (0.845) 0.772 (0.783) 0.637 (0.638) 0.613 (0.615) 0.728 (0.734) 0.690 (0.691) 0.595 (0.597) 0.581 (0.582) 0.501 (0.501) 0.533 (0.533) 0.528 (0.528) 0.539 (0.539) Notes: Probabilities are reported for penalized splines with n equally spaced observations in [0, 1] and K = 20 equally spaced knots. Here ML = 1 corresponds to using the ML criterion and ML = 0 corresponds to using the REML criterion for estimating the smoothing parameter. The exact probability of underestimating λ 0 is reported between parentheses and is obtained using 1 million simulations from the distribution described in (7) and (8) respectively. The approximate probability of underestimating λ 0 is obtained using equations (13) and (14). The standard deviation of each of these estimated probabilities is at most 0.0005. of λ 0. The approximate probabilities depend essentially on the weights c s,n (λ 0 ) and d s,n (λ 0 ), and converge to the probabilities described in equation (15). For example, for n = 100, K = 20 we have K µ = K ξ = 20 and we obtain lim p ML(λ 0 ) = F 20,79 (79/80) = 0.514 and λ 0 lim p REML(λ 0 ) = F 20,79 (1) = 0.528. λ 0 These values agree with the simulation results presented in Table 5 for λ 0 = 10 5. 7 LONGITUDINAL MODELS Modeling longitudinal data is one of the most common application area of LMMs. LRTs for null hypotheses that include zero variance components are routinely used in this context. We focus our discussion on (R)LRTs for a zero random effects variance. Asymptotic theory for longitudinal data models was developed (Stram and Lee, 1994) under the assumption that data can be partitioned into a large number of independent subvectors. One might argue, as one referee did, that for longitudinal data it is generally possible to partition the response vectors into independent clusters corresponding to subjects and that the appropriate 17

asymptotics is obtained by allowing the number of subjects to increase rather than the number of observations per subject. However, in our view, this argument has several drawbacks. The object of interest is the finite sample distribution of the test statistic. If this is available, then it should be used. If an asymptotic distribution is used, then the type of asymptotics used should be that which gives the best approximation to the finite-sample distribution. In Section 7.1, for the random intercept longitudinal model, we show that if the number of subjects is less than 100 then the 50 : 50 mixture of chi squared distributions is a poor approximation of the (R)LRT finite sample distribution, regardless of the number of observations per subject. For this example we would use the finite sample distribution of the (R)LRT statistic, as derived by Crainiceanu and Ruppert (2003). Moreover, for many commonly used semiparametric models there are random effects in the submodel for the population mean, so that the data cannot be divided into independent subvectors and the results of Stram and Lee (1994) do not apply to this type of models.. For example, suppose that n subjects in M groups are observed and their response curves are recorded over time. The mean response of a subject can be decomposed as the corresponding group mean and the deviation of the subject mean from the group mean. The group mean can farther be decomposed as the overall mean and the deviation of the group mean from the overall mean. In Section 7.2 we show that if the overall mean is modelled nonparametrically in this way, then the response vector cannot be partitioned into more than one independent subvector. Moreover, if the overall mean is modelled parametrically and the group deviations are modelled nonparametrically, then the response vector cannot be partitioned into more than M independent subvectors, where M rarely exceeds 5 in applications. 7.1 RANDOM INTERCEPT LONGITUDINAL MODEL Consider the random intercept longitudinal model with K subjects and J(k) observations per subject k Y kj = β 0 + S β s x (s) kj + b k + ɛ kj, k = 1,..., K and j = 1,..., J(k). (26) where ɛ kj are i.i.d. random variables N(0, σ 2 ɛ ), b k are i.i.d. intercepts distributed N(0, σ 2 b ) independent of ɛ kj and denote by λ = σ 2 b /σ2 ɛ. In model (26) β 0 and β s, s = 1,..., S are fixed unknown parameters and x (s) kj is the j-th observation for the k-th subject on the s-th covariate. 18

The matrix X for fixed effects is a JK (S + 1) matrix with the first column containing only 1 s and the last S containing the corresponding values of x s. The matrix Z is a JK K matrix with every column containing only zeros with the exception of a J-dimensional vector of 1 s corresponding to the random intercept for each subject. For this model Σ = I K, p = S + 1 and n = K k=1 J(k). The eigenvalues of the matrices Z T P 0 Z and Z T Z can be computed numerically and the finite sample distributions of (R)LRT n statistics can be simulated using results in section 2. In particular, properties such as probability mass at zero and probability of underestimating the true value of λ 0 can be obtained in seconds. In some cases these eigenvalues can be calculated explicitly. Consider the case of equal number of observations per subject, J(k) = J for k = 1,..., K, one covariate, S = 1, and x (1) kj = x kj = j. This example is routinely found in practice when K subjects are observed at equally spaced intervals in time and time itself is used as a covariate. It is easy to show that for this example µ s,n = J for s = 1,..., K 1, µ K,n = 0 and ξ s,n = J for s = 1,..., K. These eigenvalues are identical to the corresponding eigenvalues for the balanced one-way ANOVA model and all distributional results discussed in section 5 apply to this example. If the covariates are not equally spaced and/or there are unequal numbers of observations per subject then the eigenvalues of Z T P 0 Z cannot be computed explicitly. In particular, the F distribution cannot be used to compute the probability mass at zero or the probability of underestimating the true signal to noise ratio, but our methodology could be. 7.2 NONPARAMETRIC LONGITUDINAL MODELS Nested families of curves in longitudinal data analysis can be modeled nonparametrically using penalized splines. An important advantage of low order smoothers, such as penalized splines, is the reduction in the dimensionality of the problem. Moreover, smoothing can be done using standard mixed model software because splines can be viewed as BLUPs in mixed models. In this context, likelihood and restricted likelihood ratio statistics could be used to test for effects of interest. For example, the hypothesis that a curve is linear is equivalent to a zero variance component being zero. While computing the test statistics is straightforward using standard software, the null distribution theory is complex. Modified asymptotic theory for testing whether a parameter is on the boundary of the parameter space does not apply because data are correlated 19

under the alternative (usually the full model). Brumback and Rice (1998) study an important class of models for longitudinal data. In this paper, we consider a subclass of those models. In particular, we assume that repeated observations are taken on each of n subjects divided into M groups. Suppose that y ij is the j-th observation on the i-th subject recorded at time t ij, where 1 i I, 1 j J(i), and n = I i=1 J(i) is the total number of observations. Consider the nonparametric model y ij = f(t ij ) + f g(i) (t ij ) + f i (t ij ) + ɛ ij, (27) where ɛ ij are independent N(0, σɛ 2 ) errors and g(i) denotes the group corresponding to the i-th subject. The population curve f( ), the deviations of the g(i) group from the population curve, f g(i) (t ij ), and the deviation of the ith subject s curve from the population average, f i ( ), are modeled nonparametrically. Models similar to (27) have been studied by many other authors, e.g., Wang (1998). We model the population, group and subject curves as penalized splines with a relatively small number of knots f(t) = p s=0 β st s + K 1 k=1 b k(t κ 1,k ) p + f g (t) = p s=0 β gst s + K 2 k=1 u gk(t κ 2,k ) p + f i (t) = p s=0 a ist s + K 3 k=1 v ik(t κ 3,k ) p +. For the population curve β 0,..., β p will be treated as fixed effects and b 1,..., b K will be treated as independent random coefficients distributed N(0, σb 2 ). In LMM s, usually parameters are treated as random effects because they are subject-specific and the subjects have been sampled randomly. Here, b 1,..., b K are treated as random effects for an entirely different reason. Modeling them as random specifies a Bayesian prior and allows for shrinkage that assures smoothness of the fit. For the group curves β gs, g = 1,..., M, s = 1,..., p, will be treated as fixed effects and u g k, k = 1,..., K, will be treated as random coefficients distributed N(0, σ 2 g). identifiable we impose restrictions on β gs M β gs = 0 or β Ms = 0 for s = 1,..., p. g=1 To make the model This model assumes that the groups are fixed, e.g., are determined by fixed treatments. If the groups were chosen randomly, then the model would be changed somewhat. The parameters β gs, 20

g = 1,..., M, s = 1,..., p, would be treated as random effects and the σg 2 might be assumed to be equal or to be i.i.d. from some distribution such as an inverse gamma. For the subject curves, a is, 1 i n, will be treated as independent random coefficients distributed N(0, σa,s), 2 which is a typical random effects assumption since the subjects are sampled randomly. Moreover v ik will be treated as independent random coefficients distributed N(0, σv). 2 With the additional assumption that b k, a ik, v ik and ɛ ij are independent, the nonparametric model (27) can be rewritten as a LMM with p + M + 3 variance components Y = X β + X G β G + Z b b + Z G u G + Z a a + Z v v + ɛ, (28) where β = (β 0,..., β p ) T, β g = (β g0,..., β gp ) T, β G = (β T 1,..., β T M ), u g = (u g1,..., u gk2 ) T, u G = (u T 1,..., ut M )T, a s = (a 1s,..., a ns ) T, a = (a T 0,..., at p ) T and v = (v 11,..., v nk3 ). The X and Z matrices are defined accordingly. By inspecting the form (28) of model (27), some interesting features of the model are revealed. If σb 2 > 0 then the vector Y cannot be partitioned into more than one independent subvectors due to the term Zb. If σb 2 = 0 but at least one of σ2 g > 0 then Y cannot be partitioned into more than M independent subvectors due to the term Z G u G. In other words, if the overall or one of the group means are modeled nonparametrically then the assumptions for obtaining the Self and Liang type of results fail to hold. When longitudinal data are modeled semiparametrically in this way and there is more than one variance component, deriving the finite sample distribution of (R)LRT for testing that certain variance components are zero is not straightforward. The distribution of the LRT and RLRT statistics will depend on those variance components, if any, that are not completely specified by the null hypothesis. Therefore, exact tests are generally not possible. We recommend that the parametric bootstrap be used to approximate the finite sample distributions of the test statistics. However, the bootstrap will be much more computationally intensive that the simulations we have been able to develop. 8 DISCUSSION We considered estimation of the ratio between the random effects and error variances in a LMM with one variance component and we examined the probability mass at zero of the (RE)ML estimator, as well as the probability of underestimating the true parameter λ 0. We provided simple formulae for 21