Restricted Likelihood Ratio Lack of Fit Tests using Mixed Spline Models

Size: px

Start display at page:

Download "Restricted Likelihood Ratio Lack of Fit Tests using Mixed Spline Models"

Vivien Foster
6 years ago
Views:

1 Restricted Likelihood Ratio Lack of Fit Tests using Mixed Spline Models Gerda Claeskens Texas A&M University, College Station, USA and Université catholique de Louvain, Louvain-la-Neuve, Belgium Summary Penalised regression spline models afford a simple mixed model representation in which variance components control the degree of non-linearity in the smooth function estimates. This motivates the study of lack of fit tests based on the restricted maximum likelihood ratio statistic which tests whether variance components are zero versus the alternative of taking on positive values. For this one-sided testing problem a further complication is that the variance component belongs to the boundary of the parameter space under the null hypothesis. Conditions are obtained on the design of the regression spline models under which asymptotic distribution theory applies, and finite sample approximations to the asymptotic distribution are provided. Test statistics are studied for simple as well as multiple regression models. Keywords: Boundary hypothesis, Lack of fit, Likelihood ratio test, Mixed model, Regression spline model, Restricted maximum likelihood. Address for correspondence: Gerda Claeskens, Institute of Statistics, Université catholique de Louvain, Voie du Roman Pays 20, B-348 Louvain-la-Neuve, Belgium. claeskens@stat.ucl.ac.be

2 2 Introduction We construct test statistics based on splines for testing a parametric null model versus a nonparametric alternative. Adaptive tests are obtained using penalised spline regression modelling (Eilers and Marx, 996; Aerts, Claeskens and Wand, 2002) where a relatively small set of spline basis functions is used, but with a penalty for each spline coefficient. As an example suppose we observe data (y i, x i ), i =,..., n for which we wish to test whether the mean of Y i, with corresponding covariate x i, is a particular parametric function of the covariate, for example a polynomial of degree q: H 0 : E(Y ) = β 0 + β x + + β q x q. () A nonparametric lack of fit test does not assume any particular form of the alternative model, that is, the response variable Y i (i =,..., n) depends on the covariate x i through a semiparametric model involving an arbitrary univariate function g, Y i = β 0 + β x i + + β q x q i + g(x i) + ε i. (2) The function g is estimated by a regression spline estimator with smoothing parameter. In particular, we fit a semiparametric model K n Y i = β 0 + β x i + + β q x q i + u k ψ k (x i ) + ε i, where ψ k, k =,..., K n are spline basis functions, for example the truncated polynomial basis consisting of piecewise continuous qth degree polynomials with knots at values κ k. The estimator can also be formulated in terms of other bases such as the Demmler Reinsch basis (Nychka and Cummins, 996) and the B-spline basis (Eilers and Marx, 996). In regression spline modelling, the knots are specified by the user. See for example Ruppert and Carroll (2000) who suggest choosing the number of knots based on quantiles of the data. Penalised spline estimation restricts the influence of the knots on the fitted model, for example by bounding the squared L 2 norm of the spline coefficients. In a mixed regression spline model we explicitly assume that the spline coefficients are independent and identically distributed random variables; see Brumback, Ruppert and Wand (999). Mixed regression spline models have proven useful for a variety of estimation settings including hazard rate estimation (Cai, Hyndman and Wand, 2002) and geoadditive modelling (Kammann and Wand, 2003). The construction of such a mixed effects model, where the β coefficients remain fixed, has as a particular advantage that the hypothesis test above reduces to testing whether the single variance component of the random effects, say σu, 2 is equal to zero; the alternative k=

3 3 hypothesis states that σu 2 > 0. The variance component plays the role of the smoothing parameter. The introduction of random spline coefficients dramatically reduces the dimensionality of the testing problem. In a purely non-random model, a nonparametric test in the same setting requires testing whether all K n spline coefficients are equal to zero. From a practical point of view, another advantage is that a test on the variance component is easily obtained using standard statistical software. The focus of this paper is on the use of the restricted likelihood ratio statistic for testing in a random spline model. Using kernel smoothing methods, Staniswalis and Severini (99) likewise assess the fit of regression models using likelihood-based diagnostics. Orthogonal series based tests of Eubank and Hart (992) and extensions thereof by Aerts, Claeskens and Hart (999, 2000), using both likelihood ratio and score statistics, form another class of omnibus nonparametric tests, explicitly using the smoothing parameter, in this case the number of terms in the series, as a test statistic. Cantoni and Hastie (2002) construct a likelihood ratio-type statistic in a setting of mixed smoothing splines, thereby reformulating the hypotheses in terms of degrees of freedom. Guo (2002) uses similar tests in smoothing spline analysis of variance. Earlier work on the usage of smoothing splines for testing includes Cox, Koh, Wahba and Yandell (988) and Cox and Koh (989). The testing problem is nonstandard for several reasons. First of all, we wish to use the restricted likelihood ratio test for a one-sided testing problem. In our application the parameter of interest is a variance component which has a meaningful interpretation only if its value is non-negative, implying that the parameter space for σu 2 is the half-line of non-negative numbers [0, ), with the value of σu 2 under the null hypothesis at its boundary. Chernoff (954) gave a rigorous treatment of the use of (full) likelihood ratio statistics for onesided testing, under the assumptions that data are independent and identically distributed, and that the true value belongs to the interior of the parameter space. Applications of likelihood ratio testing for one-sided alternatives in linear models can be found in Gouriéroux, Holly and Monfort (982). Self and Liang (987) obtain the asymptotic distribution of the (full) likelihood ratio statistic for a situation of independent and identically distributed data when the true parameter value under the null hypothesis is on the boundary of the parameter space. This result has further been applied by Stram and Lee (994) for testing in a longitudinal mixed effects model. It is important to note that the results of Self and Liang (987) cannot be applied to regression spline models where the regression setting causes the distribution of the response variable Y i to depend on the value of the covariates x i, and in addition the random effects induce dependence between the response values. Extending the results of Geyer (994), Vu and Zhou (997) obtain the asymptotic distribution of likelihood ratio statistics dealing with this type of regression data. For an overview of existing methods, see Sen and Silvapulle (2002).

4 4 A novel aspect of this paper is that we particularly focus attention to the use of the profile restricted likelihood ratio test (RLRT) as opposed to likelihood ratio testing and this within a setting of nonparametric, mixed, regression spline fitting, with a growing number of knots. In Section 3 we obtain design conditions under which the results of Vu and Zhou (997) are applicable for testing with restricted maximum likelihood statistics and we obtain the asymptotic null distribution of the accompanying test statistic, where the probability of estimating a zero variance component plays a key role. To address the issue of the asymptotic mixing proportions in the limiting distribution, Section 4 explains how to obtain exact finite sample mixing proportions, involving the probability of obtaining zero valued variance components. We first restrict the discussion to tests involving only a single smoothing parameter; an extension to lack of fit testing in models with several smoothing parameters is presented in Section 5. Results of a simulation study are presented in Section 6. 2 Regression splines as mixed effect models The design matrices of fixed and random effects are, for sample size equal to n, given by x x q ψ (x ) ψ Kn (x ) X =......, Z =....., x n x q n and a penalised least squares criterion leads to the estimators ψ (x n ) ψ Kn (x n ) (ˆβ, û) = arg min{ Y Xβ Zu 2 + u T u} which also can be obtained via ridge regression (Eilers and Marx, 996; Aerts, Claeskens and Wand, 2002). The penalisation constant in the ridge regression framework plays the role of the smoothing parameter. The introduction of random effects on the u j s results in equivalence of best linear unbiased predictors and estimators obtained via the penalised least squares criterion. More specifically, we encompass the following linear mixed model, in matrix notation: Y = Xβ + Zu + ε, where u and ε are independent, Var(u) = σ 2 ur and Var(ε) = σ 2 ε I n. Note that Var(Y) = σ 2 ε I n + σ 2 u ZRZt = σ 2 ε V, with V = I n + ZRZ t and = σ 2 u/σ 2 ε. For truncated polynomial basis functions R = I Kn. For B-splines, the matrix R is defined via the transformation which expresses B-splines in terms of truncated polynomials. We denote the number of β components by q = q + and the number of u components, which is related to the number of knots, by K n. Testing hypothesis () versus the two-sided alternative that the conditional mean response has any different structure, is in the mixed model representation equivalent to testing H 0 : σu 2 = 0 versus H a : σu 2 > 0. (3)

5 5 For the test to be genuinely nonparametric, we let the number of spline basis functions K n, that is, the number of knots, grow with the sample size. In multiple regression, more than one random effect can be included. For a random effects u,..., u a, the linear mixed model is in vector notation Y = Xβ+Z u +...+Z a u a +ε, see also Section Profile restricted maximum likelihood We assume that the error terms ε follow a normal distribution, independent of the normal random effects u. It is anticipated that similar results can be obtained for different error distributions. The log likelihood of the data is L ml (β,, σ 2 ε ) = 2 { n log(2πσ 2 ε ) + log V + σ 2 ε } (Y Xβ) t V (Y Xβ). Restricted maximum likelihood estimation of the variance components explicitly takes the degrees of freedom associated with estimation of the fixed effects into account. The restricted likelihood function is the likelihood of a linear combination A t Y where the n (n q) matrix A has full column rank and is constructed such that A t X = 0. The resulting function does not depend on the precise choice of A; it is L reml (, σ 2 ε ) = 2 { (n q) log(2πσε 2 ) + log At V A + } Y t A(A t V σε 2 A) A t Y. (4) Define the projection matrix P() = I n X(X t V X) X t V. Also, let P = V P(); as shown in Searle, Casella and McCulloch (992, p ), P = A(A t V A) A t. Since our main focus in on testing σu 2 = 0, or equivalently, = 0, an important ingredient of the test statistic is the profile log REML function L(), obtained by substituting σε 2 in the restricted likelihood function by its REML estimator ˆσ ε 2 = Yt P() t V P()Y/(n q). Specifically, L() = 2 log V 2 log Xt V n q X 2 log{y t P()V P()Y}. (5) Next we state an eigenvalue representation of the profile log REML function, which may also be of interest in general linear mixed models. Lemma In the normal linear mixed effects model Y = Xβ + Z u Z a u a + ε, the profile restricted log likelihood function evaluated at (,..., a ), where k = σ 2 u k /σ 2 ε, with corresponding true values k, and where σε 2 is the residual variance, has the eigenvalue representation L(,..., a ) = n q log[ + 2 ξ n,i (,..., a )] (n q) 2 log{σ2 ε n q + ξ n,i (,..., a ) + ξ n,i (,..., a ) U 2 i },

6 6 where ξ nk (,..., a ), k =,..., K n are the non-zero eigenvalues of the matrix a k= kw t Z k R k Z t k W, and the matrix W is such that Wt W = I n q, WW t = P(0) and W t V W=diag[ + ξ nj (,..., a )]. A proof is given in the appendix. A few special cases are worth mentioning. When there is only one random effect in the sense that =... = a =, or trivially when a =, the eigenvalues depend in a multiplicative way on, that is ξ ni = ξ ni where the ξ ni are the non-zero eigenvalues of the matrix a k= Wt Z k R k Z t kw. Explicit dependence of the eigenvalues on the smoothing parameters k is also arrived at when the matrices Q k = W t Z k R k Z t kw, k =,..., a can be simultaneously diagonalised, a necessary and sufficient condition for which is pairwise commutativity of the matrices Q k, more explicitly when for k l: P(0)Z k R k Z t k P(0)Z lr l Z t l P(0) = P(0)Z lr l Z t l P(0)Z kr k Z t k P(0). In this situation ξ ni = ξ ni a ξ nai, where the ξ nki are the non-zero eigenvalues of A k = Z t k P(0)Z kr k. 3 Tests in simple regression models For testing hypothesis (3), which equivalently can be formulated in terms of the parameter, we study the restricted profile likelihood ratio statistic R n = 2 log{l(ˆ)/l(0)}, (6) where ˆ is the maximiser of L() in (5). As explained earlier, an additional hurdle in obtaining the distribution of likelihood ratio type test statistics is the one-sided alternative. Vu and Zhou (997) extend the work of Chernoff (954) to parameter values on a boundary and non-identically distributed data. In the next theorem we obtain conditions for the random spline models under which those results apply. For ease of interpretation, the conditions are formulated using both the eigenvalue representation and the original matrix notation. (A) The number of knots K n, which equals the number of columns of matrix Z, grows to infinity at a slower rate than n, K n = o(n). (A2) The non-zero eigenvalues ξ n,..., ξ nkn of Z t P(0)ZR = Z t (I n X(X t X) X t )ZR satisfy the following two conditions. As n, k= ξ 2 nk = tr{(zt P(0)ZR) 2 } and Kn k= ξ4 nk ( = tr{(ztp(0)zr)4 } K n k= ξ2 nk )2 (tr{(z t P(0)ZR) 2 }) 2 0. The assumption on a growing number of spline basis functions is natural in nonparametric lack of fit testing. In order to construct a test which is powerful against a large class of smooth

7 7 models, the number of knots K n should, first of all, be sample size dependent, and secondly, grow with the sample size to provide enough flexibility to capture non-trivial structural features. Condition (A2) has a more technical origin. The first assumption guarantees that the smallest eigenvalue of the Fisher information matrix tends to infinity, which guarantees that this matrix is positive definite for n large. The second condition is necessary and sufficient for the standardised score statistic, whose major component is the quadratic form Y t P(0) t ZRZ t P(0)Y, to converge to a standard normal random variable (de Jong, 987). Under (A) a sufficient condition for (A2) is that the non-zero eigenvalues of Z t P(0)ZR are O(n ζ ) with ζ 0. In Section 6 we investigate these conditions for two types of spline basis functions. Theorem Under conditions (A) and (A2), the restricted likelihood ratio statistic R n for testing H 0 in () versus the nonparametric alternative (2) has asymptotically, as the sample size n tends to infinity, a distribution which is an equal mixture of a point mass at zero and a chi-squared distribution with one degree of freedom, abbreviated as 2 χ χ2. In the proof of this result (see appendix for details), we verify that for the random spline model, the conditions stated above are sufficient to apply the results of Vu and Zhou (997). The first set of conditions in Vu and Zhou (997) (A2 ) requires Chernoff regularity, and extends the original results of Chernoff (954) to this more complicated setting. Chernoff regularity ensures the existence of an asymptotic distribution of a consistent sequence of global maximisers of the likelihood. Under the stronger assumption of Clarke regularity, Geyer (994) proves in the setting of M-estimation that a root-n consistent sequence of local maximisers has that same asymptotic distribution. Since a convex set is Clarke regular at each of its points, convexity is a simple argument to prove Clarke regularity, which in its turn implies Chernoff regularity. Further details are given in the appendix. In addition to a test on the value of, it might be of interest to construct a confidence interval for. Cressie and Lahiri (993) and Richardson and Welsh (994) obtain, under some regularity conditions, the asymptotic normality of REML estimators of variance components, without explicitly taking the possibility of zero variance components into account. Based on the normality result a Wald-type confidence interval can be constructed. Using the asymptotic distribution obtained in Theorem, a restricted likelihood based confidence interval for, the variance ratio σ 2 u/σ 2 ε, can be obtained directly. Denote by r α the ( α) quantile of the asymptotic distribution of R n (), that is of the mixture distribution 2 χ χ2. A likelihood based confidence interval C α is the set of all values for which R n () is below the critical value r α, C α = { Ω : R n () r α }. Note that by definition of L(), R n () does not depend on σ 2 ε. Advantages of this likelihood based method include that the confidence interval only covers values belonging to the

8 8 parameter space, hence giving no negative values for variances, and the use of the (profile restricted) likelihood function avoids the construction of a symmetric interval around the estimated parameter value, an aspect which is automatically imposed by Wald-type intervals. The point mass at zero contributes to critical values which are smaller than the standard ones obtained from a full chi-squared distribution with one degree of freedom. 4 Calculation of the mixing proportions In practice the asymptotic approximation of a 50:50 mixture between a point mass at zero and a χ 2 distributed random variable can be poor, and Pinheiro and Bates (2000) suggest using a 65:35 mixture of the same random variables. A maximum of the restricted likelihood occurs at the boundary value = 0 when the right derivative lim 0 + L () 0. The notation 0 + denotes convergence of a positive sequence, approaching the value 0 only from the right. Hence, the probability of obtaining an extremum at the boundary equals P (lim 0 + L () 0). Using the algorithm of Davies (989), the finite sample probability of a local extremum at the boundary can be calculated; this is the subject of Section 4.. In Section 4.2 an approximation is derived using the Wilson-Hilferty (93) transformation. The finite sample mixing probability is an alternative to the asymptotic value /2. 4. Exact finite sample calculation Starting from the profile restricted likelihood function (5), using some results on matrix algebra, the probability of a local extremum at zero, is equal to { Y P (L t P()V () 0) = P ZRZt V P()Y Y t P()V P()Y tr(zrzt V P()) }. n q This requires the calculation of the quantile function of a ratio of two quadratic forms in normal random variables. An eigenvalue representation of both numerator and denominator leads to the following approximation for p Q = P (lim 0 + L () 0). Approximation (Finite sample mixing proportions) Let U,..., U n q be independent and identically distributed standard normal random variables and let ξ nk, k =,..., K n be the non-zero eigenvalues of Z t P(0)ZR. If (A) and (A2) hold, the distribution of the restricted likelihood ratio test for testing hypothesis (3) can be approximated by p Q χ 2 0 +( p Q )χ 2, where p Q is the exact probability of obtaining a local maximum at the boundary, p Q = P { Kn k= ξ nku 2 k n q k= U 2 k } Kn k= ξ nk n q

9 9 From the above expression, it is immediately obtained that p Q converges to /2 as n. To appreciate this, let Q = k= ξ nk U 2 k Kn k= ξ nk n q n q Uk 2. The expected value E( Q) equals zero, for each n, and under the assumptions on the eigenvalues both quadratic forms converge to a Gaussian limit distribution. This implies that p Q converges to the probability that a zero-mean Gaussian random variable is smaller than its mean, which is equal to /2. Davies (989) algorithm, available as Applied Statistics Algorithm AS55, can be used to obtain the exact mixing probability. k= 4.2 Wilson Hilferty approximation We start with the eigenvalue representation of the restricted likelihood. hypothesis, the true value of equals zero, therefore L() = n p log( + ξ n,i ) (n p) (σ 2 2 log ε 2 where ξ nk, k =,..., K n are the eigenvalues of Z t P(0)ZR. n p ) Ui 2 + ξ n,i Under the null Taking the derivative of L() with respect to and using the fact that the REML estimator σ ε 2 is consistent for σ 2 ε (see also in the proof of Theorem ), L () = 2 Hence, p Q is approximated by p n = P ( Kn ξ ni + ξ ni + 2 ) ξ ni Ui 2 < ξ ni = P ξ ni ( + ξ n,i ) 2 U 2 i + o P () ( Q ) E(Q) < where the quadratic form Q = K n ξ niu 2 i goes slowly to a normal random variable. To accelerate the convergence, Mathai and Provost (992, Sec. 4.6) present a normalising transformation based on the Wilson Hilferty Gaussian approximation for the chi-squared distribution. For r =, 2, 3, let θ r = K n k= ξr nk. Then for h 0 = 2θ θ 3 /(3θ2 2 ), the Wilson Hilferty approximation states that when n tends to infinity, (Q/θ ) h 0 N(µ Q = + θ 2 h 0 (h 0 )/θ 2; σ2 Q = 2θ 2h 2 0 /θ2 ). This implies that ( ) ( ) µq P lim L () 0 ˆp Q = Φ, 0 + where Φ denotes the standard normal cumulative distribution function. σ Q

10 0 Approximation 2 (Wilson Hilferty) If conditions (A) and (A2) hold, the distribution of the restricted likelihood ratio statistic R n can be approximated by ˆp Q χ ( ˆp Q)χ 2. The estimated mixing probability ˆp Q depends solely on the eigenvalues of the quadratic form Q, which can be obtained directly from the data. Terrell (2003) connected the Wilson Hilferty transformation to a local saddlepoint approximation, see also Kuonen (999). This method requires a positive definite quadratic form, which here is not guaranteed since there are only K n non-zero eigenvalues while the quadratic form contains all n q random variables U k s. 5 Tests in multiple regression For a vector of covariates x = (x,..., x a ) t we wish to test a parametric linear null hypothesis of the form H 0 : µ(x,..., x a ) = xβ (7) against a nonparametric alternative that µ is a smooth function of (x,..., x a ). In this situation there are several testing strategies possible, presented here according to the number of different smoothing parameters. 5. Single smoothing parameter tests If there is only one smoothing parameter involved, the asymptotic distribution of the test statistic is obtained by Theorem. Semiparametric models are build using an a-dimensional basis, that is, Y = xβ+zu+ε, where the i-th row of the n K n dimensional matrix Z consists of ( ) ψ (x i,..., x ai ),..., ψ Kn (x i,..., x ai ) and u = (u,..., u Kn ) t N(0, σu 2 R). Examples of multivariate spline basis functions include tensor products of univariate basis functions, in two-dimensional problems thin-plate splines are often used for estimation of spatial structures, for three dimensions radial splines can be used to fit response surfaces. If conditions (A) and (A2) hold, the asymptotic distribution of the profile REML statistic for testing hypotheses (7) is given in Theorem. With high dimensional data the problem of needing a large sample size remains. Additive spline modelling provides an interesting alternative. Often it is reasonable to assume additive effects, and then an alternative model is built as Y = X β X a β d + Z u Z a u a + ε,

11 with a possibly smaller than a to allow some of the effects to be modelled parametrically only. Under the assumption that all additive components possess the same degree of smoothness, a single smoothing parameter can be used. In the linear mixed model this translates to all random vectors u j being independent with the same variance structure σu 2 R. Following from Lemma, the relevant eigenvalues ξ ni are the nonzero eigenvalues of the matrix Z t P(0)Z R Z t a P(0)Z a R a. The equal mixture of a point mass at zero and a chisquared distribution at one degree of freedom, holds asymptotically as in Proposition, where we now take X = (X,..., X a ). For both tests, finite sample corrections as in Section 4 can be used in practice. 5.2 Two smoothing parameter tests In this section the situation is considered where the linear mixed model contains two variance components in addition to the residual variance. Without loss of generality we can write the mixed regression spline model under the alternative as Y = X β + X 2 β 2 + Z u + Z 2 u 2 + ε, where the design matrices Z and Z 2 contain the spline basis functions for covariates x and x 2 respectively, and u j N(0, σ 2 u j R j ), for j =, 2. The number of spline basis functions K nj is allowed to be different for the different additive components. We assume that u, u 2, ε are independent. The natural parameter space for (σ 2 u, σ 2 u 2, σ 2 ε ) equals [0, )2 (0, ). If the null hypothesis constrains only one of the variance components to zero, the asymptotic distribution of R n is again as given in Theorem. Testing whether both variance components are at the boundary of the parameter space yields a test with a different asymptotic distribution. The hypotheses are H 0 : σ 2 u = σ 2 u 2 = 0 versus H a : σ 2 u > 0 or σ 2 u 2 > 0 (8) The function L reml (, 2, σ 2 ε ) has a matrix representation as in (4) with V = σ2 ε I n + σ 2 uz R Z t + σ 2 u2z 2 R 2 Z t 2 = σ 2 εv, where V = V(, 2 ) = I n + Z R Z t + 2 Z 2 R 2 Z t 2 and j = σ 2 uj /σ2 ε. Similarly, also P(, 2 ) depends on both smoothing parameters. Define the 2 2 matrix G n with entries G n,kl = tr{(z t k P(0, 0)Z kr k )(Z t l P(0, 0)Z lr l )}. Further, let r n = cos (G n,2 / G n, G n,22 )/(2π), and s n = G n,2 / G n. Hence, r n = cos (s n / + s 2 n )/(2π). Theorem 2 Assume conditions (A) and (A2) hold for both A and A 2. Let (N, N 2 ) N(0, I 2 ), and denote s = lim n s n, r = lim n r n. The restricted likelihood ratio statistic

12 2 R n for testing the hypothesis in (8) has asymptotically, as the sample size n tends to infinity, the following mixture distribution R n d A proof is given in the appendix. 0, with probability /2 r, N 2, with probability /4, (N sn 2 ) 2 /( + s 2 ), with probability /4, N 2 + N 2 2, with probability r. The method of proof differs from that in earlier work on one-sided testing problems. In particular, the explicit expression of the observed log likelihood as a sum of individual contributions, as directly used by Vu and Zhou (997), is not immediately available for REML estimation. Instead, the obtained results rely on the eigenvalue representation stated in Lemma. This provides an alternative method of proof for (full) likelihood ratio testing. In the limiting distribution above, the finite sample versions r n and s n can be calculated exactly, and do not depend on any unknown parameters. A special case is when s n = 0, which implies that r n = /4, it now follows that R n d 4 χ χ2 + 4 χ2 2. The situation where s n = 0 occurs if and only if S Z t P(0, 0)Z 2S t 2 = 0. Here, the matrices S j are such that the covariance matrix can be written as R j = S t js j. For example, s n = 0 when the spline basis functions are orthogonal, that is Z t Z 2 = 0, and at least one of the spline design matrices is orthogonal with respect to the parametric design matrix: Z t j X = 0 for at least one j. 5.3 More than two smoothing parameters equal to zero If more than two variance components are set to zero under H 0, for orthogonal sets of spline basis functions such that Z t j Z k = 0 (j k), the asymptotic distribution of the likelihood ratio statistic is a mixture of χ 2 0,..., χ2 a distributed random variables where a denotes the number of variance components set to zero. The limiting mixing proportions are determined by the lower Cholesky square root of the Fisher information matrix, which is defined as the expected value of minus the second derivatives of the log REML function with respect to the parameters in the model. When exact expressions for the mixing proportions are unknown or difficult to obtain, the use of bounds on the P-value provides an alternative. Silvapulle (994) and Kodde and Palm (986) explain that the P -value can be bounded as {P 2 (χ2 0 > ˆR n ) + P (χ 2 > ˆR n )} P -value {P 2 (χ2 a > ˆR n ) + P (χ 2 a > ˆR n )},

13 3 where ˆR n denotes the observed value of R n. The fewer variance components equal to zero, the more accurate the bounds are. In most realistic applications tests would only be performed for a relatively small number of variance components. An alternative idea of performing tests on the smoothing parameters in multiple regression models is to look at maximal deviations in one direction only. More specifically, for each j =,..., a, test the parametric null hypothesis H 0 : E(Y ) = µ(x,..., x a ) against each of the following alternative hypotheses: H a,j : E(Y ) = µ(x,..., x a ) + u k ψ k (x j ), (j =,..., a), using the REML test statistic R nj. The final test statistic is R n = max j a R nj. The level of the test can for example be controlled by application of Bonferroni s inequality. Alternatively a bootstrap method can be used where data are generated under the parametric null model. Instead of focussing on one-dimensional departures only, low order interaction splines can provide a more powerful testing approach. These tests are constructed by considering alternative models of the type H a,ij : E(Y ) = µ(x,..., x a ) + u k ψ k (x i, x j ), (i j) and taking the maximum over pairs of indices (i, j). A similar max test statistic using orthogonal series estimators in multivariate regression models has been constructed by Aerts, Claeskens and Hart (2000). k= k= 6 Numerical results One of the main advantages of performing a lack of fit test using the mixed regression splines is that the test statistic can be computed easily using statistical software packages which provide tools for fitting linear mixed models, such as for example proc MIXED in SAS or the function lme() in S-Plus and R. We first address the issue of calculating the exact mixing proportions in the asymptotic distribution of a test for linearity in the one-smoothing parameter case. If the probability of estimating a zero value for is higher than the expected value this leads to oversmoothing, and hence to a lower power of the test. Figure presents boxplots summarising the results for the calculation of the probability of a zero smoothing parameter in 000 simulated sets of data generated from a normal linear regression model Y = x+ε, with the error variance set to 0. and x generated from a uniform distribution on (0, ). Four sample sizes are chosen, for n = 00 and 200 we take 25 knots at sample quantiles while for n = 300 and 500, 45 knots are used. For the calculation we

14 Figure : Boxplots showing the exact finite sample probability of zero smoothing parameter for simulated data under the linear null hypothesis for sample sizes n = 00, 200 (using 25 knots), 300 and 500 (using 45 knots) using a truncated polynomial basis (left panel) and B-spline basis (right panel). used either a truncated linear basis where ψ j (x) = max{(x κ j ), 0} for knots κ j (left panel) or a B-spline basis (results are in the right panel). For truncated polynomial basis functions even for larger sample sizes the probabilities do not tend to the value 0.5. Calculated values stay above 0.67 because assumption (A2) on the eigenvalues of the matrix Z t P(0)Z does not hold for this type of basis function, while (A) is satisfied. For the B-spline functions both conditions are fulfilled and convergence to 0.5 holds. Note that especially when using the truncated polynomial basis there are many large outliers in the calculated probabilities. A simulation study is performed to investigate power properties of the test using R n. Data are generated from a model Y i = f(x i ) + σε i with independent standard normal error terms ε i. We test for linearity: H 0 : f(x) = β 0 + β x and obtain simulated rejection probabilities under a sequence of alternative models with higher frequency terms: Y i = x i cos(πjx i ) ε i for j = 2,..., 9. All the tests are performed at the 0.05 level, using linear splines and sample size equal to 00. Simulated power curves, based on 2000 simulated datasets, are depicted in Figure 2. Curves are shown using the asymptotic critical value as obtained from the mixture distribution (see Theorem ), as well as from a simulated distribution under the null hypothesis (based on 5000 simulated datasets) and from two approximated distributions where for each dataset the mixing proportion is calculated using the exact finite sample calculations according to the algorithm by Davies and the approximation method of Wilson and Hilferty. In the figure we clearly distinguish two groups of tests, those using a truncated polynomial basis which have high power at low frequency alternatives, but lose power quickly for high frequencies, and the tests using B-spline basis functions, which have reasonably high power for all frequencies considered in the simulation study. The approximations result in slightly larger power of the tests than when using the

15 5 Simulated Power Frequency of alternative Figure 2: Simulated power curves using statistic R n for a test of linearity using asymptotic critical values (solid line), values from a simulated distribution (dot-dashed line), as well as values from approximated distributions (dashed lines, Davies; dotted lines, Wilson Hilferty). The group of tests with the rapid decrease uses truncated polynomial basis functions, the others a B-spline basis. asymptotic distribution. For B-splines the asymptotic approximation agrees closely with the results from the simulated distribution and corrections are not necessary. For the truncated polynomial basis, simulated power values of the approximation methods follow closely those using the simulated distribution, for this basis function they seem necessary. As a comparison the order selection test of Aerts, Claeskens and Hart (999) has been calculated. This test (not shown) has power behaviour similar to the truncated polynomial based tests. Using the asymptotic distribution the simulated probability of a type I error is for the test using a truncated polynomial basis for the asymptotic distribution, while for approximations and 2 give values and A B-spline basis leads to corresponding values , and 0.520, also showing that for these basis functions finite sample corrections are not necessary. The simulated level of the order selection test is

16 6 7 Discussion Difficulties with boundary values can often be circumvented by embedding the parameter space. Theorem 8.3 of Harville (997, p.433) ensures the existence of a strictly positive value c such that V is symmetric positive definite for > c. This guarantees that it is possible to reparametrise the likelihood and embed the current parameter space in a bigger space, namely Ω = ( c, ) for which the value of under the null hypothesis belongs to the interior region. Feng and McCulloch (992) use the idea of extending the parameter space for the construction of confidence intervals for a parameter possibly on the boundary, in the setting of independent and identically distributed data. While the embedding takes away the technical difficulties with directional derivatives, the complications of performing one-sided likelihood ratio based tests remain. As a consequence the asymptotic distribution of the test statistic does not change using the approach of embedding. There is a difference in distributions of the test statistic when a two-sided alternative hypothesis is considered: a limiting chi-squared distribution results, but the power of such a two-sided test will be smaller than the power of one-sided testing. While the focus of this paper is on the restricted likelihood ratio test, other test statistics can be considered. Silvapulle and Silvapulle (995) construct a score statistic for testing one-sided hypotheses which only requires estimation under the null hypothesis, and show that the asymptotic distribution is equivalent to that of a likelihood ratio statistic. For a twosided score test in variance component models see Lin (997). Full likelihood ratio statistics can also be used, while having the advantage of allowing the possibility of restricting fixed effects under the null hypothesis. A motivation for the use of restricted likelihood methods is that the degrees of freedom associated with the estimation of fixed effects in the model is more effectively taken into account in the estimation of the smoothing parameter, which is of importance in the construction of the test. Another motivation comes from the finite sample calculation of the mixing proportions, which are probabilities of a variance component being zero. Crainiceanu, Ruppert and Vogelsang (2002) calculate that for maximum likelihood estimation these probabilities can be much larger than for restricted likelihood, which is not advantageous in a hypothesis testing setting. Acknowledgements Part of this research was performed while the author was with the Department of Statistics at Texas A&M University. The author is thankful to M. Wand, M. Aerts, P. Janssen and D. Ruppert for interesting discussions related to the topic of this paper and to the editor and reviewers of this paper for helpful comments. This research was partly supported by NSF grant DMS and by the Belgian Federal Science Policy Office.

17 7 8 Appendix Proof of Lemma. Since P(0) is a symmetric and idempotent matrix, and by symmetry of V there exists a matrix W such that W t W = I n q, WW t = P(0) and W t V W = diag( ξ). Since W t V W = I n q + a k= Wt k Z k R k Z t k W, this motivates us to define ξ n,j such that ξ j = + ξ n,j (,..., a ), where ξ n,j are the (non-zero) eigenvalues of a k= Wt k Z k R k Z t k W. The number of non-zero eigenvalues is equal to the rank of the matrix, which in this case is at most K n = a k= K nk. For regular spline bases where there is no linear dependence between the different basis functions, the rank equals K n. Denote = (,..., a ). From Patterson and Thompson (97), Y t P()V P()Y = Yt W(I + diag( ξ n,i ())) W t Y. With denoting the true value of, W t Y N(0, σε 2(I n q + a k= k W t Z k R k Z t kw)). As a consequence, the quadratic form n p Y t P()V P()Y = + ξ n,i ( ) σ2 ε + ξ n,i () U i 2 with U,..., U n p independent and identically distributed N(0, ) random variables. It can be shown that (Kuo, 999) with C denoting a finite constant, log V + log X t V X = n p log( + ξ n,i ()) + C. This gives us an expression for the REML likelihood in terms of eigenvalues. We have, up to some constant not depending on any parameter value, n q L reml (, σε 2 ) = log( + 2 ξ n,i ()) (n q) 2 log(σ2 ε n q + ξ n,i ( ) + ξ n,i () U 2 i ). Proof of Theorem. The proof consists in showing that Theorem 2.2 of Vu and Zhou (997) is applicable for the profile REML statistic. We verify their conditions. There exists a neighbourhood N of the true value = 0, where the profile REML function L() is continuous, and first and second derivatives exist and are continuous on N Ω. Derivatives at = 0 are understood to be right-derivatives only. With this one-dimensional problem Chernoff regularity is satisfied. To verify the remaining conditions, we use the eigenvalue representation from Lemma. Under the null hypothesis, the true value of the parameter equals zero. First and second derivatives with respect to are obtained directly from this eigenvalue form, L () = 2 The second derivative is L () = 2 ξ n,i + (n q) + ξ n,i 2 ξn,i 2 (n q) ( + ξ n,i ) 2 ( Kn ( Kn ) ( n q ( + ξ n,i ) ξ n,iu 2 2 i ) ( n q ( + ξ n,i ) 3 ξ2 n,iui 2 ) Ui 2. + ξ n,i ) Ui 2 + ξ n,i

18 8 ( + Kn 2 (n q) ) 2 ( n q ( + ξ n,i ) ξ n,iu 2 2 i ) 2 Ui 2. + ξ n,i The above expressions can be simplified after showing that n q probability. For this Chebyshev s inequality is used (Kuo, 999), ( P n q n q Ui 2 n q + ξ ni n q U 2 i ) > a a n q n q +ξ n,i U 2 i ξ ni + ξ ni = O(K n /n)., in Since K n = o(n), the denominator is o P () away from the average of χ 2 s, which by the law of large numbers converges in probability to one. This is equivalent to showing that the REML estimator is consistent for the true residual variance σ 2 ε. Denote now L and L the simplified versions of the derivatives above, that is, The second derivative is L () = 2 L () = 2 ξni 2 Kn ( + ξ ni ) 2 ξ ni + ξ ni + 2 ( + ξ ni ) 3 ξ2 ni U i n q ( + ξ ni ) 2 ξ niu 2 i. ( Kn At the null value, E{L reml (0)} = 0, E{L reml (0)}2 = Kn 2 ξ2 ni and ( ) E{ L reml(0)} = ξni 2 ( ) 2 2 ξni 2 + ξ ni, 2 2 n q ) 2 ( + ξ ni ) ξ niu 2 2 i. which tends to infinity as n grows. Note that for these profile likelihood based mixed models, E{L reml (0)2 } E{ L reml (0)}. We now study the variance of the normal random variable to be used in the projections on the tangent cones, as in Theorem 2.2 of Vu and Zhou (997). The matrix V is the limit expression of V n = E{L reml (0)2 }/E{ L Kn reml (0)}. Consider Vn expressed in terms of eigenvalues, Vn = 3 n q n q j i ξ niξ nj / K n ξ2 ni = + O(K n /n). Since K n = o(n), we may take V = in the limit. Condition (A2) is sufficient to imply normality of the score value. The result now follows from Vu and Zhou (997). Proof of Proposition 2. The profile score vector has two components given by L = j 2 tr(z jz t jv n q P()) + 2 Y t P()V and for j, k =, 2 the partial second derivatives are obtained as 2 L = j k 2 tr{z jz t jp () t V Z kz t kp () t V Z jz t j V P()Y Y t P()V P()Y, j =, 2 }

19 9 + n q 2 { Yt P() t V Z jz t j V P()Z kz t k V Y t P() t V P()Y P()Y Yt P() t V Z kz t k V P()Z jz t j V P()Y Y t P() t V P()Y + {Yt P() t V Z kz t k V P()Y}{Yt P() t V Z jz t j V P()Y} } {Y t P()V. P()Y}2 Since Y t P()V P()Y/(n q) is a consistent estimator of the residual variance σ2 ε, we define the following quantities, which are o P () away from the derivatives given above: L = j 2 tr(z jz t j V P()) + 2 σ2 ε Yt P()V Z jz t j V P()Y; 2 L = j k 2 tr{z jz t j P ()t V Z kz t k P ()t V 2 σ 2 ε σ 4 ε Y t P() t V { Zj Z t j V } P()Z kz t k + Z kz t k V P()Z jz t j } V P()Y + 2 n q {Yt P() t V Z kz t k V P()Y}{Yt P() t V Z jz t j V P()Y}. Using properties of expected values of quadratic forms, at the null value, E{ L j (0, 0)} = 0, and while G n,jk = E{ 2 L (0, 0)} = ( 2 j k 2 2 ) tr{z j Z t j P (0)Z jz t j P (0)} n q n q tr{z jz t j P (0)}tr{Z kz t k P (0)}, D n,jk = E{ L j (0, 0) L k (0, 0)} = 2 tr{(zt jp (0)Z k )(Z t jp (0)Z k ) t }. The off-diagonal entries of D n are zero if and only if Z and Z 2 are orthogonal in the sense that Z t jp(0)z k = 0. Define the matrices A k = Z k Z t kp(0). The Cauchy Schwarz inequality gives that {tr(a A 2 )} 2 < tr(a 2 )tr(a 2) 2, where the strict inequality holds because we may rule out the situation where A is a multiple of A 2. This shows that D n is positive definite. Since we are working with profile restricted likelihood ratio tests, the matrices G n and D n are generally not expected to be identical. From the expressions above, G n = D n n q {2D n + 2 C n} = D n ( + O{min(K n, K 2n )/n}), (9) where the 2 2 matrix C n has (j, k)th entry given by tr(a j )tr(a k ). Hence, for n large enough, also G n is positive definite. The real symmetric matrix D n has two eigenvalues

20 20 ζ (D n ) < ζ 2 (D n ), given by 2 {tr(d n) ± tr(d n ) 2 4 D n }. By assumption (A) on the eigenvalues of A and A 2, tr(d n ), and since ζ (D n ) = O(tr(D n ) 2 ), also ζ (D n ), as n. By (9), Gn /2 D n (Gn /2 ) t I 2 = o(), where G /2 n represents the lower (left) Cholesky square root of G n, and is the sum of the absolute values of the matrix entries. This shows that the asymptotic covariance matrix of the bivariate random variable to be used in the projection described below is equal to the identity matrix. By results of de Jong (987) and condition (A2) on the eigenvalues of both matrices A and A 2, the score vector has a limiting normal distribution. For the set Ω = [0, ) [0, ), define the cone. { C Ωn = (, } 2 ) t = G t/2 n (, 2 ) t : (, 2 ) t Ω. Inserting the Cholesky decomposition matrix G t/2 n and letting n tend to infinity, defines the limiting cone C Ω = {(, 2 ) t : s 2 0, 2 0}. Under the results obtained above, the asymptotic distribution of the REML ratio statistic R n is now given by the distance of (N, N 2 ) N(0, I 2 ) to the set C Ω. This divides the plane into four regions, of which the orthogonal projection on C Ω of values in the area defined by {(, 2 ) t : s 2 < 0, s + 2 0} results in the component (sn + N 2 ) 2 /( + s 2 ). In this region, the component of the likelihood ratio type statistic R n is given by N 2 + N 2 2 (sn +s 2 + N 2 ) 2 = (N +s 2 sn 2 ) 2. Since N and N 2 are uncorrelated, this component follows a χ 2 distribution. The other components are obtained in a similar way. References Aerts, M., Claeskens, G. and Hart, J.D. (999) Testing the fit of a parametric function. J. Am. Statist. Assoc., 94, Aerts, M., Claeskens, G. and Hart, J.D. (2000). Testing lack of fit in multiple regression. Biometrika, 87, Aerts, M., Claeskens, G. and Wand, M.P. (2002). Some theory for penalized spline additive models. J. Statist. Plann. Inference, 03, , Brumback, B., Ruppert, D. and Wand, M.P. (999). Comment on Variable selection and function estimation in additive nonparametric regression using a data-based prior by Shively, T.S., Kohn, R. and Wood, S., J. Am. Statist. Assoc., 94, Cai, T., Hyndman, R.J. and Wand, M.P. (2002). Mixed model-based hazard estimation. J. Comp. Graph. Statist.,, Cantoni, E. and Hastie, T. (2002). Degrees-of-freedom tests for smoothing splines. Biometrika, 89,

21 2 Chernoff, H. (954). On the distribution of the likelihood ratio. Ann. Math. Statist., 25, Cox, D. and Koh, E. (989). A smoothing spline based test of model adequacy in polynomial regression. Ann. Inst. Statist. Math., 4, Cox, D., Koh, E., Wahba, G. and Yandell, B.S. (988). Testing the (parametric) null model hypothesis in (semiparametric) partial and generalized spline models. Ann. Statist., 6, 3-9. Crainiceanu, C.M., Ruppert, D. and Vogelsang, T.J. (2002). Probability of estimating zero variance of random effects in linear mixed models. Manuscript. Cressie and Lahiri (993). The asymptotic distribution of REML estimators. J. Multiv. Anal., 45, Davies, R.B. (980). (Algorithm AS 55) The distribution of a linear combination of χ 2 random variables. Appl. Statist., 29, de Jong, P. (987). A central limit theorem for generalized quadratic forms. Prob. Th. Rel. Fields, 75, Eilers, P.H.C. and Marx, B.D. (996). Flexible smoothing with B-splines and penalties (with discussion). Statist. Sci., 89, Eubank, R.L. and Hart, J.D. (992). Testing goodness-of-fit in regression via order selection criteria. Ann. Statist., 20, Feng, Z. and McCulloch, C.E. (992). Statistical inference using maximum likelihood estimation and the generalized likelihood ratio when the true parameter is on the boundary of the parameter space. Statist. Prob. Lett., 3, Geyer, C.J. (994). On the asymptotics of constrained M-estimation. Ann. Statist., 22, Gouriéroux, C., Holly, A. and Monfort, A. (982). Likelihood ratio test, Wald test, and Kuhn-Tucker test in linear models with inequality constraints on the regression parameters. Econometrica, 50, Guo, W. (2002). Inference in smoothing spline analysis of variance. J. R. Statist. Assoc., 64, Harville, D.A. (997). Matrix Algebra from a Statistician s Perspective. Springer-Verlag, New York. Kammann, E.E. and Wand, M.P. (2003). Geoadditive models. Appl. Statist., 52, 8. Kodde, D.A. and Palm, F.C. (986). Wald criteria for jointly testing equality and inequality restrictions. Econometrica, 54, Kuo, B.-S. (999). Asymptotics of ML estimator for regression models with a stochastic trend component. Econom. Th., 5,

22 22 Kuonen, D. (999). Saddlepoint approximations for distributions of quadratic forms in normal variables. Biometrika, 86, Lin, X. (997). Variance component testing in generalised linear models with random effects. Biometrika, 84, Mathai, A.M. and Provost, S.B. (992) Quadratic Forms in Random Variables, M. Dekker, New-York. Nychka, D. and Cummins, D. (996). Comment on Flexible smoothing with B-splines and penalties by P.H.C. Eilers and B.D. Marx. Statist. Sci., 89, Pinheiro, J.C. and Bates, D.M.(2000). Mixed-Effects Models in S and S-PLUS, Springer- Verlag, New York. Richardson, A.M. and Welsh, A.H. (994). Asymptotic properties of restricted maximum likelihood (REML) estimates for hierarchical mixed linear models. Austr. J. Statist., 36, Ruppert, D. and Carroll, R.J. (2000). Spatially-adaptive penalties for spline fitting. Austr. N.-Zeal. J. Statist., 42, Searle, S.R., Casella, G. and McCulloch, C.E. (992). Variance Components. John Wiley & Sons, Inc., New York. Self, S.G. and Liang, K.-Y. (987). Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions. J. Am. Statist. Assoc., 82, Sen, P.K. & Silvapulle, M.J. (2002). An appraisal of some aspects of statistical inference under inequality constraints. J. Statist. Plann. Inference 07, Silvapulle, M.J. (994). Likelihood ratio test of one-sided hypothesis in some generalized linear model. Biometrics, 50, Silvapulle, M.J. and Silvapulle, P. (995). A score test against one-sided alternatives. J. Am. Statist. Assoc., 90, Staniswalis, J.G. and Severini, T.A. (99). Diagnostics for assessing regression models. J. Am. Statist. Assoc., 86, Stram, D.O. and Lee, J.W. (994). Variance component testing in the longitudinal mixed effects model. Biometrics, 50, Terrell, G.R. (2003). The Wilson Hilferty transformation is locally saddlepoint. Biometrika, 90, Vu, H.T.V. and Zhou, S. (997). Generalization of likelihood ratio tests under nonstandard conditions. Ann. Statist., 25, Wilson, E.B. and Hilferty, M.M. (93). The distribution of chi-square. Proc. Acad. Nat. Sci., 7,

Exact Likelihood Ratio Tests for Penalized Splines

Exact Likelihood Ratio Tests for Penalized Splines By CIPRIAN CRAINICEANU, DAVID RUPPERT, GERDA CLAESKENS, M.P. WAND Department of Biostatistics, Johns Hopkins University, 615 N. Wolfe Street, Baltimore,