Semiparametric Mean-Covariance. Regression Analysis for Longitudinal Data

Size: px

Start display at page:

Download "Semiparametric Mean-Covariance. Regression Analysis for Longitudinal Data"

Corey Merritt
5 years ago
Views:

1 Semiparametric Mean-Covariance Regression Analysis for Longitudinal Data July 14, 2009 Acknowledgments We would like to thank the Editor, an associate editor and two referees for all the constructive comments and detailed suggestions, which have led to a substantially improved paper.

2 Abstract Efficient estimation of the regression coefficients in longitudinal data analysis requires a correct specification of the covariance structure. Existing approaches usually focus on modeling the mean with specification of certain covariance structures, which may lead to inefficient or biased estimators of parameters in the mean if misspecification occurs. In this paper, we propose a data-driven approach based on semiparametric regression models for the mean and the covariance simultaneously, motivated by the modified Cholesky decomposition. A regression spline based approach using generalized estimating equations is developed to estimate the parameters in the mean and the covariance. The resulting estimators for the regression coefficients in both the mean and the covariance are shown to be consistent and asymptotically normally distributed. In addition, the nonparametric functions in these two structures are estimated at their optimal rate of convergence. Simulation studies and a real data analysis show that the proposed approach yields highly efficient estimators for the parameters in the mean, and provides parsimonious estimation for the covariance structure. Some Keywords: Covariance misspecification; Efficiency; Generalized estimating equation; Longitudinal data; Modified Cholesky decomposition; Semiparametric models. 1 Background Longitudinal data arise frequently in the biomedical, epidemiological, social and economical fields. A salient feature of longitudinal studies is that subjects are measured repeatedly over time. Thus, observations for the same subject are intrinsically correlated. Regression methods for such data sets accounting for within-subject correlation is abundant in the literature (Diggle et al., 2002; Wu and Zhang, 2006). 1

3 Within the framework of generalized linear models (GLM), the technique of generalized estimating equations (GEE, Liang and Zeger, 1986) is widely used for dealing with longitudinal data. GEE makes the use of a working correlation model to estimate the mean parameters in the marginal specification of the regression. Although consistency of the mean parameter estimators is not affected, misspecification of the correlation may result in a great loss of efficiency (Wang and Carey, 2003). On the other hand, the correlation structure itself may be of scientific interest (Prentice and Zhao, 1991; Diggle and Verbyla, 1998). Therefore, there is a great need to model the covariance structure. However, modeling the correlation matrix is more challenging than modeling the mean as there are usually much more parameters in the former and the positive definiteness of the covariance matrix has to be assured. Recently, Pourahmadi (1999, 2000) proposed a modified Cholesky decomposition to decompose the covariance matrix. This decomposition is attractive as it leads automatically to positive definite covariance matrices, and the parameters in it are related to well founded statistical concepts. Thereafter, the parameters in this decomposition can be modeled via regression techniques, enabling model based inference for the parameters in the mean and the covariances. See Pan and MacKenzie (2003) for a related discussion. More recently, Ye and Pan (2006) further proposed to use GEE to model the parameters in this decomposition by several sets of parametric estimating equations. There is a clear need to relax the parametric assumption posed in Ye and Pan (2006), as model misspecification may result in biased estimation, a problem even more severe than misspecification of the covariance. An attractive approach is the semiparametric regression model, or the partly linear model (PLM), which retains the flexibility of the nonparametric model but avoids the need to model a fully nonparametric model (Härdle, Liang and Gao, 2000). Existing applications of PLM 2

4 to longitudinal data accounting for within-subject dependence mainly focus on regression analysis of the mean. The covariance is usually assumed known up to a few parameters. See for example, the kernel and spline approaches in Lin and Carroll (2001), Welch, Lin and Carroll (2002), He, Zhu and Fung (2002), Wang (2003), Wang, Carroll and Lin (2005), He, Fung and Zhu (2005), and Lin and Carroll (2006). Compared to the models for the mean in longitudinal data analysis, model based analysis for the covariance is much less studied. To address this issue, we propose semiparametric models for the mean and the covariance structure for longitudinal data. Our formulation builds on the modified Cholesky decomposition advocated by Pourahmadi (1999) such that the entries in this decomposition can be modeled by unconstrained semiparametric regression models. Our approach is more flexible than the model in Ye and Pan (2006) in both the mean and the covariance. There are also related works in this regard. Fan, Fan and Lv (2008) studied a factor model method to parametrize a high dimensional covariance matrix. Wu and Pourahmadi (2003) proposed nonparametric estimates of the covariance matrix, but their method does not deal with irregular observed measurements and only models a few of the subdiagonals of the lower triangular matrix in the modified Cholesky decomposition. Fan, Huang and Li (2007) and Fan and Wu (2008) studied a different semiparametric model for the covariance structure by imposing the familiar variance-correlation decomposition. They estimated the marginal variance via kernel smoothing and proposed a parametric model for the correlation matrix. Similar to the method in Fan, Huang and Li (2007), our approach can handle irregularly and possibly subject-specific times points. We show that the resulting estimators for the regression coefficients in the mean and the variance are consistent and asymptotically normally distributed. Furthermore, the nonparametric parts are estimated at the optimal convergence rate by 3

5 using regression splines. The rest of the paper is organized as follows. Section 2 introduces the models and estimation methods. Theoretical properties of the proposed estimators are given in Section 3. Extensive simulations and data analysis are presented in Section 4. Section 5 gives some concluding remarks. All the proofs are relegated to the Appendix. 2 The Models and The Estimation Methods 2.1 The models Let y i = (y i1,, y ini ) be the n i repeated measurements at time points t i = (t i1,, t ini ) of the ith subject (i = 1,, m). Here t ij may be the time or any time-dependent covariate which is modeled nonparametrically. Without loss of generality, we assume that all {t ij } are scaled into the interval [0, 1]. Furthermore, we assume that the first two moments of the response satisfy E(y ij x ij, t ij ) = µ 0 ij and V (y i x i, t i ) = Σ 0i, where x ij is a p-vector covariate and x i = (x i1,..., x ini ) is the covariate matrix for the ith subject. To guarantee the positive definiteness of the matrices Σ 0i, an explicit way of modeling Σ 0i is via its modified Cholesky decomposition as Φ i Σ 0i Φ i = D 0i, where Φ i is a lower triangular matrix with 1 s on its diagonal, and D 0i is a diagonal matrix. As indicated by Pourahmadi (1999), this decomposition has a clear statistical interpretation. The below-diagonal entries of Φ i are the negatives of the autoregressive coefficients φ ijk defined in j 1 ŷ ij = µ ij + φ ijk (y ik µ ik ). (1) k=1 That is, the autoregressive coefficients are the population regression coefficients of the linear regression of y ij on its predecessors y i(j 1),, y i1. The diagonal entries σ0ij 2 of D 0i can be seen as the innovation variance σ0ij 2 = var(ɛ ij ), for ɛ ij = y ij ŷ ij. Clearly, the modified Cholesky decomposition is advantageous in that φ and log(σ 2 ) 4

6 are unconstrained, rather than a constrained parameter Σ 0i that must be positive definite. To use the semiparametric regression tools, we postulate three sets of models g(µ 0 ij) = x ijβ 0 + f 0 (t ij ), φ ijk = w ijkγ 0, log(σ 2 0ij) = z ijλ 0 + f 1 (t ij ), (2) where x ij, w ijk and z ij are the p 1, q 1 and d 1 vectors of covariates respectively; β 0, γ 0 and λ 0 are the regression coefficients; f 0 ( ) and f 1 ( ) are unknown smooth functions. The known link function g( ) is assumed to be monotone and differentiable. The covariates x ij, w ijk and z ij may contain the baseline covariates, the time and the associated interactions. The idea of using (2) reflects the belief that regression models for the autoregressive coefficients and innovation variances are as important as those for the mean. Furthermore, model based analysis of these parameters permit more accessible statistical inference. This technique was also used by Ye and Pan (2006), but they assumed parametric models for g(µ 0 ij ) and log(σ2 0ij ). As discussed earlier, the semiparametric estimating equations are more flexible and can be less biased if the parametric assumption is violated. It is interesting to make some comparisons with the approach in Fan et al. (2007). They adopted the variance-correlation decomposition by using a well-behaved matrix such as AR(1) or ARMA(1,1) as the correlation matrix to ensure positive definiteness of the covariance. We use the modified Cholesky decomposition, which allows unconstrained parametrization. Thus regression based analysis for entries in this decomposition are possible. These two approaches depend on different ways to decompose covariance and each has its own merits. 2.2 The estimating equations The two nonparametric functions f 0 and f 1 are parametrized by regression splines, because splines can provide optimal rates of convergence for both the parametric 5

7 and the nonparametric components in PLM with a small number of knots (Heckman 1986; He and Shi 1996, He, et al., 2002). Additionally, any computational algorithm developed for GLM can be used for fitting a semiparametric extension of GLM, since they treat a nonparametric function as a linear function with the basis functions as covariates. For simplicity, we assume that f 0 and f 1 have the same smoothness property. Let 0 = s 0 < s 1 < < s kn < s kn+1 = 1 be a partition of the interval [0, 1]. Using {s i } as the internal knots, we have K = k n + l normalized B-spline basis functions of order l that form a basis for the linear spline space. We use the B- spline basis functions because they have bounded support and are numerically stable (Schumaker, 1981). Thus f 0 (t) and f 1 (t) are approximated by π (t)α and π (t) α respectively, where π(t) = (B 1 (t),, B K (t)) is the vector of basis functions and α, α R K. Let π ij = π(t ij ). With this notation, the nonlinear regression models in (2) can be linearized as following: g(µ ij ) = x ijβ + π (t ij )α = b ijθ, log(σ 2 ij) = z ijλ + π (t ij ) α = h ijρ, (3) where b ij = (x ij, π ij), h ij = (z ij, π ij), θ = (β, α ) and ρ = (λ, α ). We then let µ i = (µ i1,, µ ini ), B i = (b i1,, b ini ) and define x i, π i, z i and H i in a similar fashion. Throughout this paper, a scalar function acting on a vector is set to be the vector of the function on each component, for example, g(µ i ) = (g(µ i1 ),, g(µ ini )). Using the GEE method from Liang and Zeger (1986), we construct the estimating equations for θ, γ and ρ as follows S 1 (θ) = B i iσ 1 i (y i µ i (B i θ)) = 0, S 2 (γ) = V i D 1 i (r i ˆr i ) = 0, S 3 (ρ) = H id i W 1 i (ɛ 2 i σ 2 i (H i ρ)) = 0, where i = i (B i θ) = diag{ġ 1 (b ijθ),, ġ 1 (b in i θ)} and ġ 1 ( ) is the derivative of the inverse function g 1 ( ); r i and ˆr i are the n i 1 vectors with jth components r ij = 6 (4)

8 y ij µ ij and ˆr ij = E(r ij r i1,, r i(j 1) ) = j 1 k=1 φ ijkr ik (j = 1,, n i ). Note that when j = 1 the notation 0 k=1 means zero throughout this paper. It can be shown that D i = diag{σ 2 i1,, σ2 in i } in S 2 (γ) is actually the covariance matrix of r i ˆr i and that V i = ˆr i/ γ is the q n i matrix with jth column ˆr ij / γ = j 1 k=1 r ikw ijk. On the other hand, ɛ 2 i and σ 2 i in S 3 (λ) are the n i 1 vectors with jth components ɛ 2 ij and σ 2 ij (j = 1,, n i), respectively, where ɛ ij = y ij ŷ ij and ŷ ij are given in (1). Obviously, we have E(ɛ 2 i ) = σ 2 i. In addition, W i is the covariance matrix of ɛ 2 i, that is, W i = V (ɛ 2 i ). The solutions of these generalized estimating equations, ˆθ, ˆγ and ˆρ say, are termed the GEE estimators of θ, γ and ρ. As suggested by Ye and Pan (2006), a sandwich working covariance structure W i = A 1/2 i R i (δ)a 1/2 i can be used to approximate the true W i s, where A i = 2diag{σ 4 i1,, σ4 in i } and R i (δ) mimics the correlation between ɛ 2 ij and ɛ2 ik (j k) by introducing a new parameter δ. Typical structures for R i (δ) include compound symmetry (exchangeable) and AR(1). As with the conventional generalized estimating equations for the mean, the parameter δ may have very little effect on the estimators of γ and ρ. Our real data analysis and simulation studies reported in later sections confirm this point very well. The three GEE equations in (4) can be seen as a generalization of the conventional GEE for the mean parameters. If we use a working covariance structure for Σ i in S 1 (θ) and ignore S 2 (γ) and S 3 (ρ), we have the PLM for the mean. The modified Cholesky decomposition allows us to proceed a step further to impose partly linear structures for the variance components. 2.3 The main algorithm The solutions of θ, γ and ρ satisfy the equations in (4). These parameters can be solved iteratively by fixing the other parameters. For example, for fixed values of 7

9 θ and γ, ρ can be computed via the third equation in (4). An application of the quasi-fisher scoring algorithm on equation (4) directly yields the numerical solutions for these parameters. More specifically, given Σ i, θ can be updated by [ m ] 1 m θ (k+1) = θ (k) + B i i Σ 1 i i B i B i i Σ 1 i (y i µ i (B i θ)). (5) θ=θ (k) Given θ and ρ, γ can be updated approximately through [ ] 1 m γ (k+1) = E V id 1 i V i V id 1 i r i. (6) γ=γ (k) In this update, the true errors are replaced by the residuals. We remark that this replacement has minimal effects on the estimation if the mean model is correctly specified. However, if the mean model is misspecified, this replacement would normally give inconsistent estimators for the autoregressive and log innovation parameters. This argument, on the other hand, demonstrates that the partly linear mean model of our approach is at least more robust than a parametric mean model with respect to model misspecification. Finally, given θ and γ, the innovation variance parameters ρ can be updated using [ m ] 1 m ρ (k+1) = ρ (k) + H id i W 1 i D i H i H id i W 1 i (ɛ 2 i σ 2 i ), (7) ρ=ρ (k) Equation (5)-(7) indicate that, iteratively, the parameters can be estimated using weighted generalized least squares. We summarize the algorithm as follows 1. Initialization step: given a starting value (θ (0), γ (0), ρ (0) ), use the model (2) to form the lower triangular matrices T (0) i and diagonal matrices D (0) i. Set k = 0 to obtain Σ (0) i, the starting values of Σ i ; 2. Iteration step: use (5) (7) to calculate θ (k+1), γ (k+1) and ρ (k+1) ; 8

10 3. Updating step: replace θ (k), γ (k) and ρ (k) with the estimators θ (k+1), γ (k+1) and ρ (k+1). Repeat Steps 2-3 until a desired convergence criterion is met. A good starting value of Σ (0) i can be simply chosen as I i, the identity matrix for the ith subject. This initial value of Σ i guarantees the consistency of the initial estimators in the mean, which in return guarantees consistency of the autoregressive parameters and innovative parameters after the first iteration. In the analysis presented in this paper, the convergence criterion is met as long as the successive difference in the Euclidean norm is less than Our numerical experience shows that this iterative algorithm converges very quickly, usually in a few iterations. 2.4 Knot selection Knot selection is an important issue in spline smoothing. The number of knots plays the same role as the smoothing parameter in the smoothing spline model and the bandwidth parameter in kernel smoothing. In this article, we follow the spline literature (He, et al. 2005) and use the sample quartiles of {t ij, i = 1,, m, j = 1,, n i } as knots. For example, if we use three internal knots, they are taken to be the three quartiles of the observed {t ij }. We use cubic splines (splines of order 4) in the numerical simulation section, and the number of internal knots is taken to be the integer part of n 1/5, where n is the sample size. This particular choice is consistent with the asymptotic theory of Section 3. According to our empirical experience, it works well in a wide variety of problems. A data adaptive procedure is to use the leave-one-subject-out cross validation method, which is computationally more demanding. Theoretical justification of the leave-one-subject-out cross validation is possible but is beyond the scope of the paper. 9

11 3 Asymptotic Properties Here and throughout, for a vector denotes its Euclidean norm, and for any square matrix A, A denotes its modulus of the largest singular value of A. To study the rates of convergence for ˆβ, ˆγ, ˆλ and ˆf 0, ˆf 1, we first give a set of regularity conditions. If the estimating equation (4) has multiple solutions, then only a sequence of consistent estimator (ˆθ, ˆγ, ˆρ) is considered in this section. A sequence (ˆθ, ˆγ, ˆρ) is said to be a consistent sequence if (ˆβ, ˆγ, ˆλ ) (β 0, γ 0, λ 0 ) and sup t π (t) ˆα f 0 (t) 0, sup t π (t)ˆ α f 1 (t) 0 in probability as m. The fact that our iterative algorithm starts from consistent estimators of the parameters ensures that the final estimators are also consistent. Our basic conditions are as follows: (A1) The dimensions p, q and d of covariates x ij, w ijk and z ij are fixed; m and max{n i } is bounded, and the distinct values of t ij form a quasi-uniform sequence i that grows dense on [0, 1]. The first four moments of y ij exist. (A2) The sth derivatives of f 0 and f 1 are bounded for some s 2. (A3) The covariates w ijk and the matrices W 1 i are all bounded, which means that all the elements of the vectors and matrices are bounded. The function g 1 ( ) has bounded second derivatives. (A4) The parametric space Θ is a compact subset of R p+q+d, and the parameter value (β 0, γ 0, λ 0 ) is in the interior of the parameter space Θ. The assumptions that the number of independent subjects goes to infinity and that the maximum number of repeated measurements is bounded are standard. They were used by Liang and Zeger (1986) to study the GEE estimates. The existence of the first four moments of the response is needed for consistently estimating the parameters in the variance. The smoothness conditions on f 0 and f 1 given by Condition (A2) 10

12 determine the rate of convergence of the spline estimates. Condition (A3) is satisfied as t is bounded. Assumption (A4) is routinely made in linear models. To study the asymptotic properties of estimators, we assume the dependence between x ij, z ij and t ij as following x ijk = g k (t ij ) + δ ijk, k = 1,, p. (8) z ijl = g l (t ij ) + δ ijl, l = 1,, d; i = 1,, m; j = 1,, n i ; (9) where δ ijk s and δ ijl s are mean zero random variables independent of the corresponding random errors and of one another. Relationships (8) and (9) are commonly made in the spline literature. See He, et al. (2002) and references therein. Indeed, Härdle et al. (2000, Chapter 1.3) considered similar assumptions for partly linear kernel regression. Our assumptions correspond to their random design case. Equations (8) and (9) can be applied when categorical variables are present (Speckman, 1988). For example, when x k is a binary zero one variable denoting treatment assignment, g k (t) = P (x k = 1 t). Independence between x k and covariate t implies that g k is constant. However, a systematic trend resulting in unbalanced allocation to treatment could result in non-constant g k. Let Λ n and Λ n be the n p and n d matrices whose kth column are δ k = (δ 11k,, δ 1n1 k,, δ mnmk) and δ k = ( δ 11k,, δ 1n1 k,, δ mnmk) respectively. We make the following assumption: (A5) (1) EΛ n =0, sup n 1 n E Λ n 2 < ; E Λ n =0, sup n 1 n E Λ n 2 < ; (2) k n M Σ 0 M and k n M W 0 M are nonsingular for sufficiently large n, and the eigenvalues of k n M Σ 0 M/n and k n M W 0 M/n are bounded away from 0 and infinity, where M = (π 1,, π m ), Σ 0 = diag{σ 0 1,, Σ0 m } with Σ0 i = 0iΣ 1 0i 0i = i (η 0 i )Σ 1 0i i(η 0 i ), and W 0 is defined in a similar fashion. We take the number of knots k n as the integer part of n 1/(2s+1), where s is defined 11

13 in (A2) and taken as 2 in this work. For this knot number, Condition (2) of (A5) is expected to hold as this is a property of the B-spline basis functions (He and Shi, 1996). The asymptotic properties of (ˆβ, ˆγ, ˆλ) involve computation of the covariance matrix m = (δ kl m ) k,l=1,2,3 of ( S 1, S 2, S 3 ) / m, where S 1, S 2 and S 3 are defined by S 1 = X i 0i Σ 1 0i (y i µ 0i ), S 2 = Vi 0 D 1 0i (r 0i ˆr 0i ), S 3 = Z i D 0i W 1 0i (ɛ2 0i σ2 0i ). Here X = (I P)X with P = M(M Σ 0 M) 1 M Σ 0 ; r 0i = y i µ 0i, ˆr 0i = (r 0i1,, r 0ij,, r 0ini ) with ˆr 0ij = j 1 r 0ik w ijk γ 0; Vi 0 = (0, r 0i1 w i21,, j 1 k=1 r 0ikw ijk ) ; and k=1 Z = (I P)Z. We make the following assumption similar to Assumption 11 in Ye and Pan (2006). (A6) The covariance matrix m is positive definite, and m = δ11 m δ12 m δ13 m δ 21 m δ22 m δ23 m = δ11 δ 12 δ 13 δ 21 δ 22 δ 23, as m, (11) δ 31 m δ 32 m δ 33 m δ 31 δ 32 δ 33 were is a positive definite matrix. Theorem 1. If (A1) to (A6) hold and k n = O(n 1/(2s+1) ), we have that 1 n 1 n j=1 n i (10) n i { } 2 ˆf0 (t ij ) f 0 (t ij ) = Op (n 2s/(2s+1) ), (12) j=1 where ˆf 0 (t) = π (t) ˆα and ˆf 1 (t) = π (t) α. { } 2 ˆf1 (t ij ) f 1 (t ij ) = Op (n 2s/(2s+1) ), (13) As pointed out by He et al. (2005), (12) and (13) imply that ( ˆf i (t) f i (t)) 2 dt = O p (n 2s/(2s+1) ), i = 0, 1 under some general conditions (Stone, 1985, Lemmas 8 and 9). This is the optimal rate of convergence for estimating f 0 and f 1 under the smoothness condition (A2). For the parametric coefficients, we have the following. 12

14 Theorem 2. Under conditions (A1)-(A6), the generalized estimating equation estimator (ˆβ m, ˆγ m, ˆλ m ) is m-consistent and asymptotically normal, that is ˆβ m β 0 m ˆγ m γ 0 N ˆλ 0, δ δ 22 0 δ11 δ 12 δ 13 δ 21 δ 22 δ 23 δ δ 22 0 m λ δ 33 δ 31 δ 32 δ δ 33 in distribution as m. The asymptotic variance reduces to a diagonal matrix for normally distributed response variable, which is a semiparametric analog to Theorem 2 of Ye and Pan (2006). For inference, we use a robust estimator of the covariance matrix of ˆβ m as follows V (ˆβ m ) = M 1 0 M 1M 1 0, (14) where M 0 = m X i ˆ i ˆΣi ˆ i X i, M 1 = m X i ˆ i ˆΣi (y i ˆµ i )(y i ˆµ i ) ˆ i X i. The estimated covariance matrices of ˆγ m and ˆλ m can be obtained in a similar way, and the covariances δ kl (k l) can also be estimated by their sample versions. 4 Numerical Study Whenever appropriate, we compare our approach with GEE and that in Fan et al. (2007). We use local linear regression to model f 0 (t) in the mean and local constant 1 regression for the marginal variance for Fan et al s method. For the conventional GEE, we use the same spline representation for f 0. All the code is written in R and function geese in R package geepack is used for GEE. For brevity, our approach is referred to as the semiparametric mean-variance method (SMV). The approach by Fan et al. (2007) is referred to as the quasi-likelihood method (QL) which is used for estimating the correlation. 4.1 Real data analysis We apply the proposed estimation method to the CD4 cell study, which was analyzed by many authors (Zeger and Diggle, 1994; Ye and Pan, 2006). This data set comprises 13

15 CD4 cell counts of 369 HIV-infected men with six covariates including time since seroconversion (t ij ), age (relative to arbitrary origin, x ij1 ), packs of cigarettes smoked per day (x ij2 ), recreation drug use (x ij3 ), number of sexual partners (x ij4 ), mental illness score (x ij5 ). Altogether there are 2,376 values of CD4 cell counts, with multiple repeated measurements taken for each individual at different times, covering a period of approximately eight and a half years. The number of measurements for each individual varies from 1 to 12 and the time points are not equally spaced. Thus, the CD4 cell data are highly unbalanced. We use square root transformation on the response by the suggestion in Zeger and Diggle (1994), where further details about the design and the medical implications of the study can be found. To model jointly the mean and covariance structures for the CD4 cell data, we use the following mean model y ij = x ij1 β 1 + x ij2 β 2 + x ij3 β 3 + x ij4 β 4 + x ij5 β 5 + f 0 (t ij ) + e ij. We take covariates for the autoregressive components as w ijk = (1, t ij t ik, (t ij t ik ) 2, (t ij t ik ) 3 ) following the arguments in Ye and Pan (2006), and for the innovation variances as z ij = x ij. The latter specification allows us to examine whether the innovations are dependent on the covariates. Finally the number of knots is taken to be [(2376) 1/5 ] = 7, which is also the optimal number of knots according to the leave-one-subject-out cross validation. Table 1 lists the results for β by our modified Cholesky decomposition method, where a working AR(1) model with δ = 0.2 is used for the innovation. For comparison, we list the conventional GEE method for the partly linear mean model using different working correlations, including independent, AR(1) and exchangeable structures. We also include Fan et al s approach by using AR(1) and ARMA(1,1) for the correlation. The results show that both SMV and QL give estimators with generally smaller standard errors. For our approach, smoking 14

16 and drug use are highly significant variables, while mental illness score is marginally significant. QL identifies smoking and mental illness as significant covariates, and estimates drug use as borderline insignificant. The significance of smoking is missed by GEE using AR(1) covariance structure, while that of drug use is missed by GEE using either AR(1) or exchangeable variance structure. Finally, GEE using the independent working correlation indicates that mental score is not significant at all, which contradicts with the GEE results using other working correlations. For the autoregressive and log innovation parameters, our model yields γ 1 = 0.675, γ 2 = 0.563, γ 3 = 0.170, γ 4 = 0.017; (0.048) (0.033) (0.077) (0.004) λ 1 = 0.003, λ 2 = 0.085, λ 3 = 0.048, λ 4 = 0.006, λ 5 = 0.004, (0.011) (0.045) (0.122) (0.015) (0.005) where the standard errors are in parentheses. All the autoregressive parameters are highly significant. Figure 1 displays the three fitted curves for f 0, φ as a function of the time lag and f 1, when R i (δ) is specified by AR(1) with δ = 0.2. The trajectory of the mean curve is consistent with that in Zeger and Diggle (1994). Figure 1 (B) plots the estimated generalized autoregressive parameters φ against the time lag between measurements in the same subject, which, according to our model, is simply a cubic polynomial. This plot shows that the generalized autoregressive parameters decrease sharply if the time lag is less than two years and then drop slowly when the lag becomes larger. The innovation curve seems to fluctuate around a constant. These observations basically agree with those in Ye and Pan (2006). It is helpful to compare our method with Fan et al s approach in terms of prediction. To this end, we randomly split the data into three parts, each with 369/3=123 subjects. We use the first two parts of this partitioning as the training data set to 15

17 (A) CD4+ Number Time(Years since Seroconversion) (B) (C) Autogression Coef nonpara. part in (log) innov.var Lag Time(Years since Seroconversion) Figure 1: The CD4 cell data. The fitted curves of (A) nonparametric part in mean against time, (B) the generalized autoregressive parameters against lag and (C) the nonparametric part in (log) innovation variances against time based on AR(1) structure with δ = 0.2 and square root CD4 cell numbers. Dashed curves represent asymptotic 95% confidence intervals. 16

18 Table 1: CD4 cell data. The estimates of parameters based on square root CD4 cell numbers, with standard errors in parentheses. SMV Generalized Estimating Equations QL Independence AR(1) Exchangeable AR(1) ARMA(1,1) β (0.030) (0.035) (0.034) (0.032) (0.032) (0.031) β (0.130) (0.184) (0.190) (0.136) (0.152) (0.138) β (0.345) (0.528) (0.350) (0.358) (0.358) (0.329) β (0.038) (0.059) (0.041) (0.043) (0.040) (0.038) β (0.014) (0.021) (0.014) (0.015) (0.014) (0.014) fit the model, and then assess the out-of-sample performance on the testing data set (denoted as TD) which is left out. We use the predictive mean square errors defined as i TD ni j=1 (ŷ ij y ij ) 2 /n TD as the performance measure, where ŷ ij is the fitted response using the training data set only and n TD = i TD n i is the test data sample size. This process is repeated 100 times. The average mean square errors for QL AR(1), QL ARMA(1,1) and our approach are 37.19, 36.98, with standard errors 2.98, 2.93 and 3.40 respectively. Pairwise t-tests of the mean square errors show that these approaches perform similarly. Note that our model is more flexible and potential more powerful because covariates other than t are allowed in modeling the log innovation variances. As suggested by an anonymous reviewer, we can use graphical tools to explore the covariance structure produced by various methods. For a fair comparison, we fit the mean by using the regression spline approximation for f 0 (t), assuming errors are iid. The fitted residuals are then used for covariance modeling. For SMV, since none of λ is significant, we apply SMV with only a nonparametric function for the log innovation. Note that iteration for θ in (5) is no longer needed for this comparison. Define the variogram v(u, t) = E [ {e(t) e(t u)} 2] /2, u 0 as a function of timelag u and observing time t. We then compare the sample variogram, the variogram estimated by QL with ARMA(1,1) structure and variogram estimated by SMV at 17

19 the sample points by plotting v(u, t) versus u and t in Figure 2. The vertical axis is truncated at 40 to accentuate the shape of the smooth estimate. It is seen that the sample variogram increases with time-lag, corresponding to decreasing correlation as observations are separated in time. Both QL and SMV capture this feature nicely. On the other hand, the plot of sample variogram versus observation time indicates that v(u, t) is likely independent of t, which is missed by QL and to a lesser degree by SMV. Note that the enormous random fluctuation of the sample variogram is a typical phenomenon (Diggle, et al. 2002). Empirical Method QL SMV v(u,t) Time Lag (u) v(u,t) Time Lag (u) Time Lag (u) Time (t) Time (t) Time (t) Figure 2: The sample variogram and its estimates by SMV and QL for CD4 data. Individual points represent the variogram at the sampling points and the solid lines are the LOWESS smoothers. 18

20 4.2 Simulation study We conduct extensive numerical studies to assess the finite sample performance of the proposed method. We also test the asymptotic covariance formula in Theorem 2 and compare SMV with QL and the conventional GEE using working correlation. For each study, we simulate one thousand data sets. Study 1. We first consider the following model y ij = x ij1 β 1 + x ij2 β 2 + f 0 (t ij ) + e ij, i = 1,, m; j = 1,, n i, for m = 100. Note that we use ɛ ij to denote the innovation and e ij here for the random error associated with observation j for subject i. We use the sample scheme in Fan et al. (2007) such that the observation times are regularly scheduled but may be randomly missed in practice. More precisely, each individual has a set of scheduled time points {0, 1, 2,..., 12}, and each scheduled time, except time 0, has a 20% probability of being skipped. The actual observation time is a random perturbation of a scheduled time: a uniform [0, 1] random variable is added to a nonskipped scheduled time. This results in different observed time points t ij per subject, and then t ij is transformed onto [0, 1]. Note that the longitudinal observations are highly irregular and some of {n i } are less than the number of the parameters in the same subject. We take x ij1 = t ij + δ ij, where δ ij follows the standard normal distribution and let x ij2 follow a Bernoulli distribution with success probability 0.5. Note that x ij1 is time-varying. For the nonparametric function in the mean, we take f 0 (t) = cos(tπ). The error (e i1,, e ini ) is generated to follow a multivariate normal distribution with mean 0 and covariance Σ i satisfying Φ i Σ i Φ i = D i, where Φ i and D i are described in Section 2.1 with w ijk = (1, t ij t ik ), z ij = x ij, and f 1 (t) = sin(πt). We consider using AR(1) for R i (δ) in W i = A 1/2 i R i (δ)a 1/2 i, the working covariance structure of 19

21 ɛ 2 i. The compound symmetry (exchangeable) structure is also used. Because the results are similar, we omit the details. In each case, the parameter δ, measuring the correlation between ɛ 2 ij = (y ij ŷ ij ) 2 and ɛ 2 ik = (y ik ŷ ik ) 2, takes four different values, δ = 0, 0.2, 0.5, 0.8, so that the effect of misspecification of R i (δ) on estimating β, γ, and λ can be studied. We take two specifications: Case (1) β = (1, 0.5), γ = (0.2, 0.3) and λ = ( 0.5, 0.2) ; Case (2) β = (1, 0), γ = (0.2, 0), and λ = ( 0.5, 0). For each setting, the expected sample size is about The number of the knots is taken to be /5. Numerical experiments show that the results are not very sensitive to the number of the knots. For each simulated data set, we use δ = 0, 0.2, 0.5 and 0.8 to study the robustness of the approach with regard to δ. Table 2 shows that our semiparametric methods literally yield unbiased estimates for the parameters. Additionally, the parameter δ used in the working covariance structure for the innovations has little effect on the estimation of β, γ and λ, and the estimated mean square errors for f 0 and f 1, when the structure for R i (δ) is based on AR(1). These results imply that SMV is robust against misspecification of the structure of R i (δ). For Case (1), Figure 3 displays the true and fitted curves for nonparametric function f 0 and f 1 when R i (δ) is specified by AR(1) with δ = 0.2. The three curves ˆf 5, ˆf 50 and ˆf 95 represent the fits which are 5%, 50% and 95% best in terms of the mean squared errors in 1000 runs, respectively. They show a close agreement with the true functions. To check the robustness of the choice of W, we have also followed Ye and Pan (2006) by generating longitudinal data sets from normal mixture distributions. The results are robust to this deviation of the normality, and are qualitatively similar to those in Ye and Pan (2006). The detailed results can be found in the online supplemental materials. Hereafter, we use AR(1) with δ = 0.2 for R i (δ) in the following simulation when- 20

22 Table 2: Simulation results for Study 1 over 1000 replications. Presented are the sample means with sample standard errors in parentheses. MSE( ˆf i ), i = 0, 1 is the mean square error for the estimate f i over all time points in the data. True δ = 0 δ = 0.2 δ = 0.5 δ = 0.8 β (0.0276) 1.00 (0.0276) 1.00 (0.0276) 1.00 (0.0277) β (0.0579) 0.50 (0.0579) 0.50 (0.0579) 0.50 (0.0579) γ (0.0174) 0.20 (0.0174) 0.20 (0.0174) 0.20 (0.0174) γ (0.0525) 0.30 (0.0525) 0.30 (0.0526) 0.30 (0.0527) λ (0.0435) 0.50 (0.0450) 0.50 (0.0502) 0.50 (0.0542) λ (0.0931) 0.20 (0.0974) 0.20 (0.1082) 0.20 (0.1161) MSE( ˆf 0 ) (0.0371) (0.0371) (0.0371) (0.0371) MSE( ˆf 1 ) (0.0081) (0.0084) (0.0097) (0.0130) β (0.0308) 1.00 (0.0308) 1.00 (0.0308) 1.00 (0.0308) β (0.0615) 0.00 (0.0616) 0.00 (0.0616) 0.00 (0.0618) γ (0.0181) 0.20 (0.0181) 0.20 (0.0181) 0.20 (0.0182) γ (0.0487) 0.00 (0.0487) 0.00 (0.0487) 0.00 (0.0488) λ (0.0453) 0.50 (0.0465) 0.50 (0.0514) 0.50 (0.0551) λ (0.0873) 0.00 (0.0909) 0.00 (0.1017) 0.00 (0.1098) MSE( ˆf 0 ) (0.0230) (0.0229) (0.0230) (0.0230) MSE( ˆf 1 ) (0.0084) (0.0086) (0.0010) (0.0137) ever our approach is used. Study 2. This example is used to illustrate the performance of the asymptotic covariance formula in Theorem 2, following the simulation setup Case (1) in Study 1. In Table 3, SD represents the sample standard deviation of 1,000 estimates of β which can be viewed as the true standard deviation of the resulting estimates. SE represents the sample average of 1,000 estimated standard errors using formula (14), and Std represents the standard deviation of these 1,000 standard errors. Table 3 demonstrates that the standard error formula works well for different AR(1) correlation structures. Table 3: Assessment of the standard errors using formula (14). δ = 0 δ = 0.2 δ = 0.5 δ = 0.8 β 1 SD SE (Std) (0.0027) (0.0027) (0.0027) (0.0028) β 2 SD SE (Std) (0.0047) (0.0047) (0.0047) (0.0048) 21

23 (A) (B) f 0(t) f 0 f^5 f^50 f^95 f 1(t) f 1 f^5 f^50 f^ Time Time Figure 3: Nonparametric function f 0 and f 1 and their fitted curves ˆf 5, ˆf 50, ˆf 95, for AR(1) structure with δ = 0.2. Study 3. In this example, we assess the effect of misspecification of the working covariance structure Σ i on the estimation of β and f 0. For comparison, we apply GEE with independent, exchangeable and AR(1) working correlation. The results are summarized in Table 4. Not surprisingly, all methods give almost unbiased estimates for β. However, the standard errors of our semiparametric method is much smaller than those of the other methods, implying that our estimator is more efficient. Furthermore, for the nonparametric part f 0 in the mean, our semiparametric approach gives estimates with significantly smaller mean square errors. Taken together, the study shows that the SMV approach is more accurate in estimating the mean. Table 4: Simulation results for Study 3 with standard errors in parentheses. SMV Generalized Estimating Equations True Independence Exchangeable AR(1) β ( ) 1.00 ( ) 1.00 ( ) 1.00 ( ) β ( ) 0.50 ( ) 0.51 ( ) 0.50 ( ) MSE(f 0 ) (0.0371) (0.1328) (0.1196) (0.1217) 22

24 Study 4. We use this example to compare SMV and QL for estimating the covariance matrix. The mean model, taken from Example 5.1 in Fan and Wu (2008), has a parametric form as y ij = β 1 x ij1 + β 2 x ij2 + β 3 x ij3 + e ij, where β = (β 1, β 2, β 3 ) = (2, 1.5, 3), x ijk marginally follows standard normal distribution and the correlation between the ith and the jth covariate in x is 0.5 i j. We use a linear model for the mean since the main difference between SMV and QL lies in covariance estimation. We consider two scenarios for generating the covariance matrix, first of which favors QL and the second SMV. Especially, for QL, we take h 0 (t) = 0.5 exp(t/12+sin(t/12)) as the marginal variance function and the true correlation matrix as AR(1) with parameter a = 0, 0.3, 0.6 or 0.9. For SMV, we take h 0 as the innovation function and w ijk = (1, t ij t ik ) as in Study 1, but fix γ 1 = 0 while allowing γ 2 to be 0.3, 0.6 or 0.9. Note that the numerical values of γ 2 are not directly comparable to the values of a, because one is defined on the autoregressive structure and the other is on the correlation. We remark that when a = 0, both the variance-correlation and the modified Cholesky decompositions are appropriate. The observation times are generated according to Study 1 and we take m = 200 as in Fan and Wu (2008). For QL, we use the AR(1) working correlation structure. Thus, if the data generating process favors QL, we would fit a correctly specified model using QL but a misspecified model using SMV. Since both approaches give literally unbiased estimates for β, here we only report the sample standard errors of the estimated β under these two methods, together with those y using GEE with working independence (GEE). To compare accuracy in estimating the covariance matrix, we use the entropy loss L(Σ, ˆΣ) = m 1 m {trace(σ i ˆΣ 1 i ) log Σ ˆΣ 1 i i n i }, where Σ i is the true covariance matrix and ˆΣ i is its estimator. The results are summarized in Table 5. When the model 23

25 favors Fan et al. (i.e. a 0), QL gives more accurate estimates of β and the covariance structure, judged by the smaller sample standard deviations of β and the smaller means of the entropy loss. On the other hand, SMV performs better if the data is generated from the second scenario (i.e. γ 2 0). Both SMV and QL ourperform the GEE estimator with working independence, even for misspecified models. Interestingly, when both approaches are appropriate (a = 0), our approach performs better with smaller standard errors and entropy losses. This may be due to the fact that in estimating the marginal variance, Fan et al. (2007) used local constant estimate. Table 5: Simulation results comparing SMV and QL. For β, the sample standard errors (multplied by a factor of 1000) are reported, while for L(Σ, ˆΣ), both the mean and the sample standard errors (in parentheses), multplied by a factor of 100, are reported. β 1 β 2 β 2 L(Σ, ˆΣ) QL SMV GEE QL SMV GEE QL SMV GEE QL SMV a = (7.6) 11.7 (2.0) a = (1.8) 96.6 (4.2) a = (2.2) (6.2) a = (5.3) (9.9) γ 2 = (11.0) 11.8 (2.1) γ 2 = (9.9) 11.6 (2.1) γ 2 = (24.1) 12.8 (2.0) 5 Discussion We have proposed semiparametric mean-covariance models for longitudinal data analysis using the modified Cholesky decomposition. This decomposition is adopted such that partly linear regression models can be applied to the autoregressive coefficients and log innovation variances. On the one hand, our approach extends the semiparametric model for the mean in longitudinal analysis. On the other hand, our approach relaxes the parametric assumption made by Ye and Pan (2006). We have provided theoretical justification for the proposed approach and compared our method with the variance-correlation decomposition of Fan et al. (2007) and the conventional GEE. 24

26 In practice, the true underlying covariance structure is unknown. It is important to develop tools to check whether the imposed covariance decomposition is reasonable. We have used variogram for this purpose in CD4 data analysis. More work needs to be done in this direction. Our results are valid under the usual assumptions that the number of observations m goes to infinity while the number of repeated measurements n i is bounded. When m or max{n i } or both go to infinity, the asymptotic results in Xie and Yang (2003) may be extended to our setup as the model in (4) can be seen as a semiparametric generalization of the GEE. However, our setup is more challenging due to simultaneous models for the mean and the covariance. Although limited simulation (not shown) indicates that our method still works, a more rigorous study deserves more attention. In this work, we only consider the classical setup when x are z are finite dimensional. For a diverging number of covariates in the mean, Lam and Fan (2008) studied the profile likelihood estimator. For our semiparametric approach, it will be interesting to investigate the statistical properties with diverging numbers of parameters both in the mean and the variance. Finally, it would be interesting to extend the semiparametric approach to nonnormal longitudinal data analysis. Appendix: sketch of proofs Proofs of Lemma 2 and Theorem 2 are available in the supplemental materials. The following lemma is stated for easy reference (Schumaker, 1981, Theorem 12.7). Lemma 1. Under Assumptions (A1)-(A2), there exists constants C 0 and C 1 such that sup t [0,1] f 0 (t) π (t)α 0 C 0 k s n, and sup t [0,1] f 1 (t) π (t) α 0 C 1 k n s. Proof of Theorem 1. Equation (12) can be obtained directly from He et al (2005). 25

27 Here we only give a proof of equation (13) when all W i are known, denoted by W 0i. Similar asymptotic results hold when all W 0i are replaced by consistent estimates. Let ( 1/2 A T m = m Am 1/2 0 k 1/2 n H W 0 M(M W 0 M) 1 where A m = H W 0 H = m H i Wi 0H i, Q2 m = k nm W 0 M. Obviously, condition (A6) implies that A m /m A > 0 in probability for some positive matrix A as m. From the definition of T m, it is easy to know that T m m H i D 0iW 1 0i D 0iH i T m = I d+k, where I d+k is (d + K) (d + K) identity matrix. We further denote ( ) ( ζ1 ζ(ρ) = ζ = (T 2 m) 1 Am 1/2 (λ λ (ρ ρ 0 ) = 0 ) kn 1/2 Q m ( α α 0 ) + kn 1/2 Q 1 m M W 0 H(λ λ 0 ) and Q 1 m ), ˆζ = ( ˆζ1 ˆζ 2 ) = ζ(ˆλ, α). (A.1) From Lemma 1, it is easy to know that for sufficiently large n, 1 n n i { } 2 ˆf1 (t ij ) f 1 (t ij ) 2 n j=1 and Am 1/2 (ˆλ λ 0 ) ˆζ, [ 1 n n i (π ij ( α α 0 )) 2 + 2C 0 k 2s n i ] 1/2 (π ij( α α 0 )) 2 = n 1/2 M( α α 0 ) Cn 1/2 kn 1/2 Q n ( α α 0 ) j=1 Cn 1/2 ˆζ + Cλ 1/2 n ˆλ λ 0 sup n 1 a M W 0 Hbkn 1/2, a =1, b =1 where λ n is the minimum eigenvalue of k n M W 0 M/n. Then by Lemma 6.2 of He and Shi (1996) it suffices to show that ˆζ = O p (k 1/2 n ). To do so, let R mi = π i α 0 f 1 (t i ), η 0 i = H i λ 0 +f 1 (t i ), and ς i = H i ζ +R mi, where H i = H i T m = (H i A 1/2 m, π i Q 1 m kn 1/2 ). Then it s easy to see that H i ζ = η 0 i + ς i, σ 2 i = exp(η0 i + ς i), and the third estimating equation of (4) can be rewritten as j=1 n, ) S ζ (ζ) = H id i (η 0 i + ς i )W 1 0i (ɛ2 i exp(η 0 i + ς i )) = (A.2)

28 Multiply T m to equation (A.2) and we get Ψ(ζ) = T m S ζ (ζ) = H i D i(η 0 i + ς i)w 1 0i (ɛ2 i exp(η0 i + ς i)) = 0. (A.3) It is easy to known that both (A.2) and (A.3) give the same root for ζ by conditions (A4) and (A5). Let a R d+ K such that a a = 1. We expand a Ψ(ζ) in a Taylor series, a Ψ(ζ) = = + a H i D i (η 0 i + ς i)w 1 0i (ɛ2 i exp(η0 i + ς i)) a H i D 0i W 1 0i (ɛ2 i σ 2 0i) a H i D 0i W 1 0i D 0iς i ς a H i D i i W ς i ςi 1 0i (ɛ2 i σ2 0i ) + R m (ς ), =0 (A.4) where R m (ς ) = m R mi (ς i ) and R mi (ς i ) = 1 2 ς i [ 2 a H i D i W 1 0i (ɛ2 i σ2 i )/ ς i ς i ς i =ς i ]ς i for ς i = η 0 i +τ i ς i (i = 1,, m) with 0 < τ i < 1. Further, let Φ(ζ) = m H id 0i W 1 0i (ɛ2 0i σ 2 0i ) ζ, where ɛ 0i = y i µ 0i. Denote the solution of Φ as ζ, that is, ζ = ( ζ1 ζ 2 ) = H id 0i W 1 0i (ɛ2 0i σ 2 0i). (A.5) From these two expressions the difference between a Ψ(ζ) and a Φ(ζ) can be expressed as a (Ψ(ζ) Φ(ζ)) = + a H i D 0i W 1 0i (ɛ2 i ɛ 2 0i) ς a H i D i i W ς i ςi 1 0i (ɛ2 i σ2 0i ) + R m (ς ) =0 =: I n0 I n1 + I n2 (ζ) + R m (ς ). a H i D 0i W 1 0i D 0iR mi By Cauchy-Schwarz inequality, the definition of H and k n = O(n 1/(2s+1) ), we have E(I n0 ) 2 a H i D 0i W 1 0i D 0i H i a 27 E(ɛ 2 i ɛ 2 0i) W 1 0i (ɛ2 i ɛ 2 0i)

29 = = C E( D( µ i )ɛ 0i + ( µ i ) 2 ) W 1 0i ( D( µ i )ɛ 0i + ( µ i ) 2 ) trace{ D( µ i )W 1 0i [ µ i 2 + D( µ i )Σ 0i } + n i [ µ ij ] 4 = O p (k n ), j=1 {( µ i ) 2 } W 1 0i ( µ i) 2 (A.6) where µ i = (µ i1 µ 0i1,, µ ini µ 0ini ) and D( µ i ) = diag{µ i1 µ 0i1,, µ ini µ 0ini } with µ i = µ(b iˆθ) = g 1 (B iˆθ). The last inequality in (A.6) can be obtained easily by He et al (2005). Thus I n0 = O p (kn 1/2 ). For I n1, Obviously, I n1 = a H i D 0i W 1 0i D 0iR mi = a HW 0 R m = {a H W 0 Ha} 1/2 {R mσ 0 R m } 1/2 = O(n 1/2 k s n ) = O(k 1/2 n ), where R m = (R m1,, R mnm ). For I n2 (ζ), write I n2 (ζ) = ζ H i G 0,i + R mi G 0,i =: I (1) (2) n2 (ζ) + I n2, where G 0,i = a H i D i ς i W ςi 1 0i (ɛ2 i σ2 0i ). Let ẽ i = W 1 0i (ɛ2 i σ2 0i ) = (ẽ i1,, ṽe ini ). =0 It is easy to know that G 0i = diag{σ 2 0i1ẽ1i,, σ 2 0in i ẽ ini } H i a =: A(ẽ i ) H i a. Then by Cauchy-Schwarz inequality, we have ( I (1) n2 ) 2 = ( m 2 ζ H i A(ẽ i ) H i a) = ξ 2 d k=1 ( d m ξ k k=1 1 k H i A(ẽ i) H i a ( m ) 2 d ( m ) 2 1 H k ia(ẽ i ) H i a ξ 2 1 H k ia(ẽ i ) H i 1 j, k,j=1 )2 where d = d + K and 1 k = (0,, 0, 1, 0,, 0) is a d vector with 1 as its kth element and 0 elsewhere. By conditions (A1),(A3), (A5) and (A6), we have E(I (1) n2 ) 2 C ζ 2 d k,j=1 E (1 k H ia(ẽ i ) H ) 2 i 1 j 28

30 C ζ 2 d C ζ 2 sup i = C ζ 2 sup i k,j=1 d 1 H k H i i 1 k E A(ẽ i ) H i 1 j 2 k=1 1 k H i H i 1 k m trace{ H i H i }trace{ d k=1 1 k H i H i 1 k O(k n ) H i H i }O(k n ) C ζ 2 k n sup trace{h i A 1 m H i + k n π i Q 2 m π i }O(k n) i = O(k 3 n ζ 2 /n), where the constant C, independent of n, may vary from line to line. Therefore, for sufficiently large L, we have sup 1/2 ζ Lk n, a a=1 I(1) n2 (ζ) = O p(n 1/2 kn 2 ). Similarly, we can show that sup I (2) n2 = O(kn 1 s ). Combining these two results and noting that a a=1 k n = O(n 1/(2s+1) ), we obtain sup I n2 = O p (kn 1/2 ). a a=1 For Rm (ς ), let F i = ( 2 a H i D i W 1 0i (ɛ2 i σ2 i )/ ς i ς i ) ς i = ς i. We see that R m(ς ) = 1 2 ζ H i F H i i ξ + R mif H i i ζ = I (1) n3 (ζ) + I (2) n3 (ζ) + I (3) n3 (ζ). R mif i R mi By (A3), (A5) and (A6), we have that sup a a=1 F i = O p(n 1/2 kn 1/2 ). Hence sup I (1) n3 (ξ) = O p (n 1/2 kn 5/2 ), sup ζ Lkn 1/2,a a=1 ζ Lk 1/2 n,a a=1 sup I (3) n3 (ξ) = O p (n 1/2 kn 1/2 2s ), sup vζ Lkn 1/2,a a=1 ζ Lk 1/2 n,a a=1 I (2) n3 (ξ) = O p (k 3/2 s n ), R n(ς ) = O p (k 1/2 n ). Putting all the approximations together, we have sup Ψ(ζ) Φ(ζ) = O p (kn 1/2 ), ζ Lkn 1/2 and for sufficient large C, direct calculations give E ζ 2 = E[(ɛ 2 0i σ2 0i ) W 1 0i D 0iH i A 1 m H i D 0i W 1 0i (ɛ2 0i σ2 0i ) +k n (ɛ 2 0i σ 2 0i) W 1 0i D 0iπ i Q 2 m π id 0i W 1 0i (ɛ2 0i σ 2 0i)] Ctrace{H A 1 m H + k n MQ 2 m M } = O(k n ). 29

Stat 579: Generalized Linear Models and Extensions

Stat 579: Generalized Linear Models and Extensions Linear Mixed Models for Longitudinal Data Yan Lu April, 2018, week 15 1 / 38 Data structure t1 t2 tn i 1st subject y 11 y 12 y 1n1 Experimental 2nd subject