New Local Estimation Procedure for Nonparametric Regression Function of Longitudinal Data

ew Local Estimation Procedure for onparametric Regression Function of Longitudinal Data Weixin Yao and Runze Li The Pennsylvania State University Technical Report Series #0-03 College of Health and Human Development The Pennsylvania State University

ew Local Estimation Procedure for onparametric Regression Function of Longitudinal Data Weixin Yao Department of Statistics, Kansas State University, Manhattan, Kansas 66506, U.S.A. wxyao@ksu.edu Runze Li Department of Statistics, The Pennsylvania State University, University Park Pennsylvania, 6802-2, U.S.A. rli@stat.psu.edu Abstract This paper develops a new estimation of nonparametric regression functions for clustered or longitudinal data. We propose to use Cholesky decomposition and profile least squares techniques to estimate the correlation structure and regression function simultaneously. We further prove that the proposed estimator is as asymptotically efficient as if the covariance matrix were known. A Monte Carlo simulation study is conducted to examine the finite sample performance of the proposed procedure, and to compare the proposed procedure with the existing ones. Based on our empirical studies, the newly proposed procedure works better than the naive local linear regression with working independence error structure and the efficiency gain can be achieved in moderate-sized samples. A real data set application is also provided to illustrate the proposed estimation procedure. Key Words: Cholesky decomposition; Local polynomial regression; Longitudinal data; Profile least squares. Li s research is supported by ational Institute on Drug Abuse grants R2 DA024260 and P50 DA0075 and a ational Science Foundation grant DMS 0348869. The content is solely the responsibility of the authors and does not necessarily represent the official views of the IDA or the IH.

Introduction For clustered or longitudinal data, we know that the data collected from the same subject at different times are correlated and that observations from different subjects are often independent. Therefore, it is of great interest to estimate the regression function incorporating the within-subject correlation to improve the estimation efficiency. This issue has been well studied for parametric regression models in the literature. See, for example, generalized method of moments (Hansen, 982), the generalized estimating equation (Liang and Zeger, 986), and quadratic inference function (Qu, Lindsay, and Li, 2000). The parametric regression generally has simple and intuitive interpretations and provides a parsimonious description of the relationship between the response variable and its covariates. However, these strong assumption models may introduce modeling biases and lead to erroneous conclusions when there is model misspecification. In this article, we focus on the nonparametric regression model for longitudinal data. Suppose that (x ij, y ij ), i =,..., n, j =,..., J i } is a random sample from the following nonparametric regression model: y ij = m(x ij ) + ɛ ij, () where m( ) is a nonparametric smoothing function, and ɛ ij is a random error. Here (x ij, y ij ) is the j th observation of the i th subject or cluster. Thus, (x ij, y ij ), j =,, J i, are correlated. There has been substantial research interest in developing nonparametric estimation procedures for m( ) under the setting of clustered/longitudinal data. Lin and Carroll (2000) proposed the kernel GEE, an extension of the parametric GEE, for model () and showed that the kernel GEE works the best without incorporating the within-subject correlation. Wang (2003) proposed the marginal kernel method for longitudinal data and proved its efficiency by incorporating the true correlation structure. She also demonstrated that the marginal kernel method using the true correlation structure results in more efficient estimate than Lin and Carroll (2000) s kernel GEE. In this article, we propose a new procedure to estimate the correlation structure and regression function simultaneously, based on the Cholesky decomposition and profile least squares techniques. 2

We derive the asymptotic bias and variance, and further establish the asymptotic normality of the resulting estimator. We further conduct some theoretic comparison. We show that the newly proposed procedure is more efficient than Lin and Carroll (2000) s kernel GEE. We further prove that the proposed estimator is as asymptotically efficient as if the true covariance matrix were known a priori. Compared with the marginal kernel method of Wang (2003), the newly proposed procedure does not require specifying a working correlation structure. This has appeal in practice because the true correlation structure is typically unknown. Monte Carlo simulation studies are conducted to examine the finite sample performance of the proposed procedure, and to compare the proposed procedure with the existing ones. Results from our empirical studies suggest that the newly proposed procedure performs better than the naive local linear regression and the efficiency gain can be achieved in moderate-sized samples. We illustrate the proposed estimation method with an analysis of a real data set. The remainder of this paper is organized as follows. In Section 2, we introduce the new estimation procedure based on the profile least squares and the Cholesky decomposition. We then provide the asymptotic results of the proposed estimator. Finally, we present numerical comparison and analysis of a real data example in Section 3. The proofs and the regularity conditions are given in the Appendix. 2 ew estimation procedures For ease of presentation, let us start with balanced longitudinal data. The proposed procedure can be easily adapted for nearly balanced longitudinal data by binning the data. We will discuss how to use Cholesky composition to incorporate the within-subject correlation into the local estimation procedures for unbalanced longitudinal data in Section 2.2. Suppose (x ij, y ij ), i =,..., n, j =,..., J} is a random sample from the model (). In this paper, we will mainly consider univariate x ij. The newly proposed procedures are applicable for multivariate x ij, but are practically less useful due to the curse of dimensionality. Let ɛ i = (ɛ i,..., ɛ ij ) T and x i = (x i,..., x ij ). Suppose cov(ɛ i x i ) = Σ. Based on Cholesky 3

decomposition, there exists a lower triangle matrix Φ with diagonal ones such that cov(φɛ i ) = ΦΣΦ T = D, where D is a diagonal matrix. In other words, we have ɛ i = e i, ɛ ij = φ j, ɛ i, + + φ j,j ɛ i,j + e ij, i =,..., n, j = 2,..., J, where e i = (e i,..., e ij ) T = Φɛ i, and φ j,l is negative of (j, l) element of Φ. Let D = diag(d 2,..., d2 J ). Since D is a diagonal matrix, e ij s are uncorrelated and var(e ij) = d 2 j, j =,..., J. If ɛ,..., ɛ n } were available, then we would work on the following partially linear model with uncorrelated error term e ij s: y i = m(x i ) + e i y ij = m(x ij ) + φ j, ɛ i, + + φ j,j ɛ i,j + e ij, i =,..., n, j = 2,..., J. (2) However, in practice, ɛ ij is not available, but it may be estimated by ˆɛ ij = y ij ˆm I (x ij ), where ˆm I (x ij ) is a local linear estimate of m( ) based on model () pretending that the random error ɛ ij s are independent. As shown in Lin and Carroll (2000), ˆm I(x) under the working independence structure is a root n consistent estimate of m(x). Replacing ɛ ij s in (2) with ˆɛ ijs, we have y ij = m(x ij ) + φ j,ˆɛ i, + + φ j,j ˆɛ i,j + e ij, i =,..., n, j = 2,..., J. (3) Let Y = (y 2,..., y J,..., y nj ) T, X = (x 2,..., x J,..., x nj ) T, φ = (φ 2,..., φ J,J ) T, e = (e 2,..., e nj ) T, and ˆF T ij = 0 T (j 2)(j )/2, ˆɛ i,,..., ˆɛ i,j, 0(J )J/2 (j )j/2} T, where 0k is the k- dimension column vector with all entries 0. Then we can rewrite the model (3) with the following 4

matrix format: Y = m(x) + ˆF a φ + e, (4) where m(x) = m(x 2 ),..., m(x J ),..., m(x nj )} T and ˆF a = (ˆF 2,..., ˆF J,..., ˆF nj ) T. Let Y = Y ˆF a φ. Then Y = m(x) + e, (5) ote that e ij s in e are uncorrelated. Therefore, if Σ and thus φ is known, we can use the Cholesky decomposition to transfer the correlated data model () to the uncorrelated data model (5) with the new response Y. For partial linear model (4), various estimation methods have been proposed. In this paper, we will employ the profile least squares techniques (Fan and Li, 2004) to estimate φ and m( ) in (4). 2. Profile least squares estimate oting that (5) is a one-dimension nonparametric model, given φ, one may employ existing linear smoothers, such as local polynomial regression (Fan and Gijbels, 996) and smoothing splines (Gu, 2002) to estimate m(x). Here, we employ the local linear regression. Let A x0 = x 2 x 0 x J x 0 x nj x 0 T, and W x0 = diagk h (x 2 x 0 )/ ˆd 2,..., K h (x J x 0 )/ ˆd 2 J,..., K h (x nj x 0 )/ ˆd 2 J}, where K h (t) = h K(t/h), K( ) is a kernel function, and h is the bandwidth, and ˆd j is any consistent estimate of d j, the standard deviation of e j. Denote by ˆm(x 0 ) the local linear regression estimate of m(x 0 ). Then ˆm(x 0 ) = ˆβ 0 = [, 0](A x0 T W x0 A x0 ) A x0 T W x0 Y. ote that ˆm(x 0 ) is a linear function in terms of Y. Let S h (x 0 ) = [, 0](A x0 T W x0 A x0 ) A x0 T W x0. 5

Then ˆm(X) can be represented by ˆm(X) = S h (X)Y, where S h (X) is a (J )n (J )n smoothing matrix, depending on X and the bandwidth h only. Substituting m(x) in (5) by ˆm(X), we obtain the linear regression model: I S h (X)} Y = I S h (X)} ˆF a φ + e, where I is the identity matrix. Let Ĝ = diag( ˆd 2 2,..., ˆd 2 J,..., ˆd 2 2,..., ˆd 2 J). Then, the profile least squares estimator for φ is ˆφ p = [ˆFT a I S h (X)} T Ĝ I S h (X)} ˆF ] a ˆFT a I S h (X)} T Ĝ I S h (X)} Y. (6) Let Ŷ = Y ˆF a ˆφp, then Ŷ = m(x) + e, (7) and e ijs are uncorrelated. ote that when we estimate the regression function m(x), we can also include the observations from the first time point. Therefore, for simplicity of notation, when estimating m(x), we assume that Ŷ consists of all observations with ŷ i = y i. Similar changes are used for all other notation when estimating m(x) in (7). Since e ij s in (7) are uncorrelated, we can use the conventional local linear regression estimator: ( ˆβ 0, ˆβ ) = arg min β 0,β (Ŷ A x0 β) T W x0 (Ŷ A x0 β). Then the local linear estimate of m(x 0 ) is ˆm(x 0, ˆφ p ) = ˆβ 0. Bandwidth selection. To implement the newly proposed estimation procedure, we need to specify 6

bandwidths. We use local linear regression with the working independent correlation matrix to estimate ˆm I ( ). The plug-in bandwidth selector (Ruppert, Sheather, and Wand, 995) was applied for the estimation of ˆm I ( ). Then we calculate ˆɛ ij = y ij ˆm I (x ij ), and further calculate the difference-based estimate for φ (Fan and Li, 2004), denoted by ˆφ dbe. Using ˆφ dbe in (5), we select a bandwidth for the proposed profile least squares estimator using the plug-in bandwidth selector. 2.2 Theoretic comparison The following notation is used in the asymptotic results below. Let F i = (F i,..., F ij ) T where F ij = 0 T (j 2)(j )/2, ɛ i,,..., ɛ i,j, 0 T (J )J/2 (j )j/2} T, and µ j = t j K(t)dt, and ν 0 = K 2 (t)dt. Denote by f j (x) the marginal density of X j. The asymptotic results of the profile least squares estimators ˆφ p and ˆm(x 0, ˆφ p ) are given in the following theorem, whose proof can be found in the Appendix. Theorem 2.. Supposing the regularity conditions A A6 in the Appendix hold, under the assumption of cov(ɛ i x i ) = Σ, we have (a) the asymptotic distribution of ˆφ p in (6) is given by n(ˆφp φ) (0, V ), where V = J E(F j F T j)/d 2 j, j=2 and var(e j ) = d 2 j. 7

(b) the asymptotic distribution of ˆm(x 0, ˆφ p ), conditioning on x,..., x nj }, is given below h ˆm(x0, ˆφ p ) m(x 0 ) 2 µ 2m (x 0 )h 2 } ( ) ν 0 0,, τ(x 0 ) where = nj and τ(x 0 ) = J j= f j (x 0 ) d 2 j. Under the same assumption of Theorem 2., the asymptotic variance of the local linear estimate with working independence correlation structure (Lin and Carroll, 2000) is (h) ν 0 J f j (x 0 )σ 2 j= j, where var(ɛ j ) = σj 2. Based on the property of Cholesky s decomposition, we know that σ 2 = d 2 and σ 2 j d 2 j, j = 2,..., J. The equality only holds when cov(ɛ x) = Σ is a diagonal matrix. ote that ˆm(x 0, ˆφ p ) has the same asymptotic bias as the working independence estimate of m(x 0 ) (Lin and Carroll, 2000). Therefore, if within-subject observations are correlated (i.e., the covariance matrix Σ is not diagonal), then our proposed estimator ˆm(x 0, ˆφ p ) is asymptotically more efficient than local linear estimator with the working independence correlation structure. We next introduce how to use the Cholesky decomposition in (7) for unbalanced longitudinal data, and investigate the performance of the proposed procedure when a working covariance matrix is used for calculating Ŷ. We shall show that the resulting local linear estimator is also consistent with any working positive definite covariance matrix, and further show that its asymptotic variance is minimized when the covariance structure is correctly specified. For unbalanced longitudinal data, let ɛ i = (ɛ i,..., ɛ iji ) T and x i = (x i,..., x iji ), where J i is the number of observations for ith subject or cluster. Denoted by cov(ɛ i x i ) = Σ i, which is a J i J i matrix and may depend on x i. Based on the Cholesky decomposition, there exists a lower 8

triangle matrix Φ i with diagonal ones such that cov(φ i ɛ i ) = Φ i Σ i Φ i = D i, where D i is a diagonal matrix. Let φ (i) j,l be the negative of (j, l) element of Φ i. Similar to (3), we have for i =,..., n, j = 2,..., J i, y i = m(x i ) + e i y ij = m(x ij ) + φ (i) j,ˆɛ i, + + φ (i) j,j ˆɛ i,j + e ij, where e i = (e i,..., e iji ) T = Φ i ɛ i. Since D i is a diagonal matrix, e ijs are uncorrelated. Therefore, if Σ i were known, one could adapt the newly proposed procedure for unbalanced longitudinal data. Following the idea of the generalized estimating equation (GEE, Liang and Zeger, 986), we replace Σ i with a working covariance matrix, denoted by Σ i, since the true covariance matrix is unknown in practice. A parametric working covariance matrix can be constructed as in GEE, and a semiparametric working covariance matrix may also be constructed following Fan, Huang, and Li (2007). Let Φ i be the corresponding lower triangle matrix with diagonal ones such that Φ i Σi Φ i = D i, where D i is a diagonal matrix. Let ỹ ij = y ij φ (i) j,l be the negative of (j, l)-element of Φ i. Let ỹ i = y i and (i) φ j,ˆɛ (i) i, φ j,j ˆɛ i,j. Then our proposed new local linear estimate m(x 0 ) = β 0 is the minimizer of the following weighted least squares: ( β 0, β ) = arg min β 0,β n J i 2 K h (x ij x 0 ) d ij ỹ ij β 0 β (x ij x 0 )} 2, (8) i= j= where d 2 ij is the jth diagonal element of D i. The asymptotic behavior of m(x 0 ) is given in Theorem 2.2. Similar to Lin and Carroll (2000) and Wang (2003), we assume that J i = J < in order to simplify the presentation of the asymptotic 9

results. Let φ i = (φ (i) 2,..., φ(i) J,J )T and φ (i) (i) i = ( φ 2,..., φ J,J )T. Theorem 2.2. Suppose the regularity conditions A A6 in the Appendix hold and cov(ɛ i x i ) = Σ i. Let m(x 0 ) be the solution of (8) using the working covariance matrix Σ i. (a) The asymptotic bias of m(x 0 ) is given by bias m(x 0 )} = 2 µ 2m (x 0 )h 2 ( + o p ()) and the asymptotic variance is given by var m(x 0 )} = (h) ν 0γ(x 0 ) τ 2 (x 0 ) ( + o p()), where τ(x 0 ) = J j= 2 f j (x 0 )E( d j X j = x 0 ), and γ(x 0 ) = J j= f j (x 0 )E (c 2 j + d 2 } 4 j) d j Xj = x 0, where c 2 j F( φ is the jth diagonal element of cov φ) } X. (b) The asymptotic variance of m(x 0 ) is minimized only when Σ i = kσ i is correctly specified for a positive constant k. It can then be simplified to var m(x 0 )} (h) ν 0 J f j (x 0 )E(d 2 j X j = x 0 ) j=. For balanced longitudinal data, if Σ i = Σ for all i and does not depend on X, then var m(x 0 )} (h) ν 0 J f j (x 0 )d 2 j j=. Theorem 2.2 (a) implies that the leading term of asymptotic bias does not depend on the 0

working covariance matrix. This is expected since the bias is caused by the approximation error of local linear regression. Theorem 2.2 (a) also implies that the resulting estimate is consistent for any positive definite working covariance matrix. Theorem 2.2 (b) implies that the asymptotic variance of m(x 0 ) in (8) is minimized when the working correlation matrix is equal to the true correlation matrix. Comparing Theorem 2. (b) to the Theorem 2.2 (b), one can see that the proposed profile least square estimate ˆm(x 0, ˆφ p ) for balanced longitudinal data is as asymptotically efficient as if one knew the true covariance matrix. It is of great interest to compare the asymptotic variance in Theorem 2.2 (b) with the asymptotic variance of marginal kernel method proposed in Wang (2003). We attempted to make a theoretic comparison, but we found that the comparison was very challenging and was out of scope of this paper. We conjecture that the asymptotic variance of the marginal kernel method may be smaller than the one in Theorem 2.2 (b). However, the newly proposed procedure is easier to implement in practice. Thus, further research is needed. 3 Simulation results and real data application In this section, we will conduct a Monte Carlo simulation to assess the performance of our proposed profile least estimator and illustrate the newly proposed procedure with an empirical analysis of a real data example. Example. In this example, data (x ij, y ij ), i =,..., n, j =,..., 6} were generated from the model y ij = 2 sin(2πx ij ) + ɛ ij, where x ij U(0, ) and ɛ ij (0, ). Let ɛ i = (ɛ i... ɛ i6 ) T and x i = (x i,..., x i6 ) T. We considered the following three cases: Case I: ɛ ijs are independent. Case II: cov(ɛ ij, ɛ ik ) = 0.6, when j k and otherwise. Case III: Σ = cov(ɛ i x i ) is AR() correlation structure with ρ = 0.6.

To investigate the effect of errors in estimating the covariance matrix, we compared the proposed profile least squares procedure with the oracle estimator using the true covariance matrix. The oracle estimator serves as a benchmark for the comparison. In this example, we also compared the newly proposed procedure with the local linear regression using the working independence correlation structure. We compared the performance of different methods by the root of average squared errors (RASE) defined by RASE ˆm( )} = n grid n grid ˆm(u j ) m(u j )} 2 j= /2, where u j s are the n grid(= 200) grid points equally spaced from 0 to. Simulation results are summarized in Table. In the Table and in the discussion below, Independence stands for local linear regression with working independence, ew for the newly proposed procedure, and Oracle for the oracle estimator. The mean and standard deviation of 500 RASEs over 500 simulations are depicted in the left panel of Table. Empirical relative efficiency is reported in the right panel of Table. Empirical relative efficiency means that, for example, RE(ew)=ERASE 2 (Independence)}/ERASE 2 (ew)} and we estimate ERASE 2 (Independence)} and ERASE 2 (ew)} by the sample average over 500 simulations. Table shows that the ew and Oracle have smaller RASEs than the Independence when the data are correlated (Cases II and III) and the efficiency gain can be achieved even for moderate sample size. For independent data (Case I), the ew did not lose much efficiency for estimating the correlation structure when compared to the Independence. Furthermore, Table shows that when sample size is large, the ew performed as well as the Oracle, which uses the true correlation structure. The simulation results confirm the theoretic findings in Section 2. Example 2. In this example, we illustrate the proposed methodology with an empirical analysis of a data set collected from the website of Pennsylvania-ew Jersey-Maryland Interconnections (PJM), the largest regional transmission organization (RTO) in the U.S. electricity market. The data set includes hourly electricity price and electricity load in the Allegheny Power Service district 2

Table : Simulation Results of Example. Case n Independence ew Oracle RE(ew) RE(Oracle) I 30 0.20(0.064) 0.22(0.068) 0.20(0.062) 0.90.000 50 0.59(0.047) 0.64(0.047) 0.59(0.047) 0.945.000 50 0.099(0.026) 0.00(0.026) 0.099(0.026) 0.985.000 400 0.067(0.06) 0.067(0.06) 0.067(0.06) 0.987.000 II 30 0.227(0.08) 0.20(0.078) 0.202(0.077).55.236 50 0.80(0.057) 0.63(0.057) 0.59(0.056).94.255 50 0.5(0.033) 0.0(0.033) 0.00(0.033).277.293 400 0.074(0.08) 0.063(0.08) 0.063(0.08).362.367 III 30 0.25(0.070) 0.203(0.067) 0.94(0.065).27.224 50 0.69(0.053) 0.55(0.048) 0.52(0.048).87.232 50 0.0(0.030) 0.098(0.028) 0.097(0.028).244.266 400 0.072(0.08) 0.064(0.06) 0.064(0.06).256.266 from 8:00am to 8:00pm during the period from January, 2005 to January 3, 2005. We studied the effect of the electricity load on the electricity price. As an illustration, we treated day as subject, and set the electricity price as the response variable and the electricity load as the predictor variable. Thus, the sample size n equals 3, and each subject has J = 3 observations. The scatter plot of observations is depicted in Figure (b). We first used local linear regression with working independence covariance matrix to estimate the regression. The plug-in bandwidth selector (Ruppert, Sheather, and Wand, 995) yielded a bandwidth of 277. The dashed lines in Figure are the resulting estimate along its 95% pointwise confidence interval. Based on the resulting estimate, we further obtained the residuals, and estimated the correlation between ɛ i,j and ɛ i,j+k for j =,, 2 and k 3 j. The plot of estimated correlations is depicted in Figure (a), which shows that the within-subject correlation is moderate. Thus, our proposed method may produce a more accurate estimate than the local linear regression, ignoring the within-subject correlation. ext, we apply the newly proposed procedure for this data set. The bandwidth selected by the 3

0.8 Correlation 0.6 0.4 0.2 0 0.2 0 2 4 6 8 0 2 Lag k (a) 60 40 20 00 80 Price 60 40 20 0 20 4500 5000 5500 6000 6500 7000 7500 8000 8500 9000 Load (b) Figure : (a) Plot of estimated correlation between ɛ i,j and ɛ i,j+k versus the lag k. For example, dots at k = correspond to the correlations between ɛ i,j and ɛ i,j+ for j =, 2,, 2. (b) Scatter plot of observations and the plot of fitted regression curves. The solid curve is the fitted regression by the proposed method and the corresponding 95% point-wise confidence interval. The dash-dot curve is the local linear fit ignoring the within- subject correlation. 4

plug-in bandwidth selector equals 243. The solid curves in Figure (b) are the fitted regression curves along with 95% pointwise confidence interval by the newly proposed procedure. Figure (b) shows that the newly proposed procedure provided a smaller confidence interval than the one ignoring the within-subject correlation, even though the number of subjects is only 3, although the difference between the two estimated regression curves is very little. 4 Concluding remark We have developed a new local estimation procedure for regression functions of longitudinal data. The proposed procedure uses the Cholesky decomposition and profile least squares techniques to estimate the correlation structure and regression function simultaneously. We demonstrate that the proposed estimator is as asymptotically efficient as an oracle estimator which uses the true covariance matrix to take into account the within-subject correlation. In this paper, we focus on nonparametric regression models. The proposed methodology can be easily adapted for other regression models, such as additive models and varying coefficient models. Such extensions are of great interest for future research. APPEDIX: PROOFS Define B = ˆF a F a. Since G can be estimated by a parametric rate, we will assume that G is known in our proof, without loss of generality. Our proofs use a strategy similar to that in Fan and Huang (2005). The following conditions are imposed to facilitate the proof and are adopted from Fan and Huang (2005). They are not the weakest possible conditions. A. The random variable x ij has a bounded support Ω. Its density function f j ( ) is Lipschitz continuous and bounded away from 0 on its support. The x ij s are allowed to be correlated for different j s A2. m( ) has the continuous second derivative in x Ω. 5

A3. The kernel K( ) is a bounded symmetric density function with bounded support and satisfies the Lipschitz condition. A4. nh 8 0 and nh 2 /(log n) 2. A5. There is a s > 2 such that E F j s <, j and for some ξ > 0 such that n 2s 2ξ h. A6. sup x Ω ˆm I (x) m(x) = o p (n /4 ), where ˆm I (x) is obtained by local linear regression pretending that the data are independent and identically distributed (i.i.d.). The following lemma is taken from Lemma A. of Fan and Huang (2005) and will be used in our proof repeatedly. Lemma A.. Let (X, Y ),..., (X n, Y n ) be independent and identically distributed random vectors, where the Y i are scalar random variables. Further assume that for some s > 2 and interval [a, b] E Y s < and sup x [a,b] y s g(x, y)dy <, where g denotes the joint density of (X, Y ). Let K satisfy Condition A3. Then sup x [a,b] n n (log(/h) ) } /2 [K h (X i x)y i E K h (X i x)y i }] = O p nh i= provided that h 0, for some ξ > 0, n 2s 2ξ h. Lemma A.2. Let (x j, y j ),..., (x nj, y nj ) be independent and identically distributed random vectors for each j =,..., J, where y ij are scalar random variables. Further assume that for some s > 2 and interval [a, b] E Y j s < and sup x [a,b] y s g j (x, y)dy <, where g j (, ) denotes the joint density of (X ij, Y ij ). Let K satisfy Condition A3. Then sup x [a,b] n i= j= (log(/h) ) } /2 [K h (x ij x)y ij E K h (x ij x)y ij }] = O p nh 6

provided that h 0, for some ξ > 0, n 2s 2ξ h. Proof: Based on Lemma A., we have sup n [K h (x ij x)y ij E K h (x ij x)y ij }] x [a,b] i= j= n sup [K J j= x [a,b] h (x ij x)y ij E K h (x ij x)y ij }] n i= (log(/h) ) } /2 (log(/h) ) } /2 O p = O p. J nh nh j= Let A x = (x x)/h.. (x nj x)/h, e = (e,..., e nj ) T, and m = (m(x ),..., m(x nj )) T. We will first provide the asymptotic results including all the observations for simplicity of notation and then consider the change (without using (x i, y i ), i =,..., n}) when applying the result to φ. We first prove some lemmas, which will be used in the proof of Theorem 2.. Lemma A.3. Under Conditions A A6, it follows that FT a I S h (X)} T G I S h (X)}F a P Ṽ. where Ṽ = J E(F j F T j)/d 2 j. j= Proof: Let W x = diagk h (x x)/d 2,..., K h (x nj x 0 )/d 2 J}. 7

Then the smoothing matrix S h (X) of the local linear regression can be expressed as S h (X) = [, 0]A T x W x A x } A T x W x. [, 0] A T x nj W xnj A xnj } va T x nj W xnj. where A T x W x A x equals n J K h (x ij x) i= j= n J i= j= x ij x h d 2 j K h (x ij x) d 2 j n J i= j= n J i= j= x ij x K h (x ij x) h (x ij x) 2 d 2 j K h (x ij x) h 2 d 2 j. Let S k = n i= j= ( ) xij x k K h (x ij x). h d 2 j Based on the result X = E(X) + O p ( var(x)) and Lemma A.2, we can show that if k is even, S k = J = J d 2 j= j d 2 j= j t k K(t)f j (x + ht)dt + O p ( log(/h)/(nh) ) f j (x)µ k + O p (h 2 + log(/h)/(nh)) holds uniformly in x. Let τ(x) = J J j= f j(x)/d 2 j. Because of the symmetry of kernel function, for any odd numbered k, µ k = 0 and then S k = O p (h + log(/h)/(nh)) holds uniformly in x. Therefore, A T x W x A x equals [ τ(x) + O p h 2 + }] log(/h)/(nh) O p h + } log(/h)/(nh) O p h + } log(/h)/(nh) [ τ(x)µ 2 + O p h 2 + }] log(/h)/(nh) 8

and } AT x W x A x [ = τ(x)} + O p h 2 + }] log(/h)/(nh) O p h + } log(/h)/(nh) O p h + } log(/h)/(nh) [ τ(x)µ 2 } + O p h 2 + }] log(/h)/(nh) (9) hold uniformly in x. Similarly, based on Lemma A.2 and the assumption of independence between ɛ ij and x ij, it follows that AT x W x F a = log(/h)/(nh) } O p log(/h)/(nh) } O p T (J )J/2, where k is the dimension k column vector with all elements. Consequently, } } [, 0] AT x W x A x AT x W x F a = o p () T (J )J/2. Substituting this result into smoothing matrix S h (X), we have S h (X)F a = [, 0]A T x W x A x } A T x W x F a. [, 0] A T x nj W xnj A xnj } A T x nj W xnj F a = o p() (J )J/2, where = nj, a b is the dimension a by b matrix with all elements one. Therefore, F a S h (X)F a = F a + o p ()}. 9

Finally, by the WLL, FT a I S h (X)} T G I S h (X)}F a = n i= j= F ij F T ij/d 2 j + o p ()} 2 P Ṽ, where Ṽ = J E(F j F T j)/d 2 j. j= Lemma A.4. Under Conditions A A6, we have ˆF T a I S h (X)} T G I S h (X)}ˆF a P Ṽ. Proof: Since B = ˆF a F a, the generic element of B is of the form m(x ij ) ˆm(x ij ), which is of order o p (n /4 ) uniformly in x by Condition A6. Thus, B = o p (n /4 ). Therefore ˆF T a I S h (X)} T G I S h (X)}ˆF a = (F a + B) T I S h (X)} T G I S h (X)}(F a + B). By using similar argument in the proof of Lemma A.3, we can show that FT a I S h (X)} T G I S h (X)}B = o p (), and BT I S h (X)} T G I S h (X)}B = o p (). Therefore, ˆF T a I S h (X)} T G I S h (X)}ˆF a = FT a I S h (X)} T G I S h (X)}F a + o p (). Thus, the result follows by Lemma A.3. 20

Lemma A.5. Suppose Conditions A A6 hold. It follows that ˆFT a I S h (X)} T G I S h (X)}m = o p (). and ˆFT a I S h (X)} T G I S h (X)}Bφ = o p (). Proof: We will first prove ˆFT a I S h (X)} T G I S h (X)}m = o p (). Similarly to the argument in the proof of Lemma A.3, we can show that [ AT x W x m = m(x)τ(x) + O p h 2 + }] log(/h)/(nh) O p h + } log(/h)/(nh). Based on the above result and the result (9), we have [, 0] AT x W x A x } AT x W x m} = m(x) [ + O p h + }] log(/h)/(nh) holds uniformly in x Ω. Therefore, I S h (X)}m = m m [ + O p h + }] log(/h)/(nh) = o p (). ote that I S h (X)}F a = F a ( + o p ()). Therefore, we have F T a I S h (X)} T G I S h (X)}m = F T a G ( + o p ())o p (). ote that E(F j ) = 0, and covariance matrix for F,..., F J } is finite. Thus, F T a G = O p (). 2

We can have F T a I S h (X)} T G I S h (X)}m = o p (). (0) Since ˆF a = F a + B, we can break /2 ˆFT a I S h (X)} T G I S h (X)}m into two terms: F T a I S h (X)} T G I S h (X)}m, which is o p () by (0), and B T I S h (X)} T G I S h (X)}m, which is also o p () as B = o p (n /4 ). Therefore, ˆFT a I S h (X)} T G I S h (X)}m = o p (). From the proof of Lemma A.4, we have I S h (X)}Bφ = o p () and /2 ˆFT a I S h (X)} = O p (). Therefore, ˆFT a I S h (X)} T G I S h (X)}Bφ = o p (). Lemma A.6. Under Conditions A A6, let e = (e,..., e nj ) T. Assuming that var(e ij ) = d 2 j, then [ˆFT a I S h (X)} T G I S h (X)}ˆF a ] ˆFT a I S h (X)} T G I S h (X)}e = (0, Ṽ ). Proof: From the proof of Lemma A.5, we have F T a I S h (X)} T G I S h (X)}e = n i= j= F ij d 2 j [ ] e ij [, 0]A T x ij W xij A xij } A T x ij W xij e + o p ()}. () 22

By using Lemma A.2 on x ij, e ij } and Lemma A.3, we can show that [, 0]A T x W x A x } A T x W x e [ =[, 0] τ(x)} + O p h 2 + }] log(/h)/nh O p h + } log(/h)/nh O p h + } log(/h)/nh O p h + } log(/h)/nh =o p (). O p h + } log(/h)/nh [ τ(x)µ 2 } + O p h 2 + }] log(/h)/nh (2) Then e ij [, 0]A T x ij W xij A xij } A T x ij W xij e = e ij + o p ()}. Plugging this in (), we obtain F T a I S h (X)} T G I S h (X)}e = n F ij e ij + o p ()}. d 2 i= j= j ote that E(F ij e ij ) = 0, var(f ij e ij ) <, and E(F ij e ij F i j e i j ) = 0 when i i or j j since e ij is independent of F ij. Based on the central limit theorem for n i= (F ie i /d 2,..., F ije ij /d 2 J )T, we have F T a I S h (X)} T G I S h (X)}e L (0, Ṽ). By Lemma A.3, F T a I S h (X)} T G I S h (X)}F a P Ṽ. Applying the Slutsky theorem, we have [ F T a I S h (X)} T G I S h (X)}F a ] F T a I S h (X)} T G I S h (X)}e = (0, Ṽ ). Since ˆF a = F a + B, we may write ˆF T a I S h (X)} T G I S h (X)}e = F T a I S h (X)} T G I S h (X)}e+B T I S h (X)} T G I S h (X)}e. 23

ote that B = o p (n /4 ) based on Condition A6, it can be shown that B T I S h (X)} T G I S h (X)}e = o p (). Furthermore, we have shown that F T a I S h (X)} T G I S h (X)}e L (0, Ṽ). Thus, ˆFT a I S h (X)} T G I S h (X)}e L (0, Ṽ), as well. The proof is completed by Lemma A.4 and the Slutsky theorem. Proof of Theorem 2.: Let us first show the asymptotic normality of ˆφ p. According to the expression of ˆφ p in (6), we can break (ˆφ p φ) into the sum of following three terms A, B and C A = [ } ] ˆFT a (I S h (X)) T G (I S h (X))ˆF a ˆFT a I S h (X)} T G I S h (X)}m, B = [ } ] ˆFT a (I S h (X)) T G (I S h (X))ˆF a ˆFT a I S h (X)} T G I S h (X)}Bφ, C = [ } ] ˆFT a (I S h (X)) T G (I S h (X))ˆF a ˆFT a I S h (X)} T G I S h (X)}e. For term A, it is a product of two terms [ ˆFT a I S h (X)} T G I S h (X)}ˆF a ] and [ ˆFT a I S h (X)} T G I S h (X)}m ]. 24

From Lemma A.4 and A.5, the asymptotic properties of these two terms lead to the conclusion that A = o p (). Similarly, applying Lemma A.4 and A.5 on two product components of term B results in B = o p (), as well. In addition, Lemma A.6 states that term C converges to (0, Ṽ ). oting that ˆφ p does not use the observations from the first time points, we should replace J by J for ˆφ p. Putting A, B, and C together, we get the asymptotic distribution of ˆφ p. ext we derive the asymptotic bias and variance of ˆm( ). ote that, ˆm(x 0, ˆφ p ) = [, 0]A T x 0 W x0 A x0 } A T x 0 W x0 (m + e + F a φ ˆF a ˆφp ) = [, 0]A T x 0 W x0 A x0 } A T x 0 W x0 (m + e) + o p ()}. ote that Ee X} = 0. Therefore, bias ˆm(x 0, ˆφ p ) X} = [, 0]A T x 0 W x0 A x0 } A T x 0 W x0 m + o p ()} m(x 0 ) = [, 0]A T x 0 W x0 A x0 } A T x 0 W x0 m Ax0 [m(x 0 ), hm (x 0 )] T } + o p ()}. Similar to the arguments in Fan and Gijbels (996, 3.7), we can prove that the asymptotic bias is 2 m (x 0 )h 2 µ 2. In addition, note that [, 0]A T x 0 W x0 A x0 } A T x 0 W x0 = τ(x 0 ) [K h(x x 0 )/d 2,..., K h (x nj x 0 )/d 2 J]. Therefore, var ˆm(x 0, ˆφ p ) X} = [, 0]A T x 0 W x0 A x0 } A T x 0 W x0 cov(e)w x0 A x0 A T x 0 W x0 A x0 } [, 0] T + o p ()} = K 2 (x)dx + o p ()}. hτ(x 0 ) As to the asymptotic normality, ˆm(x 0, ˆφ p ) E ˆm(x 0, ˆφ p ) X} = [, 0]A T x 0 W x0 A x0 } A T x 0 W x0 e + o p ()} 25

Thus, conditioning on X, the asymptotic normality can be established using the CLT since given j, e ij s are identically and independent distributed with mean zero and variance d2 j..) Let Proof of Theorem 2.2: Ỹ = (ỹ,..., ỹ nj ), W x = diagk h (x x)/ d 2,..., K h (x nj x)/ d 2 nj}, where K h (t) = h K(t/h), K( ) is a kernel function, and h is the bandwidth. ote that m(x 0 ) = [, 0]A T x 0 Wx0 A x0 } A T x 0 Wx0 (m + e + H), where H = (F φ ˆF φ,..., F n φ n ˆF n φn ) T, and ˆF i = (ˆF i,..., ˆF ij ) T. ote that A T W x x A x = n J i= j= K h(x ij x)/ d 2 ij n J i= j= x ij x K h (x ij x) h d 2 ij n J i= j= x ij x K h (x ij x) h d 2 ij n J (x ij x) 2 K h (x ij x) i= j= h 2 d 2 ij. Let τ(x) = J J j= f 2 j(x)e( d j X j = x). Similar to the proof of Lemma A.3, we have that AT W [ x x A x = τ(x) + O p h 2 + }] log(/h)/(nh) O p h + } log(/h)/(nh) O p h + } log(/h)/(nh) [ τ(x)µ 2 + O p h 2 + }] log(/h)/(nh) 26

and } AT W x x A x [ = τ(x)} + O p h 2 + }] log(/h)/(nh) O p h + } log(/h)/(nh) O p h + } log(/h)/(nh) [ τ(x)µ 2 } + O p h 2 + }] log(/h)/(nh) hold uniformly in x. ote that F i φ i ˆF i φi = (F i ˆF i ) φ i + F i (φ i φ i ). (3) Since the generic element of F i ˆF i is of the form m(x ij ) ˆm I (x ij ), which is of order o p (n /4 ) uniformly in x by Condition A6. Therefore, F i ˆF i = o p (n /4 ). ote that E(e X) = 0 and E(F i X) = 0. Thus, E m(x 0 ) X} = [, 0]A T x 0 Wx0 A x0 } A T x 0 Wx0 m + o p ()}, and bias m(x 0 ) X} = [, 0]A T x 0 Wx0 A x0 } A T x 0 Wx0 m + o p ()} m(x 0 ) = [, 0]A T x 0 Wx0 A x0 } A T x 0 Wx0 m Ax0 [m(x 0 ), hm (x 0 )] T } + o p ()}. Similar to the arguments in Fan and Gijbels (996, 3.7), we can prove that the asymptotic bias is 2 m (x 0 )h 2 µ 2. In addition, note that [, 0]A T x 0 Wx0 A x0 } A T x 0 Wx0 = τ(x 0 ) [K h(x x 0 )/ d 2,..., K h (x nj x 0 )/ d 2 nj]( + o p ()), 27

Let c 2 j be the jth diagonal element of covf(φ φ) X}. Based on (3), we have var m(x 0 ) X} = [, 0]A T x 0 Wx0 A x0 } A T x 0 Wx0 cov(e+h X) W x0 A x0 A T x 0 Wx0 A x0 } [, 0] T + o p ()} = ν 0 h τ 2 (x 0 ) γ(x 0) + o p ()}, where γ(x 0 ) = J j= } f j (x 0 )E (c 2 j + d 2 4 j) d j X j = x 0. 2.) When Σ = Σ, we have c 2 j = 0 and d 2 j = d2 j. Hence var m(x 0 ) X} (h) ν 0 J f j (x 0 )E(d 2 j X j = x 0 ) j=. and By noting that γ(x 0 ) J f j (x 0 )E(d 2 j j= j X j = x 0 ), (4) d 4 f j (x 0 )E(d 2 j j= 4 d j X j = x 0 ) f j (x 0 )E(d 2 j X j = x 0 ) j= j= 2 f j (x 0 )E( d j X j = x 0 ) we can obtain the result. For result (4), the equality only holds when φ = φ. For the second inequality (5), based on the Cauchy-Schwarz Inequality, the equality only holds when d j /d j are all equal. Based on the Cholesky decomposition result, φ = φ and d j /d j are all equal only when Σ = kσ and thus Σ i = kσ i, for some constant k. (5) 2. References Fan, J. and Gijbels, I. (996). Local Polynomial Modelling and Its Applications. Chapman and Hall, London. 28

Fan, J. and Huang, T. (2005). Profile likelihood inferences on semiparametric varying-coefficient partially linear models. Bernoulli,, 03-057. Fan, J., Huang, T., and Li, R. (2007). Analysis of longitudinal data with semiparametric estimation of covariance function. Journal of the American Statistical Association, 02, 632-64. Fan, J. and Li, R. (2004). ew estimation and model selection procedures for semiparametric modeling in longitudinal data analysis. Journal of American Statistical Association, 99, 70-723. Gu, C. (2002). Smoothing Spline Anova Models. Springer-Verlag, ew York. Hansen, L. (982). Large sample properties of generalized method of moments estimators. Econometrica, 50, 029-054. Liang, K.-Y. and Zeger, S. L. (986). Longitudinal data analysis using generalized linear models. Biometrika, 73, 3-22. Lin, X. and Carroll, R. J. (2000). onparametric function estimation for clustered data when the predictor is measured without/with error. Journal of American Statistical Association, 95, 520-534. Qu, A. and Lindsay, B. G., and Li, B. (2000). Improving generalised estimating equatinos using quadratic inference functions. Biometrika, 87, 823-836. Ruppert, D., Sheather S. J., and Wand M. P. (995). An effective bandwidth selector for local least sqrares regression. Journal of American Statistical Association, 90, 257-270. Wang,. (2003). Marginal nonparametric kernel regression accounting for within-subject correlation. Biometrika, 90, 43-52. 29