Professors Lin and Ying are to be congratulated for an interesting paper on a challenging topic and for introducing survival analysis techniques to th

Size: px

Start display at page:

Download "Professors Lin and Ying are to be congratulated for an interesting paper on a challenging topic and for introducing survival analysis techniques to th"

Solomon Boone
6 years ago
Views:

1 DISCUSSION OF THE PAPER BY LIN AND YING Xihong Lin and Raymond J. Carroll Λ July 21, 2000 Λ Xihong Lin is Associate Professor, Department ofbiostatistics, University of Michigan, Ann Arbor, MI Her research was supported by a grant from the National Cancer Institute (CA 76404). Raymond J. Carroll (carroll@stat.tamu.edu) is Distinguished Professor, Departments of Statistics and Biostatistics & Epidemiology, Texas A&M University, College Station TX His research was supported byagrant from the National Cancer Institute (CA 57030), and by the Texas A&M Center for Environmental and Rural Health via a grant from the National Institute of Environmental Health Sciences (P30-ES09106).

2 Professors Lin and Ying are to be congratulated for an interesting paper on a challenging topic and for introducing survival analysis techniques to the analysis of longitudinal data. Their paper makes an interesting contribution to the developing literature on semiparametric and nonparametric regression in longitudinal data. Discussions are of course a means of bringing out points that authors are aware of but did not have space to address. In our discussion, we focus on two points. ffl For computational convenience, Lin and Ying choose single nearest neighbor "smoothing" instead of standard smoothing techniques, i.e., they average over a neighborhood that contains but a single member. This choice permits simple computation and elegant techniques. However, their dismissal of standard smoothing techniques deserves some discussion. As we show below, their method can have near zero efficiency compared to alternative semiparametric methods, at least in the special cases we have examined. ffl In their simulation, Lin and Ying find that their estimates of parametric components of semiparametric models have efficiency near one compared to parametric modeling techniques. They are very careful not to claim this as a general phenomenon, but we worry that most readers will not see how careful they really are. We indicate using semiparametric efficiency bounds that their estimators of the parametric components can have near zero efficiency compared to parametric methods, when the latter choose the correct model. We also indicate that their simulation results can be explained theoretically. For simplicity and to satisfy space limitations, in our discussion we focus here on the case that (a) X(t) is a scalar; (b) the number of observations per subject is bounded away from infinity; and (c) the number of subjects goes to infinity (n!1). We will not repeat these assumptions in what follows. 1 Semiparametric Regression and the Choice of Singleton Nearest Neighbors The Lin and Ying method has undeniable appeal in terms of computational simplicity, and we conjecture that it has good efficiency when the number of observations per subject is large. However, in many longitudinal studies, the number of observations per subject is small. For example, in the AIDS example considered in the paper, the number of observations per subjects ranged from 1 to 18 with a median of 8. We now show that the computational simplicity achieved by using single 1

3 nearest neighbor smoothing techniques as compared to standard smoothing techniques can lead to estimators with arbitrarily low efficiency compared to methods readily available in standard statistical packages, e.g., the gam function in Splus. Consider the simple special case that (a) X(t) = X is, as in their example, non-time varying (i.e., X and T are independent); (b) each individual has a single observation time T, which varies from individual to individual; and (c) these observation times have acontinuous density function. Write Y i = Y (T i )andffl i = ffl(t i ). Then their model (1.1) is Y i = ff(t i )+fix i + ffl i : (1) Assume for simplicity that var(ffl i )=ff 2 ffl, write the mean and variance of X as μ x and ffx, 2 respectively, and note that E(XjT )=μ x. Model (1) is often referred to as the partial linear model, and there is an enormous literature on it, including, among many others, Speckman (1988), Hastie & Tibshirani (1990) and Severini and Staniswalis (1994). Estimators of fi are available through the gam function in Splus. In particular, the latter authors show that the semiparametric efficient estimator is easily computed without iteration or backfitting as follows: (a) first regress Y on T to form an estimator bm y (t) using, e.g., kernel methods or spline methods; then (b) estimate fi by regressing Y bm y (T )onx X, where X is the sample mean of the X's. Severini and Staniswalis (1994) show that this estimate of fi is semiparametric efficient and has an asymptotic variance of var( fi) b ß n 1 ff2 ffl ff 2 ; x while, as shown in the Appendix, Lin and Ying's estimator, b fi LY, has an asymptotic variance of var( b fi LY ) ß n 1 varfff(t )g + ff2 ffl ff 2 x : (2) The efficiency of the Lin and Ying estimator compared to the semiparametric efficient estimator is Efficiency( b fi LY ; b fi)= ff 2 ffl varfff(t )g + ff 2 ffl ; which can be arbitrarily small if the function ff(t )varies considerably. To gain insight into this result, we compare the estimating function corresponding to Lin and Ying's estimator with the semiparametric efficient score of fi. One can easily show that the semiparametric efficient score of fi under model (1) is, by suppressing the index i, fx E(XjT )gfy Xfi ff(t )g=ff 2 ffl (Severini and Staniswalis, 1994), which is the same as the estimating function U g (fi) given in the equation above (2.7) when g(t ) = ff(t ). This suggests that the efficiency of 2

4 an estimator of fi heavily depends the asymptotic choice of g(t ) and the method for estimating it. Nonparametric regression estimators of ff(t )give an estimating function asymptotically equivalent to (correctly) assuming that g(t )=ff(t ). Lin and Ying's estimator of ff(t )by using the singleton nearest neighbor method via calculating Y Λ, which isequaltoy in this case, gives an estimating function asymptotically equivalent to (incorrectly) assuming that g(t ) = Efff(T )g, which can of course differ from ff(t ) dramatically if ff(t ) varies considerably. It is possible to construct the semiparametric efficient score in our setting when there is more than one observation per individual (Lin and Carroll, 2000), but implementing this has not been done. However if, as do Lin and Ying, one ignores the correlations in the residual process ffl(t), then the same basic estimator described above applies: one simply ignores the within-subject correlations and combines the (Y; X;T) data into a large" data set and estimates ff(t) using standard smoothing methods. This is in fact a GEE type estimator under working independence. The resulting estimator of fi is p n-consistent and its asymptotic variance is easy to work out (Lin and Carroll, 2000) and estimate. We expect that the GEE type estimator is generally more efficient, and often much more efficient, than that of Lin and Ying, since the above simple scenario with one observation per subject is a special case. More generally, we conjecture that replacing Y Λ (t) in Lin and Ying's (2.8) with a standard nonparametric regression estimator, while slightly more complex computationally, will lead to more efficient estimation, and sometimes nearly infinitely more efficient estimation. We would be interested in Professors Lin and Ying's comments on this issue. Another issue is that the consistency of Lin and Ying's estimator requires the covariate history X i (t) to be fully observed, while the GEE type estimator does not require this assumption. If X i (t) is a time-independent covariate, this assumption is easy to satisfy. If X i (t) is time-varying and only a finite number of observations per subject are available (a common case in longitudinal studies), this assumption will be difficult to satisfy, since information on X i (t) is often available only at the observation times. If one approximates X i (t) by X Λ i Λ (t) defined similarly to Y (t), using i the singleton nearest neighbor method, the resulting estimating equation can be shown to biased and the estimator of fi to be p n inconsistent. Hence an alternative method needs to be proposed. We conjecture that once again the use of conventional nonparametric regression techniques will solve the consistency issue, and we are interested in Professors Lin and Ying's suggestion on how to handle this situation. 3

5 2 Relative Efficiency of Semiparametric Regression and Parametric Regression Lin and Ying showed through simulation studies that the loss of efficiency in estimating the finite dimensional parameter fi by fitting the semiparametric regression model (1.1) is negligible compared to a parametric model when ff(t) is estimated parametrically. We show theoretically that semiparametric regression is fully efficient in estimating fi compared to parametric regression if X i (t) is a time-independent covariate, i.e., X and T are independent, However, semiparametric regression can be subject to an arbitrarily large loss of efficiency when X i (t) is time-varying. Consider a simple special case where each subject has a single observation time, which can vary from subject to subject. Assume the optimal (semiparametric efficient) score is used, i.e., g(t )=ff(t ) in equation (2.7). Without loss of generality, assume (X; T; Y ) are centered and have mean zero. Compare the semiparametric model (1) with the simple parametric linear model Y i = fft i + fix i + ffl i ; (3) where ffl i ο N(0;ff 2 ffl ). It can be shown that the semiparametric efficient information bound for fi under model (1) is I S = Efvar(XjT )g ff 2 : ffl The efficient information for fi under the parametric linear model (3) is I P = 1 ( ff 2 E(X 2 ) [E(XT)]2 ) ffl E(T 2 ) = 1 ( ff 2 E[var(XjT )] + E E 2 (XjT ) Λ E(T 2 ) [E(TE(XjT ))] 2 ) ffl E(T 2 ) (4) 1 ff 2 ffl Efvar(XjT )g = I S ; where the Cauchy-Schwartz inequality is applied in the last step and the equality holds when E(XjT ) is linear in T or is free of T. The second term in (4) can be arbitrarily large, e.g., when E(XjT ) varies with T substantially in a nonlinear fashion. Hence the loss of efficiency in estimating fi using semiparametric regression can be arbitrarily large compared to parametric regression. Lin and Ying's simulation studies considered a time-varying X i (t) case and the results in Table 2 showed that little loss of efficiency when fitting the semiparametric regression model compared to fitting the parametric regression model. Examination of the data generating mechanism used 4

6 for Table 2 suggests that it assumes realizations of X vary with T, but X and T are in fact independent. Our results suggest that this is the type of situation that semiparametric estimators are nearly efficient in the parametric sense. It would be interesting to run a simulation study where X and T are correlated, e.g., by allowing the mean of X(T ) to vary in a major way with T, to assess the loss of efficiency of semiparametric regression compared to parametric regression. Appendix: Verification of (2) The Lin and Ying estimator is given in their (2.8). For our simple special case, we have bfi LY = n 1 P n i=1(x i X)(Y i Y ) n 1 P n i=1(x i X) 2 = fi + n 1 P n i=1(x i X) fff(t i ) ff + ffl i fflg n 1 P n i=1(x i X) 2 : Since by assumption ff(t i ), X i and ffl i are independent, (2) follows immediately. References Hastie, T. & Tibshirani, R. (1990), Generalized Additive Models, Chapman and Hall, New York. Lin, X. and Carroll, R. J. (2000), Semiparametric Regression For Clustered Data Using Generalized Estimating Equations," under review. Severini, T. A. and Staniswalis, J. G. (1994), Quasilikelihood Estimation in Semiparametric Models," Journal of the American Statistical Association, 89, Speckman, P. (1988), Kernel smoothing in partial linear models," Journal of the Royal Statistical Society, Series B, 50,

Discussion of the paper Inference for Semiparametric Models: Some Questions and an Answer by Bickel and Kwon

Discussion of the paper Inference for Semiparametric Models: Some Questions and an Answer by Bickel and Kwon Jianqing Fan Department of Statistics Chinese University of Hong Kong AND Department of Statistics