STATISTICAL ESTIMATION IN VARYING COEFFICIENT MODELS

Size: px

Start display at page:

Download "STATISTICAL ESTIMATION IN VARYING COEFFICIENT MODELS"

Claud Chase
5 years ago
Views:

1 he Annals of Statistics 1999, Vol. 27, No. 5, SAISICAL ESIMAION IN VARYING COEFFICIEN MODELS By Jianqing Fan 1 and Wenyang Zhang University of North Carolina and Chinese University of Hong Kong Varying coefficient models are a useful extension of classical linear models. hey arise naturally when one wishes to examine how regression coefficients change over different groups characterized by certain covariates such as age. he appeal of these models is that the coefficient functions can easily be estimated via a simple local regression. his yields a simple one-step estimation procedure. We show that such a one-step method cannot be optimal when different coefficient functions admit different degrees of smoothness. his drawback can be repaired by using our proposed two-step estimation procedure. he asymptotic mean-squared error for the two-step procedure is obtained and is shown to achieve the optimal rate of convergence. A few simulation studies show that the gain by the two-step procedure can be quite substantial. he methodology is illustrated by an application to an environmental data set. 1. Introduction Background. Driven by many sophisticated applications and fueled by modern computing power, many useful data-analytic modeling techniques have been proposed to relax traditional parametric models and to exploit possible hidden structure. For introduction to these techniques, see the books by Hastie and ibshirani 199), Green and Silverman 1994), Wand and Jones 1995) and Fan and Gijbels 1996), among others. In dealing with high-dimensional data, many powerful approaches have been incorporated to avoid the so-called curse of dimensionality. Examples include additive models [Breiman and Friedman 1995), Hastie and ibshirani 199)], lowdimensional interaction models, [Friedman 1991), Gu and Wahba 1993), Stone, Hansen, Kooperberg and ruong 1997)], multiple-index models [Härdle and Stoker 199), Li 1991)], partially linear models [Wahba 1984), Green and Silverman 1994)], and their hybrids [Carroll, Fan, Gijbels and Wand 1997), Fan, Härdle and Mammen 1998), Heckman, Ichimura, Smith and odd 1998)], among others. Different models explore different aspects of high-dimensional data and incorporate different prior knowledge into modeling and approximation. hey together form useful tool kits for processing high-dimensional data. A useful extension of classical linear models is varying coefficient models. his idea is scattered around in text books. See, for example, page 245 of Shumway 1988). However, the potential of such a modeling technique did not Received October 1997; revised March Supported in part by NSF Grant DMS and NSA Grant AMS 1991 subject classifications. Primary 62G7; secondary 62J12. Key words and phrases. Varying coefficient models, local linear fit, optimal rate of convergence, mean-squared errors. 1491

2 1492 J. FAN AND W. ZHANG get fully explored until the seminal work of Cleveland, Grosse and Shyu 1991) and Hastie and ibshirani 1993). he varying coefficient models assume the following conditional linear structure: p 1.1) Y = a j UX j + ε for given covariates U X 1 X p and response variable Y with Eε U X 1 X p = and var ) ε U X 1 X p = σ 2 U By regarding X 1 1, 1.1) allows a varying intercept term in the model. he appeal of this model is that, via allowing coefficients a 1 a p to depend on U, the modeling bias can significantly be reduced and the curse of dimensionality can be avoided. Another advantage of this model is its interpretability. It arises naturally when one is interested in exploring how regression coefficients change over different groups such as age. It is particularly appealing in longitudinal studies where it allows one to examine the extent to which covariates affect responses over time. See Hoover, Rice, Wu and Yang 1997) and Fan and Zhang 2) for details on novel applications of varying coefficient models to longitudinal data. For nonlinear time series applications, see Chen and say 1993) where functional coefficient AR models are proposed and studied Estimation methods. Suppose that we have a random sample U i X i1 X ip Y i n i=1 from model 1.1). One simple approachto estimate the coefficient functions a j j = 1p is to use local linear modeling. For eachgiven point u, approximate the function locally as 1.2) a j u a j + b j u u for u in a neighborhood of u. his leads to the following local least-squares problem: minimize [ n p { Y i aj + b j U i u } ] 2 1.3) X ij K h U i u i=1 for a given kernel function K and bandwidth h, where K h = K /h/h. he idea is due to Cleveland, Grosse and Shyu 1991). While this idea is very simple and useful, it is implicitly assumed that the functions a j possess about the same degrees of smoothness and hence they can be approximated equally well in the same interval. If the functions possess different degrees of smoothness, suboptimal estimators are obtained via using the least-squares method 1.3). o formulate the above intuition in a mathematical framework, let us assume that a p is smoother than the rest of the functions. For concreteness,

3 VARYING COEFFICIEN MODELS 1493 we assume that a p possesses a bounded fourth derivative so that the function can locally be approximated by a cubic function, 1.4) a p u a p + b p u u +c p u u 2 + d p u u 3 for u in a neighborhood of u. his naturally leads to the following weighted least-squares problem: [ n { Y i aj + b j U i u } X ij 1.5) i=1 { a p + b p U i u +c p U i u 2 + d p U i u 3} X ip ] 2 K h1 U i u Let â j 1 ˆb j 1 j = 1p 1 and â p 1 ˆb p 1 ĉ p 1 dˆ p 1 minimize 1.5). he resulting estimator â p OS u =â p 1 is called a one-step estimator. We will show that the bias of the one-step estimator is Oh 2 1 and the variance of the one-step estimator is Onh 1 1. herefore, using the one-step estimator â p OS u, the optimal rate On 8/9 cannot be achieved. o achieve the optimal rate, a two-step procedure has to be used. he first step involves getting an initial estimate of a 1 a. Suchan initial estimate is usually undersmoothed so that the bias of the initial estimator is small. hen, in the second step, a local least-squares regression is fitted again via substituting the initial estimate into the local least-squares problem. More precisely, we use the local linear regression to obtain a preliminary estimate by minimizing ) 2 n p 1.6) Y k a j + b j U k u X kj K h U k u k=1 for a given initial bandwidth h and kernel K. Let â 1 u â p u denote the initial estimate of a 1 u a p u. In the second step, we substitute the preliminary estimates â 1 â and use a local cubic fit to estimate a p u, namely, minimize 1.7) n Y i i=1 â j U i X ij { a p + b p U i u +c p U i u 2 + d p U i u 3} X ip ) 2 K h2 U i u withrespect to a p b p c p d p, where h 2 is the bandwidth in the second step. In this way, a two-step estimator of â p S u of a p u is obtained. We will

4 1494 J. FAN AND W. ZHANG show that the bias of the two-step estimator is of Oh 4 2 and the variance Onh 2 1, provided that h = oh 2 2 nh /log h and nh 3. his means that when the optimal bandwidth h 2 n 1/9 is used, and the preliminary bandwidth h is between the rates On 1/3 and On 2/9, the optimal rates of convergence On 8/9 for estimating a 2 can be achieved. Note that the condition nh 3 is only a convenient technical condition based on the assumption of the sixth bounded moments of the covariates. It plays little role in understanding the two-step procedure. If X i is assumed to have higher moments, the condition can be relaxed to be as weak as nh 1+δ for some small δ>. See Condition 7) in Section 4 for details. herefore, the requirement on h is very minimal. A practical implication of this is that the two-step estimation method is not sensitive to the initial bandwidth h. his makes practical implementation much easier. Another possible way to conduct variable smoothing for coefficient functions is to use the following smoothing spline approach proposed by Hastie and ibshirani 1993): [ n i=1 ] 2 p p Y i a j U i X ij + λ j {a j u} 2 du for some smoothing parameters λ 1 λ p. While this idea is powerful, there are a number of potential problems. First, there are p-smoothing parameters to choose simultaneously. his is quite a task in practice. Second, computation can be a challenge. An iterative scheme was proposed in Hastie and ibshirani 1993). hird, sampling properties are somewhat difficult to obtain. It is not clear if the resulting method can achieve the same optimal rate of convergence as the one-step procedure. he above theoretical work is not purely academic. It has important practical implications. o validate our asymptotic claims, we use three simulated example to illustrate our methodology. he sample size is n = 5 and p = 2. Figure 1 depicts typical estimates of the one-step and two-step methods, both using the optimal bandwidth for estimating a 2 For the two-step estimator, we do not optimize simultaneously the bandwidths h and h 2 ; rather, we only optimize the bandwidth h 2 for a given small bandwidth h ). Details of simulations can be found in Section 5.2. In the first example, the bias of the one-step estimate is too large since the optimal bandwidth h 1 for a 2 is so large that a 1 can no longer to approximated well by a linear function in sucha large neighborhood. In the second example the estimated curve is clearly undersmoothed by using the one-step estimate, since the optimal bandwidth for a 2 has to be very small in order to compromise for the bias arising from approximating a 1. he one-step estimator works reasonably well in the third example, though the two-step estimator still improves somewhat the quality of the one-step estimate.

5 VARYING COEFFICIEN MODELS 1495 Fig. 1. Comparisons of the performance between the one-step and two-step estimator. Solid curves: true functions; short-dashed curves: estimates based on the one-step procedure; long-dashed curves: estimates based on the two-step procedure. In real applications, we do not know in advance if a p is really smoother than the rest of the functions. he above discussion reveals that the two-step procedure can lead to significant gain when a p is smoother than the rest of the functions. When a p has the same degree of smoothness as the rest of the functions, we will demonstrate that the two-step estimation procedure has the same performance as the one-step approach. herefore, the two-step scheme is always more reliable than the one-step approach. Details of implementing the two-step method will be outlined in Section Outline of the paper. Section 2 gives strategies for implementing the two-step estimators. he explicit formulas for our proposed estimators are given in Section 3. Section 4 studies asymptotic properties of the one-step and two-step estimators. In Section 5, we study finite sample properties of the one-step and two-step estimators via some simulated examples. wo-step techniques are further illustrated by an application to an environmental data set. echnical proofs are given in Section Practical implementation of two-step estimators. As discussed in the introduction, a one-step procedure is not optimal when coefficient functions admit different degrees of smoothness. However, we do not know in advance which function is not smooth. o implement the two-step strategy, one minimizes 1.6) witha small bandwidthh to obtain preliminary estimates â 1 U i â p U i for i = 1n. Withthese preliminary estimates, one can now estimate the coefficient functions a j u by using an equation that is similar to 1.7). Other techniques such as smoothing splines can also be used in the second stage of fitting. In practical implementation, it usually suffices to use local linear fits instead of local cubic fits in the second step. his would result in computational

6 1496 J. FAN AND W. ZHANG savings. Our experience with local polynomial fits show that for practical purpose the local linear fit with optimally chosen bandwidth performs comparably withthe local cubic fit withoptimal bandwidth. As discussed in the introduction, the two-step estimator is not very sensitive to the choice of initial bandwidth as long as it is small enough so that the bias in the first step smoothing is negligible. his suggests the following simple automatic rule: use cross-validation or generalized cross-validation [see, e.g., Hoover, Rice, Wu and Yang 1997)] to select the bandwidth ĥ for the one-step fit. hen, use h = 5 ĥ say) as the initial bandwidth. An advantage of the two-step procedure is that in the second step, the problem is really a univariate smoothing problem. herefore, one can apply univariate bandwidthselection procedures suchas cross-validation [Stone, 1974)], preasymptotic substitution method [Fan and Gijbels 1995)], plugin bandwidth selector [Ruppert, Sheather and Wand 1995)] and empirical bias method [Ruppert 1997)] to select the smoothing parameter. As discussed before, the preliminary bandwidth h is not very crucial to our final estimates, since for a wide range of bandwidth h the two-step method will achieve the optimal rate. his is another benefit of the two-step procedure: bandwidth selection problems become relatively easy. 3. Formulas for the proposed estimators. he solution to the least squares problems 1.5) 1.7) can easily be obtained. We take this opportunity to introduce necessary notation. In the notation below, we use subscripts, 1 and 2, respectively, to indicate the variables related to the initial, onestep and two-step estimators. Let X 11 X 11 U 1 u X 1p X 1p U 1 u X = X n1 X n1 U n u X np X np U n u Y =Y 1 Y n and W = diag K h U 1 u K h U n u ) hen the solution to the least-squares problem 1.6) can be expressed as 3.1) â j u =e 2j 1 2p X W X 1 X W Y j = 1p Here and hereafter, we will always use notation e k m to denote the unit vector of length m with1 at the kthposition. he solution to problem 1.5) can be expressed as follows. Let X 1p X 1p U 1 u X 1p U 1 u 2 X 1p U 1 u 3 X 2 = X np X np U n u X np U n u 2 X np U n u 3

7 VARYING COEFFICIEN MODELS 1497 and X 11 X 11 U 1 u X 1 X 1 U 1 u X 3 = X n1 X n1 U n u X n X n U n u X 1 =X 3 X 2 W 1 = diag K h1 U 1 u K h1 U n u ) hen the solution to the least-squares problem 1.5) is given by 3.2) â p 1 u =e 2 2p+2 X 1 W 1X 1 1 X 1 W 1Y Using the notation introduced above, we can express the two-step estimator as 3.3) where â p 2 u =1 X 2 W 2X 2 ) 1X 2 W 2 Y V W 2 = diag K h2 U 1 u K h2 Un u ) and V =V 1 V n with V i = âj U i X ij. Note that the two-step estimator â p 2 is a linear estimator for given bandwidths h and h 2, since it is a weighted average of observations Y 1 Y n. he weights are somewhat complicated. o obtain these weights, let X i be the matrix X with u = U i and W i be the matrix W with u = U i. hen Set hen 3.4) V i = B n = I n X ij e 2j 1 2p X i W ) 1X ix i i W iy X 1j e 2j 1 2p X 1 W 1X 1 ) 1X 1 W 1 X nj e 2j 1 2p X n W nx n ) 1X n W n â p 2 u =1 X 2 W 2X 2 1 X 2 W 2B n Y 4. Main results. We impose the following technical conditions: 1. EX 2s j <, for some s>2j= 1p. 2. a j is continuous in a neighborhood of u, for j = 1p. Further, assume a j u, for j = 1p. 3. he function a p has a continuous fourth derivative in a neighborhood of u. 4. r ij is continuous in a neighborhood of u and r ij u, for i j = 1p, where r ij u =EX i X j U = u.

8 1498 J. FAN AND W. ZHANG 5. he marginal density of U has a continuous second derivative in some neighborhood of u and fu. 6. he function Kt is a symmetric density function witha compact support. 7. h /h 2 and h 2 nh γ / log h, for any γ>s/s 2 with s given in condition 1. hroughout this paper, we will use the following notation. Let µ i = t i Kt dt and ν i = t i K 2 t dt and be the observed covariates vector, namely, = ) U 1 U n X 11 X 1n X p1 X pn Set r ij = r ij u =EX i X j U = u, for i j = 1p. Put and = diag σ 2 U 1 σ 2 U n ) α j u = r 1j ur j u ) α j = α j u for j = 1p i u =E { X 1 X i X 1 X i U = u } i = i u for i = 1p For the one step-estimator, we have the following asymptotic bias and variance. heorem 1. Under conditions 1 6, if h 1 in such a way that nh 1, then the asymptotic conditional bias of â p OS u is given by bias â p OS u ) = h2 1 µ 2 r 2r pj a j u +o P h 2 1 pp and the asymptotic conditional variance of â p OS u is var â p OS u ) = σ2 u λ 2 r pp + λ 3 α p 1 α p nh 1 fu λ 1 r pp rpp α p 1 α )1 + o p 1 p where λ 1 =µ 4 µ λ 2 = ν µ 2 4 2ν 2µ 2 µ 4 +µ 2 2 ν 4 and λ 3 = 2µ 2 ν 2 µ 4 2ν µ 2 2 µ 4 µ 2 2 ν 4 + ν µ 4 2. he proofs of heorem 1 and other theorems are given in Section 6. It is clear that the conditional MSE of the one-step estimator â p OS u is only O P h 4 1 +nh 1 1 which achieves the rate O P n 4/5 when the bandwidth h 1 = On 1/5 is used. he bias expression above indicates clearly that the approximation errors of functions a 1 a are transmitted to the bias of estimating a p. hus, the one-step estimator for a p inherits nonnegligible )

9 VARYING COEFFICIEN MODELS 1499 approximation errors and is not optimal. Note that heorem 1 continues to hold if condition 3) is dropped. See also heorem 3. We now consider the asymptotic MSE for the two-step estimator. heorem 2. If conditions 1 7 hold, then the asymptotic conditional bias of â p S u can be expressed as bias â p S u ) = 1 µ 2 4 µ 6µ 2 a 4! µ 4 µ 2 p 4 u h 4 2 µ 2h 2 a j 2r u r pj + o P h h2 2 pp and the asymptotic conditional variance of â p S u is given by varâ p S u = µ2 4 ν 2µ 4 µ 2 ν 2 + µ 2 2 ν 4σ 2 u e nh 2 fu µ 4 µ p p 1 p e p p1 + o P 1 By heorem 2, the asymptotic variance of the two-step estimator is independent of the initial bandwidth as long as nh γ, where γ is given in condition 7. hus, the initial bandwidth h should be chosen as small as possible subject to the constraint that nh γ. In particular, when h = oh 2 2, the bias from the initial estimator becomes negligible and the bias expression for the two-step estimator is 1 4! µ 2 4 µ 6µ 2 µ 4 µ 2 2 a 4 p u h o Ph 4 2 Hence, via taking the optimal bandwidth h 2 of order n 1/9, the conditional MSE of the two-step estimator achieves the optimal rate of convergence O P n 8/9. Remark 1. Consider the ideal situation where a 1 a are known. hen, one can simply run a local cubic estimator to estimate a p. he resulting estimator has the asymptotic bias 1 4! and asymptotic variance µ 2 4 µ 6µ 2 µ 4 µ 2 2 a 4 p u h o Ph 4 2 µ 2 4 ν 2µ 4 µ 2 ν 2 + µ 2 2 ν 4 nh 2 fu r pp µ 4 µ σ2 u +o P { nh2 1} his ideal estimator has the same asymptotic bias as the two-step estimator. Further, this ideal estimator has the same order of variance as the two-step estimator. In other words, the two-step estimator enjoys the same optimal rate of convergence as the ideal one.

10 15 J. FAN AND W. ZHANG We now consider the case that a p is as smoothas the rest of functions. In technical terms, we assume that a p has only continuous second derivative. For this case, a local linear approximation is used for the function a p in both the one-step and two-step procedure. With some abuse of notation, we still denote the resulting one-step and two-step estimators as â p OS and â p S, respectively. Our technical results are to establish that the two-step estimator does not lose its statistical efficiency. Indeed, it has the same performance as the onestep procedure. Since it gains the efficiency when a p is smoother, we conclude that the two-step estimator is preferable. hese results give theoretical endorsement of the proposed two-step method in Section 2. heorem 3. Under conditions 1, 2, 4 6, if h 1 and nh 1, then the asymptotic conditional bias of the one-step estimator is given by bias â p OS u ) = h2 1 µ 2 2 a p u 1 + o P 1 and the asymptotic conditional variance of â p OS u is given by var â p OS u ) = σ2 u ν nh 1 fu e p p 1 p e p p1 + o P 1 We now consider the asymptotic behavior for the two-step estimator. heorem 4. Suppose that conditions 1, 2, 4 7 hold. hen we have the asymptotic conditional bias bias â p S u ) 1 = 2 a p u µ 2 h 2 2 µ 2h 2 a j 2r u r pj )1 + o P 1 pp and the asymptotic variance Remark 2. var â p S u ) = ν σ 2 u nh 2 fu e p p 1 p e p p1 + o P 1 he asymptotic bias of the two-step estimator is simplified as 1 2 a 2 u µ 2 h o P1 by taking initial bandwidth h = oh 2. Moreover, it has the same asymptotic variance as that of the one-step estimator. In other words, the performance of the one-step and two-step estimators is asymptotically identical. Remark 3. When a 1 ta t are known, we can use the local linear fit to find an estimate of a p. Suchan ideal estimator possesses the bias 1 2 a p u µ 2 h 2 { op 1 }

11 VARYING COEFFICIEN MODELS 151 and variance σ 2 u ν nh 2 fu r pp { 1 + op 1 } So, both one-step and two-step estimators have the same order of MSE as the ideal estimator. However, the variance of the ideal estimator is typically small. Unless X 1 X and X p are uncorrelated given U = u, the asymptotic variance of the ideal estimator is always smaller. 5. Simulations and applications. In this section, we illustrate our methodology via an application to an environmental data set and via simulations. hroughout this section, we use the Epanechnikov kernal Kt= 751 t Applications to an environmental data set. We now illustrate the methodology via an application to an environmental data set. he data set used here consists of a collection of daily measurements of pollutants and other environmental factors in Hong Kong between January 1, 1994 and December 31, 1995 courtesy of Professor. S. Lau). hree pollutants, sulphur dioxide in µg/m 3 ), nitrogen dioxide in µg/m 3 ) and dust in µg/m 3 ), are considered here. able 1 summarizes their correlation coefficients. he correlation between the dust level and NO 2 is quite high. Figure 2 depicts the marginal distributions of the level of pollutants in the summer April 1 September 3) and winter seasons October 1 March31). he level of pollutants in the summer season raining very often) tends to be lower and has smaller variation. An objective of the study is to understand the association between the level of the pollutants and the number of daily total hospital admissions for circulatory and respiratory problems and to examine the extent to which the association varies over time. We consider the relationship among the number of daily hospital admissions Y and level of pollutants SO 2,NO 2 and dust, which are denoted by X 2 X 3 and X 4, respectively. We took X 1 = 1asthe intercept term and U = t = time. he varying coefficient model 5.1) Y = a 1 t+a 2 tx 2 + a 3 tx 3 + a 4 tx 4 + ε able 1 Correlation coefficients among pollutants Sulphur dioxide Nitrogen dioxide Dust Sulphur dioxide Nitrogen dioxide Dust 1.

12 152 J. FAN AND W. ZHANG Fig. 2. Density estimation for the distributions of pollutants. Solid curves are for the summer season and dashed curves are for the winter season. is fitted to the given data. he two-step method is employed. An initial bandwidth h = six percent of the whole interval) was chosen. As anticipated, the results do not alter muchwithdifferent choices of initial bandwidths. he second-stage bandwidths h 2 were chosen, respectively, 25%, 25%, 3% and 3% of the interval length for the functions a 1 a 4. Figure 3 depicts the estimated coefficient functions. hey describe the extent to which the coefficients vary with time. wo short-dashed curves indicate pointwise 95% confidence intervals withbias ignored. he standard errors are computed from the second-stage local cubic regression. See Section 4.3 of Fan and Gijbels 1996) on how to compute the estimated standard errors from local polynomial regression. he figure shows that there is strong time effect on the coefficient functions. For comparison purposes, in Figure 3 we also superimpose the estimates long-dashed curves) using the one-step procedure with bandwidths 25%, 25%, 3% and 3% of the time interval for a 1 a 4, respectively. o compare the performance between the one-step and two-step methods, we define the relative efficiency between the one-step and the two-step methods via computing { RSSh one-step RSS h two-step }/ RSS h two-step where RSS h one-step) and RSS h two-step) are the residual sum of squares for the one-step procedure using the bandwidth h and the two-step method using the same bandwidth h in the second stage, respectively. Figure 4a shows that the two-step method has smaller RSS than that of the one-step method. he gain is more pronounced as the bandwidth increases. his can be intuitively explained as follows. As bandwidthincreases, at least one of the components would have nonnegligible biases. Pollutants may not have an immediate effect on circulatory and respiratory systems. A natural question arises if there is any time lag in the response variable. o study this question, we fit model 5.1) for each time lag τ using

13 VARYING COEFFICIEN MODELS 153 Fig. 3. he estimated coefficient functions. he solid- and long-dashed curves are for the two-step and one-step methods, respectively. wo short-dashed curves indicate pointwise 95% confidence intervals with bias ignored. the data { Yt + τx2 tx 3 tx 4 tt= 1 2 } Figure 4b presents the resulting residuals sum of squares for each time lag. As τ gets larger, so does the residuals sum of squares. his in turn suggests no evidence for time delay in the response variable. We now examine how the expected number of hospital admissions changes, over time, when pollutants levels are set at their averages. Namely, we plot the function Ŷt =â 1 t+â 2 t X 2 +â 3 t X 3 +â 4 t X 4 against t, where the estimated coefficient functions were obtained by the twostep approach. Figure 4c presents the result. It indicates an overall increasing trend in the number of hospital admissions for respiratory and circulatory problems. A seasonal pattern can also be seen. hese features are not available in the usual parametric least-squares models.

14 154 J. FAN AND W. ZHANG Fig. 4. a) Comparing the relative efficiency between the one-step and the two-step method. b) esting if there is any time delay in the response variable. c) he expected number of hospital admissions over time when pollutant levels are set at their averages Simulations. We use the following three examples to illustrate the performance of our method: Example 1. Y = sin6ux 1 + 4U1 UX 2 + ε. Example 2. Y = sin6πux 1 + sin2πux 2 + ε. Example 3. Y = sin 8πU 5 ) X [ exp 4U 1 2 +exp 4U 3 2 ] 15 ) X 2 + ε where U follows a uniform distribution on [, 1] and X 1 and X 2 are normally distributed withcorrelation coefficient 2 1/2. Further, the marginal distributions of X 1 and X 2 are the standard normal and ε, U and X 1 X 2 are independent. he random variable ε follows a normal distribution withmean zero and variance σ 2.heσ 2 is chosen so that the signal-to-noise ratio is about 5 1, namely, σ 2 = 2 var { mu X 1 X 2 } with mu X 1 X 2 =EY U X 1 X 2 Figure 5 shows the varying coefficient functions a 1 and a 2 for Examples 1 3. For eachof the above examples, we conducted 1 simulations withsample size n = 25 5, 1. Mean integrated squared errors for estimating a 2 are recorded. For the one-step procedure, we plot the MISE against h 1 and hence the optimal bandwidth can be chosen. For the two-step procedure, we choose some small initial bandwidth h and then compute the MISE for the twostep estimator as a function of h 2. Specifically, we choose h = 3.4 and.5, respectively, for Examples 1, 2 and 3. he optimal bandwidths h 1 and h 2 were used to compute the resulting estimators presented in Figure 1. Among 1 samples, we select the one such that the two-step estimator attains the median performance. Once the sample is selected, the one-step estimate and

15 VARYING COEFFICIEN MODELS 155 Fig. 5. Varying coefficient functions. Solid curves are for a 1 and dashed curves are for a 2. the two-step estimate are computed. Figure 1 depicts the resulting estimate based on n = 5. Figure 6 depicts the MISE as a function of bandwidth. he MISE curves for the two-step method are always smaller than those for the one-step approach for the three examples that we tested. his is in line with our asymptotic theory that the two-step approach outperforms the one-step procedure if the initial bandwidth is correctly chosen. he improvement of the two-step estimator is quite substantial if the optimal bandwidth is used in comparison with the one-step approach using the optimal bandwidth). Further, for the two-step estimator, the MISE curve is flatter than that for the one-step method. his is turn reveals that the bandwidth for the two-step estimator is less crucial than that for the one-step procedure. his is an extra benefit of the two-step procedure. 6. Proofs. he proof of heorem 3 and heorem 4) is similar to that of heorem 1 and heorem 2). hus, we only prove heorems 1 and 2. When the asymptotic conditional bias and variance are calculated for the two-step procedure â p S u, the following lemma on the uniform convergence will be used. Lemma 1. Let X 1 Y 1 X n Y n be i.i.d random vectors, where the Y i s are scalar random variables. Assume further that Ey 3 < and sup x y s fx y dy <, where f denotes the joint density of X Y. Let K be a bounded positive function with a bounded support, satisfying a Lipschitz condition. hen n { sup Kh X i xy i E [ ]} [ K h X i xy i = O P nh/ log1/h 1/2 ] x D n 1 i=1 provided that n 2ε 1 h for some ε<1 s 1.

16 156 J. FAN AND W. ZHANG Fig. 6. MISE as a function of bandwidth. Solid curve: one-step procedure; dashed curve: two-step procedure. Proof. his follows immediately from the result obtained by Mack and Silverman 1982). he following notation will be used in the proof of the theorems. Let ) S11 S S = 12 S 12 S 22 with and ) µ S 11 = µ 2 ) µ µ2 S 12 = α p µ 2 µ 4 µ µ 2 S 22 = r pp µ 2 µ 4 µ 2 µ 4 µ 4 µ 6

17 VARYING COEFFICIEN MODELS 157 where denotes the Kronecker product. Let S be the matrix similar to S except replacing µ i by ν i. Set ) ) S i = µ 1 pu i S µ = S i U i =u Q = p 2 and p β i = a j U iµ 2 α j U ir pj U i 1 α =α p r pp 1 Put ) 1 A = I h 1 G = I p ) 1 h and 1 h 1 D = h D h 2 2 = h 2 2 h 3 1 h 3 1 We are now ready to prove our results. Proofofheorem 1. First, let us calculate the asymptotic conditional bias of â p 1 u. Note that by aylor s expansion, we have Y = X 1 a 1 u a 1 u a u a u a p u a p u 1 2 a p u 1 ) 3! a p u a + 1 j ξ 1jU 1 u 2 X 1j a 4 p η U 1 u 4 X 1p 4! + a j ξ nju n u 2 X nj ap 4 η n U n u 4 X np where =ε 1 ε n ξ ij and η i are between U i and u for i = 1nj = 1p 1. hus, â p 1 u =a p u + 1 e 2 2 2p+2 a X W X ) j ξ 1jU 1 u 2 X 1j 1X W a j ξ nju n u 2 X nj

18 158 J. FAN AND W. ZHANG a 4 p η ) 1 U 1 u 4 X 1p + 1 4! e 2 2p+2 X 1X 1 W 1 X 1 1 W 1 a 4 p η n U n u 4 X np + e ) 2 2p+2 X 1X 1 W 1 X 1 1 W 1 Obviously, X X1 W 3 W 1 X 3 X3 1X 1 = W ) 1X 2 X2 W 1X 3 X2 W 1X 2 By calculating the mean and variance, one can easily get X3 W 1X 3 = nfu AS 11 A1 + o P 1 X3 W 1X 2 = nfu AS 12 D1 + o P 1 and 6.1) X2 W 1X 2 = nfu DS 22 D1 + o P 1 Combining the last three asymptotic expressions leads to X1 W 1X 1 = nfu diaga DS diaga D1 + o P 1 Similarly, we have a j ξ 1jU 1 u 2 X 1j X3 W 1 a j ξ nju n u 2 X nj = nfu h 2 1 a j u A α j 1 ) µ o P 1 and a j ξ 1jU 1 u 2 r X pj µ 2 1j X2 W 1 = nfu h 2 1 a j u D 1 + o a j ξ r pj µ 4 P 1 nju n u 2 X nj hus, a j ξ 1jU 1 u 2 X 1j X1 W 1 a j ξ nju n u 2 X nj = nfu h 2 1 a j u diaga D α j 1 µ 2r pj µ 2 r pj µ 4 ) 1 + op 1

19 VARYING COEFFICIEN MODELS 159 So the asymptotic conditional bais of â p 1 u is given by bias â p 1 u ) = 1 2 h2 1 a j u e 2 2p+2 S 1 α j 1 µ 2r pj µ 2 r pj µ o P 1 Using the properties of the Kronecker product we have biasâ p 1 u = h 2 1 µ 2 2r pp α p 1 α pr pp rpj α p 1 α p r pp α p 1 α ) j a j u 1 + o P 1 = h2 1 µ 2 r 2r pj a j u +o P h 2 1 pp We now calculate the asymptotic variance. Using an asymptotic argument similar to the above, it is easy to calculate that the asymptotic conditional variance of â p 1 u is given by var â p 1 u ) = e 2 2p+2 X 1 W 1X 1 1 X 1 W 1W 1 X 1 X 1 W 1X 1 1 e 2 2p+2 = σ2 u nh 1 fu e 2 2p+2 S 1 SS 1 e 2 2p o P 1 By using the properties of the Kronecker product, it follows that var â p 1 u ) = σ2 u λ 2 r pp + λ 3 α p 1 α p nh 1 fu λ 1 r pp r pp α p 1 α p 1 + o P1 where λ 1 =µ 4 µ λ 2 = ν µ 2 4 2ν 2µ 2 µ 4 + µ 2 2 ν 4 λ 3 = 2µ 2 ν 2 µ 4 2ν µ 2 2 µ 4 µ 2 2 ν 4 + ν µ 4 2. his establishes the result in heorem 1. Proofofheorem 2. We first compute the asymptotic conditional bias. Note that by aylor s expansion, one obtains Y = X i a 1 U i a 1 U ia p U i a p U i a j ξ 1jU 1 U i 2 X 1j p a j ξ nju n U i 2 X nj = X i a 1 U i a 1 U ia p U i a p U i

20 151 J. FAN AND W. ZHANG a j U iu 1 U i 2 X 1j p a j U iu n U i 2 X nj a j ξ 1j a j U iu 1 U i 2 X 1j p + a j ξ nj a j U iu n U i 2 X nj where ξ kj is between U i and U k. hus, for l = 1p 1, â l U i =a l U i e 2l 1 2p X i W ix i 1 X i W i e 2l 1 2p X i W ix i 1 X i W i a j U iu 1 U i 2 X 1j p a j U iu n U i 2 X nj a j ξ 1j a j U iu 1 U i 2 X 1j p a j ξ nj a j U iu n U i 2 X nj + e 2l 1 2p X i W ix i 1 X i W i By Lemma 1, we have 6.2) X i W ix i = nfu i GS i G1 + o P1 and 6.3) a j U iu 1 U i 2 X 1j p X i W i = nfu ih 2 Gβ i1 + o P 1 a j U iu n U i 2 X nj Note that in our applications below, we only consider those U i s which are in a neighborhood of u. By the continuity assumption, the term o P 1) holds uniformly in i suchthat U i falls in the neighborhood of u. Combining 6.2)

21 VARYING COEFFICIEN MODELS 1511 and 6.3), we have 1 2 e 2l 1 2p X i W ix i 1 X i W i a j U iu 1 U i 2 X 1j p a j U iu n U i 2 X nj = 1 2 h2 e 2l 1 2p S 1 i β i1 + o P 1 Note that K has a bounded support. From the last expression and the uniform continuity of functions a j in a neighborhood of u, it follows that 6.4) Eâ l U i a l U i = 1 2 h2 e 2l 1 2p S 1 i β i1 + o P 1 Since Y 1 Y n â j U 1 X 1j a p U 1 X 1p = a p U n X np â j U n X nj a j U 1 â j U 1 X 1j + + a j U n â j U n X nj it follows from 3.3) that â p 2 u =a p u + 1 4! 1 X2 W ) 1X 2X 2 2 W 2 ap 4 η 1 U 1 u 4 X 1p a 4 p η n U n u 4 X np

22 1512 J. FAN AND W. ZHANG aj U 1 â j U 1 ) X 1j +1 X2 W 2X 2 1 X2 W 2 aj U n â j U n ) X nj +1 X2 W 2X 2 1 X2 W 2ε a p u + 1 J 4! 1 + J 2 +1 X2 W 2X 2 1 X2 W 2ε By simple calculation we have EJ 1 =h 4 2 a4 p u 1 S 1 = h 4 µ4 2 a4 p u µ 2 µ 4 µ 2 2 µ 4 µ 2 2 = µ2 4 µ 2µ 6 h 4 µ 4 µ 2 2 a4 p u 1 + o P r pp µ o r pp µ 6 P 1 µ 4 ) 1 + o µ 6 P 1 By 6.4), we have EJ 2 =1 X2 W ) 1X 2X 2 2 W 2 E a j U 1 â j U 1 ) X 1j Ea j U n â j U n X nj

23 VARYING COEFFICIEN MODELS 1513 = 1 2 h2 1 X2 W ) 1X 2X 2 2 W o P 1 = h2 µ4 µ ) 2 2r pp µ 4 µ 2 2 µ 4 µ 2 2 e 2j 1 2p S 1 1 β 1X 1j e 2j 1 2p S 1 n β nx nj 1 + o P 1 = h2 e 2j 1 2p 2r S 1 β r pj 1 + o P 1 pp e 2j 1 2p S 1 β r pj e 2j 1 2p S 1 β r pj µ 2 herefore, by 6.5) we obtain bias â p 2 u ) = h2 2r pp e 2j 1 2p S 1 β r pj + µ2 4 µ 2µ 6 4!µ 4 µ 2 2 a4 p u h 4 2 ) 1 + o P 1 By using the properties of the Kronecker product, we have bias â p 2 u ) 1 µ 2 4 = µ 6µ 2 a 4! µ 4 µ 2 p 4 u h 4 2 µ 2h 2 2r 2 pp 1 + o P 1 p a j u α p 1 p αj = 1 µ 2 4 µ 6µ 2 a 4! µ 4 µ 2 p 4 u h 4 2 µ 2h 2 a j 2r u r pj + o P h h2 2 pp r pj )) his proves the bias expression in heorem 2.

24 1514 J. FAN AND W. ZHANG We now calculate the asymptotic variance. Recall B n defined at the end of Section 3. Denote by H = I B n. By 3.4), we have var â p 2 u ) 6.5) =1 X 2 W 2X 2 1 X 2 W 2W 2 X 2 X 2 W 2 X 2 ) X 2 W 2X 2 ) 1X 2 W 2 HW 2 X 2 X 2 W 2 X 2 ) X 2 W 2X 2 ) 1X 2 W 2 HH W 2 X 2 X 2 W 2 X 2 ) 1 1 Using similar arguments as before, we can show that 6.6) 1 X 2 W 2X 2 ) 1X 2 W 2 W 2 X 2 X 2 W 2 X 2 ) 11 = µ2 4 ν 2µ 4 µ 2 ν 2 + µ 2 2 ν 4 ) nh 2 fu r pp µ4 µ 2 2 σ 2 u 1 + o P 1 2 Since HW 2 X 2 = by Lemma 1, we have where X 1j e 2j 1 2p X 1 W 1X 1 ) 1X 1 W 1W 2 X 2 X nj e 2j 1 2p X n W nx n ) 1X n W nw 2 X 2 X i W iw 2 X 2 = nfu i σ 2 U i K h2 U i u G 2p 4 i D o P 1 for k = 1p, ) 2p 4 i = ũ k l i ũ 2k 1 i = r kp U i 2p 4 1 k 2p l 3 h 2 ) Ui u ũ 2k 1 1 i = r kp U i + o P 1 ) Ui u 2 ) ũ 2k 1 2 i = r kp U i Ui u + o P 1 + o P 1 h 2 ) Ui u 3 ũ 2k 1 3 i = r kp U i Ui u + o h P 1 2 h 2 ) Ui u + o P 1 + o P 1 h 2 h 2 ) 2

25 and ũ 2k i = o P 1 VARYING COEFFICIEN MODELS 1515 ) Ui u ũ 2k 1 i = o P 1 + o P 1 ) Ui u 2 ) ũ 2k 2 i = o P 1 Ui u + o P 1 + o P 1 h 2 ) Ui u 3 ũ 2k 3 i = o P 1 Ui u + o P 1 h 2 ) Ui u + o P 1 + o P 1 h 2 h 2 h 2 h 2 ) 2 hus, we obtain X 2 W 2HW 2 X 2 = nfu his and 6.1) together yield h 2 e 2j 1 2p S 1 α r pj σ 2 u D 2 ν ν 2 ν 2 ν 4 D ν 2 ν o P 1 ν 4 ν 6 1 X 2 W 2X 2 ) 1X 2 W 2 HW 2 X 2 X 2 W 2 X 2 ) ) = µ2 4 ν 2µ 4 µ 2 ν 2 + µ 2 2 ν 4 ) nh 2 fu rpp 2 µ4 µ e 2j 1 2p S 1 α r pj σ 2 u 1 + o P 1 Let X ki = X k1 X k1 U k U i X kp X kp U k U i ) and X i = X 1i X 2i X ni ) hen we have X 2 W 2HH W 2 X 2 =v rs 4 4 r s 3

26 1516 J. FAN AND W. ZHANG where v rs = = n i=1 l=1 { n X ip X lp U i u r U l u s K h2 U i u K h2 U l u m=1 k=1 { n l=1 X ij e 2j 1 2p X i W ) 1X ix i i W i ) X lm e 2m 1 2p X l W ) } 1X lx l l W l m=1 { n n X ip X ij U i u r e 2j 1 2p i=1 X i W ix i 1 X ki K h2 } U i u K h U k U i σ 2 U k X lp X lm U l u s X kl X l W ) 1e2m 1 lx l 2p K h2 } U l u K h U k U l Using Lemma 1 and tedious calculation, we obtain X 2 W 2HH W 2 X 2 = nfu σ 2 u h 2 m=1 ν ν 2 ν 2 ν 4 D ν 2 ν o P 1 ν 4 ν 6 r pj r pm e 2j 1 2p S 1 QS 1 e 2m 1 2pD 2 he combination of this and 6.1) gives 1 X 2 W 2X 2 ) 1X 2 W 2 HH W 2 X 2 X 2 W 2 X 2 ) ) = µ2 4 ν 2µ 4 µ 2 ν 2 + µ 2 2 ν 4 nh 2 fu r 2 pp µ 4 µ m=1 r pj r pm e 2j 1 2p S 1 QS 1 e 2m 1 2pσ 2 u 1 + o P 1

27 VARYING COEFFICIEN MODELS 1517 Substituting 6.7) 6.9) into 6.6), we have var â p 2 u ) = µ2 4 ν 2µ 4 µ 2 ν 2 + µ 2 2 ν 4 nh 2 fu r 2 pp µ 4 µ r pp + r pj r pm e 2j 1 2p S 1 QS 1 e 2m 1 2p m=1 σ 2 u 1 + o P 1 Using the properties of the Kronecker product we get var â p 2 u ) µ 2 = 4 ν 2µ 4 µ 2 ν 2 + µ 2 2 ν ) 4 σ 2 u nh 2 fu r 2 pp µ 4 µ r pp + r 2 pp e p p 1 p e p p α p r ) pp 1 p ) 2 r pj e 2j 1 2p S 1 α αp r pp )) 1 + o P 1 Note that 1 p α p r pp = e p p. his results in heorem 2. Acknowledgment. he authors thank an Associate Editor for helpful comments that led to improvement of the presentation of the paper. REFERENCES Breiman, L. and Friedman, J. H. 1985). Estimating optimal transformations for multiple regression and correlation withdiscussion). J. Amer. Statist. Assoc Carroll, R. J., Fan, J., Gijbels, I. and Wand M. P. 1997). Generalized partially linear singleindex models. J. Amer. Statist. Assoc Chen, R. and say, R. S. 1993). Functional-coefficient autoregressive models. J. Amer. Statist. Assoc Cleveland, W. S., Grosse, E. and Shyu, W. M. 1991). Local regression models. In Statistical Models in S J. M. Chambers, and. J. Hastie, eds.) Wadsworth / Brooks-Cole, Pacific Grove, CA. Fan, J. and Gijbels, I. 1995). Data-driven bandwidthselection in local polynomial fitting: variable bandwidthand spatial adaptation. J. Roy. Statist. Soc. Ser. B Fan, J. and Gijbels, I. 1996). Local Polynomial Modelling and Its Applications. Chapman and Hall, London. Fan, J., Härdle, W. and Mammen, E. 1998). Direct estimation of additive and linear components for high dimensional data. Ann. Statist Fan, J. and Zhang, J. 2). wo-step estimation of functional linear models withapplications to longitudinal data. J. Roy. Statist. Soc. Ser. B 62. o appear. Friedman, J. H. 1991). Multivariate adaptive regression splines withdiscussion). Ann. Statist Green, P. J. and Silverman, B. W. 1994). Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach. Chapman and Hall, London.

28 1518 J. FAN AND W. ZHANG Gu, C. and Wahba, G. 1993). Smoothing spline ANOVA with component-wise Bayesian confidence intervals. J. Comput. Graph. Statist Härdle, W. and Stoker,. M. 1989). Investigating smooth multiple regression by the method of average derivatives. J. Amer. Statist. Assoc Hastie,. J. and ibshirani, R. 199). Generalized Additive Models. Chapman and Hall, London. Hastie,. J. and ibishirani, R. J. 1993). Varying-coefficient models. J. Roy. Statist. Soc. Ser. B 55, Heckman, J., Ichimura, H., Smith, J. and odd, P. 1998). Characterizing selection bias using experimental data. Econometrica, Hoover, D. R., Rice, J. A., Wu, C. O. and Yang, L. P. 1997). Nonparametric smoothing estimates of time-varying coefficient models withlongitudinal data. Biometrika Li, K. C. 1991). Sliced inverse regression for dimension reduction withdiscussion). J. Amer. Statist. Assoc Mack, Y. P., Silverman, B. W. 1982). Weak and Strong uniform consistency of kernel regression estimates. Z. Wahrsch. Verw. Gebiete Ruppert, D. 1997). Empirical-bias bandwidths for local polynomial nonparametric regression and density estimation. J. Amer. Statist. Assoc Ruppert, D., Sheather, S. J. and Wand, M. P. 1995). An effective bandwidthselector for local least squares regression. J. Amer. Statist. Assoc Shumway, R. H. 1988). Apllied Staistical ime Series Analysis. Prentice-Hall, Englewood Cliffs, NJ. Stone, C. J., Hansen, M., Kooperberg, C. and ruong, Y. K. 1997). Polynomial splines and their tensor products in extended linear modeling. Ann. Statist Stone, M. 1974). Cross-validatory choice and assessment of statistical predictions with discussion). J. Roy. Statist. Soc. Ser. B Wahba, G. 1984). Partial spline models for semiparametric estimation of functions of several variables. In Statistical Analysis of ime Series. Proceedings of the Japan-U.S. Joint Seminar, okyo Institute of Statistical Mathematics, okyo. Wand, M. P. and Jones, M. C. 1995). Kernel Smoothing. Chapman and Hall, London. Department ofstatistics University ofnorth Carolina Chapel Hill, North Carolina jfan@stat.ucla.edu Department ofstatistics Chinese University ofhong Kong Shatin Hong Kong

Discussion of the paper Inference for Semiparametric Models: Some Questions and an Answer by Bickel and Kwon

Discussion of the paper Inference for Semiparametric Models: Some Questions and an Answer by Bickel and Kwon Jianqing Fan Department of Statistics Chinese University of Hong Kong AND Department of Statistics