A Difference Based Approach to Semiparametric Partial Linear Model

Size: px

Start display at page:

Download "A Difference Based Approach to Semiparametric Partial Linear Model"

Alannah Anderson
5 years ago
Views:

1 A Difference Based Approach to Semiparametric Partial Linear Model Lie Wang, Lawrence D. Brown and T. Tony Cai Department of Statistics The Wharton School University of Pennsylvania Abstract In this article, we consider a commonly used semiparametric partial linear model. We propose analyzing this model using a difference based approach. The procedure estimates the linear component based on the differences of the observations and then estimates the nonparametric component by a kernel estimate using the residuals of the linear fit. It is shown that both the estimator of the linear component and the estimator of the nonparametric component asymptotically perform as well as if the other component were known. The estimator of the linear component is asymptotically efficient and the estimator of the nonparametric component is asymptotically rate optimal. A test for linear combinations of the regression coefficients of the linear component is also developed. Both the estimation and the testing procedures are easily implementable. Numerical performance of the procedure is studied using both simulated and real data. In particular, we demonstrate our method in an analysis of an attitude data set as well as a data set from the Framingham Heart Study. Keywords: Asymptotic efficiency; Difference-based method; Kernel method; Partial linear model; Semiparametric model. The research of Tony Cai was supported in part by NSF Grant DMS

2 Introduction Semiparametric models have received considerable attention in statistics and econometrics. They have a wide range of applications, from biomedical studies to economics. In these models, some of the relations are believed to be of certain parametric form while others are not easily parameterized. In this paper, we consider the following semiparametric partial linear model Y i = X iβ + f(u i ) + ɛ i, i =,..., n, () where X i IR p, U i IR, β is an unknown vector of parameters, f( ) is an unknown function and ɛ i s are independent and identically distributed random noise with mean 0 and variance σ 2 and are independent of (X i, U i ). The semiparametric partial linear model has been extensively studied and several approaches have been developed to construct the estimators. A penalized least-squares method was used in for example Wahba (984), Engle, Granger, Rice and Weiss (986), and Chen and Shiau (99). A partial residual method was used for example in Cuzick (992). And a profile likelihood approach was used in Severini and Wong (992) and Carroll, Fan, Gijbels and Wand (997). In addition, Fan, Härdle, and Mammen (998) have shown that an additive nonparametric component can be estimated as well as if the other components were known. Similar oracle properties were obtained by Linton (997) and Mammen, Linton, and Nielsen (999). Several methods for estimating the additive functions have been proposed, including the marginal integration estimation methods of Tjøstheim and Auestad (994) and Linton and Nielsen (995), the backfitting algorithms of Buja, Hastie, and Tibshirani (989) and Opsomer and Ruppert (998), the estimating equation methods of Mammen et al. (999), the Fourier series approximation approach of Amato, Antoniadis, and De Feis (2002), among others. The issue of achieving the information bound in this and other non- and semiparametric models has been examined by Ritov and Bickel (990) and extensively discussed in Bickel, Klassen, Ritov and Wellner (993). In this article we consider a difference based estimation method for the semiparametric partial linear model (). The estimation procedure is optimal in the sense that the estimator of the linear component is asymptotically efficient and the estimator of the nonparametric component is asymptotically minimax rate optimal. The method of taking differences to eliminate the effect of the unknown nonparametric component have been used in both nonparametric and semiparametric settings. Rice (984) introduced a first-order differencing estimator in a nonparametric regression model for estimating the variance of the random errors. Hall, Kay and Titterington (990) and Munk, Bissantz, Wagner and Freitag (2005) 2

3 extended the idea to higher-order differences for efficient estimation of the variance in such a setting. Horowitz and Spokoiny (200) used differencing for testing between a parametric model and a nonparametric alternative. In the semiparametric regression setting (), Yatchew (997) concentrated attention on estimation of the linear components and used differencing to eliminate bias induced from the presence of the nonparametric component. The properties of the resulting estimator of the linear component were considered under the condition that the nonparametric function f has a bounded first derivative. See also Fan and Huang (200) and Lam and Fan (2007). In this paper we treat estimation of both the linear and the nonparametric components. We use higher order differences for optimal efficiency in estimating the linear part by using a special class of difference sequences. It is important how the linear component is estimated after taking differences. It is interesting to note that, although the differences are correlated, the correlation should be ignored and the linear regression coefficient vector β should be estimated by the ordinary least squares estimator instead of a generalized least squares estimator which takes into account the correlations among the differences. If the correlation structure is incorporated in the estimation, the resulting generalized least squares estimator will not be optimal. Our procedure begins with taking differences of the ordered observations (ordered according to the values of U i ). Let d t, t =, 2,, m + be an order m difference sequence that satisfies t d t = 0 and t d2 t =. For i =, 2,, n m, let D i = d t Y i+ t. (2) D i can be seen as the mth order difference of Y i. The goal of this step is to eliminate the effect of the nonparametric component f. As we will show, by taking the differences the effect of the parametric component is retained and the effect of the nonparametric component f becomes negligible provide that the function f is suitably smooth. After this step, the problem reduces to the standard multiple linear regression problem. We then proceed as if there is no correlation among the differences {D i } and estimate the linear regression coefficients β by the ordinary least squares estimator based on the differences. This estimator of β is shown to be asymptotically efficient. In this step it is important to ignore the correlation among the differences. The estimate of the nonparametric function f is then constructed by using a kernel estimate based on the residuals of the linear fit. This estimator is shown to attain the optimal rate of convergence. The results show that in the case where either the U i s are fixed or random and independent of the X i s both the linear and nonparametric components of the semiparametric model are estimated as well as 3

4 if the other component were known. Hence the estimation procedure has a very desirable oracle property. Moreover, we also derive a test for linear combinations of the regression coefficients of the linear component. The test is fully specified. The test statistic is shown to asymptotically have the usual F distribution under the null hypothesis. Both the estimation and the testing procedures are easily implementable. Numerical performance of the estimation procedure is studied using both simulated and real data. The simulation results are consistent with the theoretical findings. The effect of the nonparametric component on the linear component is negligible, and vice versa. In addition, the simulations show that the effect of the order of the difference sequence on the estimation accuracy is quite small. As real data applications, we demonstrate our estimation and testing methods in an analysis of an attitude data set discussed in Chatterjee and Price (977) and a data set from the Framingham Heart Study. The paper is organized as follows. we will begin in Section 2 with the simpler case where U i = i/n is fixed to illustrate the whole procedure. In Section 3 we treat the general case where U i are random and possibly correlated with the X i and the main results are given. In Section 4, we introduce the testing problem. A simulation study is carried out in Section 5 to study the numerical performance of the procedure. Real data applications are also discussed. The proofs are contained in Section 6. 2 Fixed Design In this section, we consider the fixed design version of the semiparametric partial linear model () where U i = i/n are fixed. The covariates X i in the linear component are assumed to be random. This version is easier to analyze than the case where the U i are also random. We shall treat the random design version in Section 3. Let X i = (X i, X i2,.x ip ) be p dimensional independent random vectors with the covariance matrix Σ X. Define the Lipschitz ball Λ α (M) in the usual way: Λ α (M) = {g : for all 0 x, y, k = 0,..., α, g (k) (x) M, and g ( α ) (x) g ( α ) (y) M x y α } where α is the largest integer less than α and α = α α. Suppose f Λ α (M) for some α > 0. Then the partial linear model () can be written as Y i = X iβ + f(u i ) + ɛ i = X i β + X i2 β X ip β p + f(u i ) + ɛ i. (3) 4

5 The goal is to estimate the coefficient β and the unknown function f. This will be done through a difference based estimation. To achieve the asymptotic efficiency, the usual first-order differences are not sufficient and higher order differences are needed. In fact, the order of differences, m, needs to increase as the sample size n increases. We begin by defining suitable difference sequence. Suppose a difference sequence d, d 2,, d satisfies d i = 0 and d 2 i =. Such a sequence is called an mth order difference sequence. Moreover, for k =, 2,, m let c k = k d i d i+k. Suppose m k= c2 k = O(m ) as m. (4) One example of a sequence that satisfies these conditions is m d = m +, d 2 = d 3 = = d = m(m + ). (5) Remark The asymptotic results in the theorems to follow require that the order m and that (4) be satisfied. However, even the simple choice of m = 2 seems to yield quite satisfactory performance as attested by the simulations and real data examples in Section 5. Then We now consider the difference based estimator of β. Let where Z i = D i = matrix form, (7) becomes d t Y i+ t, i =, 2,, n m. (6) D i = Z iβ + δ i + w i, i =, 2,, n m, (7) d t X i+ t, δ i = d t f(u i+ t ), and w i = d t ɛ i+ t. Written in D = Zβ + δ + w (8) where D = (D, D 2,, D n m ), w = (w, w 2,, w n m ) and Z is a matrix whose ith row is given by Z i. Let Ψ denote the (n m ) (n m ) covariance matrix of w. 5

6 The elements of the matrix Ψ is given by for i = j Ψ i,j = c i j for i j m 0 otherwise. (9) In (7), δ i are the deterministic errors related to the nonparametric component f in () and w i are the random errors which are correlated, and have the covariance matrix Ψ = (Ψ i,j ) given by (9). For estimating the linear regression coefficient vector β, we shall ignore both the deterministic errors δ i and the correlation among the random errors w i. That is, we estimate β by the ordinary least squares estimate β = (Z Z) Z D. (0) Although not entirely intuitive, it is important in this step to ignore the correlation among the w i and use the ordinary least squares estimate. If instead a generalized least squares estimator, which incorporates the correlation structure, is used, the optimality results in Theorem below and Theorem 3 in the next section will not generally be valid. Theorem Suppose α > 0, m, m/n 0, and that Condition (4) is satisfied. Then the estimator β given in (0) is asymptotically efficient, i.e. n( β β) F N(0, σ 2 Σ X ) where Σ X = V ar(x ). Hence the coefficient β is estimated as well as if the nonparametric component is not present and hence ˆβ is asymptotically efficient. Remark 2 Using the generalized least squares method on the differences is equivalent to applying the ordinary least squares regression of Y on X in the original model (). This would cause significant bias due to the presence of f. See Section 5 for numerical comparison. Remark 3 Our proof shows that n( ˆβ β) N(0, σ 2 ( + 2 m k= c2 k )Σ X ). So when condition (4) fails, our results cannot hold. This means condition (4) is necessary for our procedure. Once we have the estimator ˆβ for the regression coefficients, we can plug it back into the original model () and remove the effect of the linear component. For i =, 2,, n the residuals of the linear fit are r i = Y i X i β = f(u i ) + X i(β β) + ɛ i. 6

7 The nonparametric component f is estimated by a kernel estimate based on the residuals r i. Let k(x) be a kernel function satisfying k(x)dx = and x j k(x)dx = 0, for j =, 2,, α. Take h = n /(+2α) and let K i,h (u) = h estimator f is then given by (Ui +U i+ )/2 k( u (U i +U i )du for i =, 2,, n. The )/2 h f(u) = n K i,h (u)r i = n K i,h (u)(y i X i β). () We have the following theorem for the estimator ˆf. Theorem 2 For all α > 0, the estimator f given in () satisfies [ ] sup f Λ α (M) E ( f(x) f(x)) 2 dx Cn 2α/(+2α) for some constant C > 0. Moreover, for any x 0 (0, ), sup E [( f(x ] 0 ) f(x 0 )) 2 Cn 2α/(+2α). f Λ α (M) It is well known in nonparametric regression that the minimax rate of convergence for estimating f over the Lipschitz ball Λ α (M) is n 2α/(+2α) under both the global integrated squared error loss and the pointwise squared error loss. That is, [ ] inf sup E ( f(x) f(x)) 2 dx inf sup E [( f(x ] 0 ) f(x 0 )) 2 n 2α/(+2α), ˆf f Λ α (M) ˆf f Λ α (M) where for two sequences a n > 0 and b n > 0, a n b n means that there exist constants c 2 c > 0 such that c a n /b n c 2 for all n. Hence Theorem 2 shows that the estimator ˆf given in () attains the optimal rate of convergence over the Lipschitz ball Λ α (M) under both the global and local losses. Remark 4 Instead of using the kernel method to estimate the nonparametric component f, other methods such as wavelet thresholding can also be used. Wavelet estimators can achieve better spatially adaptivity over more general function spaces than the Lipschitz ball Λ α (M). 7

8 3 Random Design We now turn to the random design version of the partial linear model () where both X i and U i are assumed to be random. Again let X i be p dimensional random vectors. Let U i be random variables on [0, ] and suppose that (X i, U i ), i =,..., n, are independent with an unknown joint density function g(x, u). Assume the ɛ i are independent and identically distributed with mean 0 and variance σ 2 and are also independent of (X i, U i ). Let h(u) = E(X U) and S(U) = E(X X U). Suppose f(u) Λ α (M f ) and h(u) Λ γ (M h ) for some α > 0 and γ > 0. (When X is a vector, this means each coordinate of h(u) has γ derivatives.) Moreover, suppose the marginal density of U is bounded away from 0, i.e. there exists a constant c > 0 such that g(x, u)dx c for any u [0, ]. Suppose U () U (2) U (n) are the order statistics of the U i s and X (i) are the corresponding X. Note that X (i) s are not the order statistics of X i s, but the X associated with U (i). Similar to the fixed design case, we take the m-th order differences D i = d t Y i+ t = Z iβ + δ i + w i where Z i = d t X (i+t ), δ i = d t f(u (i+t ) ), and w i = d t ɛ (i+t ). Again we estimate the linear regression coefficient vector β by the ordinary least square estimate ˆβ = (Z T Z) Z T D. (2) In using the ordinary least squares estimator ˆβ given in (2) we have ignored the errors δ i, the correlation between Z i and δ i and the correlation among the w i. We have the following theorem. Theorem 3 When α+γ > /2 and S(u) > 0 for every u, the estimator β is asymptotically efficient, i.e. n( β β) F N(0, σ 2 Σ X U ) where Σ X U = E[(X E(X U ))(X E(X U )) ]. Remark 5 We can see from this theorem that we do not always need α > /2 to ensure the asymptotic efficiency. We only need one of the two functions f(u) and h(u) = E(X U) to have minimal smoothness. Theorem can be considered to be a special case where γ is infinity. When the condition α + γ > /2 fails, Theorem 3 does not hold. In this case 8

9 n( β β) may not have asymptotic normality. To see this, let us consider the following example. Take p = and f(u) = h(u) Λ α (M) for 0 < α < /4. Suppose X i and ɛ i both follow standard normal distribution. Set ˆX i = X i h(u i ) and τ i = ˆX i β+ɛ i +β. Then 2 τ i iid N(0, ) and Model () can be written as Y i = h(u i )( + β) + τ i + β2. (3) Treat h(u i )( + β) as an unknown mean function, then the problem of estimating the regression coefficient β becomes estimating the variance in Model (3). It follows from Wang, Brown, Cai and Levine (2006) that the minimax rate of this variance estimation problem is n 4α n. Therefore Theorem 3 fails in this case. Remark 6 When X and U are independent, E(X U) = E(X). In this case, the result become n( β β) F N(0, σ 2 Σ X ). This is the same as the fixed design case and the estimator ˆβ performs as well as if the nonparametric component is known. Remark 7 Yatchew (997) considered the partial linear model () under the conditions that both f and h have bounded first derivatives, i.e., α = and γ =, and obtained similar results. In this case the condition α + γ > /2 of Theorem 3 is easily satisfied. Once we have an estimate of β, we can then use the same procedure to estimate f(u) as in the fixed design case, and we have the same result. Theorem 4 For all α > 0, the estimator f given in () satisfies [ ] sup f Λ α (M) E ( f(x) f(x)) 2 dx Cn 2α/(+2α) for some constant C > 0. Moreover, for any x 0 (0, ), sup E [( f(x ] 0 ) f(x 0 )) 2 Cn 2α/(+2α). f Λ α (M) Hence in this case the estimator ˆf also attains the optimal rate of convergence over the Lipschitz ball Λ α (M) under both the global and local losses. The proof of Theorem 3 is given in Section 6. The following lemma is one of the main technical tools for the proof and it is also useful in the development of the test given in Section 4. 9

10 Lemma Under the assumptions of Theorem 3, we have Z Z n Z ΨZ n Z δδ Z n P Σ X U, as n, P ( m c 2 k)σ X U as n, k= = O p (n 2α ) + O p (n 2α 2γ ) as n. 4 Testing the Linear Component We have so far focused on the estimation problem. In many contexts, hypothesis testing is also important under the semiparametric partial linear model (). For example, it is often of interest to test the hypothesis that a subset of the coefficients in the linear component is equal to zero. In this section, we consider the problem of testing the null hypothesis that the linear regression coefficients satisfy certain linear constraints. That is, we wish to test H 0 : Cβ = 0 against H : Cβ 0, where C is an r p matrix with rank(c) = r. A special case is testing the hypothesis H 0 : β i = = β ir = 0. In this section, we shall assume the errors ɛ i are independent and identically distributed N(0, σ 2 ) variables. 4. Fixed design or independent case We divide the testing problem into two cases. We first consider the case where U i = i/n (fixed design) or the U i s are random but independent of the X i s. From the previous sections, we know that asymptotically in this case the estimator ˆβ of the linear regression coefficient vector β satisfies n( ˆβ β) N(0, σ 2 Σ X ). This means asymptotically n(c ˆβ Cβ) N(0, σ 2 CΣ X C ). So our test statistic will be based on nσ 2 ˆβ C (CΣ X C ) C ˆβ. Both the covariance matrix Σ X and the error variance σ 2 are unknown in general and thus need to be estimated. It follows from Lemma that Z Z/n a.s. Σ X in this case. If σ 2 is given, the test statistic σ ˆβ C (C(Z Z) C ) C ˆβ 2 has a limiting χ 2 distribution with r degrees of freedom. 0

11 To estimate the error variance σ 2, set H = I Z(Z Z) Z. It is easy to check that H is a projection of rank n m p, i.e., H 2 = H and rank(h) = n m p. We shall use ˆσ 2 = HD 2 2 n m p to estimate σ2. Note that Write HD = D Z(Z Z) Z D = H(δ + w). HD 2 2 = (δ + w )H(δ + w) = w Hw + 2w Hδ + δ Hδ. Suppose α > /2. Then it is easy to see that δ Hδ a.s. 0 as n. Since w Hδ δ N(0, 2σ 2 δ Hδ), we know that w Hδ a.s. 0. Here we also assume that the first term of the difference sequence satisfies that d 2 = O(m ) (the sequence given in (5) satisfies this condition). It can be shown that σ 2 w Hw is approximately distributed as chi-squared with n m p degrees of freedom. We thus use ˆσ 2 = HD 2 2 n m p as an estimator of σ2. We have the following theorem. Theorem 5 Suppose α > /2 and d 2 = O(m ). For testing H 0 : Cβ = 0 against H : Cβ 0, where C is an r p matrix with rank(c) = r, the test statistic F = ˆβ C (C(Z Z) C ) C ˆβ/r ˆσ 2 asymptotically follows the F (r, n m p) distribution under the null hypothesis. Moreover, the asymptotic power of this test (at local alternatives) is the same as the usual F test when f is not present in the model (). Hence an approximate level α test of H 0 : Cβ = 0 against H : Cβ 0 is to reject the null hypothesis H 0 when the test statistic F F r,n m p;α where F r,n m p;α is the α quantile of the F (r, n m p) distribution. 4.2 General random design case We now turn to the test problem in the general random design case where U i are random and correlated with X i. Again suppose that (X i, U i ), i =,..., n, are independent with an unknown joint density function g(x, u). We will show that the same F test also works in this case. Notice that in this case, asymptotically n(c ˆβ Cβ) L N(0, σ 2 CΣ X U C ),

12 where Σ X U = E[(X E(X U ))(X E(X U )) ]. Lemma shows that Z Z/n converges to Σ X U. Based on this observation and the discussion given in Section 4., we have the following theorem. Theorem 6 Suppose α > /2 and d 2 = O(m ). For testing H 0 : Cβ = 0 against H : Cβ 0, where C is an r p matrix with rank(c) = r, the test statistic F = ˆβ C (C(Z Z) C ) C ˆβ/r ˆσ 2 asymptotically follows the F (r, n m p) distribution under the null hypothesis. 5 Numerical Study The difference based procedure for estimating the linear coefficients and the unknown function introduced in the previous sections is easily implementable. In this section we investigate the numerical performance of the estimator using both simulations and analysis of real data. In the simulation studies we study the effect of the nonparametric component f on the estimation accuracy of the linear component as well as the presence of the linear component on the estimation accuracy of f. In addition, the effect of the order of the difference sequence is considered. We also apply our estimation and testing procedures to the analysis of the attitude data from Chetterjee and Price (977) and a data set from the Framingham Heart Study. 5. Simulation We first study the effect of the unknown nonparametric component f on the estimation accuracy of the linear component and then investigate the effect of the order of the difference sequence. In the first simulation study, we take n = 500, U i iid Uniform(0, ) and consider the following four different functions, 3 30x for 0 x 0. 20x for 0. x 0.25 f (x) = 4 + ( 4x)8/9 for 0.25 < x (x 0.725) for < x (x 0.89)/ for 0.89 < x f 2 (x) = + 4(e 550(x 0.2)2 + e 200(x 0.8)2 + e 950(x 0.8)2 ) 2

13 where f 3 (x) = h j ( + x x j w j ) 4, (x j ) = (0.0, 0.3, 0.5, 0.23, 0.25, 0.40, 0.44, 0.65, 0.76, 0.78, 0.8) (h j ) = (4, 5, 3, 4, 5, 4.2, 2., 4.3, 3., 5., 4.2) (w j ) = (0.005, 0.005, 0.006, 0.0, 0.0, 0.03, 0.0, 0.0, 0.005, 0.008, 0.005), and f 4 (x) = 2.π x( x) sin( x ). The test functions f 3 and f 4 are the Bumps and Doppler functions given in Donoho and Johnstone (994). When we do simulation, we will normalize these functions to make them have unit variance. The four functions are plotted in Figure below. f f U U f f U U Figure : Plots of the four test functions. We also consider the case where f 0 for comparison. The errors ɛ i are generated from the standard normal distribution. For X i and β, we consider two cases: Case (). p =, X i N(4U i, ), β = ; 3

14 Case (2). p = 3, X i N((U i, 2U i, 4Ui 2 ), I 3 ), β = (2, 2, 4) where I 3 denotes the 3 3 identity matrix. We first examine the effect of the unknown function f on the estimation of the linear component. In this part, the difference sequence in equation (5) with m = 2 is used. The mean squared errors (MSEs) of the estimator ˆβ is calculated over 00 simulation runs. Note that the average squared error (( ˆβ 3 β ) 2 + ( ˆβ 2 β 2 ) 2 + ( ˆβ 3 β 3 ) 2 ) is computed for Case (2) in each simulation and then the average is taken over the 00 simulations. We also consider the case where the presence of f is completely ignored and we directly run least squares regression of Y on X in model (). This is also equivalent to estimate β by the generalized least squares regression of D on Z in (8) after taking the differences. The results are summarized in Table. The numbers insides the parentheses are the MSEs of the estimate when the nonparametric component is ignored. By comparing the MSEs in each row, it can be easily seen that after taking the differences the effect of the nonparametric component f is fairly small on the estimation accuracy of β in this study. This means that in many cases we can estimate the linear coefficients nearly as well as if f is known. On the other hand, if f is simply ignored and β is estimated by applying the least squares regression of Y on X directly, the estimator is highly inaccurate. The mean squared errors are between 2 to over 600 times as large as those of the corresponding estimators based on the differences. f 0 f (ignored) f 2 (ignored) f 3 (ignored) f 4 (ignored) Case () (.942) (0.054) (0.04) (0.0) Case (2) (0.70) (0.025) (0.009) (0.007) Table : The MSEs of estimate ˆβ over 00 replications with sample size n = 500. The numbers insides the parentheses are the MSEs of the estimate when the nonparametric component is ignored. For estimating the nonparametric function f, we use a kernel method with the Parzen s kernel. The bandwidth was selected by cross validation, see for example Härdle (99) and Scott (992). For comparison, we also carried out the simulation in the case where β = 0, i.e., the linear component is not present. The mean squared error of the estimated f is summarized in Table 2. It can be seen that the MSEs in each column are close to each other and hence the performance of our estimator ˆf does not depend sensitively on the structure of X and β. 4

15 f 0 f f 2 f 3 f 4 β = Case () Case (2) Table 2: The MSEs of the estimate ˆf over 00 replications with sample size n = 500. We now consider the effect of the order of the difference sequence m on the estimation accuracy. In this study, different combinations of the function f and the Cases () and (2) yield basically the same results. As an illustration of this, we focus on Case (2) and f = f 2. We compare four different m: 2, 4, 8, 6. The difference sequence in equation (5) was used in each case. We summary in Table 3 the mean and standard deviation of the estimate ˆβ and the average MSE of the estimate ˆf. By comparing the means and standard deviations in each row we can see that the performance of the estimator does not depend significantly on m. Increasing m may cause more bias in the estimator of β but can decrease the standard deviation slightly if m is not too large. m = 2 m = 4 m = 8 m = 6 Mean(sd) of ˆβ 2.003(0.057) 2.007(0.053).995(0.049).997(0.05) Mean(sd) of ˆβ (0.05) 2.004(0.05) 2.007(0.048).999(0.049) Mean(sd) of ˆβ (0.057) 4.002(0.047) 4.000(0.043) 3.992(0.049) MSE of ˆf Table 3: The mean and standard deviation of the estimate ˆβ and the average MSEs of the estimate ˆf over 00 replications with sample size n = Application to Attitude Data We now apply our estimation and testing procedures to the analysis of the attitude data. This data set was first analyzed in Chatterjee and Price (977) using multiple linear regression and variable selection. This data set was from a study of the performance of supervisors and was collected from a survey of the clerical employees of a large financial organization. The questions were about the employees satisfaction with their supervisors. This survey was designed to measure the overall performance of a supervisor, as well as questions that related to specific characteristic of the supervisor. An exploratory study was 5

16 undertaken to explain the relationship between specific supervisor characteristic and overall satisfaction with supervisors as perceived by the employees. The data are aggregated from the questionnaires of the approximately 35 employees for each of the 30 departments. The numbers give the percent proportion of favorable responses to seven questions in each department. Seven variables, Y (over all rating of the job being done by supervisor), X (raises based on performance), X 2 (handle employee complaints), X 3 (does not allow special privileges), X 4 (opportunity to learn new things), X 5 (rate of advancing to better job), and U (too critical to poor performances) are considered here. The goal is to understand the effect of other variables (X,..., X 5 and U) on Rating (Y ). Figure 2 plots each independent variable against the response Y. From Figure 2 we can see that the effect of U on Y is not linear, while the effect of other variables are roughly linear. So we employ the following semiparametric partial linear model, Y = β X + β 2 X 2 + β 3 X 3 + β 4 X 4 + β 5 X 5 + f(u) + ɛ. (4) Rating (Y) Rating (Y) Rating (Y) Raises (X) Complaints (X2) Privileges (X3) Rating (Y) Rating (Y) Rating (Y) Learning (X4) Advance (X5) Critical (U) Figure 2: Plot s of the individual explanatory variables against the response variable. 6

17 Using the estimation procedure discussed in Section 3 with m = 2, the linear component in the model (4) is estimated as X X X X X 5. The F statistic and the p value for testing each coefficient H i0 : β i = 0 against H i : β i 0 are given in Table 4. The p-values for β, β 3 and β 5 are exceedingly large. Estimated coefficient F statistic p value X X X X X Table 4: The estimated coefficients of the linear component and the significance tests. We thus perform the simultaneous F test to test the hypothesis H 0 : β = β 3 = β 5 = 0 against H : at least one of them is nonzero. The value of the F statistic is and the p value is In comparison, the value of the F statistic for the global hypothesis H 0 : β = = β 5 = 0 is and the p value is less than The results show that we fail to reject the hypothesis H 0 : β = β 3 = β 5 = 0. We can thus refine the linear component by using only Learning (X 2 ) and Complaints (X 4 ) as independent variables. In this case, the estimated linear component is X X 4. The F value for this model is and p value is less than We can then estimate the nonparametric component of the effect of Critical (U). For this, we run kernel estimation using the residuals of the linear fits as we did in Section 5.. Figure 3 shows the nonparametric fits. The left panel plots the estimate of f under the model (4) but we ignore the linear component, the middle panel plots the estimate of f under the model (4) with all linear variables and the right panel plots ˆf with the variables X 2 and X 4 in the linear part. We can see that the plot on the left panel is quite different from the other two. And the two plots on middle and right are similar since including a small number of additional non-significant variables does not have a large effect on the estimates of the remaining parts of the model. Note that in Chatterjee and Price (977), the standard multiple linear regression was used to model the relationship between the response and the explanatory variables. The linear model failed to detect the relation of the variable U and Y, and it concluded that variable U did not have significant effect on Y. 7

18 Nonparametric fit Nonparametric fit Nonparametric fit Critical (U) Critical (U) Critical (U) Figure 3: Kernel estimates of the nonparametric component f. The points are the residuals of respective linear fits. 5.3 A Data Set from the Framingham Heart Study To further illustrate our method, we apply it to a data set from the Framingham Heart Study. The objective of the Framingham Heart Study was to identify the common factors or characteristics that contribute to Cardiovascular disease by following its development over a long period of time in a large group of participants who had not yet developed overt symptoms of Cardiovascular disease or suffered a heart attack or stroke. For more details on the study, see for example Dawber, Meadors and Moore (95). In our study, we use a data set from Liang, Härdle and Carroll (999). The samples we used here are n = 65 males. Let Y denote their average blood pressure in a fixed two years period; X denote the age; and U denote the logarithm of serum cholesterol level observed at exam 2. Then we have the following semiparametric partial linear model Y = βx + f(u) + ɛ. We apply our procedure with m = 2 to estimate the linear coefficient and run the significance test. The estimated linear component is X and the p-value is less than The unknown nonparametric function f(u) is then estimated based on the residuals of this linear fit. As in Section 5., we use Parzen s kernel and choose the bandwidth through cross validation. Figure 4 below shows the nonparametric fit. 8

19 Residuals of linear fit (blood pressure) U (cholesterol level) 6 Proofs Figure 4: Kernel estimates of the nonparametric component f(u). We shall prove Theorems, 2,3 and 5. The proof of Theorem 4 and Theorem 6 is similar to that of Theorem 2 and Theorem 5, respectively. 6. Proof on Theorem We first state and prove two lemmas. lemmas. Theorem then follows directly from these two Lemma 2 Under the assumptions of Theorem, n(z Z) Z w L N(0, ( + O(m ))σ 2 Σ X ). Proof. Note that Z w can be written as X Ψɛ. The asymptotic normality of n(z Z) Z w follows from the Central Limit Theorem and the fact that m k= c2 k = O(m ). Moreover, E( n(z Z) Z w) = 0. We now turn to the variance. It is easy to see that V ar((z Z) Z w Z) = (Z Z) Z ΨZ(Z Z). Note that Z Z = n m Z i Z i = n m ( ) ( d t X i+ t d t X i+ t ), 9

20 so, with m = o(n), Note also that n (Z Z) a.s. E Z ΨZ = Z Z + 2 m k= [( n m k ) ( d t X i+ t c k ( Hence, with m = o(n), for any k =, 2,, m, This implies n m k d t X i+ t )] d tx i+ t )( = Σ X ( n d tx i+ t )( d tx i+k+ t) a.s. E [( d tx i+ t )( ] d tx i+k+ t ) ( k ) = d i d i+k Σ X = c k Σ X. nv ar((z Z) Z w Z) a.s. σ 2 Σ X Σ X( + 2 m k= c 2 k)σ X This value does not depend on X or n, so n(z Z) Z w d tx i+k+ t). = ( + O( m ))σ2 Σ X. L N(0, ( + O( m ))σ2 Σ X ). Lemma 3 Under the assumptions of Theorem, ne [ ((Z Z) Z δ)((z Z) Z δ) ] ( m ) 2(α )) = O( (ΣX ) n Proof. Note that E( n m ( d tx i+ t)δ i ) = 0 This implies E{(Z Z) Z δ} = 0. Now [( n m ) ( n m )] E ( d t X i+ t )δ i ( d t X i+ t)δ i = ( n m n m 2 δi 2 c k j= δ j m l= δ j+m ) Σ X. When m = o(n), since f Λ α (M), δ i < M( m n )(α ). So n m δi 2 n m 2 m δ j m δ j+m j= l= = O(n 2(α ) m 2(α ) ). 20

21 Also we know that n Z Z a.s. Σ X, therefore, as n, nv ar((z Z) Z δ) [ ( n m ) ( n m ) ] = E n(z Z) ( d t X i+ t )δ i ( d t X i+ t)δ i (Z Z) = no(n 2(α ) m 2(α ) ) n 2 (Σ X ) = O(( m n ) 2(α )) (ΣX ). Theorem now follows from Lemmas 2 and 3. Note that n( β β) = n(z Z) Z (w+ δ). Lemma 3 implies n(z Z) Z δ P 0. This together with Lemma 2 yield Theorem. 6.2 Proof of Theorem 2 We shall only prove the convergence rate under the pointwise squared error loss, the rate of convergence under the global mean integrated squared error risk can be derive using a similar line of argument. Let ˆf (u) = n K i,h(u)(f(u i ) + ɛ i ) and ˆf 2 (u) = n K i,h(u)x i (β ˆβ), then ˆf(u) = ˆf (u) + ˆf 2 (u). In ˆf there is no linear component. From the standard nonparametric regression results we know that for any x 0, sup E [( ˆf ] (x 0 ) f(x 0 )) 2 Cn 2α/(+2α) f Λ α (M) for some constant C > 0. Note that n K 2 i,h(u) = O( nh ) = O(n 2α/(+2α) ) and E(X i (β ˆβ)) 2 = O(n ). So [ ] E( ˆf n 2 (x 0 ) 2 ) = E ( K i,h X i (β ˆβ)) 2 n Ki,hE(X 2 i (β ˆβ)) 2 = O(n 2α/(+2α) ) Putting these together, we have sup E [( ˆf(x ] 0 ) f(x 0 )) 2 C n 2α/(+2α) f Λ α (M) for some constant C > 0. Theorem 4 can be proved by using the same argument. 2

22 Proof of Lemma Lemma is the main technical tool for the proof of Theorem 3. We need to following lemma to prove Lemma. Lemma 4 Under the assumptions of Theorem 3, V ar( n X (i) X n (i+)) 0 as n. Here the variance of a matrix means the variance of each entry and the limitation is also entry-wise. Proof. First we have V ar( n X (i) X n (i+)) = n V ar(x n 2 (i) X (i+)) + 2 n 2 Cov(X n 2 (i) X (i+), X (i+) X (i+2)) + 2 n 2 i+<j Cov(X (i) X (i+), X (j) X (j+)). Here the covariance of two matrix means the covariance of corresponding entries. Let η i = h(u (i+) ) h(u (i) ) and H i = h (U (i) )h(u (i) ) for i =, 2,, n. Note that when γ > 0, η i = O(n γ ). Then for i + < j Cov(X (i) X (i+), X (j) X (j+)) = E(E(X (i) X (i+)x (j) X (j+) U)) E(E(X (i) X (i+) U))E(E(X (j) X (j+) U)) = E(h (U (i) )h(u (i+) )h (U (j) )h(u (j+) )) E(h (U (i) )h(u (i+) ))E(h (U (j) )h(u (j+) )) = E(h (U (i) )(h(u (i) ) + η i )h (U (j) )(h(u (j) ) + η j )) E(h (U (i) )(h(u (i) ) + η i ))E(h (U (j) )(h(u (j) ) + η j )) = E(H i H j ) E(H i )E(H j ) + O(n γ ) = Cov(H i, H j ) + O(n γ ). Hence, 2 n 2 i+<j Cov(X (i) X (i+), X (j) X (j+)) = 2 n 2 i+<j ( Cov(Hi, H j ) + O(n γ ) ) = n n {V ar( n 2 n H 2 i ) 2 Cov(H i, H i+ ) V ar(h i )} + O(n γ ) = V ar( n n H i ) + O(n γ ). 22

23 Also, it is easy to see that and n V ar(x n 2 (i) X (i+)) = O( n ) 2 n 2 Cov(X n 2 (i) X (i+), X (i+) X (i+2)) = O( n ). Putting these together, the lemma is proved. Remark 8 By the same calculation, we actually can prove that for any fixed integer k > 0, V ar( n k n X (i)x(i+k) ) goes to zero as n goes to infinity. We now prove Lemma stated in Section 3. Proof of Lemma. It follows from Lemma 4 and the fact m = o(n) that lim V Z n ar(z n ) = 0 lim V ΨZ n ar(z n ) = 0 lim V δδ Z n ar(z ) = 0. n So we only need to check the limit of the expectation. First note that E(Z i Z i) = E(E(( d t X (i+t ) )( d t X (i+t )) U)) = = d 2 t E[E(X (i+t ) X (i+t ) U)] + +E(X (i+k+t ) X (i+t ) U)] And by the same calculation, E(Z i Z i+j) = j d 2 t E(V ar(x (i+t ) U)) + [ m k= d t+j d t E(V ar(x (i+j+t ) U)) + [ d t d t+k E[E(X (i+t ) X (i+t+k ) U) d t h(u (i+t ) )] [ d t h(u (i+t ) )]. d t h(u (i+t ) )] [ d t h(u (i+j+t ) )]. 23

24 This implies n m Z lim n E(Z n ) = lim E(Z i Z n n i) n = lim n n { E[V ar(x i U)] + = lim n n m k= n E[V ar(x i U)] = Σ X U E[(h(U (i) ) h(u (i+k) )) (h(u (i) ) h(u (i+k) ))]} c k n k Here we use the fact that, since h(u) has β derivatives, Similarly, lim n n m k= c k n k E[(h(U (i) ) h(u (i+k) )) (h(u (i) ) h(u (i+k) ))]} = 0. n m lim n n E(Z ΨZ) = lim n n { E(Z i Z i) + = ( Σc 2 k)σ X U. m k= E(Z i Z i+k + Z i+k Z i)} c k n k Finally for the third equation, let dh i = d th(u (i+t) ) then lim n n E(Z δδ Z) = lim n n m n E[( n m Z i δ i )( Z iδ i )] n m = lim n n E{E( δi 2 Z i Z i U) + E( δ i δ j Z i Z j U)} i j n m = lim n n E{ δi 2 E(Z i Z i U) + δ i δ j E(Z i Z j U)} i j n = lim E[V ar(x n n (i) U)( d 2 t δi t 2 + d t d t+k δ i t δ i+k t )] t k= n n + lim n n [ δ i E(dh i )] [ δ i E(dh i )] Since t d2 t δi t 2 + k= d td t+k δ i t δ i+k t is of order n 2α for any i, the first part of the above expression is of order n 2α. And δ i E(dh i ) is of order n α γ for any i, so the second part of the above expression is of order n 2α 2γ. This implies n E(Z δδ Z) = O(n 2α ) + O(n 2α 2γ ). 24

25 6.3 Proof of Theorem 3 With Lemmas and 2, the proof of Theorem 3 is now straightforward. Note that n( ˆβ β) = n(z Z) Z w + n(z Z) Z δ. It follows from the same argument as in the proof of Lemma 2 that n(z Z) Z w N(0, n(z Z) Z ΦZ(Z Z) ). The first two equalities of Lemma yields lim n n(z Z) Z ΦZ(Z Z) ) = {Σ X U } and the third equality of Lemma shows that, when α + γ > /2, n(z Z) Z δ is small and negligible. Theorem 3 now follows. 6.4 Proof of Theorem 5 From the discussion in Section 4, we know that ˆβ C (C(Z Z) C ) C ˆβ asymptotically follows χ 2 distribution with r degrees of freedom. We only need to show that (n m p)ˆσ 2 /σ 2 asymptotically follows χ 2 distribution with n m p degrees of freedom, and is independent with ˆβ C (C(Z Z) C ) C ˆβ. The independence is easy to see because it is the same as the ordinary linear regression case. Since we know that 2w Hδ + δ Hδ goes to 0 in probability, we just need to focus on w Hw. Note that d 2 = O(m ) and let L be a (n m) n matrix given by { d j i+ for 0 j i m L i,j =. (5) 0 otherwise Moreover, let J be another (n m) n matrix given by J i,i = for i =, 2,, n m and 0 otherwise. Then w = Lɛ = Jɛ + (L J)ɛ = w + w 2 where w = Jɛ and w 2 = (L J)ɛ. We can see that w N(0, σ 2 I n m ) and w 2 N(0, σ 2 (L J)(L J) ). We can also see that the variance of each coordinate of w 2 is of order m. Now w Hw = w Hw + 2w Hw 2 + w 2Hw 2. It is easy to check that H is a projection of rank n m p, i.e., H 2 = H and rank(h) = n m p. Hence w Hw = (Hw ) (Hw ) 25

26 follows a χ 2 distribution with n m p degrees of freedom. Since m = o(n) and the variance of each coordinate of w 2 is of order m, we know that both w Hw 2 /(n m p) and w 2Hw 2 /(n m p) converge to 0 in probability. This means ˆβ C (C(Z Z) C ) C ˆβ/r ˆσ 2 asymptotically follows F (r, n m p) distribution. By the same argument as in the proof of Theorem, we can prove that the asymptotic power of this test (at local alternatives) is the same as the usual F test when f is not present in the model (). References [] Amato, U., Antoniadis, A., and De Feis, I. (2002). Fourier Series Approximation of Separable Models. J. Computation and Applied Mathematics. 46, [2] Bickel, P. J., Klassen, C. A. J., Ritov, Y. and Wellner, J. A. (993). Efficient and Adaptive Estimation for Semiparametric Models. The John Hopkins University Press, Baltimore. [3] Buja, A., Hastie, T. J., and Tibshirani, R. J. (989). Linear Smoothers and Additive Models. Ann. Statist. 7, [4] Carroll, R. J., Fan, J., Gijbels,I., and Wand, M. P. (977). Generalized partially linear single-index models. J. Amer. Statist. Assoc. 92, [5] Chatterjee, S. and Price, B. (977). Regression Analysis by Example. New York: Wiley. [6] Chen, H. and Shiau, J. H. (99). A two-stage spline smoothing method for partially linear models. J. Statist. Plann. Inference 27, [7] Cuzick, J. (992). Semiparametric additive regression. J. Roy. Statist. Soc. Ser. B 54, [8] Dawber, T. R., Meadors G. F., and Moore F. E. J. (95). Epidemiological approaches to heart disease: the Framingham Study. Amer. J. Public Health 4, [9] Donoho, D.L. and Johnstone, I.M. (994). Ideal spatial adaptation via wavelet shrinkage. Biometrika 8,

27 [0] Engle, R. F., Granger, C. W. J., Rice, J. and Weiss, A. (986). Nonparametric estimates of the relation between weather and electricity sales. J. Amer. Statist. Assoc. 8, [] Fan, J., Hädle, W., and Mammen, E. (998). Direct estimation of additive and linear components for high-dimensional data. Ann. Statist., 26, [2] Fan, J., and Huang, L. (200). Goodness-of-Fit Tests for Parametric Regression Models. J. Amer. Statist. Assoc. 96, [3] Hall, P., Kay, J. and Titterington, D. M. (990). Asymptotically optimal differencebased estimation of variance in nonparametric regression. Biometrika 77, [4] Härdle, W. (99). Smoothing Techniques. Berlin: Springer-Verlag. [5] Horowitz, J. and Spokoiny, V. (200). An adaptive rate-optimal test of a parametric mean-regression model against a nonparametric alternative. Econometrica 69, [6] Lam, C., and Fan, J. (2007). Profile-Kernel Likelihood Inference With Diverging Number of Parameters. Ann. Statist., to appear. [7] Liang, H., Härdle, W. and Carroll, R. J. (999). Estimation in a semiparametric partially linear errors-in-variables model. Ann. Statist. 27, [8] Linton, O. B. (997). Efficient Estimation of Additive Nonparametric Regression Models. Biometrika 84, [9] Linton, O. B., and Nielsen, J. P. (995). A Kernel Method of Estimating Regressing Structured Nonparametric Regression Based on Marginal Integration. Biometrika 82, [20] Mammen, E., Linton, O., and Nielsen, J. (999). The Existence and Asymptotic Properties of a Backfitting Projection Algorithm Under Weak Conditions. Ann. Statist. 27, [2] Munk, A., Bissantz, N., Wagner, T. and Freitag, G. (2005). On difference based variance estimation in nonparametric regression when the covariate is high dimensional. J. Roy. Statist. Soc. B 67,

28 [22] Opsomer, J.-D., and Ruppert, D. (998). A Fully Automated Bandwidth Selection Method for Fitting Additive Models. J. Amer. Statist. Assoc. 93, [23] Ritov, Y., and Bickel, P. J. (990). Achieving information bounds in semi and non parametric models. Ann. Statist. 8, [24] Scott, D. W. (992). Multivariate Density Estimation. New York: Wiley. [25] Severini, T. A. and Wong, W. H. (992). Generalized profile likelihood and conditional parametric models. Ann. Statist. 20, [26] Tjøstheim, D., and Auestad, B. (994). Nonparametric Identification of Nonlinear Time Series: Projection. J. Amer. Statist. Assoc. 89, [27] Wahba, G. (984). Cross validated spline methods for the estimation of multivariate functions from data on functionals. Statistics: An Appraisal. Proceedings 50th Anniversary Conference Iowa State Statistical Laboratory (H. A. David, ed.). Iowa State Univ. Press. [28] Wang, L., Brown, L.D., Cai, T. and Levine, M. (2006). Effect of mean on variance function estimation on nonparametric regression. Ann. Statist., to appear. [29] Yatchew, A. (997). An elementary estimator of the partial linear model. Economics Letters 57,

A semiparametric multivariate partially linear model: a difference approach

Statistica Sinica (2013): Preprint 1 A semiparametric multivariate partially linear model: a difference approach Lawrence D. Brown, Michael Levine, and Lie Wang University of Pennsylvania, Purdue University,