Sparse and efficient estimation for partial spline models with increasing dimension

Ann Inst Stat Math (25) 67:93 27 DOI.7/s463-3-44-y Sparse and efficient estimation for partial spline models with increasing dimension Guang Cheng Hao Helen Zhang Zuofeng Shang Received: 28 November 2 / Revised: 9 September 23 / Published online: 5 December 23 The Institute of Statistical Mathematics, Tokyo 23 Abstract We consider model selection and estimation for partial spline models and propose a new regularization method in the context of smoothing splines. The regularization method has a simple yet elegant form, consisting of roughness penalty on the nonparametric component and shrinkage penalty on the parametric components, which can achieve function smoothing and sparse estimation simultaneously. We establish the convergence rate and oracle properties of the estimator under weak regularity conditions. Remarkably, the estimated parametric components are sparse and efficient, and the nonparametric component can be estimated with the optimal rate. The procedure also has attractive computational properties. Using the representer theory of smoothing splines, we reformulate the objective function as a LASSO-type problem, enabling us to use the LARS algorithm to compute the solution path. We then extend the procedure to situations when the number of predictors increases with the sample size and investigate its asymptotic properties in that context. Finite-sample performance is illustrated by simulations. G. Cheng: Supported by NSF Grant DMS-96497 and CAREER Award DMS-5692. H. H. Zhang: Supported by NSF grants DMS-645293, DMS-347844, NIH grants P CA42538 and R CA85848. G. Cheng (B) Z. Shang Department of Statistics, Purdue University, West Lafayette, IN 4797-266, USA e-mail: chengg@purdue.edu Z. Shang e-mail: shang9@purdue.edu H. H. Zhang Department of Mathematics, University of Arizona, Tucson, AZ 8572-89, USA e-mail: hzhang@math.arizona.edu 23

94 G. Cheng et al. Keywords Smoothing splines Semiparametric models RKHS High dimensionality Solution path Oracle property Shrinkage methods Introduction. Background Partial smoothing splines are an important class of semiparametric regression models. Developed in a framework of reproducing kernel Hilbert spaces (RKHS), these models provide a compromise between linear and nonparametric models. In general, a partial smoothing spline model assumes the data (X i, T i, Y i ) follow Y i = X i β + f (T i) + ɛ i, i =,...,n, f W m [, ], () where X i R d are linear covariates, T i [, ] is the nonlinear covariate, and ɛ i s are independent errors with mean zero and variance σ 2. The space W m [, ] is the mth order Sobolev Hilbert space W m [, ] ={f : f, f (),..., f (m ) are absolutely continuous, f (m) L 2 [, ]} for m. Here f ( j) denotes the jth derivative of f.the function f (t) is the nonparametric component of the model. Denote the observations of (X i, T i, Y i ) as (x i, t i, y i ) for i =, 2,...,n. The standard approach to compute the partial spline (PS) estimator is minimizing the penalized least squares: ( β PS, f PS ) = arg min β R d, f W m n n [ yi xi T β f (t i) ] 2 + λ J 2 f, (2) i= where λ is a smoothing parameter and J 2 f = [ f (m) (t) ] 2 dt is the roughness penalty on f ;seekimeldorf and Wahba (97), Craven and Wahba (979), Denby (984), Green and Silverman (994) for details. It is known that the solution f PS is a natural spline (Wahba 99) of order 2m on[, ] with knots at t i, i =,...,n. Asymptotic theory for partial splines has been developed by several authors (Shang and Cheng 23; Rice 986; Heckman 986; Speckman 988; Shiau and Wahba 988. In this paper, we mainly consider partial smoothing splines in the framework of Wahba (984)..2 Model selection for partial splines Variable selection is important for data analysis and model building, especially for high dimensional data, as it helps to improve the model s prediction accuracy and interpretability. For linear models, various penalization procedures have been proposed to obtain a sparse model, including the non-negative garrote (Breiman 995; Tibshirani 996; Fan and Li 2; Fan and Peng 24), and the adaptive LASSO (Zou 26; Wang et al. 27). Contemporary research frequently deals with problems where the input dimension d diverges to infinity as the data sample size increases (Fan and Peng 24). There is also active research going on for linear model selection in these 23

Sparse partial spline 95 situations (Fan and Peng 24; Zou and Zhang 29; Fan and Lv 28; Huang et al. 28a, b). In this paper, we propose and study a new approach to variable selection for partially linear models in the framework of smoothing splines. The procedure leads to a regularization problem in the RKHS, whose unified formulation can facilitate numerical computation and asymptotic inferences of the estimator. To conduct variable selection, we employ the adaptive LASSO penalty on linear parameters. One advantage of this procedure is its easy implementation. We show that, by using the representer theory (Wahba 99), the optimization problem can be reformulated as a LASSO-type problem so that the entire solution path can be computed by the LARS algorithm (Efron et al. 24). We show that the new procedure can asymptotically () correctly identify the sparse model structure; (2) estimate the nonzero β j s consistently and achieve the semiparametric efficiency; and (3) estimate the nonparametric component f at the optimal nonparametric rate. We also investigate the property of the new procedure with a diverging number of predictors (Fan and Peng 24). From now on, we regard (Y i, X i ) as i.i.d realizations from some probability distribution. We assume that the x i s belong to some compact subset in R d, and they are standardized such that n i= x ij /n = and n i= xij 2 /n = for j =,...,d, where x i = (x i,...,x id ). Also assume t i [, ] for all i. Throughout the paper, we use the convention that / =. The rest of the article is organized as follows: Sect. 2 introduces our new double-penalty estimation procedure for partial spline models. Section 3 is devoted to two main theoretical results. We first establish the convergence rates and oracle properties of the estimators in the standard situation with a fixed d, and then extend these results to the situations when d diverges with the sample size n. Section 4 gives the computational algorithm. In particular, we show how to compute the solution path using the LARS algorithm. The issue of parameter tuning is also discussed. Section 5 illustrates the performance of the procedure via simulations and real examples. Discussions and technical proofs are presented in Sects. 6 and 7. 2 Method We assume that t < t 2 < < t n. In order to achieve a smooth estimate for the nonparametric component and sparse estimates for the parametric components simultaneously, we consider the following regularization problem: min β R d, f W m n n [ yi xi T β f (t i) ] 2 + λ i= [ f (m) (t) ] 2 d dt + λ2 w j β j. (3) j= The penalty term in (3) is naturally formed as a combination of roughness penalty on f and the weighted LASSO penalty on β. Here, λ controls the smoothness of the estimated nonlinear function while λ 2 controls the degree of shrinkage on β s. The weight w j s are pre-specified. For convenience, we will refer to this procedure as PSA (the Partial Splines with Adaptive penalty). 23

96 G. Cheng et al. Note that w j s should be adaptively chosen such that they take large values for unimportant covariates and small values for important covariates. In particular, we propose using w j = / β j γ, where β = ( β,..., β d ) is some consistent estimate for β in the model (), and γ is a fixed positive constant. For example, the standard partial smoothing spline β PS can be used to construct the weights. Therefore, we get the following optimization problem: ( β PSA, fˆ PSA ) = arg min β R d, f W m n +λ 2 d j= β j n [ yi x i β f (t i) ] 2 + λ i= [ f (m) (t)] 2 dt β j γ. (4) When β is fixed, the standard smoothing spline theory suggests that the solution to (4) is linear in the residual (y Xβ), i.e., ˆf(β) = A(λ )(y Xβ), where y = (y,...,y n ), X = (x,...,x n ) and the matrix A(λ ) is the smoother or influence matrix (Wahba 984). The expression of A(λ ) will be given in Sect. 4. Plugging ˆf(β) into (4), we can obtain an equivalent objective function for β: Q(β) = n (y Xβ) [I A(λ )](y Xβ) + λ 2 d j= β j β j γ, (5) where I is the identity matrix of size n. The PSA solution can be computed as β PSA = arg min Q(β), β fˆ PSA = A(λ )(y X β PSA ). Special software like Quadratic Programming (QP) or LARS (Efron et al. 24) is needed to obtain the solution. 3 Statistical theory We can write the true coefficient vector as β = (β,...,β d ) = (β, β 2 ), where β consists of all q nonzero components and β 2 consists of the rest (d q) zero elements, and write the true function of f as f. We also write the estimated vector β PSA = ( ˆβ,..., ˆβ d ) = ( β PSA,, β ). PSA,2 In addition, assume that Xi has zero mean and strictly positive definite covariance matrix R. The observations t i s satisfy ti u(w)dw = i/n for i =,...,n, (6) where u( ) is a continuous and strictly positive function independent of n. 23

Sparse partial spline 97 3. Asymptotic results for fixed d We show that, for any fixed γ>, if λ and λ 2 converge to zero at proper rates, then both the parametric and nonparametric components can be estimated at their optimal rates. Moreover, our estimation procedure produces the nonparametric estimate fˆ PSA with desired smoothness, i.e., ***(). Meanwhile, we conclude that our double penalization procedure can estimate the nonparametric function well enough to achieve the oracle properties of the weighted Lasso estimates. In the following we use, 2 to represent the Euclidean norm, L 2 - norm, and use n to denote the empirical L 2 -norm, i.e., F 2 n = n i= F 2 (s i )/n. We derive our convergence rate results under the following regularity conditions: R. ɛ is assumed to be independent of X, and has a sub-exponential tail, i.e., E(exp( ɛ /C )) C for some < C <, seemammen and van de Geer (997); R2. k φ kφ k /n converges to some non-singular matrix with φ k =[, t k,...,tk m, x k,...,x kd ] in probability. Theorem Consider the minimization problem (4), where γ> is a fixed constant. Assume the initial estimate β is consistent. If n 2m/(2m+) λ λ >, nλ 2 and n 2m 2(2m+) λ 2 β j γ P λ 2 > for j = q +,...,d (7) as n, then we have. there exists a local minimizer β PSA of (4) such that 2. the nonparametric estimate ˆ f PSA satisfies β PSA β =O P (n /2 ). (8) fˆ PSA f n = O P (λ /2 ), (9) J f ˆPSA = O P (). () 3. the local minimizer β PSA = ( β PSA,, β PSA,2 ) satisfies (a) Sparsity: P( β PSA,2 = ). (b) Asymptotic Normality: n( β PSA, β ) d N(,σ 2 R ), where R is the q q upper-left sub matrix of covariance matrix of X i. Remark Note that t is assumed to be nonrandom and satisfy the condition (6), and that EX =. In this case, the semiparametric efficiency bound for β PSA, in the partly 23

98 G. Cheng et al. linear model under sparsity is just σ 2 R,seeVaart and Wellner (996). Thus, we can claim that β PSA, is semiparametric efficient. If we use the partial spline solutions to construct the weights in (4) and choose γ = and n 2m/(2m+) λ i λ i > fori =, 2, the above Theorem implies that the double penalized estimators achieve the optimal rates for both parametric and nonparametric estimation, i.e., (8) (9), and that β PSA possesses the oracle properties, i.e., the asymptotic normality of β PSA, and sparsity of β PSA,2. 3.2 Asymptotic results for diverging d n Let β = (β, β 2 ) R q n R m n = R d n.letx i = (w i, z i ) where w i consists of the first q n covariates, and z i consists of the remaining m n covariates. Thus we can define the matrix X = (w,...,w n ) and X 2 = (z,...,z n ). For any matrix K we denote its smallest and largest eigenvalue as λ min (K) and λ max (K), respectively. Now, we give the additional regularity conditions required to establish the largesample theory for the increasing dimensional case: RD. There exist constants < b < b < such that b min{ β j, j q n } max{ β j, j q n } b. R2D. λ min ( k φ kφ k /n) c 3 > for any n. R3D. Let R be the covariance matrix for the vector X i. We assume that < c λ min (R) λ max (R) c 2 < for any n. Conditions R2D and R3D are equivalent to Condition R2 when d n is assumed to be fixed. 3.2. Convergence Rate of β PSA and ˆ f PSA We first present a Lemma concerning about the convergence rate of the initial estimate β PS given the increasing dimension d n. For two deterministic sequences p n, q n = o(), we use the symbol p n q n to indicate that p n = O(q n ) and pn = O(qn ). Define x y (x y) to be the maximum (minimum) value of x and y. Lemma Suppose that β PS is a partial smoothing spline estimate, then we have β PS β =O P ( d n /n) given d n = n /2 nλ /2m. () Our next theorem gives the convergence rates for β PSA and ˆ f PSA when dimension of β diverges to infinity. In this increasing dimension set-up, we find three results: (i) the convergence rate for β PSA coincides with that for the estimator in the linear regression model with increasing dimension (Portnoy 984); thus we can conclude that the presence of nonparametric function and sparsity of β does not affect the overall 23

Sparse partial spline 99 convergence rate of β PSA ; (ii) the convergence rate for fˆ PSA is slower than the regular partial smoothing spline, i.e., O P (n m/(2m+) ), and is controlled by the dimension of important components of β, i.e., q n ; and (iii) the nonparametric estimator fˆ PSA always satisfies the desired smoothness condition, i.e., J f ˆPSA = O P (), even under increasing dimension of β. Theorem 2 Suppose that d n = o(n /2 nλ /2m ),nλ /2m and n/d n λ 2, we have β PSA β =O P ( d n /n). (2) If we further assume that λ /q n n 2m/(2m+) and n/dn (λ 2 /q n ) max = O P (n /(2m+) d 3/2 j=q n +,...,d n β j γ n ), (3) then we have f PSA f n = O P ( d n /n (n m/(2m+) q n )), (4) J f ˆPSA = O P (). (5) It seems nontrivial to improve the rate of convergence for the parametric estimate to the minimax optimal rate q n log d n /n proven in Bickel et al. (29). The main reason is that the above rate result is proven in the (finite) dictionary learning framework which requires that the nonparametric function can be well approximated by a member of the span of a finite dictionary of (basis) functions. This key assumption does not straightforwardly hold in our smoothing spline setup. In addition, it is also unclear how to relax the Gaussian error condition assumed in Bickel et al. (29) to the fairly weak sub-exponential tail condition assumed in our paper. 3.2.2 Oracle Properties In this subsection, we show that the desired oracle properties can also be achieved even in the increasing dimension case. In particular, when showing the asymptotic normality of β PSA,, we consider an arbitrary linear combination of β,sayg n β, where G n is an arbitrary l q n matrix with a finite l. Theorem 3 Given the following conditions: D. d n = o(n /3 (n 2/3 λ /3m )) and q n = o(n λ 2 2 ); S. λ satisfies: λ /q n n 2m/(2m+) and n m/(2m+) λ ; S2. λ 2 satisfies: n/dn λ 2 P min, (6) j=q n +,...,d n β j γ we have 23

G. Cheng et al. (a) Sparsity: P( β PSA,2 = ) (b) Asymptotic Normality: ngn R /2 ( β PSA, β ) d N(,σ 2 G), (7) where G n be a non-random l q n matrix with full row rank such that G n G n G. In Corollary, we give the fastest possible increasing rates for the dimensions of β and its important components to guarantee the estimation efficiency and selection consistency. The range of the smoothing and shrinkage parameters are also given. Corollary Let γ =. Suppose that β is the partial smoothing spline solution. Then, we have. β PSA β =O P ( d n /n) and fˆ PSA f n = O P ( d n /n (n m/(2m+) q n )); 2. β PSA possesses the oracle properties. if the following dimension and smoothing parameter conditions hold: d n = o(n /3 ) and q n = o(n /3 ), (8) nλ /2m, n m/(2m+) λ and λ /q n n 2m/(2m+), (9) n/dn λ 2,(n/d n )λ 2 and d n (n/q n )λ 2 = O(n /(2m+) ). (2) Define d n n d and q n n q, where q d < /3 according to (8). For the usual case that m 2, we can give a set of sufficient conditions for (9) (2) as: λ n r and λ 2 n r 2 for r = 2m/(2m +) q, r 2 = d/2+2m/(2m +) q and ( d)/2 < r 2 < d. The above conditions are very easy to check. For example, if m = 2, we can set λ n.55 and λ 2 n.675 when d n n /4, q n n /4. In Ni et al. (29), the authors considered variable selection in partly linear models when dimension is increasing. Under m = 2, they applied SCAD penalty on the parametric part and proved oracle property together with asymptotic normality. In this paper, we considered general m and applied adaptive LASSO for the parametric part. Besides oracle property, we also derived rate of convergence for the nonparametric estimate. The technical proof relies on nontrivial applications of RKHS theory and model empirical processes theory. Therefore, our results are substantially different from Ni et al. (29). The numerical results provided in Sect. 5 demonstrate satisfactory selection accuracy. Furthermore, we are able to report the estimation accuracy of the nonparametric estimate, which is also satisfactory. 4 Computation and tuning 4. Algorithm We propose a two-step procedure to obtain the PSA estimator: first compute β PSA, then compute ˆ f PSA. As shown in Sect. 2, we need to minimize (5) to estimate β. Define 23

Sparse partial spline the square root matrix of I A(λ ) as T, i.e., I A(λ ) = T T. Then (5) can be reformulated into a LASSO-type problem min n (y X β ) (y X β ) + λ 2 d j= β j, (2) where the transformed variables are y = T y, X = T XW, and β j = β j / β j γ, j =,...,d, with W = diag{ β j γ }. Therefore, (2) can be conveniently solved with the LARS algorithm (Efron et al. 24). Now assume β PSA has been obtained. Using the standard smoothing spline theory, it is easy to show that ˆf PSA = A(λ )(y X β PSA ), where A is the influence matrix. By the reproducing kernel Hilbert space theory (Kimeldorf and Wahba 97), W m [, ] is an RKHS when equipped with the inner product m [ ( f, g) = ν= ][ ] f (ν) (t)dt g (ν) (t)dt + f (m) g (m) dt. We can decompose W m [, ] =H H as a direct sum of two RKHS subspaces. In particular, H = {f : f (m) = } = span{k ν (t), ν =,...,m }, where k ν (t) = B ν (t)/ν! and B ν (t) are Bernoulli polynomials (Abramowitz and Stegun 964). H ={f : f (ν) (t) =,ν =,...,m ; f (m) L 2 [, ]}, associated with the reproducing kernel K (t, s) = k m (t)k m (s) + ( ) m k 2m ([s t]), where [τ] is the fractional part of τ. LetS be a n n square matrix with s i,ν = k ν (t i ) and be a square matrix ( with ) the (i, j)-th entry K (t i, t j ). Let the QR decomposition of S U be S = (F, F 2 ), where F =[F, F 2 ] is orthogonal and U is upper triangular with S F 2 =. As shown in Wahba (984) and Gu (22), the influence matrix A can be expressed as A(λ ) = I nλ F 2 (F 2 VF 2) F 2, where V = + nλ I. Using the representer theorem (Wahba 99), we can compute the nonparametric estimator as m fˆ PSA (t) = ν= ˆb ν k ν (t) + n ĉ i K (t, t i ), where ĉ = F 2 (F 2 VF 2) F 2 y and b = U F (y ĉ). We summarize the algorithm in the following: Step. Fit the standard smoothing spline and construct the weights w j s. Compute y and X. Step 2. Solve (2) using the LARS algorithm. Denote the solution as β = ( ˆβ,..., ˆβ d ). i= 23

2 G. Cheng et al. Step 3. Calculate β PSA = ( ˆβ,, ˆβ d ) by ˆβ j = ˆβ j β j γ for j =,...,d. Step 4. Obtain the nonparametric fit by f = S b + ĉ, where the coefficients are computed as ĉ = F 2 (F 2 VF 2) F 2 y and β = U F (y ĉ). 4.2 Parameter tuning One possible tuning approach for the double penalized estimator is to choose (λ,λ 2 ) jointly by minimizing some scores. Following the local quadratic approximation (LQA) technique used in Tibshirani (996) and Fan and Li (2), we can derive the GCV score as a function of (λ,λ 2 ). Define the diagonal matrix D(β) = diag{/ β β,...,/ β d β d }. The solution β PSA can be approximated by [ X {I A(λ )}X + nλ 2 D( β PSA ) ] X {I A(λ )}y Hy. Correspondingly, f PSA = A(λ )(y X β PSA ) = A(λ )[I XH]y. Therefore, the predicted response can be approximated as ŷ = X β PSA + ˆf PSA = M(λ,λ 2 )y, where M(λ,λ 2 ) = XH + A(λ )[I XH]. Therefore, the number of effective parameters in the double penalized fit ( β PSA, ˆf PSA ) may be approximated by tr (M(λ,λ 2 )). The GCV score can be constructed as GCV(λ,λ 2 ) = n n i= (y i ŷ i ) 2 [ n tr (M(λ,λ 2 ))] 2. The two-dimensional search is computationally expensive in practice. In the following, we suggest an alternative two-stage tuning procedure. Since λ controls the partial spline fit ( β, b, c), we first select λ using the GCV at Step of the computation algorithm: GCV(λ ) = n n i= (y i ỹ i ) 2 [ n tr{ã(λ )}] 2, where ỹ = (ỹ,...,ỹ n ) is the partial spline prediction and Ã(λ ) is the influence matrix for the partial spline solution. Let λ = arg min λ GCV(λ ). We can also select λ using GCV in the smoothing spline problem: Y i X i β = f (t i ) + ɛ i, where β is the n-consistent difference-based estimator (Yatchew 997). This substitution approach is theoretically valid for selection λ since the convergence rate of β is faster than the nonparametric rate for estimating f, and thus β can be treated as the true value. At the successive steps, we fix λ at λ and only select λ 2 for the optimal variable selection. Wang et al. (27); Zhang and Lu (27); Wang et al. (29) suggested that BIC works better in terms of consistent model selection than the GCV when tuning λ 2 for the adaptive LASSO in the context of linear models 23

Sparse partial spline 3 even with diverging dimension. Therefore, we propose to choose λ 2 by minimizing BIC(λ 2 ) = (y X β PSA ˆf PSA ) (y X β PSA ˆf PSA )/ ˆσ 2 + log(n) r, where r is the number of nonzero coefficients in β, and the estimated residual variance ˆσ 2 can obtained from the standard partial spline model, i.e., ˆσ 2 = (y X β PS f PS ) (y X β PS f PS )/(n tr(ã(λ )) d). 5 Numerical studies 5. Simulation We compare the standard partial smoothing spline model with the new procedure under the LASSO (with w j = in(3)) and adaptive (ALASSO) penalty. In the following, these three methods are, respectively, referred to as PS, PSL, and PSA. We also include the Oracle model fit assuming the true model were known. In all the examples, we use γ = for PSA and consider two sample sizes n = and n = 2. The smoothness parameter m was chosen to be 2 in all the numerical experiments. In each setting, a total of 5 Monte Carlo (MC) simulations are carried out. We report the MC sample mean and standard deviation (given in the parentheses) for the MSEs. Following Fan and Li (24), we use mean squared [ error MSE( β) = ] E β β 2 and mean integrated squared error MISE( f ˆ ) = E { f (t) f (t)} 2 dt to evaluate goodness-of-fit for parametric and nonparametric estimation, respectively, and compute them by averaging over data knots in the simulations. To evaluate the variable selection performance of each method, we report the number of correct zero ( correct ) coefficients, the number of coefficients incorrectly set to ( incorrect ), model size, and the empirical probability of capturing the true model. We generate data from a model Y i = X i β + f (T i) + ε i and consider two following model settings: Model : β = (3, 2.5, 2,.5,,...,), d = 5 and q = 4. And f (t) =.5sin(2πt). Model 2: Let β = (3,...,3,,...,), d = 2 and q =. The nonparametric function f 2 (t) = t ( t) 4 /(3B(, 5)) + 4t 4 ( t) /(5B(5, )), where the beta function B(u,v) = tu ( t) v dt. Two model coefficient vectors β = β and β 2 = β/3 were considered. The Euclidean norms of β and β 2 are β =9.49 and β 2 =3.6 respectively. The supnorm of f is f sup =.6. So the ratios β / f sup = 8.8 and β 2 / f sup = 2.72, denoted as the parametric-to-nonparametric signal ratios (PNSR). The two settings represent high and low PNSR s, respectively. Two possible distributions for the covariates X and T Model : X,...,X 5, T are i.i.d. generated from Unif(, ). 23

4 G. Cheng et al. Table Variable selection and fitting results for Model σ n Method MSE( β PSA ) MISE( ˆ f PSA ) Size Number of zeros Correct Incorrect.5 PS.578 (.).5 (.) 5 () () () PSL.36 (.8).5 (.) 7.34 (.9) 7.66 (.9). (.) PSA.234 (.8).4 (.) 4.53 (.4).47 (.4). (.) Oracle.29 (.4).4 (.) 4 () () () 2 PS.249 (.4).8 (.) 5 () () () PSL.47 (.4).8 (.) 7.6 (.9) 7.84 (.9). (.) PSA. (.4).8 (.) 4.36 (.4).64 (.4). (.) Oracle.63 (.).7 (.) 4 () () () PS 2.293 (.4).55 (.2) 5 () () () PSL.256 (.32).5 (.2) 7.36 (.9) 7.64 (.9). (.) PSA. (.36).5 (.2) 4.72 (.5).25 (.5).2 (.) Oracle.5 (.7).48 (.2) 4 () () () 2 PS.989 (.7).28 (.) 5 () () () PSL.587 (.6).27 (.) 7.2 (.9) 7.8 (.9). (.) PSA.479 (.4).26 (.) 4.42 (.4).58 (.4). (.) Oracle.252 (.8).26 (.) 4 () () Model 2: X = (X,...,X 2 ) are standard normal with AR() correlation, i.e., corr(x i, X j ) = ρ i j. T follows Unif(, ) and is independent with X i s. We consider ρ =.3 and ρ =.6. Two possible error distributions are used in these two settings: Model : (normal error) ɛ N(,σ 2 ), with σ =.5 and σ =, respectively. Model 2: (non-normal error) ɛ 2 t, t-distribution with degrees of freedom. Table compares the model fitting and variable selection performance of various procedures in different settings for Model. It is evident that the PSA procedure outperforms both the PS and PSL in terms of both the MSE and variable selection. The three procedures give similar performance in estimating the nonparametric function. Table 2 shows that the PSA works much better in distinguishing important variables from unimportant variables than PSL. For example, when σ =.5, the PSA identifies the correct model 5.7 = 35 times out of 5 times when n = and 5.78 = 39 times when n = 2, while the PSL identifies the correct model only 5.9 = 45 times when n = and 5. = 5 times when n = 2. To present the performance of our nonparametric estimation procedure, we plot the estimated functions for Model in the below Fig.. The top row of Fig. depicts the typical estimated curves corresponding to the th best, the 5th best (median), and the 9th best according to MISE among simulations when n = 2 and σ =.5. It can be seen that the fitted curves are overall able to capture the shape of the true function very well. In order to describe the sampling variability of the estimated nonparametric 23

Sparse partial spline 5 Table 2 Variable selection relative frequency in percentage over 5 runs for Model σ n Important index Unimportant variable index P(correct) 3 4 5 6 7 8 9 2 3 4 5.5 PSL.3.29.32.33.3.25.32.29.32.33.3.9 PSA.5.5.7.6.6.3.4.3.5.6.4.7 2 PSL.29.29.26.3.3.28.25.3.28.29.3. PSA.3.3.4.4.3.3.3.4.3.3.4.78 PSL.3.3.32.33.3.26.32.29.32.33.3.8 PSA.98.8.7.9.8.8.4.5.5.7.8.5.55 2 PSL.29.29.27.32.3.29.24.3.29.29.29. PSA.3.3.5.5.4.3.3.4.4.3.4.75 23

6 G. Cheng et al. 2 Model : Fitted confidence envelope fit for f (n=2, σ=.5) Estimated f % 5% 9% 2..2.3.4.5.6.7.8.9 t 2 Model : 95% pointwise confidence band for f (n=2, σ=.5) Estimated f 2..2.3.4.5.6.7.8.9 t Fig. The estimated nonlinear functions given by the PSA in Model. The estimated nonlinear function, confidence envelop, and 95% point-wise confidence interval for Model 2 with n = 2 and σ =.5. In the top plot, the dashed line is for the th best fit, the dotted line is for the 5th best fit, and the dashed-dotted line is for the 9th best among 5 simulations. The bottom plot is a 95% pointwise confidence interval function at each point, we also depict a 95% pointwise confidence interval for f in the bottom row of Fig.. The upper and lower bound of the confidence intervals are respectively given by the 2.5th and 97.5th percentiles of the estimated function at each grid point among simulations. The results show that the function f is estimated with very good accuracy. Tables 3 and 4, 5 and 6 summarize the simulation results when the true parametric components are β and β 2, respectively. Tables 3 and 5 compare the model fitting and variable selection performance in the correlated setting Model 2. The case ρ =.3 represents a weak correlation among X s and ρ =.6 represents a moderate situation. Again, we observe that the PSA performs best in terms of both MSE and variable selection in all settings. In particular, when n = 2, the PSA is very close to the Oracle results in this example. Tables 4 and 6 compare the variable selection results of PSL and PSA in four scenarios if the covariates are correlated. Since neither of the methods misses any important variable over 5 runs, we only report the selection relative frequencies 23

Sparse partial spline 7 Table 3 Model selection and fitting results for Model 2 when the true parameter vector is β = β ρ n Method MSE( β PSA ) MISE( ˆ f PSA ) Size Number of zeros Correct Incorrect.3 PS.46 (.8).45 (.2) 2 () () () PSL.299 (.6).447 (.2) 2.99 (.9) 7. (.9). (.) PSA.24 (.5).443 (.2).29 (.3) 9.7 (.3). (.) Oracle.8 (.4).444 (.2) () () () 2 PS.79 (.3).48 (.) 2 () () () PSL.25 (.3).46 (.) 3.22 (.8) 6.78 (.8). (.) PSA.87 (.2).44 (.).2 (.2) 9.88 (.2). (.) Oracle.82 (.2).45 (.) () () ().6 PS.72 (.3).448 (.2) 2 () () () PSL.4 (.9).44 (.2).9 (.7) 8.9 (.7). (.) PSA.349 (.8).438 (.2).22 (.3) 9.78 (.3). (.) Oracle.3 (.4).439 (.2) () () () 2 PS.3 (.5).48 (.) 2 () () () PSL.7 (.4).45 (.) 2.47 (.7) 7.53 (.7). (.) PSA.47 (.4).44 (.).2 (.2) 9.88 (.2). (.) Oracle.39 (.4).45 (.) () () () PNSR 8.8 Table 4 Relative frequency of variables selected in 5 runs for Model 2 when the true parameter vector is β = β ρ n Method Unimportant variable index P(correct) 2 3 4 5 6 7 8 9 2.3 PSL.35.34.34.34.35.33.34.35.34.36. PSA.5.3.3.3.4.3.4.4.2.4.82 2 PSL.34.32.3.32.29.3.3.3.3.3.4 PSA.2..........9.6 PSL.32.26.24.25.25.23.22.25.24.26.2 PSA.6.2..2.2..2.2.2..87 2 PSL.29.23.2.8.9.2.8.7.9.2.25 PSA.2..........96 for the unimportant variables. Overall, the PSA results in a more sparse model and identifies the true model with a much higher frequency. For example, when the true parametric component is β, n = and the correlation is moderate with ρ =.6, the PSA identifies the correct model with relative frequency.87 (about 5.87 = 435 times) while the PSL identifies the correct model only 5.2 = times. When the true parametric component is β 2, PSA and PSL identify the correct model 45 and 9 times, respectively. 23

8 G. Cheng et al. Table 5 Model selection and fitting results for Model 2 when the true parameter vector is β 2 = β/3 ρ n Method MSE( β PSA ) MISE( ˆ f PSA ) Size Number of zeros Correct Incorrect.3 PS.52 (.8).432 (.2) 2 () () () PSL.32 (.5).43 (.2) 3.32 (.8) 6.68 (.8). (.) PSA.269 (.5).425 (.2).44 (.5) 9.56 (.5). (.) Oracle.23 (.4).422 (.2) () () () 2 PS.226 (.3).39 (.) 2 () () () PSL.4 (.3).39 (.) 3. (.8) 6.89 (.8). (.) PSA.6 (.2).387 (.).38 (.4) 9.62 (.4). (.) Oracle. (.2).386 (.) () () ().6 PS.9 (.3).432 (.2) 2 () () () PSL.42 (.7).425 (.2) 2.5 (.8) 7.95 (.8). (.) PSA.382 (.7).49 (.2).3 (.4) 9.7 (.4). (.) Oracle.35 (.5).47 (.2) () () () 2 PS.382 (.5).388 (.) 2 () () () PSL.92 (.3).387 (.) 2.2 (.7) 7.8 (.7). (.) PSA.8 (.4).386 (.).7 (.2) 9.83 (.2). (.) Oracle.74 (.3).385 (.) () () () PNSR 2.72 Table 6 Relative frequency of variables selected in 5 runs for Model 2 when the true parameter vector is β 2 = β/3 ρ n Method Unimportant variable index P(correct) 2 3 4 5 6 7 8 9 2.3 PSL.36.33.35.34.37.33.34.35.32.37.9 PSA.6.4.5.5.3.4.5.5.5.5.78 2 PSL.34.32.3.35.32.32.32.3.3.33. PSA.4.4.3.2.3.3.3.3.2.3.87.6 PSL.34.28.25.23.25.23.24.25.24.28.8 PSA.8.3.3.3.3.3.2.3.3.3.8 2 PSL.29.25.2.5.9.2.6.8.9.9.22 PSA.4.2.2.2.....2..92 The top rows of Figs. 2 and 3 depict the typical estimated functions corresponding to the th best, the 5th best (median), and the 9th best fits according to MISE among simulations when n = 2, ρ =.3, and the true Euclidean parameters are β and β 2, respectively. It is evident that the estimated curves are able to capture the shape of the true function very well. The bottom rows of Figs. 2 and 3 depict the 95% pointwise confidence intervals for f. The results show that, when the true Euclidean parameters are β and β 2, respectively, the function f is estimated with 23

Sparse partial spline 9 Model 2: Fitted confidence envelope fit for f when the true parameter vector is β (n=2, ρ=.3).5 Estimated f.5 % 5% 9%.5..2.3.4.5.6.7.8.9 t Model 2: 95% pointwise confidence band for f when the true parameter vector is β (n=2, ρ=.3) 3 2 Estimated f 2..2.3.4.5.6.7.8.9 t Fig. 2 The estimated nonlinear functions given by the PSA in Model 2. The estimated nonlinear function, confidence envelop, and 95% point-wise confidence interval for Model 2, n = 2, ρ =.3, and the true Euclidean parameter is β.inthetop plot, the dashed line is th best fit, the dotted line is 5th best fit, and the dashed-dotted line is 9th best of 5 simulations. The bottom plot is a 95% pointwise confidence interval reasonably good accuracy. Interestingly, in this simulation setting, the choice of β and β 2 does not affect much on the estimation accuracy of f. 5.2 Simulation 2: large dimensional setting We consider an example involving a larger number of linear variables: Model 3: Let d = 6, q = 5. We considered two parameter vectors β = β and β 2 =.3β, and two nonparametric functions f (t) = f (t) and f 2 (t) =.5 f (t), with different magnitudes on the (non)parametric component representing weak and strong (non)parametric signals, where β = (4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2,,...,) and f (t) =.2t 29 ( t) 6 /B(3, 7) +.8t 2 ( t) /B(3, ). In particular, the maximum absolute values of f and f 2 are f sup = 3.8 and f 2 sup =.54, respectively, and the l 2 -norms 23

G. Cheng et al..5 Model 2: Fitted confidence envelope fit for f when the true parameter vector is β 2 (n=2, ρ=.3) Estimated f.5 % 5% 9%.5..2.3.4.5.6.7.8.9 t 3 Model 2: 95% pointwise confidence band for f when the true parameter vector is β 2 (n=2, ρ=.3) 2 Estimated f 2..2.3.4.5.6.7.8.9 t Fig. 3 The estimated nonlinear functions given by the PSA in Model 2. The estimated nonlinear function, confidence envelop, and 95% point-wise confidence interval for Model 2, n = 2, ρ =.3, and the true Euclidean parameter is β 2.Inthetop plot, the dashed line is th best fit, the dotted line is 5th best fit, and the dashed-dotted line is 9th best of 5 simulations. The bottom plot is a 95% pointwise confidence interval of the β and β 2 are β =2.4 and β 2 =3.6, respectively. So the ratio of β 2 to f sup and β to f 2 sup, i.e., the PNSR s, are β 2 / f sup =.7 and β / f 2 sup = 7.82, representing the lower to higher PNSRs. The correlated covariates (X,...,X 6 ) are generated from marginally standard normal with AR() correlation with ρ =.5. Consider two settings for the normal error ɛ N(,σ 2 ), with σ =.5 and σ =.5, respectively. Tables 7 and 8 compare the model fitting and variable selection performance of various procedures in different settings for Model 3. In particular, in Table 7, the true Euclidean parameter and nonparametric function are β 2 and f, while in Table 8 they are β and f 2. So the PNSR s in Tables 7 and 8 are.7 and 7.82, respectively. To better illustrate the performance, we considered σ =.5 and.5 in each table. The corresponding signal-to-noise ratios, defined as the ratios of β ( β 2 )toσ s, are 24.8, 8.3, 7.22, and 2.4 in the four settings. It is evident that the PSA procedure outperforms both the PS and PSL in terms of both the MSE and variable selection accuracy. The three procedures give similar performance in estimating the nonparametric 23

Sparse partial spline Table 7 Variable selection and fitting results for Model 3 when the true parameter vector is β 2 and the true function is f σ n Method MSE( β) MISE( ˆ f ) Size Number of zeros Correct Incorrect.5 PS 4.94 (.59).39 (.2) 6 () () () PSL.526 (.7).64 (.) 27.86 (.3) 32.4 (.3). (.) PSA.34 (.).6 (.) 2.43 (.29) 38.56 (.29). (.) Oracle.225 (.).522 (.) 5 () 45 () () 2 PS.95 (.).85 (.3) 6 () () () PSL.223 (.7).68 (.) 26.2 (.22) 33.88 (.22). (.) PSA.34 (.4).548 (.) 8.78 (.2) 4.22 (.2). (.) Oracle.2 (.2).478 (.9) 5 () 45 () ().5 PS.4 (.3).5 (.27) 6 () () () PSL 2.72 (.66).256 (.4) 27.9 (.35) 32. (.35). (.) PSA.4 (.4).22 (.2) 2.7 (.22) 38.3 (.22). (.) Oracle.38 (.5).28 (.9) 5 () 45 () () 2 PS 2.44 (.2).9 (.3) 6 () () () PSL.688 (.).63 (.3) 26. (.25) 35. (.25). (.) PSA.47 (.8).52 (.2) 2.5 (.2) 38.95 (.2). (.) Oracle.448 (.6).42 (.2) 5 () 45 () () PNSR.7. SNRs 7.22 and 2.4 for σ =.5and.5respectively function. In Figs. 4 and 5, we plotted the confidence envelop and the 95% confidence band of f and f 2 when n = 2, σ =.5, and the true Euclidean parameters are β 2 and β, respectively. All the figures demonstrate satisfactory coverage of the true unknown nonparametric function by confidence envelops and pointwise confidence bands. We also conclude that, at least in this simulation setup, for the two settings with different PNSR s, the estimates of f and f 2 are satisfactory. 5.3 Real example : ragweed pollen data We apply the proposed method to the Ragweed Pollen data analyzed in Ruppert et al. (23). The data consist of 87 daily observations of ragweed pollen level and relevant information collected in Kalamazoo, Michigan during the 993 ragweed season. The main purpose of this analysis is to develop an accurate model for forecasting daily ragweed pollen level based on some climate factors. The raw response ragweed is the daily ragweed pollen level (grains/m 3 ). There are four explanatory variables: X = rain: the indicator of significant rain for the following day ( = at least 3 h of steady or brief but intense rain, = otherwise); X 2 = temperature: temperature of the following day ( o F); X 3 = wind: wind speed forecast for the following day (knots); 23

2 G. Cheng et al. Table 8 Variable selection and fitting results for Model 3 when the true parameter vector is β and the true function is f 2 σ n Method MSE( β PSA ) MISE( ˆ f PSA ) Size Number of zeros Correct Incorrect.5 PS.599 (.34).32 (.3) 6 () () () PSL.334 (.9).24 (.2) 8.44 (.6) 4.56 (.6). (.) PSA.23 (.).232 (.2) 5.3 (.3) 44.7 (.6). (.) Oracle.63 (.2).22 (.2) 5 () 45 () () 2 PS.379 (.3).254 (.) 6 () () () PSL.2 (.2).236 (.) 7.9 (.3) 42.8 (.3). (.) PSA.94 (.).23 (.) 5.2 (.) 44.98 (.). (.) Oracle.68 (.).28 (.) 5 () 45 () ().5 PS 7.27 (.83).539 (.3) 6 () () () PSL.588 (.35).459 (.9) 27.69 (.22) 32.3 (.22). (.) PSA.285 (.28).434 (.7) 2.2 (.9) 38.88 (.9). (.) Oracle.785 (.).395 (.4) 5 () 45 () () 2 PS.886 (.4).35 (.3) 6 () () () PSL.57 (.).339 (.2) 26.35 (.9) 33.65 (.9). (.) PSA.472 (.6).334 (.2) 9.8 (.4) 4.92 (.4). (.) Oracle.342 (.5).325 (.2) 5 () 45 () () PNSR 7.82. SNRs 24.8 and 8.3 for σ =.5and.5respectively X 4 = day: the number of days in the current ragweed pollen season. We first standardize X-covariates. Since the raw response is rather skewed, Ruppert et al. (23) suggested a square root transformation Y = ragweed. Marginal plots suggest a strong nonlinear relationship between Y and the day number. Consequently, a partial linear model with a nonparametric baseline f (day) is reasonable. Ruppert et al. (23) fitted a semiparametric model with three linear effects X, X 2, and X 3, and a nonlinear effect of X 4. For the variable selection purpose, we add the quadratic terms in the model and fit an enlarged model: y = f (day) + β x + β 2 x 2 + β 3 x 3 + β 4 x 2 2 + β 4x 2 3 + ε. Table 9 gives the estimated regression coefficients. We observe that PSL and PSA end up with the same model, and all the estimated coefficients are positive, suggesting that the ragweed pollen level increases as any covariate increases. The shrinkage in parametric terms from the partial spline models that resulted from the PSA procedure is overall smaller than that resulted from the PSL procedure. Figure 6 depicts the estimated nonparametric function f (day) and its 95% pointwise confidence intervals given by the PSA. The plot indicates that f (day) increases rapidly to the peak on around day 25, plunges until day 6, and decreases steadily 23

Sparse partial spline 3 Model 3: Fitted confidence envelope fit for f when the true parameter vector is β 2 (n=2, σ=.5) 4 3 Estimated f 2 % 5% 9%..2.3.4.5.6.7.8.9 t Model 3: 95% pointwise confidence band for f when the true parameter vector is β 2 (n=2, σ=.5) 6 Estimated f 4 2 2..2.3.4.5.6.7.8.9 t Fig. 4 The estimated nonlinear functions given by the PSA in Model 3. The estimated nonlinear function, confidence envelop, and 95% point-wise confidence interval for Model 3 with true nonparametric function f and true Euclidean parameter β 2, n = 2 and σ =.5. In the top plot, the dashed line is for the th best fit, the dotted line is for the 5th best fit, and the dashed-dotted line is for the 9th best among 5 simulations. The bottom plot is a 95% pointwise confidence interval thereafter. The nonparametric fits given by the other two procedures are similar and hence omitted in the paper. We examined the prediction accuracy for PS, PSL and PSA, in terms of the mean squared prediction errors (MSPE) based the leave-one-out strategy. We also fit linear models for the above data using LASSO. Our analysis shows that the MSPEs for PS, PSL, and PSA are 5.63, 5.6, and 5.47, respectively, while the MSPE for LASSO based on linear models is 2.4. Roughly speaking, the MSPEs using PS, PSL, and PSA are similar, though they provide different model selection results summarized in Table 9. Notice that the PS method keeps more variables in the model than the PSL and PSA; however, the MSPEs are not much different. Thus, using PSL or PSA one can select a subgroup of significant variables to explain the model. Furthermore, the large MSPE based on linear models demonstrates invalidity of simply using linear models for such data. 23

4 G. Cheng et al. Model 3: Fitted confidence envelope fit for f 2 when the true parameter vector is β (n=2, σ=.5) 2.5 Estimated f 2.5 % 5% 9%.5..2.3.4.5.6.7.8.9 t Model 3: 95% pointwise confidence band for f 2 when the true parameter vector is β (n=2, σ=.5) 3 2 Estimated f 2 2..2.3.4.5.6.7.8.9 Fig. 5 The estimated nonlinear functions given by the PSA in Model 3. The estimated nonlinear function, confidence envelop, and 95% point-wise confidence interval for Model 3 with true nonparametric function f 2 and true Euclidean parameter β, n = 2 and σ =.5. In the top plot, the dashed line is for the th best fit, the dotted line is for the 5th best fit, and the dashed-dotted line is for the 9th best among 5 simulations. The bottom plot is a 95% pointwise confidence interval t 5.4 Real example 2: prostate cancer data We analyze the Prostate Cancer data (Stamey et al. 989). The goal is to predict the log level of prostate-specific antigen using a number of clinical measures. The data consist of 97 men who were about to receive a radical prostatectomy. There are eight predictors: X = log cancer volume (lcavol), X 2 = log prostate weight (lweight), X 3 = age, X 4 = log of benign prostatic hyperplasia amount (lbph), X 5 = seminal vesicle invasion (svi), X 6 = log of capsular penetration (lcp), X 7 = Gleason score (gleason), and X 8 = percent of Gleason scores of 4 or 5 (pgg45). A variable selection analysis was conducted in Tibshirani (996) using a linear regression model with LASSO, and it selected three important variables lcavol, lweight, svi as important variables to predict the prostate specific antigen. We fitted partially linear models by treating lweight as a nonlinear term. Table gives the estimated coefficients for different methods. Interestingly, both PSL and PSA select 23

Sparse partial spline 5 Table 9 Estimated coefficients for Ragweed Pollen data Covariate PS PSL PSA Rain.3834.362.386 Temperature.53.45.53 Wind.247.2384.249 Temp temp.42.4.4 Wind wind.4 Fig. 6 The estimated nonlinear function f (day) with its 95% pointwise confidence interval (dotted lines) given by the PSA for the ragweed pollen data 5 fhat 5 2 4 6 8 day Table Estimated coefficients for prostate cancer data Covariate PS PSL PSA lcavol.587.443.562 Age.2 lbph.7 svi.766.346.498 lcp.5 Gleason.45 pgg45.5 23

6 G. Cheng et al. lcavol and svi as important linear variables, which is consistent to the analysis by Tibshirani (996). 6 Discussion We propose a new regularization method for simultaneous variable selection for linear terms and component estimation for the nonlinear term in partial spline models. The oracle properties of the new procedure for variable selection are established. Moreover, we have shown that the new estimator can achieve the optimal convergence rates for both the parametric and nonparametric components. All the above conclusions are also proven to hold in the increasing dimensional situation. The proposed method sets up a basic framework to implement variable selection for partial spline models, and it can be generalized to other types of data analysis. In our future research, we will generalize the results in this paper to the generalized semiparametric models, robust linear regression, or survival data analysis. In this paper, we assume the errors are i.i.d. with constant variance, and the smoothness order of the Sobolev space is fixed as m, though in practice we used m = 2 to facilitate computation. In practice, the problem of heteroscedastic error, i.e., the variance of ɛ is some non-constant function of (X, T ), is often encountered. Meanwhile, the order m may not be always available which needs to be approximated. We will examine the latter two issues in the future. 7 Proofs For simplicity, we use β, β ( β 2 ) and fˆ to represent β PSA, β PSA, ( β PSA,2 ) and fˆ PSA, in the proofs. Definition Let A be a subset of a (pseudo-) metric space (L, d) of real-valued functions. The δ-covering number N(δ,A, d) of A is the smallest N for which there exist functions a,...,a N in L, such that for each a A, d(a, a j ) δ for some j {,...,N}. Theδ-bracketing number N B (δ, A, d) is the smallest N for which there exist pairs of functions {[a L j, au j ]}N j= L, with d(al j, au j ) δ, j =,...,N, such that for each a A there is a j {,...,N} such that a L j a a U j.theδ-entropy number (δ-bracketing entropy number) is defined as H(δ, A, d) = log N(δ, A, d) (H B (δ, A, d) = log N B (δ, A, d)). Entropy Calculations: For each < C < and δ>, we have ( ) C /m H B (δ, {η : η C, J η C}, ) M, (22) δ ( ) C /m H(δ, {η : η C, J η C}, ) M, (23) δ where represents the uniform norm and M is some positive number. 23

Sparse partial spline 7 Proof of Theorem In the proof of (8), we will first show for any given ɛ>, there exists a large constant M such that { } P inf (s) > s =M ɛ, (24) where (s) Q(β +n /2 s) Q(β ). This implies with probability at least ( ɛ) that there exists a local minimum in the ball {β + n /2 s : s M}. Thus, we can conclude that there exists a local minimizer such that ˆβ n β =O P (n /2 ) if (24) holds. Denote the quadratic part of Q(β) as L(β), i.e., L(β) = n (y Xβ) [I A(λ )](y Xβ). Then we can obtain the below inequality: (s) L(β + n /2 s) L(β ) + λ 2 q j= β j + n /2 s j β j β j γ, where s j is the j-th element of vector s. Note that L(β) is a quadratic function of β. Hence, by the Taylor expansion of L(β), we can show that (s) n /2 s L(β ) + 2 s [n L(β )]s + λ 2 q j= β j + n /2 s j β j β j γ, (25) where L(β ) and L(β ) are the first and second derivative of L(β) at β, respectively. Based on (5), we know that L(β ) = (2/n)X [I A(λ )](y Xβ ) and L(β ) = (2/n)X [I A(λ )]X. Combing the proof of Theorem and its four propositions in Heckman (986), we can show that n /2 X [I A(λ )]( f + ɛ) n /2 X A(λ )ɛ d N(,σ 2 R), P. provided that λ and nλ /2m. Therefore, the Slutsky s theorem implies that L(β ) = O P (n /2 ), (26) L(β ) = O P () (27) given the above conditions on λ. Based on (26) and (27), we know the first two terms in the right-hand side of (25) are of the same order, i.e., O P (n ).Andthe second term, which converges to some positive constant, dominates the first one by choosing sufficiently large M. The third term is bounded by n /2 λ 2 M for some 23

8 G. Cheng et al. positive constant M since β j is the consistent estimate for the nonzero coefficient for j =,...,q. Considering that nλ 2, we have completed the proof of (8). We next show the convergence rate for fˆ in terms of n -norm, i.e., (9). Let g (x, t) = x β + f (t), and ĝ(x, t) = x ˆβ + f ˆ(t). Then, by the definition of ( ˆβ, f ˆ), we have ĝ g 2 n + λ J 2ˆ + λ f 2 J ˆβ 2 n ɛ i (ĝ g )(X i, t i ) + λ J 2 f n + λ 2 J β, i= ĝ g 2 n 2 ɛ n ĝ g n + λ J 2 f + λ 2 J β, ĝ g 2 n ĝ g n O P () + o P (), (28) where J β d j= β j / β j γ. The second inequality follows from the Cauchy- Schwartz inequality. The last inequality holds since ɛ has sub-exponential tail, and λ,λ 2. Then the above inequality implies that ĝ g n = O P (), so that ĝ n = O P (). By Sobolev embedding theorem, we can decompose g(x, t) as g (x, t) + g 2 (x, t), where g (x, t) = x β + m j= α j t j and g 2 (x, t) = f 2 (t) with g 2 (x, t) J g2 = J f. Similarly, we can write ĝ = ĝ + ĝ 2, where ĝ = x ˆβ + m j= ˆα j t j = ˆδ φ and ĝ 2 Jĝ. We shall now show that ĝ /( + Jĝ) = O P () via the above Sobolev decomposition. Then ĝ n + Jĝ ĝ n + ĝ 2 n + Jĝ + Jĝ = O P (). (29) Based on the assumption about k φ kφ k /n,(29)impliesthat ˆδ /( + Jĝ) = O P (). Since (X, t) is in a bounded set, ĝ /( + Jĝ) = O P (). Sowehaveprovedthat ĝ /( + Jĝ) = O P (). Thus, the entropy calculation (22) impliesthat { g g H B (δ, : g G, g } ) C, M δ /m, + J g + J g where M is some positive constant, and G ={g(x, t) = x β + f (t) : β R d, J f < }. Based on Theorem 2.2 in Mammen and van de Geer (997) about the continuity modulus of the empirical processes { n i= ɛ i (g g )(z i )} indexed by g and (28), we can establish the following set of inequalities: λ J 2ˆ f 23 [ ĝ g n /2m ( + J ˆ f )/2m ( + J f ˆ 2m )n 2(2m+) ]O P (n /2) +λ J 2 f + λ 2 (J β J ˆβ ), (3)

Sparse partial spline 9 and [ ĝ g 2 n ĝ g n /2m ( + J f ˆ )/2m ( + J f ˆ 2m )n 2(2m+) ]O P (n /2) +λ J 2 f + λ 2 (J β J ˆβ ). (3) Note that λ 2 (J β J ˆβ ) λ 2 q j= β j ˆβ j β j γ + λ 2 d j=q+ β j ˆβ j β j γ O P (n 2m/(2m+)). (32) (32) in the above follows from β β =O P (n /2 ) and (7). Thus, solving the above two inequalities gives ĝ g n = O P (λ /2 ) and J f ˆ = O P () when n 2m/(2m+) λ λ >. Note that ( X n ) ( ( ˆβ β ) n = ( ˆβ β ) X i X i /n ( ˆβ β ) < ˆβ β =O P n /2) i= by (8). Applying the triangle inequality to ĝ g n = O P (λ /2 ), we have proved that fˆ f n = O P (λ /2 ). We next prove 3(a). It suffices to show that Q{( β, )} = min Q{( β, β 2 )} with probability approaching to (33) β 2 Cn /2 for any β satisfying β β =O P (n /2 ) based on (8). To show (33), we need to show that Q(β)/ β j < forβ j ( Cn /2, ), and Q(β)/ β j > for β j (, Cn /2 ),for j = q +,...,d, holds with probability tending to. By two term Taylor expansion of L(β) at β, Q(β)/ β j can be expressed in the following form for j = q +,...,d: Q(β) β j = L(β ) β j + d k= 2 L(β ) β j β k (β k β k ) + λ 2 sgn(β j ) β j γ, where β k is the k th element of vector β. Note that β β =O P (n /2 ) by the above constructions. Hence, we have Q(β) β j = O P (n /2 ) + sgn(β j ) λ 2 β j γ 23