Sparse and efficient estimation for partial spline models with increasing dimension

Size: px
Start display at page:

Download "Sparse and efficient estimation for partial spline models with increasing dimension"

Transcription

1 Ann Inst Stat Math (25) 67:93 27 DOI.7/s y Sparse and efficient estimation for partial spline models with increasing dimension Guang Cheng Hao Helen Zhang Zuofeng Shang Received: 28 November 2 / Revised: 9 September 23 / Published online: 5 December 23 The Institute of Statistical Mathematics, Tokyo 23 Abstract We consider model selection and estimation for partial spline models and propose a new regularization method in the context of smoothing splines. The regularization method has a simple yet elegant form, consisting of roughness penalty on the nonparametric component and shrinkage penalty on the parametric components, which can achieve function smoothing and sparse estimation simultaneously. We establish the convergence rate and oracle properties of the estimator under weak regularity conditions. Remarkably, the estimated parametric components are sparse and efficient, and the nonparametric component can be estimated with the optimal rate. The procedure also has attractive computational properties. Using the representer theory of smoothing splines, we reformulate the objective function as a LASSO-type problem, enabling us to use the LARS algorithm to compute the solution path. We then extend the procedure to situations when the number of predictors increases with the sample size and investigate its asymptotic properties in that context. Finite-sample performance is illustrated by simulations. G. Cheng: Supported by NSF Grant DMS and CAREER Award DMS H. H. Zhang: Supported by NSF grants DMS , DMS , NIH grants P CA42538 and R CA G. Cheng (B) Z. Shang Department of Statistics, Purdue University, West Lafayette, IN , USA chengg@purdue.edu Z. Shang shang9@purdue.edu H. H. Zhang Department of Mathematics, University of Arizona, Tucson, AZ , USA hzhang@math.arizona.edu 23

2 94 G. Cheng et al. Keywords Smoothing splines Semiparametric models RKHS High dimensionality Solution path Oracle property Shrinkage methods Introduction. Background Partial smoothing splines are an important class of semiparametric regression models. Developed in a framework of reproducing kernel Hilbert spaces (RKHS), these models provide a compromise between linear and nonparametric models. In general, a partial smoothing spline model assumes the data (X i, T i, Y i ) follow Y i = X i β + f (T i) + ɛ i, i =,...,n, f W m [, ], () where X i R d are linear covariates, T i [, ] is the nonlinear covariate, and ɛ i s are independent errors with mean zero and variance σ 2. The space W m [, ] is the mth order Sobolev Hilbert space W m [, ] ={f : f, f (),..., f (m ) are absolutely continuous, f (m) L 2 [, ]} for m. Here f ( j) denotes the jth derivative of f.the function f (t) is the nonparametric component of the model. Denote the observations of (X i, T i, Y i ) as (x i, t i, y i ) for i =, 2,...,n. The standard approach to compute the partial spline (PS) estimator is minimizing the penalized least squares: ( β PS, f PS ) = arg min β R d, f W m n n [ yi xi T β f (t i) ] 2 + λ J 2 f, (2) i= where λ is a smoothing parameter and J 2 f = [ f (m) (t) ] 2 dt is the roughness penalty on f ;seekimeldorf and Wahba (97), Craven and Wahba (979), Denby (984), Green and Silverman (994) for details. It is known that the solution f PS is a natural spline (Wahba 99) of order 2m on[, ] with knots at t i, i =,...,n. Asymptotic theory for partial splines has been developed by several authors (Shang and Cheng 23; Rice 986; Heckman 986; Speckman 988; Shiau and Wahba 988. In this paper, we mainly consider partial smoothing splines in the framework of Wahba (984)..2 Model selection for partial splines Variable selection is important for data analysis and model building, especially for high dimensional data, as it helps to improve the model s prediction accuracy and interpretability. For linear models, various penalization procedures have been proposed to obtain a sparse model, including the non-negative garrote (Breiman 995; Tibshirani 996; Fan and Li 2; Fan and Peng 24), and the adaptive LASSO (Zou 26; Wang et al. 27). Contemporary research frequently deals with problems where the input dimension d diverges to infinity as the data sample size increases (Fan and Peng 24). There is also active research going on for linear model selection in these 23

3 Sparse partial spline 95 situations (Fan and Peng 24; Zou and Zhang 29; Fan and Lv 28; Huang et al. 28a, b). In this paper, we propose and study a new approach to variable selection for partially linear models in the framework of smoothing splines. The procedure leads to a regularization problem in the RKHS, whose unified formulation can facilitate numerical computation and asymptotic inferences of the estimator. To conduct variable selection, we employ the adaptive LASSO penalty on linear parameters. One advantage of this procedure is its easy implementation. We show that, by using the representer theory (Wahba 99), the optimization problem can be reformulated as a LASSO-type problem so that the entire solution path can be computed by the LARS algorithm (Efron et al. 24). We show that the new procedure can asymptotically () correctly identify the sparse model structure; (2) estimate the nonzero β j s consistently and achieve the semiparametric efficiency; and (3) estimate the nonparametric component f at the optimal nonparametric rate. We also investigate the property of the new procedure with a diverging number of predictors (Fan and Peng 24). From now on, we regard (Y i, X i ) as i.i.d realizations from some probability distribution. We assume that the x i s belong to some compact subset in R d, and they are standardized such that n i= x ij /n = and n i= xij 2 /n = for j =,...,d, where x i = (x i,...,x id ). Also assume t i [, ] for all i. Throughout the paper, we use the convention that / =. The rest of the article is organized as follows: Sect. 2 introduces our new double-penalty estimation procedure for partial spline models. Section 3 is devoted to two main theoretical results. We first establish the convergence rates and oracle properties of the estimators in the standard situation with a fixed d, and then extend these results to the situations when d diverges with the sample size n. Section 4 gives the computational algorithm. In particular, we show how to compute the solution path using the LARS algorithm. The issue of parameter tuning is also discussed. Section 5 illustrates the performance of the procedure via simulations and real examples. Discussions and technical proofs are presented in Sects. 6 and 7. 2 Method We assume that t < t 2 < < t n. In order to achieve a smooth estimate for the nonparametric component and sparse estimates for the parametric components simultaneously, we consider the following regularization problem: min β R d, f W m n n [ yi xi T β f (t i) ] 2 + λ i= [ f (m) (t) ] 2 d dt + λ2 w j β j. (3) j= The penalty term in (3) is naturally formed as a combination of roughness penalty on f and the weighted LASSO penalty on β. Here, λ controls the smoothness of the estimated nonlinear function while λ 2 controls the degree of shrinkage on β s. The weight w j s are pre-specified. For convenience, we will refer to this procedure as PSA (the Partial Splines with Adaptive penalty). 23

4 96 G. Cheng et al. Note that w j s should be adaptively chosen such that they take large values for unimportant covariates and small values for important covariates. In particular, we propose using w j = / β j γ, where β = ( β,..., β d ) is some consistent estimate for β in the model (), and γ is a fixed positive constant. For example, the standard partial smoothing spline β PS can be used to construct the weights. Therefore, we get the following optimization problem: ( β PSA, fˆ PSA ) = arg min β R d, f W m n +λ 2 d j= β j n [ yi x i β f (t i) ] 2 + λ i= [ f (m) (t)] 2 dt β j γ. (4) When β is fixed, the standard smoothing spline theory suggests that the solution to (4) is linear in the residual (y Xβ), i.e., ˆf(β) = A(λ )(y Xβ), where y = (y,...,y n ), X = (x,...,x n ) and the matrix A(λ ) is the smoother or influence matrix (Wahba 984). The expression of A(λ ) will be given in Sect. 4. Plugging ˆf(β) into (4), we can obtain an equivalent objective function for β: Q(β) = n (y Xβ) [I A(λ )](y Xβ) + λ 2 d j= β j β j γ, (5) where I is the identity matrix of size n. The PSA solution can be computed as β PSA = arg min Q(β), β fˆ PSA = A(λ )(y X β PSA ). Special software like Quadratic Programming (QP) or LARS (Efron et al. 24) is needed to obtain the solution. 3 Statistical theory We can write the true coefficient vector as β = (β,...,β d ) = (β, β 2 ), where β consists of all q nonzero components and β 2 consists of the rest (d q) zero elements, and write the true function of f as f. We also write the estimated vector β PSA = ( ˆβ,..., ˆβ d ) = ( β PSA,, β ). PSA,2 In addition, assume that Xi has zero mean and strictly positive definite covariance matrix R. The observations t i s satisfy ti u(w)dw = i/n for i =,...,n, (6) where u( ) is a continuous and strictly positive function independent of n. 23

5 Sparse partial spline Asymptotic results for fixed d We show that, for any fixed γ>, if λ and λ 2 converge to zero at proper rates, then both the parametric and nonparametric components can be estimated at their optimal rates. Moreover, our estimation procedure produces the nonparametric estimate fˆ PSA with desired smoothness, i.e., ***(). Meanwhile, we conclude that our double penalization procedure can estimate the nonparametric function well enough to achieve the oracle properties of the weighted Lasso estimates. In the following we use, 2 to represent the Euclidean norm, L 2 - norm, and use n to denote the empirical L 2 -norm, i.e., F 2 n = n i= F 2 (s i )/n. We derive our convergence rate results under the following regularity conditions: R. ɛ is assumed to be independent of X, and has a sub-exponential tail, i.e., E(exp( ɛ /C )) C for some < C <, seemammen and van de Geer (997); R2. k φ kφ k /n converges to some non-singular matrix with φ k =[, t k,...,tk m, x k,...,x kd ] in probability. Theorem Consider the minimization problem (4), where γ> is a fixed constant. Assume the initial estimate β is consistent. If n 2m/(2m+) λ λ >, nλ 2 and n 2m 2(2m+) λ 2 β j γ P λ 2 > for j = q +,...,d (7) as n, then we have. there exists a local minimizer β PSA of (4) such that 2. the nonparametric estimate ˆ f PSA satisfies β PSA β =O P (n /2 ). (8) fˆ PSA f n = O P (λ /2 ), (9) J f ˆPSA = O P (). () 3. the local minimizer β PSA = ( β PSA,, β PSA,2 ) satisfies (a) Sparsity: P( β PSA,2 = ). (b) Asymptotic Normality: n( β PSA, β ) d N(,σ 2 R ), where R is the q q upper-left sub matrix of covariance matrix of X i. Remark Note that t is assumed to be nonrandom and satisfy the condition (6), and that EX =. In this case, the semiparametric efficiency bound for β PSA, in the partly 23

6 98 G. Cheng et al. linear model under sparsity is just σ 2 R,seeVaart and Wellner (996). Thus, we can claim that β PSA, is semiparametric efficient. If we use the partial spline solutions to construct the weights in (4) and choose γ = and n 2m/(2m+) λ i λ i > fori =, 2, the above Theorem implies that the double penalized estimators achieve the optimal rates for both parametric and nonparametric estimation, i.e., (8) (9), and that β PSA possesses the oracle properties, i.e., the asymptotic normality of β PSA, and sparsity of β PSA, Asymptotic results for diverging d n Let β = (β, β 2 ) R q n R m n = R d n.letx i = (w i, z i ) where w i consists of the first q n covariates, and z i consists of the remaining m n covariates. Thus we can define the matrix X = (w,...,w n ) and X 2 = (z,...,z n ). For any matrix K we denote its smallest and largest eigenvalue as λ min (K) and λ max (K), respectively. Now, we give the additional regularity conditions required to establish the largesample theory for the increasing dimensional case: RD. There exist constants < b < b < such that b min{ β j, j q n } max{ β j, j q n } b. R2D. λ min ( k φ kφ k /n) c 3 > for any n. R3D. Let R be the covariance matrix for the vector X i. We assume that < c λ min (R) λ max (R) c 2 < for any n. Conditions R2D and R3D are equivalent to Condition R2 when d n is assumed to be fixed Convergence Rate of β PSA and ˆ f PSA We first present a Lemma concerning about the convergence rate of the initial estimate β PS given the increasing dimension d n. For two deterministic sequences p n, q n = o(), we use the symbol p n q n to indicate that p n = O(q n ) and pn = O(qn ). Define x y (x y) to be the maximum (minimum) value of x and y. Lemma Suppose that β PS is a partial smoothing spline estimate, then we have β PS β =O P ( d n /n) given d n = n /2 nλ /2m. () Our next theorem gives the convergence rates for β PSA and ˆ f PSA when dimension of β diverges to infinity. In this increasing dimension set-up, we find three results: (i) the convergence rate for β PSA coincides with that for the estimator in the linear regression model with increasing dimension (Portnoy 984); thus we can conclude that the presence of nonparametric function and sparsity of β does not affect the overall 23

7 Sparse partial spline 99 convergence rate of β PSA ; (ii) the convergence rate for fˆ PSA is slower than the regular partial smoothing spline, i.e., O P (n m/(2m+) ), and is controlled by the dimension of important components of β, i.e., q n ; and (iii) the nonparametric estimator fˆ PSA always satisfies the desired smoothness condition, i.e., J f ˆPSA = O P (), even under increasing dimension of β. Theorem 2 Suppose that d n = o(n /2 nλ /2m ),nλ /2m and n/d n λ 2, we have β PSA β =O P ( d n /n). (2) If we further assume that λ /q n n 2m/(2m+) and n/dn (λ 2 /q n ) max = O P (n /(2m+) d 3/2 j=q n +,...,d n β j γ n ), (3) then we have f PSA f n = O P ( d n /n (n m/(2m+) q n )), (4) J f ˆPSA = O P (). (5) It seems nontrivial to improve the rate of convergence for the parametric estimate to the minimax optimal rate q n log d n /n proven in Bickel et al. (29). The main reason is that the above rate result is proven in the (finite) dictionary learning framework which requires that the nonparametric function can be well approximated by a member of the span of a finite dictionary of (basis) functions. This key assumption does not straightforwardly hold in our smoothing spline setup. In addition, it is also unclear how to relax the Gaussian error condition assumed in Bickel et al. (29) to the fairly weak sub-exponential tail condition assumed in our paper Oracle Properties In this subsection, we show that the desired oracle properties can also be achieved even in the increasing dimension case. In particular, when showing the asymptotic normality of β PSA,, we consider an arbitrary linear combination of β,sayg n β, where G n is an arbitrary l q n matrix with a finite l. Theorem 3 Given the following conditions: D. d n = o(n /3 (n 2/3 λ /3m )) and q n = o(n λ 2 2 ); S. λ satisfies: λ /q n n 2m/(2m+) and n m/(2m+) λ ; S2. λ 2 satisfies: n/dn λ 2 P min, (6) j=q n +,...,d n β j γ we have 23

8 G. Cheng et al. (a) Sparsity: P( β PSA,2 = ) (b) Asymptotic Normality: ngn R /2 ( β PSA, β ) d N(,σ 2 G), (7) where G n be a non-random l q n matrix with full row rank such that G n G n G. In Corollary, we give the fastest possible increasing rates for the dimensions of β and its important components to guarantee the estimation efficiency and selection consistency. The range of the smoothing and shrinkage parameters are also given. Corollary Let γ =. Suppose that β is the partial smoothing spline solution. Then, we have. β PSA β =O P ( d n /n) and fˆ PSA f n = O P ( d n /n (n m/(2m+) q n )); 2. β PSA possesses the oracle properties. if the following dimension and smoothing parameter conditions hold: d n = o(n /3 ) and q n = o(n /3 ), (8) nλ /2m, n m/(2m+) λ and λ /q n n 2m/(2m+), (9) n/dn λ 2,(n/d n )λ 2 and d n (n/q n )λ 2 = O(n /(2m+) ). (2) Define d n n d and q n n q, where q d < /3 according to (8). For the usual case that m 2, we can give a set of sufficient conditions for (9) (2) as: λ n r and λ 2 n r 2 for r = 2m/(2m +) q, r 2 = d/2+2m/(2m +) q and ( d)/2 < r 2 < d. The above conditions are very easy to check. For example, if m = 2, we can set λ n.55 and λ 2 n.675 when d n n /4, q n n /4. In Ni et al. (29), the authors considered variable selection in partly linear models when dimension is increasing. Under m = 2, they applied SCAD penalty on the parametric part and proved oracle property together with asymptotic normality. In this paper, we considered general m and applied adaptive LASSO for the parametric part. Besides oracle property, we also derived rate of convergence for the nonparametric estimate. The technical proof relies on nontrivial applications of RKHS theory and model empirical processes theory. Therefore, our results are substantially different from Ni et al. (29). The numerical results provided in Sect. 5 demonstrate satisfactory selection accuracy. Furthermore, we are able to report the estimation accuracy of the nonparametric estimate, which is also satisfactory. 4 Computation and tuning 4. Algorithm We propose a two-step procedure to obtain the PSA estimator: first compute β PSA, then compute ˆ f PSA. As shown in Sect. 2, we need to minimize (5) to estimate β. Define 23

9 Sparse partial spline the square root matrix of I A(λ ) as T, i.e., I A(λ ) = T T. Then (5) can be reformulated into a LASSO-type problem min n (y X β ) (y X β ) + λ 2 d j= β j, (2) where the transformed variables are y = T y, X = T XW, and β j = β j / β j γ, j =,...,d, with W = diag{ β j γ }. Therefore, (2) can be conveniently solved with the LARS algorithm (Efron et al. 24). Now assume β PSA has been obtained. Using the standard smoothing spline theory, it is easy to show that ˆf PSA = A(λ )(y X β PSA ), where A is the influence matrix. By the reproducing kernel Hilbert space theory (Kimeldorf and Wahba 97), W m [, ] is an RKHS when equipped with the inner product m [ ( f, g) = ν= ][ ] f (ν) (t)dt g (ν) (t)dt + f (m) g (m) dt. We can decompose W m [, ] =H H as a direct sum of two RKHS subspaces. In particular, H = {f : f (m) = } = span{k ν (t), ν =,...,m }, where k ν (t) = B ν (t)/ν! and B ν (t) are Bernoulli polynomials (Abramowitz and Stegun 964). H ={f : f (ν) (t) =,ν =,...,m ; f (m) L 2 [, ]}, associated with the reproducing kernel K (t, s) = k m (t)k m (s) + ( ) m k 2m ([s t]), where [τ] is the fractional part of τ. LetS be a n n square matrix with s i,ν = k ν (t i ) and be a square matrix ( with ) the (i, j)-th entry K (t i, t j ). Let the QR decomposition of S U be S = (F, F 2 ), where F =[F, F 2 ] is orthogonal and U is upper triangular with S F 2 =. As shown in Wahba (984) and Gu (22), the influence matrix A can be expressed as A(λ ) = I nλ F 2 (F 2 VF 2) F 2, where V = + nλ I. Using the representer theorem (Wahba 99), we can compute the nonparametric estimator as m fˆ PSA (t) = ν= ˆb ν k ν (t) + n ĉ i K (t, t i ), where ĉ = F 2 (F 2 VF 2) F 2 y and b = U F (y ĉ). We summarize the algorithm in the following: Step. Fit the standard smoothing spline and construct the weights w j s. Compute y and X. Step 2. Solve (2) using the LARS algorithm. Denote the solution as β = ( ˆβ,..., ˆβ d ). i= 23

10 2 G. Cheng et al. Step 3. Calculate β PSA = ( ˆβ,, ˆβ d ) by ˆβ j = ˆβ j β j γ for j =,...,d. Step 4. Obtain the nonparametric fit by f = S b + ĉ, where the coefficients are computed as ĉ = F 2 (F 2 VF 2) F 2 y and β = U F (y ĉ). 4.2 Parameter tuning One possible tuning approach for the double penalized estimator is to choose (λ,λ 2 ) jointly by minimizing some scores. Following the local quadratic approximation (LQA) technique used in Tibshirani (996) and Fan and Li (2), we can derive the GCV score as a function of (λ,λ 2 ). Define the diagonal matrix D(β) = diag{/ β β,...,/ β d β d }. The solution β PSA can be approximated by [ X {I A(λ )}X + nλ 2 D( β PSA ) ] X {I A(λ )}y Hy. Correspondingly, f PSA = A(λ )(y X β PSA ) = A(λ )[I XH]y. Therefore, the predicted response can be approximated as ŷ = X β PSA + ˆf PSA = M(λ,λ 2 )y, where M(λ,λ 2 ) = XH + A(λ )[I XH]. Therefore, the number of effective parameters in the double penalized fit ( β PSA, ˆf PSA ) may be approximated by tr (M(λ,λ 2 )). The GCV score can be constructed as GCV(λ,λ 2 ) = n n i= (y i ŷ i ) 2 [ n tr (M(λ,λ 2 ))] 2. The two-dimensional search is computationally expensive in practice. In the following, we suggest an alternative two-stage tuning procedure. Since λ controls the partial spline fit ( β, b, c), we first select λ using the GCV at Step of the computation algorithm: GCV(λ ) = n n i= (y i ỹ i ) 2 [ n tr{ã(λ )}] 2, where ỹ = (ỹ,...,ỹ n ) is the partial spline prediction and Ã(λ ) is the influence matrix for the partial spline solution. Let λ = arg min λ GCV(λ ). We can also select λ using GCV in the smoothing spline problem: Y i X i β = f (t i ) + ɛ i, where β is the n-consistent difference-based estimator (Yatchew 997). This substitution approach is theoretically valid for selection λ since the convergence rate of β is faster than the nonparametric rate for estimating f, and thus β can be treated as the true value. At the successive steps, we fix λ at λ and only select λ 2 for the optimal variable selection. Wang et al. (27); Zhang and Lu (27); Wang et al. (29) suggested that BIC works better in terms of consistent model selection than the GCV when tuning λ 2 for the adaptive LASSO in the context of linear models 23

11 Sparse partial spline 3 even with diverging dimension. Therefore, we propose to choose λ 2 by minimizing BIC(λ 2 ) = (y X β PSA ˆf PSA ) (y X β PSA ˆf PSA )/ ˆσ 2 + log(n) r, where r is the number of nonzero coefficients in β, and the estimated residual variance ˆσ 2 can obtained from the standard partial spline model, i.e., ˆσ 2 = (y X β PS f PS ) (y X β PS f PS )/(n tr(ã(λ )) d). 5 Numerical studies 5. Simulation We compare the standard partial smoothing spline model with the new procedure under the LASSO (with w j = in(3)) and adaptive (ALASSO) penalty. In the following, these three methods are, respectively, referred to as PS, PSL, and PSA. We also include the Oracle model fit assuming the true model were known. In all the examples, we use γ = for PSA and consider two sample sizes n = and n = 2. The smoothness parameter m was chosen to be 2 in all the numerical experiments. In each setting, a total of 5 Monte Carlo (MC) simulations are carried out. We report the MC sample mean and standard deviation (given in the parentheses) for the MSEs. Following Fan and Li (24), we use mean squared [ error MSE( β) = ] E β β 2 and mean integrated squared error MISE( f ˆ ) = E { f (t) f (t)} 2 dt to evaluate goodness-of-fit for parametric and nonparametric estimation, respectively, and compute them by averaging over data knots in the simulations. To evaluate the variable selection performance of each method, we report the number of correct zero ( correct ) coefficients, the number of coefficients incorrectly set to ( incorrect ), model size, and the empirical probability of capturing the true model. We generate data from a model Y i = X i β + f (T i) + ε i and consider two following model settings: Model : β = (3, 2.5, 2,.5,,...,), d = 5 and q = 4. And f (t) =.5sin(2πt). Model 2: Let β = (3,...,3,,...,), d = 2 and q =. The nonparametric function f 2 (t) = t ( t) 4 /(3B(, 5)) + 4t 4 ( t) /(5B(5, )), where the beta function B(u,v) = tu ( t) v dt. Two model coefficient vectors β = β and β 2 = β/3 were considered. The Euclidean norms of β and β 2 are β =9.49 and β 2 =3.6 respectively. The supnorm of f is f sup =.6. So the ratios β / f sup = 8.8 and β 2 / f sup = 2.72, denoted as the parametric-to-nonparametric signal ratios (PNSR). The two settings represent high and low PNSR s, respectively. Two possible distributions for the covariates X and T Model : X,...,X 5, T are i.i.d. generated from Unif(, ). 23

12 4 G. Cheng et al. Table Variable selection and fitting results for Model σ n Method MSE( β PSA ) MISE( ˆ f PSA ) Size Number of zeros Correct Incorrect.5 PS.578 (.).5 (.) 5 () () () PSL.36 (.8).5 (.) 7.34 (.9) 7.66 (.9). (.) PSA.234 (.8).4 (.) 4.53 (.4).47 (.4). (.) Oracle.29 (.4).4 (.) 4 () () () 2 PS.249 (.4).8 (.) 5 () () () PSL.47 (.4).8 (.) 7.6 (.9) 7.84 (.9). (.) PSA. (.4).8 (.) 4.36 (.4).64 (.4). (.) Oracle.63 (.).7 (.) 4 () () () PS (.4).55 (.2) 5 () () () PSL.256 (.32).5 (.2) 7.36 (.9) 7.64 (.9). (.) PSA. (.36).5 (.2) 4.72 (.5).25 (.5).2 (.) Oracle.5 (.7).48 (.2) 4 () () () 2 PS.989 (.7).28 (.) 5 () () () PSL.587 (.6).27 (.) 7.2 (.9) 7.8 (.9). (.) PSA.479 (.4).26 (.) 4.42 (.4).58 (.4). (.) Oracle.252 (.8).26 (.) 4 () () Model 2: X = (X,...,X 2 ) are standard normal with AR() correlation, i.e., corr(x i, X j ) = ρ i j. T follows Unif(, ) and is independent with X i s. We consider ρ =.3 and ρ =.6. Two possible error distributions are used in these two settings: Model : (normal error) ɛ N(,σ 2 ), with σ =.5 and σ =, respectively. Model 2: (non-normal error) ɛ 2 t, t-distribution with degrees of freedom. Table compares the model fitting and variable selection performance of various procedures in different settings for Model. It is evident that the PSA procedure outperforms both the PS and PSL in terms of both the MSE and variable selection. The three procedures give similar performance in estimating the nonparametric function. Table 2 shows that the PSA works much better in distinguishing important variables from unimportant variables than PSL. For example, when σ =.5, the PSA identifies the correct model 5.7 = 35 times out of 5 times when n = and 5.78 = 39 times when n = 2, while the PSL identifies the correct model only 5.9 = 45 times when n = and 5. = 5 times when n = 2. To present the performance of our nonparametric estimation procedure, we plot the estimated functions for Model in the below Fig.. The top row of Fig. depicts the typical estimated curves corresponding to the th best, the 5th best (median), and the 9th best according to MISE among simulations when n = 2 and σ =.5. It can be seen that the fitted curves are overall able to capture the shape of the true function very well. In order to describe the sampling variability of the estimated nonparametric 23

13 Sparse partial spline 5 Table 2 Variable selection relative frequency in percentage over 5 runs for Model σ n Important index Unimportant variable index P(correct) PSL PSA PSL PSA PSL PSA PSL PSA

14 6 G. Cheng et al. 2 Model : Fitted confidence envelope fit for f (n=2, σ=.5) Estimated f % 5% 9% t 2 Model : 95% pointwise confidence band for f (n=2, σ=.5) Estimated f t Fig. The estimated nonlinear functions given by the PSA in Model. The estimated nonlinear function, confidence envelop, and 95% point-wise confidence interval for Model 2 with n = 2 and σ =.5. In the top plot, the dashed line is for the th best fit, the dotted line is for the 5th best fit, and the dashed-dotted line is for the 9th best among 5 simulations. The bottom plot is a 95% pointwise confidence interval function at each point, we also depict a 95% pointwise confidence interval for f in the bottom row of Fig.. The upper and lower bound of the confidence intervals are respectively given by the 2.5th and 97.5th percentiles of the estimated function at each grid point among simulations. The results show that the function f is estimated with very good accuracy. Tables 3 and 4, 5 and 6 summarize the simulation results when the true parametric components are β and β 2, respectively. Tables 3 and 5 compare the model fitting and variable selection performance in the correlated setting Model 2. The case ρ =.3 represents a weak correlation among X s and ρ =.6 represents a moderate situation. Again, we observe that the PSA performs best in terms of both MSE and variable selection in all settings. In particular, when n = 2, the PSA is very close to the Oracle results in this example. Tables 4 and 6 compare the variable selection results of PSL and PSA in four scenarios if the covariates are correlated. Since neither of the methods misses any important variable over 5 runs, we only report the selection relative frequencies 23

15 Sparse partial spline 7 Table 3 Model selection and fitting results for Model 2 when the true parameter vector is β = β ρ n Method MSE( β PSA ) MISE( ˆ f PSA ) Size Number of zeros Correct Incorrect.3 PS.46 (.8).45 (.2) 2 () () () PSL.299 (.6).447 (.2) 2.99 (.9) 7. (.9). (.) PSA.24 (.5).443 (.2).29 (.3) 9.7 (.3). (.) Oracle.8 (.4).444 (.2) () () () 2 PS.79 (.3).48 (.) 2 () () () PSL.25 (.3).46 (.) 3.22 (.8) 6.78 (.8). (.) PSA.87 (.2).44 (.).2 (.2) 9.88 (.2). (.) Oracle.82 (.2).45 (.) () () ().6 PS.72 (.3).448 (.2) 2 () () () PSL.4 (.9).44 (.2).9 (.7) 8.9 (.7). (.) PSA.349 (.8).438 (.2).22 (.3) 9.78 (.3). (.) Oracle.3 (.4).439 (.2) () () () 2 PS.3 (.5).48 (.) 2 () () () PSL.7 (.4).45 (.) 2.47 (.7) 7.53 (.7). (.) PSA.47 (.4).44 (.).2 (.2) 9.88 (.2). (.) Oracle.39 (.4).45 (.) () () () PNSR 8.8 Table 4 Relative frequency of variables selected in 5 runs for Model 2 when the true parameter vector is β = β ρ n Method Unimportant variable index P(correct) PSL PSA PSL PSA PSL PSA PSL PSA for the unimportant variables. Overall, the PSA results in a more sparse model and identifies the true model with a much higher frequency. For example, when the true parametric component is β, n = and the correlation is moderate with ρ =.6, the PSA identifies the correct model with relative frequency.87 (about 5.87 = 435 times) while the PSL identifies the correct model only 5.2 = times. When the true parametric component is β 2, PSA and PSL identify the correct model 45 and 9 times, respectively. 23

16 8 G. Cheng et al. Table 5 Model selection and fitting results for Model 2 when the true parameter vector is β 2 = β/3 ρ n Method MSE( β PSA ) MISE( ˆ f PSA ) Size Number of zeros Correct Incorrect.3 PS.52 (.8).432 (.2) 2 () () () PSL.32 (.5).43 (.2) 3.32 (.8) 6.68 (.8). (.) PSA.269 (.5).425 (.2).44 (.5) 9.56 (.5). (.) Oracle.23 (.4).422 (.2) () () () 2 PS.226 (.3).39 (.) 2 () () () PSL.4 (.3).39 (.) 3. (.8) 6.89 (.8). (.) PSA.6 (.2).387 (.).38 (.4) 9.62 (.4). (.) Oracle. (.2).386 (.) () () ().6 PS.9 (.3).432 (.2) 2 () () () PSL.42 (.7).425 (.2) 2.5 (.8) 7.95 (.8). (.) PSA.382 (.7).49 (.2).3 (.4) 9.7 (.4). (.) Oracle.35 (.5).47 (.2) () () () 2 PS.382 (.5).388 (.) 2 () () () PSL.92 (.3).387 (.) 2.2 (.7) 7.8 (.7). (.) PSA.8 (.4).386 (.).7 (.2) 9.83 (.2). (.) Oracle.74 (.3).385 (.) () () () PNSR 2.72 Table 6 Relative frequency of variables selected in 5 runs for Model 2 when the true parameter vector is β 2 = β/3 ρ n Method Unimportant variable index P(correct) PSL PSA PSL PSA PSL PSA PSL PSA The top rows of Figs. 2 and 3 depict the typical estimated functions corresponding to the th best, the 5th best (median), and the 9th best fits according to MISE among simulations when n = 2, ρ =.3, and the true Euclidean parameters are β and β 2, respectively. It is evident that the estimated curves are able to capture the shape of the true function very well. The bottom rows of Figs. 2 and 3 depict the 95% pointwise confidence intervals for f. The results show that, when the true Euclidean parameters are β and β 2, respectively, the function f is estimated with 23

17 Sparse partial spline 9 Model 2: Fitted confidence envelope fit for f when the true parameter vector is β (n=2, ρ=.3).5 Estimated f.5 % 5% 9% t Model 2: 95% pointwise confidence band for f when the true parameter vector is β (n=2, ρ=.3) 3 2 Estimated f t Fig. 2 The estimated nonlinear functions given by the PSA in Model 2. The estimated nonlinear function, confidence envelop, and 95% point-wise confidence interval for Model 2, n = 2, ρ =.3, and the true Euclidean parameter is β.inthetop plot, the dashed line is th best fit, the dotted line is 5th best fit, and the dashed-dotted line is 9th best of 5 simulations. The bottom plot is a 95% pointwise confidence interval reasonably good accuracy. Interestingly, in this simulation setting, the choice of β and β 2 does not affect much on the estimation accuracy of f. 5.2 Simulation 2: large dimensional setting We consider an example involving a larger number of linear variables: Model 3: Let d = 6, q = 5. We considered two parameter vectors β = β and β 2 =.3β, and two nonparametric functions f (t) = f (t) and f 2 (t) =.5 f (t), with different magnitudes on the (non)parametric component representing weak and strong (non)parametric signals, where β = (4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2,,...,) and f (t) =.2t 29 ( t) 6 /B(3, 7) +.8t 2 ( t) /B(3, ). In particular, the maximum absolute values of f and f 2 are f sup = 3.8 and f 2 sup =.54, respectively, and the l 2 -norms 23

18 G. Cheng et al..5 Model 2: Fitted confidence envelope fit for f when the true parameter vector is β 2 (n=2, ρ=.3) Estimated f.5 % 5% 9% t 3 Model 2: 95% pointwise confidence band for f when the true parameter vector is β 2 (n=2, ρ=.3) 2 Estimated f t Fig. 3 The estimated nonlinear functions given by the PSA in Model 2. The estimated nonlinear function, confidence envelop, and 95% point-wise confidence interval for Model 2, n = 2, ρ =.3, and the true Euclidean parameter is β 2.Inthetop plot, the dashed line is th best fit, the dotted line is 5th best fit, and the dashed-dotted line is 9th best of 5 simulations. The bottom plot is a 95% pointwise confidence interval of the β and β 2 are β =2.4 and β 2 =3.6, respectively. So the ratio of β 2 to f sup and β to f 2 sup, i.e., the PNSR s, are β 2 / f sup =.7 and β / f 2 sup = 7.82, representing the lower to higher PNSRs. The correlated covariates (X,...,X 6 ) are generated from marginally standard normal with AR() correlation with ρ =.5. Consider two settings for the normal error ɛ N(,σ 2 ), with σ =.5 and σ =.5, respectively. Tables 7 and 8 compare the model fitting and variable selection performance of various procedures in different settings for Model 3. In particular, in Table 7, the true Euclidean parameter and nonparametric function are β 2 and f, while in Table 8 they are β and f 2. So the PNSR s in Tables 7 and 8 are.7 and 7.82, respectively. To better illustrate the performance, we considered σ =.5 and.5 in each table. The corresponding signal-to-noise ratios, defined as the ratios of β ( β 2 )toσ s, are 24.8, 8.3, 7.22, and 2.4 in the four settings. It is evident that the PSA procedure outperforms both the PS and PSL in terms of both the MSE and variable selection accuracy. The three procedures give similar performance in estimating the nonparametric 23

19 Sparse partial spline Table 7 Variable selection and fitting results for Model 3 when the true parameter vector is β 2 and the true function is f σ n Method MSE( β) MISE( ˆ f ) Size Number of zeros Correct Incorrect.5 PS 4.94 (.59).39 (.2) 6 () () () PSL.526 (.7).64 (.) (.3) 32.4 (.3). (.) PSA.34 (.).6 (.) 2.43 (.29) (.29). (.) Oracle.225 (.).522 (.) 5 () 45 () () 2 PS.95 (.).85 (.3) 6 () () () PSL.223 (.7).68 (.) 26.2 (.22) (.22). (.) PSA.34 (.4).548 (.) 8.78 (.2) 4.22 (.2). (.) Oracle.2 (.2).478 (.9) 5 () 45 () ().5 PS.4 (.3).5 (.27) 6 () () () PSL 2.72 (.66).256 (.4) 27.9 (.35) 32. (.35). (.) PSA.4 (.4).22 (.2) 2.7 (.22) 38.3 (.22). (.) Oracle.38 (.5).28 (.9) 5 () 45 () () 2 PS 2.44 (.2).9 (.3) 6 () () () PSL.688 (.).63 (.3) 26. (.25) 35. (.25). (.) PSA.47 (.8).52 (.2) 2.5 (.2) (.2). (.) Oracle.448 (.6).42 (.2) 5 () 45 () () PNSR.7. SNRs 7.22 and 2.4 for σ =.5and.5respectively function. In Figs. 4 and 5, we plotted the confidence envelop and the 95% confidence band of f and f 2 when n = 2, σ =.5, and the true Euclidean parameters are β 2 and β, respectively. All the figures demonstrate satisfactory coverage of the true unknown nonparametric function by confidence envelops and pointwise confidence bands. We also conclude that, at least in this simulation setup, for the two settings with different PNSR s, the estimates of f and f 2 are satisfactory. 5.3 Real example : ragweed pollen data We apply the proposed method to the Ragweed Pollen data analyzed in Ruppert et al. (23). The data consist of 87 daily observations of ragweed pollen level and relevant information collected in Kalamazoo, Michigan during the 993 ragweed season. The main purpose of this analysis is to develop an accurate model for forecasting daily ragweed pollen level based on some climate factors. The raw response ragweed is the daily ragweed pollen level (grains/m 3 ). There are four explanatory variables: X = rain: the indicator of significant rain for the following day ( = at least 3 h of steady or brief but intense rain, = otherwise); X 2 = temperature: temperature of the following day ( o F); X 3 = wind: wind speed forecast for the following day (knots); 23

20 2 G. Cheng et al. Table 8 Variable selection and fitting results for Model 3 when the true parameter vector is β and the true function is f 2 σ n Method MSE( β PSA ) MISE( ˆ f PSA ) Size Number of zeros Correct Incorrect.5 PS.599 (.34).32 (.3) 6 () () () PSL.334 (.9).24 (.2) 8.44 (.6) 4.56 (.6). (.) PSA.23 (.).232 (.2) 5.3 (.3) 44.7 (.6). (.) Oracle.63 (.2).22 (.2) 5 () 45 () () 2 PS.379 (.3).254 (.) 6 () () () PSL.2 (.2).236 (.) 7.9 (.3) 42.8 (.3). (.) PSA.94 (.).23 (.) 5.2 (.) (.). (.) Oracle.68 (.).28 (.) 5 () 45 () ().5 PS 7.27 (.83).539 (.3) 6 () () () PSL.588 (.35).459 (.9) (.22) 32.3 (.22). (.) PSA.285 (.28).434 (.7) 2.2 (.9) (.9). (.) Oracle.785 (.).395 (.4) 5 () 45 () () 2 PS.886 (.4).35 (.3) 6 () () () PSL.57 (.).339 (.2) (.9) (.9). (.) PSA.472 (.6).334 (.2) 9.8 (.4) 4.92 (.4). (.) Oracle.342 (.5).325 (.2) 5 () 45 () () PNSR SNRs 24.8 and 8.3 for σ =.5and.5respectively X 4 = day: the number of days in the current ragweed pollen season. We first standardize X-covariates. Since the raw response is rather skewed, Ruppert et al. (23) suggested a square root transformation Y = ragweed. Marginal plots suggest a strong nonlinear relationship between Y and the day number. Consequently, a partial linear model with a nonparametric baseline f (day) is reasonable. Ruppert et al. (23) fitted a semiparametric model with three linear effects X, X 2, and X 3, and a nonlinear effect of X 4. For the variable selection purpose, we add the quadratic terms in the model and fit an enlarged model: y = f (day) + β x + β 2 x 2 + β 3 x 3 + β 4 x β 4x ε. Table 9 gives the estimated regression coefficients. We observe that PSL and PSA end up with the same model, and all the estimated coefficients are positive, suggesting that the ragweed pollen level increases as any covariate increases. The shrinkage in parametric terms from the partial spline models that resulted from the PSA procedure is overall smaller than that resulted from the PSL procedure. Figure 6 depicts the estimated nonparametric function f (day) and its 95% pointwise confidence intervals given by the PSA. The plot indicates that f (day) increases rapidly to the peak on around day 25, plunges until day 6, and decreases steadily 23

21 Sparse partial spline 3 Model 3: Fitted confidence envelope fit for f when the true parameter vector is β 2 (n=2, σ=.5) 4 3 Estimated f 2 % 5% 9% t Model 3: 95% pointwise confidence band for f when the true parameter vector is β 2 (n=2, σ=.5) 6 Estimated f t Fig. 4 The estimated nonlinear functions given by the PSA in Model 3. The estimated nonlinear function, confidence envelop, and 95% point-wise confidence interval for Model 3 with true nonparametric function f and true Euclidean parameter β 2, n = 2 and σ =.5. In the top plot, the dashed line is for the th best fit, the dotted line is for the 5th best fit, and the dashed-dotted line is for the 9th best among 5 simulations. The bottom plot is a 95% pointwise confidence interval thereafter. The nonparametric fits given by the other two procedures are similar and hence omitted in the paper. We examined the prediction accuracy for PS, PSL and PSA, in terms of the mean squared prediction errors (MSPE) based the leave-one-out strategy. We also fit linear models for the above data using LASSO. Our analysis shows that the MSPEs for PS, PSL, and PSA are 5.63, 5.6, and 5.47, respectively, while the MSPE for LASSO based on linear models is 2.4. Roughly speaking, the MSPEs using PS, PSL, and PSA are similar, though they provide different model selection results summarized in Table 9. Notice that the PS method keeps more variables in the model than the PSL and PSA; however, the MSPEs are not much different. Thus, using PSL or PSA one can select a subgroup of significant variables to explain the model. Furthermore, the large MSPE based on linear models demonstrates invalidity of simply using linear models for such data. 23

22 4 G. Cheng et al. Model 3: Fitted confidence envelope fit for f 2 when the true parameter vector is β (n=2, σ=.5) 2.5 Estimated f 2.5 % 5% 9% t Model 3: 95% pointwise confidence band for f 2 when the true parameter vector is β (n=2, σ=.5) 3 2 Estimated f Fig. 5 The estimated nonlinear functions given by the PSA in Model 3. The estimated nonlinear function, confidence envelop, and 95% point-wise confidence interval for Model 3 with true nonparametric function f 2 and true Euclidean parameter β, n = 2 and σ =.5. In the top plot, the dashed line is for the th best fit, the dotted line is for the 5th best fit, and the dashed-dotted line is for the 9th best among 5 simulations. The bottom plot is a 95% pointwise confidence interval t 5.4 Real example 2: prostate cancer data We analyze the Prostate Cancer data (Stamey et al. 989). The goal is to predict the log level of prostate-specific antigen using a number of clinical measures. The data consist of 97 men who were about to receive a radical prostatectomy. There are eight predictors: X = log cancer volume (lcavol), X 2 = log prostate weight (lweight), X 3 = age, X 4 = log of benign prostatic hyperplasia amount (lbph), X 5 = seminal vesicle invasion (svi), X 6 = log of capsular penetration (lcp), X 7 = Gleason score (gleason), and X 8 = percent of Gleason scores of 4 or 5 (pgg45). A variable selection analysis was conducted in Tibshirani (996) using a linear regression model with LASSO, and it selected three important variables lcavol, lweight, svi as important variables to predict the prostate specific antigen. We fitted partially linear models by treating lweight as a nonlinear term. Table gives the estimated coefficients for different methods. Interestingly, both PSL and PSA select 23

23 Sparse partial spline 5 Table 9 Estimated coefficients for Ragweed Pollen data Covariate PS PSL PSA Rain Temperature Wind Temp temp Wind wind.4 Fig. 6 The estimated nonlinear function f (day) with its 95% pointwise confidence interval (dotted lines) given by the PSA for the ragweed pollen data 5 fhat day Table Estimated coefficients for prostate cancer data Covariate PS PSL PSA lcavol Age.2 lbph.7 svi lcp.5 Gleason.45 pgg

24 6 G. Cheng et al. lcavol and svi as important linear variables, which is consistent to the analysis by Tibshirani (996). 6 Discussion We propose a new regularization method for simultaneous variable selection for linear terms and component estimation for the nonlinear term in partial spline models. The oracle properties of the new procedure for variable selection are established. Moreover, we have shown that the new estimator can achieve the optimal convergence rates for both the parametric and nonparametric components. All the above conclusions are also proven to hold in the increasing dimensional situation. The proposed method sets up a basic framework to implement variable selection for partial spline models, and it can be generalized to other types of data analysis. In our future research, we will generalize the results in this paper to the generalized semiparametric models, robust linear regression, or survival data analysis. In this paper, we assume the errors are i.i.d. with constant variance, and the smoothness order of the Sobolev space is fixed as m, though in practice we used m = 2 to facilitate computation. In practice, the problem of heteroscedastic error, i.e., the variance of ɛ is some non-constant function of (X, T ), is often encountered. Meanwhile, the order m may not be always available which needs to be approximated. We will examine the latter two issues in the future. 7 Proofs For simplicity, we use β, β ( β 2 ) and fˆ to represent β PSA, β PSA, ( β PSA,2 ) and fˆ PSA, in the proofs. Definition Let A be a subset of a (pseudo-) metric space (L, d) of real-valued functions. The δ-covering number N(δ,A, d) of A is the smallest N for which there exist functions a,...,a N in L, such that for each a A, d(a, a j ) δ for some j {,...,N}. Theδ-bracketing number N B (δ, A, d) is the smallest N for which there exist pairs of functions {[a L j, au j ]}N j= L, with d(al j, au j ) δ, j =,...,N, such that for each a A there is a j {,...,N} such that a L j a a U j.theδ-entropy number (δ-bracketing entropy number) is defined as H(δ, A, d) = log N(δ, A, d) (H B (δ, A, d) = log N B (δ, A, d)). Entropy Calculations: For each < C < and δ>, we have ( ) C /m H B (δ, {η : η C, J η C}, ) M, (22) δ ( ) C /m H(δ, {η : η C, J η C}, ) M, (23) δ where represents the uniform norm and M is some positive number. 23

25 Sparse partial spline 7 Proof of Theorem In the proof of (8), we will first show for any given ɛ>, there exists a large constant M such that { } P inf (s) > s =M ɛ, (24) where (s) Q(β +n /2 s) Q(β ). This implies with probability at least ( ɛ) that there exists a local minimum in the ball {β + n /2 s : s M}. Thus, we can conclude that there exists a local minimizer such that ˆβ n β =O P (n /2 ) if (24) holds. Denote the quadratic part of Q(β) as L(β), i.e., L(β) = n (y Xβ) [I A(λ )](y Xβ). Then we can obtain the below inequality: (s) L(β + n /2 s) L(β ) + λ 2 q j= β j + n /2 s j β j β j γ, where s j is the j-th element of vector s. Note that L(β) is a quadratic function of β. Hence, by the Taylor expansion of L(β), we can show that (s) n /2 s L(β ) + 2 s [n L(β )]s + λ 2 q j= β j + n /2 s j β j β j γ, (25) where L(β ) and L(β ) are the first and second derivative of L(β) at β, respectively. Based on (5), we know that L(β ) = (2/n)X [I A(λ )](y Xβ ) and L(β ) = (2/n)X [I A(λ )]X. Combing the proof of Theorem and its four propositions in Heckman (986), we can show that n /2 X [I A(λ )]( f + ɛ) n /2 X A(λ )ɛ d N(,σ 2 R), P. provided that λ and nλ /2m. Therefore, the Slutsky s theorem implies that L(β ) = O P (n /2 ), (26) L(β ) = O P () (27) given the above conditions on λ. Based on (26) and (27), we know the first two terms in the right-hand side of (25) are of the same order, i.e., O P (n ).Andthe second term, which converges to some positive constant, dominates the first one by choosing sufficiently large M. The third term is bounded by n /2 λ 2 M for some 23

26 8 G. Cheng et al. positive constant M since β j is the consistent estimate for the nonzero coefficient for j =,...,q. Considering that nλ 2, we have completed the proof of (8). We next show the convergence rate for fˆ in terms of n -norm, i.e., (9). Let g (x, t) = x β + f (t), and ĝ(x, t) = x ˆβ + f ˆ(t). Then, by the definition of ( ˆβ, f ˆ), we have ĝ g 2 n + λ J 2ˆ + λ f 2 J ˆβ 2 n ɛ i (ĝ g )(X i, t i ) + λ J 2 f n + λ 2 J β, i= ĝ g 2 n 2 ɛ n ĝ g n + λ J 2 f + λ 2 J β, ĝ g 2 n ĝ g n O P () + o P (), (28) where J β d j= β j / β j γ. The second inequality follows from the Cauchy- Schwartz inequality. The last inequality holds since ɛ has sub-exponential tail, and λ,λ 2. Then the above inequality implies that ĝ g n = O P (), so that ĝ n = O P (). By Sobolev embedding theorem, we can decompose g(x, t) as g (x, t) + g 2 (x, t), where g (x, t) = x β + m j= α j t j and g 2 (x, t) = f 2 (t) with g 2 (x, t) J g2 = J f. Similarly, we can write ĝ = ĝ + ĝ 2, where ĝ = x ˆβ + m j= ˆα j t j = ˆδ φ and ĝ 2 Jĝ. We shall now show that ĝ /( + Jĝ) = O P () via the above Sobolev decomposition. Then ĝ n + Jĝ ĝ n + ĝ 2 n + Jĝ + Jĝ = O P (). (29) Based on the assumption about k φ kφ k /n,(29)impliesthat ˆδ /( + Jĝ) = O P (). Since (X, t) is in a bounded set, ĝ /( + Jĝ) = O P (). Sowehaveprovedthat ĝ /( + Jĝ) = O P (). Thus, the entropy calculation (22) impliesthat { g g H B (δ, : g G, g } ) C, M δ /m, + J g + J g where M is some positive constant, and G ={g(x, t) = x β + f (t) : β R d, J f < }. Based on Theorem 2.2 in Mammen and van de Geer (997) about the continuity modulus of the empirical processes { n i= ɛ i (g g )(z i )} indexed by g and (28), we can establish the following set of inequalities: λ J 2ˆ f 23 [ ĝ g n /2m ( + J ˆ f )/2m ( + J f ˆ 2m )n 2(2m+) ]O P (n /2) +λ J 2 f + λ 2 (J β J ˆβ ), (3)

27 Sparse partial spline 9 and [ ĝ g 2 n ĝ g n /2m ( + J f ˆ )/2m ( + J f ˆ 2m )n 2(2m+) ]O P (n /2) +λ J 2 f + λ 2 (J β J ˆβ ). (3) Note that λ 2 (J β J ˆβ ) λ 2 q j= β j ˆβ j β j γ + λ 2 d j=q+ β j ˆβ j β j γ O P (n 2m/(2m+)). (32) (32) in the above follows from β β =O P (n /2 ) and (7). Thus, solving the above two inequalities gives ĝ g n = O P (λ /2 ) and J f ˆ = O P () when n 2m/(2m+) λ λ >. Note that ( X n ) ( ( ˆβ β ) n = ( ˆβ β ) X i X i /n ( ˆβ β ) < ˆβ β =O P n /2) i= by (8). Applying the triangle inequality to ĝ g n = O P (λ /2 ), we have proved that fˆ f n = O P (λ /2 ). We next prove 3(a). It suffices to show that Q{( β, )} = min Q{( β, β 2 )} with probability approaching to (33) β 2 Cn /2 for any β satisfying β β =O P (n /2 ) based on (8). To show (33), we need to show that Q(β)/ β j < forβ j ( Cn /2, ), and Q(β)/ β j > for β j (, Cn /2 ),for j = q +,...,d, holds with probability tending to. By two term Taylor expansion of L(β) at β, Q(β)/ β j can be expressed in the following form for j = q +,...,d: Q(β) β j = L(β ) β j + d k= 2 L(β ) β j β k (β k β k ) + λ 2 sgn(β j ) β j γ, where β k is the k th element of vector β. Note that β β =O P (n /2 ) by the above constructions. Hence, we have Q(β) β j = O P (n /2 ) + sgn(β j ) λ 2 β j γ 23

Lecture 14: Variable Selection - Beyond LASSO

Lecture 14: Variable Selection - Beyond LASSO Fall, 2017 Extension of LASSO To achieve oracle properties, L q penalty with 0 < q < 1, SCAD penalty (Fan and Li 2001; Zhang et al. 2007). Adaptive LASSO (Zou 2006; Zhang and Lu 2007; Wang et al. 2007)

More information

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty Journal of Data Science 9(2011), 549-564 Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty Masaru Kanba and Kanta Naito Shimane University Abstract: This paper discusses the

More information

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract Journal of Data Science,17(1). P. 145-160,2019 DOI:10.6339/JDS.201901_17(1).0007 WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION Wei Xiong *, Maozai Tian 2 1 School of Statistics, University of

More information

Nonnegative Garrote Component Selection in Functional ANOVA Models

Nonnegative Garrote Component Selection in Functional ANOVA Models Nonnegative Garrote Component Selection in Functional ANOVA Models Ming Yuan School of Industrial and Systems Engineering Georgia Institute of Technology Atlanta, GA 3033-005 Email: myuan@isye.gatech.edu

More information

Single Index Quantile Regression for Heteroscedastic Data

Single Index Quantile Regression for Heteroscedastic Data Single Index Quantile Regression for Heteroscedastic Data E. Christou M. G. Akritas Department of Statistics The Pennsylvania State University SMAC, November 6, 2015 E. Christou, M. G. Akritas (PSU) SIQR

More information

Variable Selection for Highly Correlated Predictors

Variable Selection for Highly Correlated Predictors Variable Selection for Highly Correlated Predictors Fei Xue and Annie Qu arxiv:1709.04840v1 [stat.me] 14 Sep 2017 Abstract Penalty-based variable selection methods are powerful in selecting relevant covariates

More information

A Significance Test for the Lasso

A Significance Test for the Lasso A Significance Test for the Lasso Lockhart R, Taylor J, Tibshirani R, and Tibshirani R Ashley Petersen June 6, 2013 1 Motivation Problem: Many clinical covariates which are important to a certain medical

More information

Lecture 5: Soft-Thresholding and Lasso

Lecture 5: Soft-Thresholding and Lasso High Dimensional Data and Statistical Learning Lecture 5: Soft-Thresholding and Lasso Weixing Song Department of Statistics Kansas State University Weixing Song STAT 905 October 23, 2014 1/54 Outline Penalized

More information

Can we do statistical inference in a non-asymptotic way? 1

Can we do statistical inference in a non-asymptotic way? 1 Can we do statistical inference in a non-asymptotic way? 1 Guang Cheng 2 Statistics@Purdue www.science.purdue.edu/bigdata/ ONR Review Meeting@Duke Oct 11, 2017 1 Acknowledge NSF, ONR and Simons Foundation.

More information

Analysis Methods for Supersaturated Design: Some Comparisons

Analysis Methods for Supersaturated Design: Some Comparisons Journal of Data Science 1(2003), 249-260 Analysis Methods for Supersaturated Design: Some Comparisons Runze Li 1 and Dennis K. J. Lin 2 The Pennsylvania State University Abstract: Supersaturated designs

More information

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA Presented by Dongjun Chung March 12, 2010 Introduction Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions:

More information

Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001)

Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001) Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001) Presented by Yang Zhao March 5, 2010 1 / 36 Outlines 2 / 36 Motivation

More information

STAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song

STAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song STAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song Presenter: Jiwei Zhao Department of Statistics University of Wisconsin Madison April

More information

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Nonconcave Penalized Likelihood with A Diverging Number of Parameters Nonconcave Penalized Likelihood with A Diverging Number of Parameters Jianqing Fan and Heng Peng Presenter: Jiale Xu March 12, 2010 Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized

More information

Semi-Nonparametric Inferences for Massive Data

Semi-Nonparametric Inferences for Massive Data Semi-Nonparametric Inferences for Massive Data Guang Cheng 1 Department of Statistics Purdue University Statistics Seminar at NCSU October, 2015 1 Acknowledge NSF, Simons Foundation and ONR. A Joint Work

More information

Bayesian Aggregation for Extraordinarily Large Dataset

Bayesian Aggregation for Extraordinarily Large Dataset Bayesian Aggregation for Extraordinarily Large Dataset Guang Cheng 1 Department of Statistics Purdue University www.science.purdue.edu/bigdata Department Seminar Statistics@LSE May 19, 2017 1 A Joint Work

More information

Data Mining Stat 588

Data Mining Stat 588 Data Mining Stat 588 Lecture 02: Linear Methods for Regression Department of Statistics & Biostatistics Rutgers University September 13 2011 Regression Problem Quantitative generic output variable Y. Generic

More information

Iterative Selection Using Orthogonal Regression Techniques

Iterative Selection Using Orthogonal Regression Techniques Iterative Selection Using Orthogonal Regression Techniques Bradley Turnbull 1, Subhashis Ghosal 1 and Hao Helen Zhang 2 1 Department of Statistics, North Carolina State University, Raleigh, NC, USA 2 Department

More information

STAT 462-Computational Data Analysis

STAT 462-Computational Data Analysis STAT 462-Computational Data Analysis Chapter 5- Part 2 Nasser Sadeghkhani a.sadeghkhani@queensu.ca October 2017 1 / 27 Outline Shrinkage Methods 1. Ridge Regression 2. Lasso Dimension Reduction Methods

More information

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the

More information

Stepwise Searching for Feature Variables in High-Dimensional Linear Regression

Stepwise Searching for Feature Variables in High-Dimensional Linear Regression Stepwise Searching for Feature Variables in High-Dimensional Linear Regression Qiwei Yao Department of Statistics, London School of Economics q.yao@lse.ac.uk Joint work with: Hongzhi An, Chinese Academy

More information

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data Confidence Intervals for Low-dimensional Parameters with High-dimensional Data Cun-Hui Zhang and Stephanie S. Zhang Rutgers University and Columbia University September 14, 2012 Outline Introduction Methodology

More information

Post-selection Inference for Forward Stepwise and Least Angle Regression

Post-selection Inference for Forward Stepwise and Least Angle Regression Post-selection Inference for Forward Stepwise and Least Angle Regression Ryan & Rob Tibshirani Carnegie Mellon University & Stanford University Joint work with Jonathon Taylor, Richard Lockhart September

More information

Lecture 3: Statistical Decision Theory (Part II)

Lecture 3: Statistical Decision Theory (Part II) Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical

More information

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models Jingyi Jessica Li Department of Statistics University of California, Los

More information

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson Bayesian variable selection via penalized credible regions Brian Reich, NC State Joint work with Howard Bondell and Ander Wilson Brian Reich, NCSU Penalized credible regions 1 Motivation big p, small n

More information

arxiv: v1 [stat.me] 30 Dec 2017

arxiv: v1 [stat.me] 30 Dec 2017 arxiv:1801.00105v1 [stat.me] 30 Dec 2017 An ISIS screening approach involving threshold/partition for variable selection in linear regression 1. Introduction Yu-Hsiang Cheng e-mail: 96354501@nccu.edu.tw

More information

Consistent Selection of Tuning Parameters via Variable Selection Stability

Consistent Selection of Tuning Parameters via Variable Selection Stability Journal of Machine Learning Research 14 2013 3419-3440 Submitted 8/12; Revised 7/13; Published 11/13 Consistent Selection of Tuning Parameters via Variable Selection Stability Wei Sun Department of Statistics

More information

A significance test for the lasso

A significance test for the lasso 1 First part: Joint work with Richard Lockhart (SFU), Jonathan Taylor (Stanford), and Ryan Tibshirani (Carnegie-Mellon Univ.) Second part: Joint work with Max Grazier G Sell, Stefan Wager and Alexandra

More information

Nonparametric Inference In Functional Data

Nonparametric Inference In Functional Data Nonparametric Inference In Functional Data Zuofeng Shang Purdue University Joint work with Guang Cheng from Purdue Univ. An Example Consider the functional linear model: Y = α + where 1 0 X(t)β(t)dt +

More information

Variable Selection for Nonparametric Quantile. Regression via Smoothing Spline ANOVA

Variable Selection for Nonparametric Quantile. Regression via Smoothing Spline ANOVA Variable Selection for Nonparametric Quantile Regression via Smoothing Spline ANOVA Chen-Yen Lin, Hao Helen Zhang, Howard D. Bondell and Hui Zou February 15, 2012 Author s Footnote: Chen-Yen Lin (E-mail:

More information

Adaptive Piecewise Polynomial Estimation via Trend Filtering

Adaptive Piecewise Polynomial Estimation via Trend Filtering Adaptive Piecewise Polynomial Estimation via Trend Filtering Liubo Li, ShanShan Tu The Ohio State University li.2201@osu.edu, tu.162@osu.edu October 1, 2015 Liubo Li, ShanShan Tu (OSU) Trend Filtering

More information

The deterministic Lasso

The deterministic Lasso The deterministic Lasso Sara van de Geer Seminar für Statistik, ETH Zürich Abstract We study high-dimensional generalized linear models and empirical risk minimization using the Lasso An oracle inequality

More information

A significance test for the lasso

A significance test for the lasso 1 Gold medal address, SSC 2013 Joint work with Richard Lockhart (SFU), Jonathan Taylor (Stanford), and Ryan Tibshirani (Carnegie-Mellon Univ.) Reaping the benefits of LARS: A special thanks to Brad Efron,

More information

Comparisons of penalized least squares. methods by simulations

Comparisons of penalized least squares. methods by simulations Comparisons of penalized least squares arxiv:1405.1796v1 [stat.co] 8 May 2014 methods by simulations Ke ZHANG, Fan YIN University of Science and Technology of China, Hefei 230026, China Shifeng XIONG Academy

More information

High-dimensional regression with unknown variance

High-dimensional regression with unknown variance High-dimensional regression with unknown variance Christophe Giraud Ecole Polytechnique march 2012 Setting Gaussian regression with unknown variance: Y i = f i + ε i with ε i i.i.d. N (0, σ 2 ) f = (f

More information

M-estimation in high-dimensional linear model

M-estimation in high-dimensional linear model Wang and Zhu Journal of Inequalities and Applications 208 208:225 https://doi.org/0.86/s3660-08-89-3 R E S E A R C H Open Access M-estimation in high-dimensional linear model Kai Wang and Yanling Zhu *

More information

Regularization Paths

Regularization Paths December 2005 Trevor Hastie, Stanford Statistics 1 Regularization Paths Trevor Hastie Stanford University drawing on collaborations with Brad Efron, Saharon Rosset, Ji Zhu, Hui Zhou, Rob Tibshirani and

More information

Spatially Adaptive Smoothing Splines

Spatially Adaptive Smoothing Splines Spatially Adaptive Smoothing Splines Paul Speckman University of Missouri-Columbia speckman@statmissouriedu September 11, 23 Banff 9/7/3 Ordinary Simple Spline Smoothing Observe y i = f(t i ) + ε i, =

More information

On Model Selection Consistency of Lasso

On Model Selection Consistency of Lasso On Model Selection Consistency of Lasso Peng Zhao Department of Statistics University of Berkeley 367 Evans Hall Berkeley, CA 94720-3860, USA Bin Yu Department of Statistics University of Berkeley 367

More information

Bayesian Grouped Horseshoe Regression with Application to Additive Models

Bayesian Grouped Horseshoe Regression with Application to Additive Models Bayesian Grouped Horseshoe Regression with Application to Additive Models Zemei Xu, Daniel F. Schmidt, Enes Makalic, Guoqi Qian, and John L. Hopper Centre for Epidemiology and Biostatistics, Melbourne

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

Spatial Process Estimates as Smoothers: A Review

Spatial Process Estimates as Smoothers: A Review Spatial Process Estimates as Smoothers: A Review Soutir Bandyopadhyay 1 Basic Model The observational model considered here has the form Y i = f(x i ) + ɛ i, for 1 i n. (1.1) where Y i is the observed

More information

Minimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model.

Minimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model. Minimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model By Michael Levine Purdue University Technical Report #14-03 Department of

More information

Inference For High Dimensional M-estimates. Fixed Design Results

Inference For High Dimensional M-estimates. Fixed Design Results : Fixed Design Results Lihua Lei Advisors: Peter J. Bickel, Michael I. Jordan joint work with Peter J. Bickel and Noureddine El Karoui Dec. 8, 2016 1/57 Table of Contents 1 Background 2 Main Results and

More information

Identify Relative importance of covariates in Bayesian lasso quantile regression via new algorithm in statistical program R

Identify Relative importance of covariates in Bayesian lasso quantile regression via new algorithm in statistical program R Identify Relative importance of covariates in Bayesian lasso quantile regression via new algorithm in statistical program R Fadel Hamid Hadi Alhusseini Department of Statistics and Informatics, University

More information

Bias-free Sparse Regression with Guaranteed Consistency

Bias-free Sparse Regression with Guaranteed Consistency Bias-free Sparse Regression with Guaranteed Consistency Wotao Yin (UCLA Math) joint with: Stanley Osher, Ming Yan (UCLA) Feng Ruan, Jiechao Xiong, Yuan Yao (Peking U) UC Riverside, STATS Department March

More information

ASYMPTOTIC PROPERTIES OF BRIDGE ESTIMATORS IN SPARSE HIGH-DIMENSIONAL REGRESSION MODELS

ASYMPTOTIC PROPERTIES OF BRIDGE ESTIMATORS IN SPARSE HIGH-DIMENSIONAL REGRESSION MODELS ASYMPTOTIC PROPERTIES OF BRIDGE ESTIMATORS IN SPARSE HIGH-DIMENSIONAL REGRESSION MODELS Jian Huang 1, Joel L. Horowitz 2, and Shuangge Ma 3 1 Department of Statistics and Actuarial Science, University

More information

Chapter 3. Linear Models for Regression

Chapter 3. Linear Models for Regression Chapter 3. Linear Models for Regression Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 c Wei Pan Linear

More information

Quantile Regression for Extraordinarily Large Data

Quantile Regression for Extraordinarily Large Data Quantile Regression for Extraordinarily Large Data Shih-Kang Chao Department of Statistics Purdue University November, 2016 A joint work with Stanislav Volgushev and Guang Cheng Quantile regression Two-step

More information

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, Linear Regression In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, y = Xβ + ɛ, where y t = (y 1,..., y n ) is the column vector of target values,

More information

Function of Longitudinal Data

Function of Longitudinal Data New Local Estimation Procedure for Nonparametric Regression Function of Longitudinal Data Weixin Yao and Runze Li Abstract This paper develops a new estimation of nonparametric regression functions for

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Lecture 4: Newton s method and gradient descent

Lecture 4: Newton s method and gradient descent Lecture 4: Newton s method and gradient descent Newton s method Functional iteration Fitting linear regression Fitting logistic regression Prof. Yao Xie, ISyE 6416, Computational Statistics, Georgia Tech

More information

SUPPLEMENTAL NOTES FOR ROBUST REGULARIZED SINGULAR VALUE DECOMPOSITION WITH APPLICATION TO MORTALITY DATA

SUPPLEMENTAL NOTES FOR ROBUST REGULARIZED SINGULAR VALUE DECOMPOSITION WITH APPLICATION TO MORTALITY DATA SUPPLEMENTAL NOTES FOR ROBUST REGULARIZED SINGULAR VALUE DECOMPOSITION WITH APPLICATION TO MORTALITY DATA By Lingsong Zhang, Haipeng Shen and Jianhua Z. Huang Purdue University, University of North Carolina,

More information

Sampling Distributions

Sampling Distributions Merlise Clyde Duke University September 8, 2016 Outline Topics Normal Theory Chi-squared Distributions Student t Distributions Readings: Christensen Apendix C, Chapter 1-2 Prostate Example > library(lasso2);

More information

arxiv: v1 [stat.me] 23 Dec 2017 Abstract

arxiv: v1 [stat.me] 23 Dec 2017 Abstract Distribution Regression Xin Chen Xuejun Ma Wang Zhou Department of Statistics and Applied Probability, National University of Singapore stacx@nus.edu.sg stamax@nus.edu.sg stazw@nus.edu.sg arxiv:1712.08781v1

More information

Bayesian Grouped Horseshoe Regression with Application to Additive Models

Bayesian Grouped Horseshoe Regression with Application to Additive Models Bayesian Grouped Horseshoe Regression with Application to Additive Models Zemei Xu 1,2, Daniel F. Schmidt 1, Enes Makalic 1, Guoqi Qian 2, John L. Hopper 1 1 Centre for Epidemiology and Biostatistics,

More information

Reduction of Model Complexity and the Treatment of Discrete Inputs in Computer Model Emulation

Reduction of Model Complexity and the Treatment of Discrete Inputs in Computer Model Emulation Reduction of Model Complexity and the Treatment of Discrete Inputs in Computer Model Emulation Curtis B. Storlie a a Los Alamos National Laboratory E-mail:storlie@lanl.gov Outline Reduction of Emulator

More information

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1 Variable Selection in Restricted Linear Regression Models Y. Tuaç 1 and O. Arslan 1 Ankara University, Faculty of Science, Department of Statistics, 06100 Ankara/Turkey ytuac@ankara.edu.tr, oarslan@ankara.edu.tr

More information

(Part 1) High-dimensional statistics May / 41

(Part 1) High-dimensional statistics May / 41 Theory for the Lasso Recall the linear model Y i = p j=1 β j X (j) i + ɛ i, i = 1,..., n, or, in matrix notation, Y = Xβ + ɛ, To simplify, we assume that the design X is fixed, and that ɛ is N (0, σ 2

More information

Model Selection and Geometry

Model Selection and Geometry Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model

More information

Reconstruction from Anisotropic Random Measurements

Reconstruction from Anisotropic Random Measurements Reconstruction from Anisotropic Random Measurements Mark Rudelson and Shuheng Zhou The University of Michigan, Ann Arbor Coding, Complexity, and Sparsity Workshop, 013 Ann Arbor, Michigan August 7, 013

More information

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection DEPARTMENT OF STATISTICS University of Wisconsin 1210 West Dayton St. Madison, WI 53706 TECHNICAL REPORT NO. 1091r April 2004, Revised December 2004 A Note on the Lasso and Related Procedures in Model

More information

Inference for High Dimensional Robust Regression

Inference for High Dimensional Robust Regression Department of Statistics UC Berkeley Stanford-Berkeley Joint Colloquium, 2015 Table of Contents 1 Background 2 Main Results 3 OLS: A Motivating Example Table of Contents 1 Background 2 Main Results 3 OLS:

More information

Robust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly

Robust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly Robust Variable Selection Methods for Grouped Data by Kristin Lee Seamon Lilly A dissertation submitted to the Graduate Faculty of Auburn University in partial fulfillment of the requirements for the Degree

More information

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina Direct Learning: Linear Regression Parametric learning We consider the core function in the prediction rule to be a parametric function. The most commonly used function is a linear function: squared loss:

More information

New Local Estimation Procedure for Nonparametric Regression Function of Longitudinal Data

New Local Estimation Procedure for Nonparametric Regression Function of Longitudinal Data ew Local Estimation Procedure for onparametric Regression Function of Longitudinal Data Weixin Yao and Runze Li The Pennsylvania State University Technical Report Series #0-03 College of Health and Human

More information

Kernel Method: Data Analysis with Positive Definite Kernels

Kernel Method: Data Analysis with Positive Definite Kernels Kernel Method: Data Analysis with Positive Definite Kernels 2. Positive Definite Kernel and Reproducing Kernel Hilbert Space Kenji Fukumizu The Institute of Statistical Mathematics. Graduate University

More information

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University PCA with random noise Van Ha Vu Department of Mathematics Yale University An important problem that appears in various areas of applied mathematics (in particular statistics, computer science and numerical

More information

Least squares under convex constraint

Least squares under convex constraint Stanford University Questions Let Z be an n-dimensional standard Gaussian random vector. Let µ be a point in R n and let Y = Z + µ. We are interested in estimating µ from the data vector Y, under the assumption

More information

Concentration behavior of the penalized least squares estimator

Concentration behavior of the penalized least squares estimator Concentration behavior of the penalized least squares estimator Penalized least squares behavior arxiv:1511.08698v2 [math.st] 19 Oct 2016 Alan Muro and Sara van de Geer {muro,geer}@stat.math.ethz.ch Seminar

More information

Tuning parameter selectors for the smoothly clipped absolute deviation method

Tuning parameter selectors for the smoothly clipped absolute deviation method Tuning parameter selectors for the smoothly clipped absolute deviation method Hansheng Wang Guanghua School of Management, Peking University, Beijing, China, 100871 hansheng@gsm.pku.edu.cn Runze Li Department

More information

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Relaxed Lasso. Nicolai Meinshausen December 14, 2006

Relaxed Lasso. Nicolai Meinshausen December 14, 2006 Relaxed Lasso Nicolai Meinshausen nicolai@stat.berkeley.edu December 14, 2006 Abstract The Lasso is an attractive regularisation method for high dimensional regression. It combines variable selection with

More information

Bi-level feature selection with applications to genetic association

Bi-level feature selection with applications to genetic association Bi-level feature selection with applications to genetic association studies October 15, 2008 Motivation In many applications, biological features possess a grouping structure Categorical variables may

More information

Additive Isotonic Regression

Additive Isotonic Regression Additive Isotonic Regression Enno Mammen and Kyusang Yu 11. July 2006 INTRODUCTION: We have i.i.d. random vectors (Y 1, X 1 ),..., (Y n, X n ) with X i = (X1 i,..., X d i ) and we consider the additive

More information

Regularization Paths. Theme

Regularization Paths. Theme June 00 Trevor Hastie, Stanford Statistics June 00 Trevor Hastie, Stanford Statistics Theme Regularization Paths Trevor Hastie Stanford University drawing on collaborations with Brad Efron, Mee-Young Park,

More information

Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models

Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider variable

More information

Regularization in Cox Frailty Models

Regularization in Cox Frailty Models Regularization in Cox Frailty Models Andreas Groll 1, Trevor Hastie 2, Gerhard Tutz 3 1 Ludwig-Maximilians-Universität Munich, Department of Mathematics, Theresienstraße 39, 80333 Munich, Germany 2 University

More information

Sample Size Requirement For Some Low-Dimensional Estimation Problems

Sample Size Requirement For Some Low-Dimensional Estimation Problems Sample Size Requirement For Some Low-Dimensional Estimation Problems Cun-Hui Zhang, Rutgers University September 10, 2013 SAMSI Thanks for the invitation! Acknowledgements/References Sun, T. and Zhang,

More information

Effective Computation for Odds Ratio Estimation in Nonparametric Logistic Regression

Effective Computation for Odds Ratio Estimation in Nonparametric Logistic Regression Communications of the Korean Statistical Society 2009, Vol. 16, No. 4, 713 722 Effective Computation for Odds Ratio Estimation in Nonparametric Logistic Regression Young-Ju Kim 1,a a Department of Information

More information

The Iterated Lasso for High-Dimensional Logistic Regression

The Iterated Lasso for High-Dimensional Logistic Regression The Iterated Lasso for High-Dimensional Logistic Regression By JIAN HUANG Department of Statistics and Actuarial Science, 241 SH University of Iowa, Iowa City, Iowa 52242, U.S.A. SHUANGE MA Division of

More information

Statistical Properties of Numerical Derivatives

Statistical Properties of Numerical Derivatives Statistical Properties of Numerical Derivatives Han Hong, Aprajit Mahajan, and Denis Nekipelov Stanford University and UC Berkeley November 2010 1 / 63 Motivation Introduction Many models have objective

More information

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis. 401 Review Major topics of the course 1. Univariate analysis 2. Bivariate analysis 3. Simple linear regression 4. Linear algebra 5. Multiple regression analysis Major analysis methods 1. Graphical analysis

More information

Density estimators for the convolution of discrete and continuous random variables

Density estimators for the convolution of discrete and continuous random variables Density estimators for the convolution of discrete and continuous random variables Ursula U Müller Texas A&M University Anton Schick Binghamton University Wolfgang Wefelmeyer Universität zu Köln Abstract

More information

1. SS-ANOVA Spaces on General Domains. 2. Averaging Operators and ANOVA Decompositions. 3. Reproducing Kernel Spaces for ANOVA Decompositions

1. SS-ANOVA Spaces on General Domains. 2. Averaging Operators and ANOVA Decompositions. 3. Reproducing Kernel Spaces for ANOVA Decompositions Part III 1. SS-ANOVA Spaces on General Domains 2. Averaging Operators and ANOVA Decompositions 3. Reproducing Kernel Spaces for ANOVA Decompositions 4. Building Blocks for SS-ANOVA Spaces, General and

More information

Permutation-invariant regularization of large covariance matrices. Liza Levina

Permutation-invariant regularization of large covariance matrices. Liza Levina Liza Levina Permutation-invariant covariance regularization 1/42 Permutation-invariant regularization of large covariance matrices Liza Levina Department of Statistics University of Michigan Joint work

More information

THE LASSO, CORRELATED DESIGN, AND IMPROVED ORACLE INEQUALITIES. By Sara van de Geer and Johannes Lederer. ETH Zürich

THE LASSO, CORRELATED DESIGN, AND IMPROVED ORACLE INEQUALITIES. By Sara van de Geer and Johannes Lederer. ETH Zürich Submitted to the Annals of Applied Statistics arxiv: math.pr/0000000 THE LASSO, CORRELATED DESIGN, AND IMPROVED ORACLE INEQUALITIES By Sara van de Geer and Johannes Lederer ETH Zürich We study high-dimensional

More information

Homogeneity Pursuit. Jianqing Fan

Homogeneity Pursuit. Jianqing Fan Jianqing Fan Princeton University with Tracy Ke and Yichao Wu http://www.princeton.edu/ jqfan June 5, 2014 Get my own profile - Help Amazing Follow this author Grace Wahba 9 Followers Follow new articles

More information

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee The Learning Problem and Regularization 9.520 Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing

More information

Theoretical results for lasso, MCP, and SCAD

Theoretical results for lasso, MCP, and SCAD Theoretical results for lasso, MCP, and SCAD Patrick Breheny March 2 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/23 Introduction There is an enormous body of literature concerning theoretical

More information

Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise

Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published

More information

Linear Methods for Regression. Lijun Zhang

Linear Methods for Regression. Lijun Zhang Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived

More information

Efficient Estimation for the Partially Linear Models with Random Effects

Efficient Estimation for the Partially Linear Models with Random Effects A^VÇÚO 1 33 ò 1 5 Ï 2017 c 10 Chinese Journal of Applied Probability and Statistics Oct., 2017, Vol. 33, No. 5, pp. 529-537 doi: 10.3969/j.issn.1001-4268.2017.05.009 Efficient Estimation for the Partially

More information

A Modern Look at Classical Multivariate Techniques

A Modern Look at Classical Multivariate Techniques A Modern Look at Classical Multivariate Techniques Yoonkyung Lee Department of Statistics The Ohio State University March 16-20, 2015 The 13th School of Probability and Statistics CIMAT, Guanajuato, Mexico

More information

Chapter 3: Regression Methods for Trends

Chapter 3: Regression Methods for Trends Chapter 3: Regression Methods for Trends Time series exhibiting trends over time have a mean function that is some simple function (not necessarily constant) of time. The example random walk graph from

More information

Ultra High Dimensional Variable Selection with Endogenous Variables

Ultra High Dimensional Variable Selection with Endogenous Variables 1 / 39 Ultra High Dimensional Variable Selection with Endogenous Variables Yuan Liao Princeton University Joint work with Jianqing Fan Job Market Talk January, 2012 2 / 39 Outline 1 Examples of Ultra High

More information

Likelihood Ratio Tests. that Certain Variance Components Are Zero. Ciprian M. Crainiceanu. Department of Statistical Science

Likelihood Ratio Tests. that Certain Variance Components Are Zero. Ciprian M. Crainiceanu. Department of Statistical Science 1 Likelihood Ratio Tests that Certain Variance Components Are Zero Ciprian M. Crainiceanu Department of Statistical Science www.people.cornell.edu/pages/cmc59 Work done jointly with David Ruppert, School

More information

regression Lie Wang Abstract In this paper, the high-dimensional sparse linear regression model is considered,

regression Lie Wang Abstract In this paper, the high-dimensional sparse linear regression model is considered, L penalized LAD estimator for high dimensional linear regression Lie Wang Abstract In this paper, the high-dimensional sparse linear regression model is considered, where the overall number of variables

More information