PENALIZED MINIMUM AVERAGE VARIANCE ESTIMATION

Size: px
Start display at page:

Download "PENALIZED MINIMUM AVERAGE VARIANCE ESTIMATION"

Transcription

1 Statistica Sinica 23 (213), doi: PENALIZED MINIMUM AVERAGE VARIANCE ESTIMATION Tao Wang 1, Peirong Xu 2 and Lixing Zhu 1 1 Hong Kong Baptist University and 2 East China Normal University Abstract: For simultaneous dimension reduction and variable selection for general regression models, including the multi-index model as a special case, we propose a penalized minimum average variance estimation method, combining the ideas of minimum average variance estimation in dimension reduction and regularization in variable selection. The resulting estimator can be found in a computationally efficient manner. Under mild conditions, the new method can consistently select all relevant predictors and has the oracle property. Simulations and a data example demonstrate the effectiveness and efficiency of the proposed method. Key words and phrases: Dimension reduction, minimum average variance estimation, oracle property, single-index model, sufficient dimension reduction, variable selection. 1. Introduction Consider a univariate response Y and a predictor vector X = (X 1,..., X p ) T. The goal of regression analysis is to infer about the conditional mean function or, in general, the conditional distribution of Y X. To deal with high-dimensional data that arise frequently in modern scientific research, the literature offers sufficient dimension reduction and variable selection. Sufficient dimension reduction aims to replace the predictor vector with its low-dimensional projection onto a suitable subspace of the predictor space without losing regression information, whereas variable selection is to rule out predictors that insignificantly affect the response. More specifically, sufficient dimension reduction seeks to find a subspace S of minimum dimension such that Y X P S X, where indicates independence and P S stands for the orthogonal projection onto S; that is, Y and X are independent conditioned on P S X or equivalently, the conditional distribution of Y X equals that of Y P S X. Such a space, denoted S Y X and often called the central subspace, typically exists and is unique under mild conditions (Cook (1996)). When all projection directions are contained in the conditional mean of Y given X, E(Y X), the above independence can be written as Y E(Y X) P S X. The

2 544 TAO WANG, PEIRONG XU AND LIXING ZHU related subspace is called the central mean subspace (CMS). A number of methods have been proposed to estimate the central/central mean subspace: inverse regression methods such as sliced inverse regression method (SIR, Li (1991)) and sliced average variance estimation method (SAVE, Cook and Weisberg (1991)); direct regression methods, in particular the minimum average variance estimation method (MAVE, Xia et al. (22)), are popular. Inverse regression methods are computationally simple and practically useful, but the probabilistic assumptions on the design of predictors, such as the linearity condition (Li (1991)), are considered to be strong (Li and Dong (29)). Further, some of them fail in one way or another to estimate the central/central mean subspace (Cook and Weisberg (1991)). For CMS, MAVE is useful for both dimension reduction and nonparametric function estimation. It is free of the linearity condition and often has much better performance in finite samples than inverse regression methods. The aforementioned methods assume that all predictors contain useful information. As a result, the estimated directions generally involve all of the original predictors, making their interpretation sometimes difficult. Further, if irrelevant predictors are included, quite likely in high-dimensional situations, the precision of estimation as well as the accuracy of prediction is lessened. It is thus crucial to consider variable selection along with dimension reduction, and several attempts at this have been made within the framework of sufficient dimension reduction. For instance, Li, Cook, and Nachtsheim (25) developed test-based selection procedures, Li (27) suggested sparse sufficient dimension reduction via integrating inverse regression estimation with a regularization paradigm, Zhou and He (28) studied a constrained canonical correlation method, and Wang and Yin (28) proposed sparse minimum average variance estimation. However, most methods are designed for element screening and not variable screening, producing a shrinkage solution for a basis matrix of the central/central mean subspace. Building upon the regression-type formulation of sufficient dimension reduction methods, Chen, Zou, and Cook (21) proposed a coordinate-independent sparse estimation that can simultaneously achieve dimension reduction and screen out irrelevant predictors. With the aid of the local quadratic approximation, the resulting estimators can be found by an application of the eigen-system analysis. Their general method is subspace-oriented and covers a number of inverse regression methods, but inherits all drawbacks of inverse regression methods, that is, the strong conditions on predictors and regression function. Also, as pointed out in Fan and Li (21), the local quadratic approximation algorithm has to delete any small coefficient because it cannot have a sparse representation. In this paper, by incorporating a bridge penalty (Frank and Friedman (1993)) to L 1 norms of the rows of a basis matrix, we propose a penalized minimum

3 PENALIZED MINIMUM AVERAGE VARIANCE ESTIMATION 545 average variance estimation (P-MAVE) that can achieve sufficient dimension reduction and variable selection simultaneously. A fast and efficient algorithm is developed to solve the optimization problem, and the resulting estimator naturally adopts a sparse representation. A BIC-type criterion is suggested to select the tuning parameter. The new method is free of the linearity condition that is needed for inverse regression methods (see, e.g., Chen, Zou, and Cook (21)) and has the oracle property under mild conditions, so that P-MAVE can be applied to time series data while inverse regression methods cannot. The rest of the paper is organized as follows. Section 2 introduces the new method. Its theoretical properties are studied in Section 3. Numerical studies are reported in Section 4, and there a dataset with categorical predictors is analyzed for variable selection. Concluding remarks are included in Section 5. All technical details are relegated to the Appendix. 2. Methodology 2.1. Minimum average variance estimation revisited Let B R p d denote a p d orthogonal matrix with d < p, so B T B = I d where I d is the d d identity matrix. For dimension reduction we consider the model, Y = g(b T X) + ϵ, (2.1) where g is an unknown smooth link function and E(ϵ X) = almost surely. At the population level, the MAVE procedure minimizes the objective function E{var(Y B T X)} = E{Y E(Y B T X)} 2 (2.2) among all B R p d such that B T B = I d. If σ 2 B (BT X) = E[{Y E(Y B T X)} 2 B T X] is the conditional variance of Y given B T X, Thus, minimizing (2.2) is equivalent to E{Y E(Y B T X)} 2 = E{σ 2 B(B T X)}. min E{σ2 B(B T X)} subject to B T B = I d. B R p d Suppose {(y i, x i ), i = 1,..., n} is a random sample drawn according to (2.1). For x i close to x, a local linear approximation is y i g(b T x i ) y i g(b T x ) { g(b T x )} T B T (x i x ). In the spirit of local linear smoothing, we might estimate σ 2 B (BT x ) by ˆσ 2 B(B T x ) = min a R,b R d {y i a b T B T (x i x )} 2 w i. i=1

4 546 TAO WANG, PEIRONG XU AND LIXING ZHU Here, w i, i = 1,..., n, are some weights such that n i=1 w i = 1, often centering at x. Let x ij = x i x j. The MAVE procedure is to minimize 1 2n (y i a j b T j B T x ij ) 2 w ij (2.3) over a j R, b j R d, and B R p d such that B T B = I d. In practice, minimizing (2.3) can be decomposed into two weighted least squares problems by fixing (a j, b j ), j = 1,..., n, and B, alternatively. To be specific, given B we minimize, for j = 1,..., p, 1 (y i a j b T j B T x ij ) 2 w ij (2.4) 2n i=1 with respect to a j R and b j R d. With c j = (a j, b T j ) T and s ij = (1, x T ij B)T, the solution is ( ) 1 c j = s ij s T ijw ij s ij y i w ij. (2.5) i=1 i=1 Given (a j, b j ), j = 1,..., n, we minimize 1 2n over B R p d subject to B T B = I d. Note that 1 2n (y i a j b T j B T x ij ) 2 w ij = 1 2n (y i a j b T j B T x ij ) 2 w ij (2.6) {y i a j (x T ij b T j )β} 2 w ij, where β = vec(b T ) and vec( ) is a matrix operator that stacks all columns of a matrix into a vector. Again, this is a weighted least squares problem with closed-form solution 1 β = {(x ij x T ij) (b j b T j )}w ij (x ij b j )(y i a j )w ij. (2.7) The procedure now consists of iteration between (2.5) and (2.7). weights w ij, we employ the lower-dimensional kernel weights As for the w ij = K h (B T x ij ) n i=1 K h(b T x ij ), (2.8)

5 PENALIZED MINIMUM AVERAGE VARIANCE ESTIMATION 547 where B is the latest or current estimate of B, and K h ( ) is a d-dimensional kernel function with bandwidth h; see Xia et al. (22) for more details Penalized MAVE In practice, it is often the case that some of predictors have only marginal influence on the response. Effectively removing these irrelevant predictors could improve estimation accuracy and enhance model interpretability. This motivates us to apply regularization to simultaneously estimate parameters and select relevant predictors. By Proposition 1 in Cook (24), the kth predictor can be removed from the model if and only if the kth row of B D is a zero vector, where D denotes an arbitrary d d orthogonal matrix. Thus, the selection of predictors can be realized through the selection of non-vanishing rows of B D. Building upon this observation and the iterative least squares MAVE algorithm, we propose the penalized MAVE procedure, as follows. For some d d orthogonal matrix D, suppose that B is an initial estimator of B D. Replace B in (2.8) by B and let w ij denote the new weights. For each j, let (ã j, b 1 j ) = argmin (y i a j b T a j R,b j 2n B j T x ij ) 2 w ij. (2.9) R d i=1 Write β = (β T 1,..., βt p ) T and β = ( β T 1,..., β T p ) T, where β k = (β 1 k,..., βd k )T R d is the kth row of B R p d and β k = ( β 1 k,..., β l k )T R d is the kth row of B R p d, respectively. Similarly, write x ij x ij b j = ( x T ij1,..., xt ijp )T, where x ijk = x k ij b j and x k ij is the kth component of x ij. Let ỹ ij = y i ã j. For a R m the L r norm is written as a r. We consider the objective function Ψ λn (β; β) = 1 2n = 1 2n {y i ã j (x T ij b T j )β} 2 w ij + λ n β k γ 1 (ỹ ij x T ijβ) 2 w ij + λ n β k γ 1, (2.1) where λ n > is a tuning parameter and γ (, 1). By adopting a bridge penalty for the L 1 norms of the rows of B, it is then possible to carry out variable screening and element screening simultaneously. We call ˆβ minimizing (2.1) the penalized MAVE estimator (P-MAVE). In the literature, Huang et al. (29) adopted the same penalty function to perform group variable selection in linear regression models.

6 548 TAO WANG, PEIRONG XU AND LIXING ZHU 2.3. Computation Because the penalty term in (2.1) is non-convex, direct minimization of Ψ λn (β; β) is difficult. However it is possible to give an equivalent form, in certain sense, of (2.1), one that is easier to compute. Take Φ λn (β, θ; β) = 1 2n (ỹ ij x T ijβ) 2 w ij + θ 1 1/γ k β k 1 + τ n θ k, (2.11) where τ n is a penalty parameter. The following is proved in the Appendix. Proposition 1. Suppose that γ (, 1). If λ n = τn 1 γ γ γ (1 γ) γ 1, then ˆβ is a minimizer of Ψ λn (β; β) if and only if ( ˆβ, ˆθ) is a minimizer of Φ λn (β, θ; β) subject to θ >, where θ > means θ k for k = 1,..., p. To estimate β and θ, we can use an iterative approach (see, e.g., Huang et al. (29)). Specifically, we first fix β and estimate θ, then we fix θ and estimate β, and we iterate between these two steps to convergence. Since the value of the objective function decreases at each step, the process is guaranteed to converge. The algorithm proceeds as follows. S1. Initialization. Set β () = β and s = 1. S2. For k = 1,..., p, compute S3. Compute β (s) = argmin β 1 2n θ (s) k = ( 1 γ τ n γ ) γ β (s 1) γ. 1 (ỹ ij x T ijβ) 2 w ij + k ( θ (s) k ) 1 1/γ βk 1. This is a LASSO problem and can be solved efficiently using the LARS algorithm (Efron et al. (24)) or the fast coordinate descent algorithm (Friedman et al. (27)). S4. If β (s) and β (s 1) are close enough, stop; otherwise, set s = s + 1 and go back to Step 2. With a good initial value β, the penalized MAVE estimator obtained in this way has nice statistical properties, as shown in the following section. In practice, we might invoke one further step by taking the penalized MAVE estimator as the initial estimator, updating ã j and b j from (2.9), and carrying out steps S1-S4 to convergence. The resulting estimator is called the one-step penalized

7 PENALIZED MINIMUM AVERAGE VARIANCE ESTIMATION 549 MAVE estimator. As seen later in numerical studies, the one-step estimator is able to stabilize the estimation and improve the penalized MAVE estimator. Furthermore, it is less sensitive to the choice of the initial estimator β. 3. Asymptotic Theory 3.1 Basic theoretical properties In what follows, we assume with no loss of generality that only the first q predictors are relevant to the response, where d q < p. Let A 1 = {1,..., q} and A 2 = {q + 1,..., p}. With a slight abuse of notation, write β = (β T A 1, β T A 2 ) T with β A1 = (β k, k A 1 ) a sub-vector formed by the first qd components of β, and β A2 = (β k, k A 2 ) formed by the last pd qd components. Let µ B (x) = E(X B T X = B T x) and ν B (x) = µ B (x) x. We use ν B,A1 (x) to denote a subvector consisting of the first q components of ν B (x). Define W g,a1 = E[{ν B,A 1 (X)ν T B,A 1 (X)} { g(b T X) T g(b T X)}], W g = E[{ν B (X)ν T B (X)} { g(b T X) T g(b T X)}], U g,a1 = E[{ν B,A 1 (X)ν T B (X)} { g(b T X) T g(b T X)}]. We need the following condition. (A) B = B D + O P (n 1/2 ), equivalently β β = O P (n 1/2 ). Condition (A) is mild and typically holds. In particular, the ordinary MAVE estimator can be used as the initial estimator (Xia (26); Wang and Xia (28)). Note that the group bridge penalty in (2.1) is added to each row of B and the penalty is not rotation free, so the sparseness depends on a special form of rotation. Under condition (A), D is the underlying rotation matrix and the identifiability issue vanishes automatically. Theorem 1. Suppose that γ (, 1) and d 3. Assume (A) and conditions (C1) (C5) in the Appendix. (i) If λ n n 1/2 = O(1), then there exists a local minimizer ˆβ of Ψ λn (β; β) such that ( ) ˆβ 1 β 2 = O P n. (ii) If λ n n 1/2 = O(1) and λ n n γ/2, then P ( ˆβ A2 = ) 1 as n. (iii) Suppose that n( β β ) L G 1 and 1 n {ν B,A 1 (x i ) g(b T x i )}ϵ i L G 2. i=1

8 55 TAO WANG, PEIRONG XU AND LIXING ZHU If λ n n 1/2 λ [, ) and λ n n γ/2, then where n( ˆβA1 β A1 ) L argmin{v (u), u = (u T 1,..., u T q ) T R qd }, u V (u) = u T (U g G 1 + G 2 ) + u T W g,a1 u q d + λ γ β k γ 1 1 {u l k sign(βl k )I(βl k ) + ul k I(βl k = )}. In particular, when λ =, l=1 n( ˆβA1 β A1 ) L 1 2 W + g,a 1 (U g G 1 + G 2 ), where W + g,a 1 denotes the Moore-Penrose inverse matrix of W g,a1. Theorem 1 indicates that the penalized MAVE estimator is n-root consistent, and that it is able to select all relevant predictors consistently. Part (iii) of Theorem 1 has implications: if λ >, the limit distribution of ˆβ A1 that corresponds to the relevant predictors puts positive probability to when some components of β A1 are exactly, that is, element screening along with variable screening is achievable; if λ =, the estimator ˆβ A1 of β A1 is asymptotically normal. For general multiple-index models, the ordinary MAVE estimate is consistent, but it is hard to specify the asymptotic distribution, particularly for d > 3. The asymptotic distribution is available when we use inverse regression estimates, such as sliced inverse regression, as initial estimates. However, its variance has a vary complicated structure (see, Zhu and Ng (1995)). In the special case of the single-index model, we have the oracle property. Corollary 1. Adopt the conditions of Theorem 1 with d = 1. Suppose that β is the ordinary MAVE estimator. (i) If λ n n 1/2 = O(1) and λ n n γ/2, then P ( ˆβ A2 = ) 1 as n. (ii) If n 1/2 n i=1 g (β T x i)ν β,a 1 (x i )ϵ i L G 2, λ n n 1/2 and λ n n γ/2, then n( ˆβA1 β A1 ) L W + g,a 1 G 2. Remark 1. It is clear that B in model (2.1) or B in (2.3) is not identifiable. So the initial condition B = B D +O P (n 1/2 ) for some d d orthogonal matrix D is necessary. Many estimates, including the ordinary MAVE estimate, can serve as an initial estimate because they can consistently estimate B D (D may be

9 PENALIZED MINIMUM AVERAGE VARIANCE ESTIMATION 551 different for different estimates). Of course, the P-MAVE depends on the initial one, and we have shown in Theorem 3.1 that it is a consistent estimator of B D. Nevertheless, the selection consistency of the P-MAVE does not depend on the choice of the initial estimator. Note that at the population level, the (unique) solution of the MAVE can only identify B D for some d d orthogonal matrix D. This is why we say that dimension reduction methods can identify the central (mean) dimension reduction subspace. The MAVE algorithm provides a consistent estimate of B D, see Xia et al. (22) and Wang and Xia (28), but this does not affect the selection by our procedure. It is easy to see that, if the kth row of B is a zero vector (we say B is sparse in this sense), then the kth row of B D for any d d orthogonal matrix D is also zero, and vice versa. Thus, the ordinary MAVE procedure estimates the sparse version B D for some D, and we can apply the techniques of shrinkage and selection to produce sparse solutions, such as the P-MAVE Tuning parameter selection In practice, the attractive properties of the penalized MAVE estimator rely heavily on the choice of the tuning parameter that controls the model complexity. In the literature, generalized cross-validation (GCV) has been extensively used to choose tuning parameters, see Fan and Li (21). Recently, Wang, Li, and Tsai (27) proved that the model selected by GCV tends to overfit, while the Bayesian information criterion (BIC) is able to identify a finite-dimensional model consistently. This motivates us to consider a BIC-type criterion for tuning parameter selection. Let RSS λ = 1 2n 2 where ˆβ λ = ( ˆβ T λ1,..., ˆβ T λp )T (ỹ ij x T ˆβ ij λ ) 2 w ij, (3.1) BIC λ = log RSS λ + d df λ log n n, (3.2) is the minimizer of Ψ λ (β; β), and df λ denotes the number of nonzero sub-vectors in { ˆβ λ1,..., ˆβ λp }. We use the generic notation M {1,..., p} to denote an arbitrary candidate model. Let M F = {1,..., p}, M T = {1,..., q}, and M λ = {k : ˆβ λ,k } be the full model, the true model, and the model that is identified by ˆβ λ, respectively. Theorem 2. Let ˆλ denote the tuning parameter selected by minimizing BIC λ. Under the conditions of Corollary 1, P (Mˆλ = M T ) 1 as n.

10 552 TAO WANG, PEIRONG XU AND LIXING ZHU 4. Numerical Studies 4.1. Simulations In this subsection, we report on a study of the finite sample performance of the proposed method via simulations. Throughout, we took the ordinary MAVE estimator to be the initial estimator and used the BIC-type criterion (3.2) to select the tuning parameter. To be specific, we used the refined MAVE estimator with the multi-dimensional Gaussian kernel, and adopted the optimal bandwidth h = {4/(d+2)} 1/(d+4) n 1/(d+4) (Silverman (1999)). For measuring the estimation accuracy, we employed the vector correlation coefficient COR1 = ( d l=1 ϕ2 l )1/2 and the trace correlation coefficient COR2 = (d 1 d l=1 ϕ2 l )1/2, where the ϕ 2 l s are the eigenvalues of the matrix ˆB T B B T ˆB; see Ye and Weiss (23) for more details. Furthermore, we used four summary statistics to assess how well the method selected predictors: MS, TPR, FDR, and CM. MS is the average number of identified predictors, TPR is the average ratio of the number of correctly identified predictors to the number of truly important predictors, FDR is the average ratio of the number of falsely identified predictors to the total number of identified predictors, and CM is the fraction of runs in which the correct model is selected. In each of the following example, the reported simulation results were based on 2 data replications. Example 1. Let d = 1, p = 1, q = 4, and β = (.4,.4,.8,.2,,..., ) T. Consider the single-index model Y = 1 + 2(X T β + 3) log(3 X T β + 1) + ϵ, where ϵ N(, 1), X N(, Σ) with Σ ij = ρ i j for i, j = 1,..., 1, and ϵ and X are independent. Two values of ρ were explored, and.5, with the sample size n = 1. Example 2. The same as Example 1, except that n = 2 and p = 2. Example 3. Let d = 2, p = 1, and q = 4. We considered the model Y = (X 1 + 2X 2 + 3X 3 ) X 1 + X 2 + X 4 +.5ϵ, where ϵ N(, 1), X N(, Σ) with Σ ij = ρ i j for i, j = 1,..., 1, and ϵ and X are independent. Two values of ρ were explored, and.5, with the sample size n = 1. In this example, the true dimension reduction subspace is spanned by (1, 2, 3,,..., ) T / 14 and ( 2, 1,, 1,,..., ) T / 6.

11 PENALIZED MINIMUM AVERAGE VARIANCE ESTIMATION 553 Example 4. Let d = 3, p = 1, and q = 3. The regression model was Y = X 1 + X (1.5 + X 3 ) 2 +.2ϵ, where ϵ N(, 1), X N(, Σ) with Σ ij = ρ i j for i, j = 1,..., 1, and ϵ and X are independent. Two values of ρ were explored, and.5, with the sample size n = 1. In this example, the true dimension reduction subspace is spanned by (1,,..., ) T, (, 1,,..., ) T, and (,, 1,,..., ) T. Example 5. We took the non-linear time series model ( π ) Y t = 1 +.4Y t 1 + cos 2 Y t 2 +.2ϵ t, where the ϵ t s are independent and identically distributed N(, 1), and X t 1 = (Y t 1,..., Y t 6 ) T. In this example, d = 2, p = 6, q = 2, and the true dimension reduction subspace is spanned by (1,,..., ) T and (, 1,,..., ) T. Our sample sizes were n = 1 and n = 2. We evaluated the performance of the P-MAVE estimator as well as the onestep P-MAVE estimator and, in the first four examples, we also chose the refined outer product of gradients estimator (OPG, Xia et al. (22)) as the initial estimator. We used γ =.5 in the bridge penalty function. For comparison purpose, we report the simulation results of the coordinate-independent sparse estimation. In particular, we applied sliced inverse regression (CISE-SIR) and principal fitted components (CISE-PFC); see Section 3 of Chen, Zou, and Cook (21) for details. The simulation results from these five examples are summarized in Tables 1-5, respectively. When the original MAVE estimator was used as the initial value, observations are as follows. In all examples, both the penalized MAVE estimator and its one-step counterpart improved the MAVE estimator in terms of basis estimation. This phenomenon was more evident in Example 5, where the refined MAVE estimator had a very low vector correlation coefficient (COR1) when n = 1 (Table 5). In general, the one-step P-MAVE estimator outperformed the penalized MAVE estimator, and the proposed method worked quite well in terms of predictor selection: the true positive rate (TPR) was very high, lowest at 97.33%; and the false discovery rate (FDR) was close to. Further, the CM-values in Example 1-2 were not far from 1% (Tables 1-2), which confirms the consistency of the BIC-type criterion in the context of single-index models. When there was more than one index, d > 1, the lowest CM-values were 67.5% and 88.5% for the P-MAVE estimator and the one-step P-MAVE estimator, respectively. We also tried two other selection criteria, generalized cross-validation and Akaike s

12 554 TAO WANG, PEIRONG XU AND LIXING ZHU Table 1. Simulation results for Example 1. The means and the standard deviations (in parentheses) of the vector correlation coefficient (COR1) and the trace correlation coefficient (COR2), the average model size (MS), the true positive rate (TPR), the false discovery rate (FDR), and the proportion of selecting the correct model (CM), based on 2 data replications, are reported. d = 1, n = 1, p = 1, q = 4, ρ = Initial estimate: Refined MAVE.9979 (.13).9979 (.13) COR1 COR2 MS TPR FDR CM P-MAVE.9993 (.7).9993 (.7) One-step P-MAVE.9994 (.6).9994 (.6) Initial estimate: Refined OPG.9976 (.15).9976 (.15) P-MAVE.9992 (.8).9992 (.8) One-step P-MAVE.9994 (.6).9994 (.6) CISE-SIR.9351 (.593).9351 (.593) CISE-PFC.9365 (.575).9365 (.575) d = 1, n = 1, p = 1, q = 4, ρ =.5 Initial estimate: Refined MAVE.9978 (.14).9978 (.14) P-MAVE.9992 (.9).9992 (.9) One-step P-MAVE.9994 (.6).9994 (.6) Initial estimate: Refined OPG.9973 (.18).9973 (.18) P-MAVE.9992 (.9).9992 (.9) One-step P-MAVE.9994 (.6).9994 (.6) CISE-SIR.8841 (.789).8841 (.789) CISE-PFC.88 (.767).88 (.767) information criterion, and found that they tend to over-select many irrelevant predictors. When the refined OPG estimator was used as the initial value, we can make similar conclusions. In fact, as can be seen from Tables 1-4, the P- MAVE estimator and the one-step P-MAVE estimator, especially the latter, are not very sensitive to the choice of the initial estimator. The P-MAVE estimator and the one-step P-MAVE estimator outperformed CISE-SIR and CISE-PFC. In Example 1, CISE-SIR and CISE-PFC missed on average one or two important predictors. In Example 2, they had much better performance when the predictors were uncorrelated (Table 2). As a conclusion, the coordinate-independent sparse estimation did not work well when either the sample size of the data was low or the predictors were correlated. This is further seen in Example 4 where we found that CISE-SIR and CISE-PFC could only estimate part of the dimension reduction subspace. Unreported results showed that the performance did not improve much when the sample size went from n = 1 to n = 8. In Example 3, CISE-SIR and CISE-PFC failed as expected, as there was a symmetry component in the mean regression function Automobile data As an illustration, we apply the proposed method to an automobile data set

13 PENALIZED MINIMUM AVERAGE VARIANCE ESTIMATION 555 Table 2. Simulation results for Example 2. The means and the standard deviations (in parentheses) of the vector correlation coefficient (COR1) and the trace correlation coefficient (COR2), the average model size (MS), the true positive rate (TPR), the false discovery rate (FDR), and the proportion of selecting the correct model (CM), based on 2 data replications, are reported. d = 1, n = 2, p = 2, q = 4, ρ = Initial estimate: Refined MAVE.9982 (.7).9982 (.7) COR1 COR2 MS TPR FDR CM P-MAVE.9996 (.4).9996 (.4) One-step P-MAVE.9998 (.2).9998 (.2) Initial estimate: Refined OPG.9978 (.9).9978 (.9) P-MAVE.9996 (.4).9996 (.4) One-step P-MAVE.9998 (.2).9998 (.2) CISE-SIR.9769 (.176).9769 (.176) CISE-PFC.9731 (.211).9731 (.211) d = 1, n = 2, p = 2, q = 4, ρ =.5 Initial estimate: Refined MAVE.9979 (.1).9979 (.1) P-MAVE.9996 (.4).9996 (.4) One-step P-MAVE.9998 (.2).9998 (.2) Initial estimate: Refined OPG.9972 (.14).9972 (.14) P-MAVE.9996 (.5).9996 (.5) One-step P-MAVE.9998 (.2).9998 (.2) CISE-SIR.9599 (.482).9599 (.482) CISE-PFC.941 (.614).941 (.614) (available from the Machine Learning Repository at the University of California- Irvine). The objective of this analysis is to study the relationship between the car price and a set of car attributes. In the data set, there are originally 25 instances and 26 attributes, and there are also some missing values. We focus our analysis on 17 attributes: Wheel-base (X 1 ), Length (X 2 ), Width (X 3 ), Height (X 4 ), Curb-weight (X 5 ), Engine-size (X 6 ), Bore (X 7 ), Stroke (X 8 ), Compression-ratio (X 9 ), Horsepower (X 1 ), Peak-rpm (X 11 ), City-mpg (X 12 ), Highway-mpg (X 13 ), Fuel-type (X 14, = diesel and 1 = gas), Aspiration (X 15, = std and 1 = turbo), No-of-cylinders (X 16, = four cylinders and 1 otherwise), and Drive-wheels (4wd, fwd and rwd). The first 13 attributes are continuous and the rest are categorical. Set X 17 = if a car has four-wheel drive and 2 otherwise, X 18 = 1 if it has frontwheel drive and 2 otherwise. The response Y is the car price in thousands of dollars. Because the number of missing values is quite small compared to the total sample size, we simply discard the instances with missing values. The resulting data set then consists of 195 instances with complete records. For ease of explanation, all predictors, X 1,..., X 18, are standardized separately. As to sufficient dimension reduction, this dataset, when the categorical predictors were removed, was analysed to identify the projection directions when the inverse regression method was applied. See Zhu and Zeng (26). The categorical predictors were

14 556 TAO WANG, PEIRONG XU AND LIXING ZHU Table 3. Simulation results for Example 3. The means and the standard deviations (in parentheses) of the vector correlation coefficient (COR1) and the trace correlation coefficient (COR2), the average model size (MS), the true positive rate (TPR), the false discovery rate (FDR), and the proportion of selecting the correct model (CM), based on 2 data replications, are reported. d = 2, n = 1, p = 1, q = 4, ρ = Initial estimate: Refined MAVE.9776 (.98).9888 (.49) COR1 COR2 MS TPR FDR CM P-MAVE.9919 (.77).9959 (.38) One-step P-MAVE.9933 (.78).9967 (.39) Initial estimate: Refined OPG.978 (.136).9854 (.68) P-MAVE.9924 (.64).9962 (.32) One-step P-MAVE.9941 (.5).9971 (.25) CISE-SIR.1985 (.2589).6834 (.82) CISE-PFC.331 (.2996).7289 (.968) d = 2, n = 1, p = 1, q = 4, ρ =.5 Initial estimate: Refined MAVE.9676 (.155).9838 (.77) P-MAVE.9897 (.14).9949 (.68) One-step P-MAVE.9911 (.12).9956 (.5) Initial estimate: Refined OPG.946 (.48).9731 (.197) P-MAVE.9846 (.267).9924 (.129) One-step P-MAVE.995 (.118).9952 (.58) CISE-SIR.1916 (.2419).677 (.116) CISE-PFC.2637 (.271).6879 (.1198) removed because inverse regression methods are limited in this regard whereas, MAVE is not. To estimate the dimension, both the cross-validation method of Xia et al. (22) and the modified BIC-type criterion proposed by Wang and Yin (28) suggest taking d = 1. The results of estimation and variable selection are summarized in Table 6. The vector correlation coefficient, the trace correlation coefficient between the MAVE estimator and the P-MAVE estimator, is.9865; between the MAVE estimator and the one-step P-MAVE estimator, is In addition, the P- MAVE estimator and the one-step P-MAVE have similar performance in terms of predictor selection: the relevant predictors in common are Wheel-base (X 1 ), Length (X 2 ), Width (X 3 ), Engine-size (X 6 ), Stroke (X 8 ), Compression-ratio (X 9 ), Peak-rpm (X 11 ), City-mpg (X 12 ), Highway-mpg (X 13 ), and Drive-wheels (X 18 ). The plot of Y against the reduced direction X T ˆβ is shown in Figure 1, displaying a strong non-linear relationship. As seen in Figure 1, the significance of X 18 indicates the difference between front-wheel drive cars and others. It is common knowledge that four-wheel and rear-wheel drive cars are typically more expensive to purchase than comparable front wheel drive cars.

15 PENALIZED MINIMUM AVERAGE VARIANCE ESTIMATION 557 Table 4. Simulation results for Example 4. The means and the standard deviations (in parentheses) of the vector correlation coefficient (COR1) and the trace correlation coefficient (COR2), the average model size (MS), the true positive rate (TPR), the false discovery rate (FDR), and the proportion of selecting the correct model (CM), based on 2 data replications, are reported. d = 3, n = 1, p = 1,, q = 3, ρ = Initial estimate: Refined MAVE.827 (.2458).9476 (.646) COR1 COR2 MS TPR FDR CM P-MAVE.9546 (.1872).9927 (.382) One-step P-MAVE.9864 (.997).9968 (.211) Initial estimate: Refined OPG.7769 (.2446).9341 (.622) P-MAVE.9193 (.2597).9893 (.395) One-step P-MAVE.982 (.1262).9961 (.265) CISE-SIR.1711 (.3594).7853 (.1362) CISE-PFC.3756 (.4585).868 (.122) d = 3, n = 1, p = 1, q = 3, ρ =.5 Initial estimate: Refined MAVE.665 (.2883).94 (.721) P-MAVE.8758 (.323).9763 (.67) One-step P-MAVE.9641 (.1683).9919 (.368) Initial estimate: Refined OPG.5762 (.28).8744 (.769) P-MAVE.8241 (.353).9778 (.521) One-step P-MAVE.9396 (.2314).99 (.49) CISE-SIR.216 (.1176).791 (.1222) CISE-PFC.29 (.152).7382 (.1147) Table 5. Simulation results for Example 5. The means and the standard deviations (in parentheses) of the vector correlation coefficient (COR1) and the trace correlation coefficient (COR2), the average model size (MS), the true positive rate (TPR), the false discovery rate (FDR), and the proportion of selecting the correct model (CM), based on 2 data replications, are reported. d = 2, n = 1, p = 6, q = 2 Initial estimate: Refined MAVE.5771 (.2414).8277 (.85) COR1 COR2 MS TPR FDR CM P-MAVE.9365 (.1841).9749 (.68) One-step P-MAVE.9683 (.1532).9889 (.499) d = 2, n = 2, p = 6, q = 2 Initial estimate: Refined MAVE.749 (.279).892 (.739) P-MAVE.9556 (.1458).982 (.526) One-step P-MAVE.992 (.676).9969 (.226) Concluding Remarks In this paper, we suggest a penalized minimum average variance estimation for both dimension reduction and variable selection. This is a forward regression approach that does not depend on the eigen-decomposition of a kernel matrix. As such this method, compared with that of Chen, Zou, and Cook (21), requires

16 558 TAO WANG, PEIRONG XU AND LIXING ZHU Table 6. Estimation and variable selection for the automobile data. Predictor Refined MAVE P-MAVE One-step P-MAVE X X X X X X X X X X X X X X X X X X fewer conditions on the predictors and the regression function. However, as with other methods, P-MAVE seems to have difficulty handling the cases with very large p; it can be extended to the case where p diverges to infinity at certain rate as the sample size tends to infinity. As P-MAVE requires an initial nonparametric estimator with all predictors, dimension reduction and variable selection can be less accurate when p is very large. We have not yet found a good solution to this problem, it deserves further study. Acknowledgement The research described here was supported by a grant from the Research Council of Hong Kong, and a grant from Hong Kong Baptist University, Hong Kong. The authors thank the editor, associate editors and two referees for their constructive comments and suggestions which led to improvement of an early manuscript. Appendix Regularity Conditions. Here are regularity conditions needed for establishing the asymptotic properties of our estimator. (C1) The predictor vector X is bounded and its density function f( ) has a bounded second order derivative; E(X B T X = u) and E(XX T B T X = u)

17 PENALIZED MINIMUM AVERAGE VARIANCE ESTIMATION 559 Figure 1. The automobile data. The cars are represented by o for frontwheel drive cars, for four-wheel drive cars, and + for rear-wheel drive cars. have bounded derivatives with respect to u and B in a small neighborhood of B. (C2) E Y k < for some large k >. (C3) E(Y B T X) has a bounded fourth order derivative in a neighborhood of B. (C4) K( ) is a symmetric density function with bounded first order derivative. In particular, we use the Gaussian kernel with bandwidth h n 1/(d+4). (C5) The smallest eigenvalue of J T W gj is larger than ρ and the largest eigenvalue of W g is less than ρ for some positive constants ρ and ρ. Matrix J in (C5) can be found in the proof of Theorem 1. Conditions (C1) (C4) are similar to those of Xia (26) and Wang and Xia (28). As for condition (C5), consider the special case of the single-index model, so W g is a square matrix of order p and J T W gj is a matrix of order p 1. It follows that β T W g = and β T J =. Note that J T W gj has the same rank as (J, β ) T W g (J, β ), and (J, β ) T W g (J, β ) has the same rank as W g, so

18 56 TAO WANG, PEIRONG XU AND LIXING ZHU J T W gj has full rank p 1. As a result, condition (C5) is mild. Proof of Proposition 1. This follows directly from Proposition 1 in Huang et al. (29). Proof of Theorem 1. There are three steps: Step I establishes the order of the minimizer ˆβ of Ψ λn (β; β), Step II shows that ˆβ is variable selection consistent, and Step III derives the asymptotic distribution. Following Theorem 1 and Theorem 2 of Huang et al. (29), write where Ψ λn (β; β) = L n (β; β) + λ n L n (β; β) = 1 2n β k γ 1, (ỹ ij x T ijβ) 2 w ij. Step I. We prove that there exists a local minimizer ˆβ of Ψ λn (β; β) such that ˆβ β 2 = O P ( 1 n ). (A.1) Because B T B = I d, the number of free parameters in B equals pd d(d + 1)/2. In addition, there are d(d 1)/2 free parameters in D. For example, we may parameterize D via the polar coordinate system. Therefore, the total number of free parameters in B D is pd d 2. For ease of presentation, we first consider the single-index model. We can simply take D = 1. It follows that B = β = (β 1,..., β p ) T R p, but note that β 2 = 1, so g(β T X) does not have a derivative at the point β. To address this problem, we adopt the delete-one-component method as follows. For any vector a = (a 1,..., a p ) T R p, let a (1) = (a 2,..., a p ) T. Then we can write β 1 = β 1 (β (1) ) = (1 β (1) 2 2 )1/2 and β = β(β (1) ) = (β 1 (β (1) ), β (1)T ) T. To exclude the trivial cases, we assume q 2. Then the true parameter β (1) satisfies the constraint β (1) 2 < 1, and β is infinitely differentiable in a neighborhood of = (η 1,..., η p ) T R p (p 1), where β (1). We define the Jacobian matrix as J β (1) η 1 = (1 β (1) 2 2 ) 1/2 β (1) and, for s = 2,..., p, η s is a (p 1)-dimensional unit vector with sth component being 1. When d > 1, a Jacobian matrix J = J β (1) R pd (pd d2) can be defined in a similar way, where the vector β (1) consists of all the free parameters in B D. Let β (1) = β (1) + n 1/2 v, where v 2 = C for some positive constant C.

19 PENALIZED MINIMUM AVERAGE VARIANCE ESTIMATION 561 Applying a Taylor expansion, we can write L n (β ; β) L n (β ; β) = nv T J T β (1) I 1n = nv T J T β (1) vt J T β (1) I 1n + I 2n, [ W g ( β β ) + 1 n i=1 1 n 2 1 n 2 (ỹ ij x T ijβ ) x ij w ij x ij x T ij w ij J (1) β v where β (1) lies between β (1) and β (1). By Lemmas 6.4 and 6.5 in Wang and Xia (28), ] ( ) 1 {ν B (x i ) g(b T x i )}ϵ i + o P n, I 2n = v T J T β (1) It follows that {W g + o P (1)}J (1) β v. Ψ λn (β ; β) Ψ λn (β ; β) = nv T J T β (1) + v T J T β (1) [ W g ( β β ) + 1 n W g J β (1) v + λ n ] {ν B (x i ) g(b T x i )}ϵ i i=1 ( βk γ 1 β k γ 1 ) + o P T 1n + T 2n + T 3n + o P ( 1 n ). ( 1 n ) By (C5), we have T 2n ρ v 2 2 with large probability. Further, by (A), J T W β (1) g n( β β ) + 1 ( ) 1 J T {ν n β (1) B (x i ) g(b T x i )}ϵ i = O P n. i=1 Thus, by choosing a sufficiently large C, T 1n is dominated by T 2n. Since b γ a γ 2(b a)b γ 1 for a b, by the Cauchy-Schwarz inequality, T 3n 2λ n q 2 dλ n 2 d β k γ 1 1 β k β k 1 q β k γ 1 1 β k β k 2 q ( q β k γ 1 1 λ n βk β k 2 2 ) 1/2

20 562 TAO WANG, PEIRONG XU AND LIXING ZHU = O(1)λ n v 2 n = O( v 2 ), where the last equality is from λ n = O(n 1/2 ). By choosing a sufficiently large C, T 3n is also dominated by T 2n. Consequently, it holds with large probability that Ψ λn (β ; β) > Ψ λn (β ; β) uniformly in {v : v 2 = C}. This implies that, with large probability, there exists a local minimizer of of Ψ λn (β; β) in the ball {v : v 2 C}. Therefore, there exists a local minimizer ˆβ of Ψ λn (β; β) such that ˆβ β 2 = O P (n 1/2 ). Step II. In this step, we establish the consistency of variable selection. Let I( ) denote the indicator function. Define β k = ˆβ k I(k A 1 ) for k = 1,..., p. Write β = ( β T 1,..., β T p ) T. By Proposition 1 and the Karush-Kuhn-Tucker condition, we have n 1 n 2 (ỹ ij x T ˆβ) x ij l ijk w ij = ˆθ 1 1/γ sgn( ˆβ k l ), ˆβl k, (A.2) where x l ijk is the lth component of x ijk. Because ˆθ 1 1/γ ˆβ k 1 = γλ n ˆβ k γ 1, we obtain n 1 n 2 (ỹ ij x T ˆβ) x ij l ijk w ij = γλ n ˆβ k γ 1 1 sgn( ˆβ k l ), ˆβl k. (A.3) It follows that n 1 n 2 (ỹ ij x T ˆβ) ij w ij x T ij( ˆβ β) = n 1 n 2 (ỹ ij x T ˆβ) ij w ij x l ijk ( ˆβ k l β k l ) k,l: ˆβ k l = γλ n ˆβ k γ 1 1 sgn( ˆβ k l )( ˆβ k l β k l ) k,l: ˆβ k l = k,l γλ n ˆβ k γ 1 1 sgn( ˆβ l k )( ˆβ l k β l k ). Note that from ( ˆβ l k β l k )sgn( ˆβ l k ) = ˆβ l k I(k A 2), we obtain n 1 n 2 (ỹ ij x T ˆβ) ij w ij x T ij( ˆβ β) = γλ n ˆβ k γ 1 1 k A 2 = d ˆβ k l l=1 γλ n ˆβ k γ 1 1 ( ˆβ k 1 β k 1 ).

21 PENALIZED MINIMUM AVERAGE VARIANCE ESTIMATION 563 Since γb (γ 1) (b a) b γ a γ for a b, we have γ ˆβ k γ 1 1 ( ˆβ k 1 β k 1 ) ˆβ k γ 1 β k γ 1. And then n 1 n 2 (ỹ ij x T ˆβ) ij w ij x T ij( ˆβ β) λ n q ( ˆβ k γ 1 β k γ 1 ) + γλ n By the definition of ˆβ, Ψ λn ( ˆβ; β) Ψ λn ( β; β). It follows that n 1 n 2 λ n (ỹ ij x T ˆβ) ij w ij x T ij( ˆβ β) + (1 γ)λ n ˆβ k γ 1 λ n β k γ 1 L n ( β; β) L n ( ˆβ; β) = n 1 n 2 (ỹ ij x T ˆβ) ij w ij x T ij( ˆβ β) n( ˆβ β) T 1 n 2 x ij x T ij w ij ( ˆβ β). k=q+1 ˆβ k γ 1 k=q+1 Then, with (C5) and Lemma 4 of Wang and Xia (28), we have, in probability, ˆβ k γ 1. (1 γ)λ n k=q+1 ˆβ k γ n( ˆβ β) T 1 n 2 x ij x T ij w ij ( ˆβ β) nρ ˆβ β 2 2 = nρ ˆβ k 2 2 k=q+1 nρ ˆβ β 2 2. Using the result of Step I, (1 γ)λ n ˆβ k γ 1 nρ ˆβ k 2 2 = O P (1). (A.4) On the other hand, ˆβ ( k γ 1 k=q+1 k=q+1 p ˆβ k 1 ) γ ( p k=q+1 k=q+1 k=q+1 ˆβ k 2 2) γ/2. (A.5)

22 564 TAO WANG, PEIRONG XU AND LIXING ZHU By (A.4) and (A.5), if p k=q+1 ˆβ k 2 2 > then (1 γ)λ n nρ ( p k=q+1 But, since λ n n γ/2, we obtain ( p P ˆβ k 2 2) 1 γ/2 = nρ O P (1)(nρ ) 1+γ/2 = O P (n γ/2 ). k=q+1 ˆβ ) k 2 2 > as n. Step III. It remains to derive the asymptotic distribution. Let m R m be a m-vector of zeros. Let u = (u T 1,..., ut q ) T, where u k = (u 1 k,..., ud k )T R d for k = 1,..., q. Take V n (u) = Ψ λn (β + n 1/2 (u T, T (p q) d )T ; β) Ψ λn (β ; β). (A.6) We have shown that, with large probability, ˆβ β = n 1/2 (û T, T (p q) d )T, where û is a minimizer of V n (u). We may re-express V n (u) as V n (u) = n(u T, T (p q) d ) [ W g ( β β ) + 1 n We have + (u T, T (p q) d )W g(u T, T (p q) d )T ] {ν B (x i ) g(b T x i )}ϵ i i=1 q + λ n ( β k + n 1/2 u k γ 1 β k γ 1 ) + o P (1) = u T U g n( β β ) u T 1 n {ν B,A 1 (x i ) g(b T x i )}ϵ i i=1 q +u T W g,a1 u + λ n ( β k + n 1/2 u k γ 1 β k γ 1 ) + o P (1) V 1n + λ n q ( β k + n 1/2 u k γ 1 β k γ 1 ) + o P (1) V 1n + V 2n + o P (1). V 1n L u T (U g,a1 G 1 + G 2 ) + u T W g,a1 u, q d V 2n λ γ β k γ 1 1 {u l k sign(βl k )I(βl k ) + ul k I(βl k = )}. l=1

23 PENALIZED MINIMUM AVERAGE VARIANCE ESTIMATION 565 Therefore, V n (u) L u T (U g,a1 G 1 + G 2 ) + u T W g,a1 u q d + λ γ β k γ 1 1 {u l k sign(βl k )I(βl k ) + ul k I(βl k = )} V (u). l=1 Because û = O P (1), by the argmin continuous mapping theorem of Kim and Pollard (199), n( ˆβA1 β A1 ) = û L argmin V (u). u The proof is complete. Proof of Corollary 1. The corollary follows from observing that, by Theorem 4.2 in Xia (26), W g ( β β ) = 1 ( ) 1 g (β T x i )ν β (x i )ϵ i + o P n. n i=1 Proof of Theorem 2. A candidate model M is said to be underfitted if it misses at least one important predictor, M M T ; it is overfitted if it covers all important predictors but also contains at least one irrelevant predictor, M M T but M M T. According to whether the model M λ is underfitted, correctly fitted, or overfitted, we set R ={λ : M λ M T }, R ={λ : M λ = M T } and R + ={M λ M T, M λ M T }. Let λ n be a reference tuning parameter sequence such that P (M λn = M T ) 1 as n. To prove the theorem, it suffices to show that ( ) P inf BIC λ > BIC λn 1 as n. λ R R + Clearly, we need only consider the single index model. For each M, write RSS M = 1 2n 2 (ỹ ij x T ˇβ ij M ) 2 w ij, (A.7) (A.8) where ˇβ M is an unpenalized estimator such that 1 ˇβ M = 2n 2 (ỹ ij x T ijβ) 2 w ij. argmin β j =: j M

24 566 TAO WANG, PEIRONG XU AND LIXING ZHU We consider two cases. Case 1: Underfitted model. For any λ, we have RSS λ RSS Mλ and RSS Mλ = 1 2n 2 (ỹ ij x T ˇβ ij MF ) 2 w ij + ( ˇβ MF ˇβ Mλ ) T 1 n 2 + ( ˇβ Mλ ˇβ MF ) T 1 2n 2 RSS MF + R 1λ + R 2λ. (ỹ ij x T ˇβ ij MF ) x ij w ij x ij x T ij w ij ( ˇβ Mλ ˇβ MF ) Applying techniques similar to those used in Lemma 1 of Xia et al. (22), we obtain RSS MF = 1 2n 2 ϵ 2 i w ij + o p (1) = O P (1). (A.9) Let Π = (β, J (1) β ). Because Π is full rank, there exists a vector ϕ R p such that ˇβ Mλ = Πϕ = ϕ 1 β + J (1) β ϕ (1). Note that β 2 = 1, ˇβ Mλ 2 = 1, and β T J β (1) 1 ϕ 2 1 and =, we have ϕ (1)T J T β (1) J (1) β ϕ (1) = ˇβ Mλ β 2 2 = (ϕ 1 1) 2 + ϕ (1)T J T β (1) J (1) β ϕ (1) = 2(1 ϕ 1 ). Since λ R, assume without loss of generality that ˇβ Mλ,2 =. Then, 1 ϕ 1 = 1 2 ˇβ Mλ β β 2 2 2, 1 + ϕ 1 = 1 + ˇβ T M λ β (1 + 1 β 2 2 2) = 1 2 β It follows that ϕ (1)T J T J β (1) (1) β ϕ (1) 1 4 β Thus we have for some positive constant, ϕ (1) β β > c. By (C5), we have, with large probability, ( ˇβ Mλ β ) T 1 2n 2 x ij x T ij w ij ( ˇβ Mλ β ) cρ. (A.1)

25 PENALIZED MINIMUM AVERAGE VARIANCE ESTIMATION 567 Further, we can show that ( ˇβ MF β ) T 1 2n 2 ( ) x ij x T ij w ij ( ˇβ 1 MF β ) = O P. (A.11) n By (A.9), (A.1), and (A.11), we have R 1λ + R 2λ cρ in probability. Therefore, with large probability, RSS λ RSS Mλ RSS MF + cρ. (A.12) Note that the above results hold uniformly over all λ R. As for the reference tuning parameter λ n, a similar decomposition yields that ( ) 1 RSS λn = RSS MF + O P. (A.13) n This, together with (A.12) and the definition of BIC λ, implies that ( ) P inf BIC λ > BIC λn 1 as n. λ R (A.14) Case 2: Overfitted model. Consider an arbitrary λ R +. By definition, RSS λ RSS Mλ. Further, by (C5), we can show that ( 1 ) RSS Mλ RSS MF = O P. n It follows that log RSS λ log RSS MF log RSS Mλ log RSS MF = log Similarly, we obtain = RSS M λ RSS MF RSS MF log RSS λn log RSS MF Then we have, with large probability, ( 1 + RSS M λ RSS ) MF RSS MF ( 1 ) {1 + o P (1)} = O P. n = O P ( 1 n inf BIC λ BIC λn = inf log RSS λ log RSS λn + (df λ df λn ) log n λ R + λ R + n ). inf λ R + log RSS λ log RSS λn + log n n O P ( 1 n It thus follows that ( ) P inf BIC λ > BIC λn 1 as n. λ R + The results of Cases 1 and 2 complete the proof. ) + log n n. (A.15)

26 568 TAO WANG, PEIRONG XU AND LIXING ZHU References Chen, X., Zou, C. and Cook, R. D. (21). Coordinate-independent sparse sufficient dimension reduction and variable selection. Ann. Statist. 38, Cook, R. D. (1996). Graphics for regressions with a binary response. J. Amer. Statist. Assoc. 91, Cook, R. D. (24). Testing predictor contributions in sufficient dimension reduction. Ann. Statist. 32, Cook, R. D. and Weisberg, S. (1991). Sliced inverse regression for dimension reduction: Comment. J. Amer. Statist. Assoc. 86, Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (24). Least angle regression. Ann. Statist. 32, Fan, J. and Li, R. (21). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96, Frank, I. E. and Friedman, J. H. (1993). A statistical view of some chemometrics regression tools. Technometrics 35, Friedman, J., Hastie, T., Höfling, H. and Tibshirani, R. (27). Pathwise coordinate optimization. Ann. Appl. Statist. 1, Huang, J., Ma, S., Xie, H. and Zhang, C. (29). A group bridge approach for variable selection. Biometrika 96, Kim, J. and Pollard, D. (199). Cube root asymptotics. Ann. Statist. 18, Li, B. and Dong, Y. (29). Dimension reduction for nonelliptically distributed predictors. Ann. Statist. 37, Li, L. (27). Sparse sufficient dimension reduction. Biometrika 94, Li, L., Cook, R. D. and Nachtsheim, C. J. (25). Model-free variable selection. J. Roy. Statist. Soc. Ser. B 67, Li, K. C. (1991). Sliced inverse regression for dimension reduction. J. Amer. Statist. Assoc. 86, Silverman, B. W. (1999). Density Estimation for Statistics and Data Analysis. Chapman and Hall, London. Wang, H., Li, R. and Tsai, C. L. (27). Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika 94, Wang, H. and Xia, Y. (28). Sliced regression for dimension reduction. J. Amer. Statist. Assoc. 12, Wang, Q. and Yin, X. (28). A nonlinear multi-dimensional variable selection method for high dimensional data: Sparse MAVE. Comp. Statist. Data Anal. 52, Xia, Y. (26). Asymptotic distributions of two estimators of the single-index model. Econometric Theory 22, Xia, Y., Tong, H., Li, W. K. and Zhu, L. (22). An adaptive estimation of dimension reduction space. J. Roy. Statist. Soc. B 64, Ye, Z. and Weiss, R. E. (23). Using the bootstrap to select one of a new class of dimension reduction methods. J. Amer. Statist. Assoc. 98, Zhou, J. and He, X. (28). Dimension reduction based on constrained canonical correlation and variable filtering. Ann. Statist. 36, Zhu, L. X. and Ng, K. W. (1995). Asymptotics for sliced inverse regression. Statist. Sinica 5,

27 PENALIZED MINIMUM AVERAGE VARIANCE ESTIMATION 569 Zhu, Y. and Zeng, P. (26). Fourier methods for estimating the central subspace and the central mean subspace in regression. J. Amer. Statist. Assoc. 11, Department of Mathematics, Hong Kong Baptist University, Kowloon Tong, Hong Kong, P. R. China. School of Finance and Statistics, East China Normal University, Shanghai, 2241, P. R. China. Department of Mathematics, Hong Kong Baptist University, Kowloon Tong, Hong Kong, P. R. China. (Received November 211; accepted May 212)

Robust Variable Selection Through MAVE

Robust Variable Selection Through MAVE Robust Variable Selection Through MAVE Weixin Yao and Qin Wang Abstract Dimension reduction and variable selection play important roles in high dimensional data analysis. Wang and Yin (2008) proposed sparse

More information

Robust variable selection through MAVE

Robust variable selection through MAVE This is the author s final, peer-reviewed manuscript as accepted for publication. The publisher-formatted version may be available through the publisher s web site or your institution s library. Robust

More information

Analysis Methods for Supersaturated Design: Some Comparisons

Analysis Methods for Supersaturated Design: Some Comparisons Journal of Data Science 1(2003), 249-260 Analysis Methods for Supersaturated Design: Some Comparisons Runze Li 1 and Dennis K. J. Lin 2 The Pennsylvania State University Abstract: Supersaturated designs

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

VARIABLE SELECTION AND ESTIMATION WITH THE SEAMLESS-L 0 PENALTY

VARIABLE SELECTION AND ESTIMATION WITH THE SEAMLESS-L 0 PENALTY Statistica Sinica 23 (2013), 929-962 doi:http://dx.doi.org/10.5705/ss.2011.074 VARIABLE SELECTION AND ESTIMATION WITH THE SEAMLESS-L 0 PENALTY Lee Dicker, Baosheng Huang and Xihong Lin Rutgers University,

More information

MS-C1620 Statistical inference

MS-C1620 Statistical inference MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents

More information

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty Journal of Data Science 9(2011), 549-564 Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty Masaru Kanba and Kanta Naito Shimane University Abstract: This paper discusses the

More information

Regression Graphics. 1 Introduction. 2 The Central Subspace. R. D. Cook Department of Applied Statistics University of Minnesota St.

Regression Graphics. 1 Introduction. 2 The Central Subspace. R. D. Cook Department of Applied Statistics University of Minnesota St. Regression Graphics R. D. Cook Department of Applied Statistics University of Minnesota St. Paul, MN 55108 Abstract This article, which is based on an Interface tutorial, presents an overview of regression

More information

In Search of Desirable Compounds

In Search of Desirable Compounds In Search of Desirable Compounds Adrijo Chakraborty University of Georgia Email: adrijoc@uga.edu Abhyuday Mandal University of Georgia Email: amandal@stat.uga.edu Kjell Johnson Arbor Analytics, LLC Email:

More information

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables LIB-MA, FSSM Cadi Ayyad University (Morocco) COMPSTAT 2010 Paris, August 22-27, 2010 Motivations Fan and Li (2001), Zou and Li (2008)

More information

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1 Variable Selection in Restricted Linear Regression Models Y. Tuaç 1 and O. Arslan 1 Ankara University, Faculty of Science, Department of Statistics, 06100 Ankara/Turkey ytuac@ankara.edu.tr, oarslan@ankara.edu.tr

More information

Forward Selection and Estimation in High Dimensional Single Index Models

Forward Selection and Estimation in High Dimensional Single Index Models Forward Selection and Estimation in High Dimensional Single Index Models Shikai Luo and Subhashis Ghosal North Carolina State University August 29, 2016 Abstract We propose a new variable selection and

More information

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider the problem of

More information

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models Jingyi Jessica Li Department of Statistics University of California, Los

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

The Iterated Lasso for High-Dimensional Logistic Regression

The Iterated Lasso for High-Dimensional Logistic Regression The Iterated Lasso for High-Dimensional Logistic Regression By JIAN HUANG Department of Statistics and Actuarial Science, 241 SH University of Iowa, Iowa City, Iowa 52242, U.S.A. SHUANGE MA Division of

More information

Chapter 3. Linear Models for Regression

Chapter 3. Linear Models for Regression Chapter 3. Linear Models for Regression Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 c Wei Pan Linear

More information

Regularization Path Algorithms for Detecting Gene Interactions

Regularization Path Algorithms for Detecting Gene Interactions Regularization Path Algorithms for Detecting Gene Interactions Mee Young Park Trevor Hastie July 16, 2006 Abstract In this study, we consider several regularization path algorithms with grouped variable

More information

Generalized Elastic Net Regression

Generalized Elastic Net Regression Abstract Generalized Elastic Net Regression Geoffroy MOURET Jean-Jules BRAULT Vahid PARTOVINIA This work presents a variation of the elastic net penalization method. We propose applying a combined l 1

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Saharon Rosset 1 and Ji Zhu 2

Saharon Rosset 1 and Ji Zhu 2 Aust. N. Z. J. Stat. 46(3), 2004, 505 510 CORRECTED PROOF OF THE RESULT OF A PREDICTION ERROR PROPERTY OF THE LASSO ESTIMATOR AND ITS GENERALIZATION BY HUANG (2003) Saharon Rosset 1 and Ji Zhu 2 IBM T.J.

More information

Variable Selection for Highly Correlated Predictors

Variable Selection for Highly Correlated Predictors Variable Selection for Highly Correlated Predictors Fei Xue and Annie Qu arxiv:1709.04840v1 [stat.me] 14 Sep 2017 Abstract Penalty-based variable selection methods are powerful in selecting relevant covariates

More information

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu LITIS - EA 48 - INSA/Universite de Rouen Avenue de l Université - 768 Saint-Etienne du Rouvray

More information

Forward Regression for Ultra-High Dimensional Variable Screening

Forward Regression for Ultra-High Dimensional Variable Screening Forward Regression for Ultra-High Dimensional Variable Screening Hansheng Wang Guanghua School of Management, Peking University This version: April 9, 2009 Abstract Motivated by the seminal theory of Sure

More information

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection DEPARTMENT OF STATISTICS University of Wisconsin 1210 West Dayton St. Madison, WI 53706 TECHNICAL REPORT NO. 1091r April 2004, Revised December 2004 A Note on the Lasso and Related Procedures in Model

More information

ASYMPTOTIC PROPERTIES OF BRIDGE ESTIMATORS IN SPARSE HIGH-DIMENSIONAL REGRESSION MODELS

ASYMPTOTIC PROPERTIES OF BRIDGE ESTIMATORS IN SPARSE HIGH-DIMENSIONAL REGRESSION MODELS ASYMPTOTIC PROPERTIES OF BRIDGE ESTIMATORS IN SPARSE HIGH-DIMENSIONAL REGRESSION MODELS Jian Huang 1, Joel L. Horowitz 2, and Shuangge Ma 3 1 Department of Statistics and Actuarial Science, University

More information

Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices

Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices Article Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices Fei Jin 1,2 and Lung-fei Lee 3, * 1 School of Economics, Shanghai University of Finance and Economics,

More information

Shrinkage Inverse Regression Estimation for Model Free Variable Selection

Shrinkage Inverse Regression Estimation for Model Free Variable Selection Shrinkage Inverse Regression Estimation for Model Free Variable Selection Howard D. Bondell and Lexin Li 1 North Carolina State University, Raleigh, USA Summary. The family of inverse regression estimators

More information

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract Journal of Data Science,17(1). P. 145-160,2019 DOI:10.6339/JDS.201901_17(1).0007 WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION Wei Xiong *, Maozai Tian 2 1 School of Statistics, University of

More information

Regression, Ridge Regression, Lasso

Regression, Ridge Regression, Lasso Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.

More information

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression Noah Simon Jerome Friedman Trevor Hastie November 5, 013 Abstract In this paper we purpose a blockwise descent

More information

On Mixture Regression Shrinkage and Selection via the MR-LASSO

On Mixture Regression Shrinkage and Selection via the MR-LASSO On Mixture Regression Shrinage and Selection via the MR-LASSO Ronghua Luo, Hansheng Wang, and Chih-Ling Tsai Guanghua School of Management, Peing University & Graduate School of Management, University

More information

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA Presented by Dongjun Chung March 12, 2010 Introduction Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions:

More information

Comparisons of penalized least squares. methods by simulations

Comparisons of penalized least squares. methods by simulations Comparisons of penalized least squares arxiv:1405.1796v1 [stat.co] 8 May 2014 methods by simulations Ke ZHANG, Fan YIN University of Science and Technology of China, Hefei 230026, China Shifeng XIONG Academy

More information

Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise

Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published

More information

Lecture 14: Variable Selection - Beyond LASSO

Lecture 14: Variable Selection - Beyond LASSO Fall, 2017 Extension of LASSO To achieve oracle properties, L q penalty with 0 < q < 1, SCAD penalty (Fan and Li 2001; Zhang et al. 2007). Adaptive LASSO (Zou 2006; Zhang and Lu 2007; Wang et al. 2007)

More information

Feature selection with high-dimensional data: criteria and Proc. Procedures

Feature selection with high-dimensional data: criteria and Proc. Procedures Feature selection with high-dimensional data: criteria and Procedures Zehua Chen Department of Statistics & Applied Probability National University of Singapore Conference in Honour of Grace Wahba, June

More information

Nonnegative Garrote Component Selection in Functional ANOVA Models

Nonnegative Garrote Component Selection in Functional ANOVA Models Nonnegative Garrote Component Selection in Functional ANOVA Models Ming Yuan School of Industrial and Systems Engineering Georgia Institute of Technology Atlanta, GA 3033-005 Email: myuan@isye.gatech.edu

More information

Tuning Parameter Selection in Regularized Estimations of Large Covariance Matrices

Tuning Parameter Selection in Regularized Estimations of Large Covariance Matrices Tuning Parameter Selection in Regularized Estimations of Large Covariance Matrices arxiv:1308.3416v1 [stat.me] 15 Aug 2013 Yixin Fang 1, Binhuan Wang 1, and Yang Feng 2 1 New York University and 2 Columbia

More information

Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001)

Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001) Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001) Presented by Yang Zhao March 5, 2010 1 / 36 Outlines 2 / 36 Motivation

More information

Lecture 5: Soft-Thresholding and Lasso

Lecture 5: Soft-Thresholding and Lasso High Dimensional Data and Statistical Learning Lecture 5: Soft-Thresholding and Lasso Weixing Song Department of Statistics Kansas State University Weixing Song STAT 905 October 23, 2014 1/54 Outline Penalized

More information

Iterative Selection Using Orthogonal Regression Techniques

Iterative Selection Using Orthogonal Regression Techniques Iterative Selection Using Orthogonal Regression Techniques Bradley Turnbull 1, Subhashis Ghosal 1 and Hao Helen Zhang 2 1 Department of Statistics, North Carolina State University, Raleigh, NC, USA 2 Department

More information

STATS 306B: Unsupervised Learning Spring Lecture 13 May 12

STATS 306B: Unsupervised Learning Spring Lecture 13 May 12 STATS 306B: Unsupervised Learning Spring 2014 Lecture 13 May 12 Lecturer: Lester Mackey Scribe: Jessy Hwang, Minzhe Wang 13.1 Canonical correlation analysis 13.1.1 Recap CCA is a linear dimensionality

More information

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss arxiv:1811.04545v1 [stat.co] 12 Nov 2018 Cheng Wang School of Mathematical Sciences, Shanghai Jiao

More information

Regression Graphics. R. D. Cook Department of Applied Statistics University of Minnesota St. Paul, MN 55108

Regression Graphics. R. D. Cook Department of Applied Statistics University of Minnesota St. Paul, MN 55108 Regression Graphics R. D. Cook Department of Applied Statistics University of Minnesota St. Paul, MN 55108 Abstract This article, which is based on an Interface tutorial, presents an overview of regression

More information

Robust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly

Robust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly Robust Variable Selection Methods for Grouped Data by Kristin Lee Seamon Lilly A dissertation submitted to the Graduate Faculty of Auburn University in partial fulfillment of the requirements for the Degree

More information

Sufficient Dimension Reduction for Longitudinally Measured Predictors

Sufficient Dimension Reduction for Longitudinally Measured Predictors Sufficient Dimension Reduction for Longitudinally Measured Predictors Ruth Pfeiffer National Cancer Institute, NIH, HHS joint work with Efstathia Bura and Wei Wang TU Wien and GWU University JSM Vancouver

More information

A review on Sliced Inverse Regression

A review on Sliced Inverse Regression A review on Sliced Inverse Regression Kevin Li To cite this version: Kevin Li. A review on Sliced Inverse Regression. 2013. HAL Id: hal-00803698 https://hal.archives-ouvertes.fr/hal-00803698v1

More information

Bayesian Grouped Horseshoe Regression with Application to Additive Models

Bayesian Grouped Horseshoe Regression with Application to Additive Models Bayesian Grouped Horseshoe Regression with Application to Additive Models Zemei Xu, Daniel F. Schmidt, Enes Makalic, Guoqi Qian, and John L. Hopper Centre for Epidemiology and Biostatistics, Melbourne

More information

The lasso, persistence, and cross-validation

The lasso, persistence, and cross-validation The lasso, persistence, and cross-validation Daniel J. McDonald Department of Statistics Indiana University http://www.stat.cmu.edu/ danielmc Joint work with: Darren Homrighausen Colorado State University

More information

Regularization: Ridge Regression and the LASSO

Regularization: Ridge Regression and the LASSO Agenda Wednesday, November 29, 2006 Agenda Agenda 1 The Bias-Variance Tradeoff 2 Ridge Regression Solution to the l 2 problem Data Augmentation Approach Bayesian Interpretation The SVD and Ridge Regression

More information

On High-Dimensional Cross-Validation

On High-Dimensional Cross-Validation On High-Dimensional Cross-Validation BY WEI-CHENG HSIAO Institute of Statistical Science, Academia Sinica, 128 Academia Road, Section 2, Nankang, Taipei 11529, Taiwan hsiaowc@stat.sinica.edu.tw 5 WEI-YING

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Shrinkage Tuning Parameter Selection in Precision Matrices Estimation

Shrinkage Tuning Parameter Selection in Precision Matrices Estimation arxiv:0909.1123v1 [stat.me] 7 Sep 2009 Shrinkage Tuning Parameter Selection in Precision Matrices Estimation Heng Lian Division of Mathematical Sciences School of Physical and Mathematical Sciences Nanyang

More information

Single Index Quantile Regression for Heteroscedastic Data

Single Index Quantile Regression for Heteroscedastic Data Single Index Quantile Regression for Heteroscedastic Data E. Christou M. G. Akritas Department of Statistics The Pennsylvania State University SMAC, November 6, 2015 E. Christou, M. G. Akritas (PSU) SIQR

More information

A Selective Review of Sufficient Dimension Reduction

A Selective Review of Sufficient Dimension Reduction A Selective Review of Sufficient Dimension Reduction Lexin Li Department of Statistics North Carolina State University Lexin Li (NCSU) Sufficient Dimension Reduction 1 / 19 Outline 1 General Framework

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

MODEL SELECTION FOR CORRELATED DATA WITH DIVERGING NUMBER OF PARAMETERS

MODEL SELECTION FOR CORRELATED DATA WITH DIVERGING NUMBER OF PARAMETERS Statistica Sinica 23 (2013), 901-927 doi:http://dx.doi.org/10.5705/ss.2011.058 MODEL SELECTION FOR CORRELATED DATA WITH DIVERGING NUMBER OF PARAMETERS Hyunkeun Cho and Annie Qu University of Illinois at

More information

SUPPLEMENTAL NOTES FOR ROBUST REGULARIZED SINGULAR VALUE DECOMPOSITION WITH APPLICATION TO MORTALITY DATA

SUPPLEMENTAL NOTES FOR ROBUST REGULARIZED SINGULAR VALUE DECOMPOSITION WITH APPLICATION TO MORTALITY DATA SUPPLEMENTAL NOTES FOR ROBUST REGULARIZED SINGULAR VALUE DECOMPOSITION WITH APPLICATION TO MORTALITY DATA By Lingsong Zhang, Haipeng Shen and Jianhua Z. Huang Purdue University, University of North Carolina,

More information

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010 Model-Averaged l 1 Regularization using Markov Chain Monte Carlo Model Composition Technical Report No. 541 Department of Statistics, University of Washington Chris Fraley and Daniel Percival August 22,

More information

An Improved 1-norm SVM for Simultaneous Classification and Variable Selection

An Improved 1-norm SVM for Simultaneous Classification and Variable Selection An Improved 1-norm SVM for Simultaneous Classification and Variable Selection Hui Zou School of Statistics University of Minnesota Minneapolis, MN 55455 hzou@stat.umn.edu Abstract We propose a novel extension

More information

Sufficient Dimension Reduction using Support Vector Machine and it s variants

Sufficient Dimension Reduction using Support Vector Machine and it s variants Sufficient Dimension Reduction using Support Vector Machine and it s variants Andreas Artemiou School of Mathematics, Cardiff University @AG DANK/BCS Meeting 2013 SDR PSVM Real Data Current Research and

More information

Adaptive Piecewise Polynomial Estimation via Trend Filtering

Adaptive Piecewise Polynomial Estimation via Trend Filtering Adaptive Piecewise Polynomial Estimation via Trend Filtering Liubo Li, ShanShan Tu The Ohio State University li.2201@osu.edu, tu.162@osu.edu October 1, 2015 Liubo Li, ShanShan Tu (OSU) Trend Filtering

More information

Stepwise Searching for Feature Variables in High-Dimensional Linear Regression

Stepwise Searching for Feature Variables in High-Dimensional Linear Regression Stepwise Searching for Feature Variables in High-Dimensional Linear Regression Qiwei Yao Department of Statistics, London School of Economics q.yao@lse.ac.uk Joint work with: Hongzhi An, Chinese Academy

More information

Ensemble estimation and variable selection with semiparametric regression models

Ensemble estimation and variable selection with semiparametric regression models Ensemble estimation and variable selection with semiparametric regression models Sunyoung Shin Department of Mathematical Sciences University of Texas at Dallas Joint work with Jason Fine, Yufeng Liu,

More information

Linear Methods for Regression. Lijun Zhang

Linear Methods for Regression. Lijun Zhang Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived

More information

The deterministic Lasso

The deterministic Lasso The deterministic Lasso Sara van de Geer Seminar für Statistik, ETH Zürich Abstract We study high-dimensional generalized linear models and empirical risk minimization using the Lasso An oracle inequality

More information

Issues on quantile autoregression

Issues on quantile autoregression Issues on quantile autoregression Jianqing Fan and Yingying Fan We congratulate Koenker and Xiao on their interesting and important contribution to the quantile autoregression (QAR). The paper provides

More information

Bi-level feature selection with applications to genetic association

Bi-level feature selection with applications to genetic association Bi-level feature selection with applications to genetic association studies October 15, 2008 Motivation In many applications, biological features possess a grouping structure Categorical variables may

More information

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013) A Survey of L 1 Regression Vidaurre, Bielza and Larranaga (2013) Céline Cunen, 20/10/2014 Outline of article 1.Introduction 2.The Lasso for Linear Regression a) Notation and Main Concepts b) Statistical

More information

Regression Shrinkage and Selection via the Lasso

Regression Shrinkage and Selection via the Lasso Regression Shrinkage and Selection via the Lasso ROBERT TIBSHIRANI, 1996 Presenter: Guiyun Feng April 27 () 1 / 20 Motivation Estimation in Linear Models: y = β T x + ɛ. data (x i, y i ), i = 1, 2,...,

More information

Day 4: Shrinkage Estimators

Day 4: Shrinkage Estimators Day 4: Shrinkage Estimators Kenneth Benoit Data Mining and Statistical Learning March 9, 2015 n versus p (aka k) Classical regression framework: n > p. Without this inequality, the OLS coefficients have

More information

Least Angle Regression, Forward Stagewise and the Lasso

Least Angle Regression, Forward Stagewise and the Lasso January 2005 Rob Tibshirani, Stanford 1 Least Angle Regression, Forward Stagewise and the Lasso Brad Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani Stanford University Annals of Statistics,

More information

High-dimensional Ordinary Least-squares Projection for Screening Variables

High-dimensional Ordinary Least-squares Projection for Screening Variables 1 / 38 High-dimensional Ordinary Least-squares Projection for Screening Variables Chenlei Leng Joint with Xiangyu Wang (Duke) Conference on Nonparametric Statistics for Big Data and Celebration to Honor

More information

Some Theories about Backfitting Algorithm for Varying Coefficient Partially Linear Model

Some Theories about Backfitting Algorithm for Varying Coefficient Partially Linear Model Some Theories about Backfitting Algorithm for Varying Coefficient Partially Linear Model 1. Introduction Varying-coefficient partially linear model (Zhang, Lee, and Song, 2002; Xia, Zhang, and Tong, 2004;

More information

Minimax design criterion for fractional factorial designs

Minimax design criterion for fractional factorial designs Ann Inst Stat Math 205 67:673 685 DOI 0.007/s0463-04-0470-0 Minimax design criterion for fractional factorial designs Yue Yin Julie Zhou Received: 2 November 203 / Revised: 5 March 204 / Published online:

More information

Analysis of Greedy Algorithms

Analysis of Greedy Algorithms Analysis of Greedy Algorithms Jiahui Shen Florida State University Oct.26th Outline Introduction Regularity condition Analysis on orthogonal matching pursuit Analysis on forward-backward greedy algorithm

More information

The picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R

The picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R The picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R Xingguo Li Tuo Zhao Tong Zhang Han Liu Abstract We describe an R package named picasso, which implements a unified framework

More information

Linear model selection and regularization

Linear model selection and regularization Linear model selection and regularization Problems with linear regression with least square 1. Prediction Accuracy: linear regression has low bias but suffer from high variance, especially when n p. It

More information

Outlier detection and variable selection via difference based regression model and penalized regression

Outlier detection and variable selection via difference based regression model and penalized regression Journal of the Korean Data & Information Science Society 2018, 29(3), 815 825 http://dx.doi.org/10.7465/jkdi.2018.29.3.815 한국데이터정보과학회지 Outlier detection and variable selection via difference based regression

More information

Adaptive estimation of the copula correlation matrix for semiparametric elliptical copulas

Adaptive estimation of the copula correlation matrix for semiparametric elliptical copulas Adaptive estimation of the copula correlation matrix for semiparametric elliptical copulas Department of Mathematics Department of Statistical Science Cornell University London, January 7, 2016 Joint work

More information

Asymptotic inference for a nonstationary double ar(1) model

Asymptotic inference for a nonstationary double ar(1) model Asymptotic inference for a nonstationary double ar() model By SHIQING LING and DONG LI Department of Mathematics, Hong Kong University of Science and Technology, Hong Kong maling@ust.hk malidong@ust.hk

More information

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R

More information

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Nonconcave Penalized Likelihood with A Diverging Number of Parameters Nonconcave Penalized Likelihood with A Diverging Number of Parameters Jianqing Fan and Heng Peng Presenter: Jiale Xu March 12, 2010 Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized

More information

Two Tales of Variable Selection for High Dimensional Regression: Screening and Model Building

Two Tales of Variable Selection for High Dimensional Regression: Screening and Model Building Two Tales of Variable Selection for High Dimensional Regression: Screening and Model Building Cong Liu, Tao Shi and Yoonkyung Lee Department of Statistics, The Ohio State University Abstract Variable selection

More information

Extended Bayesian Information Criteria for Gaussian Graphical Models

Extended Bayesian Information Criteria for Gaussian Graphical Models Extended Bayesian Information Criteria for Gaussian Graphical Models Rina Foygel University of Chicago rina@uchicago.edu Mathias Drton University of Chicago drton@uchicago.edu Abstract Gaussian graphical

More information

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage Lingrui Gan, Naveen N. Narisetty, Feng Liang Department of Statistics University of Illinois at Urbana-Champaign Problem Statement

More information

An Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models

An Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models Proceedings 59th ISI World Statistics Congress, 25-30 August 2013, Hong Kong (Session CPS023) p.3938 An Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models Vitara Pungpapong

More information

On Algorithms for Solving Least Squares Problems under an L 1 Penalty or an L 1 Constraint

On Algorithms for Solving Least Squares Problems under an L 1 Penalty or an L 1 Constraint On Algorithms for Solving Least Squares Problems under an L 1 Penalty or an L 1 Constraint B.A. Turlach School of Mathematics and Statistics (M19) The University of Western Australia 35 Stirling Highway,

More information

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Least Absolute Shrinkage is Equivalent to Quadratic Penalization Least Absolute Shrinkage is Equivalent to Quadratic Penalization Yves Grandvalet Heudiasyc, UMR CNRS 6599, Université de Technologie de Compiègne, BP 20.529, 60205 Compiègne Cedex, France Yves.Grandvalet@hds.utc.fr

More information

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation Adam J. Rothman School of Statistics University of Minnesota October 8, 2014, joint work with Liliana

More information

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

Boosting Methods: Why They Can Be Useful for High-Dimensional Data New URL: http://www.r-project.org/conferences/dsc-2003/ Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003) March 20 22, Vienna, Austria ISSN 1609-395X Kurt Hornik,

More information

Statistical Methods for Data Mining

Statistical Methods for Data Mining Statistical Methods for Data Mining Kuangnan Fang Xiamen University Email: xmufkn@xmu.edu.cn Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find

More information

MSA220/MVE440 Statistical Learning for Big Data

MSA220/MVE440 Statistical Learning for Big Data MSA220/MVE440 Statistical Learning for Big Data Lecture 9-10 - High-dimensional regression Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Recap from

More information

Sparse orthogonal factor analysis

Sparse orthogonal factor analysis Sparse orthogonal factor analysis Kohei Adachi and Nickolay T. Trendafilov Abstract A sparse orthogonal factor analysis procedure is proposed for estimating the optimal solution with sparse loadings. In

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28 Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models 1 / 28 Topics Standard sparse regression model algorithms: convex relaxation and greedy algorithm sparse recovery analysis:

More information

STANDARDIZATION AND THE GROUP LASSO PENALTY

STANDARDIZATION AND THE GROUP LASSO PENALTY Statistica Sinica (0), 983-00 doi:http://dx.doi.org/0.5705/ss.0.075 STANDARDIZATION AND THE GROUP LASSO PENALTY Noah Simon and Robert Tibshirani Stanford University Abstract: We re-examine the original

More information

FUNCTIONAL ADDITIVE REGRESSION. University of Southern California

FUNCTIONAL ADDITIVE REGRESSION. University of Southern California Submitted to the Annals of Statistics FUNCTIONAL ADDITIVE REGRESSION By Yingying Fan and Gareth M. James University of Southern California and Peter Radchenko We suggest a new method, called Functional

More information

A CENTRAL LIMIT THEOREM FOR NESTED OR SLICED LATIN HYPERCUBE DESIGNS

A CENTRAL LIMIT THEOREM FOR NESTED OR SLICED LATIN HYPERCUBE DESIGNS Statistica Sinica 26 (2016), 1117-1128 doi:http://dx.doi.org/10.5705/ss.202015.0240 A CENTRAL LIMIT THEOREM FOR NESTED OR SLICED LATIN HYPERCUBE DESIGNS Xu He and Peter Z. G. Qian Chinese Academy of Sciences

More information