PENALIZED MINIMUM AVERAGE VARIANCE ESTIMATION

Size: px

Start display at page:

Download "PENALIZED MINIMUM AVERAGE VARIANCE ESTIMATION"

Shanon Marsh
6 years ago
Views:

1 Statistica Sinica 23 (213), doi: PENALIZED MINIMUM AVERAGE VARIANCE ESTIMATION Tao Wang 1, Peirong Xu 2 and Lixing Zhu 1 1 Hong Kong Baptist University and 2 East China Normal University Abstract: For simultaneous dimension reduction and variable selection for general regression models, including the multi-index model as a special case, we propose a penalized minimum average variance estimation method, combining the ideas of minimum average variance estimation in dimension reduction and regularization in variable selection. The resulting estimator can be found in a computationally efficient manner. Under mild conditions, the new method can consistently select all relevant predictors and has the oracle property. Simulations and a data example demonstrate the effectiveness and efficiency of the proposed method. Key words and phrases: Dimension reduction, minimum average variance estimation, oracle property, single-index model, sufficient dimension reduction, variable selection. 1. Introduction Consider a univariate response Y and a predictor vector X = (X 1,..., X p ) T. The goal of regression analysis is to infer about the conditional mean function or, in general, the conditional distribution of Y X. To deal with high-dimensional data that arise frequently in modern scientific research, the literature offers sufficient dimension reduction and variable selection. Sufficient dimension reduction aims to replace the predictor vector with its low-dimensional projection onto a suitable subspace of the predictor space without losing regression information, whereas variable selection is to rule out predictors that insignificantly affect the response. More specifically, sufficient dimension reduction seeks to find a subspace S of minimum dimension such that Y X P S X, where indicates independence and P S stands for the orthogonal projection onto S; that is, Y and X are independent conditioned on P S X or equivalently, the conditional distribution of Y X equals that of Y P S X. Such a space, denoted S Y X and often called the central subspace, typically exists and is unique under mild conditions (Cook (1996)). When all projection directions are contained in the conditional mean of Y given X, E(Y X), the above independence can be written as Y E(Y X) P S X. The

2 544 TAO WANG, PEIRONG XU AND LIXING ZHU related subspace is called the central mean subspace (CMS). A number of methods have been proposed to estimate the central/central mean subspace: inverse regression methods such as sliced inverse regression method (SIR, Li (1991)) and sliced average variance estimation method (SAVE, Cook and Weisberg (1991)); direct regression methods, in particular the minimum average variance estimation method (MAVE, Xia et al. (22)), are popular. Inverse regression methods are computationally simple and practically useful, but the probabilistic assumptions on the design of predictors, such as the linearity condition (Li (1991)), are considered to be strong (Li and Dong (29)). Further, some of them fail in one way or another to estimate the central/central mean subspace (Cook and Weisberg (1991)). For CMS, MAVE is useful for both dimension reduction and nonparametric function estimation. It is free of the linearity condition and often has much better performance in finite samples than inverse regression methods. The aforementioned methods assume that all predictors contain useful information. As a result, the estimated directions generally involve all of the original predictors, making their interpretation sometimes difficult. Further, if irrelevant predictors are included, quite likely in high-dimensional situations, the precision of estimation as well as the accuracy of prediction is lessened. It is thus crucial to consider variable selection along with dimension reduction, and several attempts at this have been made within the framework of sufficient dimension reduction. For instance, Li, Cook, and Nachtsheim (25) developed test-based selection procedures, Li (27) suggested sparse sufficient dimension reduction via integrating inverse regression estimation with a regularization paradigm, Zhou and He (28) studied a constrained canonical correlation method, and Wang and Yin (28) proposed sparse minimum average variance estimation. However, most methods are designed for element screening and not variable screening, producing a shrinkage solution for a basis matrix of the central/central mean subspace. Building upon the regression-type formulation of sufficient dimension reduction methods, Chen, Zou, and Cook (21) proposed a coordinate-independent sparse estimation that can simultaneously achieve dimension reduction and screen out irrelevant predictors. With the aid of the local quadratic approximation, the resulting estimators can be found by an application of the eigen-system analysis. Their general method is subspace-oriented and covers a number of inverse regression methods, but inherits all drawbacks of inverse regression methods, that is, the strong conditions on predictors and regression function. Also, as pointed out in Fan and Li (21), the local quadratic approximation algorithm has to delete any small coefficient because it cannot have a sparse representation. In this paper, by incorporating a bridge penalty (Frank and Friedman (1993)) to L 1 norms of the rows of a basis matrix, we propose a penalized minimum

3 PENALIZED MINIMUM AVERAGE VARIANCE ESTIMATION 545 average variance estimation (P-MAVE) that can achieve sufficient dimension reduction and variable selection simultaneously. A fast and efficient algorithm is developed to solve the optimization problem, and the resulting estimator naturally adopts a sparse representation. A BIC-type criterion is suggested to select the tuning parameter. The new method is free of the linearity condition that is needed for inverse regression methods (see, e.g., Chen, Zou, and Cook (21)) and has the oracle property under mild conditions, so that P-MAVE can be applied to time series data while inverse regression methods cannot. The rest of the paper is organized as follows. Section 2 introduces the new method. Its theoretical properties are studied in Section 3. Numerical studies are reported in Section 4, and there a dataset with categorical predictors is analyzed for variable selection. Concluding remarks are included in Section 5. All technical details are relegated to the Appendix. 2. Methodology 2.1. Minimum average variance estimation revisited Let B R p d denote a p d orthogonal matrix with d < p, so B T B = I d where I d is the d d identity matrix. For dimension reduction we consider the model, Y = g(b T X) + ϵ, (2.1) where g is an unknown smooth link function and E(ϵ X) = almost surely. At the population level, the MAVE procedure minimizes the objective function E{var(Y B T X)} = E{Y E(Y B T X)} 2 (2.2) among all B R p d such that B T B = I d. If σ 2 B (BT X) = E[{Y E(Y B T X)} 2 B T X] is the conditional variance of Y given B T X, Thus, minimizing (2.2) is equivalent to E{Y E(Y B T X)} 2 = E{σ 2 B(B T X)}. min E{σ2 B(B T X)} subject to B T B = I d. B R p d Suppose {(y i, x i ), i = 1,..., n} is a random sample drawn according to (2.1). For x i close to x, a local linear approximation is y i g(b T x i ) y i g(b T x ) { g(b T x )} T B T (x i x ). In the spirit of local linear smoothing, we might estimate σ 2 B (BT x ) by ˆσ 2 B(B T x ) = min a R,b R d {y i a b T B T (x i x )} 2 w i. i=1

4 546 TAO WANG, PEIRONG XU AND LIXING ZHU Here, w i, i = 1,..., n, are some weights such that n i=1 w i = 1, often centering at x. Let x ij = x i x j. The MAVE procedure is to minimize 1 2n (y i a j b T j B T x ij ) 2 w ij (2.3) over a j R, b j R d, and B R p d such that B T B = I d. In practice, minimizing (2.3) can be decomposed into two weighted least squares problems by fixing (a j, b j ), j = 1,..., n, and B, alternatively. To be specific, given B we minimize, for j = 1,..., p, 1 (y i a j b T j B T x ij ) 2 w ij (2.4) 2n i=1 with respect to a j R and b j R d. With c j = (a j, b T j ) T and s ij = (1, x T ij B)T, the solution is ( ) 1 c j = s ij s T ijw ij s ij y i w ij. (2.5) i=1 i=1 Given (a j, b j ), j = 1,..., n, we minimize 1 2n over B R p d subject to B T B = I d. Note that 1 2n (y i a j b T j B T x ij ) 2 w ij = 1 2n (y i a j b T j B T x ij ) 2 w ij (2.6) {y i a j (x T ij b T j )β} 2 w ij, where β = vec(b T ) and vec( ) is a matrix operator that stacks all columns of a matrix into a vector. Again, this is a weighted least squares problem with closed-form solution 1 β = {(x ij x T ij) (b j b T j )}w ij (x ij b j )(y i a j )w ij. (2.7) The procedure now consists of iteration between (2.5) and (2.7). weights w ij, we employ the lower-dimensional kernel weights As for the w ij = K h (B T x ij ) n i=1 K h(b T x ij ), (2.8)

5 PENALIZED MINIMUM AVERAGE VARIANCE ESTIMATION 547 where B is the latest or current estimate of B, and K h ( ) is a d-dimensional kernel function with bandwidth h; see Xia et al. (22) for more details Penalized MAVE In practice, it is often the case that some of predictors have only marginal influence on the response. Effectively removing these irrelevant predictors could improve estimation accuracy and enhance model interpretability. This motivates us to apply regularization to simultaneously estimate parameters and select relevant predictors. By Proposition 1 in Cook (24), the kth predictor can be removed from the model if and only if the kth row of B D is a zero vector, where D denotes an arbitrary d d orthogonal matrix. Thus, the selection of predictors can be realized through the selection of non-vanishing rows of B D. Building upon this observation and the iterative least squares MAVE algorithm, we propose the penalized MAVE procedure, as follows. For some d d orthogonal matrix D, suppose that B is an initial estimator of B D. Replace B in (2.8) by B and let w ij denote the new weights. For each j, let (ã j, b 1 j ) = argmin (y i a j b T a j R,b j 2n B j T x ij ) 2 w ij. (2.9) R d i=1 Write β = (β T 1,..., βt p ) T and β = ( β T 1,..., β T p ) T, where β k = (β 1 k,..., βd k )T R d is the kth row of B R p d and β k = ( β 1 k,..., β l k )T R d is the kth row of B R p d, respectively. Similarly, write x ij x ij b j = ( x T ij1,..., xt ijp )T, where x ijk = x k ij b j and x k ij is the kth component of x ij. Let ỹ ij = y i ã j. For a R m the L r norm is written as a r. We consider the objective function Ψ λn (β; β) = 1 2n = 1 2n {y i ã j (x T ij b T j )β} 2 w ij + λ n β k γ 1 (ỹ ij x T ijβ) 2 w ij + λ n β k γ 1, (2.1) where λ n > is a tuning parameter and γ (, 1). By adopting a bridge penalty for the L 1 norms of the rows of B, it is then possible to carry out variable screening and element screening simultaneously. We call ˆβ minimizing (2.1) the penalized MAVE estimator (P-MAVE). In the literature, Huang et al. (29) adopted the same penalty function to perform group variable selection in linear regression models.

6 548 TAO WANG, PEIRONG XU AND LIXING ZHU 2.3. Computation Because the penalty term in (2.1) is non-convex, direct minimization of Ψ λn (β; β) is difficult. However it is possible to give an equivalent form, in certain sense, of (2.1), one that is easier to compute. Take Φ λn (β, θ; β) = 1 2n (ỹ ij x T ijβ) 2 w ij + θ 1 1/γ k β k 1 + τ n θ k, (2.11) where τ n is a penalty parameter. The following is proved in the Appendix. Proposition 1. Suppose that γ (, 1). If λ n = τn 1 γ γ γ (1 γ) γ 1, then ˆβ is a minimizer of Ψ λn (β; β) if and only if ( ˆβ, ˆθ) is a minimizer of Φ λn (β, θ; β) subject to θ >, where θ > means θ k for k = 1,..., p. To estimate β and θ, we can use an iterative approach (see, e.g., Huang et al. (29)). Specifically, we first fix β and estimate θ, then we fix θ and estimate β, and we iterate between these two steps to convergence. Since the value of the objective function decreases at each step, the process is guaranteed to converge. The algorithm proceeds as follows. S1. Initialization. Set β () = β and s = 1. S2. For k = 1,..., p, compute S3. Compute β (s) = argmin β 1 2n θ (s) k = ( 1 γ τ n γ ) γ β (s 1) γ. 1 (ỹ ij x T ijβ) 2 w ij + k ( θ (s) k ) 1 1/γ βk 1. This is a LASSO problem and can be solved efficiently using the LARS algorithm (Efron et al. (24)) or the fast coordinate descent algorithm (Friedman et al. (27)). S4. If β (s) and β (s 1) are close enough, stop; otherwise, set s = s + 1 and go back to Step 2. With a good initial value β, the penalized MAVE estimator obtained in this way has nice statistical properties, as shown in the following section. In practice, we might invoke one further step by taking the penalized MAVE estimator as the initial estimator, updating ã j and b j from (2.9), and carrying out steps S1-S4 to convergence. The resulting estimator is called the one-step penalized

7 PENALIZED MINIMUM AVERAGE VARIANCE ESTIMATION 549 MAVE estimator. As seen later in numerical studies, the one-step estimator is able to stabilize the estimation and improve the penalized MAVE estimator. Furthermore, it is less sensitive to the choice of the initial estimator β. 3. Asymptotic Theory 3.1 Basic theoretical properties In what follows, we assume with no loss of generality that only the first q predictors are relevant to the response, where d q < p. Let A 1 = {1,..., q} and A 2 = {q + 1,..., p}. With a slight abuse of notation, write β = (β T A 1, β T A 2 ) T with β A1 = (β k, k A 1 ) a sub-vector formed by the first qd components of β, and β A2 = (β k, k A 2 ) formed by the last pd qd components. Let µ B (x) = E(X B T X = B T x) and ν B (x) = µ B (x) x. We use ν B,A1 (x) to denote a subvector consisting of the first q components of ν B (x). Define W g,a1 = E[{ν B,A 1 (X)ν T B,A 1 (X)} { g(b T X) T g(b T X)}], W g = E[{ν B (X)ν T B (X)} { g(b T X) T g(b T X)}], U g,a1 = E[{ν B,A 1 (X)ν T B (X)} { g(b T X) T g(b T X)}]. We need the following condition. (A) B = B D + O P (n 1/2 ), equivalently β β = O P (n 1/2 ). Condition (A) is mild and typically holds. In particular, the ordinary MAVE estimator can be used as the initial estimator (Xia (26); Wang and Xia (28)). Note that the group bridge penalty in (2.1) is added to each row of B and the penalty is not rotation free, so the sparseness depends on a special form of rotation. Under condition (A), D is the underlying rotation matrix and the identifiability issue vanishes automatically. Theorem 1. Suppose that γ (, 1) and d 3. Assume (A) and conditions (C1) (C5) in the Appendix. (i) If λ n n 1/2 = O(1), then there exists a local minimizer ˆβ of Ψ λn (β; β) such that ( ) ˆβ 1 β 2 = O P n. (ii) If λ n n 1/2 = O(1) and λ n n γ/2, then P ( ˆβ A2 = ) 1 as n. (iii) Suppose that n( β β ) L G 1 and 1 n {ν B,A 1 (x i ) g(b T x i )}ϵ i L G 2. i=1

8 55 TAO WANG, PEIRONG XU AND LIXING ZHU If λ n n 1/2 λ [, ) and λ n n γ/2, then where n( ˆβA1 β A1 ) L argmin{v (u), u = (u T 1,..., u T q ) T R qd }, u V (u) = u T (U g G 1 + G 2 ) + u T W g,a1 u q d + λ γ β k γ 1 1 {u l k sign(βl k )I(βl k ) + ul k I(βl k = )}. In particular, when λ =, l=1 n( ˆβA1 β A1 ) L 1 2 W + g,a 1 (U g G 1 + G 2 ), where W + g,a 1 denotes the Moore-Penrose inverse matrix of W g,a1. Theorem 1 indicates that the penalized MAVE estimator is n-root consistent, and that it is able to select all relevant predictors consistently. Part (iii) of Theorem 1 has implications: if λ >, the limit distribution of ˆβ A1 that corresponds to the relevant predictors puts positive probability to when some components of β A1 are exactly, that is, element screening along with variable screening is achievable; if λ =, the estimator ˆβ A1 of β A1 is asymptotically normal. For general multiple-index models, the ordinary MAVE estimate is consistent, but it is hard to specify the asymptotic distribution, particularly for d > 3. The asymptotic distribution is available when we use inverse regression estimates, such as sliced inverse regression, as initial estimates. However, its variance has a vary complicated structure (see, Zhu and Ng (1995)). In the special case of the single-index model, we have the oracle property. Corollary 1. Adopt the conditions of Theorem 1 with d = 1. Suppose that β is the ordinary MAVE estimator. (i) If λ n n 1/2 = O(1) and λ n n γ/2, then P ( ˆβ A2 = ) 1 as n. (ii) If n 1/2 n i=1 g (β T x i)ν β,a 1 (x i )ϵ i L G 2, λ n n 1/2 and λ n n γ/2, then n( ˆβA1 β A1 ) L W + g,a 1 G 2. Remark 1. It is clear that B in model (2.1) or B in (2.3) is not identifiable. So the initial condition B = B D +O P (n 1/2 ) for some d d orthogonal matrix D is necessary. Many estimates, including the ordinary MAVE estimate, can serve as an initial estimate because they can consistently estimate B D (D may be

9 PENALIZED MINIMUM AVERAGE VARIANCE ESTIMATION 551 different for different estimates). Of course, the P-MAVE depends on the initial one, and we have shown in Theorem 3.1 that it is a consistent estimator of B D. Nevertheless, the selection consistency of the P-MAVE does not depend on the choice of the initial estimator. Note that at the population level, the (unique) solution of the MAVE can only identify B D for some d d orthogonal matrix D. This is why we say that dimension reduction methods can identify the central (mean) dimension reduction subspace. The MAVE algorithm provides a consistent estimate of B D, see Xia et al. (22) and Wang and Xia (28), but this does not affect the selection by our procedure. It is easy to see that, if the kth row of B is a zero vector (we say B is sparse in this sense), then the kth row of B D for any d d orthogonal matrix D is also zero, and vice versa. Thus, the ordinary MAVE procedure estimates the sparse version B D for some D, and we can apply the techniques of shrinkage and selection to produce sparse solutions, such as the P-MAVE Tuning parameter selection In practice, the attractive properties of the penalized MAVE estimator rely heavily on the choice of the tuning parameter that controls the model complexity. In the literature, generalized cross-validation (GCV) has been extensively used to choose tuning parameters, see Fan and Li (21). Recently, Wang, Li, and Tsai (27) proved that the model selected by GCV tends to overfit, while the Bayesian information criterion (BIC) is able to identify a finite-dimensional model consistently. This motivates us to consider a BIC-type criterion for tuning parameter selection. Let RSS λ = 1 2n 2 where ˆβ λ = ( ˆβ T λ1,..., ˆβ T λp )T (ỹ ij x T ˆβ ij λ ) 2 w ij, (3.1) BIC λ = log RSS λ + d df λ log n n, (3.2) is the minimizer of Ψ λ (β; β), and df λ denotes the number of nonzero sub-vectors in { ˆβ λ1,..., ˆβ λp }. We use the generic notation M {1,..., p} to denote an arbitrary candidate model. Let M F = {1,..., p}, M T = {1,..., q}, and M λ = {k : ˆβ λ,k } be the full model, the true model, and the model that is identified by ˆβ λ, respectively. Theorem 2. Let ˆλ denote the tuning parameter selected by minimizing BIC λ. Under the conditions of Corollary 1, P (Mˆλ = M T ) 1 as n.

10 552 TAO WANG, PEIRONG XU AND LIXING ZHU 4. Numerical Studies 4.1. Simulations In this subsection, we report on a study of the finite sample performance of the proposed method via simulations. Throughout, we took the ordinary MAVE estimator to be the initial estimator and used the BIC-type criterion (3.2) to select the tuning parameter. To be specific, we used the refined MAVE estimator with the multi-dimensional Gaussian kernel, and adopted the optimal bandwidth h = {4/(d+2)} 1/(d+4) n 1/(d+4) (Silverman (1999)). For measuring the estimation accuracy, we employed the vector correlation coefficient COR1 = ( d l=1 ϕ2 l )1/2 and the trace correlation coefficient COR2 = (d 1 d l=1 ϕ2 l )1/2, where the ϕ 2 l s are the eigenvalues of the matrix ˆB T B B T ˆB; see Ye and Weiss (23) for more details. Furthermore, we used four summary statistics to assess how well the method selected predictors: MS, TPR, FDR, and CM. MS is the average number of identified predictors, TPR is the average ratio of the number of correctly identified predictors to the number of truly important predictors, FDR is the average ratio of the number of falsely identified predictors to the total number of identified predictors, and CM is the fraction of runs in which the correct model is selected. In each of the following example, the reported simulation results were based on 2 data replications. Example 1. Let d = 1, p = 1, q = 4, and β = (.4,.4,.8,.2,,..., ) T. Consider the single-index model Y = 1 + 2(X T β + 3) log(3 X T β + 1) + ϵ, where ϵ N(, 1), X N(, Σ) with Σ ij = ρ i j for i, j = 1,..., 1, and ϵ and X are independent. Two values of ρ were explored, and.5, with the sample size n = 1. Example 2. The same as Example 1, except that n = 2 and p = 2. Example 3. Let d = 2, p = 1, and q = 4. We considered the model Y = (X 1 + 2X 2 + 3X 3 ) X 1 + X 2 + X 4 +.5ϵ, where ϵ N(, 1), X N(, Σ) with Σ ij = ρ i j for i, j = 1,..., 1, and ϵ and X are independent. Two values of ρ were explored, and.5, with the sample size n = 1. In this example, the true dimension reduction subspace is spanned by (1, 2, 3,,..., ) T / 14 and ( 2, 1,, 1,,..., ) T / 6.

11 PENALIZED MINIMUM AVERAGE VARIANCE ESTIMATION 553 Example 4. Let d = 3, p = 1, and q = 3. The regression model was Y = X 1 + X (1.5 + X 3 ) 2 +.2ϵ, where ϵ N(, 1), X N(, Σ) with Σ ij = ρ i j for i, j = 1,..., 1, and ϵ and X are independent. Two values of ρ were explored, and.5, with the sample size n = 1. In this example, the true dimension reduction subspace is spanned by (1,,..., ) T, (, 1,,..., ) T, and (,, 1,,..., ) T. Example 5. We took the non-linear time series model ( π ) Y t = 1 +.4Y t 1 + cos 2 Y t 2 +.2ϵ t, where the ϵ t s are independent and identically distributed N(, 1), and X t 1 = (Y t 1,..., Y t 6 ) T. In this example, d = 2, p = 6, q = 2, and the true dimension reduction subspace is spanned by (1,,..., ) T and (, 1,,..., ) T. Our sample sizes were n = 1 and n = 2. We evaluated the performance of the P-MAVE estimator as well as the onestep P-MAVE estimator and, in the first four examples, we also chose the refined outer product of gradients estimator (OPG, Xia et al. (22)) as the initial estimator. We used γ =.5 in the bridge penalty function. For comparison purpose, we report the simulation results of the coordinate-independent sparse estimation. In particular, we applied sliced inverse regression (CISE-SIR) and principal fitted components (CISE-PFC); see Section 3 of Chen, Zou, and Cook (21) for details. The simulation results from these five examples are summarized in Tables 1-5, respectively. When the original MAVE estimator was used as the initial value, observations are as follows. In all examples, both the penalized MAVE estimator and its one-step counterpart improved the MAVE estimator in terms of basis estimation. This phenomenon was more evident in Example 5, where the refined MAVE estimator had a very low vector correlation coefficient (COR1) when n = 1 (Table 5). In general, the one-step P-MAVE estimator outperformed the penalized MAVE estimator, and the proposed method worked quite well in terms of predictor selection: the true positive rate (TPR) was very high, lowest at 97.33%; and the false discovery rate (FDR) was close to. Further, the CM-values in Example 1-2 were not far from 1% (Tables 1-2), which confirms the consistency of the BIC-type criterion in the context of single-index models. When there was more than one index, d > 1, the lowest CM-values were 67.5% and 88.5% for the P-MAVE estimator and the one-step P-MAVE estimator, respectively. We also tried two other selection criteria, generalized cross-validation and Akaike s

12 554 TAO WANG, PEIRONG XU AND LIXING ZHU Table 1. Simulation results for Example 1. The means and the standard deviations (in parentheses) of the vector correlation coefficient (COR1) and the trace correlation coefficient (COR2), the average model size (MS), the true positive rate (TPR), the false discovery rate (FDR), and the proportion of selecting the correct model (CM), based on 2 data replications, are reported. d = 1, n = 1, p = 1, q = 4, ρ = Initial estimate: Refined MAVE.9979 (.13).9979 (.13) COR1 COR2 MS TPR FDR CM P-MAVE.9993 (.7).9993 (.7) One-step P-MAVE.9994 (.6).9994 (.6) Initial estimate: Refined OPG.9976 (.15).9976 (.15) P-MAVE.9992 (.8).9992 (.8) One-step P-MAVE.9994 (.6).9994 (.6) CISE-SIR.9351 (.593).9351 (.593) CISE-PFC.9365 (.575).9365 (.575) d = 1, n = 1, p = 1, q = 4, ρ =.5 Initial estimate: Refined MAVE.9978 (.14).9978 (.14) P-MAVE.9992 (.9).9992 (.9) One-step P-MAVE.9994 (.6).9994 (.6) Initial estimate: Refined OPG.9973 (.18).9973 (.18) P-MAVE.9992 (.9).9992 (.9) One-step P-MAVE.9994 (.6).9994 (.6) CISE-SIR.8841 (.789).8841 (.789) CISE-PFC.88 (.767).88 (.767) information criterion, and found that they tend to over-select many irrelevant predictors. When the refined OPG estimator was used as the initial value, we can make similar conclusions. In fact, as can be seen from Tables 1-4, the P- MAVE estimator and the one-step P-MAVE estimator, especially the latter, are not very sensitive to the choice of the initial estimator. The P-MAVE estimator and the one-step P-MAVE estimator outperformed CISE-SIR and CISE-PFC. In Example 1, CISE-SIR and CISE-PFC missed on average one or two important predictors. In Example 2, they had much better performance when the predictors were uncorrelated (Table 2). As a conclusion, the coordinate-independent sparse estimation did not work well when either the sample size of the data was low or the predictors were correlated. This is further seen in Example 4 where we found that CISE-SIR and CISE-PFC could only estimate part of the dimension reduction subspace. Unreported results showed that the performance did not improve much when the sample size went from n = 1 to n = 8. In Example 3, CISE-SIR and CISE-PFC failed as expected, as there was a symmetry component in the mean regression function Automobile data As an illustration, we apply the proposed method to an automobile data set

13 PENALIZED MINIMUM AVERAGE VARIANCE ESTIMATION 555 Table 2. Simulation results for Example 2. The means and the standard deviations (in parentheses) of the vector correlation coefficient (COR1) and the trace correlation coefficient (COR2), the average model size (MS), the true positive rate (TPR), the false discovery rate (FDR), and the proportion of selecting the correct model (CM), based on 2 data replications, are reported. d = 1, n = 2, p = 2, q = 4, ρ = Initial estimate: Refined MAVE.9982 (.7).9982 (.7) COR1 COR2 MS TPR FDR CM P-MAVE.9996 (.4).9996 (.4) One-step P-MAVE.9998 (.2).9998 (.2) Initial estimate: Refined OPG.9978 (.9).9978 (.9) P-MAVE.9996 (.4).9996 (.4) One-step P-MAVE.9998 (.2).9998 (.2) CISE-SIR.9769 (.176).9769 (.176) CISE-PFC.9731 (.211).9731 (.211) d = 1, n = 2, p = 2, q = 4, ρ =.5 Initial estimate: Refined MAVE.9979 (.1).9979 (.1) P-MAVE.9996 (.4).9996 (.4) One-step P-MAVE.9998 (.2).9998 (.2) Initial estimate: Refined OPG.9972 (.14).9972 (.14) P-MAVE.9996 (.5).9996 (.5) One-step P-MAVE.9998 (.2).9998 (.2) CISE-SIR.9599 (.482).9599 (.482) CISE-PFC.941 (.614).941 (.614) (available from the Machine Learning Repository at the University of California- Irvine). The objective of this analysis is to study the relationship between the car price and a set of car attributes. In the data set, there are originally 25 instances and 26 attributes, and there are also some missing values. We focus our analysis on 17 attributes: Wheel-base (X 1 ), Length (X 2 ), Width (X 3 ), Height (X 4 ), Curb-weight (X 5 ), Engine-size (X 6 ), Bore (X 7 ), Stroke (X 8 ), Compression-ratio (X 9 ), Horsepower (X 1 ), Peak-rpm (X 11 ), City-mpg (X 12 ), Highway-mpg (X 13 ), Fuel-type (X 14, = diesel and 1 = gas), Aspiration (X 15, = std and 1 = turbo), No-of-cylinders (X 16, = four cylinders and 1 otherwise), and Drive-wheels (4wd, fwd and rwd). The first 13 attributes are continuous and the rest are categorical. Set X 17 = if a car has four-wheel drive and 2 otherwise, X 18 = 1 if it has frontwheel drive and 2 otherwise. The response Y is the car price in thousands of dollars. Because the number of missing values is quite small compared to the total sample size, we simply discard the instances with missing values. The resulting data set then consists of 195 instances with complete records. For ease of explanation, all predictors, X 1,..., X 18, are standardized separately. As to sufficient dimension reduction, this dataset, when the categorical predictors were removed, was analysed to identify the projection directions when the inverse regression method was applied. See Zhu and Zeng (26). The categorical predictors were

14 556 TAO WANG, PEIRONG XU AND LIXING ZHU Table 3. Simulation results for Example 3. The means and the standard deviations (in parentheses) of the vector correlation coefficient (COR1) and the trace correlation coefficient (COR2), the average model size (MS), the true positive rate (TPR), the false discovery rate (FDR), and the proportion of selecting the correct model (CM), based on 2 data replications, are reported. d = 2, n = 1, p = 1, q = 4, ρ = Initial estimate: Refined MAVE.9776 (.98).9888 (.49) COR1 COR2 MS TPR FDR CM P-MAVE.9919 (.77).9959 (.38) One-step P-MAVE.9933 (.78).9967 (.39) Initial estimate: Refined OPG.978 (.136).9854 (.68) P-MAVE.9924 (.64).9962 (.32) One-step P-MAVE.9941 (.5).9971 (.25) CISE-SIR.1985 (.2589).6834 (.82) CISE-PFC.331 (.2996).7289 (.968) d = 2, n = 1, p = 1, q = 4, ρ =.5 Initial estimate: Refined MAVE.9676 (.155).9838 (.77) P-MAVE.9897 (.14).9949 (.68) One-step P-MAVE.9911 (.12).9956 (.5) Initial estimate: Refined OPG.946 (.48).9731 (.197) P-MAVE.9846 (.267).9924 (.129) One-step P-MAVE.995 (.118).9952 (.58) CISE-SIR.1916 (.2419).677 (.116) CISE-PFC.2637 (.271).6879 (.1198) removed because inverse regression methods are limited in this regard whereas, MAVE is not. To estimate the dimension, both the cross-validation method of Xia et al. (22) and the modified BIC-type criterion proposed by Wang and Yin (28) suggest taking d = 1. The results of estimation and variable selection are summarized in Table 6. The vector correlation coefficient, the trace correlation coefficient between the MAVE estimator and the P-MAVE estimator, is.9865; between the MAVE estimator and the one-step P-MAVE estimator, is In addition, the P- MAVE estimator and the one-step P-MAVE have similar performance in terms of predictor selection: the relevant predictors in common are Wheel-base (X 1 ), Length (X 2 ), Width (X 3 ), Engine-size (X 6 ), Stroke (X 8 ), Compression-ratio (X 9 ), Peak-rpm (X 11 ), City-mpg (X 12 ), Highway-mpg (X 13 ), and Drive-wheels (X 18 ). The plot of Y against the reduced direction X T ˆβ is shown in Figure 1, displaying a strong non-linear relationship. As seen in Figure 1, the significance of X 18 indicates the difference between front-wheel drive cars and others. It is common knowledge that four-wheel and rear-wheel drive cars are typically more expensive to purchase than comparable front wheel drive cars.

15 PENALIZED MINIMUM AVERAGE VARIANCE ESTIMATION 557 Table 4. Simulation results for Example 4. The means and the standard deviations (in parentheses) of the vector correlation coefficient (COR1) and the trace correlation coefficient (COR2), the average model size (MS), the true positive rate (TPR), the false discovery rate (FDR), and the proportion of selecting the correct model (CM), based on 2 data replications, are reported. d = 3, n = 1, p = 1,, q = 3, ρ = Initial estimate: Refined MAVE.827 (.2458).9476 (.646) COR1 COR2 MS TPR FDR CM P-MAVE.9546 (.1872).9927 (.382) One-step P-MAVE.9864 (.997).9968 (.211) Initial estimate: Refined OPG.7769 (.2446).9341 (.622) P-MAVE.9193 (.2597).9893 (.395) One-step P-MAVE.982 (.1262).9961 (.265) CISE-SIR.1711 (.3594).7853 (.1362) CISE-PFC.3756 (.4585).868 (.122) d = 3, n = 1, p = 1, q = 3, ρ =.5 Initial estimate: Refined MAVE.665 (.2883).94 (.721) P-MAVE.8758 (.323).9763 (.67) One-step P-MAVE.9641 (.1683).9919 (.368) Initial estimate: Refined OPG.5762 (.28).8744 (.769) P-MAVE.8241 (.353).9778 (.521) One-step P-MAVE.9396 (.2314).99 (.49) CISE-SIR.216 (.1176).791 (.1222) CISE-PFC.29 (.152).7382 (.1147) Table 5. Simulation results for Example 5. The means and the standard deviations (in parentheses) of the vector correlation coefficient (COR1) and the trace correlation coefficient (COR2), the average model size (MS), the true positive rate (TPR), the false discovery rate (FDR), and the proportion of selecting the correct model (CM), based on 2 data replications, are reported. d = 2, n = 1, p = 6, q = 2 Initial estimate: Refined MAVE.5771 (.2414).8277 (.85) COR1 COR2 MS TPR FDR CM P-MAVE.9365 (.1841).9749 (.68) One-step P-MAVE.9683 (.1532).9889 (.499) d = 2, n = 2, p = 6, q = 2 Initial estimate: Refined MAVE.749 (.279).892 (.739) P-MAVE.9556 (.1458).982 (.526) One-step P-MAVE.992 (.676).9969 (.226) Concluding Remarks In this paper, we suggest a penalized minimum average variance estimation for both dimension reduction and variable selection. This is a forward regression approach that does not depend on the eigen-decomposition of a kernel matrix. As such this method, compared with that of Chen, Zou, and Cook (21), requires

16 558 TAO WANG, PEIRONG XU AND LIXING ZHU Table 6. Estimation and variable selection for the automobile data. Predictor Refined MAVE P-MAVE One-step P-MAVE X X X X X X X X X X X X X X X X X X fewer conditions on the predictors and the regression function. However, as with other methods, P-MAVE seems to have difficulty handling the cases with very large p; it can be extended to the case where p diverges to infinity at certain rate as the sample size tends to infinity. As P-MAVE requires an initial nonparametric estimator with all predictors, dimension reduction and variable selection can be less accurate when p is very large. We have not yet found a good solution to this problem, it deserves further study. Acknowledgement The research described here was supported by a grant from the Research Council of Hong Kong, and a grant from Hong Kong Baptist University, Hong Kong. The authors thank the editor, associate editors and two referees for their constructive comments and suggestions which led to improvement of an early manuscript. Appendix Regularity Conditions. Here are regularity conditions needed for establishing the asymptotic properties of our estimator. (C1) The predictor vector X is bounded and its density function f( ) has a bounded second order derivative; E(X B T X = u) and E(XX T B T X = u)

17 PENALIZED MINIMUM AVERAGE VARIANCE ESTIMATION 559 Figure 1. The automobile data. The cars are represented by o for frontwheel drive cars, for four-wheel drive cars, and + for rear-wheel drive cars. have bounded derivatives with respect to u and B in a small neighborhood of B. (C2) E Y k < for some large k >. (C3) E(Y B T X) has a bounded fourth order derivative in a neighborhood of B. (C4) K( ) is a symmetric density function with bounded first order derivative. In particular, we use the Gaussian kernel with bandwidth h n 1/(d+4). (C5) The smallest eigenvalue of J T W gj is larger than ρ and the largest eigenvalue of W g is less than ρ for some positive constants ρ and ρ. Matrix J in (C5) can be found in the proof of Theorem 1. Conditions (C1) (C4) are similar to those of Xia (26) and Wang and Xia (28). As for condition (C5), consider the special case of the single-index model, so W g is a square matrix of order p and J T W gj is a matrix of order p 1. It follows that β T W g = and β T J =. Note that J T W gj has the same rank as (J, β ) T W g (J, β ), and (J, β ) T W g (J, β ) has the same rank as W g, so

18 56 TAO WANG, PEIRONG XU AND LIXING ZHU J T W gj has full rank p 1. As a result, condition (C5) is mild. Proof of Proposition 1. This follows directly from Proposition 1 in Huang et al. (29). Proof of Theorem 1. There are three steps: Step I establishes the order of the minimizer ˆβ of Ψ λn (β; β), Step II shows that ˆβ is variable selection consistent, and Step III derives the asymptotic distribution. Following Theorem 1 and Theorem 2 of Huang et al. (29), write where Ψ λn (β; β) = L n (β; β) + λ n L n (β; β) = 1 2n β k γ 1, (ỹ ij x T ijβ) 2 w ij. Step I. We prove that there exists a local minimizer ˆβ of Ψ λn (β; β) such that ˆβ β 2 = O P ( 1 n ). (A.1) Because B T B = I d, the number of free parameters in B equals pd d(d + 1)/2. In addition, there are d(d 1)/2 free parameters in D. For example, we may parameterize D via the polar coordinate system. Therefore, the total number of free parameters in B D is pd d 2. For ease of presentation, we first consider the single-index model. We can simply take D = 1. It follows that B = β = (β 1,..., β p ) T R p, but note that β 2 = 1, so g(β T X) does not have a derivative at the point β. To address this problem, we adopt the delete-one-component method as follows. For any vector a = (a 1,..., a p ) T R p, let a (1) = (a 2,..., a p ) T. Then we can write β 1 = β 1 (β (1) ) = (1 β (1) 2 2 )1/2 and β = β(β (1) ) = (β 1 (β (1) ), β (1)T ) T. To exclude the trivial cases, we assume q 2. Then the true parameter β (1) satisfies the constraint β (1) 2 < 1, and β is infinitely differentiable in a neighborhood of = (η 1,..., η p ) T R p (p 1), where β (1). We define the Jacobian matrix as J β (1) η 1 = (1 β (1) 2 2 ) 1/2 β (1) and, for s = 2,..., p, η s is a (p 1)-dimensional unit vector with sth component being 1. When d > 1, a Jacobian matrix J = J β (1) R pd (pd d2) can be defined in a similar way, where the vector β (1) consists of all the free parameters in B D. Let β (1) = β (1) + n 1/2 v, where v 2 = C for some positive constant C.

19 PENALIZED MINIMUM AVERAGE VARIANCE ESTIMATION 561 Applying a Taylor expansion, we can write L n (β ; β) L n (β ; β) = nv T J T β (1) I 1n = nv T J T β (1) vt J T β (1) I 1n + I 2n, [ W g ( β β ) + 1 n i=1 1 n 2 1 n 2 (ỹ ij x T ijβ ) x ij w ij x ij x T ij w ij J (1) β v where β (1) lies between β (1) and β (1). By Lemmas 6.4 and 6.5 in Wang and Xia (28), ] ( ) 1 {ν B (x i ) g(b T x i )}ϵ i + o P n, I 2n = v T J T β (1) It follows that {W g + o P (1)}J (1) β v. Ψ λn (β ; β) Ψ λn (β ; β) = nv T J T β (1) + v T J T β (1) [ W g ( β β ) + 1 n W g J β (1) v + λ n ] {ν B (x i ) g(b T x i )}ϵ i i=1 ( βk γ 1 β k γ 1 ) + o P T 1n + T 2n + T 3n + o P ( 1 n ). ( 1 n ) By (C5), we have T 2n ρ v 2 2 with large probability. Further, by (A), J T W β (1) g n( β β ) + 1 ( ) 1 J T {ν n β (1) B (x i ) g(b T x i )}ϵ i = O P n. i=1 Thus, by choosing a sufficiently large C, T 1n is dominated by T 2n. Since b γ a γ 2(b a)b γ 1 for a b, by the Cauchy-Schwarz inequality, T 3n 2λ n q 2 dλ n 2 d β k γ 1 1 β k β k 1 q β k γ 1 1 β k β k 2 q ( q β k γ 1 1 λ n βk β k 2 2 ) 1/2

20 562 TAO WANG, PEIRONG XU AND LIXING ZHU = O(1)λ n v 2 n = O( v 2 ), where the last equality is from λ n = O(n 1/2 ). By choosing a sufficiently large C, T 3n is also dominated by T 2n. Consequently, it holds with large probability that Ψ λn (β ; β) > Ψ λn (β ; β) uniformly in {v : v 2 = C}. This implies that, with large probability, there exists a local minimizer of of Ψ λn (β; β) in the ball {v : v 2 C}. Therefore, there exists a local minimizer ˆβ of Ψ λn (β; β) such that ˆβ β 2 = O P (n 1/2 ). Step II. In this step, we establish the consistency of variable selection. Let I( ) denote the indicator function. Define β k = ˆβ k I(k A 1 ) for k = 1,..., p. Write β = ( β T 1,..., β T p ) T. By Proposition 1 and the Karush-Kuhn-Tucker condition, we have n 1 n 2 (ỹ ij x T ˆβ) x ij l ijk w ij = ˆθ 1 1/γ sgn( ˆβ k l ), ˆβl k, (A.2) where x l ijk is the lth component of x ijk. Because ˆθ 1 1/γ ˆβ k 1 = γλ n ˆβ k γ 1, we obtain n 1 n 2 (ỹ ij x T ˆβ) x ij l ijk w ij = γλ n ˆβ k γ 1 1 sgn( ˆβ k l ), ˆβl k. (A.3) It follows that n 1 n 2 (ỹ ij x T ˆβ) ij w ij x T ij( ˆβ β) = n 1 n 2 (ỹ ij x T ˆβ) ij w ij x l ijk ( ˆβ k l β k l ) k,l: ˆβ k l = γλ n ˆβ k γ 1 1 sgn( ˆβ k l )( ˆβ k l β k l ) k,l: ˆβ k l = k,l γλ n ˆβ k γ 1 1 sgn( ˆβ l k )( ˆβ l k β l k ). Note that from ( ˆβ l k β l k )sgn( ˆβ l k ) = ˆβ l k I(k A 2), we obtain n 1 n 2 (ỹ ij x T ˆβ) ij w ij x T ij( ˆβ β) = γλ n ˆβ k γ 1 1 k A 2 = d ˆβ k l l=1 γλ n ˆβ k γ 1 1 ( ˆβ k 1 β k 1 ).

21 PENALIZED MINIMUM AVERAGE VARIANCE ESTIMATION 563 Since γb (γ 1) (b a) b γ a γ for a b, we have γ ˆβ k γ 1 1 ( ˆβ k 1 β k 1 ) ˆβ k γ 1 β k γ 1. And then n 1 n 2 (ỹ ij x T ˆβ) ij w ij x T ij( ˆβ β) λ n q ( ˆβ k γ 1 β k γ 1 ) + γλ n By the definition of ˆβ, Ψ λn ( ˆβ; β) Ψ λn ( β; β). It follows that n 1 n 2 λ n (ỹ ij x T ˆβ) ij w ij x T ij( ˆβ β) + (1 γ)λ n ˆβ k γ 1 λ n β k γ 1 L n ( β; β) L n ( ˆβ; β) = n 1 n 2 (ỹ ij x T ˆβ) ij w ij x T ij( ˆβ β) n( ˆβ β) T 1 n 2 x ij x T ij w ij ( ˆβ β). k=q+1 ˆβ k γ 1 k=q+1 Then, with (C5) and Lemma 4 of Wang and Xia (28), we have, in probability, ˆβ k γ 1. (1 γ)λ n k=q+1 ˆβ k γ n( ˆβ β) T 1 n 2 x ij x T ij w ij ( ˆβ β) nρ ˆβ β 2 2 = nρ ˆβ k 2 2 k=q+1 nρ ˆβ β 2 2. Using the result of Step I, (1 γ)λ n ˆβ k γ 1 nρ ˆβ k 2 2 = O P (1). (A.4) On the other hand, ˆβ ( k γ 1 k=q+1 k=q+1 p ˆβ k 1 ) γ ( p k=q+1 k=q+1 k=q+1 ˆβ k 2 2) γ/2. (A.5)

22 564 TAO WANG, PEIRONG XU AND LIXING ZHU By (A.4) and (A.5), if p k=q+1 ˆβ k 2 2 > then (1 γ)λ n nρ ( p k=q+1 But, since λ n n γ/2, we obtain ( p P ˆβ k 2 2) 1 γ/2 = nρ O P (1)(nρ ) 1+γ/2 = O P (n γ/2 ). k=q+1 ˆβ ) k 2 2 > as n. Step III. It remains to derive the asymptotic distribution. Let m R m be a m-vector of zeros. Let u = (u T 1,..., ut q ) T, where u k = (u 1 k,..., ud k )T R d for k = 1,..., q. Take V n (u) = Ψ λn (β + n 1/2 (u T, T (p q) d )T ; β) Ψ λn (β ; β). (A.6) We have shown that, with large probability, ˆβ β = n 1/2 (û T, T (p q) d )T, where û is a minimizer of V n (u). We may re-express V n (u) as V n (u) = n(u T, T (p q) d ) [ W g ( β β ) + 1 n We have + (u T, T (p q) d )W g(u T, T (p q) d )T ] {ν B (x i ) g(b T x i )}ϵ i i=1 q + λ n ( β k + n 1/2 u k γ 1 β k γ 1 ) + o P (1) = u T U g n( β β ) u T 1 n {ν B,A 1 (x i ) g(b T x i )}ϵ i i=1 q +u T W g,a1 u + λ n ( β k + n 1/2 u k γ 1 β k γ 1 ) + o P (1) V 1n + λ n q ( β k + n 1/2 u k γ 1 β k γ 1 ) + o P (1) V 1n + V 2n + o P (1). V 1n L u T (U g,a1 G 1 + G 2 ) + u T W g,a1 u, q d V 2n λ γ β k γ 1 1 {u l k sign(βl k )I(βl k ) + ul k I(βl k = )}. l=1

23 PENALIZED MINIMUM AVERAGE VARIANCE ESTIMATION 565 Therefore, V n (u) L u T (U g,a1 G 1 + G 2 ) + u T W g,a1 u q d + λ γ β k γ 1 1 {u l k sign(βl k )I(βl k ) + ul k I(βl k = )} V (u). l=1 Because û = O P (1), by the argmin continuous mapping theorem of Kim and Pollard (199), n( ˆβA1 β A1 ) = û L argmin V (u). u The proof is complete. Proof of Corollary 1. The corollary follows from observing that, by Theorem 4.2 in Xia (26), W g ( β β ) = 1 ( ) 1 g (β T x i )ν β (x i )ϵ i + o P n. n i=1 Proof of Theorem 2. A candidate model M is said to be underfitted if it misses at least one important predictor, M M T ; it is overfitted if it covers all important predictors but also contains at least one irrelevant predictor, M M T but M M T. According to whether the model M λ is underfitted, correctly fitted, or overfitted, we set R ={λ : M λ M T }, R ={λ : M λ = M T } and R + ={M λ M T, M λ M T }. Let λ n be a reference tuning parameter sequence such that P (M λn = M T ) 1 as n. To prove the theorem, it suffices to show that ( ) P inf BIC λ > BIC λn 1 as n. λ R R + Clearly, we need only consider the single index model. For each M, write RSS M = 1 2n 2 (ỹ ij x T ˇβ ij M ) 2 w ij, (A.7) (A.8) where ˇβ M is an unpenalized estimator such that 1 ˇβ M = 2n 2 (ỹ ij x T ijβ) 2 w ij. argmin β j =: j M

24 566 TAO WANG, PEIRONG XU AND LIXING ZHU We consider two cases. Case 1: Underfitted model. For any λ, we have RSS λ RSS Mλ and RSS Mλ = 1 2n 2 (ỹ ij x T ˇβ ij MF ) 2 w ij + ( ˇβ MF ˇβ Mλ ) T 1 n 2 + ( ˇβ Mλ ˇβ MF ) T 1 2n 2 RSS MF + R 1λ + R 2λ. (ỹ ij x T ˇβ ij MF ) x ij w ij x ij x T ij w ij ( ˇβ Mλ ˇβ MF ) Applying techniques similar to those used in Lemma 1 of Xia et al. (22), we obtain RSS MF = 1 2n 2 ϵ 2 i w ij + o p (1) = O P (1). (A.9) Let Π = (β, J (1) β ). Because Π is full rank, there exists a vector ϕ R p such that ˇβ Mλ = Πϕ = ϕ 1 β + J (1) β ϕ (1). Note that β 2 = 1, ˇβ Mλ 2 = 1, and β T J β (1) 1 ϕ 2 1 and =, we have ϕ (1)T J T β (1) J (1) β ϕ (1) = ˇβ Mλ β 2 2 = (ϕ 1 1) 2 + ϕ (1)T J T β (1) J (1) β ϕ (1) = 2(1 ϕ 1 ). Since λ R, assume without loss of generality that ˇβ Mλ,2 =. Then, 1 ϕ 1 = 1 2 ˇβ Mλ β β 2 2 2, 1 + ϕ 1 = 1 + ˇβ T M λ β (1 + 1 β 2 2 2) = 1 2 β It follows that ϕ (1)T J T J β (1) (1) β ϕ (1) 1 4 β Thus we have for some positive constant, ϕ (1) β β > c. By (C5), we have, with large probability, ( ˇβ Mλ β ) T 1 2n 2 x ij x T ij w ij ( ˇβ Mλ β ) cρ. (A.1)

25 PENALIZED MINIMUM AVERAGE VARIANCE ESTIMATION 567 Further, we can show that ( ˇβ MF β ) T 1 2n 2 ( ) x ij x T ij w ij ( ˇβ 1 MF β ) = O P. (A.11) n By (A.9), (A.1), and (A.11), we have R 1λ + R 2λ cρ in probability. Therefore, with large probability, RSS λ RSS Mλ RSS MF + cρ. (A.12) Note that the above results hold uniformly over all λ R. As for the reference tuning parameter λ n, a similar decomposition yields that ( ) 1 RSS λn = RSS MF + O P. (A.13) n This, together with (A.12) and the definition of BIC λ, implies that ( ) P inf BIC λ > BIC λn 1 as n. λ R (A.14) Case 2: Overfitted model. Consider an arbitrary λ R +. By definition, RSS λ RSS Mλ. Further, by (C5), we can show that ( 1 ) RSS Mλ RSS MF = O P. n It follows that log RSS λ log RSS MF log RSS Mλ log RSS MF = log Similarly, we obtain = RSS M λ RSS MF RSS MF log RSS λn log RSS MF Then we have, with large probability, ( 1 + RSS M λ RSS ) MF RSS MF ( 1 ) {1 + o P (1)} = O P. n = O P ( 1 n inf BIC λ BIC λn = inf log RSS λ log RSS λn + (df λ df λn ) log n λ R + λ R + n ). inf λ R + log RSS λ log RSS λn + log n n O P ( 1 n It thus follows that ( ) P inf BIC λ > BIC λn 1 as n. λ R + The results of Cases 1 and 2 complete the proof. ) + log n n. (A.15)

26 568 TAO WANG, PEIRONG XU AND LIXING ZHU References Chen, X., Zou, C. and Cook, R. D. (21). Coordinate-independent sparse sufficient dimension reduction and variable selection. Ann. Statist. 38, Cook, R. D. (1996). Graphics for regressions with a binary response. J. Amer. Statist. Assoc. 91, Cook, R. D. (24). Testing predictor contributions in sufficient dimension reduction. Ann. Statist. 32, Cook, R. D. and Weisberg, S. (1991). Sliced inverse regression for dimension reduction: Comment. J. Amer. Statist. Assoc. 86, Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (24). Least angle regression. Ann. Statist. 32, Fan, J. and Li, R. (21). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96, Frank, I. E. and Friedman, J. H. (1993). A statistical view of some chemometrics regression tools. Technometrics 35, Friedman, J., Hastie, T., Höfling, H. and Tibshirani, R. (27). Pathwise coordinate optimization. Ann. Appl. Statist. 1, Huang, J., Ma, S., Xie, H. and Zhang, C. (29). A group bridge approach for variable selection. Biometrika 96, Kim, J. and Pollard, D. (199). Cube root asymptotics. Ann. Statist. 18, Li, B. and Dong, Y. (29). Dimension reduction for nonelliptically distributed predictors. Ann. Statist. 37, Li, L. (27). Sparse sufficient dimension reduction. Biometrika 94, Li, L., Cook, R. D. and Nachtsheim, C. J. (25). Model-free variable selection. J. Roy. Statist. Soc. Ser. B 67, Li, K. C. (1991). Sliced inverse regression for dimension reduction. J. Amer. Statist. Assoc. 86, Silverman, B. W. (1999). Density Estimation for Statistics and Data Analysis. Chapman and Hall, London. Wang, H., Li, R. and Tsai, C. L. (27). Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika 94, Wang, H. and Xia, Y. (28). Sliced regression for dimension reduction. J. Amer. Statist. Assoc. 12, Wang, Q. and Yin, X. (28). A nonlinear multi-dimensional variable selection method for high dimensional data: Sparse MAVE. Comp. Statist. Data Anal. 52, Xia, Y. (26). Asymptotic distributions of two estimators of the single-index model. Econometric Theory 22, Xia, Y., Tong, H., Li, W. K. and Zhu, L. (22). An adaptive estimation of dimension reduction space. J. Roy. Statist. Soc. B 64, Ye, Z. and Weiss, R. E. (23). Using the bootstrap to select one of a new class of dimension reduction methods. J. Amer. Statist. Assoc. 98, Zhou, J. and He, X. (28). Dimension reduction based on constrained canonical correlation and variable filtering. Ann. Statist. 36, Zhu, L. X. and Ng, K. W. (1995). Asymptotics for sliced inverse regression. Statist. Sinica 5,

27 PENALIZED MINIMUM AVERAGE VARIANCE ESTIMATION 569 Zhu, Y. and Zeng, P. (26). Fourier methods for estimating the central subspace and the central mean subspace in regression. J. Amer. Statist. Assoc. 11, Department of Mathematics, Hong Kong Baptist University, Kowloon Tong, Hong Kong, P. R. China. School of Finance and Statistics, East China Normal University, Shanghai, 2241, P. R. China. Department of Mathematics, Hong Kong Baptist University, Kowloon Tong, Hong Kong, P. R. China. (Received November 211; accepted May 212)

Robust Variable Selection Through MAVE

Robust Variable Selection Through MAVE Weixin Yao and Qin Wang Abstract Dimension reduction and variable selection play important roles in high dimensional data analysis. Wang and Yin (2008) proposed sparse