Accepted author version posted online: 12 Aug 2015.

Size: px

Start display at page:

Download "Accepted author version posted online: 12 Aug 2015."

Quentin Bradley
6 years ago
Views:

This article was downloaded by: [Institute of Software] On: 13 August 2015, At: 12:15 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered

1 This article was downloaded by: [Institute of Software] On: 13 August 2015, At: 12:15 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: Registered office: 5 Howick Place, London, SW1P 1WG Journal of Business & Economic Statistics Publication details, including instructions for authors and subscription information: Threshold Estimation via Group Orthogonal Greedy Algorithm Ngai Hang Chan ab, Ching-Kang Ing cd, Yuanbo Li b & Chun Yip Yau b a Southwestern University of Finance and Economics b Chinese University of Hong Kong c Academia Sinica d National Taiwan University Accepted author version posted online: 12 Aug Click for updates To cite this article: Ngai Hang Chan, Ching-Kang Ing, Yuanbo Li & Chun Yip Yau (2015): Threshold Estimation via Group Orthogonal Greedy Algorithm, Journal of Business & Economic Statistics, DOI: / To link to this article: Disclaimer: This is a version of an unedited manuscript that has been accepted for publication. As a service to authors and researchers we are providing this version of the accepted manuscript (AM). Copyediting, typesetting, and review of the resulting proof will be undertaken on this manuscript before final publication of the Version of Record (VoR). During production and pre-press, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal relate to this version also. PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the Content ) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content. This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at

2 Threshold Estimation via Group Orthogonal Greedy Algorithm Ngai Hang Chan 1,2, Ching-Kang Ing 3,4, Yuanbo Li 2 and Chun Yip Yau 2 Southwestern University of Finance and Economics 1, Chinese University of Hong Kong 2, Academia Sinica 3 and National Taiwan University 4 Abstract Threshold autoregressive (TAR) model is an important class of nonlinear time series models that possess many desirable features such as asymmetric limit cycles and amplitude dependent frequencies. Statistical inference for TAR model encounters a major difficulty in the estimation of thresholds, however. This paper develops an efficient procedure to estimate the thresholds. The procedure first transforms multiple-threshold detection to a regression variable selection problem, and then employs a group orthogonal greedy algorithm to obtain the threshold estimates. Desirable theoretical results are derived to lend support to the proposed methodology. Simulation experiments are conducted to illustrate the empirical performances of the method. Applications to U.S. GNP data are investigated. Keywords: high-dimensional regression, information criteria, multiple-regime, multiple-threshold, nonlinear time series. 1

3 1 Introduction The (m + 1)-regime threshold autoregressive (TAR) model, defined by p Y t = φ ( o j) + φ ( j) i Y t i + σ j η t, for r j 1 < Y t d r j, (1.1) i=1 where j = 1,..., m + 1 and η t IID(0, 1), asserts that the autoregressive structure of the process {Y t } is governed by the range of values of Y t d. The integer-valued quantity d is called the delay parameter and the parameters = r 0 < r 1 < < r m < r m+1 = are known as the thresholds. The regime-changing autoregressive structure in (1.1) offers a simple and convenient means for interpretation. More importantly, realizations of this model capture important features observed in many empirical studies that cannot be explained by linear time series models, e.g., asymmetric limit cycles, amplitude dependent frequencies, jump resonance and chaos, among others. As a result, the TAR model has attracted enormous attention in the literature across different fields such as finance, econometrics and hydrology since the seminal paper of Tong (1978). See for example the excellent surveys on the TAR models by Tong (1990, 2011), and theoretical backgrounds on inference for TAR models by Chan (1993), Li and Ling (2012) and Li et al. (2015). In spite of the well-developed asymptotic theory of estimation, estimation of TAR models incurs a high computational cost due to the irregular nature of the threshold parameters; e.g., see Li and Ling (2012). In particular, to explicitly evaluate the loss functions such as least-squares criterion or likelihood functions, the thresholds r 1,..., r m have to be specified first a priori. Hence, for an (m + 1)-regime TAR model, locating the global minimum of any loss function requires a multi-parameter grid search over all possible values of the r threshold parameters, which is computational infeasible, if not impossible. To circumvent this difficulty, Tsay (1989) develops a transformation that connects (1.1) to a change-point model and suggests a graphical approach to visually determine the number and values of the thresholds. Coakley et al. (2003) use similar techniques to develop an estimation approach that is based on QR factorizations of matrices. When the number of thresholds r is unknown, Gonzalo and Pitarakis (2002) suggest a sequential estimation 2

4 procedure for choosing r, under the assumption that all σ j s are equal. Recently, Chan et al. (2015) reframe the problem into a regression variable selection context and propose an computationally efficient way to solve the problem by least absolute shrinkage and selection operator (LASSO) estimation. In this paper, motivated by the connection between TAR model and the high dimensional regression model developed by Chan et al. (2015), we explore an alternative procedure to efficiently estimate the number and values of thresholds. The procedure is based on an group orthogonal greedy algorithm (GOGA) algorithm for the high dimensional regression problem. With the use of GOGA, the well known problem of biasedness in the LASSO procedure (e.g., Fan and Li (2001) and Zou (2006)) can be circumvented. Simulation studies demonstrate that the GOGA procedure gives much better empirical results than the LASSO procedure. On the other hand, the asymptotic theory is non-trivial since the design matrix of the high dimensional regression model constructed from a TAR model does not satisfy some standard regularity conditions assumed in the literature. In particular, the design matrix has highly correlated columns and thus the standard restricted eigenvalue condition (e.g., Bickel et al. (2009), Ing and Lai (2011)) is not applicable. Substantial arguments are needed in establishing the consistency of the number and locations of the estimated thresholds by GOGA. Further, the convergence rate of the thresholds is also established. This paper is organized as follows. In Section 2, we first review the connection between TAR models and high dimensional regression problem. In Section 3, estimation for TAR models with GOGA will be discussed. Simulation studies are given in Sections 4. Applications to U.S. GNP data are provided in Section 5. Technical details are deferred to the Appendix. 2 TAR Model In this section, we introduce the connection between the TAR model and a high-dimensional regression model developed in Chan et al. (2015). For notational simplicity, suppose we observe Y = (Y 1 d,..., Y n ) T from model (1.1). Let (Y π(1), Y π(2),, Y π(n) ) T be the order statistics 3

5 of (Y 1 d, Y 2 d,, Y n d ) T, from the smallest to the largest. For example, if Y 5 < Y 3 are the two smallest observations among (Y 1 d, Y 2 d,, Y n d ) T, then π(1) = 5 and π(2) = 3. Note from (1.1) that (Y π(1), Y π(2),, Y π(n) ) T governs the dependence structure of Y 0 n (Y π(1)+d,, Y π(n)+d ) T, since the time lag between them is the delay parameter d. Denote the error terms corresponding to Y 0 n by ε(n) = (ε π(1)+d,, ε π(n)+d ) T. Define the design matrix X n as Y T π(1) Y T π(2) Y T π(2) X n = Y T π(3) Y T π(3) Y T π(3)... 0, (2.2)..... Y T π(n) Y T π(n) Y T π(n)... Y T π(n) where Y T π( j) = (1, Y π( j)+d 1, Y π( j)+d 2,..., Y π( j)+d p ) and p is the AR order. Note that we can assume all the regimes follow the same AR(p) model since an AR(p ) model with p < p can be regarded as an AR(p) model with the last few coefficient equaling to zero. Note also that the design matrix X n is of size n n(p + 1), i.e., the number of variables is greater than the number of observations. For simplicity, we assume that the delay parameter d is known. Consider the regression model Y 0 n = X n θ(n) + ε(n), (2.3) where θ(n) {θ T π(1), θt π(2),..., θt π(n) }T is the regression coefficient. By expanding the quantity X n θ(n), it can be shown that (1.1) is equivalent to the model (2.3) with θ π(1) = φ 1 and φ k+1 φ k if Y π( j 1) r k < Y π( j), for k = 1,..., m θ π( j) = 0 otherwise, (2.4) where φ k = (φ (k) 0, φ(k) 1,, φ(k) p ) is the autoregressive parameter vector in the k-th segment of the TAR model and r 1,..., r m are the unknown thresholds. 4

6 The basic idea of model (2.3) is as follows. The j-th block θ π( j) of the coefficient vector θ(n) indicates a change in the autoregressive parameter when the value of the threshold variable is increased from Y π( j 1) to Y π( j). Thus, if a threshold is within the interval [Y π( j 1), Y π( j) ], then θ π( j) is non-zero and it follows that the corresponding observations Y π( j 1)+d and Y π( j)+d are generated from different autoregressive processes. Otherwise, if there is no threshold within the interval [Y π( j 1), Y π( j) ], then θ π( j) is zero and Y π( j 1)+d and Y π( j)+d are in the same regime. Since only m of the vectors θ π( j) s in θ(n) are non-zero, the solution to the high dimension regression model (2.3) is sparse. In Section 3, we develop a three-step procedure using GOGA to obtain an estimate ˆθ n to (2.3). Given the estimate ˆθ n, unknown thresholds can be identified by the positions of non-zero elements in ˆθ(n). In particular, the estimates of the thresholds are given by A n = {Y π( j 1) : θ π( j) 0, j 2}. (2.5) Thus, ˆm #(A n ), the cardinality of the set A n, is the estimated number of thresholds. Denote the elements of A n by Y π(ˆt1 ),..., Y π(ˆt ˆm ) so that Y π(ˆtk ) is the k-th estimated threshold. Finally, the autoregressive parameter for the first and the k + 1-th regime can be estimated by ˆφ 1 = ˆθ π(1) and ˆφ k = ˆt k ˆθ j=1 π( j), respectively, where k = 1,..., ˆm. Since many computationally efficient procedures for high dimensional regression have been developed in recent decades (e.g., Efron et al. (2004), Fan and Li (2001), Fan and Lv (2008), Buhlmann et al. (2009), Cho and Fryzlewicz (2012) and Ing and Lai (2011)), estimating the TAR model under the framework of (2.3) can be much more effective than the traditional approaches of a full combinatorial search. Chan et al. (2015) proposed a group LASSO procedure to solve (2.3). The computation is conducted efficiently by a group LARS algorithm that approximately solves the group LASSO problem. In the next section we develop a group orthogonal greedy algorithm (GOGA) algorithm which yields improved thresholds estimates. 5

7 3 Three-step procedure: GOGA+HDIC+Trim In this section, we introduce a three-step procedure originally proposed by Ing and Lai (2011) to estimate the high-dimensional regression model (2.3). Some modifications are introduced to the current setting. The first step employs the group orthogonal greedy algorithm (GOGA) to build a solution path, which includes one additional group of variables sequentially. The second and third steps incorporate a high-dimensional information criterion (HDIC) to consistently select relevant groups of variables on the solution path. 3.1 Group Orthogonal Greedy Algorithm Let X n, j = (0,..., 0, Y π( j), Y π( j+1),..., Y π(n) ) T for j = 1,..., n. From (2.4), the design matrix can be expressed as X n = {X n,1, X n,2,, X n,n }. Replacing Y 0 n by Y 0 n Y 0 n and X n,i by X n,i X n,i, for i = 1,..., n, where Y 0 n = n 1 11 T Y 0 n and X n,i = n 1 11 T X n,i, we can assume that Y 0 n and X n,i have mean zeros. Here 1 is an n-dimension vector with all elements equaling to 1. The GOGA algorithm is described as follows: Group Orthogonal Greedy Algorithm (GOGA) 1. Initializing k = 0 with fitted value Ŷ (0) = 0 and active set ˆ J 0 =. 2. Compute the current residuals U (k) := Y 0 n Ŷ (k). Regress U (k) against each of X n, j for j = 1,..., n and obtain the most correlated group ĵ k+1 = arg min U (k) X n, j (X T n, jx n, j ) 1 X T n, ju (k) 2. (3.6) 1 j n Include the ĵ k+1 -th group to update the active set as ˆ J k+1 = { ĵ 1,, ĵ k, ĵ k+1 }. 3. Transform the matrix X n, ĵk+1 into the orthogonalized version X n, ĵ k+1 as follows: (a) Denote H { j} = X n, j(x n, jt X n, j) 1 X n, jt. The first column of X n, ĵ k+1, denoted as X n, ĵ k+1,1, is 6

8 computed by X n, ĵ k+1,1 = X n, ĵ k+1,1 k H X { ĵ i } n, ĵ k+1,1, (3.7) i=1 i.e., the residual of projecting the first column of X n, ĵk+1 onto the space spanned by { X n, ĵ i }i=1,...,k. (b) For l = 2,..., p + 1, the l-th column of X n, ĵ k+1, X n, ĵ k+1, is computed by,l X n, ĵ k+1,l = X n, ĵ k+1,l k l 1 H X { ĵ i } n, ĵ k+1,l H X { ĵ k+1,u} n, ĵ k+1,l. (3.8) i=1 u=1 Thus, the l-th column X n, ĵ k+1,l is orthogonal to both X n, ĵ 1, X n, ĵ 2,..., X n, ĵ k and X n, ĵ k+1,1,..., X n, ĵ k+1,l Updated the fitted value Ŷ (k+1) by Ŷ (k+1) = Ŷ (k) + H { ĵ k+1 } U(k). (3.9) 5. Repeat steps 2-4 until iteration k = K n, where K n is a pre-specified upper bound. The resulting variables selected is given by ˆ J Kn = { ĵ 1,..., ĵ Kn }. There are two advantages to obtain the orthogonalized predictors X n,i sequentially. First, the difficulties from inverting high-dimensional matrices in the usual implementation of ordinary least squares can be circumvented by a componentwise linear regression. Second, the algorithm is more efficient since the same group cannot be selected repeatedly. If each group contains only one variable, then the GOGA reduces to the ordinary orthogonal greedy algorithm. Note that the most correlated group (3.6) can be rewritten as ĵ k+1 = arg min 1 j n 1 r2 j, (3.10) where r j is the correlation coefficient between X n, j and the current residual U (k). In other words, OGA chooses the predictor that is most correlated with U (k) at the k-th step. This is similar to the LARS algorithm ( Efron et al. (2004)). However, the LARS algorithm selects the best predictor that has as much correlation with the current residual as the best predictor at the next iteration does. In 7

9 the high-dimension regression model (2.3), the design matrix in (2.2) has a very special structure: the adjacent groups are highly correlated since they differ by only one entry. As a consequence, LARS algorithm tends to select many irrelevant groups around each important group. Although a second step involving information criterion can be used to exclude these irrelevant groups (Chan et al. (2015)), the effect is not satisfactory because it is tricky to distinguish the true relevant groups. The results could be even worse when the number of thresholds is large. On the other hand, the GOGA s greedy manner can avoid including too many irrelevant groups in the active set if the iterations is not large. After a group X n, j is selected, then the groups near X n, j would not be selected easily because their contributions to further reducing current residuals are relatively small compared to the groups that are far away from X n, j. Ing and Lai (2011) first use the OGA to solve a general high dimensional regression problem. They establish the sure screening property (i.e., including all relevant variables with probability approaching 1) of the OGA with a relative small number of iterations, K n = O((n/ log p) 1/2 ), where n is the sample size and p is the number of variables. Their extensive simulations studies indicate that GOGA a better performance then other penalizaton methods such as SIS-SCAD, ISIS-SCAD and Adaptive Lasso. In the theoretical results of Ing and Lai (2011), it is assumed that the regression predictors are not highly correlated with each others, which is reasonable in a practical variable selection context. Since the design matrix X n in (2.3) has the special property that adjacent groups are highly correlated, the assumption in Ing and Lai (2011) is not guaranteed, however. Indeed, as adjacent columns in X n only differs by one entry, they cannot be distinguished asymptotically, and thus the usual consistency of model selection does not hold. Therefore, the theoretical properties of the solution for (2.3) is different from that in Ing and Lai (2011) and should incorporate information from TAR models. To establish the asymptotic theory, we impose the following assumptions, which are similar to Chan and Tsay (1998) and Chan et al. (2015): Assumptions. A1: {η t } has a bounded, continuous and positive density function and E η 1 2+ι < for some ι > 0. 8

10 A2: {Y t } is a α-mixing stationary process with a geometric decaying rate and E Y t 2+ι <. A3: min 2 i m0 +1 φ 0 i φ 0 i 1 > ν for some ν > 0, where φ0 i is the true AR parameter vector in the i-th regime, i = 1,..., m Also, there exists Z i = (1, z p 1,i,..., z 0,i ) T, where z p d,i = r i 1, such that (φ 0 i 1 φ0 i )T Z i 0 for i = 2,..., m A4: There exist constants l, u such that r i [l, u] for i = l, 2,..., m 0, where m 0 is unknown and not fixed. Furthermore, for some γ n satisfying γ n /n 0, m 0 γ n /n 0 and m 0 nγ 1 n (γ n ) ι/2 (log n) 2(2+ι) 0, it holds that min 2 i m0 r i r i 1 /(γ n /n). The following theorem ensures that the variables selected by the GOGA will be close to the true one so that consistency of the thresholds ˆr 1,..., ˆr m can be obtained. Theorem 3.1. Suppose that Assumptions A1-A4 hold and let K n = O((n/ log n) 1/2 ). Suppose that the groups ˆ J Kn = { ĵ 1, ĵ 2,..., ĵ Kn } is selected by the GOGA at the end of K n iterations. Let J = { j 0 1, j0 2,..., j0 m 0 } be the index set of true non-zero elements in θ(n). Then as n, P{d H ( ˆ J Kn, J) γ n } 1, where γ n (3.11) = log(n) and d H (A, B) is the Hausdorff distance between two sets A and B given by d H (A, B) = max b B min a A b a, and d H (A, ) = d H (, B) = 1 if is an empty set. It follows that { P d H (ˆr, r 0 ) γ } n 1, n where ˆr and r 0 are the sets of estimated and true thresholds, respectively. (3.12) 3.2 High-dimensional Information Criterion Note that Theorem 3.1 only ensures that each true threshold can be identified by an estimated threshold up to a γ n /n neighborhood. It is possible that the set ˆr is greater than r 0, thus the consistency of ˆm = #(ˆr) for the number of threshold is not guaranteed. One natural remedy is to select the best subset among ˆr according to some criteria. Based on Ing and Lai (2011), we prescribe the 9

11 second and the third step of the threshold estimation procedure in this subsection. In particular, the second step selects the model along the solution path of GOGA based on a high-dimensional information criterion (HDIC). The third step further trims the model to eliminate irrelevant variables by HDIC. Specifically, for a non-empty subset J of {1,..., n}, let ˆσ 2 J = n 1 Y 0 n Ŷ 0 n;j2, where Ŷ 0 n;j is the fitted value by projecting Y 0 n onto the space spanned by {X n, j }, j J. Define the high-dimensional information criterion (HDIC) by HDIC(J) = n log ˆσ 2 J + #(J) log n(log n log log n), (3.13) ˆk n = arg min HDIC( Jˆ k ), 1 k K n (3.14) in which ˆ J k = { ĵ 1,..., ĵ k } contains the first k groups along the GOGA solution path. Similar to other information criteria, in (3.13), the n log ˆσ 2 J is the lack of fit term and #(J) log n(log n log log n) is the penalty term for model complexity. The HDIC reduces to the usual BIC without the factor log n log log n. Thus, the factor log n log log n can be interpreted as the additional adjustment for a high-dimensional problem. Note also that the penalty term of the HDIC in Ing and Lai (2011) is log n(log n(p + 1)), where n(p + 1) is the total number of columns in X n. It is different from the term log n(log n log log n) in (3.13) because groups of variable are selected. In the derivation of the HDIC in Section 4.2 of Ing and Lai (2011), the penalization factor log p log log p is established, where p is the number of variables. Since the log log p term is relatively small, only the log p term was used in defining the HDIC. However, extensive simulation evidence suggests that keeping the log log p term yields good finite sample performance in the threshold estimation context. Therefore, as there are n groups of variables, we have p = n and the term log n log log n in our definition of HDIC. In terms of the computational complexity, the second step involves relatively little cost because ˆσ 2 J can be readily computed in the k-th GOGA iteration. Therefore, the HDIC along the GOGA solution path at each step can be obtained efficiently. 10

12 Finally, the third step of the threshold estimation procedure trims the solution set ˆ Jˆk n second step to eliminate irrelevant groups as follows. Define in the ˆN n = { ĵ l : HDIC( ˆ Jˆk n { ĵ l }) > HDIC( ˆ Jˆk n ), 1 l ˆk n } if ˆk n > 1, (3.15) and ˆN n = { ĵ 1 } if ˆk n = 1. The reason of conducting the third step is that the second step procedure only chooses the best subset of variables along the GOGA solution path. While it guarantees that all relevant variables are selected, irrelevant groups may be present. Therefore, the third step is required to eliminate the irrelevant variable by HDIC again. Note that HDIC( ˆ Jˆk n ) is already obtained in the preceding step, obtaining ˆN n only requires the computation of ˆk n 1 ordinary least squares regressions. Since ˆk n is usually not large, the third step can be quickly computed. The following theorem shows that the three-step procedure, GOGA+HDIC+Trim, has the consistency property, i.e., with probability approaching 1, the number of threshold is correctly estimated and all the true threshold can be identified up to a neighborhood shrinking to zero. The proof follows from similar arguments in Chan et al. (2015) and is thus omitted. Theorem 3.2. Assume that Assumption A1-A4 hold. For the index set ˆN n = { ĵ 1, ĵ 2,..., ĵ #( ˆN n )} that satisfies (3.15), there exist a constant B > 0 and γ n = O(log n) such that as n, { } P{ ˆN n = m 0 } 1 and P max ĵ i j 0 i Bγ n 1. 1 i m 0 Equivalently, we have { } P max ˆr i r 0 i Bγ n 1. 1 i m 0 n 4 Simulation Studies In this section, we explore the finite sample performance of GOGA+HDIC+Trim. We also consider the two-step procedure proposed by Chan et al. (2015), who use LASSO to build the solution path. We first compare the performances of the two methods in several simple threshold models 11

13 and then demonstrate the advantage of GOGA+HDIC+Trim in complicated cases, which contain larger number of thresholds. In applying the GOGA+HDIC+Trim, K n = n/ log n is employed in the first step. Example 1 In this example, 1000 realizations are simulated from the model Y t 1 0.2Y t 2 + ɛ t, if y t 1 1.5, Y t = 1.9Y t Y t 2 + ɛ t, if 1.5 < y t 1 1.5, (4.1) Y t Y t 2 + ɛ t, if 1.5 y t 1. Following the transformation in Section 2.1, we obtain the response Y 0 n {y 0 1, y0 2,..., y0 n}. The original observations {Y t } and the transformed observations {y 0 t } are plotted in Figure 1 for a sample with size n = 600. Note that the right plot shows a clear regime switching structure, which suggests that the model may contain two thresholds. We apply the GOGA-HDIC-Trim and the LASSO procedure to estimate the threshold. The results are summarized in Table 1. The percentage (%) of correct estimation for the number of regimes, and the bias and standard error of the threshold estimates are reported. The bias and standard error are computed for the cases where the estimated number of thresholds equals to the true value. From Table 1, GOGA+HDIC+Trim method has 100 % accuracy in detecting the two thresholds and has lower bias and standard errors of the first threshold estimates in all the three different sample sizes. For the second threshold estimate, GOGA+HDIC+Trim has three times smaller bias. This is consistent with the finding that LASSO estimation tends to have a larger bias in the regression coefficients. Example 2 In this example, 1000 realizations are generated from Y t 1 + ɛ t, if y t 1 1, Y t = Y t Y t 2 + ɛ t, if 1 < y t 1 2.5, (4.2) Y t 1 0.6Y t 2 + ɛ t, if 2.5 y t 1. 12

14 Similar to Figure 1, the transformed variable Y 0 n in Figure 2 also suggests that three regimes exist in the data. Table 2 reports the estimation results of the GOGA+HDIC+Trim and the LASSO procedure. In this case, GOGA+HDIC+Trim has a much better performance than the LASSO procedure in terms of the number and accuracy of the thresholds estimates. In particular, the percentage of correct identification is 100 for GOGA+HDIC+Trim but is only 90 for LASSO. Also, GOGA+HDIC+Trim attains lower bias and standard errors in both two thresholds for all sample sizes. Example 3 In this example, the time series 0.8Y t 1 0.2Y t 2 + ɛ t, if y t 1 2, Y t = 1.9Y t Y t 2 + ɛ t, if 2 < y t 1 2, (4.3) 0.6Y t 1 Y t 2 + ɛ t, if 2 y t 1, is more difficult to handle because the intercepts in all segments are zero and each segment has a similar structure. In particular, it is observed from the right plot of Figure 3 that the regime switching pattern is more obscure. Performances of the two methods are summarized in Table 3. Surprisingly, the results of GOGA+HDIC+Trim is still very promising. The number of thresholds is perfectly identified in such a difficult situation. In contrast, only 70% of the LASSO estimates identify the correct number of thresholds. Moreover, GOGA+HDIC+Trim achieves lower bias and standard deviations in all thresholds and in all sample size. Example 4 13

15 In this example, 1000 realizations are simulated from the model Y t 1 0.5Y t ɛ t, if y t 1 1, Y t = Y t Y t 2 + ɛ t, if 1 < y t 1 2.5, (4.4) Y t 1 0.6Y t Y t 3 + ɛ t, if 2.5 y t 1. Compare to the previous examples, this model involves higher AR orders and conditional heteroscedasticity. The estimation results of GOGA+HDIC+Trim and the LASSO procedure are summarized in Table 4. In can be seen that percentage of correct identification of GOGA+HDIC+Trim is always larger than that of the LASSO procedure under three different settings of sample size n. Also, GOGA+HDIC+Trim achieves lower bias and standard deviations in most cases. Example 5 In this last example, we use the following model to explore the performance of the two methods when the number of thresholds is large: Y t 1 + ɛ t, if y t 1 3.5, Y t Y t 2 + ɛ t, if 3.5 < y t 1 2.5, 2 0.9Y t 1 + ɛ t, if 2.5 < y t 1 1.5, Y t Y t 2 + ɛ t, if 1.5 < y t 1 0.5, Y t = Y t 1 + ɛ t, if 0.5 < y t 1 0.5, (4.5) Y t 1 + ɛ t, if 0.5 < y t 1 1.5, Y t 1 + ɛ t, if 1.5 < y t 1 2.5, Y t 1 0.2Y t 2 + ɛ t, if 2.5 < y t 1 3.5, Y t 1 + ɛ t, if 3.5 y t 1. 14

16 The transformed observations are plotted in Figure 5 and the estimation results are reported in Table 5. From Figure 5, the boundaries between each segments are indistinguishable especially for the last two thresholds. When the sample size is as small as 2000, each segment has only around 200 observations. With such a little information, GOGA+HIDC+Trim still has over 84% chance to detect the correct number of thresholds with small bias and stardard errors. When the number of observations increases to 3000, GOGA+HDIC+Trim attains nearly 100% detection accuracy, and achieves low bias and small standard deviation for all thresholds estimates. Again, GOGA+HDIC+Trim outperforms the LASSO procedure in terms of the number and accuracy of the threshold estimates. The LASSO procedure is particularly disappointing in this scenario in view of the fact that it has only 54.7% detection accuracy even for n = One possible reason for the failure of LASSO is that too many irrelevant variables around each true threshold are selected in the first step. The maximum number of threshold K is not sufficient to include groups around all true thresholds when the number of true threshold is large, hence the number of thresholds is under-estimated. As discussed in Section 3.1, GOGA avoids this problem and thus better empirical performance is achieved. 5 Applications to Real Data In this section, we apply the GOGA-HDIC-Trim procedure to a multiple-regime TAR model for the growth rate of the quarterly U.S. real GNP data over the period Given the quarterly GNP data {y t } t=1,...,261 from 1947 to the first quarter of 2012, the growth rate is defined as x t = 100(log y t log y t 1 ) t = 2,..., 261. This data set has been investigated previously by Li and Ling (2012) and Chan et al. (2015). Li and Ling (2012) use a three-regime model encompassing bad, good and normal times and the two threshold are 1.20 and 2.43 respectively. The two-step procedure of Chan et al. (2015) identified 15

17 three thresholds at 1.23, 1.83 and 2.55 respectively. They also show that the four-regime model may give a better fit to the data. Setting K n = 6 and AR order p = 11 in GOGA-HDIC-Trim, the estimated thresholds are 1.23, 1.65 and 2.23, which are very quite close to the estimates in Chan et al. (2015). This example not only illustrates the usefulness of GOGA+HDIC+Trim, but also reinforces the notion that a four-regime model may give a better explanation to this GNP data set. Acknowledgements We would like to thank an Associate Editor and two anonymous referees for their thoughtful and useful comments, which led to an improved version of this paper. Research supported in part by HKSAR-RGC-GRF Nos: , and HKSAR-RGC-CRF: CityU8/CRG/12G (Chan); Academia Sinica Investigator Award (Ing), HKSAR-RGC-ECS: and HKSAR-RGC-GRF: (Yau). Part of this research was conducted while NH Chan was visiting Renmin University of China (RUC) and Southwestern University of Finance and Economics (SWUFE). Research supported by the School of Statistics at RUC and SWUFE is gratefully acknowledged. 16

18 References Bickel, P., Ritov, Y. and Tsybakov, A. (2009) Simultaneous analysis of lasso and dantzig selector. Annals of Statistics, 4, Buhlmann, P., Kalisch, M. and Maathuis, M. (2009) Variable selection for high-dimensional models: partially faithful distributions and the pc-simple algorithm. Biometrika, 97, Chan, K. S. (1993) Consistency and limiting distribution of the least squares estimator of a threshold autoregressive model. Annals of Statistics, 21, Chan, K. S. and Tsay, R. S. (1998) Limiting properties of the least squares estimator of a continuous threshold autoregressive model. Biometrika, 85, Chan, N. H., Yau, C. Y. and Zhang, R. (2015) Lasso estimation for threshold autoregressive models. Journal of Econometrics, In press. Cho, H. and Fryzlewicz, P. (2012) High dimensional variable selection via tilting. Journal of Royal Statistical Society, Series B, 74, Coakley, J., Fuertes, A.-M. and Pérez, M.-T. (2003) Numerical issues in threshold autoregressive modeling of time series. Journal of Economic Dynamics and Control, 27, Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004) Least angle regression. Annals of Statistics, 32, Fan, J. and Li, R. (2001) Variable selection via nonconcave penalized likelhiood and its oracle properties. Journal of American Statistical Association, 96, Fan, J. and Lv, J. (2008) Sure indendence screening for ultrahigh dimensional feature space. Journal of Royal Statistical Society, Series B, 70, Gonzalo, J. and Pitarakis, J.-Y. (2002) Estimation and model selection based inference in single and multiple threshold models. Journal of Econometrics, 110,

19 Ing, C. K. and Lai, T. L. (2011) A stepwise regression method and consistent model selection for high-dimensional sparse linear models. Statistica Sinica, 21, Li, D. and Ling, S. (2012) On the least squares estimation of multiple-regime threshold autoregressive models. Journal of Econometrics, 167, Li, D., Ling, S. and Zhang, R. (2015) On a threshold double autoregressive model. Journal of Business and Economic Statistics, In press. Temlyakov, V. (2000) Weak greedy algorithms. Advances in Computational Mathematics, 12, Tong, H. (1978) On a threshold model. In Pattern Recognition and Signal Processing. NATO ASI Series E: Applied Sc. (29). (ed. C. Chen), Oxford University Press, Oxford. Tong, H. (1990) Non-linear Time Series: a Dynamical System Approach. Oxford University Press, Oxford. Tong, H. (2011) Threshold models in time series analysis 30 years on. Statistics and its Interface, 4, Tsay, R. S. (1989) Testing and modeling threshold autoregressive processes. Journal of the American Statistical Association, 84, Zou, H. (2006) The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101, Appendix: Proof of Theorems First define two types of group orthogonal greedy algorithms, population GOGA and its generalized version, which are crucial to the study of the sample version of GOGA introduced in Section 18

20 3.1. For notational simplicity, we denote X n, j by X j. Suppose the true model is u = n j=1 X j b j. Let H J be the projection matrix associated with the linear space spanned by X j, j J {1,..., n}. Since ( ) u X j X T 1 j X j X T j u u 2 the criterion (3.6) is equivalent to ĵ k+1 = arg max 1 j n U(k)T H { j} U (k). 2 = 1 ut H { j}u u 2, This representation will be used in the following proof. First, we introduce the population GOGA as follows. Let U (0) = u, j 1 = arg max 1 j n U (0)T H { j} U (0) and U (1) = ( I H { j 1 }) u. At the m-th iteration, we have j m = arg max 1 j n U(m 1)T H { j} U (m 1), U (m) = (I H { j 1,..., j m } )u, and the active set J m = { j 1,..., j m }. The population GOGA approximates u by H J m u. (A.1) The generalization of the population GOGA considers an additional parameter 0 ξ 1. At the i-th step, instead of (A.1), the generalization of the population GOGA replaces j i by j i,ξ, where j i,ξ is any 1 l n satisfying U (i 1)T H {l} U (i 1) ξ max 1 j n U(i 1)T H { j} U (i 1). (A.2) Without loss of generality, we assume that each group X j contains only two variables. Thus, 19

21 the matrix X n in (2.2) can be denoted as a 1 b a 2 b 2 a 2 b X n = a 3 b 3 a 3 b 3 a 3 b , (A.3) a n 1 b n 1 a n 1 b n 1 a n 1 b n 1... a n 1 b n 1 a n b n a n b n a n b n... a n b n where X j is the submatrix containing the (2 j 1)-th and (2 j)-th columns of X n. Now we prove an inequality for the generalization of population GOGA. Lemma A.1 Let 0 ξ 1, m 1, and J m,ξ = { j 1,ξ, j 2,ξ,, j m,ξ } be the active set selected by the generalization of population GOGA at the end of the m-th iteration. Then, I H J m,ξ u 2 2 n a j (1 + mξ) 1, ( ) where a j = b T j X T j X j A j and A j A j = ( ) 1. X T j X j Proof. For J {1,..., n}, i 1,..., n and m 1, define v J,i = n 1/2 A i X T i (I H J )u. Note that ) 2 ( I H J m,ξ u j=1 ) ( ) ( ( I H J m 1,ξ u T 1 ) T 2 X j m,ξ X j m,ξ X j m,ξ X j m,ξ I H J m 1,ξ u ) 2 ( I H J m 1,ξ u n v J m 1,ξ, j m,ξ 2 ) 2 ( I H J m 1,ξ u nξ max 2. v J 1 j n m 1,ξ, j (A.4) 20

22 Note that for i, j {1,..., n}, ) 2 ( ) 2 ( I H J m,ξ u I H J m 1,ξ u nξ v J m 1,ξ,i 2, and ) 2 ( ) 2 ( I H J m,ξ u I H J m 1,ξ u nξ v J m 1,ξ, j 2. Therefore, ( ( I H Jm,ξ) u 2 ) 2 Thus, we have = ( ( I H Jm 1,ξ) ) 2 2 ( u + (nξ) 2 v J m 1,ξ,i 2 v J m 1,ξ, j 2) ) 2 2 ( I H J m 1,ξ u nξ v J m 1,ξ,i v J m 1,ξ, j ( ( ) 2 I H Jm 1,ξ u nξ v J m 1,ξ,i 2 v J m 1,ξ, j). ) 2 ( ) 2 ( I H J m,ξ u I H J m 1,ξ u nξ v J m 1,ξ,i Since this is true for i, j {1,..., n}, ) 2 ( ) 2 ( I H J m,ξ u I H J m 1,ξ u nξ max On the other hand, ( ( I H Jm 1,ξ) u 2 ) 2 = v J 1 i, j n m 1,ξ,i n ( ) b T j X T j I H J m 1,ξ u j=1 n = n = n a j v J m 1,ξ, j j=1 n ( ai v J m 1,ξ,i i, j=1 n max v J 1 i, j n m 1,ξ,i 2 2 ) ( a j v J m 1,ξ, j) v J m 1,ξ, j j=1 v J m 1,ξ, j. v J m 1,ξ, j. (A.5) 2 n a j. (A.6) 21

23 It follows from (A.4) and (A.5) that ) 2 ( ) 2 ( I H J m,ξ u I H J m 1,ξ u 1 ( n j=1 ξ ) 2 a ) ( I H 2 J m 1,ξ u. j Combining (A.6) with Lemma 3.1 of Temlyakov (2000) yields the desired conclusion. (A.7) Lemma A.2 For the matrix X n defined in (A.3), denote X j,k (1 j n, k = 1, 2) as the k-th columns of the j-th group. Then there exists M > 0 such that max 1 #(J) K n,i J,k=1,2 (XT J X J ) 1 X T J X i,k 1 < M a.s. as n Proof. For notational simplicity, let J = {1, 2,..., p} and i = p + s, where p < K n. That is, we select the first p groups of X n as J and consider X p+s = (X p+s,1, X p+s,2 ) which is s groups away from J. Since the best predictor of X p+s in terms of X J is through X p, the regression coefficients of other variables except X p is zero. Thus, we only need to calculate (X T px p ) 1 X T px p+s instead of the complicated inverse (X T J X J ) 1. Therefore, (X T J X J ) 1 X T J X p+s = 0 0, (A.8) m11 m12 m21 m22 22

24 where m 11 m 12 = n j=p a 2 j n j=p a j b j n j=p+s a 2 j n j=p+s a j b j. m 21 m 22 n j=p a j b j n j=p b 2 j 1 n j=p+s a j b j n j=p+s b 2 j Let q j = (a j b j ) T and n j=p a 2 j n j=p a j b j Q =. n j=p a j b j n j=p b 2 j 1 Let Σ = lim n Q be the population covariance matrix. Then it follows that for sufficiently n large n, p+s 1 Q 1 Q j=p q j q T j = I Q 1 p+s 1 j=p q j q T j = I n 1 (Σ 1 + O p (1)) p+s 1 j=p q j q T j. (A.9) Combining (A.9), s n and the ergodicity of TAR model, we have that m i j, i, j = 1, 2, are bounded by a constant almost surely. The other cases can be proved similarly. Thus, the proof of Lemma A.2 is completed. Proof of Theorem 3.1. For a given 0 ξ 1, let ξ = 2/(1 ξ). Define v J,i = n 1/2 A i X T i (I H J )u and ˆv J,i = n 1/2 A i X i (I H J )y, where y = n j=1 X j b j + ɛ, and the white noise ɛ t = σ j η t if Y t d belongs to the j-th regime. Define also σ U = max{σ 1, σ 2,..., σ m0 +1}. Construct the sets A n and B n as A n = B n = { } max (J,i):#(J) m 1,i J ˆvT J,iˆv J,i v T J,iv J,i C 2 σ 2 Un 1 log n, { min max v J ˆ 0 k m 1 k,i } v J ˆk, j > ξ 2 C 2 σ 2 Un 1 log n, 1 i, j n where ˆ J m = { ĵ 1,, ĵ m } is the index set associated with the groups selected by the sample GOGA 23

25 . For all 1 q m and on the set A n B n, we have v Ṱ J q 1, ĵ q vj ˆq 1, ĵ q ˆv Ṱ J q 1, ĵ q ˆv J ˆq 1, ĵ q v Ṱ J q 1, ĵ q vj ˆq 1, ĵ q + ˆv Ṱ J q 1, ĵ q ˆv ˆ max ˆv T J,iˆv J,i v T J,iv J,i + ˆv T ˆ (J,i):#(J) m 1,i J J q 1, ĵ q ˆv ˆ ( ) C 2 σ 2 Un 1 log n + max ˆv Ṱ jˆv 1 j n J q 1, Jˆ q 1, j ( ) 2C 2 σ 2 Un 1 log n + max v Ṱ v 1 j n J q 1, j Jˆ q 1, j ξ max 1 j n vṱ J q 1, j v ˆ J q 1, j. J q 1, ĵ q J q 1, ĵ q (A.10) That is, on the set A n B n, ˆ J m is also the set of groups chosen by the generalization of the population GOGA (A.2). Therefore, Lemma A.1 implies the upper bound 2 ) 2 n ( I H J ˆm u IAn B n a j (1 + mξ) 1. (A.11) j=1 ) Since 2 ( ) 2 ( I H J ˆm u I H J ˆk u for 0 k m 1, we have that on B c n, ) 2 ) ( I H J ˆm u min 2 ( I H J ˆk u Define 0 k m 1 min 0 k m 1 n max 1 i, j n v ˆ J k,i v ˆ n ξcσ U (log n) 1/2 a j. { C n = j=1 min max 0 k m 1 1 i, j n v ˆ J k,i J k, j n a j v ˆ j=1 J k, j 2 (Lemma A.1) } < ξ 2 C 2 σ 2 Un 2 log n, Similar to the preceding calculations, it can be shown that, on C n, ) 2 ( I H J ˆm u min 0 k m 1 n max 1 i, j n v ˆ J k,i v ˆ J k, j n ξcσ U (n 1 log n) 1/2 a j, j=1 n a j j=1 2 (A.12) 24

26 With the specific structure of the matrix X n, it can be shown that for m > K n (K n >> m 0 ), ( lim P min max v J ˆ n 0 k m 1 k,i ) v J ˆk, j < ξ 2 C 2 σ 2 Un 2 log n = 1, 1 i, j n Denote a = nj=1 a j n. For a given m, it follows from (A.11) and (A.12) that n 1 ( I H ˆ J m ) u 2 IAn a mξ I (log n) 1/2 {m<k n } + a ξcσ U I {m Kn }. n (A.13) Since A n decreases as m increases, we have for all 1 m K n that n 1 ( I H ˆ J m ) u 2 IA a mξ I (log n) 1/2 {m<k n } + a ξcσ U I {m Kn }, n where A corresponds to the set A n with m = K n. Note that max #(J) K n 1,i J ˆv T J,iˆv J,i v T J,iv J,i = max #(J) K n 1,i J n 1 u T (I H J )H {i} (I H J )u y T (I H J )H {i} (I H J )y = max #(J) K n 1,i J n 1 ɛ T (I H J )H {i} (I H J )ɛ + 2u T (I H J )H {i} (I H J )ɛ. Denote each group X i as X i = (X i1 X i2 ) and n(x T i X i ) 1 = w i,11 w i,12 w i,21 w i,22 X T i1 (I H J)ɛ and v i2 = X T i2 (I H J)ɛ. The first part of (A.14) can be expressed as v 2 n 1 ɛ T (I H J )X i (X T i X i ) 1 X T i1 i (I H J )ɛ = w i,11 n + w 2 i,22 n + 2w v i1 v i2 2 i,12. n 2 Using Lemma A.2, we can show that v 2 i2 (A.14). Define also v i1 = max #(J) K n 1,i J n 1 v i1 = max #(J) K n 1,i J n 1 X T i1 (I X J(X T J X J ) 1 X T J)ɛ = max #(J) K n 1,i J n 1 X T i1 ɛ XT i1 X J(X T J X J ) 1 X T Jɛ max #(J) K n 1,i J max 1 i n, j=1,2 n 1 X T i jɛ (1 + X T i1 X J(X T J X J ) 1 X T J ) max 1 i n, j=1,2 n 1 X T i jη (1 + M)σ U. (A.15) 25

27 Moreover, for a sufficiently large constant C > 0, we have ( ) 1/2 log n P max #(J) K n 1,i J n 1 v i1 > Cσ U n ( n P max t=1 X i j,t η ) t 1 i n, j=1,2 > C(log n)1/2 (1 + M) 1 0, n 1/2 (A.16) as n. The last convergence (A.16) follows from the invariant principle and the fact that { n t=1 X i j,t η t } i=1,2,... is a partial sum process and {η t } t=1,2,... is an independent sequence. Similarly, we can show that ( ) 1/2 log n P max #(J) K n 1,i J n 1 v i2 > Cσ U 0 as n. (A.17) n Since w i,11, w i,12 and w i,22, 1 i n, are O p (1), it follows from (A.16) and (A.17) that ( P max #(J) K n 1,i J n 1 ɛ T (I H J )H {i} (I H J )ɛ ) > C 2 σ 2 log n U 0, n (A.18) as n. Since most coefficients are zero-vector and only m 0 groups are non-zero, the second part of (A.14) is equal to j m0 2n 1 j= j 1 b T j X T j (I H J )X i (X T i X i ) 1 X T i (I H J )ɛ. (A.19) For each component of (A.19), we have = 2 where 2n 1 b T j X T j (I H J )X i (X T i X i ) 1 X T i (I H J )ɛ ( v i1 a j1 w i,11 n n + w v i2 a j2 i,22 n n + w v i1 i,21 n max a jk #(J) K n 1,i J n = max #(J) K n 1,i J n 1 b T j X T j X ik n 1 b T j X T j H J X, ik k = 1, 2, are bounded. Combining with (A.16) and (A.17), we have a j2 n + w v i2 i,12 n a ) j1, n P( max #(J) K n 1,i J n 1 2u T (I H J )H {i} (I H J )ɛ > C 2 σ 2 log n U n ) 0, (A.20) 26

28 as n. Finally, it follows from (A.18) and (A.20) that lim n P(Ac ) = 0. (A.21) Moreover, for m K n, it follows from (A.13) that lim n P(Ac n(m)) lim P(A c ) = 0, n n 1 ( I H ˆ J m ) u 2 IAn (m) It follows from (A.21) that, (A.22) a mξ I {m<k n }+a ξcσ U (log n) 1/2 n I {m Kn}. (A.23) lim P(F n) = 0, n where F n = {n 1 (I H ˆ J m )u 2 > j ˆ J m J a2 I (log n) 1+mξ {m<k n } + a ξcσ 1/2 U I n {m Kn }}. (A.24) For J {1,..., n} and j J, define b j (J) be the parameter of X j in the best linear predictor i J X i b i (J) of u that minimizes u i J X i λ i 2. Let b j (J) = 0 if j J. Thus, ( ) n 1 2 I H J ˆm u = n 1 ( X j b j b j ( Jˆ m ) ) 2. (A.25) Now suppose that a true group j 0 s has not been selected, that is, ˆ J c m J = { j 0 s}, where m > K n. We can conclude that there must exist at least one group ĵ l ˆ J m contained in the neighbourhood of j 0 s, i.e., l s < γ n = log(n). Otherwise, on { ˆ J c m J and l s > log(n)}, it follows from (A.25) that n 1 ( I H ˆ J m ) u 2 n 1 ( ) min b j 2 j J ( λ min X T Jˆ X ) D log(n) m J Jˆ m J = n, (A.26) ( where D is a positive constant and λ min X T Jˆ X ) m J Jˆ m J is the smallest eigenvalue of X T Jˆ X m J Jˆ m J. For a sufficiently large n, we obtain that n 1 ( I H ˆ J m ) u 2 D log(n) n (log n) 1/2 a ξcσ U. (A.27) n Since we already select more than K n variables using the sample GOGA, this implies that {d H ( ˆ J m, J) > 27

29 log(n)} F n. Thus, lim P(d H( Jˆ Kn, J) log(n)) = 1. n 28

30 Figure 1: Example 1: Left: Time series plot of a realization of (4.1). Right: Time series plot of the transformed observations Y 0 n. Sample size n =

31 Figure 2: Example 2: Left: Time series plot of a realization of (4.2). Right: Time series plot of the transformed observations Y 0 n. Sample size n =

32 Figure 3: Example 3: Left: Time series plot of a realization of (4.3). Right: Time series plot of the transformed observations Y 0 n. Sample size n =

33 Figure 4: Example 4: Left: Time series plot of a realization of (4.4). Right: Time series plot of the transformed observations Y 0 n. Sample size n =

34 Figure 5: Example 5: Time series plot of the transformed observations Y 0 n based on (4.5). Sample size n =

35 Figure 6: Left: Growth rate of the U.S. GNP data from 1947 to Right: plot of the transformed observations. 34

36 Table 1: Percentage of correct identification of the number of thresholds (%m 0 ), bias and empirical standard deviation (ESD) of the GOGA+HDIC+Trim. The corresponding values of the two-step estimation in Chan et al. (2015) are given in the parenthesis. Replication=1000. n %m 0 r 1 r Bias (97) (0.010) (0.012) ESD (0.056) (0.018) Bias (96) (0.006) (0.008) ESD (0.037) (0.011) Bias (95.7) (0.004) (0.007) ESD (0.027) (0.009) 35

37 Table 2: Percentage of correct identification of the number of thresholds (%m 0 ), bias and empirical standard deviation (ESD) of the GOGA+HDIC+Trim. The corresponding values of the two-step estimation in Chan et al. (2015) are given in the parenthesis. Replication=1000. n %m 0 r 1 r Bias (90) (0.005) (0.009) ESD (0.025) (0.036) Bias (89.3) (0.007) (0.008) ESD (0.027) (0.011) Bias (91.4) (0.000) (0.005) ESD (0.026) (0.007) 36

38 Table 3: Percentage of correct identification of the number of thresholds (%m 0 ), bias and empirical standard deviation (ESD) of the GOGA+HDIC+Trim. The corresponding values of the two-step estimation in Chan et al. (2015) are given in the parenthesis. Replication=1000. n %m 0 r 1 r Bias (71.2) (0.002) (0.020) ESD (0.056) (0.107) Bias (72.5) (0.000) (0.015) ESD (0.036) (0.069) Bias (71) (0.002) (0.012) ESD (0.033) (0.053) 37

39 Table 4: Percentage of correct identification of the number of thresholds (%m 0 ), bias and empirical standard deviation (ESD) of the GOGA+HDIC+Trim. The corresponding values of the two-step estimation in Chan et al. (2015) are given in the parenthesis. Replication=1000. n %m 0 r 1 r Bias (93.3) (0.013) (0.012) ESD (0.040) (0.084) Bias (96) (0.007) (0.010) ESD (0.025) (0.012) Bias (96.6) (0.007) (0.008) ESD (0.020) (0.010) 38

40 Table 5: Percentage of correct identification of the number of thresholds (%m 0 ), bias and empirical standard deviation (ESD) of the GOGA+HDIC+Trim. The corresponding values of the two-step estimation in Chan et al. (2015) are given in the parenthesis. Replication=1000. n %m 0 r 1 r 2 r 3 r 4 r 5 r 6 r 7 r Bias (54.7) (0.003) (0.003) (0.002) (0.004) (0.003) (0.096) (0.059) (0.030) ESD (0.004) (0.003) (0.012) (0.014) (0.012) (0.170) (0.050) (0.137) Bias (41) (0.004) (0.004) (0.003) (0.002) (0.004) (0.056) (0.622) (0.283) ESD (0.006) (0.004) (0.019) (0.046) (0.051) (0.117) (0.440) (0.516) Bias (20.4) (0.006) (0.006) (0.002) (0.005) (0.003) (0.043) (0.981) (0.758) ESD (0.009) (0.008) (0.020) (0.026) (0.017) (0.025) (0.230) (0.475) 39

On High-Dimensional Cross-Validation

On High-Dimensional Cross-Validation BY WEI-CHENG HSIAO Institute of Statistical Science, Academia Sinica, 128 Academia Road, Section 2, Nankang, Taipei 11529, Taiwan hsiaowc@stat.sinica.edu.tw 5 WEI-YING