Consistency of test based method for selection of variables in high dimensional two group discriminant analysis

Size: px

Start display at page:

Download "Consistency of test based method for selection of variables in high dimensional two group discriminant analysis"

Dayna Wilkerson
5 years ago
Views:

1 ORIGINAL PAPER Consistency of test based method for selection of variables in high dimensional two group discriminant analysis Yasunori Fujikoshi 1 Tetsuro Sakurai 2 Received: 19 July 2018 / Accepted: 16 January 2019 Japanese Federation of Statistical Science Associations 2019 Abstract This paper is concerned with selection of variables in two-group discriminant analysis with the same covariance matrix. We propose a test-based method (TM) drawing on the significance of each variable. Sufficient conditions for the test-based method to be consistent are provided when the dimension and the sample size are large. For the case that the dimension is larger than the sample size, a ridge-type method is proposed. Our results and tendencies therein are explored numerically through a Monte Carlo simulation. It is pointed that our selection method can be applied for high-dimensional data. Keywords Consistency Discriminant analysis High-dimensional framework Selection of variables Test-based method 1 Introduction This paper is concerned with the variable selection problem in two-group discriminant analysis with the same covariance matrix. In a variable selection problem under such discriminant model, one of the goals is to find a subset of variables, whose coefficients of the linear discriminant function are not zero. Several methods including model selection criteria AIC (Akaike information criterion, Akaike 1973) and BIC (Bayesian information criterion, Schwarz 1978) have been developed. It is known (see, e.g., Fujikoshi 1985; Nishii et al. 1988) that in a large-sample framework, AIC is not consistent, but BIC is consistent. On the other hand, in multivariate regression model, it has been shown (Fujikoshi et al. 2014; Yanagihara et al. 2015) that in a high-dimensional framework, AIC has consistency properties under some conditions, but BIC is not necessarily * Yasunori Fujikoshi fujikoshi_y@yahoo.co.jp 1 Department of Mathematics, Graduate School of Science, Hiroshima University, Kagamiyama, Higashi Hiroshima, Hiroshima , Japan 2 School of General and Management Studies, Suwa University of Science, Toyohira, Chino, Nagano , Japan Vol.:( )

2 consistent. In our discriminant model, there are methods based on misclassification errors by McLachlan (1976), Fujikoshi (1985), Hyodo and Kubokawa (2014), Yamada et al. (2017) for a high-dimensional case as well as a large-sample case. It is known (see, e.g., Fujikoshi 1985) that two methods by misclassification error rate and AIC are asymptotically equivalent under a large-sample framework. In our discriminant model, Sakurai et al. (2013) derived an asymptotically unbiased estimator of the risk function for a high-dimensional case. On the other hand, the selection methods are based on the minimization of the criteria, and become computationally onerous when the dimension is large. Though some stepwise methods have been proposed, their optimality is not known. For high-dimensional data, Lasso and other regularization methods have been extended. For such study, see, e.g., Clemmensen et al. (2011), Witten and Tibshirani (2011), Hao et al. (2015), etc. In this paper, we propose a test-based method based on significance test of each variable, which is useful for high-dimensional data as well as large-sample data. Such an idea is essentially the same as in Zhao et al. (1986). Our criterion involves a constant term which should be determined by point of some optimality. We propose a class of constants satisfying a consistency when the dimension and the sample size are large. For the case, when the dimension is larger than the sample size, a regularized method is numerically examined. Our results and tendencies therein are explored numerically through a Monte Carlo simulation. The remainder of the present paper is organized as follows. In Sect. 2, we present the relevant notation and the test-based method. In Sect. 3, we derive sufficient conditions for the test-based criterion to be consistent under a high-dimensional case. In Sect. 4, we study the test-based criterion through a Monte Carlo simulation. In Sect. 5, we propose the ridge-type criteria, whose consistency properties are numerically examined. In Sect. 6, conclusions are offered. All proofs of our results are provided in the Appendix. 2 Test based method In two-group discriminant analysis, suppose that we have independent samples y (i) 1,, y(i) n of y =(y i 1,, y p ) from p-dimensional normal distributions Π i N p (μ (i), Σ), i = 1, 2. Let Y be the total sample matrix defined by Y =(y (1) 1,, y(1) n, y (2) 1 1,, y(2) n ). 2 The coefficients of the population discriminant function are given by β = Σ 1 (μ (1) μ (2) )=(β 1,, β p ). (1) Let Δ and D be the population and the sample Mahalanobis distances defined by Δ = { (μ (1) μ (2) ) Σ 1 (μ (1) μ (2) ) } 1 2, and D = { (ȳ (1) ȳ (2) ) S 1 (ȳ (1) ȳ (2) ) } 1 2,

3 respectively. Here, ȳ (1) and ȳ (2) are the sample mean vectors, and S be the pooled sample covariance matrix based on n = n 1 + n 2 samples. Suppose that j denotes a subset of ω ={1,, p} containing p j elements, and y j denotes the p j vector consisting of the elements of y indexed by the elements of j. We use the notation D j and D ω for D based on y j and y ω (= y), respectively. Let M j be a variable selection model defined by M j β i 0 if i j, and β i = 0 if i j. (2) The Model M j is equivalent to Δ j = Δ ω, i.e., the Mahalanobis distance based on y j is the same as the one based on the full set of variables y. We identify the selection of M j with the selection of y j. Let AIC j be the AIC for M j. Then, it is known (see, e.g., Fujikoshi 1985) that A j = AIC j AIC ω { = n log 1 + g2 (D 2 ω D2 j ) } 2(p p n 2 + g 2 D 2 j ), j (3) where g = (n 1 n 2 ) n. Similarly, let BIC j be the BIC for M j, and we have B j = BIC j BIC ω { = n log 1 + g2 (D 2 ω D2 j ) } (log n)(p p n 2 + g 2 D 2 j ). j (4) In a large-sample framework, it is known (see, Fujikoshi 1985; Nishii et al. 1988) that AIC is not consistent, but BIC is consistent. On the other hand, the variable selection methods based on AIC and BIC are given as min j AIC j and min j BIC j, respectively. Therefore, such criteria become computationally onerous when p is large. To circumvent this issue, we consider a test-based method (TM) drawing on the significance of each variable. A critical region for β i = 0 based on the likelihood ratio principle is expressed (see, e.g., Rao 1973; Fujikoshi et al. 2010) as { T d,i = n log 1 + g2 (D 2 ω D2 ( i) ) } d > 0, (5) n 2 + g 2 D 2 ( i) where ( i), i = 1,, p is the subset of ω ={1,, p} obtained by omitting the i from ω, and d is a positive constant which may depend on p and n. Note that T 2,i > 0 AIC ( i) AIC ω > 0. We consider a test-based method for the selection of variables defined by selecting the set of suffixes or the set of variables given by TM d ={i ω T d,i > 0}, (6) or {y i {y 1,, y p } T d,i > 0}. The notation ĵ TMd is also used for TM d. This idea is closely related to an approach considered by Zhao et al. (1986) and Nishii et al.

4 (1988), who used a selection method based on comparison with IC ( i) and IC ω for a general information criterion. In general, if d is large, a small number of variables is selected. On the other hand, if d is small, a large number of variables is selected. Ideally, we want to select only the true variables whose discriminant coefficients are not zero. For a test-based method, there is an important problem how to decide the constant term d. Nishii et al. (1988) have used a special case with d = log n. They noted that under a largesample framework, TM log n is consistent. However, we note that TM log n will be not consistent for a high-dimensional case, through a simulation experiment. We propose a class of d satisfying a high-dimensional consistency including d = n in Sect Consistency of TM d under high dimensional framework For studying consistency of the variable selection criterion TM d, it is assumed that the true model M is included in the full model. Let the minimum model including M be M j. For a notational simplicity, we regard the true model M as M j. Let be the entire suite of candidate models, that is = {{1},, {p}, {1, 2},, {1,, p}}. A subset j of ω is called overspecified model if j includes the true model. On the other hand, a subset j of ω is called underspecified model if j does not include the true model. Then, is separated into two sets, one is a set of overspecified models, i.e., + ={j j j} and the other is a set of underspecified models, i.e., = + c. It is said that a model selection criterion ĵ has a high-dimensional consistency if Here, we list some of our main assumptions: A1 (The true model): M j. A2 (The high-dimensional asymptotic framework): p, n, p n c (0, 1), n i n k i > 0, (i = 1, 2). For the dimensionality p of the true model and the Mahalanobis distance Δ, first the following case is considered: A3 : p is finite, and Δ 2 = O(1). For the constant d of test-based statistic T d,i in (5), we consider the following assumptions: B1 : d n 0. B2 : h d n 1 (n p 3) > 0, and h = O(n a ), where 0 < a < 1. lim Pr(ĵ = j )=1. p n c (0,1)

5 A consistency of TM d in (6) for some d > 0 shall be shown along the following outline: In general, we have TM d = j T d,i > 0 for i j, and T d,i 0 for i j Therefore ( ) P(TM d = j )=P T d,i > 0 T d,i < 0 i j i j ( ) = 1 P T d,i 0 T d,i 0 i j i j 1 i j P(T d,i 0) i j P(T d,i 0). We shall consider to show [F1] i j P(T d,i 0) 0. (7) [F2] i j P(T d,i 0) 0, (8) [F1] denotes the probability such that the true variables are not selected. [F2] denotes the probability, such that the non-true variables are selected. The square of Mahalanobis distance of y is decomposed as a sum of the squares of Mahalanobis distance of y ( i) and the conditional Mahalanobis distance of y {i} given y ( i) as follows: Δ 2 = Δ 2 ( i) + Δ2 {i} ( i). (9) When i j, ( i) + and hence Related to consistency of TM d under assumption A3, we consider the following assumption: A4 : For i j, lim(n 1 n 2 n 2 )Δ 2 {i} ( i) Δ 2 {i} ( i) = Δ2 Δ 2 ( i) > 0. { 1 +(n 1 n 2 n 2 )Δ 2 ( i)} 1 > 0. Note that A4 is satisfied if lim Δ 2 {i} ( i) > 0 and lim Δ2 ( i) > 0, under Δ2 = O(1).

6 Theorem 1 Suppose that assumptions A1, A2, A3 and A4 are satisfied. Then, the test-based method TM d is consistent if B1 and B2 are satisfied. Let d = n r, where 0 < r < 1. Then, h = O(n (1 r) ), and the condition B2 is satisfied. Therefore, the test-based criteria with d = n 3 4, n 2 3, n 1 2, n 1 3 or n 1 4 have a high-dimensional consistency. Among of them, we have numerically seen that the one with d = n has a good behavior. Note that TC 2 and TC log n do not satisfy B2. As a special case, under the assumptions of Theorem 1, we have seen that the probability of selecting overspecified models tends to zero, since P(i TM d )= P(T d,i 0) 0. i j i j The proof given there is applicable also to the case replaced assumption A3 by assumption A5: A5 : p = O(p), and Δ 2 = O(p). In other words, such a property holds regardless of whether the dimension of j is finite or not. Furthermore, it does not depend on the order of the Mahalanobis distance. Related to consistency of TM d under assumption A5, we consider the following assumption: A6 : For i j, θ 2 i = Δ 2 {i} ( i) { Δ 2 ( i)} 1 = O(p b 1 ), 0 < b < 1. Theorem 2 Suppose that assumptions A1, A2, A5 and A6 are satisfied. Then, the test-based criterion TM d is consistent if d = n r,0< r < 1 and r < b, (3 4)(1 + δ) < b are satisfied, where δ is any small positive number. From our proof, it is conjectured that the condition 3 4 < b can be replaced by 1 2 < b. From a practical point, it shall be natural that b is small, and then, the sufficiency condition requires r 1 2. Theorem 3 Suppose that assumption; A1 and p; fixed, n i n k i > 0 (i = 1, 2), d, d n 0 are satisfied. Then, the test-based criterion TM d is consistent when n tends to infinity.

7 From Theorem 3, we can see that TM log n and TM n are consistent under a large-sample framework. However, TM 2 does not satisfy the sufficient condition. 4 Numerical study In this section, we numerically explore the validity of our claims through three test-based criteria, TM 2, TM log n, and TM n. Note that TM n satisfies sufficient conditions B1 and B2 for its consistency, but TM 2 and TM log n do not satisfy them. The true model was assumed as follows: the true dimension is j = 3, 6, the true mean vectors: μ 1 = α(1,, 1, 0,,0), μ 2 = α( 1,, 1, 0,,0), where α = 1, 2, 3, and the true covariance matrix Σ = I p. The selection rates associated with these criteria are given in Tables 1, 2, and 3. Under, True, and Over denote the underspecified models, the true model, and the overspecified models, respectively. We focused on selection rates for 10 3 replications in Tables 1, 2, and for 10 2 replications in Table 3. From Table 1, we can identify the following tendencies. The selection probabilities of the true model by TM 2 are relatively large when the dimension is small as in the case p = 5. However, the values do not approach to 1 as n increases, and it seems that TM 2 has no consistency under a large-sample case. The selection probabilities of the true model by TM log n are near to 1, and has a consistency under a large-sample case. However, the probabilities are decreasing as p increases, and so will not have a consistency in a high-dimensional case. The selection probabilities of the true model by TM n approach 1 as n is large, even if p is small. Furthermore, if p is large, but under a high-dimensional framework such that n is also large, then it has a consistency. However, the probabilities decrease as the ratio p / n approaches 1. As the quantity α presenting a distance between two groups becomes large, the selection probabilities of the true model by TM 2 and TM log n increase in a largesample size as in the case p = 5. However, the effect becomes small when p is large. On the other hand, the selection probabilities of the true model by TM n increase to a certain extend both in large-sample and high-dimensional cases. In Table 2, we examine the case, where the dimension p of the true model is larger than the one in Table 1. The following tendencies can be identified. As is the case with Table 1, TM 2 and TM log n are not consistent as p increases, but TM n is consistent. In general, the probability of selecting the true model decreases as the dimension of the true model is large.

8 Table 1 Selection rates of TM 2, TM log n and TM n for p = 3 n 1 n 2 p TM 2 TM log n TM n Under True Over Under True Over Under True Over p = 3, α = p = 3, α = p = 3, α = In Table 3, we examine the case, where the dimension p of the true model is relatively large, and especially in the case of p = p 4. The following tendencies can be identified. When p = p 4, the consistency of TM 2 and TM log n can not be seen. The consistency of TM n can be seen when n is large.

9 Table 2 Selection rates of TM 2, TM log n and TM n for p = 6 n 1 n 2 p TM 2 TM log n TM n Under True Over Under True Over Under True Over p = 6, α = p = 6, α = p = 6, α = Table 3 Selection rates of TM 2, TM log n and TM n for p = p 4 n 1 n 2 α TM 2 TM log n TM n Under True Over Under True Over Under True Over p = 100, p = p = 200, p = ,

10 5 Ridge type methods When p > n 2, S becomes singular, and so we cannot use the TM d. One way to overcome this problem is to use the ridge-type estimator of Σ defined by Σ λ = 1 n{ (n 2)S + λip }, (10) where λ =(n 2)(np) 1 trs. The estimator was used in multivariate regression model, by Kubokawa and Srivastava (2012), and Fujikoshi and Sakurai (2016), etc. The numerical experiment was done for j = 3, μ 1 = α(1, 1, 1, 0,,0), μ 2 = α( 1, 1, 1, 0,,0), Σ = I p, where α = 1, 2, 3. We focused on selection rates for 10 2 replications in Table 4. From Table 4, we can identify the following tendencies. TM 2 does not have consistency. On the other hand, it seems that TM log n and TM n have consistency when the dimension p and the total sample size n are separated. Recently, it has been proposed to use a ridge-type estimator for Σ 1 by Ito and Kubokawa (2015) and Van Wieringen and Peeters (2016). By the use of such estimators, it is expected that the consistency property will be improved, but this is left as a future work. 6 Concluding remarks In this paper, we propose a test-based method (TM) for the variable selection problem, based on drawing on the significance of each variable. The method involves a constant term d, and is denoted by TM d. When d = 2 and d = log n, the corresponding TM s are related to AIC and BIC, respectively. However, the usual model selection criteria such as AIC and BIC need to examine all the subsets. However, TM d need not to examine all the subsets, but need to examine only the p subsets ( i), i = 1,, p. This circumvents computational complexities associated Table 4 Selection rates of TM 2, TM log n and TM n for p = p 4 α = 1 TM 2 TM log n TM n n 1 n 2 p Under True Over Under True Over Under True Over

11 with AIC and BIC has been resolved. Furthermore, it was identified that TM d has a high-dimensional consistency property for some d including d = n, when (i) p is finite and Δ 2 = O(1), and (ii) p is infinite and Δ 2 = O(p), When the dimension is larger than the sample size, we propose the ridge-type methods. However, a study of their theoretical property is left as a future work. In discriminant analysis, it is important to select a set of variables, such that it minimizes the expected probability of misclassification. When we use the linear discriminant function with cutoff point 0, its expected probability of misclassification is R = ρ 1 P(2 1)+ρ 2 P(1 2), where P(2 1) is the probability of putting a new observation into Π 2, although it is known from Π 1. Similarly, P(1 2) is the probability of putting a new observation into Π 1, although it is known from Π 2. An estimator is used as a measure of selection of variables. Usually, the prior probabilities ρ i are unknown, so we use an estimator with ρ 1 = ρ 2 = 1 2 as a conventional approach. Then, we have a variable selection method based on an estimator when a subset of variables, y j is used. For example, under a large-sample asymptotic framework is used, which is given (see, e.g., Fujikoshi 1985) by R j = Φ(Q j ), where Φ is the distribution function of the standard normal distribution: Q j = 1 2 D j (p 1){ n n 1 } 2 D 1 j +(32m) 1 D j {4(4p 1) D 2 j m = n 1 + n 2 2, and D j is the sample Mahalanobis distance based on y j. For an estimator under a large-sample asymptotic framework, see, e.g., Fujikoshi (2000), Hyodo and Kubokawa (2014), and Yamada et al. (2017). A usual variable selection method is to select a subset minimizing Q j. However, such a method has no consistency as in Fujikoshi (1985), and becomes computationally onerous when the dimension is large. On the other hand, along the same line as in the test-based method, we can propose a variable selection method defined by selecting the set of suffixes given by MM d = { i ω Q i Q ω > d }, where d is a constant, though the problem of looking for an optimum d is left as a future subject. Furthermore, as being pointed by one of the reviewers, the variable selection method introduced in this paper should be compared with the methods based on estimators of misclassification. Acknowledgements We thank two referees for careful reading of our manuscript and many helpful comments which improved the presentation of this paper. The first author s research is partially supported by the Ministry of Education, Science, Sports, and Culture, a Grant-in-Aid for Scientific Research (C), 16K00047, }, (11)

12 Appendix: Proofs of Theorems 1, 2 and 3 Preliminary lemmas First, we study distributional results related to the test statistics T d,i in (5). For a notational simplicity, consider a decomposition of y =(y 1, y 2 ), y 1 ; p 1 1, y 2 ; p 2 1. Similarly, decompose β =(β 1, β 2 ), and S = ( S11 S 12 S 21 S 22 ), S 12 ; p 1 p 2. Let λ be the likelihood ratio criterion for testing a hypothesis β 2 = 0, then { 2 log λ = n log 1 + g2 (D 2 D 2 1 ) }, n 2 + g 2 D 2 1 (12) where g = { (n 1 n 2 ) n } 1 2. The following lemma (see, e.g., Fujikoshi et al. 2010) is used. Lemma 1 Let D 1 and D be the sample Mahalanobis distances based on y 1 and y, respectively. Let D = D2 D 2. Similarly, the corresponding population quantities 1 are expressed as Δ 1, Δ and Δ 2. Then, it holds that (1) D 2 1 =(n 2)g 2 R, R = χ 2 p (g 2 Δ {χ ) 2 n p 1 1} ( ){ (2) D =(n 2)g 2 χ 2 p g 2 Δ (1 χ R n p 1} + R). (3) g2 (D 2 D 2 1 ) = χ 2 (g 2 Δ 2 n 2 + g 2 D 2 p (1 + R) 1 ){χ 2 n p 1 } 1 1 Here, χ 2 ( ), χ 2 p 1 n p 1 1, χ 2 ( ), and χ 2 are independent Chi-square variates. p 2 n p 1 Related to the conditional distribution of the right-hand side of (3) with p 2 = 1 and m = n p 1 in Lemma 1, consider the random variable defined by V = χ 2 1 (λ2 ) χ 2 m 1 + λ2 m 2, (13) where χ 2 1 (λ2 ) and χm 2 are independent. We can express V as V = U 1 U 2 +(m 2) 1 U 1 +(1 + λ 2 )U 2, in terms of the centralized variables U 1 and U 2 defined by U 1 = χ 2 1 (1 + λ2 ) (1 + λ 2 ), U 2 = 1 1 χm 2 m 2. (14) (15)

13 It is well known (see, e.g., Tiku 1985) that E(U 1 )=0, E(U 2 1 )=2(1 + 2λ2 ), E(U 3 1 )=8(1 + 3λ2 ), E(U 4 1 )=48(1 + 4λ2 )+4(1 + 3λ 2 ) 2. Furthermore E ( { ( U k ) k 1 2 = kc i E = i=0 k k C i i=1 χ 2 m ) } i ( 1 ) k i m 2 1 (m 2) (m 2i) ( 1 ) k i ( + 1 ) k. m 2 m 2 These give the first four moments of V. In particular, we use the following results. Lemma 2 Let V be the random variable defined by (14). Suppose that λ 2 = O(m). Then Proof of Theorem 1 E(V) =0, E(V 2 )= 2(m 3 2λ2 + λ 4 ) (m 2) 2 (m 4) E(V 3 )=O(m 2 ), E(V 4 )=O(m 2 ). First, we show [F1] 0. Let i j. Then, ( i) +, and hence Using (12) and Lemma 1(3) { T d,i = n log Δ 2 ( i) <Δ2, Δ 2 {i} ( i) > χ 2 1 (g2 Δ 2 {i} ( i) (1 + R i ) 1 ) χ 2 n p 1 = O(m 1 ), } d, where R i = χ 2 p 1 (g2 Δ 2 ( i) ) {χ 2 n p} 1. Here, since j is finite, by showing T d,i p ti > 0 or T d,i p, we obtain P(T d,i 0) 0, and hence, [F1] 0. It is easily seen that R i p + g2 Δ 2 ( i), n p

14 where g = {(n 1 n 2 ) n} 1 2 and means asymptotically equivalent, and hence (1 + R i ) 1 n p. n + g 2 Δ 2 ( i) Therefore, we obtain ( 1 n T d,i lim log 1 + g2 Δ 2 ) {i} ( i) > 0, n + g 2 Δ 2 ( i) which implies our assertion. Next, consider to show [F2] 0. For any i j, Δ 2 = Δ 2. Therefore, using ( i) Lemma 1(3), we have ( ) T d,i = n log 1 + χ 2 1 χ 2 n p 1 whose distribution does not depend on i. Here, χ 2 1 and χ 2 are independent Chisquare variates with 1 and n p 1 degrees of freedom. This implies n p 1 that d, (16) Noting that E[χ 2 1 χ 2 n p 1 ]=(n p 3) 1, let Then, since e d n 1 1 n p 3 > h T d,i > 0 χ 2 1 > e d n 1. χ 2 n p 1 U = χ χ 2 n p 3. n p 1 ( ) P(T d,i > 0) =P U > e d n 1 1 n p 3 P(U > h). Furthermore, using Markov inequality, we have P(T d,i > 0) P( U > h) h 2l E(U 2l ), l = 1, 2, Furthermore, it is easily seen that E(U 2l )=O(n 2l ), using, e.g., Theorem in Fujikoshi et al. (2010). When h = O(n a ) h 2l E(U 2l )=O(n 2(1 a)l ).

15 Choosing l such that l > (1 a) 1, we have [F2] 0. Proof of Theorem 2 First, note that in the proof of [F2] 0 in Theorem 1, Assumption A3 is not used. This implies the assertion [F2] 0 in Theorem 2. Now, we consider to show [F1] 0 when p = O(p) and Δ 2 = O(p). In this case, p tends to. Based on the proof in Theorem 1, we can express T d,i for i j as { T d,i = n log 1 + χ 2 1 ( λ 2 i ) } d, χ 2 n p 1 1. where λ 2 = g 2 Δ 2 i {i} ( i) (1 + R i ) 1 and R i = χ 2 p 1 (g2 Δ {χ 2 ( i) n p} ) 2 Note that χ 2 1 and χ 2 n p 1 are independent of R i, and hence of λ 2. Then, we have i P(T d,i 0) =P( V ĥ), (17) where V = χ 2 1 ( λ 2 i ) 1 + λ 2 i χ 2 n p 3, n p 1 ĥ = e d n 1 (1 + λ 2 i ) (n p 3). Considering the conditional distribution of the right-hand side in (17), we have } P( V ĥ) =E λ {Q( λ 2i ), 2 i (18) where Q(λ 2 i )=P( V ĥ λ 2 i = λ 2 i ) = P(Ṽ h). Here Using Assumption A6, it can be seen that Ṽ = χ 2 1 (λ2 i ) 1 + λ2 i χ 2 n p 3, n p 1 h = e d n 1 (1 + λ 2 ) (n p 3). i λ 2 i (1 c)c 1 θ 2 i p λ2 i0, and λ 2 i = O(p b ).

16 Now, we consider the probability P(Ṽ h) when λ 2 = λ 2. From assumption r < b, i i0 for large n, h < 0. Therefor, for large n, we have P(Ṽ h) P( Ṽ h ) h 4 E(Ṽ 4 ). From Lemma 2, E(Ṽ 4 )=O(n 2 ). Noting that h = O(n (1 b) ), we have h 4 E(Ṽ 4 )=O(n 4(1 b) 2 ), whose order is O(n (1+3δ) ) if we choose b as b > (3 4)(1 + δ). Therefore, we have P(T d,i 0) =O(n (1+3δ) ), which implies [F1] 0. Proof of Theorem 3 The assertion [F1] 0 follows from the proof of [F1] 0 in Theorem 1. For a proof of [F2] 0, it is enough to show that for i j, T d,i. since p has been fixed. From (16), the limiting distribution of T d,i is χ 2 d. This 1 implies [F2] 0. References Akaike, H. (1973). Information theory and an extension of themaximum likelihood principle. In B. N. Petrov & F. Csáki (Eds.), 2nd International Symposium on Information Theory (pp ). Budapest: Akadémiai Kiadó. Clemmensen, L., Hastie, T., Witten, D. M., & Ersbell, B. (2011). Sparse discriminant analysis. Technometrics, 53, Fujikoshi, Y. (1985). Selection of variables in two-group discriminant analysis by error rate and Akaike s information criteria. Journal of Multivariate Analysis, 17, Fujikoshi, Y. (2000). Error bounds for asymptotic approximations of the linear discriminant function when the sample size and dimensionality are large. Journal of Multivariate Analysis, 73, Fujikoshi, Y., & Sakurai, T. (2016). High-dimensional consistency of rank estimation criteria in multivariate linear model. Journal of Multivariate Analysis, 149, Fujikoshi, Y., Ulyanov, V. V., & Shimizu, R. (2010). Multivariate statistics: high-dimensional and largesample approximations. Hobeken, NJ: Wiley. Fujikoshi, Y., Sakurai, T., & Yanagihara, H. (2014). Consistency of high-dimensional AIC-type and C p -type criteria in multivariate linear regression. Journal of Multivariate Analysis, 144, Hao, N., Dong, B. & Fan, J. (2015). Sparsifying the Fisher linear discriminant by rotation. Journal of the Royal Statistical Society: Series B, 77, Hyodo, M., & Kubokawa, T. (2014). A variable selection criterion for linear discriminant rule and its optimality in high dimensional and large sample data. Journal of Multivariate Analysis, 123, Ito, T. & Kubokawa, T. (2015). Linear ridge estimator of high-dimensional precision matrix using random matrix theory. Discussion Paper Series, CIRJE-F-995. Kubokawa, T., & Srivastava, M. S. (2012). Selection of variables in multivariate regression models for large dimensions. Communication in Statistics-Theory and Methods, 41, McLachlan, G. J. (1976). A criterion for selecting variables for the linear discriminant function. Biometrics, 32,

17 Nishii, R., Bai, Z. D., & Krishnaia, P. R. (1988). Strong consistency of the information criterion for model selection in multivariate analysis. Hiroshima Mathematical Journal, 18, Rao, C. R. (1973). Linear statistical inference and its applications (2nd ed.). New York: Wiley. Sakurai, T., Nakada, T., & Fujikoshi, Y. (2013). High-dimensional AICs for selection of variables in discriminant analysis. Sankhya, Series A, 75, Schwarz, G. (1978). Estimating the dimension od a model. Annals of Statistics, 6, Tiku, M. (1985). Noncentral chi-square distribution. In S. Kotz & N. L. Johnson (Eds.), Encyclopedia of Statistical Sciences, vol. 6 (pp ). New York: Wiely. Van Wieringen, W. N., & Peeters, C. F. (2016). Ridge estimation of inverse covariance matrices from high-dimensional data. Computational Statistics & Data Analysis, 103, Witten, D. W., & Tibshirani, R. (2011). Penalized classification using Fisher s linear discriminant. Journal of the Royal Statistical Society: Series B, 73, Yamada, T., Sakurai, T. & Fujikoshi, Y. (2017). High-dimensional asymptotic results for EPMCs of W- and Z- rules. Hiroshima Statistical Research Group, Yanagihara, H., Wakaki, H., & Fujikoshi, Y. (2015). A consistency property of the AIC for multivariate linear models when the dimension and the sample size are large. Electronic Journal of Statistics, 9, Zhao, L. C., Krishnaiah, P. R., & Bai, Z. D. (1986). On determination of the number of signals in presence of white noise. Journal of Multivariate Analysis, 20, Publisher s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Consistency of Test-based Criterion for Selection of Variables in High-dimensional Two Group-Discriminant Analysis

Consistency of Test-based Criterion for Selection of Variables in High-dimensional Two Group-Discriminant Analysis Yasunori Fujikoshi and Tetsuro Sakurai Department of Mathematics, Graduate School of Science,