Consistency of Distance-based Criterion for Selection of Variables in High-dimensional Two-Group Discriminant Analysis

Size: px

Start display at page:

Download "Consistency of Distance-based Criterion for Selection of Variables in High-dimensional Two-Group Discriminant Analysis"

Esmond August Carter
5 years ago
Views:

1 Consistency of Distance-based Criterion for Selection of Variables in High-dimensional Two-Group Discriminant Analysis Tetsuro Sakurai and Yasunori Fujikoshi Center of General Education, Tokyo University of Science, Suwa, Toyohira, Chino, Nagano , Japan Department of Mathematics, Graduate School of Science, Hiroshima University, -3- Kagamiyama, Higashi Hiroshima, Hiroshima , Japan Abstract This paper is concerned with selection of variables in two-group discriminant analysis with the same covariance matrix. We propose a distance-based criterion (DC drawing on the distance of each variables. The selection method can be applied for high-dimensional data. Sufficient conditions for the distance-based criterion to be consistent are provided when the dimension and the sample are large. Our results, and tendencies therein are explored numerically through a Monte Carlo simulation. AMS 2000 subject classification: primary 62H2; secondary 62H30 Key Words and Phrases: Consistency, Discriminant analysis, High-dimensional framework, Selection of variables, Distance-based criterion.

2 . Introduction This paper is concerned with the variable selection problem in two-group discriminant analysis with the same covariance matrix. Some methods have been proposed for high-dimensional data as well as large-sample data. One of the methods are based on model selection criteria such as AIC and BIC, see, e.g., Fujikoshi (984, Nishii et al. (988. There are methods based on misclassification errors by MaLachlan (976, Fujikoshi (985, Hyodo and Kubokawa (204, Yamada et al. (207. For high-dimensional data, there are Lasso and other regularization methods by Clemmensen et al. (20, Witten and Tibshirani (20, etc. In this paper we propose a distance-based criterion based on the distance of each variables, which is useful for high-dimensional data as well as largesample data. Here, the distance of each variables is measured as the one except for the effect of other variables. The criterion involves a constant term which should be determined through some optimality. We propose a class of constants satisfying a consistency when the dimension and the sample size are large. The class depends on the parameters. With respect to the actual use, we also propose a class of estimators for the constant term which satisfies a consistency. Our results, and tendencies therein are explored numerically through a Monte Carlo simulation. The remainder of the present paper is organized as follows. In Section 2, we present the relevant notation and the distance-based criterion. In Sections 3 we derive sufficient conditions for the distance-based criterion to be consistent under a high-dimensional case. In Section 4 we also propose a class of estimators for the constant term. The proposed distance-based criterion is also examined through a Monte Carlo simulation in Section 5. In Section 6, conclusions are offered. All proofs of our results are provided in the Appendix. 2

3 2. Distance-based Criterion In two-group discriminant analysis, suppose that we have independent samples y (i,..., y n (i i from p-dimensional normal distributions N p (µ (i, Σ, i =, 2. Let Y be the total sample matrix defined by Y = (y (,..., y ( n, y (2,..., y (2 n 2. Let and D be the population and the sample Mahalanobis distances defined by = { (µ ( µ ( Σ (µ ( µ ( } /2, and D = { ( x ( x (2 S ( x ( x (2 } /2, respectively. Here x ( and x (2 are the sample mean vectors, and S be the pooled sample covariance matrix based on n = n + n 2 samples. Let β be the coefficients of the population discriminant function given by β = Σ (µ ( µ (2 = (β,..., β p. (2. Suppose that j denotes a subset of ω = {,..., p} containing p j elements, and y j denotes the p j vector consisting of the elements of y indexed by the elements of j. We use the notation D j and D ω for D based on y j and y ω (= y, respectively. One of the approaches for variable selection is to consider a set of models as follows. For each subset j of ω = {,..., p}, consider a variable selection model M j defined by M j : β i 0 if i j, and β i = 0 if i j. (2.2 We identify the selection of M j with the selection of y j. Let AIC j be the AIC for M j. Then, it is known (see, e.g., Fujikoshi 985 that A j = AIC j AIC ω (2.3 { = n log + g2 (Dω 2 Dj 2 } 2(p p n 2 + g 2 Dj 2 j, 3

4 where g = (n n 2 /n. Similarly, let BIC j be the BIC for M j, and we have B j = BIC j BIC ω (2.4 { = n log + g2 (Dω 2 Dj 2 } (log n(p p n 2 + g 2 Dj 2 j. In a large sample framework, it is known (see, Fujikoshi 985, Nishii et al. 988 that AIC is not consistent, but BIC is consistent. The variable selection methods based on AIC and BIC are given as min j AIC j and min j BIC j, respectively. Therefor, such model selection methods have a computationally onerous problem when p is large, since the methods involve minimizing with respect to the subsets of 2 p. To circumvent this issue, we consider a distance-based criterion (DC drawing on the distance of each variables. A contribution of y i in the distance between two groups may be measured by D 2 D 2 ( i (2.5 where ( i is the subset of ω = {,..., p} obtained by omitting i from ω. If D 2 D 2 ( i is large, it may be considered that y i has a large contribution in discriminant analysis. Conversely, if D 2 D( i 2 is small, it may be considered that y i has a small contribution in discriminant analysis. Let us define D d,i = D 2 D 2 ( i d, i =,..., p, (2.6 where d is a positive constant. We propose a distance-based criterion for the selection of variables defined by selecting the set of suffixes or the set of variables given by DC d = {i ω D d,i > 0}, (2.7 or {y i {y,..., y p } D d,i > 0}. The notation ĵ DCd is also used for DC d. For a distance-based criterion, there is an important problem how to decide the constant term d. In Section 3 we derive a range of d which has a high-dimensional consistency. The range depends on the population parameters. For a practical usefulness, we also propose a class of estimators 4

5 for d given by d = n 2 + { g2 D 2 a(np /3 + n p }, (2.8 (n p g 2 n p 3 where a is a constant satisfying a = O(. It is shown that DC d is consistent under a high-dimensional framwork. 3. High-dimensional Consistency For studying consistency properties of the variable selection criterion DC d, it is assumed that the true model M is specified through the values of µ (, µ (2 and Σ. For simplicity, we write the variable selection models in terms of β, or M j. Assume that the true model M is included in the full model M ω. Let the minimum model including M be M j. For a notational simplicity, we regard the true model M as M j. Let F be the entire suite of candidate models, defined by F = {{},..., {k}, {, 2},..., {,..., k}}, Let F separate into two sets, one is a set of overspecified models, i.e., F + = {j F j j} and the other is a set of underspecified models, i.e., F = F+ c F. Here we list some of our main assumptions: A (The true model: M j F. A2 (The high-dimensional asymptotic framework: p, n, p/n c (0,. A3 : p is finite, and 2 = O(. A4 : n i /n k i > 0, i =, 2. 5

6 Relating to our proof of a consistency of DC d in (2.7, note that DC d = j D d,i > 0 for i j, and D d,i 0 for i / j Therefore, P (DC d = j = P D d,i > 0 D d,i < 0 i j i/ j = P D d,i 0 D d,i 0 i j i/ j P (D d,i 0 P (D d,i 0. i j i/ j Our result will be otained by proving the followings: [F] i j P (D d,i 0 0. (3. [F2] i/ j P (D d,i 0 0, (3.2 [F ] is the probability that DC d does not select the set of true variables. [F 2] is the probability that DC d select the set of no true variables. Our consistency depends on where τ i = 2 2 ( i, i ω. τ min = min i j τ i, (3.3 Theorem 3.. Suppose that assumptions A, A2, A3 and A4 are satisfied. Let b = τ min /( c, where τ min is defined by (3.3. Then, if 0 < d < b, the distance-based criterion DC d satisfies [F], i.e., i j P (D d,i 0 0. (3.4 6

7 Related to a sufficient condition for (3.2, we use the following result: for i j, E { } D 2 D( i 2 n 2 { = g 2 (n 3 + 2} (3.5 (n p 3(n p 2 q(p, n, 2. Theorem 3.2. Suppose that assumptions A, A2, A3 and A4 are satisfied. Then, if d = O( and d > q(p, n, 2, the distance-based criterion DC d satisfies P (D d,i 0 0, (3.6 i/ j Combining Theorems 3. and 3.2, a sufficiency condition for DC d to be consistent is given as follows:. Theorem 3.3. Suppose that assumptions A, A2, A3 and A4 are satisfied. Then, if d = O( and q(p, n, 2 < d < b = τ min /( c, (3.7 the distance-based criterion DC d is consistent. For an actual use, it is necessary to get an estimator d for d. This problem is discussed in Section A Method of Determining the Constant Term In the previous section we have seen that the distance-based criterion DC d is consistent if we use a constant term d such that d = O( and q(p, n, < d < τ min, where q(p, n, is given by (3.5, α min = min i j α i and α i = 7

8 2 2 ( i. However, such a d involves unknown parameters. We need to estimate d by using the estimators of 2 and τ min. Here, we consider to construct an estimator by using the fact that DC d is based on the test of α i = 0 or β i = 0. The likelihood test is based on (see, e.g., Rao 973, Fujikoshi et al. 200 g 2 D 2 {i} ( i n 2 + g 2 D 2 ( i whose null distribution is a ratio χ 2 /χ 2 n p of independent chi-squared variates χ 2 and χ 2 n p, where D 2 {i} ( i = D2 D 2 ( i. Let us define ˆd as follows: d = n 2 + { g2 D 2 a(np /3 + n p }. (4. (n p g 2 n p 3 Here a is a costant sasisfying a = O(. Then, it is shown that DC d has a consistency. Theorem 4.. Suppose that assumptions A, A2, A3 and A4 are satisfied. Then, the distance-based criterion DC d with d in (4. is consistent. As a special ˆd, we consider the followings: d = n 2 + { g2 D 2 (np /3 + n p }, (4.2 (n p g 2 n p 3 d 2 = n 2 + { g2 D 2 ( p 2 (np /3 + n p }. (4.3 (n p g 2 n n p 3 The estimators d and d 2 are defined from d by choosing a as a = and a = ( p/n 2, respectively. In Theorem 3.3, we have seen that DC d is consistent under a highdimensional framework if d satisfies q(p, n, 2 < d < τ min /( c. The expectation of d is related to the range as follows: ( E( d > q(p, n, 2, (2 lim E( d = 0 τ min /( c. p/n c 8

9 The above result follows from that { E[ ˆd] = a(np /3 + n p } n 2 + g 2 p+g 2 2 n p 3 n p 3 (n p g { 2 } = a(np /3 (n 2(n 3 g 2 (n p 3(n p + n 2 (n p 3(n p 2 + (n 2(n 3 g 2 (n p 3 + n 2 2 (n p Numerical Study In this section we numerically explore the validity of our claims through three distance-based criteria, DC d0, DC d, and DC d2. Here, d 0 is the midpoint in the interval [q(p, n, 2, τ min /( c] in (3.7. The true model was assumed as follows: the true dimension is p = 4, 8, the true mean vectors; µ = α(,...,, 0,..., 0, µ 2 = α(,...,, 0,..., 0, and the true covariance matrix is Σ = I p. The selection rates associated with these criteria are given in Tables 5. to 5.4. Under, True, and Over denote the underspecified models, the true model, and the overspecified models, respectively. We focused on selection rates for 0 3 replications in Tables From these tables, we can identify the following tendencies. Table 5.. Selection rates of DC d0, DC d and DC d2 for p = 4 and α = 9

10 p = 4, α = d 0 d d2 n n 2 p Under True Over Under True Over Under True Over Table 5.2. Selection rates of DC d0, DC d and DC d2 for p = 4 and α = 0 p = 4, α = 0 d 0 d d2 n n 2 p Under True Over Under True Over Under True Over Table 5.3. Selection rates of DC d0, DC d and DC d2 for p = 8 and α = p = 8, α = d 0 d d2 n n 2 p Under True Over Under True Over Under True Over

11 Table 5.4. Selection rates of DC d0, DC d and DC d2 for p = 8 and α = 0 p = 8, α = d 0 d ˆd2 n n 2 p Under True Over Under True Over Under True Over From Tables , we can identify the following tendncies. In all the DC criteria with d 0, d and d, there are cases which their selection rates are high. Especially, when the dimension p is small as in p = 0, their selection rates are near to. However, when n and p are near, their selection rates do not near to one when n and p are not large. On the other hand, for the case which a consistency can not be seen in Tables, it will be expected to be consistent as the values of n and p are large. In all the DC criteria with d 0, d and d, their selection rates of the true model are incresing as n and p are large. In all the DC criteria with d 0, d and d, their selection rates of the true model are decreasing as n and p are closer. In all the DC criteria with d 0, d and d, their selection rates of the true model are decreasing as the dimension p is increasing. In all the DC criteria with d 0, d and d, their selection rates of the true model are increasing as the distance between groups or α is increasing. The DC criteria with d 0 has a tendency of not selecting overspecified models as p is small in comparison with n, and has a tendency of selecting overspcified models as p and n are near.

12 The DC criteria with d does not select overspesified models, and has a tendancy of choosing the variables strictly. The DC criteria with d 2 has a tendency of not selectiong overspecified models when p is small, and has a tendency of selecting overspesified models as n and p are near, as in the DC criteria with d Concluding Remarks In this paper we propose a distance-based criterion (DC for the variable selection problem, based on drawing on the significance of each the variables except for the effect due to other variables. The criterion involves a constant term d, and is denoted by DC d. DC d need not to examine all the subsets, but need to examine only the p subsets ω ( i. The circumvent computational complexities associated with all the subsets selection procedures been resolved. Further, it was identified that a range of d and a class of its estimators such that DC d has a high-dimensional consistency property. Especially the criteria DC d with d = d 0, d = d and d = d 2 were numerically examined. However, an optimality problem on d is left for future research. A study of high-dimensional consistency properties when p is infinite and 2 = O(p are also left. Appendix: Proofs of Theorems 3. and 3.2 A Preliminary Lemmas First we summarize the distributional results related to the Mahalanobis distance and the statistics D d,i in (2.5. For a notational simplicity, consider a decomposition of y = (y, y 2, y; p, p 2 ; p 2. Let D and D be the 2

13 sample Mahalanobis distances based on y and y, respectively. Let D2 2 = D 2 D. 2 Similarly, the corresponding population quantities are expressed as, and 2 2. Let the coefficient vector β of the linear discriminant function in (2. decompose as β = (β, β 2, β ; p, β 2 ; p 2. The following lemmas (see, e.g., Fujikoshi et al. 200 are used. Lemma A.. For the population Mahalanobis distances and, it holds that ( 2 2 = (2 2 2 = 0 β 2 = 0. Note that 2 2 ( i = 0 β i = 0. (A. Lemma A.2. For the sample Maharanobis distances D and D, it holds that ( D 2 = (n 2g 2 χ 2 p (g 2 2 { χ 2 n p }. (2 If 2 2 = 0, g 2 (D 2 D 2 {n 2 + g 2 D 2 } = χ 2 p 2 {χ 2 n p }. (3 If 2 2 = 0, (D 2 D{n g 2 D} 2 and D 2 are independent, (4 D2 2 = (n 2g 2 χ 2 p 2 (g 2 2 {χ R n p } ( + R, where R = χ 2 p (g 2 2 { χ 2 n p }. Here, χ 2 p (, χ 2 n p, χ 2 p 2 (, and χ 2 n p are independent chi-square variates. From Lemma A., D 2 and D 2 ( i can be expressed as D 2 = (n 2g 2 χ2 p(g 2 2 χ 2 n p, D( i 2 = (n 2g 2 χ2 p (g 2 2 ( i, (A.2 χ 2 n p 2 where g = (n n 2 /n. It is easy to see that m χ2 p m, if η 2 = lim m δ 2 /m. m χ2 m(δ 2 p + η 2, (A.3 3

14 A2 Proofs of Theorems 3. and 3.2 Without loss of generality, we may j = {,..., s} and p = s. Then, [F] and [F2] are expressed as [F] = s p P (D d,i 0, [F2] = P (D d,i 0. i= i=s+ Proof of Theorem 3. Using (A.2 and (A.3, Similarly, Therefor, D 2 = (n 2g 2 χ2 p(g 2 2 χ 2 n p n = (n 2 χ2 p(g 2 2 n p n n 2 p χ 2 n p p c k k 2 ( c + c 2. D 2 ( i p D 2 D 2 ( i c k k 2 ( c + c 2 ( i. p ( 2 2 c ( i p n p If i j = {,..., s}, then, τ i = 2 2 ( i > 0, i =,..., s, and D 2 D 2 ( i p c τ i. Therefor, if d < c τ min, where τ min = min i=,...,s τ i, D d,i = D 2 D 2 ( i d is positive, and P (D d,i 0 0 s [F] = P (D d,i 0 0, i= 4

15 since s is finite. Proof of Theorem 3.2 Next we shall show [F2] under the assumptions in Theorem 3.2. Assume that i {s +,..., p}. Then, we have 2 = 2 ( i. In general, note that D 2 D( i 2 > 0. Using Lemma A.2 (, and hence E(D 2 = (n 2np n n 2 (n p 3 + n 2 n p 3 2, E { D 2 D 2 ( i} = q(p, n, 2, where q(p, n, 2 is given by (3.5. Suppose that. Then, we have P ( D 2 D 2 ( i d d > q(p, n, 2. = P ( D 2 D 2 ( i E[D 2 D 2 ( i] d E[D 2 D 2 ( i] P ( D 2 D( i 2 E[D 2 D( i] 2 d E[D 2 D( i] 2 (d E[D 2 D( i 2 Var(D2 D( i. 2 ]2 The last inequality follows from Chebyshev s inequality. It is easy to see that E[D 2 D 2 ( i ] = O(n. Since d = O( from our assumption, (d E[D 2 D 2 ( i] 2 = O(. Using Lemma A.2 (4, the denominator is expressed as Var(D 2 D( i 2 = E{(D 2 D( i 2 2 } {E(D 2 D( i} 2 2 { ( χ = (n 2 2 g ( E + χ2 p (g } χ 2 n p (n 2 2 g 4 (n p χ 2 n p ( + p + g n p 2

16 The expectation of the first term can be computed as { ( χ 2 2 ( E + χ2 p (g } = χ 2 n p χ 2 n p { (n p 3(n p 5 E + 2 χ2 p (g 2 2 χ 2 n p ( χ 2 p (g } + χ 2 n p = { + 2 p + g2 2 (n p 3(n p 5 n p 2 + (p } 2 + 2(p + 2(p g g g 4 4 (n p 2(n p 4 { } = + O( + O( = O(n 2. (n p 3(n p 5 The second term is evaluated as (n 2 2 g 4 ( + p + g2 2 2 (n p 3 2 n p 2 = O( (n p 3 ( + 2 O(2 = O(n 2. Therefore, we have Var(D 2 D 2 ( i = (n 2 2 g 4 O(n 2 O(n 2 = O(n 2. These imply the followings: P ( D 2 D( i 2 d O(n 2, and hence p P ( D 2 D( i 2 d 0, i=s+ which proves [F2] 0. 6

17 A3 Proof of Theorem 4. As in the proofs of Theorems 3. and 3.2, assume that j = {,..., s}, and show that [F] [F2] s P i= p i=s+ P ( D 2 D( i 2 d 0, ( D 2 D( i 2 d 0. First consider [F], and assume that i {,..., p }. In the proof of Theorem 3., it has been shown that D 2 Using this result, we have p c k k 2 ( c + c 2. d p 0. These imply that and hence D 2 D 2 ( i d p c τ i > 0, P (D 2 D 2 ( i d = 0, which implies [F] 0. Next, consider to show [F2] 0. Assume that i {s+,..., p}. Then, 2 = 2 ( i. Denote that D2 D 2 ( i = D {i} ( i. Using Lemma A.2(2 and (3, we can write as P ( D 2 D( i 2 d = P P = P ( ( χ 2 χ 2 n p χ 2 χ 2 n p g 2 d n 2 + g 2 D(i 2 g 2 d n 2 + g 2 D 2 ( F,n p n p n p 3 a(np/3. 7

18 Noting that the mean of F,n p is (n p (n p 3, ( P D 2 D( i 2 d P ( F,n p E[F,n p ] a(np /3 This implies that [F2] a 2 p i=j + which proves [F2] 0. = a 2 (np Var(F,n p 2/3 2n 2 (n a 2 (np 2/3 (n 2 2 (n 4. 2n 2 (n a 2 (np 2/3 (n 2 2 (n 4 p 2n 2 (n (np 2/3 (n 2 2 (n 4 = O(n /3, Acknowledgements The first author s research is partially supported by the Ministry of Education, Science, Sports, and Culture, a Grant-in-Aid for Scientific Research (C, 6K00047, References [] Clemmensen, L., Hastie, T., Witten, D. M. and Ersbell, B. (20. Sparse discriminant analysis. Technometrics, 53, [2] Fujikoshi, Y. (985. Selection of variables in two-group discriminant analysis by error rate and Akaike s information criteria. J. Multivariate Anal., 7, [3] Fujikoshi, Y., Ulyanov, V. V. and Shimizu, R. (200. Multivariate Statistics: High-Dimensional and Large-Sample Approximations. Wiley, Hobeken, N.J. 8

19 [4] Hyodo, M. and Kubokawa, T. (204. A variable selection criterion for linear discriminant rule and its optimality in high dimensional and large sample data. J. Multivariate Anal., 23, [5] Nishii, R., Bai, Z. D. and Krishnaia, P. R. (988. Strong consistency of the information criterion for model selection in multivariate analysis. Hiroshima Math. J., 8, [6] Rao, C. R. (973. Linear Statistical Inference and Its Applications, 2nd ed. Wiley, New-York. [7] Yamada, T., Sakurai, T. and Fujikoshi, Y. (207. High-dimensional asymptotic results for EPMCs of W- and Z- rules. Hiroshima Statistical Research Group, TR 7-. [8] Witten, D. W. and Tibshirani, R. (20. Penalized classification using Fisher s linear discriminant. J. R. Statist. Soc. B, 73,

Consistency of Test-based Criterion for Selection of Variables in High-dimensional Two Group-Discriminant Analysis

Consistency of Test-based Criterion for Selection of Variables in High-dimensional Two Group-Discriminant Analysis Yasunori Fujikoshi and Tetsuro Sakurai Department of Mathematics, Graduate School of Science,